Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study

Tracking #: 1159-2371

Maribel Acosta
Amrapali Zaveri
Elena Simperl
Dimitris Kontokostas
Fabian Flöck
Jens Lehmann

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources, and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then propose and study different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how to best combine different crowds' complementary aptitudes in quality issue detection. The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Irene Celino submitted on 17/Oct/2015
Minor Revision
Review Comment:

First of all, I'd like to congratulate the authors because this revision is a major improvement w.r.t. the original submission. The paper gained in clarity of both the content and the discussion "flow"; its claims are now much more honestly discussed. Almost all sections are much more readable, to the benefit of both the authors and the readers; only Section 5 is still a bit hard to follow in some places and the "naming" could be homogenized further (e.g. links vs. interlinks, value vs. object) so that the reader is not mistaken by different phrasing. A final check of English language is also suggested (some missing "s", some wrong tenses); a few spotted typos are listed below, but since I'm not a native speaker either I recommend a final proof reading. Another minor comment (but maybe this depends on the journal template) is that most of times tables and figures are far from the text in which they are referenced, forcing the reader to continuously go back and forth. I also thank for the detailed explanations in the supplementary file.

Therefore, this time I have only minor comments for the final version:
- abstract: this is the only place were claims sound still a bit too general
- section 4.1 p. 7: "In the contest, a prize was announced for the user submitting the highest number of (real) quality problems" so, you "cheated" and gave the prize to the one who just submitted the highest number of issues? I'd recommend to delete "(real)".
- a few lines below: Figure 1 should be Figure 2.
- section 4.1, p. 9: "This means that RDF resources are typically checked multiple times" how did you ensure this in the experts' Find stage? If I got it correctly, the experts were somehow free to choose their preferred resources to check.
- section 4.3, p. 13: "Therefore, the right answer to this tasks is: the DBpedia data is incorrect" in this case for sure the DBpedia extraction is incorrect, but is this what you are looking for? Let's say the DBpedia correctly extracted the "Apizaco" string but that Rodrigo Salinas was born somewhere else: what would you like the crowd to say? The answer would be correct or incorrect?
- same page, next column: the subtitle sounds wrong, probably either "Incorrect datatypes OF literals" or "Incorrect datatypes or LANGUAGE TAGS" would be more accurate.
- a few lines below: I still think that the Elvis example is a bit odd, the "Elvis Presley" string is correct also in Italian for example.
- Proof 2, p. 14: "... which is ≤ |T||Q|" I'd add "(for beta>1)"
- a few lines below: "the Verify stage in theory produce" an "s" is missing
- section 5.1.2, p. 16: "Therefore, we also report" did you mean "then" instead of "therefore"?
- a few lines below: "Precision values were computed for all stages of the studied workflows" I would add "with respect to the gold standard explained below"
- a few lines below: "the outcome of our Find stages does not contain true or false negatives" I find it arguable, since the lack of the "check sign" on a triple could be easily considered as a signal of a triple "indicated as correct" by the user. Nonetheless, I do not expect the authors to change this, because it would require creating a "ground truth" also on correct triples.
- section 5.1.3, p. 16: "The first sample evaluated corresponds to the set of triples obtained from the contest and submitted to MTurk" I would rephrase in "The first sample evaluated corresponds to the set of triples obtained from the contest in the experts' Find stage and submitted to MTurk"
- Table 2, last cell on bottom-right: it seems the sentence ends prematurely, maybe you meant "each task focuses on one quality issue"
- section 5.2.3, p 17: "... the experts browsed around 33,404" add "triples"
- some lines further on: "We compared the common 1, 073 triples assessed in each crowdsourcing approach against our gold standard" did the gold standard include all the triples?
- section 5.2.4, p. 17: "(dbpedia:Oncorhynchus, dbp:subdivision, “See text”)" do you really think this triple is correct?!? again, are you asking the users to check the correctness of the extraction or the correctness of the triple?
- Figure 6(b), p. 19: I would delete the last three (millimeter, nanometer, volt) as they are not relevant; on the other hand, maybe the year and date cases could be worth a short discussion in the text
- section 5.2.6, p. 20, bullet list: any comment on those cases?
- some lines below, "dbpedia:Forever Green": a "with" is missing in between
- section 5.2.7, p. 20: "In total, 26, 835 triples were identified as erroneous" it is a lot w.r.t. the experts' Find stage! any comment on this?
- section 5.3.3, p. 22: "the values of sensitive achieved in both settings were not high" I suppose you are talking about specificity
- Figure 7(b), p. 23: I would delete the last four (date, gmonthday, gyear, second) as they are not relevant
- table 7 on p. 24 and 8 on p. 25: I must say that it took me a while to interpret the two tables correctly, therefore I'd suggest the following to make them "independently" clearer; I'd remove column "Errors" from Table 7, leaving the table to be only about test cases, and I'd remove the column "F. TCs" from Table 8, leaving the table to be only about errors. Then (if I got it correctly) the only special case that is missing in Table 8 is the "person height range"; I'd add it to Table 8 so that the total is 765, matching the text.
- section 5.4.1, p. 25: "13 failed regular expressions" can you give an example of a regular expression in your context?
- Listing 1, p. 26: I still find it completely useless (unless the message you want to convey is "the baseline is implemented in a single line of commands"); If you really find it important to discuss, make it an algorithm in pseudo-language
- section 5.4.2, p. 26: " the majority of the examined links could be properly assessed using simple solutions" do you mean that they could *not* be properly assessed? I'd change "simple" with "naive"
- section 7.1, p. 27: "intermediary" --> intermediate? "our Find-Fix-Verify workflow" --> either the FFV workflow or our FV workflow; "motivators" --> motivations
- section 7.2, p. 28: "but it fully automated and thus is limited as it does allow the user" --> "but it is fully automated and thus it is limited as it does not allow the user" (if I got it correctly); "to tackled" --> "to tackle"; "a form-based interface users" --> "a form-based interface where users"

Review #2
By Gianluca Demartini submitted on 22/Oct/2015
Review Comment:

The revised manuscript accurately addresses all my comments. I recommend this version to be accepted.

Review #3
By Harald Sack submitted on 02/Dec/2015
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Dear authors,
Thanks for your careful revision of the paper. For reasons of simplicity, I will only address your answers to my first review (Reviewer #2) for which I have further comments:

- In your answer to p.11 Fig.3 / section 4.2, concerning errors introduced by your own wikipedia wrapper. Can you guarantee that displayed infobox and displayed RDF triple really correspond? It would be interesting to know how reliable this display is. Do you have any numbers on errors detected by crowdworkers? Or do you have the numbers of how often crowdworkers clicked on the wikipedia page for further clarification? If you have both numbers (reported errors and number of wikipedia clicks), do they correlate?
- section 5.2.6, reason for missclassification: the type of detected error would be rather interesting, such as e.g. for images it is unclear whether it depicts really something wrong or only something related to the original subject. In wikipedia as well as in wikicommons, images must not violate rights restrictions. Thus, there are many subjects for which a (direct) depiction (as e.g. photographies of living persons) are rights restricted and therefore something (closely) "related" to that is depicted.
- section 5.4.2, baseline discussion: now table 9 seems to be missing.

New issues:
- In sect 2 (page 4) you justify your selection of the most frequent quality problem categories acc. to your previous study. For reasons of self containment, I really would like to see the numbers, as compared to the rest that was not analyzed