Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study

Tracking #: 1065-2276

Maribel Acosta
Amrapali Zaveri
Elena Simperl
Dimitris Kontokostas
Fabian Flöck
Jens Lehmann

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources, and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how to best combine different crowds' complementary aptitudes in quality issue detection. The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gianluca Demartini submitted on 28/May/2015
Minor Revision
Review Comment:

This paper proposes new crowdsourcing techniques to identify errors in linked data by combining expert judgements with data obtained from crowdsourcing platforms. The paper addresses very valuable research questions. The paper contains all necessary definitions of crowdsourcing terminology making it self-contained and understandable for a reader from the semantic web community.

Authors compare different combinations of worker/expert answers and propose different types of workflows to identify errors in linked data. The paper focuses on 3 specific types of errors. Authors focus exclusively on error identification as fixes can be best applied by correcting the automatic extraction process rather than the generated data.

Authors also make all data and results available on-line for others to re-use.

Results are discussed and analysed in detail comparing crowd and expert performance also including an error analysis.

(1) originality:
The addressed problem of linked data quality is important and the proposed solution is novel and reasonable.

(2) significance of the results:
The results show how to best combine experts and crowd for the proposed linked data quality problems. This does not solve all linked data quality problems, but it certainly contributes to bring this field forward.

(3) quality of writing:
The paper is very well written and structured. It is easy to follow and presents a detailed description of the approach and of the experimental results also including error analysis.

Detailed comments:
- A controversial point is that the ground truth was created by experts as the results which are evaluated against this. I agree that there is no way around this, but a small discussion on how the authors believe the ground truth data is of better quality than the expert answers would help (e.g., experts did not necessarily put enough effort in the task as the ground truth creators did also by resolving conflicts and discussing difficult triples together etc.).
- In section 2 it is unclear to me how the 4 dimensions relate to the 3 error types addressed here. Expanding this section would make the paper easier to understand and more self-contained. Discussing automatic approaches on how to identify errors in linked data could also be discussed at this point to motivate the need for human computation approaches.
- The scalability of the approach is unclear: It seems to me that the proposed approach need each single triple in a linked dataset to be manually checked. This would limit the scalability of the approach. Thus, while the focus of this paper is clearly different, it would be useful to briefly discuss the possibility of hybrid human-machine approaches to scale the approach to large amounts of triples (e.g., the English DBpedia 3.9 has 500M triples). Related to this is “Proposition 2” which sounds very much not scalable.
- At the end of section 4.3 it is unclear whether the incorrect links are errors present in Wikipedia or are generated by the wrappers.
- The impression from reading the paper is that the payment of crowd workers is extremely low (e.g., 0.04USD for 5 triples or 0.06USD for 30 triples). It would be interesting to report the hourly rate by considering the time spent by workers in completing the tasks to get a better idea of the adopted payment level.
- Probably section 5.4 could be presented before the proposed approaches instead of after them. Moreover, section 5.4.1 seems not to be a relevant baseline as it looks for different types of errors.
- It would be good to add a final paragraph in section 7 stating how this paper compares to the two described areas of research.

Review #2
By Harald Sack submitted on 03/Jul/2015
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors present a study where they examine the applicability of crowdsourcing to Linked Data Quality problems with DBpedia as an example. They show the general feasibility of the approach and continue to investigate whether and for which tasks in particular unskilled laymen instead of experts can also be employed to solve LDQ problems. Furthermore, they address the problem of optimal or better crowdsourcing workflows tom employ experts and laymen for Linked Data curation.

The problem addressed is a rather interesting one, giving first insights in how to adapt crowdsourcing to LDQ issues. From my point of view, I would have liked to see a more concise comparison of the two crowdsourcing approaches also with the sophisticated state-of-the-art automated tools. The proposed RDFUnit approach in the way the authors conducted their experiments has some flaws and is too limited (for details cf below). Thus, I have some (significant) issues with the evaluation, which should be addressed by the authors.

1) In the end of the introduction (p.3) the limitations of automated methods for Linked Data quality assurance are mentioned with referring to checking ontological inconsistencies only. Besides, there also exist approached based on statistics (as e.g. outlier detection, etc.) which should not be neglected or at least mentioned as such in the related work section.[1,2,3]

2) In section 2) Linked Data Quality Issues, you focus on three RDF-tripel level quality issues only out of a larger set of Linked Data Quality issues referred to by your previous work in that area. Unfortunately, you do not explain why the 3 categories of quality issues you focus on are representative either for LDQ issues in general and crowdsourcing in particular. What about the other quality issues concerning their importance, representativeness, suitability for crowdsourcing etc? A more detailed discussion would be helpful.

3) In section 3.2) you give background information on the Find-Fix-Verify pattern (2nd paragraph). This information (in which scenario it was used first, etc.) is not really necessary for the rest of the paper.

4) On page 9 you state that your function "prune" discards all RDF triples of which an URI could not be dereferenced. Was there a significant
amount of these RDF triples? Did it only concern triples with relation to an external website that could not be dereferenced or did it also concern DBpedia URIs?

5) In p.11, Fig 3, the comparison of DBpedia RDF triples (middle column) shown in your tool is compared with wikipedia infobox values (left column) for which you implemented an extractor. How do you ensure that your extractor does not create extraction errors like they are about to occur in the DBpedia infobox which resulted in the RDF triple under consideration? What if both - your wikipedia infobox extractor and the DBpedia RDF triple - show the very same error? Many problems of the DBpedia extractors arise from wrong infobox information, because many wikipedia authors don't care about infobox conventions. Can these kind of errors be detected at all with your crowdsourcing tool (based on the very same mechanisms to give a hint to the crowdworker)?

6) In your evaluation, p.17, section 5.2.6, you present an "analysis" of expert misclassifications. The analysis only states in an aggregated form what kind of misclassifications occurred, but does not give any explanation or details why these occurred.

7) Also in your evaluation, p.21, section 5.3.5, you claimed that rdf:type information could not be evaluated correctly by users because yago classes do not provide self-speaking labels or other textual information. But, the URI of yago classes usually consists of a self-speaking name and some numerical information, such as e.g. yago:AerospaceEngineer109776079 , which is perfectly readable for humans.

8) Why don't you consider (also) the Open World Assumption (p.21) for your baseline approach. Please explain.

9) The rules you provide as constraints to be checked for your baseline approach (p.22) are sometimes questionable. As e.g., Persons without a birthdate (even if they sometimes have a deathdate) - this holds for many historical persons born a long time ago, simply because their birthdate is not known.

10) In your experiments you should give the LD experts always full schema/ontology information to consider the correctness of an RDF triple.

11) Baseline evaluation (p.22, section 5.4.2). In addition to the foaf:name, you should also take into account alternative labels (from redirects and interlanguage links) of the entity under consideration, if you want to find out automatically, whether the external web page refers to this entity. Otherwise you might not detect it. Why have you set the threshold to ">1"? Why is two times sufficient? Often in natural language texts, it is avoided to name a subject more often with the same name, but synonyms and pronouns are used instead.

12) In p.23, table 9: I doubt the ability of the baseline to detect, whether the baseline is able to identify if a "thumbnail" or a "depiction" refers to the correct image for this entity. Please justify and explain how you ensure this.

13) In the related work section (p.24/25) Games with a Purpose are mentioned. There exist also games with the dedicated purpose of DBpedia quality check that have not been mentioned/compared [4].

14) Also in the related work section (p. 25), Tools for Linked Data Quality Assessment that are able to automatically extract/create ontology constraints from available data, to further use these constraints to assess the quality of the remaining data have been neglected.[5,6]

[1] Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of Linked Data Using Statistical Distributions. Int. J. Semant. Web Inf. Syst. 10, 2 (April 2014), pp. 63-86.
[2] Didier Cherix, Ricardo Usbeck, Andreas Both, Jens Lehmann. 2014. CROCUS: Cluster-based Ontology Data Cleansing. WASABI 2014 at Extended Semantic Web Conference 2014.
[3] Daniel Fleischhacker, Heiko Paulheim, Volha Bryl, Johanna Völker, and Christian Bizer. 2014. Detecting Errors in Numerical Linked Data Using Cross-Checked Outlier Detection. In Proc. 13th Int. Semantic Web Conference (ISWC '14), pp. 357-372.
[4] J. Waitelonis, N. Ludwig, M. Knuth, H. Sack: Whoknows? - Evaluating Linked Data Heuristics with a Quiz that cleans up DBpedia. International Journal of Interactive Technology and Smart Education (ITSE), Emerald Group, Bingley (UK), Vol. 8, 2011 (3).
[5] Jens Lehmann, Lorenz Bühmann: ORE - A Tool for Repairing and Enriching Knowledge Bases", Proc. of the 9th Int. Semantic Web Conference 2010, Lecture Notes in Computer Science, Springer, 2010
[6] G. Töpper, M. Knuth, and H. Sack: DBpedia ontology enrichment for inconsistency detection. In Proc. of the 8th Int. Conf. on Semantic Systems (I-SEMANTICS '12). ACM, New York, NY, USA, pp. 33-40.

Review #3
By Irene Celino submitted on 06/Jul/2015
Major Revision
Review Comment:

Generally speaking, I like the paper and its topic and I think that it could worth publishing in SWJ, because it is strongly in line with the special issue CfP. Still, I have some major remarks that should be addressed before acceptance. My observations are mainly related to two aspects: the global scope of the paper and of the presented results, and the experiments design.

Regarding the paper scope, I am not convinced that the authors provided results that can be considered valid for Linked Data at large, nor for LD quality issues of any kind. This is quite adequately indicated in the paper title, but it is not fully reflected in the paper text itself.
The authors stated that they focused on DBpedia "as a representative data set for the broader Web of Data"; I largely disagree with that for the following reasons: (1) not all LD sources are produced by following a transformation/mapping process like DBpedia one, and the types of errors that happen in a specific LD source heavily depend both on the intrinsic quality of the source and on the possible translation process to RDF; (2) DBpedia is very general in coverage of topics, while LD sources (and their possible quality issues) can be very specific to a given domain; as a consequence, the capability of an experts' or workers' crowd to identify and assess LD quality issues is highly influenced by the domain/coverage of the source. Therefore, I'd recommend to soften the claims of generality of the presented results and to clearly state that they are "proved" only on DBpedia. I'm sure the authors can speculate to what extent those results can be considered general, but at the present moment I believe they cannot affirm that they fully addressed the research questions as they are introduced.
Furthermore, the experiments were focused on some specific LD quality issues and not to the whole list of possible issues (which are comprehensively listed in the authors' previous works). While this is fine per se - I didn't expect the authors to make experiments on the whole set of issues - it makes the presented results even less general. It should be also valuable if the authors add an explanation on why those specific quality issues (instead of others) were selected for the experiments.
As a global recommendation, I suggest the authors to honestly rephrase the parts of the paper that would try to convince the readers about a possible general validity of the paper results for any LD source and/or for any LD quality issue.

Regarding the experiments design, my impression is that a number of results are not fully related to the intended characteristics of the experiments (expert vs. laymen crowdsourcing, find vs. verify stage, etc.) but are the collateral effect of non-optimal design choices, in terms of (1) choice of triples/data and quality issues to be tested, (2) user interface and support information provided to participants and (3) reported indicators and baselines.

Apart from the already provided considerations on DBpedia as above, I have a number of concerns on the employed triples. The experts were given the opportunity to find quality issues in triples that were (i) random, (ii) instances of some class or (iii) manually selected; while this can appear reasonable, the effects are that the workers' crowds (in both experimental workflows) were presented with information "chosen" by somebody else, thus possibly making the task hard or even impossible because of the triples' domain. I would have expected the authors to make a *controlled experiment*, i.e. selecting a general-purpose subset of DBpedia that - at least from the point of view of the content - was at the same "difficulty" level for all the involved crowds. Furthermore, also restricting to a set of selected subjects, I think that not all triples were suitable for the intended experiments; indeed, some specific cases emerged that are not related to the intrinsic characteristics of quality assurance; while it is generally ok to let the experimenters find out problems, it is also reasonable to think that, when preparing an experiment, the obvious things that can lead to problems are avoided. Some examples:
- specific datatype objects (like dates vs. numbers, which are definitely ok to be mistaken)
- owl:sameAs links (which maybe were interpreted in a "purist" way by LD experts who can be careful in accepting those triples because of their logical implications)
- rdf:type triples among the incorrect link issues (apparently unclear for the MTurk workers, but partially also to me: why did rdf:type triples were considered among the "links" instead of the "values"?)
- DBpedia translation-specific triples (which do not make any sense in such an evaluation setting, and should have been filtered out in the first place).
Another fact to support this criticism is that the two authors who created the "ground truth" got quite low values of inter-rater agreement.
Also the tested quality issues are somehow unbalanced: while the incorrect object extraction or the incorrect links can be related to the entities' "meaning", the incorrect datatypes or language tags are more "structural" mistakes; as a consequence, it is not surprising that the latter is the case in which the paid workers performed worst.

Some experiment design flaws come also from the user interface and the information provided to the involved crowds.
First of all, the quality issues to be identified are not presented with the same granularity level to experts and laymen, since the experts got quite a detailed taxonomy of issues, while the MTurk users only three possibilities.
Regarding the MTurk-based Find stage, I personally find the screenshot in Figure 3 very confusing, since it seems that the Wikipedia column (which should provide the "human readable" information) is less complete that the DBpedia column: how were the workers expected to interpret this fact? were they instructed to click on the Wikipedia page link (if provided) to check?
Regarding the MTurk-based Verify stage, Figure 5 is also problematic since it doesn't display the Wikipedia preview (explained in the text) and seems anyway to require quite an effort or some knowledge to be judged; it would have been interesting to know if the authors were able to trace if and how many times the workers actually clicked on the Wikipedia link or how much time it took to them to make a decision.
Also some of the examples given in the text are misleading and therefore not fully suitable to be offered to the crowds as explanations (if they were); e.g. is Elvis Presley's name language-dependent?

I also have some doubts about the choice of evaluation metrics.
Regarding the tables, the authors would have better used sensitivity and specificity rather than TP and FP, because rates are more easily compared and interpreted than counts. This last comment also applies to bar charts, which are hard to judge because of the different value ranges: using rates would improve readability and better convey the message.
Furthermore, I am not at all convinced about the significance of keeping track of the first answer, even less of the comparison between the first answer and the majority voting: while I understand the cost consideration, it would have been more meaningful to compare 3-worker majority voting vs. 5-worker majority voting, since 1 single worker cannot express any kind of answer "agreement" or "variance".
Finally, I found the baseline section quite weird, since the authors describe the interlinks approach that perfectly makes sense (even if it regards only one of the tested quality issues), but they also introduce the TDQA assessment which cannot be compared to the experiment results (and thus cannot be considered a baseline approach). The authors would better create a baseline (e.g. by using SPIN or ShEx-based constraint checks) to try to identify datatype/language and object values issues (w.r.t. all those cases in which such checks can be implemented, of course); that would be a reasonable baseline to compare to.
In any case, I would suggest the authors to make a final summary table that compares the two workflows as well as the comparable baselines, so to support the final discussion.

Some relatively minor remarks:
- page 4, 1st column, postal code example: in some countries postal codes contain letters, so it is not necessarily true that it should be an integer
- sections 3.1.1 and 3.1.2 do not provide any reference
- section 3.2 is clearly related to reference [3], so there is no need to include the citation several times
- page 6, definition 1: why is it 2^Q? can all the quality issues happen at the same time?
- page 6, beginning of 2nd column: this is very specific to DBpedia, so it is in contradiction to the generality claims of the paper
- page 8, end of section 4.1: the authors explain the redundancy during the Find stage by experts; if an agreement is already achieved, is the Verify stage useful at all?
- page 9, 1st column: it seems that the prune step is specific to the experimental setting, rather than to the general case (non-dereferenceable URIs should have been discarded in the first place...)
- page 9, 2nd column: reference to Figure 1 should probably be Figure 3
- page 10, footnote 8: it is not simply for sake of simplicity, since datatype and language tag cannot happen together; furthermore, for the laymen probably there is not much difference between "value" and "link" either
- section 4.4 is not completely necessary in the paper
- page 15, end of 1st column: why were the DBpedia Flickr links filtered out? if there was some doubt about their validity or relevance to the tests, why not filtering them out before the Find stage?
- page 15, section 5.2.4: the example triple is totally unclear, what does it mean? why is is correct?
- table 3: from the text 1512 seems to be the number of the "marked" triples rather than the evaluated ones
- table 4: the caption does not explain that the results refer to the "ground truth" sample (same for table 6); why the LD expert inter-rater agreement was computed for all the triples together?
- page 16, beginning of 1st column: the need for specific technical knowledge about datatypes seems to be yet another experiment design flaw
- page 17, list in the 1st column: what are Wikipedia-upload entries? what does it mean w.r.t. the misclassification discussion?
- page 18, section 5.3.2: the text says 30k triples while table 5 almost 70k triples, so what's the correct number? why was the sample selected on the basis of "at least two workers" and not by majority voting? the sample contains the "exact same number of triples" or exactly the same triples? why did this Verify stage take more time than in the case of the other workflow?
- page 18, end of 2nd column: the geo-coordinates example seems yet another symptom of an ill-designed experiment
- table 5: the sample used for the Verify task does not have the same distribution of triples for the quality issues than the Find stage; can the authors elaborate of the possible effects of those different proportions in terms of loss of information?
- page 19, 2nd column: the problem with non-UTF8 characters seems another sign of sub-optimal design of the user interface for the experiments
- page 20, 2nd column: possible design flaw also in the case of proper nouns
- figure 7(b): TP+TN are complementary w.r.t. FP+FN; rates would be more meaningful than total counts
- page 21, 1st column: there are a couple of "Find" that are more probably "Verify"; it would be interesting to know if the rdf:type triples correctly classified were done by the same worker(s)
- page 21, 2nd column: it is not clear on how many triples the 5146 tests were run, on the 509 "ground truth" triples? what exactly is a success/failure in the tests?
- page 22, 2nd column: were only the foaf:name links used or also the rdfs:label ones? the listing is somewhat useless, the text was clear enough; also clear it is unclear what the "triples subject to crowdsourcing" were, since different datasets were used in the previous tests
- page 23, 1st column: I didn't get what the following consideration refers to: "workers were exceptionally good and efficient at performing comparisons between data entries, specially when some contextual information is provided"
- page 24, footnote 21: the link is broken
- page 24, 2nd column: "fix-find-verify workflow" is probably Find-Verify
- page 25, end of 1st column: the authors write "Recently, a study [18]..." but the paper was published in 2012