Empirical Methodology for Crowdsourcing Ground Truth

Tracking #: 1739-2951

Anca Dumitrache
Oana Inel
Benjamin Timmermans
Carlos Ortiz
Robert-Jan Sips
Lora Aroyo
Chris Welty

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, ambiguity in the data, as well as a multitude of perspectives of the information examples are continuously present. In this paper we present an empirically derived methodology for efficiently gathering of ground truth data in a number of diverse use cases that cover a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics, capturing inter-annotator disagreement. In this paper, we show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gerhard Wohlgenannt submitted on 30/Nov/2017
Review Comment:

The authors have sufficiently addressed the comments in my previous review (which was "minor revision").
Therefore, I suggest the acceptance of the paper.

Review #2
Anonymous submitted on 19/Dec/2017
Review Comment:

I went over the response letter and the new version of the paper and I still believe that the novel contribution of this work as compared to already published papers is very limited.

The new parts introduced in this minor revision only partially address my comments on the previous version of the paper.

The explanation about adjudication of expert and crowd labels by the authors added to section 3.3 is not a standard approach based on existing crowdsourcing research. The cases which are highly ambiguous (i.e., " there was a large discrepancy between annotations of crowds and experts, with a very small overlap between their annotations") should be excluded from the collection rather than resolved by one author (who may or may not be expert of the domain).

Review #3
By Maribel Acosta submitted on 23/Jan/2018
Review Comment:

I would like to the thank the authors for addressing the concerns I raised in my previous review (Reviewer #3).

A final remark: in crowdsourcing experiments, it is important to specify the dates when the tasks were submitted. As crowdsourcing platforms are labor markets, appropriate payments per job vary over time and it could well be that the findings reported in this work regarding task payments are not exactly reproducible in the future. Hence, I highly recommend the authors to include the dates when the microtasks were executed.

With no further comments, my recommendation is to accept the manuscript.

Review #4
Anonymous submitted on 02/Mar/2018
Review Comment:

The paper summarizes the work done by the authors to introduce more nuanced measures of agreement/disagreement when aggregating workers' responses to crowdsourcing tasks (very minor comment: I don't think the proposed metrics reflect disagreement per se, but rather different levels of agreement). The authors present a comparative evaluation across four different types of task and contrast their results with a baseline of majority voting.

The main positive aspects of the paper are as follows:
- the introduced "disagreement" metrics are definitely the most interesting part: I clearly see their value and their potential applicability; however, those metrics were already published before (e.g. in [21])
- the evaluation was done across four different tasks, by soliciting different aspects of the approach (open/closed tasks) as well as different knowledge domains; this highlights the applicability of the approach to all those cases where there is no "single answer", but the actual truth is multi-faceted; I remain a bit curious to see if the proposed approach would exhibit the same properties when applied to cases in which there should be one single truth, i.e. when ambiguous result should not be acceptable
- it is a pity that the annotation metrics defined in section 2.2 (similarity, ambiguity, clarity) are not employed at all in the evaluation; I would be curious to see if those provide some other interesting perspective on worker behavior and task properties, especially in identifying the most difficult tasks, which may require more worker contributions
- the authors correctly report in the related work section that there are other ways to aggregate workers contributions and that they don't claim anything w.r.t. those
- the paper is generally easy to read and well structured and comprehensive, so quality of writing is high; there are only a few small details that would have been worth adding (e.g. the reasons of the need for the combined expert-worker trusted judgment, if "spammers" are considered in the number of workers/task and in the majority voting, the use of the McNemar test instead of a simple t-test)
- the authors also provide the experimental data on GitHub to allow for further comparative studies

The main shortcomings I see in the paper are as follows:
- even if the claim is correctly framed in the related work section, the experimental results seem to suggest a "superiority" of CrowdTruth w.r.t. state of the art methods; instead, CrowdTruth could actually be beaten by other methods which take into account labeling quality and/or task difficulty; to correctly present their contribution, the authors could have added an additional term of comparison, e.g. expectation maximization (also in substitution for the single annotator, which frankly seems quite useless)
- the comparison to majority voting (as I understood it) seems also a bit unfair: majority voting returns one annotation for each task and therefore is penalized "by design" w.r.t. to CrowdTruth, especially because, in both open and closed tasks, workers were allowed to insert more than one annotation; moreover, as somewhat evident from Figure 4, for even numbers of workers, majority voting decreases for obvious reasons related to ties; it is also unclear whether "spammers" (as defined in Section 2.3) were excluded from majority voting computation
- the best value for the CrowdTruth unit annotation score (as per Figure 3) can be selected only ex post and with the comparison to a "ground truth", so it is unclear how it should be set ex ante in real cases or how it should be chosen if no trusted data is available; in the case of open-handed tasks, it is also peculiar (and not discussed) that the maximum value of the F1 score happens with the minimum value of the unit annotation score
- the choice of the trusted judgment collection is also questionable, especially because expert judgment is usually consider reliable, so it is unclear why this additional step was needed (are the authors more reliable than the experts? or are the experts not quite "experts"?)
- I generally disagree with the considerations related to the number of workers: the fact that usually crowdsourcing campaigns tend to limit the maximum number of workers per task is not a "bad practice", but it is usually driven by cost considerations; if you have a much larger number of tasks (also in terms of variety), you cannot afford to pay 15 workers per task; there is literature to understand when an additional annotation is needed (e.g., Sheng, Provost and Ipeirotis "Get Another Label? Improving Data Quality and Data Mining", SIGKDD, 2008)
- the relevance of the paper for this journal/issue is not so straightforward: apart from the mention of the Semantic Web in the abstract, introduction and conclusion, this paper could have been easily submitted to a crowdsourcing journal
- in the meantime (maybe because of long review timing on this submission), a relevant part of this paper was already published in: Anca Dumitrache, Oana Inel, Benjamin Timmermans and Lora Aroyo: Crowdsourcing Ambiguity-Aware Ground Truth. Collective Intelligence 2017

Since I was asked for a meta-revision to facilitate the editors with making a decision on the paper after the previous rounds of review/revision, and given the reasons above, I am afraid I have to suggest rejection.