Review Comment:
The paper summarizes the work done by the authors to introduce more nuanced measures of agreement/disagreement when aggregating workers' responses to crowdsourcing tasks (very minor comment: I don't think the proposed metrics reflect disagreement per se, but rather different levels of agreement). The authors present a comparative evaluation across four different types of task and contrast their results with a baseline of majority voting.
The main positive aspects of the paper are as follows:
- the introduced "disagreement" metrics are definitely the most interesting part: I clearly see their value and their potential applicability; however, those metrics were already published before (e.g. in [21])
- the evaluation was done across four different tasks, by soliciting different aspects of the approach (open/closed tasks) as well as different knowledge domains; this highlights the applicability of the approach to all those cases where there is no "single answer", but the actual truth is multi-faceted; I remain a bit curious to see if the proposed approach would exhibit the same properties when applied to cases in which there should be one single truth, i.e. when ambiguous result should not be acceptable
- it is a pity that the annotation metrics defined in section 2.2 (similarity, ambiguity, clarity) are not employed at all in the evaluation; I would be curious to see if those provide some other interesting perspective on worker behavior and task properties, especially in identifying the most difficult tasks, which may require more worker contributions
- the authors correctly report in the related work section that there are other ways to aggregate workers contributions and that they don't claim anything w.r.t. those
- the paper is generally easy to read and well structured and comprehensive, so quality of writing is high; there are only a few small details that would have been worth adding (e.g. the reasons of the need for the combined expert-worker trusted judgment, if "spammers" are considered in the number of workers/task and in the majority voting, the use of the McNemar test instead of a simple t-test)
- the authors also provide the experimental data on GitHub to allow for further comparative studies
The main shortcomings I see in the paper are as follows:
- even if the claim is correctly framed in the related work section, the experimental results seem to suggest a "superiority" of CrowdTruth w.r.t. state of the art methods; instead, CrowdTruth could actually be beaten by other methods which take into account labeling quality and/or task difficulty; to correctly present their contribution, the authors could have added an additional term of comparison, e.g. expectation maximization (also in substitution for the single annotator, which frankly seems quite useless)
- the comparison to majority voting (as I understood it) seems also a bit unfair: majority voting returns one annotation for each task and therefore is penalized "by design" w.r.t. to CrowdTruth, especially because, in both open and closed tasks, workers were allowed to insert more than one annotation; moreover, as somewhat evident from Figure 4, for even numbers of workers, majority voting decreases for obvious reasons related to ties; it is also unclear whether "spammers" (as defined in Section 2.3) were excluded from majority voting computation
- the best value for the CrowdTruth unit annotation score (as per Figure 3) can be selected only ex post and with the comparison to a "ground truth", so it is unclear how it should be set ex ante in real cases or how it should be chosen if no trusted data is available; in the case of open-handed tasks, it is also peculiar (and not discussed) that the maximum value of the F1 score happens with the minimum value of the unit annotation score
- the choice of the trusted judgment collection is also questionable, especially because expert judgment is usually consider reliable, so it is unclear why this additional step was needed (are the authors more reliable than the experts? or are the experts not quite "experts"?)
- I generally disagree with the considerations related to the number of workers: the fact that usually crowdsourcing campaigns tend to limit the maximum number of workers per task is not a "bad practice", but it is usually driven by cost considerations; if you have a much larger number of tasks (also in terms of variety), you cannot afford to pay 15 workers per task; there is literature to understand when an additional annotation is needed (e.g., Sheng, Provost and Ipeirotis "Get Another Label? Improving Data Quality and Data Mining", SIGKDD, 2008)
- the relevance of the paper for this journal/issue is not so straightforward: apart from the mention of the Semantic Web in the abstract, introduction and conclusion, this paper could have been easily submitted to a crowdsourcing journal
- in the meantime (maybe because of long review timing on this submission), a relevant part of this paper was already published in: Anca Dumitrache, Oana Inel, Benjamin Timmermans and Lora Aroyo: Crowdsourcing Ambiguity-Aware Ground Truth. Collective Intelligence 2017
Since I was asked for a meta-revision to facilitate the editors with making a decision on the paper after the previous rounds of review/revision, and given the reasons above, I am afraid I have to suggest rejection.
|