Review Comment:
I would like to the thank the authors for their response. The focus and clarity of the paper have been definitely improved. The new version of the manuscript addresses my major concerns raised in my previous review (Reviewer #2).
Still, I have some remarks that the authors should address in the paper before publication.
1. Clarification about the definition of CrowdTruth:
Throughout the paper, the authors refer to CrowdTruth as a method, a framework, a methodology. Furthermore, the authors should clarify the main difference of the CrowdTruth ‘methodology’ proposed in the current manuscript in comparison with the work presented in [21]. This will help readers to better understand the contributions of this work.
2. ‘Triangle’ or ‘pyramid’ of disagreement:
The authors seem to use the proposed terms ‘triangle of disagreement’ (e.g., the header of Section 2.1) and ‘pyramid of disagreement’ (e.g., the caption of Figure 2) interchangeably. It would be clearer if the authors stick to only one term.
3. Settings of the CrowdTruth metrics in the experiments:
In the reported experiments, it is clear that the authors explore the impact of using different threshold values for the ‘media unit annotation score’ metric on the quality of the crowd answers. Nonetheless, the experimental settings do not describe the usage of the other seven CrowdTruth metrics:
3.1) Were the other metrics also considered for generating the crowd results with CrowdTruth?
3.2.) What were the thresholds used in the other metrics? How were these thresholds configured? (if applicable)
If the scope of this paper is to investigate the impact of only one CrowdTruth metric, then this should be clarified in the paper.
4. CrowdFlower settings in the experiments:
In their response, the authors explain that besides the configurations reported in the paper, they used the default CrowdFlower settings. This clarification should definitely be included in the paper. Still, there are a couple of settings that should be further described:
4.1) When were the microtasks of each use case submitted to CrowdFlower?
4.2) The authors explain in Section 3.1. that the payment per task was gradually increased. What was the starting payment and maximum payment rewarded in each type of task? In addition, the authors report the cost/judgment in Table 3, but it is unclear what is the relation of this cost and the payment configuration.
5. Computation of precision and recall:
The conducted evaluation reports on the micro-F1 score to measure the quality of the studied crowdsourcing methodologies. This avoids biases based on the size of the ‘classes’ in the dataset. However, it is unclear what the definition of ‘class’ is in this context.
6. Collection of Trusted Judgements:
In Section 5, the authors mention that for the task ‘sound interpretation’ all the answers collected from the crowd were accepted as part of the trusted judgments. However, this is not mentioned in Section 3.3 This clarification should be included before presenting the results.
Further minor comments:
- Page 4, w is not defined in wwa(w), wsa(w), and na(w).
- Page 12, the following passage is difficult to follow: “According to our theory of the disagreement triangle, where the ambiguity of the task propagates in the crowdsourcing system affecting the degree to which workers dis- agree (i.e. the optimal number of workers per task), and the clarity of the unit (i.e. the optimal media unit- annotation score threshold).”
- The sentence “an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.” appears identically in the abstract and the conclusions. This should be avoided.
|