Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
(1) originality
This work is built on top of authors’ previous work, however it presents novel results.
(2) significance of the results
Certain empirical observations should be further analyzed to derive conclusive results.
(3) quality of writing
Overall, the paper is well written. Readability of the paper can be improved by including more examples.
This work presents a methodology for gathering ground truth data for semantic annotation tasks via microtask crowdsourcing. The proposed solution comprises four components: (i) CrowdTruth metrics which relies on workers’ disagreement to detect low quality answers or vagueness/ambiguity in crowdsourced data; (ii) a method for task complexity assessment to estimate efficient crowdsourcing task configurations; (iii) task templates that can be reused and adapted in different tasks; (iv) a method for task setup to configure crowdsourcing task parameters, e.g., units per job, redundancy, among others. The proposed methodology was empirically evaluated with four different use cases using seven real-world datasets. In summary, experimental results show that it is feasible to collect ground truth data via microtasks, and that disagreement can be an indicator of ambiguous tasks (using the CrowdTruth metric unit clarity) and can also be used to detect spam workers (using the CrowdTruth worker metrics).
The paper tackles a highly relevant problem in crowdsourcing: devising efficient and reusable methodologies to execute microtasks. My major concern about this work is that the description of the methodology is not very precise and the claimed optimality achieved by the proposed solution is not demonstrated. Similar comment apply to the empirical evaluation: there are no details about the computational techniques (e.g., clustering, entity extraction, etc.) or crowdsourcing configurations in CrowdFlower, therefore, the experiments are not reproducible. In addition, some of the experimental results are not conclusive as they are presented. More details about major issues in the following:
1) Regarding the CrowdTruth metrics, the authors explain that the “The length of the annotation vector depends on the number of possible answers”. It is very clear how the annotation vector is built when the question corresponds to a multiple-choice question. However, it is not explained in the paper if low-dimensional embedding of, for example, “free text input” or “highlighting” is needed and how is it carried out. I suggest the authors to provide a formal definition of the annotation vector, explaining how workers’ answers for the different tasks are transformed into dimensions in the proposed vectorial representation.
2) The task complexity assessment presented in Section 3.2. seems very rushed. For instance, it is unclear how the different features that impact on the crowd performance in a task were identified. Also, the definitions and differences of “task complexity” and “task difficulty” are not properly defined. Regarding Table 1, the authors should provide more details about how the different feature elements were classified into low, medium, and high difficulty.
2.1) Did the crowdworkers provide any feedback regarding the complexity/difficulty of the tasks?
2.2) Do the authors have empirical evidence that for the ‘medical relation extraction’ the task “complexity” was actually reduced as mentioned?
3) For the template reusability, the authors mention that “[users can] create new types of tasks by borrowing successful elements from existing templates”. How are users informed about the successful elements in different templates?
4) Regarding the task setup, in Section 3.4 the authors claim that the task complexity assessment and the template components “(...) provide the basis for the approach towards task setup in a way that each task is performed in an optimal setting in terms of time, cost, and quality of the results”. If this is the case, the authors should formally define the optimization problem, the conditions that ensure optimality, and demonstrate that the parameter settings determined by the proposed approach is actually optimal, i.e., there are no better configurations in terms of time, cost, and quality. However, from the description of the methodology maybe the authors mean that the approach is able to find parameters that lead to *efficient* crowd performance, but not necessarily the optimal ones. Section 3.4 should be carefully revised in this regard, since optimality is mentioned several times throughout the paper but not demonstrated.
5) The description of the dataset creation is not precise enough, leaving several questions open:
5.1) What distant supervision method is used to generate DS1?
5.2) What are the arguments/configurations used in the distant supervision method to process DS1?
5.3) How were the 1,000 sounds chosen in DS2?
5.4) What clustering algorithm was used to group the sounds from DS2 according to their length? By the description of the dataset, it seems that the sounds were *classified* instead of *clustered*. Please, clarify.
5.5) How is detected that a passage in DS3 “could potentially contain the answer to the question”?
5.6) When is a passage “too short” or “too long” in DS3?
5.7) How was determined that a passage was unreadable in DS3?
5.8) How was DS3 processed to build DS4? Manually?
5.9) Besides DS7, are the other datasets available online?
6) Similar to the previous comment, specifications about automatic approaches used in the machine-crowd workflows should be provided, specifically:
6.1) In UC2, what is the clustering algorithm used to group the sound keywords provided by the crowd?
6.2) In UC4, what machine named entity extraction was used? Also, the data cleaning and entity span processes should be explained.
7) Further details about the crowdsourcing settings used in the experiments should be specified:
7.1) When were the microtasks of each use case submitted to CrowdFlower? It is important to specify this information since microtask platforms are labor markets and appropriate payments per job may vary over time.
7.2) Besides the settings specified in Table 6, how were other CrowdFlower parameters (e.g., “test questions”, “minimum time per page”) configured?
7.3) Why was the maximum number of tasks per worker configured differently for each task? How was the maximum number of tasks per worker determined?
7.4) In Table 7, the number of jobs submitted to CrowdFlower is specified. However, it is not explained how units are grouped into jobs.
8) Given that the quality of the CrowdTruth metrics in the proposed methodology is compared against a manually created gold standard, the authors should provide more details about how the manual assessment was carried out in order to guarantee reliable results.
8.1) Were the annotators instructed regarding how to perform the manual assessment?
8.2) How was low-, medium-, and high-quality work defined?
8.3) The Inter-coder agreement among the annotators should be reported. Although this might seem “contradictory” with the core of the proposed methodology, it is interesting to evaluate in some way the quality of the manual assessment.
8.4) Out of the 146 workers, how many workers were finally classified as low-, medium-, high-quality, respectively?
9) In the analysis of disagreement as a signal, a proper analysis of correlation is not presented. Unfortunately, the “trends” depicted in Figure 11 are not sufficient to derive conclusive results.I suggest the authors to use an appropriate correlation method (e.g., Pearson correlation) to corroborate the observations reported in Section 6.3.
10) The Related Work should also include the approach proposed by Bornea and Barker [1] for extracting medical relations as well as the works by Kilicoglu et al. [2] and van Mulligen [3] for creating ground truth annotations in the biomedical domain.
[1] Bornea, M., & Barker, K. (2015, October). Relational Path Mining in Structured Knowledge. In Proceedings of the 8th International Conference on Knowledge Capture (p. 14). ACM.
[2] Kilicoglu, H., Rosemblat, G., Fiszman, M., & Rindflesch, T. C. (2011). Constructing a semantic predication gold standard from the biomedical literature. BMC bioinformatics, 12(1), 486.
[3] van Mulligen, E.M., et al., The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform, 2012. 45(5): p. 879-84.
As a minor remark, the readability of the paper could be improved by including a running example to illustrate the different components of the methodology. In addition, the paper is not self-contained; several details that are crucial to understand the methodology are not presented in the paper but readers are pointed to previous works of the authors. More minor comments in the following:
* Missing references in the following claims/assumptions:
** In Introduction, “(...) gathering ground truth data for training and evaluating IE systems is still a bottleneck (...)”.
** In Introduction, “(...) such methods often generate poor quality data due to noise and ambiguity-specific semantics”.
* In Table 3, the input size of DS1 is 902, but in the text appears 900.
* In Section 5.3., mu and sigma are not properly defined.
* In Table 6, “Maximum tasks per worker” could be more informative than “Tasks per Workers”.
* Figure 12 is displayed before Figure 11.
|