Empirically-derived Methodology for Crowdsourcing Ground Truth

Tracking #: 1072-2284

Authors: 
Anca Dumitrache
Oana Inel
Benjamin Timmermans
Carlos Ortiz
Robert-Jan Sips
Lora Aroyo

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
Abstract: 
The main challenge for cognitive computing systems, and specifically for their natural language processing, video and image analysis components, is to be provided with large amounts of training and evaluation data. The traditional process for gathering ground truth data is lengthy, costly, and time consuming: (i) expert annotators are not always available; (ii) automated methods generate data with a quality that is affected by noise and ambiguity-specific semantics. Typically, these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, ambiguity and a multitude of perspectives of the information examples are continuously present. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. The majority of those approaches also use inter-annotator agreement as a quality measure by assuming that there is only one correct answer for each example. In this paper we present an empirically derived methodology for efficiently gathering of ground truth data in a number of diverse use cases that cover a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth disagreement-based quality metrics (1) to achieve efficiency in terms of time and cost and (2) to achieve an optimal quality of results in terms of capturing the variety of interpretations in each example. Experimental results show that this methodology can be adapted to a variety of tasks; that it is a time and cost efficient method for gathering ground truth data; and that inter-annotation disagreement is an effective signal in distinguishing with high accuracy good workers from spammers and clear examples from ambiguous ones.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gerhard Wohlgenannt submitted on 29/Nov/2015
Suggestion:
Major Revision
Review Comment:

This paper is about an "empirically-derived methodology for crowdsourcing ground truth". It centers around the results from 4 use cases. In contrast to most existing crowdsourcing approaches the authors use disagreement-based metrics instead of simple majority voting, etc.

The paper is well-written and quite easy to understand.
Some of the results are interesting, eg. the results on "low-quality worker" detection, some are rather obvious, esp. that a worker becomes faster in doing a job when doing it repeatedly.

But coming to my main issue: I miss the clear point that the authors might want to make.
Before reading, I would have expected one of those two strategies for the paper:

A) an analytical paper on the benefits of using disagreement of workers to a) find low-quality workers, and b) to detect ambigious units of work and provide
ideas on how to handle this ambiguity. You addressed those points in the paper, esp the detection of low-quality workers -- but not in a very detailed way.
Same goes for the aspect of task complexity -- the assessement of task complexity in your paper is rather vage -- maybe there would be a way to have some
kind of formula (or other system) to assess task complexity eg. on a scale from 1-100, and then provide hints to what is appropriate worker payment
for the the given task complexity? Not sure if that makes sense or if it is possible to estimate appropriate worker payment (automatically) in such a way.
If not, then it would be good to state in the paper why it is necessary to assess task complexity and payment manually.
Another example, with such a strategy, for spammer detection one could expect a more formal procedure: (i) set up the media unit vectors,
(ii) detect spammers and low-quality workers using eg worker-disagreement in those vectors (providing a threshold/cut-off parameter?),
(iii) remove low-quality data from media unit vectors, (iv) calculate unit clarity with some algorithm from remaining high-quality worker input.

B) providing and evaluating a clear method for using the discussed aspects such as disagreement and task complexity, and providing tools that use
the presented metrics, templates, complexity analysis, etc. So with this strategy you could provide a set of tools which help building the
crowdsourcing tasks (for example task complexity analysis and re-using/modifying existing templates) and tools for analysing the results (showing an analysis
of raw results, then a detection of low-quality workers, then an analysis of results of high-quality works only -- including unit clarity values per unit, etc).

Maybe a major revision of the paper is not necessary, but I would like to see the motivation of authors for choosing this strategy for the paper,
and at least make it clearer in the paper what the goal, central theme and benefit for the user is.

---------------------------

Some more detailed/minor comments:

* There has been work showing the benefits in cost and scalabilty of using paid crowdsourcing as compared to domain experts,
so I would like to see more info on the specific benefits of leveraging disagreement (something which is not available when using only one expert annotator).
For example it would be intersting to see which percentage of units per UC is ambigious, and also different levels of ambiguity -- I suspect there are different
levels of ambiguity, eg. with 2 solutions and workers having a 50% / 50% opionion, or more than 2 solutions, or also the question on where to draw the line.
For example, if 80% or workers perfer answer A, and 20% prefer answer B -- is the unit clear or ambigious?

* I like that you have a broad variety of use cases.

* In use case UC2/DS2: You say that one cluster of sound snippets contains audio files which are 0.0001 to 0.23 seconds long.
Who can hear a sound which is 0.0001 seconds long? :-)
Also, what happens to the snippets of eg. length 1 second or 10 seconds? They are not in one of the clusters .. Are they just omitted?

* Your way of setting task parameters using preliminary experiments (as described in section 5.2) seems rather ad-hoc and very labor-intensive? Is there
a way to make this more formal or semi-automized? If not, no problem, but would be good to say a word wheter there are options to do this in a more formal way ..

* At first sight it was not clear to me that Table 10 is about the quality of detection of spammers, and not about the quality of workers on their tasks
--> improve the caption. Sometimes it is also not clear to me if and where you distinguish between spammers and low-quality (genuine) workers -- eg. in Table 10,
is the evaluation about "real" spammers only, or also about detecting other low-quality workers?

* Some of the findings are a bit bold, eg. you conclude from a single use case experiment on medical relation extraction that "domain appears to be not as relevant
.. for task complexity" .. my intuition would say that if the domain is really complex, then this will have an impact on "task complexity" ..

Typos etc:
* Related work section, first paragraph "disagreement-based (Section 2.4" --> missing a closing ")"
* page 4, footnote no 2 -- there should be no space before the footnote in the text?
* page 4 "[24] shows that often, machine learning" .. no comma before machine learning?
* In general: You use something like: "Finally, [24] shows ...". I was tought that in this case one should use "Finally, Lin et al. [24] ...". But not sure, a matter of style.
* In Table 7, you only give decimal places for "578.22" .. --> remove them?

Review #2
Anonymous submitted on 11/Jan/2016
Suggestion:
Reject
Review Comment:

This paper describes a crowdsourcing framework and experimentally addresses two research questions about crowd worker efficiency and on how to measure work quality and task complexity. The paper is well written and presents an extensive set of different experiments. Important details of the experimental setup and results seem to be missing. Findings are of limited novelty and are not well supported by enough details about the experimental results. Some conclusions are not fully supported by the experimental results.

The proposed methodology is not very clear and is not evaluated in its entirety comparing it with alternative state-of-the-art approaches for spam detection, for example. Many of the findings are not novel and have been already observed by previous work on crowdsourcing (e.g., learning effects of workers doing similar tasks in sequence, more units attract more workers, etc.)

The experimental evaluation presents the results from many different workflows which are selected with an unclear criteria and not well motivated by the chosen research questions and objectives.

Initially, test runs are used to set parameters values and check for quality of the tasks and work. Then, experimental results to address the two research questions are presented by comparing the results with manual annotation of crowd work quality.

Approaches, e.g., for low quality worker detection are not compared with alternative approaches from the literature and thus it is difficult to understand their accuracy as custom data has been used to evaluate it.

Detailed comments:
- Section 3.1: The metrics definition part should be extended with examples and more detailed definitions to make the paper self-contained
- Section 3.2: Task complexity features seem to be highly qualitative dimensions and it is unclear how they can be measured given a task
- Section 3.3: It is unclear how templates relate to disagreement. Perhaps the evaluation should have included a comparison of tasks with and without templates to show the benefits of having them.
- Section 4.1: It seems that no task with open ended questions was included in the study. Multiple-choice questions are not the only task type in crowdsourcing. A discussion on the applicability of the method to different task types should be included.
- Section 5.2: you should quantify and experimentally show how much efficiency is improved by reading a question once per six pages
- Section 5.2: the duplication of questions does not seem a good strategy to adopt. Perhaps fewer than 5 questions may have been shown instead
- Section 5.2: as one of your objectives is to reduce the cost of ground truth creation, it would be interesting to report what was the cost of the parameter tuning experiments and wether it was included in the comparison done in Table 9
- Section 5.3: For the spam detection task you should explain in more detail how thresholds were optimized.
- Figure 7: authors should report about statistically significant differences between the execution times to conclude that some tasks have shorter execution time than others.
- Section 6.2: The result that workers learn and become more efficient as they do more tasks is interesting but it is not novel and it seems not related to the CrowdTruth framework but rather intrinsic in the general crowdsourcing dynamics as shown by previous work.
- Section 6.2: The result that more units make the work being completed faster is interesting but it is not novel and it seems not related to the CrowdTruth framework but rather intrinsic in the general crowdsourcing dynamics as shown by previous work.
- Section 6.2: It is unclear if in any/all experiments the time needed by experts to perform the task was measured. For example, for the music experts who were not available, how was the task completion time estimation done? Overall, I believe more details need to be included to support the claim that, in general, crowdsourcing is cheaper than using domain experts.
- Section 6.3: fix “[reference exp. setup]”
- Section 6.3: The results for the spam detection task are not compared against any baseline and thus it is difficult to understand how effective the approach is and how difficult the task was on the specific dataset
- Section 7: The claim that publishing units over a long period of time the CrowdTruth method is beneficial has not been shown experimentally as only short term experiments have been reported
- Section 7.3: “medical medical” -> “medical”

Review #3
Anonymous submitted on 24/Jan/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality
This work is built on top of authors’ previous work, however it presents novel results.

(2) significance of the results
Certain empirical observations should be further analyzed to derive conclusive results.

(3) quality of writing
Overall, the paper is well written. Readability of the paper can be improved by including more examples.

This work presents a methodology for gathering ground truth data for semantic annotation tasks via microtask crowdsourcing. The proposed solution comprises four components: (i) CrowdTruth metrics which relies on workers’ disagreement to detect low quality answers or vagueness/ambiguity in crowdsourced data; (ii) a method for task complexity assessment to estimate efficient crowdsourcing task configurations; (iii) task templates that can be reused and adapted in different tasks; (iv) a method for task setup to configure crowdsourcing task parameters, e.g., units per job, redundancy, among others. The proposed methodology was empirically evaluated with four different use cases using seven real-world datasets. In summary, experimental results show that it is feasible to collect ground truth data via microtasks, and that disagreement can be an indicator of ambiguous tasks (using the CrowdTruth metric unit clarity) and can also be used to detect spam workers (using the CrowdTruth worker metrics).

The paper tackles a highly relevant problem in crowdsourcing: devising efficient and reusable methodologies to execute microtasks. My major concern about this work is that the description of the methodology is not very precise and the claimed optimality achieved by the proposed solution is not demonstrated. Similar comment apply to the empirical evaluation: there are no details about the computational techniques (e.g., clustering, entity extraction, etc.) or crowdsourcing configurations in CrowdFlower, therefore, the experiments are not reproducible. In addition, some of the experimental results are not conclusive as they are presented. More details about major issues in the following:

1) Regarding the CrowdTruth metrics, the authors explain that the “The length of the annotation vector depends on the number of possible answers”. It is very clear how the annotation vector is built when the question corresponds to a multiple-choice question. However, it is not explained in the paper if low-dimensional embedding of, for example, “free text input” or “highlighting” is needed and how is it carried out. I suggest the authors to provide a formal definition of the annotation vector, explaining how workers’ answers for the different tasks are transformed into dimensions in the proposed vectorial representation.

2) The task complexity assessment presented in Section 3.2. seems very rushed. For instance, it is unclear how the different features that impact on the crowd performance in a task were identified. Also, the definitions and differences of “task complexity” and “task difficulty” are not properly defined. Regarding Table 1, the authors should provide more details about how the different feature elements were classified into low, medium, and high difficulty.
2.1) Did the crowdworkers provide any feedback regarding the complexity/difficulty of the tasks?
2.2) Do the authors have empirical evidence that for the ‘medical relation extraction’ the task “complexity” was actually reduced as mentioned?

3) For the template reusability, the authors mention that “[users can] create new types of tasks by borrowing successful elements from existing templates”. How are users informed about the successful elements in different templates?

4) Regarding the task setup, in Section 3.4 the authors claim that the task complexity assessment and the template components “(...) provide the basis for the approach towards task setup in a way that each task is performed in an optimal setting in terms of time, cost, and quality of the results”. If this is the case, the authors should formally define the optimization problem, the conditions that ensure optimality, and demonstrate that the parameter settings determined by the proposed approach is actually optimal, i.e., there are no better configurations in terms of time, cost, and quality. However, from the description of the methodology maybe the authors mean that the approach is able to find parameters that lead to *efficient* crowd performance, but not necessarily the optimal ones. Section 3.4 should be carefully revised in this regard, since optimality is mentioned several times throughout the paper but not demonstrated.

5) The description of the dataset creation is not precise enough, leaving several questions open:
5.1) What distant supervision method is used to generate DS1?
5.2) What are the arguments/configurations used in the distant supervision method to process DS1?
5.3) How were the 1,000 sounds chosen in DS2?
5.4) What clustering algorithm was used to group the sounds from DS2 according to their length? By the description of the dataset, it seems that the sounds were *classified* instead of *clustered*. Please, clarify.
5.5) How is detected that a passage in DS3 “could potentially contain the answer to the question”?
5.6) When is a passage “too short” or “too long” in DS3?
5.7) How was determined that a passage was unreadable in DS3?
5.8) How was DS3 processed to build DS4? Manually?
5.9) Besides DS7, are the other datasets available online?

6) Similar to the previous comment, specifications about automatic approaches used in the machine-crowd workflows should be provided, specifically:
6.1) In UC2, what is the clustering algorithm used to group the sound keywords provided by the crowd?
6.2) In UC4, what machine named entity extraction was used? Also, the data cleaning and entity span processes should be explained.

7) Further details about the crowdsourcing settings used in the experiments should be specified:
7.1) When were the microtasks of each use case submitted to CrowdFlower? It is important to specify this information since microtask platforms are labor markets and appropriate payments per job may vary over time.

7.2) Besides the settings specified in Table 6, how were other CrowdFlower parameters (e.g., “test questions”, “minimum time per page”) configured?

7.3) Why was the maximum number of tasks per worker configured differently for each task? How was the maximum number of tasks per worker determined?

7.4) In Table 7, the number of jobs submitted to CrowdFlower is specified. However, it is not explained how units are grouped into jobs.

8) Given that the quality of the CrowdTruth metrics in the proposed methodology is compared against a manually created gold standard, the authors should provide more details about how the manual assessment was carried out in order to guarantee reliable results.
8.1) Were the annotators instructed regarding how to perform the manual assessment?

8.2) How was low-, medium-, and high-quality work defined?

8.3) The Inter-coder agreement among the annotators should be reported. Although this might seem “contradictory” with the core of the proposed methodology, it is interesting to evaluate in some way the quality of the manual assessment.

8.4) Out of the 146 workers, how many workers were finally classified as low-, medium-, high-quality, respectively?

9) In the analysis of disagreement as a signal, a proper analysis of correlation is not presented. Unfortunately, the “trends” depicted in Figure 11 are not sufficient to derive conclusive results.I suggest the authors to use an appropriate correlation method (e.g., Pearson correlation) to corroborate the observations reported in Section 6.3.

10) The Related Work should also include the approach proposed by Bornea and Barker [1] for extracting medical relations as well as the works by Kilicoglu et al. [2] and van Mulligen [3] for creating ground truth annotations in the biomedical domain.

[1] Bornea, M., & Barker, K. (2015, October). Relational Path Mining in Structured Knowledge. In Proceedings of the 8th International Conference on Knowledge Capture (p. 14). ACM.

[2] Kilicoglu, H., Rosemblat, G., Fiszman, M., & Rindflesch, T. C. (2011). Constructing a semantic predication gold standard from the biomedical literature. BMC bioinformatics, 12(1), 486.

[3] van Mulligen, E.M., et al., The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform, 2012. 45(5): p. 879-84.

As a minor remark, the readability of the paper could be improved by including a running example to illustrate the different components of the methodology. In addition, the paper is not self-contained; several details that are crucial to understand the methodology are not presented in the paper but readers are pointed to previous works of the authors. More minor comments in the following:

* Missing references in the following claims/assumptions:
** In Introduction, “(...) gathering ground truth data for training and evaluating IE systems is still a bottleneck (...)”.
** In Introduction, “(...) such methods often generate poor quality data due to noise and ambiguity-specific semantics”.

* In Table 3, the input size of DS1 is 902, but in the text appears 900.

* In Section 5.3., mu and sigma are not properly defined.

* In Table 6, “Maximum tasks per worker” could be more informative than “Tasks per Workers”.

* Figure 12 is displayed before Figure 11.