Towards hybrid NER: an extended study of content and crowdsourcing-related performance factors

Tracking #: 1068-2279

Oluwaseyi Feyisetan
Elena Simperl
Markus Luczak-Roesch
Ramine Tinati
Nigel Shadbolt

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
This paper explores the factors that influence the human component in hybrid approaches to named entity recognition (NER) in microblogs, which combine state-of-the-art automatic techniques with human and crowd computing. We identify a set of content and crowdsourcing-related features (number of entities in a post, types of entities, content sentiment, skipped true-positive posts, average time spent to complete the tasks, and interaction with the user interface) and analyse their impact on the accuracy of the results and the timeliness of their delivery. Using CrowdFlower and a simple, custom built gamified NER tool we run experiments on three datasets from related literature and a fourth newly annotated corpus. Our findings show that crowd workers are adept at recognizing people, locations, and implicitly identified entities within shorter microposts. We expect these findings to lead to the design of more advanced NER pipelines, informing the way in which tweets are chosen to be outsourced or processed by automatic tools. Experimental results are published as JSON-LD for further use by the research community.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Jun/2015
Review Comment:

The paper is interesting, but it is largely a verbatim copy of the earlier version of the paper accepted at ESWC 2015:

The abstract of the paper is exactly the same, and so is most of the paper, including the discussion section where the contributions of the work are summarised. The few additions in this new version seem insufficient to me without further clarification from the authors.

I would suggest that the authors:
* Rewrite parts of the paper so it doesn't look the same as the earlier version.
* Better describe what the novel (and relevant) contributions are in the present version, so as to be able to differentiate it and assess it accordingly. The paragraph at the end of the introduction is rather vague at the moment.

Review #2
By Leon Derczynski submitted on 10/Jun/2015
Major Revision
Review Comment:

This paper on the crowd and NER expands an ESWC paper. I am very sympathetic to this work, and think the angle that it takes is very fruitful. However, the analyses are missing in places, or only cover surface observations, and it is hard to see that the novel contribution for SWJ overcomes the ESWC paper in mnay places. To bring it up to standard, it could do with further analysis and some deeper insights (i.e. perhaps just some more thinking time and a refactoring of the content). Specific feedback is given below. Personally I would much rather see this ms invested in and developed into a strong, sharp contribution at the end of the trail that the ESWC work started, rather than end up at another lesser journal - understanding the link between diverse annotation environments and diverse annotation skillsets is crucial and a timely problem in the field.

Specific feedback:

The writing quality is never a problem.

Consider reworking the title, perhaps dropping the first three words and appending "in NE annnotation". To me, "Towards" implies that you didn't get there, which is fine, but this study is pretty much complete.

Page 2 para 2 "This paper offers"...: this is almost identical to the ESWC work, but this paper offers more, and that more should be stated here

Capturing uncertainty in annotation has been addressed recently (Plank et al. EACL 2014 best paper), and monitoring the interaction between crowd diversity and crowd recall has been examined too (Trushkowsky et al. ICDE 2013 best paper). Both these works related to this research at multiple places in the paper, and their findings/technqiues are very relevant.

Reference 17's ø is encoded wrongly.

Cohn ACL 2013 "Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation" addresses extrinsic factors in estimating annotation quality, which is relevant to the literature noted here. The lit rev is very focused on the text-level task of crowdsourced NER, and does an excellent review as a direct consequence, but could be made broader in scope - we discover later that per-worker information is important, which intersects precisely with the Cohn ACL 2013 work. I.e., one should factor in crowd worker performance to the annotation evaluation - a relevant point that has been made in the literature, and so should be present in the literature review.

Section 4.2: How good is Alchemy at sentiment? Can you evaluate this, e.g. with a SemEval test set? Or reference something that has? Otherwise, use of Alchemy just introduces noise of unknown reliability in the analysis.

Throughout: "nerd" -> "NERD"

Throughout: Table placing is very odd and really hinders readability. Please reconsider, and keep the tables near where they are referenced, in order. Personally I'd let latex take control of it, and just leave the table code higher up in the document than each table's first mention. E.g., Table 12 is mentioned before Table 4 (pg 9) and is presented on pg.14.

Section 5.2: After the paragraph marks please use a capital letter, e.g. "consider the tweet" -> "Consider the tweet"

Throughout: Use backticks for opening quotes, e.g. by "click on a phrase"

I'm not sure that the C1/C2 experiments demonstrate the point in ref [2] quite as directly as suggested in Section 7.1. Looking at Table 4, the loss is mostly in precision. This is at first sight a bit weird: precision is the part that the automated systems generally do better with (see [12]). In fact, in Table 4 recall is pretty much the same between C1 and C2. Low precision indicates cases where workers find "false positives", i.e. annotate entities not in the reference set. However, we already know that crowd recall is lower with smaller worker counts, and we know that diversity is key to getting good recall (cf. Trushkowsky ref above). So, what does this all suggest? I think what's really happening here, that the authors missed, is that existing datasets have low recall. The easy entities remain easy in C1 + C2; that's why recall is stable - the original annotators, and the crowd workers, always get them. Then, the crowd workers represent diversity beyond the original annotators, finding new entities, which presents as a precision drop. This is just like the inability of expert annotators to resolve things like "KKTNY", in prior datasets, which a diverse crowd *can* do. So while this table really shows us something happening, it's not enough to support [2] (although of course we all know it's intuitive: the raised cognitive load of annotating with a complex standard will lead to reduced performance); it's especially not enough to support [2] given the low number of crowd workers, and what we saw in the literature about required crowd worker counts. In short: something has been found here, but that thing is not strong evidence for [2].

The LOC/ORG ambiguity mentioned on pg9 col2 para2 is just classical metonymy - a well-known problem in NER (see e.g. Maynard et al. RANLP 2003 "Towards a semantic extraction of named entities").

What does Table 7 say about sentiment in skipped tweets? It's not clear - don't we need to see the general sentiment distribution as well, in order to know whether these distributions are anomalous/significant? I couldn't work it out from the table and description - may just be an expository thing.

In Figure 3: do you have some analysis of why Wordsmith entities are skipped so consistently? Even LOC doesn't manage to dip below the average across all datasets.

On page 12, col 2, para 1, the paper discusses the times taken under C1 and C2. The paper would be stronger if some kind of analysis or hypothesis was made about the reasons behind this observation: is it perhaps due to greater annotator confidence, rooted in how much guideline material they've read?

On page 13, col 2, para 1, I lost track at the end of the description of the experiment detailed in Table 13. Is it possible to give one or two worked examples of this process? The intuition from the current description doesn't fit the data. In any event, the IAA scores are so low as to suggest the data is purely noise - how did this happen? Why? How could it be remedied? And what is the impact on dataset utility?

Continuing to table 13: How can raising the IAA threshold improve recall? Doesn't this only ever eliminate entity annotations? Removing annotations absolutely cannot lead to recall increases - unless we're removing them from the gold standard. It's really not clear.

The reference to the confusion matrices (Table 5) on pg 15 is too far away from the content - please fix the table locations and order.

Page 5 col 2 para 1: Facebook, Youtube, Instagram ought be italicised.

Section 7.2.2: RQ1.1 expects, rather than states, no?

The finding at the end of the section's "Number of entities" paragraph is key, and to my mind one of the most important points of the paper. It's buried in here, though; I think more focus should be drawn to this section's analyses in the conclusion and abstract.

In the "Entities types" para: this doesn't seem to be written in light of the base distribution of ORGs in tweets; if they're the most common entitiy, then won't they be the most skipped anyway? Can't immediately see that this field effect is controlled for before making these observations. Perhaps a very clear, low-reading-age of the entity skipping process and experiments would help.

Page 16 col 1 para 2: "well mainly formed well structured" - word order problem? And there should be more vspace after this paragraph.

The Micropost text length section reported findings, but I hoped it would address the question of why this happens. Can we see perhaps how these tweets look in the UI used? Are results consistent in other crowdsourcing settings, e.g. the GATE Crowd system (Bontcheva et al. EACL 2014), or is there a known weakness of the experiments in that they use the same UI? If that's the case, it's OK, but must be stated. The findings of this paragraph to me look like an artefact of an HCI choice.

Page 17 col 1 para 1: "started 1 entities" - spurious 1? This all sounded alright, I think; 24 is nicely between 15 and 35, so there's no problem here - is there? "We took into account the responsive nature" - I didn't understand this section, only knowing the tool from its description in this ms.

Section 8 "Useful guidelines are an art" - this is the only novel paragraph in the discussion, above the ESWC paper, as far as I can see. That places a lot of the paper's novelty on this content. Can it be expanded into more than one point? There's a lot going on here

Conclusion: Implicitly named entities are mentioned at the top and foot of the paper, but get no real focus in its body. Can there be a section / subsection dedicated to them, including a definition - critical, as they seem to be introduced in the big, important parts of the paper, but aren't defined prominently in either this paper or the ESWC prior version.

Bibliography - first initials or firstnames?

Review #3
Anonymous submitted on 24/Nov/2015
Major Revision
Review Comment:

The paper is well written and has the potential to make a strong contribution. However, it needs to be differentiated better from the ESWC'2015 paper, through additional analysis.

Below I make more detailed suggestions for improvement:

In related work, could you please discuss [33], since theirs is one of the early papers exploring crowdsourcing of NER annotations and issues around that. Please could you compare your work in that context too?

Please clarify RH2: We can understand crowd worker preferences for NER tasks. Does this mean measure/evaluate or? It is a bit unclear to me, as formulated at present

Given that you are using CrowdFlower to recruit and pay crowdworkers, why did you choose to use a GWAP for the NER task, instead of implementing a UI directly in CrowdFlower? GWAPs, as I understand them, as typically aimed at self-motivated users, who are playing for fun. However, here, this is not the case. Does this affect the findings in any way?

Could you justify the choice of 3 crowdworkers per tweet/entity type instead of 5 or 7? For instance, Lawson et al (link to paper below) recommended higher number of workers per NER type, based on their experiments, also with some tricky entity types needed a higher number of crowd-workers in order to get them right.

Section 6.1: why did you decide to show the second gold tweet after 5 completed tweets, but not, e.g., at random?

Table 4: given that F1 for annotating organisations hovers typically between 33% and 45% on the corpora, can you please discuss what this means in terms of the actual usefulness of crowdsourcing organisation annotations in tweets? Likewise for MISC.

Table 5 and its explanation on page 9: Please could you also discuss the differences between condition 1 and condition 2. There are stark differences in terms of number of entities annotated under those conditions it seems. For example, on the Finin dataset: 78 PER as PER under condition 1 and 498 PER as PER under condition 2. Simiilar results are reported on other datasets and entity types in this table.

Table 8: could you present somewhere or discuss in the text how many tweets were skipped vs how many were annotated per dataset and per condition?

Figure 3 (Skipped tweets) has % of entities skipped on the Y axis, where, e.g. more than 45% of PER, ORG, and MISC entities are skipped on the Wordsmith dataset. At the same time, Table 4 shows recall on the Wordsmith dataset for PER as 71.41% in condition 1 and 57.9% under condition 2. Could you please clarify how the figures from Fig 3 related to the recall figures in Table 4?

When discussing Table 12 with the inter-annotator results, please could you elaborate more, since IAA of 35% seems very low. Also, what is it about the Finin dataset that results in almost double IAA? Could you report IAA for entity type, in addition to the overall? Perhaps there are some interesting differences there. There is some of this information later, in Section 7.2.1 but I wonder whether it wouldn't be better to bring it forward or, at least, add a forward reference to that section, as otherwise just reading the numbers on their own first leaves rather a lot of questions.

Section 7.2.2. Entity types: do you have an idea why the annotators might be skipping ORGs? Are they more likely to be single or multi-word NEs? Also, are they more likely to be within a hashtag/@mention or in the text? For the Wordsmith dataset, where PER, LOC, and ORG were skipped equally -- what are the reasons for that? Is it because of them being mostly in #tags and @mentions?

Section "Number of skipped tweets" -> is it just the number of entities in the tweets that's a factor? What about the kind of entities, as well as whether they are within a #tag/@mention? One could imagine that annotators might skip entities embedded within hashtags, for example, as they wouldn't be sure whether to annotate them or not (e.g. #PrayForLondon). The results showed that when given more instructions, annotators were more likely to take less time and not skip as much, which points to potentially needing a bit wider analysis of crowdworker behaviour on skipping likelihood.

Section 8: "Crowds can identify people and places, but more expertise
is needed to classify miscellaneous entities" -> what about the ORG ones? I think these should also be included, as currently only 3 of the 4 types are discussed. With respect to the final sentence on multi-step workflows and varying the number of judgements per entity type -- these ideas are not new - please refer to Snow et al [33] for having different number of judgements, and also to Sabou's work on hybrid genre workflows, which was developed in the context of knowledge acquisition tasks:

Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows
M Sabou, A Scharl, F Michael
International Journal on Semantic Web and Information Systems 9 (3), 14-41

"Closing the entity linking loop for the non-famous" -> I was a bit confused by this, since up to this point, the paper was only discussing NE recognition, not linking. Given that Ritter and Finin aren't URI annotated (unless I am mistaken), I found it hard to follow this paragraph on generating URIs for less famous entities. I wasn't sure how were these conclusions drawn, as I didn't see an analysis earlier of whether the correctly annotated PER entities were celebrities or not.

Do you have any crowdworker demographics information? It would be interesting, e.g. to check whether crowdworkers from the US found it hard to annotate names of European locations/organisations.

Will the authors be making available the datasets with the CrowdFlower annotations, so other researchers could carry out further analysis?

"choices in the Section 6." -> choices in Section 6.
"identify a number of distinguishing behavioural characteristics related NER tasks" -> hard to read, please rephrase
"these definition even" -> "these definitions even"
"whereas, it workers" -> please rephrase
"We presented the on the average inter-annotator agreement" -> not grammatical
"took a shorter time" -> delete a
"From the results in 8" -> is this Table or Figure or???
"which hovered slightly about this average length" -> is it about or above?
"This is asides the Ritter dataset which had an overall set of longer tweets. From this" -> please rephrase
"in which entity types are empirically" -> "in which entity types *that* are empirically"
"Literature on motivation tells us that people perform best when they can decide what they are given the freedom to choose what they contribute, how, and when" -> please rephrase - hard to read