Benchmark Corpora and Emerging Trends in Mining Semantics from Tweets

Tracking #: 1413-2625

Giuseppe Rizzo
Bianca Pereira
Andrea Varga
Marieke van Erp
Amparo Elizabeth Cano Basave

Responsible editor: 
Guest Editors Social Semantics 2016

Submission type: 
Survey Article
The large number of tweets generated daily has provided means for policy makers to get insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in a series of algorithms aimed at the automatic extraction of semantics in tweets and their link to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure featured by acronyms, abbreviations, emojis, typos, and a rich set of metadata based on entities. The NEEL series of challenges, established in 2013, has contributed to collect the emerging trends in the field and define standardized benchmark corpora for entity recognition and linking in tweets, ensuring high quality labeled data that enables easier comparisons between different approaches. This paper reports on the findings and lessons learned through an analysis of specific characteristics of the created corpora and highlighting limitations, lessons learned from the different participants in the challenges and providing guidance to implement top performing approaches in the field of entity recognition and linking in tweets.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Aug/2016
Minor Revision
Review Comment:

This paper is a survey of entity recognition and linking in tweets. It first reported the details of the created corpora and highlighting limitations, then discussed the lessons learned from the different participants in the challenges. Generally speaking the task is very useful and the survey is fairly comprehensive.

I have the following suggestions for revisions:
1. Add in more details of the annotation process - what's the inter-agreement between annotators.
2. Discuss more about error analysis and future work/directions.
3. Discuss more about the possible downstream applications and the possible impact to other research fields in the Semantic Web community.
4. Fix the incorrectly formatted citations in Table 11.
5. Fix the format of some formulations - they consumed too much space.

Review #2
By Tim Baldwin submitted on 04/Sep/2016
Major Revision
Review Comment:

This paper presents an overview of the NEEL series of shared tasks on named
entity disambiguation and linking over the time period 2013-2016, in terms of
the task design and also the participant systems.

There is no question that the NEEL series has been a valuable one, in which
sense the authors should be commended in the highest terms, but in its current
form, I found the paper to be a frustrating read and not do NEEL true justice,
on three main grounds: (1) there was not enough detail in the description of
the motivation behind and impact of critical decisions in the design of the
different iterations of NEEL; (2) the description of the participant systems
was rushed and largely superficial; and (3) the presentation quality of the
paper was poor in terms of language use. All of these need to be redressed for
the paper to be of utility to others (and achieve its true potential), in
terms of the design/cross-comparison with similar datasets, and also those in
the NER/D community who want to have a one-stop paper which documents the
best-practice wrt the NEEL datasets. Below, I provide detailed content-level
comments for (1) and (2) to aid in the reworking of the paper. For (3), the
amount of proofreading/editing required to bring the paper up to a publishable
standard is well beyond what I can provide in a review, but to give a sense of
the sort of thing I am referring to, consider the following from pp12-13:

> The approaches have proposed several differences, but we have observed
> some emerging trends that are uniquely to the top performing named entity
> recognition and linking approaches dealing with tweets

which I would rewrite as:

> There are substantial differences between the proposed approaches, but a
> number of trends can be observed in the top-performing named entity
> recognition and linking approaches to Twitter text.

First, with the task definition and dataset design details:

+ you need a better definition of NE (p1) than "A named entity is used in the
general sense of ..." which I found difficult to parse first, but more
importantly, the definition seemed to generalise to any concept (named entity
or otherwise) in a given taxonomy (e.g. if the taxonomy contained the concept
ANIMAL, the definition seems compatible with me considering "aardvark" and
"beaver" as NEs, which they clearly are not); certainly, the following
sentence ("Thus ...") does not seem to logically follow from the first.

+ A small thing, but "micro-blog" is a more conventional term than

+ another small thing, but in what way would letters of the alphabet be NEs?

+ the biggest thing to my mind in the dataset description is that from year to
year you modify the strategy for sampling the tweets (in terms of the hashtags
used etc.), without provide any real detail of how you change it (what precise
set of hashtags did you use in 2014, where they provided to the participants
at the time/after the fact, and did you not use hashtags to sample the data in
2015, or did you just not mention it?) or justification for why (e.g. were the
changes from one iteration to the next motivated by issues in the data, and
were you aiming for [somewhat/differently] domain-biased samples in each
year?). As seen in Tables 2-5 and observed in passing on p7, these choices
have a big impact on the relative proportion of NEs in addition to the class
distribution, which presumably has an impact in turn on what strategies work
better in a given year. You need to better motivate/document these
differences, and tease out "learnings" for others who may be interested in
running a similar task, in terms of the impact of sampling decisions on the
dataset composition, and best practice in constructing this type of dataset.

+ related to this point, you mention event and non-event tweets in the context
of 2014 and 2015 (but not the other years) -- how do you define event and
non-event, or extract out event vs. non-event tweets? What balance do you aim
for in the two? You mention briefly the need for having a mix (p7), which I
think was well motivated, but given the importance, more needs to be said
about this.

+ the class set in 2013 is a standard one and doesn't require justification,
but you expand it in 2015 to include NE types that are less familiar to some
NE communities such as CHARACTER and EVENT. These should be defined, and the
alignment with the original label set from 2013 made plain.

+ in terms of annotation process, you state that Phase 3 (p8) led to "higher
consensus" -- can you quantify this relative to the original annotations
(which you presumably have)? This is another instance of a potential learning
for other dataset creators, and the impact of each step of your particular
annotation setup is an important consideration for any (potentially
small-budget) annotation project.

+ you mention that you use CrowdFlower as the annotation interface in 2014,
but still claim to use expert annotators -- did you use the interface but your
own annotators, in which case, how was this managed? And why did you then move
away from CrowdFlower in 2015? Again, important learnings for others.

+ in Phase 4 for 2014/2015, how was the clustering performed?

+ in Phase 5 for 2015, what were the annotation guidelines to the third
annotator? Clustering evaluation is a notoriously thorny problem, and if you
found that your approach worked well, people need to know all the details to
be able to apply it in their own work.

+ what was the thinking in only annotating 10% of the test data in 2016?

+ again for Phase 3 of 2016, how was the clustering performed?

+ you state that your annotators have "the same background" (p10) and that
this made the annotation more efficient -- what background was this?

+ what do you mean by "resource" in your analysis of dominance? (p10)

+ the idea of analysing readability is an interesting one, but ultimately you
conclude that existing readability indices are not directly applicable to
Twitter, making me question all of the numbers in Table 9 and whether it was
really worth including this section in the paper -- you seem to generate a
table of numbers and then immediately discount the value of those numbers! I'd
advocate removing this section altogether.

+ you claim that LIX generates values in the range [20,60], but based on the
description, it would appear to range [0,101]; or did you mean "empirically"
(when applied to actual corpora), that is the range that tends to be observed?

+ onto the systems, I found Section 5 to be too superficial. Tables 11-13 are
good but there is a lot of detail there that needs to be properly reflected
and analysed in the text of the paper. In terms of what is there currently in
the prose, your description of System 14 is a great case in point of what I

> (System 14) approached it as a joint task and proposed the so-called
> end-to-end ... (it) provided a from-scratch rule-based approach using a
> combination of machine learning gradient boosting approaches

Questions that arise out of this are: what specific joint formulation did they
use (in terms of ML methods)? what is the relationship between the rule-based
approach and the joint approach? what is a "gradient boosting" approach? I'd
expect to see some mention of the specific ML architecture used, what feature
engineering/preprocessing they used that was notably different, possibly some
error analysis on your end of particular entity or input types their method
did better over, what was the relative gain their method had over other
methods (e.g. did they win by 0.1% or 10%?). This is the sort of level of
detail that you need to provide for the descriptions to have utility for
people who come to this paper cold, without familiarity with the NEEL systems.

Similarly with your description of 2015, CRFs are sequence based, so I didn't
understand the comment about "other approaches" being sequential. Also with
2016, what specific normalisation was done, and what "graph weighting" was
performed (over what graph?)?

+ in terms of evaluation, 2013 was based on macro-averaging, but subsequent
challenges were based on micro-averaging. You describe what the change means
mathematically, but *why* the change?

+ what does it mean to "resolve redirects" in terms of the 2014 evaluation?

+ is there a difference between the (m;l) and (m,l) notation? If so, what? If
not, harmonise.

+ why include the formulation for micro-averaging in 2015 and not 2014 (where
it was first used)?

+ where did the particular weights in Equation (19) come from?

+ even though you moved away from it in 2016, provide details of how you
incorporated computing time into your 2015 evaluation (rather than just
providing a link to a paper)

+ there are interesting details in the last two paragraphs of the Conclusion
which should be moved earlier in the paper (never include new information in
the Conclusion), and highlighted more (e.g. the relative breakdown of academic
vs. commercial participation, the grants, the projects, NEEL-IT, the bridges
that were built with TAC, ...)

Formatting/editing issues:

+ consistently format all URLs as \url (some are, some aren't)

+ consistently use Math mode (some functions/variables are defined in \textit,
others in Math mode, e.g. Eq (19) and the immediately preceding text on p18)

+ make sure to spell check any manuscript before submission; there weren't
huge numbers of typos that a spell checker would have picked up on, but enough
to be noticeable (e.g. "Misceleneous" and "variaty")

Review #3
Anonymous submitted on 18/Sep/2016
Minor Revision
Review Comment:

The paper presents the lesson learned by the organisers of a challenge series called “Named Entity Extraction and Linking (NEEL) from tweets”.

I tend to disagree with this paper being a 'Survey Article’. I reviewed it as a full article.

I like the paper. I value the retrospective work presented. I do not have comments related to the content, but I think the presentation can be improved.

The reminder of this review is split in two parts. In the first one I address my major comments on the paper structure, while in the second one I point out minor comments on the abstract content and on the way the information in the tables is presented in the text.


First of all, since the paper is centred on the NEEL experience, I would make this as clear as possible. The first time I read the paper I got the wrong expectation that lessons are learned also from other challenges such as TAC-KBP, W-NUT, ERD, and SemEval.

To avoid setting wrong expectations, I recommend the authors to:
1) change the title of the paper in something like “Lesson Learnt from organising a series of challenges on Named Entity Extraction and Linking from tweets”.
2) create a proper related work section where the comparison between NEEL and the other challenges is discussed. In the current version of the paper, the comparisons among the challenges is spread around the sections. I would put it in the end of the paper before the conclusions. Minor references can, of course, appear in the introduction.

Moreover, I think that the readability of section 2 can be improved by avoiding presenting general contents together with NEEL specific ones.
My recommendation is to present first the general content and, then,to present the NEEL specific one in a separate section.

In details, I recommend the following structure for the new section 2:
1) section 2.1 should present the Named Entity Extraction and Linking problem by using the ideas already presented in section 2.1 but without mentioning the challenges. I would also discuss why this is particularly interesting and challenging when the texts to be analysed are tweets (see also minor comments).
2) section 2.2 should present the solution space. To this end I would use most of the left column of page 3 and the description of Typical Entity Linking workflow currently described in section 2.2. In this way, the authors will no longer interleave the presentation of the Typical Entity Linking workflow with the presentation of the NEEL evaluation process.
3) section 2.3 should concentrate on the challenges faced by solutions in this problem space without focusing on how NEEL aimed at comparing them. I would use the content in section 2.1 that discusses how the challenges relate to the problem, the last part of the left column on page 3, the right column of page 3 and the upper part of the left column on page 4.

Then, I recommend to add a new section before the current section 3. This section is fully dedicated to introduce NEEL w.r.t. the generic content introduced in the new section 2. I would use all the parts that specifically talk about NEEL in the current version of section 2.


In the abstract, the authors introduce the problem of dealing with the “complex and challenging structure of tweet” meaning acronyms, abbreviations, emojis, typos, and a rich set of metadata. While the problem overall is discussed in a adequate manner in the paper, acronyms are mentioned only once in a comment on page 13 and once in the first row of table 11. Similarly, abbreviations are mentioned only once on page 12 and metadata present in tweets are mentioned only once on page 16 and in table 13.

I recommend to reword the abstract avoiding to create the expectation that those details are treated in depth. Of course, it will be even better if the authors extend the paper to cover them, but I see the risk of loosing focus on the overall lesson learnt if the authors present too many details.

Last but not least, I recommend the authors to avoid repeating in the text the information present in tables. I believe that the text shall highlight what is important in the tables. I shall guide the reader to read the content of the table. I shall not speak out the table. A clear example of what I would avoid is the description of tables 2-5 on page 7. It makes the text hard to read without helping the reader to better understand the overall message.