Review Comment:
This paper presents an overview of the NEEL series of shared tasks on named
entity disambiguation and linking over the time period 2013-2016, in terms of
the task design and also the participant systems.
There is no question that the NEEL series has been a valuable one, in which
sense the authors should be commended in the highest terms, but in its current
form, I found the paper to be a frustrating read and not do NEEL true justice,
on three main grounds: (1) there was not enough detail in the description of
the motivation behind and impact of critical decisions in the design of the
different iterations of NEEL; (2) the description of the participant systems
was rushed and largely superficial; and (3) the presentation quality of the
paper was poor in terms of language use. All of these need to be redressed for
the paper to be of utility to others (and achieve its true potential), in
terms of the design/cross-comparison with similar datasets, and also those in
the NER/D community who want to have a one-stop paper which documents the
best-practice wrt the NEEL datasets. Below, I provide detailed content-level
comments for (1) and (2) to aid in the reworking of the paper. For (3), the
amount of proofreading/editing required to bring the paper up to a publishable
standard is well beyond what I can provide in a review, but to give a sense of
the sort of thing I am referring to, consider the following from pp12-13:
> The approaches have proposed several differences, but we have observed
> some emerging trends that are uniquely to the top performing named entity
> recognition and linking approaches dealing with tweets
which I would rewrite as:
> There are substantial differences between the proposed approaches, but a
> number of trends can be observed in the top-performing named entity
> recognition and linking approaches to Twitter text.
First, with the task definition and dataset design details:
+ you need a better definition of NE (p1) than "A named entity is used in the
general sense of ..." which I found difficult to parse first, but more
importantly, the definition seemed to generalise to any concept (named entity
or otherwise) in a given taxonomy (e.g. if the taxonomy contained the concept
ANIMAL, the definition seems compatible with me considering "aardvark" and
"beaver" as NEs, which they clearly are not); certainly, the following
sentence ("Thus ...") does not seem to logically follow from the first.
+ A small thing, but "micro-blog" is a more conventional term than
"micro-post"
+ another small thing, but in what way would letters of the alphabet be NEs?
(p5)
+ the biggest thing to my mind in the dataset description is that from year to
year you modify the strategy for sampling the tweets (in terms of the hashtags
used etc.), without provide any real detail of how you change it (what precise
set of hashtags did you use in 2014, where they provided to the participants
at the time/after the fact, and did you not use hashtags to sample the data in
2015, or did you just not mention it?) or justification for why (e.g. were the
changes from one iteration to the next motivated by issues in the data, and
were you aiming for [somewhat/differently] domain-biased samples in each
year?). As seen in Tables 2-5 and observed in passing on p7, these choices
have a big impact on the relative proportion of NEs in addition to the class
distribution, which presumably has an impact in turn on what strategies work
better in a given year. You need to better motivate/document these
differences, and tease out "learnings" for others who may be interested in
running a similar task, in terms of the impact of sampling decisions on the
dataset composition, and best practice in constructing this type of dataset.
+ related to this point, you mention event and non-event tweets in the context
of 2014 and 2015 (but not the other years) -- how do you define event and
non-event, or extract out event vs. non-event tweets? What balance do you aim
for in the two? You mention briefly the need for having a mix (p7), which I
think was well motivated, but given the importance, more needs to be said
about this.
+ the class set in 2013 is a standard one and doesn't require justification,
but you expand it in 2015 to include NE types that are less familiar to some
NE communities such as CHARACTER and EVENT. These should be defined, and the
alignment with the original label set from 2013 made plain.
+ in terms of annotation process, you state that Phase 3 (p8) led to "higher
consensus" -- can you quantify this relative to the original annotations
(which you presumably have)? This is another instance of a potential learning
for other dataset creators, and the impact of each step of your particular
annotation setup is an important consideration for any (potentially
small-budget) annotation project.
+ you mention that you use CrowdFlower as the annotation interface in 2014,
but still claim to use expert annotators -- did you use the interface but your
own annotators, in which case, how was this managed? And why did you then move
away from CrowdFlower in 2015? Again, important learnings for others.
+ in Phase 4 for 2014/2015, how was the clustering performed?
+ in Phase 5 for 2015, what were the annotation guidelines to the third
annotator? Clustering evaluation is a notoriously thorny problem, and if you
found that your approach worked well, people need to know all the details to
be able to apply it in their own work.
+ what was the thinking in only annotating 10% of the test data in 2016?
+ again for Phase 3 of 2016, how was the clustering performed?
+ you state that your annotators have "the same background" (p10) and that
this made the annotation more efficient -- what background was this?
+ what do you mean by "resource" in your analysis of dominance? (p10)
+ the idea of analysing readability is an interesting one, but ultimately you
conclude that existing readability indices are not directly applicable to
Twitter, making me question all of the numbers in Table 9 and whether it was
really worth including this section in the paper -- you seem to generate a
table of numbers and then immediately discount the value of those numbers! I'd
advocate removing this section altogether.
+ you claim that LIX generates values in the range [20,60], but based on the
description, it would appear to range [0,101]; or did you mean "empirically"
(when applied to actual corpora), that is the range that tends to be observed?
+ onto the systems, I found Section 5 to be too superficial. Tables 11-13 are
good but there is a lot of detail there that needs to be properly reflected
and analysed in the text of the paper. In terms of what is there currently in
the prose, your description of System 14 is a great case in point of what I
mean:
> (System 14) approached it as a joint task and proposed the so-called
> end-to-end ... (it) provided a from-scratch rule-based approach using a
> combination of machine learning gradient boosting approaches
Questions that arise out of this are: what specific joint formulation did they
use (in terms of ML methods)? what is the relationship between the rule-based
approach and the joint approach? what is a "gradient boosting" approach? I'd
expect to see some mention of the specific ML architecture used, what feature
engineering/preprocessing they used that was notably different, possibly some
error analysis on your end of particular entity or input types their method
did better over, what was the relative gain their method had over other
methods (e.g. did they win by 0.1% or 10%?). This is the sort of level of
detail that you need to provide for the descriptions to have utility for
people who come to this paper cold, without familiarity with the NEEL systems.
Similarly with your description of 2015, CRFs are sequence based, so I didn't
understand the comment about "other approaches" being sequential. Also with
2016, what specific normalisation was done, and what "graph weighting" was
performed (over what graph?)?
+ in terms of evaluation, 2013 was based on macro-averaging, but subsequent
challenges were based on micro-averaging. You describe what the change means
mathematically, but *why* the change?
+ what does it mean to "resolve redirects" in terms of the 2014 evaluation?
+ is there a difference between the (m;l) and (m,l) notation? If so, what? If
not, harmonise.
+ why include the formulation for micro-averaging in 2015 and not 2014 (where
it was first used)?
+ where did the particular weights in Equation (19) come from?
+ even though you moved away from it in 2016, provide details of how you
incorporated computing time into your 2015 evaluation (rather than just
providing a link to a paper)
+ there are interesting details in the last two paragraphs of the Conclusion
which should be moved earlier in the paper (never include new information in
the Conclusion), and highlighted more (e.g. the relative breakdown of academic
vs. commercial participation, the grants, the projects, NEEL-IT, the bridges
that were built with TAC, ...)
Formatting/editing issues:
+ consistently format all URLs as \url (some are, some aren't)
+ consistently use Math mode (some functions/variables are defined in \textit,
others in Math mode, e.g. Eq (19) and the immediately preceding text on p18)
+ make sure to spell check any manuscript before submission; there weren't
huge numbers of typos that a spell checker would have picked up on, but enough
to be noticeable (e.g. "Misceleneous" and "variaty")
|