ADEL: ADaptable Entity Linking

Tracking #: 1724-2936

Authors: 
Julien Plu
Giuseppe Rizzo
Raphael Troncy

Responsible editor: 
Guest Editors LD4IE 2017

Submission type: 
Full Paper
Abstract: 
Four main challenges can cause numerous difficulties when developing an entity linking system: i) the kind of textual documents to annotate (such as social media posts, video subtitles or news articles); ii) the number of types used to categorise an entity (such as Person, Location, Organization, Date or Role); iii) the knowledge base used to disambiguate the extracted mentions (such as DBpedia, Wikidata or Musicbrainz); iv) the language used in the documents. Among these four challenges, being agnostic to the knowledge base and in particular to its coverage, whether it is encyclopedic like DBpedia or domain-specific like Musicbrainz, is arguably the most challenging one. We propose to tackle those four challenges and in order to be knowledge base agnostic, we propose a method that enables to index the data independently of the schema and vocabulary being used. More precisely, we design our index such that each entity has at least two information: a label and a popularity score such as a prior probability or a Pagerank score. This results in a framework named ADEL, an entity recognition and linking system based on a hybrid linguistic, information retrieval, and semantics-based methods. ADEL is a modular framework that is independent to the kind of text to be processed and to the knowledge base used as referent for disambiguating entities. We thoroughly evaluate the framework on six benchmark datasets: OKE2015, OKE2016, NEEL2014, NEEL2015, NEEL2016 and AIDA. Our evaluation shows that ADEL outperforms state-of-the-art systems in terms of extraction and entity typing. It also shows that our indexing approach allows to generate an accurate set of candidates from any knowledge base that makes use of linked data, respecting the required information for each entity, in a minimum of time and with a minimal size.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Oct/2017
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Originality:
The authors propose a modular system ADEL (Adaptable Entity Linking) that relies on hybrid architectural principles to address four challenges that the authors describe at the outset. An important challenge is the choice of a knowledge base to which entities must be linked. As the authors rightfully point out, this choice can significantly impact results, and several currently available entity linking systems are not easy to extend beyond the knowledge bases they have been designed to link to. The authors propose indexing multiple knowledge bases on Linked Open Data using a SPARQL-centric methodology. Evaluations on several datasets, including OKE and NEEL, illustrate competitive performance. The paper offers a nice use-case of how linked data principles can be leveraged to generalize the scope of entity linking.

Significance of results:
The authors have released their system as API endpoints published on Swagger, and I believe that this could end up being fairly significant. I have not tried out the API myself, but if it is able to provide links to multiple knowledge bases as ADEL claims to do, it would be an important addition to current tools like DBpedia Spotlight. The manner in which the authors index the knowledge bases in the paper is also directly relevant to the Semantic Web. Since the authors cite that, rather than an NLP algorithmic contribution, as a primary contribution, I believe this journal is an appropriate fit for publishing this work.

Since the system itself is a primary contribution (and since it is too complex to replicate from scratch), it would be good if the authors can provide and maintain some documentation. I saw a github link to the source code of an algorithm, and also the links to the API itself, but (unless I ended up missing it somehow) I did not see formal documentation and examples, no matter how brief. The authors may want to consider something along those lines to maximize impact of this work.

Quality of writing:

The quality of writing was good, and the paper was relatively easy to follow. I do think Figure 2 can be made clearer, perhaps by using darker colors and by not using so many dashed boxes. I was also a little confused by the way the paper started (i.e. with the description of the projects). I think this should be flipped with the task description. The authors should talk about the general problem first, and then mention the TV project as a motivation. I felt a disconnect in going from the abstract to the introduction. However, this is a minor suggestion only.

Review #2
By René Speck submitted on 17/Oct/2017
Suggestion:
Minor Revision
Review Comment:

The present full paper "ADEL: ADaptable Entity Linking" introduces a framework for entity recognition and linking based on a hybrid linguistic, information retrieval, and semantics-based methods. The paper is an extension of former approaches "Enhancing Entity Linking by Combining NER Models" and "A Hybrid Approach for Entity Recognition and Linking".

A link to the API of ADEL is given in the paper but no link to the source code of the framework. The authors should add the link to the source code for replication of the results.
The version numbers of the applied tools and frameworks are necessary to know for replication too. In many cases the source code is sufficient for that.
But also the versions of the knowledge bases are interesting, e.g. DBpedia is updated regularly. If possible, the authors should give the URIs of the Gerbil experiments in the paper.

The paper is well written but has several typos, missing parts and format issues.
The authors haven’t mentioned the former approaches in this paper, but copied several parts without any change.

* Algorithm 1 is a copy of the algorithm in the former paper, a reference to the former parts would reduce redundancy.

* Equation 1 is a copy with the description given in a former paper. I can not find any improvements here, a reference to the former parts would reduce redundancy.

* The "Overlap Resolution Module" example is given again, a reference to the former paper would be better.

* Figure 2 states something about "DSRM" in the Linkers Module but unfortunately the authors never give a definition or explanation what exactly "DSRM" is nor if it is used in the approach or not.

* Figure 2 states a "Social Media Account Dereference extractor" sub module that is never explained in the paper.

* In the listings with SPARQL queries the authors use prefixes. Please use prefixes for everything to make the queries better readable e.g.: "PREFIX artist : . " and "PREFIX dbr : . "

* Include a part of the mappings given in https://gist.github.com/jplu/74843d4c09e72845487ae8f9f201c797 in the paper and this link, instead of the link only.

* state-of-the art

* "T xt" should be "Txt" in algorithm 1.

* In section 5 the paper references are too long and overlapping the caption of the table.

* In section 5.2. "Comparison with Other Systems" references to the tables are too long.

* Entity classes in section 1.1. have another format than the class "Thing" in section 3.1.2, the classes on page 12 and the classes in section 6.

* Captions sometimes end with a point, sometimes not.

Review #3
By Anastasia Dimou submitted on 28/Nov/2017
Suggestion:
Major Revision
Review Comment:

This is a full research paper on ADEL an adaptable Entity Linking framework which is independent of the text to be processed and the used knowledge base.

Overall the paper is well-written, the followed approach seems to be interesting and the evaluation results are well-discussed in general. However, the paper has certain drawbacks in the way it describes the approach which make me think that it should not be accepted as it is. On high level:

- The proposed approach's introduction is very poor and disappointing instead of engaging for the readers to continue reading intrigued by the contributions! Instead of introducing all innovative aspects of the approach, the contributions list is limited (sec 1) and I think that the mentioned items do not reflect the contributions that one can read spread within the text. Similarly, when the approach is described (sec 3), its introduction is limited on contrasting existing approaches on two aspects, the contributions are not clearly mentioned here nor justified and it is only claimed that the architecture is designed in a way that allows to enable the changes. Therefore, I would strongly suggest that the introduction of the approach is improved, clearly mention the contributions and justifying why these are significant contributions.

- The focus of the approach is currently on the (new) architecture (of the implementation). If this is the case, I would expect that this is a system paper and not a full research paper. The approach normally goes beyond the proposed architecture. However what is currently considered as approach is nothing else but the current implementation's description whereas what is considered as implementation (current section 4) is just a description of the config file which accompanies the implementation. Therefore, I would strongly suggest to distinguish the contribution from the implementation.

- There are often statements which are not supported by references or proven (especially in sections 1 and 2). I mention a few below in details but there are more cases. I would suggest to add more references or try to show evidence in most cases.

- I would request to be clarified if the pipeline (or which of parts of the components/contributions) are open source and the experimental settings and results are made publicly available via permanent URLs (e.g. figshare) in order to enable reproducibility.

- (minor) Different acronyms are mentioned in different places within the text, it would be best if the first time they are mentioned the full name is provided. More, a proof-reading is required to correct all grammar and syntax errors.

In more details:

Introduction (section 1):
- "textual content represents the biggest part of content available on the Web" --> I would suggest that either a reference to this statement is provided or the argument is soften as it is not self-evident.
- in the task description (sec 1.1) there are different definitions mentioned. Are all these notions new? I would suggest to provide corresponding references wherever possible or clearly mention that this definition of the term for this paper.
- "the two main problems when processing natural language text are ambiguity and synonymy." --> I would suggest to provide a reference to support the argument. Is it only entity linking that solves the problems of ambiguity and synonymy? Not the entity recognition?
- sec 1.2: 1st challenge, why are newspapers, magazine and encyclopedia trusted sources? Are all newspapers trusted sources?
- sec 1.2: where are these challenges coming from? Are they defined by the authors? If so, based on which evidence? Is this list complete? Is it result of "surveying" the state of the art? Then I would suggest to provide references to different publications which refer/address to each one of the challenges. Or is there a publication that lists those challenges? Then I would suggest to refer to it. Of course, it is mentioned that these are the main challenges but again why are these and not others the main challenges? I think this can be addressed by showing that there were several papers in the past proposing alternative solutions to address this problem or arguing that these challenges are relevant to the problem being addressed in this paper or any other solution that might help supporting the argument.
- sec 1.2: "formal texts, usually well-written and coming from trusted sources such a newspaper, magazine, or encyclopedia;" How is the "well-written aspect determined? And (why) are all the newspapers and magazines trusted?"
-sec 1.2: I think the difference between the formal and informal texts lies (barely) on how they are written (genre as it is later on called) and not on their trustworthiness. Namely why a magazine is more trustworthy than a tweet? Couldn't the magazine have a twitter account? Why its tweet is less trustworthy than its articles?
- sec 1.2/1.3: could it be explicitly elaborated which challenge affects each contribution? e.g. which challenge affects the third contribution?
- all contributions apart from contribution 3 have a reference to a corresponding section, could contribution 3 also have such a reference?
- sec 1.3: Why is the 4th a contribution? It reads more like results of the evaluation rather than a certain contribution
-sec 1.4: "numerous" --> I would suggest to rephrase this!

section 2:
sec 2.1:
- "We identify two external entries for an entity linking system: the text to process and the knowledge base to use for disambiguating the extracted mentions. We extend the definition of what is an external entry for an entity linking system defined in [43] " --> This reads more like an assumption that was made within the frame of the proposed solution rather than related work. I would propose this part of the related work to be moved in another section or that this paragraph is rephrased. Moreover, I think that going from 3 (text, knowledge base and entity) to 2 (text and knowledge base) is not really an extension.
-"This definition is often extended by including other categories such as Event or Role" --> I would suggest a couple of examples to be provided with regard to where this happens.
- sec2.1.1: "We propose a different orthogonal categorization where textual content is divided between formal text and informal text." --> This is not related work but more part of the assumptions of the paper for the proposed approach. I would suggest to move the text to the corresponding section and limit on presenting existing works in the related work section so the readers can barely get aware of the state of the art in this section.
- sec 2.1.1.: Why are subtitles trusted? I would suggest to back-up the argument with a reference. The same stands for ASR, I would suggest to provide a reference to an example/publication that does state that subtitles are generated from such a system.
- sec 2.1.2: there is an outline of certain knowledge bases but none of the sub-challenges (coverage, data model and freshness) are of the knowledge base challenge are covered in details. Even more the section does not present evidence from the state of the art that shows that the aforementioned challenges indeed exist. I would suggest both remarks to be clarified in the text.

sec 2.2:
- Where does this classification come from? I would suggest either to provide a reference to a source or examples per case showing that indeed such cases exist.
- Table 1 on top it says that it is about mention extraction but the column entity recognition refers to whether the entity is recognized during the mention extraction or linking process (similarly for the entity candidate generation) but Table 2 is dedicated on Entity Linking. Despite this issue, what does it mean Yes and No? I would suggest that this is better represented. I would recommend even a table dedicated on these two aspects where, for each case, there is a tick on one of the two alternatives. Then Table 2 would be directly comparable to Table 3 and Table 1 almost comparable.
- a minor comment for these tables: there are different upper/lower cases of writing the same, e.g. "lexical similairity" and "Lexical Similarity".
- another minor/optional comment with regard to Table 1: I would suggest firstly to put the columns Main Features and Method and then the external tools and language resources, so all tables have the same structure (at least in the beginning)
- while sec 2.2 provides a clear comparison among the different technologies, this is not the case for sec 2.1 which provides a plain outline of different alternatives.
- I would suggest to provide a reference for the definition of "overlap resolution"
- "In Table 2, we observe three approaches" --> 3 approaches for doing what? I would suggest to explicitly say what these approaches do.

-section 3:
- Figure 2 should be closer to its reference (the same occurs with other images and tables too so please adjust overall)
- "ADEL comes with a new architecture" --> new compared to an older one or new compared to the others? If the former, please add a reference to the older and explain the difference but I guess the latter, so I would suggest to choose a more adequate term, perhaps innovative or alternative? But getting back to the contributions, it is mentioned that a modular architecture is proposed but could the readers assume that the approaches were not modular/adaptable so far? So, is this the contribution or does the innovation stand on other aspects, such as static Vs dynamic or flexibility? The contribution text should be updated accordingly to show the actual contribution.
- "little flexibility" --> This reads too vague. I would suggest that this is further clarified (perhaps it's best to happen within the related work section)
- "cannot be extended without ... spending a lot of time in terms of integration" --> Why would an extension require integration? I understand based on the example that replacing a module or complementing it with another module is considered an extension and I assume that integration is required for the new module to be added in the pipeline for that reason but this is best to be explained.
- "the knowledge base being used is often fixed as well" --> Was there knowledge bases that required fixing? This sentence needs to be rephrased.

sec 3.1 (all minor comments)
- it is mentioned where Gazetteer Tagger relies on but it is not mentioned what it does. I would suggest to mention what it does too
- "to handle tweets, we use the model proposed in [10]." --> How is that relevant to the POS tagger and it needs to be mentioned there? If tweets are handled according to a methodology which is proposed by [10], I would expect that it is a different extractor.
- "While using a dictionary as extractor, it gives the possibility to be very flexible in terms of entities to extract and their corresponding type" --> it refers to GATE or ADEL in this case? Please mention explicitly.
- "If we only apply the 4 classes model" --> this refers to one of the Stannford NLP models but I would suggest that it is mentioned because as it is now, it remains vague.
- was the mapping among the different sources manually defined? Was a methodology followed?

sec 3.2
- How could we know in advance which columns to search? How is that determined in advance?
- "This optimization reduces the time of the query to generate the entity candidates from around 4 seconds to less than one second" --> This reads more like results which are produced after a certain evaluation. There is no context to turn them relevant in sec 3.2. Readers may assume that 4 seconds is the time of the system before optimizing but that was never mentioned. I would suggest that the optimization is part of the contributions (I assume that it is significant improvement) and that the concrete results are mentioned in the evaluations section together with the comparisons to other systems.

section 4:
- The configuration file is mentioned to be written in YAML but it is not clearly mentioned that it consists of 3 parts which are further clarified. I would suggest to do so. Moreover I would suggest
-"In case of an Elasticsearch index, the properties query and name are mandatory. In case of Lucene, these properties are replaced by two other mandatory properties that are fields and size" --> This does not read as a very generic, modular and configurable solution. I would suggest this aspect to be clarified.

section 5:
- "the best configuration for the NEEL2015 dataset is not the same than for the NEEL2016 dataset despite the fact that both datasets are made of tweets." --> Could you explain why this happens? And do you have any idea of what needs to be done not to have this?
The evaluation is well-discussed but I miss comparison with the best approaches of each case, e.g. OKE2015, OKE2016 etc. Namely besides which configuration is the best, it would be good to know how the tool is compared to other tools which used the same evaluation datasets. of course, within the text it is mentioned that ADEL outperforms the state of the art but I would suggest that this becomes clearer in the corresponding tables. Now this is not obvious.

- For the index optimization it would be best if the results both with and without the optimization are presented.

sec 5.1:
- "We evaluate our approach at different level: extraction (Tables 6, 5, 7 and 8)," --> I would suggest to be a bit more detailed within the text with regard to what each table presents. Note also the order to be correct, now it is firstly Table 6 mentioned and then Table 5.
- Could you provide a table with all conf information together to be comparable?
- "We tackle this problem by developing a novel hashtag segmentation method inspired by [51,24]." --> I think this segmentation method should be mentioned when the solution is presented.
- The experimental settings should be made publicly available via permanent URLs (e.g. figshare) in order to enable reproducibility.

Minors:
" there is currently no agreed upon definition of what is an entity. " --> "of what an entity is."
-sec 2.1:
"We extend the definition of what is an external entry" --> "what an external entry is" (more syntax errors like this one)
"The current entity linking systems tends to adopt" --> "tend"
"this generally consists in mention detection and entity typing" --> consists of

-sec 2.2:
"since these methods aims primarily to" --> "aim"

- sec 3.1:
"it is then possible to jump from one source to another" --> I would suggest to replace jump with another verb, such as alter.