Survey on English Entity Linking on Wikidata

Tracking #: 2670-3884

Authors: 
Cedric Moeller
Jens Lehmann
Ricardo Usbeck

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Survey Article
Abstract: 
Wikidata is an always up-to-date, community-driven, and multilingual knowledge graph. Hence, Wikidata is an attractive basis for Entity Linking, which is evident by the recent increase in published papers. This survey focuses on four subjects: (1) How do current Entity Linking approaches exploit the specific characteristics of Wikidata? (2) Which unexploited Wikidata characteristics are worth to consider for the Entity Linking task? (3) Which Wikidata Entity Linking datasets exist, how widely used are they and how are they constructed? (4) Do the characteristics of Wikidata matter for the design of Entity Linking datasets and if so, how? Our survey reveals that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics like the hyper-relational structure. Thus, there is still room for improvement, for example, by including hyper-relational graph embeddings or type information. Many approaches also include information from Wikipedia which is easily combinable with Wikidata and provides valuable textual information which is Wikidata lacking. The current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. The potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 06/Feb/2021
Suggestion:
Minor Revision
Review Comment:

The paper is a survey on entity linking (EL) works that have Wikidata as their target knowledge graph (KG). Entity linking is defined as the task of finding the KG entities that correspond to a given set of extracted entity mentions (aka surface forms) from a given natural language utterance. In that sense, Entity Recognition (ER), which is the process of extracting entity mentions from utterances, is assumed to have been already performed, even if some EL works also perform ER first and then EL. The authors clearly state the coverage/scope of this work (i.e., the research questions that are investigated), and how they have decided to include/exclude specific papers (e.g., all EL papers not targeting Wikidata explicitly are excluded). After providing the scope, the authors define the overall problem, then they provide some background information about Wikidata (e.g., what is considered as a statement, a property, a reference in Wikidata) and what makes it special compared to other KGs, and then, they discuss existing EL approaches and EL benchmark datasets. The paper ends with a high-level discussion on the pros and cons of existing approaches and datasets, as well as some specific areas of improvement for future works to consider.

Overall, I found the paper very interesting, very useful and informative for someone that wants to quickly gain the big overview of the area, know about existing works, approaches and datasets. I have spotted some weak points though, which I believe will make the paper stronger if properly addressed. You can find them at the end of my review.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The level of detail (not too detailed, but also not too abstract), the coverage, and the high-level discussion in this paper are great for a survey. I would not change much there, but I would include generic EL works (i.e., those working with Wikidata, but not targeting Wikidata explicitly).

(2) How comprehensive and how balanced is the presentation and coverage.
As in my previous comment, I found the paper very well-balanced. If space allows, I would add a few more words about the each datasets covered (see detailed comments).

(3) Readability and clarity of the presentation.
Overall, this was a pleasure to read, the structure was good and there were very few issues that can be easily resolved (see detailed comments and typos).

(4) Importance of the covered material to the broader Semantic Web community.
Even if this survey focuses on EL works that target Wikidata only, yet, due to the central role that Wikidata has played in the recent years, this work is of great importance for the whole Semantic Web community.

Detailed comments:
Before going to the weaker aspects of this work, let me first commend the authors for their writing, the inclusion of high-level comments and remarks on existing works (Sections 5.2, 6.2), and the suggestion of future improvements (Section 8).

I found two major issues (MIs) and some smaller issues (SIs) in the paper.

Major issues:
MI-1: Table 8 is not only non-informative, but also confusing and should be removed. Even if the seven (!) footnotes shed some light into this issue, yet, at the end, we have a table that is supposed to compare the works, but in that table, we compare F1 scores to accuracy scores and recall scores (whichever is available per paper), for different datasets. I don't see a good reason for keeping this table and discussion. If the authors wanted to experimentally compare the works (which is not mandatory of course), they should have run (new) experiments on a fair basis (using the same evaluation scores and the same parts of the datasets). Table 9 may remain there, as long as the numbers reported are for the same measure (which could be also placed in the table caption to be more clear); even if missing values appear, which is not ideal, but acceptable.

MI-2: I can understand excluding works that target different KG than Wikidata. What I don't understand and I am not convinced by the existing justification, is why the authors have excluded works that are generic enough to capture not only Wikidata, but also other KGs. I don't understand why the authors consider generalization beyond Wikidata to be a bad thing. I am not saying that targeting Wikidata only is bad, but the opposite is not bad either. What if Wikidata, similarly to DBpedia and Freebase in the past, stops existing or is overcome by a newer. more popular KG in the (near) future? In that case, all tools working only with Wikidata would be lost, while works going beyond Wikidata would still be valid.
With that in mind, as well as with MI-1 in mind, I would ask the authors to include in their overview the works that go beyond Wikidata (without needing to report experimental results for them).

Smaller Issues:
SI-1: Section 2.1: What if the paper title contained the words "Linking (...) Entities" (instead of "Entity Linking") consecutively or with other intermediate words? Are those works excluded? For example, one such paper (not targeting Wikidata though, and older than 2017) is "A declarative framework for linking entities" by Burdick, Fagin, Kolaitis, Popa and Tan (TODS 2016). Similarly for "Disambiguating (...) Entities", and also "benchmark" or "data" for the dataset search.

SI-2: Formalisms in Section 3:
SI-2a: You may want to use a different subscript than n in m_n, to avoid confusing with the number of words in an utterance, which may be different.
SI-2b: Why do the rank functions have the real numbers as their range? Isn't a rank function always returning a natural number?
SI-2c: Related to SI-2b perhaps, why do you want to maximize (argmax) instead of minimize (argmin) the ranks in the objective functions? Isn't rank 1 the preferred rank?

SI-3: Datasets like LC-QuAD are referred in 5.2, before they are introduced in Section 6. While reading 5.2, it took me some time going back to forth to even understand if LC-QuAD for example is a method or a dataset.

SI-4: The references should be more carefully examined. For example, the last reference [137] seems to be noise, while references 65 and 66 are duplicates.

Typos/syntax/grammar issues and minor comments (in order of appearance):
- abstract: "which is Wikidata lacking" -> "which Wikidata is lacking"
- page 1, col 2: "DBpedia Live [21] exists, which is consistently updated with Wikipedia information. But (...)" -> "DBpedia Live [21] is consistently updated with Wikipedia information, but (...)"
- Table 2: "Datasets must include Wikidata identifiers from the start" quite understandable, but please elaborate to avoid misinterpretations.
- page 7: "specify how long" -> "specify for how long"
- page 7: "qv_i \in V. (s,r,o)" -> "qv_i \in V. The triple (s,r,o)"
- page 7: "(Ennio Morricone, nominated for, Academy Award for Best Original Score, (for work, The Hateful Eight), (statement is subject of, 88th Academy Awards))" -> "(Ennio Morricone, nominated for, Academy Award for Best Original Score, {(for work, The Hateful Eight), (statement is subject of, 88th Academy Awards)})." (add curly brackets outside the pairs and end the sentence with a period.)
- page 9: "are item labels/aliases" -> "item labels/aliases are"
- page 10: "Both, Q76 vs Q61909968" -> "Both Q76 and Q61909968 (which is a disambiguation page)"
- page 10: "However, as Wikidata is closely related to Wikipedia, an inclusion is easily doable." please elaborate briefly on the close relation and on the inclusion
- page 10 (and in multiple occurrences): "the amount of" [times/methods] -> "the number of"
- page 11: "This link probability a statistic on (...)" -> "This link probability is a statistic on (...)"
- Table 7: remove the comma after "Wikipedia" (for Boros et al)
- page 14: "As it only based on rules" -> "As it is only based on rules, "
- page 15: "an existing Information Extraction tool" -> which one?
- page 15: "Open triples are non-linked triples" -> please elaborate a bit more
- page 15: "E2E" -> "end-to-end"? please introduce the acronym before its first usage
- page 15: "Therefore, 200 candidates are found" -> If my understanding is correct, this should be "up to 200 candidates", since overlapping candidates are possibly generated by the 4 methods. If not, please clarify.
- page 17: "micro F1" I think is more commonly referred to and more easily understandable as "micro-averaged F1"
- page 17: "tp are here the amount of true positives, fp the amount of false positives and fn the amount of false negatives." ->
"Here, tp is the number of (...) over a ground truth" (I guess there is a ground truth of correct links).
- page 22: "Eleven datasets were found for which Wikidata [...]" place the references after "datasets" or at the end of the sentence.
- page 26: "a limited amount of datasets were created" -> "a limited number of datasets that were created

Review #2
By Filip Ilievski submitted on 16/Feb/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

This paper provides a review of methods and datasets for entity linking over Wikidata. It provides a lengthy motivation for linking to Wikidata instead of other sources, and an in-depth review of the Wikidata's components. It investigates four research questions relating to existing benchmarks, methods, and unexploited Wikidata-specific directions. The paper shows that there are a lot of benchmarks for evaluating EL over Wikidata, covering different domains, though different papers have used different ones, leading to a lack of comparability between methods and sparsity of result tables. Similarly, nearly a dozen methods are reviewed, most of which rely on surface form patterns and leverage little of Wikidata's structure, let alone its specificities (like qualifiers). The authors suggest valuable directions forward, relating to investigation of the temporal dependency between Wikidata dumps and datasets, as well as expanding the evaluation to other languages besides English.

The pursuit of this paper is very valuable, as indeed Wikidata is becoming a dominant public KG, in terms of its activity, quality and size. The research questions are relevant, and the introduction to Wikidata will benefit readers that are not familiar with it. The choice of datasets and methods is transparent, and their description is detailed. This paper is very relevant to the SW community.

I appreciate the method and systematicity by the authors, but I do suggest that the presentation should be greatly improved. Specifically:

1) The paper story needs to be organized better. For instance, the methods take over five pages, each being described in a linear fashion. It would be much better if the methods are organized in subgroups. For instance, based on the features that they use, or based on the method. As it is now, this section is hard to read and understand the big picture from. Similarly for the datasets - why not organize them according to genres; and for the motivation for Wikidata (1.1) - which now reads like a collection of anecdotal analyses rather than a coherent story.

2) The research questions are suitable but I was genuinely confused by the contributions, which do not seem to correspond to the RQs. Also, the order of the contributions is unintuitive to me, starting with future research avenues.

3) Relating to the above two points: there is a lot of content that addresses the RQs, but adding individual descriptions and then simply stating that 'this addresses RQx' does not seem satisfactory. I was especially puzzled that RQ2: "Which unexploited Wikidata characteristics are are worth to consider for the Entity Linking task?" was claimed to be answered on page 4, even before the paper said anything about which characteristics of Wikidata have been exploited (and also before answering RQ1). I think the RQs should be answered more directly, which could be done by a paragraph at the end of the relevant part to that RQ. I also think that the order of the RQs should be re-considered, because this is now inconsistent in the paper.

4) The paper has a lot of statements that are too vague to be left as such in a scientific paper. For instance, in the introduction to Wikidata (section 4), there is this paragraph "Not all items are entities in the context of EL. In general, items which are unique instances of some class
are interpreted as entities. Of course, this also depends on the use case."
I don't know what is meant here by the "context of EL", "unique instances" or "depends on the use case". Similarly, while there is a lot of information, the early sections of the paper mostly provide anecdotal information and avoid providing concrete definitions/results: what is meant by an entity, KG, etc. I agree that some of these might be controversial, but in that case, it might help to provide info about that controversy. Another example is "not very generalizable" or "not that susceptible" (5.1).

5) Why is the term emerging entity used instead of NIL? Note that these two are not synonyms. Emerging entities generally refer to those that used to be NILs but have received a representation at a later point. This is probably the case for some former NILs but not all.

6) Did the authors cross-check the system papers for other datasets? This seems like an intuitive step, given that the (nice) methodology for selecting papers was mostly keyword-based, and is likely to have missed relevant datasets/methods.

7) I fully see how the survey falls in the category 2 (identification of open problems), but I fail to see this about category 3: providing a novel approach for these problems. In fact, the authors themselves say that they *identify* important characteristics of WD, but not that this is

8) Can the authors provide an argument why the KG-agnostic methods are left out? This does not seem well-justified. Similarly, some judgments on 'suitability' seem rather arbitrary, like the one on Deeptype (which BTW is the most commonly covered method in prior surveys, according to Sec 7): "Nevertheless, as it is also possible to adapt other algorithms, initially created for different KGs, to Wikidata, this method may not be suitable to be compared to the other algorithms."

9) I did not follow the reasoning on the differences of methods in terms of retraining. Sure, traditional transductive embeddings need to be re-calculated, but why is this different for word-based embeddings (which tend to also use some of the structure), or label-based features? Is the assumption that the labels are more stable? If so, is there support for this?

10) Conversely, in terms of long-tail entities - why is this only a problem for LMs?

For both these points to be less confusing, I propose that the judgments of suitability of systems/datasets should be made explicit somewhere in the paper, and clearly justified to the extent possible.

11) The multilingualism is very prominent in the method/data descriptions and in the discussion, but the authors claim that this is out-of-scope of this paper. I think this should be resolved.

12) The paper should do a better job in connecting the datasets and the methods. Right now, these are two nearly disjoint sections, though in practice methods are often developed with datasets in mind, and vice versa. It would also be useful if the cells in the result tables are filled for the systems that have an existing API, and perhaps open code.

13) The two main findings that answer RQ4 are actually not directly following from the discussion of the results. This discrepancy should be addressed.

14) Being up-to-date is very valuable for the users, but why is that an important criteria for EL methods (cf. 8.1)? AFAIK, and as the authors say later, no system so far has taken the temporal evolution into account.

15) The future research avenues are very vague and should be made more concrete.- e.g., how would you propose to train embeddings more efficiently? Is there an existing alternative that goes into this direction? Etc.

Other comments:
* I am not sure what the authors mean, but Wikidata's dump is not 'always' up to date (in fact, this would be nearly impossible with the current technology and size of WD). Wikidata's dump is being updated twice a month, which I agree is more often than other similar KGs. Wikidata's live endpoint and website are up-to-date, but this is not what is being compared to (say in Table 4).
* Why is table 3 a table?
* Footnote 2 points to a website in German.
* "While some very recent surveys exist" - missing citations
* "the vast amount of methods" -> "the vast majority of methods"
* in section 1.2, does 'Wikidata-specific' mean Wikidata-only? Please clarify
* does DBpedia really have 484k types?
* please introduce E2E abbreviation
* the order of words is ungrammatical here "The largest number of ambiguous mentions have the Wiki-Disamb30 datasets, resulting"
* Wiki-Disamb has not been introduced, I think
* In 8.1 - what is meant by WD and WIkipedia being a 'perfect pair'? Does this refer to the current methods, or is rather a suggestion for future methods? If the former, then the paper would be inconsistent with prior statements that current methods do not really exploit WIkidata's particularities.

Review #3
Anonymous submitted on 06/May/2021
Suggestion:
Major Revision
Review Comment:

# Summary:

This survey investigates entity linking (EL) approaches and datasets that use Wikidata. The authors classify EL methods into three groups (rules, statistics and deep learning) and discuss the extent to which they draw upon capabilities that are specific to Wikidata. Datasets are listed based on their usage and the authors discuss the extent to which Wikidata's design is relevant to the dataset's construction.

## Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper is well suited as an introductory text, although it is specialized on EL and Wikidata. The focus on Wikidata as a knowledge source for EL distinguishes the paper from other similar work, makes it valuable to EL designers, and relevant to the journal's scope.

## How comprehensive and how balanced is the presentation and coverage.

1. The survey seems to cover many relevant papers, although it would have been helpful if the authors could provide information on the search portals they used for identifying the paper's covered in the literature review (similar to Table 3 for datasets).
2. Given the survey's focus I would like to suggest putting more emphasis on the differences between Wikidata and other knowledge sources (p6). After reading Section 4.1 a reader might wonder
- how/whether a Wikidata statement differs from a DBpedia statement;
- how references and ranks work and can be used for EL. Providing an example would be very useful for clarifying this point; and
- why the other structural Wikidata elements are not included in the article.
3. The survey provides a very short description of Wikidata's weaknesses (p9). Discussing these points in terms of the dimensions defined in ``Zaveri, Amrapali, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. “Quality Assessment for Linked Data: A Survey.” Semantic Web 7, no. 1 (January 1, 2016): 63–93. https://doi.org/10.3233/SW-150175.'' might be useful.

## Readability and clarity of the presentation.

The presentation chosen in the survey is sometimes not very reader friendly:
- p2, Fig. 3: Publishing years of included Wikidata EL papers (at this point the reader has no idea which papers are meant - please provide a reference to a list of these papers (e.g., Table 7 on p12)).
- p3: "While some very recent surveys exist..." > no citation or reference is given. Referencing Section 7 (related work which starts on page 25) or citing some of these studies would allow the reader to check related surveys prior to reading through the whole article.

The provided research questions are interesting and valuable to the survey's focus. Nevertheless, the answers included in the article should be more specific, better justified and structured in a way that allows the reader to quickly find them.
- RQ2 (p10): Which unexploited Wikidata characteristics are _worth_ to consider for the Entity Linking task? The answer notes that Wikidata _has_ characteristics that introduce new possibilities and challenges but does not clearly outline (i) which of these characteristics are relevant, (ii) why they are relevant and (iii) how they are expected to impact EL performance. The following discussion of the approaches (Section 5) suggests that the hyper-relational structure and Wikidata's fine-grained type system might be such characteristics, but this is not clear at the point where the answer is given. In addition, no definition of _worth_ is provided. The authors also do not elaborate _why_ the suggested characteristics should be beneficial. This question actually remains open until Section 8.2. which provides some ideas on how Wikidata-specific properties might be beneficial to EL.
- RQ1 (p20): Refining the verbal description with an (extended) version of Table 7 (p12) would provide the reader with a clear and concise answer to the research question.
- RQ3 (p22): is implicitly answered by referring to Section 5.2 (p17)
- RQ4: I wholeheartedly agree with the authors that the Wikidata version used for creating a gold standard should be documented. Nevertheless, I feel that the problem of KB evolution is not specific to Wikidata but rather relevant to all evolving KBs.
- Table 7 (p12): It would be beneficial to extend this table with another column that lists mitigation strategies used for the shortcomings within Wikidata (e.g., draw upon further KBs to obtain larger amounts of textual data, etc.)
- Table 11 (p21), Section 8 (p26): Statistics on the number of Wikidata identifies versus total identifiers, NIL entities, unmappable entities, and annotation method (automatically versus manually) would be very beneficial.
- p23: more context (e.g., by providing some examples) on why removing exact matches from the Disamb30 dataset improves performance would be interesting

## Importance of the covered material to the broader Semantic Web community.

This survey provides valuable insights into EL approaches that operate on Wikidata, raises the awareness of EL designers of Wikidata's potential, and illustrates how a KBs design impacts the potential of components that draw upon this KB.

## Replicability of the experiments

The authors published their results and source code on GitHub. Although the code seems to be well organized, replicating the results will be very difficult, since the repository's `README.md` file does not contain instructions on
- how to obtain the necessary datasets and Wikidata dumps (links to the datasets would be tremendously helpful)
- the sequence and use of the scripts required for replicating the results.

## General remarks

1. I feel that the paper addresses an important topic, although its impact would highly benefit from
- a more detailed elaboration on the crucial differences between Wikidata and other knowledge bases, and
- a clearer and more concise assessment (which should be backed up by anecdotal evidence or even better experiments) of the possible impact the use of these differences might have on EL performance

2. Ideas are often implicitly stated but not clearly expressed (e.g., the answers to the research questions). The evidence provided with some research questions (e.g., RQ2) seems to be too sparse for fully answering them. This issue could be mitigated by providing more evidence or adapting the research questions.

3. A more elaborate coverage of conflicts and how they affect EL would be interesting (is it only about ranking? Which types of conflicts actually occur? How frequently does Wikidata contain conflicts (in contrast to simple updates)?)

## Minor remarks

1. p5: mentioned that 11 Wikidata datasets have been included > Table 12 and 14 both only mention 10 datasets (NYT2018 which is cited in Table 13 is missing)
2. p5: consistency
- both the letter `s` and `u` are used for referring to an utterance
- I would like to suggest adding `m=` to the second equation on page 5 to indicate that the span refers to an entity mention
3. p7: P[0-9]* > P[0-9]+
4. Grammar:
- p2 _have_ attract_ed_ novel EL research ... in recent years
- p6: a_n_ is_instance property
5. Table 5: average., > average,
6. p10: I am wondering why KBPearl has been considered a statistical rather than a graph-based approach
7. abbreviations such as E2E should be defined prior to their first use