Survey on English Entity Linking on Wikidata

Cedric Moeller
Jens Lehmann
Ricardo Usbeck

Survey Article
Wikidata is an always up-to-date, community-driven, and multilingual knowledge graph. Hence, Wikidata is an attractive basis for Entity Linking, which is evident by the recent increase in published papers. This survey focuses on four subjects: (1) How do current Entity Linking approaches exploit the specific characteristics of Wikidata? (2) Which unexploited Wikidata characteristics are worth to consider for the Entity Linking task? (3) Which Wikidata Entity Linking datasets exist, how widely used are they and how are they constructed? (4) Do the characteristics of Wikidata matter for the design of Entity Linking datasets and if so, how? Our survey reveals that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics like the hyper-relational structure. Thus, there is still room for improvement, for example, by including hyper-relational graph embeddings or type information. Many approaches also include information from Wikipedia which is easily combinable with Wikidata and provides valuable textual information which is Wikidata lacking. The current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. The potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted.
Review #1
By Vasilis Efthymiou submitted on 06/Feb/2021
 Suggestion: Major Revision Review Comment: # Summary: This survey investigates entity linking (EL) approaches and datasets that use Wikidata. The authors classify EL methods into three groups (rules, statistics and deep learning) and discuss the extent to which they draw upon capabilities that are specific to Wikidata. Datasets are listed based on their usage and the authors discuss the extent to which Wikidata's design is relevant to the dataset's construction. ## Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. The paper is well suited as an introductory text, although it is specialized on EL and Wikidata. The focus on Wikidata as a knowledge source for EL distinguishes the paper from other similar work, makes it valuable to EL designers, and relevant to the journal's scope. ## How comprehensive and how balanced is the presentation and coverage. 1. The survey seems to cover many relevant papers, although it would have been helpful if the authors could provide information on the search portals they used for identifying the paper's covered in the literature review (similar to Table 3 for datasets). 2. Given the survey's focus I would like to suggest putting more emphasis on the differences between Wikidata and other knowledge sources (p6). After reading Section 4.1 a reader might wonder - how/whether a Wikidata statement differs from a DBpedia statement; - how references and ranks work and can be used for EL. Providing an example would be very useful for clarifying this point; and - why the other structural Wikidata elements are not included in the article. 3. The survey provides a very short description of Wikidata's weaknesses (p9). Discussing these points in terms of the dimensions defined in Zaveri, Amrapali, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. “Quality Assessment for Linked Data: A Survey.” Semantic Web 7, no. 1 (January 1, 2016): 63–93. https://doi.org/10.3233/SW-150175.'' might be useful. ## Readability and clarity of the presentation. The presentation chosen in the survey is sometimes not very reader friendly: - p2, Fig. 3: Publishing years of included Wikidata EL papers (at this point the reader has no idea which papers are meant - please provide a reference to a list of these papers (e.g., Table 7 on p12)). - p3: "While some very recent surveys exist..." > no citation or reference is given. Referencing Section 7 (related work which starts on page 25) or citing some of these studies would allow the reader to check related surveys prior to reading through the whole article. The provided research questions are interesting and valuable to the survey's focus. Nevertheless, the answers included in the article should be more specific, better justified and structured in a way that allows the reader to quickly find them. - RQ2 (p10): Which unexploited Wikidata characteristics are _worth_ to consider for the Entity Linking task? The answer notes that Wikidata _has_ characteristics that introduce new possibilities and challenges but does not clearly outline (i) which of these characteristics are relevant, (ii) why they are relevant and (iii) how they are expected to impact EL performance. The following discussion of the approaches (Section 5) suggests that the hyper-relational structure and Wikidata's fine-grained type system might be such characteristics, but this is not clear at the point where the answer is given. In addition, no definition of _worth_ is provided. The authors also do not elaborate _why_ the suggested characteristics should be beneficial. This question actually remains open until Section 8.2. which provides some ideas on how Wikidata-specific properties might be beneficial to EL. - RQ1 (p20): Refining the verbal description with an (extended) version of Table 7 (p12) would provide the reader with a clear and concise answer to the research question. - RQ3 (p22): is implicitly answered by referring to Section 5.2 (p17) - RQ4: I wholeheartedly agree with the authors that the Wikidata version used for creating a gold standard should be documented. Nevertheless, I feel that the problem of KB evolution is not specific to Wikidata but rather relevant to all evolving KBs. - Table 7 (p12): It would be beneficial to extend this table with another column that lists mitigation strategies used for the shortcomings within Wikidata (e.g., draw upon further KBs to obtain larger amounts of textual data, etc.) - Table 11 (p21), Section 8 (p26): Statistics on the number of Wikidata identifies versus total identifiers, NIL entities, unmappable entities, and annotation method (automatically versus manually) would be very beneficial. - p23: more context (e.g., by providing some examples) on why removing exact matches from the Disamb30 dataset improves performance would be interesting ## Importance of the covered material to the broader Semantic Web community. This survey provides valuable insights into EL approaches that operate on Wikidata, raises the awareness of EL designers of Wikidata's potential, and illustrates how a KBs design impacts the potential of components that draw upon this KB. ## Replicability of the experiments The authors published their results and source code on GitHub. Although the code seems to be well organized, replicating the results will be very difficult, since the repository's README.md file does not contain instructions on - how to obtain the necessary datasets and Wikidata dumps (links to the datasets would be tremendously helpful) - the sequence and use of the scripts required for replicating the results. ## General remarks 1. I feel that the paper addresses an important topic, although its impact would highly benefit from - a more detailed elaboration on the crucial differences between Wikidata and other knowledge bases, and - a clearer and more concise assessment (which should be backed up by anecdotal evidence or even better experiments) of the possible impact the use of these differences might have on EL performance 2. Ideas are often implicitly stated but not clearly expressed (e.g., the answers to the research questions). The evidence provided with some research questions (e.g., RQ2) seems to be too sparse for fully answering them. This issue could be mitigated by providing more evidence or adapting the research questions. 3. A more elaborate coverage of conflicts and how they affect EL would be interesting (is it only about ranking? Which types of conflicts actually occur? How frequently does Wikidata contain conflicts (in contrast to simple updates)?) ## Minor remarks 1. p5: mentioned that 11 Wikidata datasets have been included > Table 12 and 14 both only mention 10 datasets (NYT2018 which is cited in Table 13 is missing) 2. p5: consistency - both the letter s and u are used for referring to an utterance - I would like to suggest adding m= to the second equation on page 5 to indicate that the span refers to an entity mention 3. p7: P[0-9]* > P[0-9]+ 4. Grammar: - p2 _have_ attract_ed_ novel EL research ... in recent years - p6: a_n_ is_instance property 5. Table 5: average., > average, 6. p10: I am wondering why KBPearl has been considered a statistical rather than a graph-based approach 7. abbreviations such as E2E should be defined prior to their first use