Survey on English Entity Linking on Wikidata

Tracking #: 2865-4079

Cedric Moeller
Jens Lehmann
Ricardo Usbeck

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Survey Article
Wikidata is a frequently updated, community-driven, and multilingual knowledge graph. Hence,Wikidata is an attractive basis for Entity Linking, which is evident by the recent increase in published papers. This survey focuses on four subjects: (1) Which Wikidata Entity Linking datasets exist, how widely used are they and how are they constructed? (2) Do the characteristics of Wikidata matter for the design of Entity Linking datasets and if so, how? (3) How do current Entity Linking approaches exploit the specific characteristics of Wikidata? (4) Which Wikidata characteristics are unexploited by existing Entity Linking approaches? Our survey reveals that currentWikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. Thus, the potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted. Furthermore, we show that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure. Thus, there is still room for improvement, for example, by including hyper-relational graph embeddings or type information. Many approaches also include information from Wikipedia, which is easily combinable with Wikidata and provides valuable textual information, which Wikidata lacks.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 18/Sep/2021
Review Comment:

The authors have addressed all my concerns, which were already minor ones. I had originally suggested removing one table from the paper, but I am fine with their decision to keep it in the appendix.
I learned a lot by reading this paper and I believe that it should be accepted for publication.

Review #2
By Albert Weichselbraun1 submitted on 30/Sep/2021
Review Comment:

The revised paper is much better structured, and has gained considerably in readability and clarity.

From my point of view, the paper is ready for publication, provided that the issues outlined below are addressed.

## Remarks

- The definition of hyper-relational graphs is given on page 9, although the term is first used on page 2. Having this concept defined earlier (preferable before or as part of Section 3 (Problem definition)), would provide the reader with the reasoning behind the chosen formalization of knowledge graphs as G = (V, E, R) rather than the standard G = (V, E).
- page 10: the claim "DBpedia does ... not include aliases, only a single exact label" is misleading, since DBpedia pages do not only contain unique `rdfs:label` attributes but also additional attributes such as `foaf:name`, `dbp:commonName`, and `dbp:conventionalLongName` that effectively provide aliases.
- page 14: "Thus information on the quality and construction process of Wikidata is given." - how does information on how the dataset has been constructed provide information on the quality and construction process of Wikidata?
- page 28: "As it is proven that the inclusion ... can improve the performance of link prediction" - is this fact really _proven_ or rather confirmed by benchmarks?

## Minor remarks

- page 4, footnote 4: the query is incomplete
- page 7: ... are updated most frequently Table 3 > are updated most frequently (Table 3)
- general: abbreviations such as KBP and LM should be defined prior to their first use.
- Table 11: some approaches are cited based on the authors (e.g., Huang et al.) others by their name (VCG, KBPearl, etc.) or their description (NED using DL on Graphs).
- Table 11: neither footnote 1 nor 4 appear in the table

Review #3
By Filip Ilievski submitted on 06/Oct/2021
Review Comment:

I appreciate that the authors thoroughly revised the paper, and addressed my fairly long list of remarks and suggestions. In the current state, the paper seems much improved, and I vote for acceptance, as I think it will be fairly impactful and well-cited.

Provided that the paper is accepted, I ask the authors to consider the following comments:
1. Survey on Wikidata EL is very timely and useful, but the paper is really verbose in its current state, which makes the reader easily miss important discussion points. Some condensing of the content to remove repetitions and emphasize the key points more would help readability.
2. The dataset domains seem to lie in different conceptual spaces (e.g., open-domain vs research vs twitter). If 'domain' here means genre, then perhaps a simple fix is changing 'open domain' into 'encyclopedic'?
3. Similarly, an overview/justification for the categorization of the approaches in 6.1 and 6.2 would help, as right now the categories seem to be relatively arbitrary, and it is hard to see how they relate to each other.
4. Some typos still exist, please proofread before sending a final version. For instance: "As one can see, is the accuracy positively correlated..."