Abstract:
Vast community-driven knowledge graphs (KGs), such as Wikidata, are the primary reference data sources for Entity Linking (EL) applications. However, they exhibit significant coverage bias towards information that is widely popular on the Web, leading to underrepresentation of long-tail entities, particularly from non-contemporary contexts. Concurrently, the ongoing mass digitisation of cultural heritage resources reveals numerous named entities and associated knowledge that are currently missing from general-purpose KGs. Enriching such KGs with these ``NIL'' entities offers an opportunity to improve completeness and mitigate biases, such as gender disparities in the representation of historical figures.
In this article, we investigate an approach based on retrieval-augmented generative AI to capture information about NIL entities and generate structured KGs suitable for integration into Wikidata.
The approach is applied to the case of persons unknown to Wikidata who are mentioned in a collection of 19th-century musical periodicals. We empirically select 6 properties from Wikidata for entities of that type and create a manually annotated NIL-entities KG as the gold standard for evaluation.
Through comprehensive experiments, we evaluate 6 State-of-the-Art Large Language Models (LLMs) from different vendors, combined with 6 different State-of-the-Art retrievers.
Our results demonstrate significant variations in performance across model-retriever combinations, with a high accuracy for gender identification and family name, promising results for occupation and country of citizenship, and low accuracy for date of birth.
We report a detailed error analysis and discuss the potential of our approach to mitigate historical bias in Wikidata.