CaLiGraph: A Knowledge Graph from Wikipedia Categories and Lists

Tracking #: 3601-4815

Authors: 
Nicolas Heist
Heiko Paulheim

Responsible editor: 
Raghava Mutharaju

Submission type: 
Dataset Description
Abstract: 
Knowledge Graphs (KGs) are increasingly used for solving or supporting tasks such as question answering or recommendation. To achieve a useful performance on such tasks, it is important that the knowledge modelled by KGs is as correct and complete as possible. While this is an elusive goal for many domains, techniques for automated KG construction (AKGC) serve as a means to approach it. Yet, AKGC has many open challenges, like learning expressive ontologies or incorporating long-tail entities. With CaLiGraph, we present a KG automatically constructed from categories and lists in Wikipedia, offering a rich taxonomy with semantic class descriptions and a broad coverage of entities. We describe its extraction framework and provide details about its purpose, resources, usage and quality. Further, we evaluate the performance of CaLiGraph on downstream tasks and compare it to other popular KGs.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Apr/2024
Suggestion:
Accept
Review Comment:

Summary:
This paper presents a knowledge graph automatically constructed from semi-structured content available in Wikipedia. The key innovation of CaLiGraph lies in its approach to leveraging both Wikipedia categories and lists to develop an extensive taxonomy enriched with axioms for class descriptions and populate the KG with a large number of entities. This extraction framework aims to overcome challenges related to automated KG construction, particularly focusing on capturing long-tail entities and enhancing the expressiveness and granularity of the ontology.
CaLiGraph distinguishes itself by providing detailed semantic class descriptions through axioms, addressing the incompleteness common to public KGs by incorporating entities from semi-structured Wikipedia data sources.

Details:
The extraction framework consists of a pipeline that involves ontology construction (class and property definition, taxonomy induction, and axiom learning) and knowledge graph population (named entity recognition, disambiguation, entity typing, and relation extraction).
Ontology Construction: CaLiGraph's ontology is constructed from Wikipedia's categories and list pages. This process involves defining relevant classes and properties, discovering hierarchical relationships, and formulating axioms to describe the domain comprehensively.
Knowledge Graph Population: This phase involves identifying named entities in Wikipedia's text, disambiguating them to existing entities or creating new ones, typing these entities based on the ontology, and extracting relationships between them. The methodology emphasizes the use of semi-structured data (listings and tables) in Wikipedia for more accurate and less error-prone information extraction compared to free-text analysis.
CaLiGraph describes over 1.3 million classes and 13.7 million entities, showcasing the KG's rich taxonomy and broad entity coverage. This represents a significant enhancement over existing KGs, particularly in covering long-tail and emerging entities. The authors also evaluate CaLiGraph's performance on downstream tasks, highlighting its utility and comparing it with other popular KGs such as DBpedia and YAGO, demonstrating its effectiveness in improving task performance due to its richer semantic descriptions and wider entity coverage.

Minor remarks:
- "(subject,predicate,object)" (page 1, line 47): add spaces after commas.
- "Freebase [24] and, more recently, Wikidata [12] are examples of achieving scalability in manual curation via crowd-sourcing, but again, the capability to scale up is limited." (page 3, line 37): I would not say that Wikidata capability to scale up is limited given its size.
- "which allows more concise access to (semi-)structured page elements like sections, listings, and tables than plain HTML" (page 7, line 29): I would rephrase for clarrity, eg, "Wikipedia uses its own markup language, Wiki markup, which allows for more concise access to (semi-)structured page elements such as sections, listings, and tables, compared to plain HTML."

Conslusion:
I have nothing much to say, beside that this paper is very interesting: It is very well written and contains a sufficient amount of details for the different steps. The description of the pipeline and the experiments are comprehensive. The authors discussions of limitations is very clear, and they provided at the same time convincing ideas to improve the situation in the future.
To conclude, I enjoyed reading this paper, which is an important contribution to the Semantic Web community, and I recommend accepting the paper.

Review #2
By Thomas Pellissier Tanon submitted on 10/May/2024
Suggestion:
Accept
Review Comment:

This paper presents improvements to automatically extract more complex ontology axioms from Wikipedia Category and list pages in order. Such axioms go beyond the usual "rdf:subClassOf" to go into more complex axioms like "if page X is in the category Y then X is an album by artist A". The approach presented seems sound and has been used to build an improved version of DBpedia called CaliGraph. It is publicly available.

Some specific comments or improvement ideas:
- In section 3.2.3 it would be nice to describe which kind of axioms your approach can mine. It is suggested by examples but not formally stated.
- Even if it's not an automatically constructed knowledge base, adding Wikidata to the comparisons with other knowledge bases might have been a nice "anchor point"
- In figure 5, there is no "person" class but a big "birth" class. Is it an artifact of mining from Wikipedia classes like "birth in XXXX"?
- In table 4, reminding the reader of the accuracy of other knowledge bases like DBpedia or Yago would have been convenient. (of course only for targets for which such evaluation exist)

Review #3
Anonymous submitted on 01/Jun/2024
Suggestion:
Minor Revision
Review Comment:

The work presents a new KG to build a KG from Wikipedia categories and lists.

Section1: Introduction is clear and explains well the motivation behind the work. I like the focus of the paper, which is specific to the considered problem.
Section2: In general section 2 is fine and explains Automatic KG construction pipelines. However, there are several approaches in literature that aims to improve KG construction, such as Jaradeh et al. ( Information extraction pipelines for knowledge graphs), etc. I’d request authors to also include a comparison in a table on the existing KG construction approaches, and the approach followed by authors. Currently in this section, it gives an idea on what are KG construction sub-tasks but did not explain how authors aim to fix limitations.

Section 3: I have couple of questions:
When authors remove certain edges in taxonomy induction (sec 3.2.2.), does it impact overall quality of the KG? If not, why not? If yes, how?
For pattern mining, authors use distant supervision, in general distant supervision has its own limitations. How does authors ensure that distant supervision does not impact the overall quality?
What is the overall impacto of NED tool on KG quality?

Section 4: what is the overall plan to sustain the resource of the given work? To sustain a KG for a longer period needs continuous efforts. Does authors plan to sustain it? If yes, how? Does it require funding from uni. Mannheim? I am curious here. Adding 1-2 line in paper doesnt hurt either for authors on sustainability plans extending section 4.2.4.

Remaining section: Overall good work. I’d like to see a bit more discussion on the existing quality of KG. May be authors define some KPIs on which authors/other-researchers can improve existing KG? Which of them is low-hanging? Which requires longer efforts in the community.

Overall I like the effort of the paper.