Review Comment:
Summary:
This paper presents a knowledge graph automatically constructed from semi-structured content available in Wikipedia. The key innovation of CaLiGraph lies in its approach to leveraging both Wikipedia categories and lists to develop an extensive taxonomy enriched with axioms for class descriptions and populate the KG with a large number of entities. This extraction framework aims to overcome challenges related to automated KG construction, particularly focusing on capturing long-tail entities and enhancing the expressiveness and granularity of the ontology.
CaLiGraph distinguishes itself by providing detailed semantic class descriptions through axioms, addressing the incompleteness common to public KGs by incorporating entities from semi-structured Wikipedia data sources.
Details:
The extraction framework consists of a pipeline that involves ontology construction (class and property definition, taxonomy induction, and axiom learning) and knowledge graph population (named entity recognition, disambiguation, entity typing, and relation extraction).
Ontology Construction: CaLiGraph's ontology is constructed from Wikipedia's categories and list pages. This process involves defining relevant classes and properties, discovering hierarchical relationships, and formulating axioms to describe the domain comprehensively.
Knowledge Graph Population: This phase involves identifying named entities in Wikipedia's text, disambiguating them to existing entities or creating new ones, typing these entities based on the ontology, and extracting relationships between them. The methodology emphasizes the use of semi-structured data (listings and tables) in Wikipedia for more accurate and less error-prone information extraction compared to free-text analysis.
CaLiGraph describes over 1.3 million classes and 13.7 million entities, showcasing the KG's rich taxonomy and broad entity coverage. This represents a significant enhancement over existing KGs, particularly in covering long-tail and emerging entities. The authors also evaluate CaLiGraph's performance on downstream tasks, highlighting its utility and comparing it with other popular KGs such as DBpedia and YAGO, demonstrating its effectiveness in improving task performance due to its richer semantic descriptions and wider entity coverage.
Minor remarks:
- "(subject,predicate,object)" (page 1, line 47): add spaces after commas.
- "Freebase [24] and, more recently, Wikidata [12] are examples of achieving scalability in manual curation via crowd-sourcing, but again, the capability to scale up is limited." (page 3, line 37): I would not say that Wikidata capability to scale up is limited given its size.
- "which allows more concise access to (semi-)structured page elements like sections, listings, and tables than plain HTML" (page 7, line 29): I would rephrase for clarrity, eg, "Wikipedia uses its own markup language, Wiki markup, which allows for more concise access to (semi-)structured page elements such as sections, listings, and tables, compared to plain HTML."
Conslusion:
I have nothing much to say, beside that this paper is very interesting: It is very well written and contains a sufficient amount of details for the different steps. The description of the pipeline and the experiments are comprehensive. The authors discussions of limitations is very clear, and they provided at the same time convincing ideas to improve the situation in the future.
To conclude, I enjoyed reading this paper, which is an important contribution to the Semantic Web community, and I recommend accepting the paper.
|