Empirical ontology design patterns and shapes from Wikidata

Tracking #: 3368-4582

Valentina Anita Carriero
Paul Groth
Valentina Presutti

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
The ontology underlying the Wikidata knowledge graph (KG) has not been formalized. Instead, its semantics emerges bottom-up from the use of its classes and properties. Flexible guidelines and rules have been defined by the Wikidata project for the use of its ontology, however, it is still often difficult to reuse the ontology's constructs. Based on the assumption that identifying ontology design patterns from a knowledge graph contributes to make its (possibly) implicit ontology emerge, in this paper we present a method for extracting what we term empirical ontology design patterns (EODPs) from a knowledge graph. This method takes as input a knowledge graph and extracts the EODPs as sets of axioms/constraints involving the classes instantiated in the KG. These EODPs include data about the probability of such axioms/constraints to happen. We apply our method on two domain-specific portions of Wikidata, addressing the 'music' and 'art, architecture, and archaeology' domains, and we compare the empirical ontology design patterns we extract with the current support present in Wikidata. We show how these patterns can provide guidance for the use of the Wikidata ontology and its potential improvement, and can give an insight into the content of (domain-specific portions of) the Wikidata knowledge graph.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Giorgos Stoilos submitted on 04/May/2023
Major Revision
Review Comment:

The paper presents so-called empirical ontology design patterns (EODP) and probabilistic extensions of them as means to analyse the ontological structure of a given KG. The method is applied to different sub-parts of Wikidata showing a relevance and usefulness of the approach. Besides probabilistic-EODP the authors also present probabilistic ShEx.

The experimental evaluation clearly shows that the approach is useful in practice and can be used to help define, revise, maintain ontologies for large KGs like Wikidata. I have the following comments that I believe can help the paper improve.

1. The technical part presenting the method to extract pEODPs is quite sketchy. The authors cite a previous paper of theirs and then briefly present the method and the extensions relevant to this work with a figure and in text. It would be very good to try and make this section more formal. Provide some algorithm in pseudo-code and define any concepts that are used and any formulas that the algorithm is using. For example, the statement "we count the number of distinct instances that have at least one triple involving each property in the subgraph, i.e., we compute their occurrences" should be turned into a formula that makes this statement clear and precise. It would also be important to provide some intro to ShEx and perhaps also RDF*.

2. Experimental evaluation. Analysis of results and examples are provided in the normal text making it a bit hard to read them. It would be good to use different formatting for the examples, like math environments. For example, instead of saying: "the property Chessgames.com player...(CODE1) is an instance of the class Wikidata (CODE2)... that is a subclass of Wikidata property...(CODE3) which includes e.g., the property number of medal (CODE4)", give the triples of the aforementioned axioms in a math-array format and not a natural language description of them. There are a lot of examples like that in the experimental sections, that are very hard to read/parse. The evaluation section is generally quite lengthy. Authors cannot be accused for doing a lot of experiments but at the same time the presentation could feel a bit tedious. Perhaps the RDF*/OWL* listings can be pushed to an appendix. Perhaps some results can be summarized.

3. It is not clear what the results would be if this method is applied to a KG that doesn't specify rules or constraints of use for properties and classes like Wikidata. It would be good to conduct a small study when applying these to a different KG.

The paper is fairly well written (there are not typos) other than presentation, readability, and understanding can be improved by improving the technical part and making the experiments easier to read.

Review #2
Anonymous submitted on 03/Jul/2023
Minor Revision
Review Comment:


This article presents a methodology to extract empirical ontology design patterns from knowledge graphs. (In particular, the authors focused their study to the case of Wikidata.) According to them, this approach can be especially useful for knowledge graphs whose ontology is "loosely" present and not enforced at all when new triples are added. The solution they describe is able to output both probabilistic OWL ontology design patterns and probabilistic ShEx shapes. A specific emphasis is put on the statistical aspect of their approach since its version dependent in the sense that their recommendations may vary from one point of time to another as per the various updates which could have occurred in-between; this aspect makes their approach particularly suitable to keep track of practice evolution in terms of data updates. In order to highlight the solution, detailed sections presenting the results and discussing them are presented.

Structure and writing

- The article structure is easy to follow and allows the reader to follow the logical path of the presented solution.
- Very good writing quality.


I only have minor comments regarding the article.

- It would be great to have more info about the time it took to run such analytics and the resources used, so to give an idea on the „heaviness” of the process.

- To me, one of the most interesting part is Section §6.4. first of all, it would be great to have the years / more details for the dates ("April version from now on") regarding the dumps the authors consider. Also, since some time elapsed since April, it would be great for the final version of the article to include at least a third checkpoint e.g. July 2023 so to have a more solid section which would have three points in time which is better to draw conclusions and describe tendencies.

- A Section with ideas for other datasets would be a nice addition to the article. Even though, I admit it's not part of the main scope, as reader I'd like to know more about the possible uses of the approach for other datasets instead of just a line in the Conclusion „Moreover, we would like to test the method on knowledge graphs other than Wikidata”. (For example: What to do for datasets having strong ontologies? Could this be used to discover pattern-errors? …)

- The link to the supplementary material is very much appreciated. Nevertheless, it could've been more "complete" in the sense that I'd have liked to see a do-it-all script, or an example of how to run it completely. Also, I couldn't find the .png generator in the scripts (maybe I did not search for it properly though). → Anyway, this could be easily added and does not impact the review at all. ;-)

Overall [Minor Revision]

This article presents a very interesting approach to be able to extract from Semantic Web knowledge graphs (empirical) ontology design patterns and ShEx shapes.

I would be happier if:
- §6.4 was augmented with an additional checkpoint and the findings updated accordingly.
- Some more details were provided when it comes to apply this approach to other datasets.

Compared to the original article (reference [5] in the article) presented at the Wikidata workshop (co-located with ISWC), there are a sufficient amount a new contributions.

For these reasons, I believe this effort is a great piece of work which deserves to be part of the Semantic Web Journal.

Thank you! ☺


Thanks I find this topic definitely of interest. FYI this (arXiv) paper may be loosely related (in the end, different topic though): https://arxiv.org/abs/2205.14032 - Ontology Design Facilitating Wikibase Integration - and a Worked Example for Historical Data, by Cogan Shimizu, Andrew Eells, Seila Gonzalez, Lu Zhou, Pascal Hitzler, Alicia Sheill, Catherine Foley, Dean Rehberger