Formalizing and Validating Wikidata's Property Constraints using SHACL and SPARQL

Tracking #: 3533-4747

Nicolas Ferranti
Jairo Francisco de Souza
Shqiponja Ahmetaj
Axel Polleres

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the most extensive collaboratively maintained open data knowledge graphs on the Web. The World Wide Web Consortium (W3C) recommends the Shapes Constraint Language (SHACL) as the standard constraint language for validating Knowledge Graphs, which comes in two different levels of expressivity, SHACL-Core, as well as SHACL-SPARQL. Despite the availability of SHACL, Wikidata currently represents its property constraints through its own RDF data model, which relies on a proprietary reification mechanism based on authoritative namespaces, and - partially ambiguous - natural language definitions. In the present paper, we investigate whether and how the semantics of Wikidata property constraints, can be formalized using SHACL-Core, SHACL-SPARQL, as well as directly as SPARQL queries. While the expressivity of SHACL-Core turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all 32 current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalization with Wikidata's violation reporting system and discuss limitations in terms of evaluation via Wikidata's public SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints defined by the Wikidata community, in order to improve the quality and accuracy of data in this collaborative knowledge graph. On the other hand, as a ``byproduct'', our formalization extends existing benchmarks for both SHACL and SPARQL with a challenging, large-scale real-world use case.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jose Emilio Labra Gayo submitted on 25/Oct/2023
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

In my opinion the paper addresses the interesting challenge of representing Wikidata constraints in SHACL core or SPARQL. I think the paper conforms to:

1.- Originality. As far as I know, the work presented is original and has only been partially presented by the authors in a Wikidata workshop.
2.- Significance of the results. The results presented are significant for the semantic web community as Wikidata is nowadays a very important knowledge graph driven by the community which has some peculiarities that are worth to study and analyze.
3.- Quality of writing. The authors have done an important work to make the paper readable by people who may be interested in both the wikidata data model and the way that it can be validated following SHACL or SPARQL technologies.

The resources are also available in github.

The authors have addressed most of my previous comments and I think the paper can almost be accepted after some minor revisions:

- The authors use several times the "proprietary" word to refer to the wikidata reification data model. I am not sure if it is a good choice as at least in my case, "proprietary" in the context of software seems in contrast to open, like "proprietary versus open source software" and as far as I know, most of the things in wikidata follow an open source tradition. Maybe, what the authors are trying to indicate is that the reification model is specific to Wikidata? Maybe replacing "proprietary" by "specific", "custom" or wikidata's own reificiation model would be better. This is mainly a suggestion as I am not a native English speaker, so maybe the authors are right and "proprietary" is the best word.

- In section 2.1 "Data modeling and Wikidata" the authors are trying to present a tutorial about wikidata data model which I appreciate. However, I think the tutorial should be reviewed a little bit because it seems that some concepts are mentioned without explanation or not in a sequential way. For example, the authors start by introducing RDF and for that, they use a graph G based on figure 1, which would be difficult to justify to a reader because it is quite a complex representation, with all those large wds:... URIs. Maybe, I would suggest to start by introducing the Wikidata data model, and later on, briefly describing RDF and the RDF serialization used by Wikidata.

- Something similar happens when the authors want to explain the preferred and normal ranks in page 8 and use SPARQL and the wikidata query service which have not been mentioned before. I think if the authors start explaining the wikidata data model, and later the RDF serialization and SPARQL, it would be easier to follow the contents in a more sequential way.

- Following the previous suggestion, the authors could also briefly mention that the RDF serializations employed in Wikidata are used in three ways: in the Wikidata query service, in the RDF dumps (which are mentioned in page 9, line 47) and when retrieving information about an item in Turtle using content negotiation, e.g. the result of running

curl -L -H "Accept:text/turtle"

As far as I know there are some slight differences between those RDF serializations:


The authors mention in section 2.4.4 as a surprise finding, but maybe it is a design choice of Wikidata which could be explained by the custom RDF serialization mechanism employed.

- Page 6, line 30, "URIs represent hashes for "anonymous" reified statement and quantity value nodes..."

- Page 17. "although subproperty of (P1647) is considered the pendant of subclass of (P279) when it comes to the hierarchy of properties" I was surprised by the word "pendant" here...again, maybe it is right, but I would ask the authors to double check if it is the right word.

- Page 39, line 37, "were proposed before SHACL became a standard" is wrong as SHACL is not a standard but a W3C recommendation. It should be: "were proposed before SHACL became a W3C recommendation".

Review #2
Anonymous submitted on 04/Dec/2023
Minor Revision
Review Comment:

This is a review by a reviewer not involved in the first review round of this paper, but due to the state of the review process, I especially take a look at the response towards "Ognjen Savkovic" from round one here, and assess to which degree the comments of that reviewer have been addressed.

In response to that review, the authors have substantially revised and extended their paper, and this effort is appreciated. It appears that not all comments are addressed in the way that the reviewer had wished for, but the authors explain their reasons carefully, and did at least some improvements for each comment. In my view, the authors revisions are sufficient to increase the quality of this work to something close to acceptance, therefore, subject to minor comments, I recommend acceptance.

Response letter:
- Scope: The reworked introduction indeed helps much to explain the significance of the work, and why Wikidata deserves special treatment.
- Regarding the suggestions of formalizing Wikidata's constraints in a formal language first, the authors make a fair point, that this would add quite a distraction to the paper - an intermediate representation would shift the focus from the practical goal of consistently and automatically verifying Wikidata's constraint in an established constraint language, SHACL, and SPARQL. Also, the paper is quite long already.
- Abstract form: The authors argue similarly, and in my view, have a point.
- Locality of constraints and resulting simplicity: The authors have added this observation to the paper. Arguably, that the constraints are, from a computational complexity perspective simple is not a shortcoming of this work, and as the results show, even for these "simple" constraints, there are sufficient runtime evaluation challenges.
- Writing quality: The authors improved on this and restructured the paper. I agree though, that some sentences are still hard to read, which does not necessarily reflect formal grammar errors, but formulations and punctuations that are not entirely smooth. I suggest a careful proofreading again.
- The following comments in the review seem less major, and appear all reasonably addressed.

Detailed comments (page/line):
5.34: Wikdata
5.38: use \textit, not \mathit for highlighting terms
- I concur with the other reviewer that the use of the adjective "proprietary" for Wikidata appears not an ideal fit.
22.43: Sentence difficult to parse/odd grammar, what is meaning of dash?
22.48: Labels of enumeration are hard to comprehend/remember - maybe back-reference the labels to where they are introduced in 15.14
42.6: I'm not sure whether the fact that only 0.2% of classes have a ShEx schema is informative. Since class sizes are likely exponentially distributed, the important question is how many entities are in a class that has a ShEx schema (which might well be 90% or 99%).
- Perhaps I'm just missing it, but given the focus of the work, I would much appreciate a discussion of the practical feasibility of this approach - how close is this to something that could be deployed in the Wikimedia ecosystem, what would be the resource requirements/considerations around runtime? It seems Section 5 only focuses on evaluating correctness, and although there are mentions of timeouts here and there, and 7.3 mentions that this could be an interesting benchmark (so, it is hard?), I think the Wikidata community would much benefit from a subsection or paragraph that positively tells them whether this is something realistic to built without outrageous effort.