Formalizing and Validating Wikidata's Property Constraints using SHACL+SPARQL

Tracking #: 3378-4592

Authors: 
Nicolas Ferranti
Jairo Francisco de Souza
Axel Polleres

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Abstract: 
In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the largest collaboratively open data knowledge graphs on the web. Despite the availability of a W3C recommendation for validating RDF Knowledge Graphs against constraints via the Shapes Constraint Language (SHACL), however, Wikidata currently represents its property constraints through its own RDF data model, using proprietary authoritative namespaces, and -- partially ambiguous -- natural language definitions. In order to close this gap, we investigate the semantics of Wikidata property constraints, by formalizing them using SHACL and SPARQL. While SHACL Core's expressivity turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalisation with Wikidata's violation reporting system and discuss limitations in terms of evaluation via Wikidata's SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints in Wikidata that potentially have implications for future efforts to improve the quality and accuracy of data in collaborative knowledge graphs. On the other hand, as a ``byproduct'', our formalisation extends existing benchmarks for both SHACL and SPARQL with a challenging, large scale real-world use case.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jose Emilio Labra Gayo submitted on 17/Apr/2023
Suggestion:
Major Revision
Review Comment:

The paper presents a partial formalization of Wikidata property constraints using SHACL and shows that not all property constraints can be expressed in SHACL and another formalization in SPARQL. The paper presents some experiments and discussions about this approach.

In my opinion, the paper is well written and the approach presented is original and interesting. The authors have already presented part of the contents in the Wikidata workshop (https://ceur-ws.org/Vol-3262/paper1.pdf). I am not sure if they could include a link to that previous work, but in any case, I assume that it is compatible with this paper being presented as a journal paper.

One thing that in my opinion is a bit misleading is the title of the paper, as SHACL + SPARQL could be confused with SHACL-SPARQL, the SPARQL-based constraints that are also available in SHACL (sections 5 and 6 of the SHACL spec: https://www.w3.org/TR/shacl/#sparql-constraints). I would suggest: “Formalizing and Validating Wikidata's Property Constraints using SHACL-Core or SPARQL” as the authors provide two formalizations, one based on SHACL-Core and another one, based on SPARQL.

In fact, one question which I think is not answered in the paper is the possible formalization of the constraints using SHACL-SPARQL. Have the authors tried to do so? What would be the pros and cons of such an approach? In my opinion, there should be some discussion in the paper about that alternative and at least, the authors should review that when they use SHACL in the paper, they are referring to SHACL-Core, not to SHACL or SHACL-SPARQL.

Another topic which could be improved is the comparison with other technologies like ShEx. Although the authors indicate in the related work that Wikidata adopted ShEx for entity schemas. Their comparison is focused on using ShEx only in the way that it is used in the entity schemas extension, without indicating that given that ShEx is an RDF validation language, it could also be used in the same way as the authors use SHACL. In fact, the example given in figure 3 (a) in the paper could be expressed in ShEx as:

:P1469_ItemRequiresStatementShape {
wdt:P106 [ wd:Q937857 wd:Q1851558 wd:Q21057452 wd:Q628099]
}

With the ShapeMap: { FOCUS wdt:P1469 _}@:P1469_ItemRequiresStatementShape

And the example in figure 3 (b) could be expressed as:

:P1469_ItemRequiresStatementShape [ wd:Q370014 ] OR {
wdt:P106 [ wd:Q937857 wd:Q1851558 wd:Q21057452 wd:Q628099]
}

Which in my opinion is a more concise and readable syntax than SHACL-Core which is based in Turtle.

For a more in-depth comparison of ShEx vs SHACL, I recommend the authors chapter 7 of the validating RDF data book: http://book.validatingrdf.com/

Taking into account the main points to consider to evaluate a paper for the SWJ, I could summarize them as follow:

1. Originality. In my opinion the work presented is original and has not been presented before except for the same authors in a Wikidata workshop, which is probably OK.
2. Significance of the results. The authors proposed a declarative way to represent property constraints from Wikidata which may help its adoption and better understand their semantics. Indeed, the results present some discrepancies which already help understand if one is counting preferred values or normal values. Overall, this work could help improve the quality of the data in wikidata.
3. Quality of writing. In my opinion the authors have done a good work and although the paper contains technical content, it is quite readable.
4. Provided data artifacts. The provided data artifacts are complete and the authors present also a github repository that contains the source code of the converter. One issue is the URIs employed depend on the stability of Wikidata properties and items, which one may assume, another issue is that the authors employ some URI minimizer based on a tool from their institution. For example: https://short.wu.ac.at/8tb6, I am not sure if this URI could be considered stable or not.

Some other suggestions through the paper:

The first citation in the paper about Knowedge graphs is for a paper about Knowledge graph refinement…I wonder if it would be better to cite a survey or even a book about knowledge graphs like: https://kgbook.org/
Page 1, Line 41, “Statement representing relationships…”
Page 1, Line 46. “In order to process such KGs, the Semantic Web community has defined standards such as…RDF, SPARQL, OWL, SHACL”, that statement is wrong given that the current notion of Knowledge graphs appeared around 2012 and at that time, RDF, SPARQL and OWL already existed. .
Page 1, line 49, “Standard Protocol and RDF Query Language” is wrong, it should be “SPARQL Protocol and RDF Query Language”
Page 2, line 12: I am not sure if the statement “The large user community is primarily motivated by Wikipedia, as the vast majority of Wikipedia pages incorporate content from WD” is true…maybe it was, but I am not sure if it is still true…in any case, to avoid including wrong or non-falsifiable statements I would probably avoid it.
Page 2, line 44, I am not sure if the “in” in “We study in how far the expressiveness…” is grammatically correct in English.
Page 3, line 1, “SHACL’s core language can not express property constraint, is not … and another can only…”. I think I am missing something as plus is , not ...
Page 3, line 37. Is “which has driven the development of own bespoke constraint representations and checking techniques” OK?
Page 9, the title of section 3.3 should probably be: “Limitations of SHACL core for WD property constraint checking” to avoid the confusion between SHACL (SHACL-SPARQL) and SHACL-Core.
Page 9, In figure 4, the authors use “Type constraint” but in the text (line 36) they mention a specific class (ConstraintType), should they be the same? If that’s the case, should the authors say something about the “Value Type Constraint” tha also appears in the figure?
Page 10. Line 45-46, in the SPARQL snippet, why don’t you use prefixed properties like wdt:P734 and wdt:P1560 instead of the full URI which is more verbose?
Page 15. Line 19, “takes long to…”
Page 16, line 4, extra space after # of violations)
Page 17, line 14, I think there something missing in “allows”, maybe use “, which allows…” ?
Page 18, line 50, I am not sure if the sentence “where again we moved to the next best properties” is ok.
Page 21, line “using SPARQL and SHACL” should probably be changed by using “SPARQL” and “SHACL Core”.
If the authors are reviewing previous approaches that use SPARQL to validate RDF data portals, in the Validating RDF workshop, the use of SPARQL for validation was already proposed here: https://www.semanticscholar.org/paper/Validating-statistical-index-data-..., a more recent paper that also uses SPARQL to represent Wikidata constraints is https://peerj.com/articles/cs-1085/
Finally, another approach that the authors could consider relevant is the WShEx approach: (https://ceur-ws.org/Vol-3262/paper3.pdf) which is an attempt to define a schema language similar to ShEx and based on the Wikibase data model.

Review #2
By Ognjen Savkovic submitted on 09/Jun/2023
Suggestion:
Major Revision
Review Comment:

# Summary

In this work, the authors attempt in expressing the constraints that hold over Wikidata into SHACL and SPARQL. The main challenge is that Wikidata constraints are somewhat implicit. They are required to follow a certain format, however, written inside the documentation of the Wikidata, and at the same time, they are evolving over time in the background. This is an obvious issue since they are described in any formal standard (according to the authors) and possibly limit the applications of Wikidata if the data is further processed. A natural question would be can one model such Wikidata constraints in SHACL, since SHACL is the latest W3C standard for integrity constraints over RDF>

As an attempt to bridge this gap, the authors are trying to:
i) identify the different variants of the constraints that exist in Wikidata
ii) formalize them in SHACL first and then for those that cannot be formalized in SHACL do it in SPARQL
iii) they tested their implementation of the constraints against the Wikidata reports on violation for selected constraints.

The motivation and the problem discussed are relevant issues for the Semantic Web community. However, I have concerns regarding the overall quality of the work,
and in particular,
i) depth of technical contributions, and ii) quality of the write-up
which seems not yet to be at the acceptable stage.
In the following, I will first discuss a general critique of the work (and ways to how improve them) and then pin-point smaller remarks on the language and terminology.

The github link is provided, and the shacl and sparql queries are provided there.

# General critique

M1. Depth of technical contributions

The technical contribution of the work is not impressive. Obviously, the task that the authors had was also not the biggest math problem, however, I think the execution is not satisfactory. The main goal of this work, in my view, was to identify what are the constraints that Wikidata uses. The authors did that in section 3, identifying 13 different cases. However, they provided it in an ambiguous (it's also the problem of a writing style) manner and text form. They are immediately discussing how they can or cannot be expressed in SHACL.
However, at this point, it's not even fully clear what are those constraints and how do they look like.

I believe that the authors should have a separate section on just describing WD constraints and then expressing them in some formal language, preferably First-order Logic, Second-order Logic, or Datalog under stable model semantics, depending on what complexity is required (or at least for cases where SHACL was not sufficient).
Once this is clear one can discuss what can and cannot be implemented in SHACL, Sparql, etc (any new language that may appear in the future)

M2. The implementation of constraints proposed in the paper

There is not much discussion on the approach to implementing such constraints both for SHACL. It's a bit better for SPARQL but again seems rather a simple exercise. This, from what I can understand the implementation seems rather straightforward. This part is perhaps hard to improve drastically, however, still it is possible to make some improvements.
For instance, the encoding in SHACL could is given with two simple examples that are nice but gives only a little insight into the implementation
For instance, the authors could provide the description of all 13 cases using the abstract form as proposed by Corman et al., or any other compact form and perhaps extend that format if needed with some additional expressions. This wouldn't take much space but then it clarifies the implementation.

The SPARQL implementation description provides more insight into the approach.

Overall, all constraints seem to be rather local (there is no need for compositionality between the SHACL shapes) of the constraints or long paths inside the constraints, which makes the problem also technically simple. Please note that implementing composition on of constraints (even nonrecursive ones) would be a more interesting challenge.

M3. Quality of writing.

The paper does not read smoothly at parts, especially technical parts.
Often the terms are introduced without being explained beforehand, and even though I consider myself knowledgeable about the topic, it's very hard to interpret them. In general, my advice would be to write shorter sentences, since they would make the presentation clearer.
I will go chronologically.

Sec 2.1
Figure 1 is difficult to understand for the first time. I have to read the paragraph a few times, to get a better idea, and even then it's not fully clear what is being described.
Eg. what are dashed and what are solid arrows, what in general what those arrows represent, and is this the interpretation of the authors and taken from the WD documentation? I have to go back and forth a few times to get some understanding. I would rather start with a data instance (Neymar and Thiago) which is simple to explain and then introduce little by little all the other concepts. First Wikipedia statements (that is on the bottom of the picture) and then the rest accordingly.

E.g,m it is not clear how one gets
wd:P1469 p:P2302 wds:P1469-667F9488-5C36-4E3B-BEAA-6FD5834885ED
and how one obtains information from it.

I mean the whole text around this is very hard to parse:
"To model our concrete constraint from Fig. 2(c), first the triple connects the property “FIFA player ID” (wd:P1469) to a statement node that is the bridge to the qualifiers; here, the p prefix is used by property “property constraint” (P2302) to describe a relation between an entity (wd) and .... "

In 2.1 it says: "WD defines 32 property constraints types" and then in Sec 3 there are 13 property constraint types, and it's not clear how they relate.

Sec 3

Many keywords are just used without prior explanation.

Eg.
-p6l34: "availability of an (additional) property P′ on the subjects (or objects) of PID" - not sure what means additional ... needs better terminology here
-p6l38: "checks items expected as values of either PID or the property P′ in- 38 dicated by P2306." what are items here. I can undersand it later once I saw the example and the usage of sh:hasvalue but just that sentance is not clear
-p6l41: "A qualifier used by constraints to express that multiple statements for PID can exist as 41 long as the values of the separator properties are distinct:" what is seperator? where these terminology is coming from. I know what is composite key so I can imagine what's is going on but still not fully clear. I would like to know what kind of keys are we talking about here exactly (eg one can use paper, PG-Keys: Keys for Property Graphs, as a reference to describe them)
- there is the word "qualifier" used everywhere in the paper, and it's never properly explained.
p5l45 "These concrete restrictions are defined through qualifiers specific to the particular constraint type and property; we will discuss these property contraint specific qualifiers in detail in Section 3." - sure but what are those?
- "represents the set of PID’s subjects" What are set of subjects?

- "Reason for deprecated rank (P2241): ... property." Too long for a sentence, could be easily three.

And so on. Every first sentence, in this item list is unclear or hard to parse.

- "replacement property (P6824) and replacement value (P9729)." It's not clear what this is a condition for the constraints.

3.1

p8l44 "limit the allowed qualifier properties" it's not clear how to do those

"Single-best-value" is what?

p10
It's a good point that the implementation of Corman et al can be improved. However, I believe one needs to clarify more about the complexity of the issue. In that work, the authors are introducing an algorithm to translate arbitrary shapes,
that can also refer to shapes and can have arbitrary nesting of conditions.
Obviously, in the example in the paper there is this obvious optimization but in general, especially with nested shapes, that is not the case.

# Some More Comments on the writeup:

p3
SHACL’s core language can not express two property constraints, one is not reasonably expressible, and another three can only be expressed partially ->
split into two sentences

"unambiguously operationalizing such analyses continuously." not clear

p4
"A number of dedicated, authoritative [18] RDF" authoritative?

"deductive reasoning tasks such as node classification or overall satisfiability checking" rephrase

"Wikidata concepts are specified using the prefix wd, used for entities like “Neymar” (Q142794) 45 but also properties like “FIFA player ID” (P1469), i.e., both entities and properties can be “referred to” as concepts."

then a bit later

"That is, the wd prefix is never used in the predicate position"

p5:
Sentance at line 31 is too long

p6
"one-by-one analysis " -> we analyze one-by-one

p8
" not reasonably expressible" -> cannot be expressed in a reasonable way

l29. Too long sentence and then repetitive. Split into 3 at least
and the sentence after, the same issue

"non-allowed paths implicitly" not clear

p9 "a tool prototype" -> a demo tool