Review Comment:
# Summary
In this work, the authors attempt in expressing the constraints that hold over Wikidata into SHACL and SPARQL. The main challenge is that Wikidata constraints are somewhat implicit. They are required to follow a certain format, however, written inside the documentation of the Wikidata, and at the same time, they are evolving over time in the background. This is an obvious issue since they are described in any formal standard (according to the authors) and possibly limit the applications of Wikidata if the data is further processed. A natural question would be can one model such Wikidata constraints in SHACL, since SHACL is the latest W3C standard for integrity constraints over RDF>
As an attempt to bridge this gap, the authors are trying to:
i) identify the different variants of the constraints that exist in Wikidata
ii) formalize them in SHACL first and then for those that cannot be formalized in SHACL do it in SPARQL
iii) they tested their implementation of the constraints against the Wikidata reports on violation for selected constraints.
The motivation and the problem discussed are relevant issues for the Semantic Web community. However, I have concerns regarding the overall quality of the work,
and in particular,
i) depth of technical contributions, and ii) quality of the write-up
which seems not yet to be at the acceptable stage.
In the following, I will first discuss a general critique of the work (and ways to how improve them) and then pin-point smaller remarks on the language and terminology.
The github link is provided, and the shacl and sparql queries are provided there.
# General critique
M1. Depth of technical contributions
The technical contribution of the work is not impressive. Obviously, the task that the authors had was also not the biggest math problem, however, I think the execution is not satisfactory. The main goal of this work, in my view, was to identify what are the constraints that Wikidata uses. The authors did that in section 3, identifying 13 different cases. However, they provided it in an ambiguous (it's also the problem of a writing style) manner and text form. They are immediately discussing how they can or cannot be expressed in SHACL.
However, at this point, it's not even fully clear what are those constraints and how do they look like.
I believe that the authors should have a separate section on just describing WD constraints and then expressing them in some formal language, preferably First-order Logic, Second-order Logic, or Datalog under stable model semantics, depending on what complexity is required (or at least for cases where SHACL was not sufficient).
Once this is clear one can discuss what can and cannot be implemented in SHACL, Sparql, etc (any new language that may appear in the future)
M2. The implementation of constraints proposed in the paper
There is not much discussion on the approach to implementing such constraints both for SHACL. It's a bit better for SPARQL but again seems rather a simple exercise. This, from what I can understand the implementation seems rather straightforward. This part is perhaps hard to improve drastically, however, still it is possible to make some improvements.
For instance, the encoding in SHACL could is given with two simple examples that are nice but gives only a little insight into the implementation
For instance, the authors could provide the description of all 13 cases using the abstract form as proposed by Corman et al., or any other compact form and perhaps extend that format if needed with some additional expressions. This wouldn't take much space but then it clarifies the implementation.
The SPARQL implementation description provides more insight into the approach.
Overall, all constraints seem to be rather local (there is no need for compositionality between the SHACL shapes) of the constraints or long paths inside the constraints, which makes the problem also technically simple. Please note that implementing composition on of constraints (even nonrecursive ones) would be a more interesting challenge.
M3. Quality of writing.
The paper does not read smoothly at parts, especially technical parts.
Often the terms are introduced without being explained beforehand, and even though I consider myself knowledgeable about the topic, it's very hard to interpret them. In general, my advice would be to write shorter sentences, since they would make the presentation clearer.
I will go chronologically.
Sec 2.1
Figure 1 is difficult to understand for the first time. I have to read the paragraph a few times, to get a better idea, and even then it's not fully clear what is being described.
Eg. what are dashed and what are solid arrows, what in general what those arrows represent, and is this the interpretation of the authors and taken from the WD documentation? I have to go back and forth a few times to get some understanding. I would rather start with a data instance (Neymar and Thiago) which is simple to explain and then introduce little by little all the other concepts. First Wikipedia statements (that is on the bottom of the picture) and then the rest accordingly.
E.g,m it is not clear how one gets
wd:P1469 p:P2302 wds:P1469-667F9488-5C36-4E3B-BEAA-6FD5834885ED
and how one obtains information from it.
I mean the whole text around this is very hard to parse:
"To model our concrete constraint from Fig. 2(c), first the triple connects the property “FIFA player ID” (wd:P1469) to a statement node that is the bridge to the qualifiers; here, the p prefix is used by property “property constraint” (P2302) to describe a relation between an entity (wd) and .... "
In 2.1 it says: "WD defines 32 property constraints types" and then in Sec 3 there are 13 property constraint types, and it's not clear how they relate.
Sec 3
Many keywords are just used without prior explanation.
Eg.
-p6l34: "availability of an (additional) property P′ on the subjects (or objects) of PID" - not sure what means additional ... needs better terminology here
-p6l38: "checks items expected as values of either PID or the property P′ in- 38 dicated by P2306." what are items here. I can undersand it later once I saw the example and the usage of sh:hasvalue but just that sentance is not clear
-p6l41: "A qualifier used by constraints to express that multiple statements for PID can exist as 41 long as the values of the separator properties are distinct:" what is seperator? where these terminology is coming from. I know what is composite key so I can imagine what's is going on but still not fully clear. I would like to know what kind of keys are we talking about here exactly (eg one can use paper, PG-Keys: Keys for Property Graphs, as a reference to describe them)
- there is the word "qualifier" used everywhere in the paper, and it's never properly explained.
p5l45 "These concrete restrictions are defined through qualifiers specific to the particular constraint type and property; we will discuss these property contraint specific qualifiers in detail in Section 3." - sure but what are those?
- "represents the set of PID’s subjects" What are set of subjects?
- "Reason for deprecated rank (P2241): ... property." Too long for a sentence, could be easily three.
And so on. Every first sentence, in this item list is unclear or hard to parse.
- "replacement property (P6824) and replacement value (P9729)." It's not clear what this is a condition for the constraints.
3.1
p8l44 "limit the allowed qualifier properties" it's not clear how to do those
"Single-best-value" is what?
p10
It's a good point that the implementation of Corman et al can be improved. However, I believe one needs to clarify more about the complexity of the issue. In that work, the authors are introducing an algorithm to translate arbitrary shapes,
that can also refer to shapes and can have arbitrary nesting of conditions.
Obviously, in the example in the paper there is this obvious optimization but in general, especially with nested shapes, that is not the case.
# Some More Comments on the writeup:
p3
SHACL’s core language can not express two property constraints, one is not reasonably expressible, and another three can only be expressed partially ->
split into two sentences
"unambiguously operationalizing such analyses continuously." not clear
p4
"A number of dedicated, authoritative [18] RDF" authoritative?
"deductive reasoning tasks such as node classification or overall satisfiability checking" rephrase
"Wikidata concepts are specified using the prefix wd, used for entities like “Neymar” (Q142794) 45 but also properties like “FIFA player ID” (P1469), i.e., both entities and properties can be “referred to” as concepts."
then a bit later
"That is, the wd prefix is never used in the predicate position"
p5:
Sentance at line 31 is too long
p6
"one-by-one analysis " -> we analyze one-by-one
p8
" not reasonably expressible" -> cannot be expressed in a reasonable way
l29. Too long sentence and then repetitive. Split into 3 at least
and the sentence after, the same issue
"non-allowed paths implicitly" not clear
p9 "a tool prototype" -> a demo tool
|