RDF Graph Validation Using Rule-Based Reasoning

Tracking #: 2145-3358

Authors: 
Ben De Meester
Pieter Heyvaert
Dörthe Arndt
Anastasia Dimou
Ruben Verborgh

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Abstract: 
The correct functioning of Semantic Web applications requires that given RDF graphs adhere to an expected shape. This shape depends on the RDF graph and the application’s supported entailments of that graph. During validation, RDF graphs are assessed against sets of constraints, and found violations help refining the RDF graphs. However, existing validation approaches cannot always explain the root causes of violations (inhibiting refinement), and cannot fully match the entailments supported during validation with those supported by the application. These approaches cannot accurately validate RDF graphs, or combine multiple systems, deteriorating the validator’s performance. In this paper, we present an alternative validation approach using rule-based reasoning, capable of fully customizing the used inferencing steps. We compare to existing approaches, and present a formal ground and practical implementation "Validatrr", based on N3Logic and the EYE reasoner. Our approach – supporting an equivalent number of constraint types compared to the state of the art – better explains the root cause of the violations due to the reasoner’s generated logical proof, and returns an accurate number of violations due to the customizable inferencing rule set. Performance evaluation shows that Validatrr is performant for smaller datasets, and scales linearly w.r.t. the RDF graph size. The detailed root cause explanations can guide future validation report description specifications, and the fine-grained level of configuration can be employed to support different constraint languages. This foundation allows further research into, a.o., handling recursion, validating RDF graphs based on their generation description, and providing automatic refinement suggestions.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jose Emilio Labra Gayo submitted on 03/Jul/2019
Suggestion:
Accept
Review Comment:

After reading the answer-letter provided by the authors and the new version of the paper. I think the paper complies with the 3 criteria of originality, significance of the results and quality of writing.

Some minor typos:

- Page 2, line 12, I think the sentence "For example, the use case dictates following compound constraint c_compound..." is not well-formed...should it be "For example, the use case dictates following compound constraint..."
- Page 6. Lines 4-6, The reference to SHACL [47] appears two times in the same sentence, probably it is enough with one appearance.
- Page 8, "the used entailment regime" should be "the entailment regime used" ?
- Page 19, line 47. "...into configuration details for a.o. ShEx and SHACL...". Could you replace the "a.o." acronym by the longer term?
- Page 23, reference [49] the author is "Dean Allemang", and it appears as "Dean Allemand"

Review #2
Anonymous submitted on 04/Jul/2019
Suggestion:
Major Revision
Review Comment:

# Summary

In this work, the authors argue for using formal rule-based approach for RDF graph validation. The addressed problem is very relevant and there are interesting ideas that the paper introduces (such unique constraint language that consolidates OWA and CWA). I can appreciate the effort the authors are doing in collecting all the work on the addressed problem and in trying to address it under one framework which is very important for Semantic Web community (and often even neglected by similar work in our community). I also appreciate the effort of addressing the concerns I raised in my previous review.

However, I still have serious concerns about the correctness and academic rigor of the work. I will comment adding on top my previous review.

(2) Taken from my previous review:

[ The biggest concern is that the authors did not clearly state what is the rule language they are using (and this should be the main contribution of the paper!). There is no formal specification (grammar, syntax) nor semantics. ]

This is again my main issue. It is hard to validate the claims like, "(b) an accurate number of violations is returned by using a custom set of inferencing rules 13 up to at least OWL-RL complexity and expressiveness;"
, "(c) the number of supported constraint types is equivalent to existing validation approaches;". I see the effort of the authors that provided several examples of the comparisons but the examples do not make the claim!

I tried now to investigate more about the underlying logic the authors taken N3Logic, and now I think main problem is the authors rely on N3Logic. Is there any work on a relevant conference that is based on N3Logic or some wider adaption of it? Seems that it has been proposed almost a decade ago in [7] and [8] but in a rather informal way, and then abandoned. Even the authors of the language N3Logic claims that they were not clear about the expressiveness of the language.

In [8],
"A formal categorization of N3Logic is complicated as it differs from most traditional logics in expressivity. ... However, unlike DL, N3Logic is not
decidable, limiting expressivity in other ways motivated by the Web considerations discuss in this paper. As such, developing a formal model theory for N3Logic is quite challenging, and is the focus of current work."

Then seems that the language was not adapted by the community for further investigation (at least the authors do not provide further insides about that).

Along this lines is the comparison of the expressivity of the constraints by the proposal by the authors and languages such as SHACL and SheX that is based on PhD work of Hartmann. I had look at the work, and it is also inclined towards informal way of defining things; with all respect, this work has not been even published in a peer-reviewed conference (or at least not given as reference), thus it makes it hard to verify the claims there as well.

In general, many citations in these work are based on non peer-reviewed articles which makes is it hard to check the correctness of the claims and understand their contributions wrt to the rest of community.

Taken from my previous review:
[Almost half of the paper is consumed on criticizing existing approaches but then there is no clear proposal at the end.]

For my taste, if you claim that the main contribution is a new approach to constraint validation, then I think one should start introducing such constraint language at the beginning or asap (not at page 12/24). In this way you rise large expectations but materialize them poorly.

(2) Taken from my previous review:

[In particular, my impression is that the authors did not carefully consider the work in [21] and [22] that discusses in detail combining CWA and OWL(OWA), nor more recent work on SHACL and ShEX.]

The authors provided more references on the above but seems not that they put effort in understanding these works and comparing with their usage of N3Logic. Again this partially fault of selecting N3Logic, commented above.

(3) Taken from my previous review:
[Secondly, only the intro (sec 1) is of a reasonable quality. The style of writing and academic rigor significantly gets worse afterwards.]

The quality of the presentation has improved but still not a high level. Often new terminology is used without being introduced previously or constructed in a way that makes is hard to parse (even after several iterations).

E.g.,

- Words like
"resource r_firstname"
"compound constraint"
I find hard to understand because they do not fit to standard logic terminology (or semantic web terminology) in the context they are used. E.g., compound constraint - is this used in the literature elsewhere? I would just say constraint; resource r_firstname, is this constraints formula ? how it can be a resource?

- "Problem 2 (P2): the number of found violations depends on the supported entailments." This is a know problem already addressed in [54] and [71] (and many work afterwords)

- "To solve aforementioned observed validation problems, we pose following hypotheses" -- why do you call these hypotheses (hypothesis = a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation)? I think it's more like your contribution.. or rather just drop part 1.2

- "declarative logic" -- what is declarative logic? Probably you meant just mathematical logic.

- "In this work, we propose an alternative validation approach using rule-based reasoning. " -- as far as I am aware of, almost all approaches to validation are sort of rule based (especially in relational and graph databases)

- "Problem 1 (P1):" I got the idea but writing style needs to be more precise.

- "These requirements are not 34 common for Semantic Web logics", what is semantic web logics?

- "Semantic Web rule-based reasoning" I am not sure if this is a known term. I find the whole paragraph is confusing.

- Section 5 - should be the main section in my view, and it is called "Application", why? Then 5.2 is called "Technologies".

- RDF-CV: RDF-CV is also important for overall argument but it also not introduced. Then it is not clear what is Listing 2 (or even 3) is exactly specifying (other than general intuition); that is, what is the semantics of rdfcv:leftProperties, rdfcv:contextClass, etc.?

Review #3
By Simon Steyskal submitted on 07/Jul/2019
Suggestion:
Accept
Review Comment:

The paper has been significantly revised and together with the authors' response accompanying the resubmission most if not all of my raised remarks/questions were addressed. Thank you very much!

There are only 2 (minor) issues that I would like to see addressed:

1) RDFS Entailment (rdfs:subClassOf) - you say: "SHACL specifies a fixed set of inferencing steps during validation, namely, rdfs:subClassOf entailment. Thus, one cannot validate, e.g., whether an RDF graph explicitly contains all triples that link resources to all their classes given a set of rdfs:subClassOf axioms, as rdfs:subClassOf triples are always inferred by a conform SHACL validator."

=> I would be interested in seeing an example of such a constraint, but fwiw the SHACL Specification explicitly states in 3.2 Data Graph:
"SHACL makes no assumptions about whether a graph contains triples that are entailed from the graph under any RDF entailment regime.
The data graph is expected to include all the ontology axioms related to the data and especially all the rdfs:subClassOf triples in order for SHACL to correctly identify class targets and validate Core SHACL constraints. "

as such, unless one explicitely specifies "sh:entailment ." in the shapes graph, there shouldn't be any RDFS inference happening. But agreed that passage is a bit ambiguous..
Also keep in mind that potential SPARQL definitions as the one given in https://www.w3.org/TR/shacl/#targetClass are informative only!

E.g.:

:Teacher a sh:NodeShape , rdfs:Class ;
sh:property [
sh:path :teaches ;
sh:class :Course ;
sh:minCount 1
];
sh:property [
sh:path rdf:type ;
sh:hasValue :Person
] .

=============================================
:bob a :Teacher ;
:teaches :logic.

:carol a :Teacher ;
:teaches :algebra .

:algebra a :Course .

:alice a :Person , :Teacher ;
:teaches :algebra .

:teaches rdfs:domain :Teacher ;
rdfs:range :Course .

:Teacher rdfs:subClassOf :Person .
=============================================
should produce (e.g. using TopBraids SHACL API):

[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [ a sh:ValidationResult ;
sh:focusNode :bob ;
sh:resultMessage "Value does not have class :Course" ;
sh:resultPath :teaches ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:ClassConstraintComponent ;
sh:sourceShape [] ;
sh:value :logic
] ;
sh:result [ a sh:ValidationResult ;
sh:focusNode :bob ;
sh:resultMessage "Missing expected value :Person" ;
sh:resultPath rdf:type ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:HasValueConstraintComponent ;
sh:sourceShape _:b0
] ;
sh:result [ a sh:ValidationResult ;
sh:focusNode :carol ;
sh:resultMessage "Missing expected value :Person" ;
sh:resultPath rdf:type ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:HasValueConstraintComponent ;
sh:sourceShape _:b0
]
] .

while the same shapes graph with <> sh:entailment . added, produces no validation results.

2) random uppercase letter: SÖren [22] and JüRgen [41]

best regards, simon