RDF Graph Validation Using Rule-Based Reasoning

Tracking #: 1998-3211

Ben De Meester
Pieter Heyvaert
Dörthe Arndt
Anastasia Dimou
Ruben Verborgh

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Semantic Web applications cannot function when the given data – i.e., an RDF graph – is not interpreted as expected. RDF graphs can be validated by defining and assessing constraints. These constraints define an RDF graph that can be correctly interpreted for a specific Semantic Web application or use case. Which entailment regime is used – e.g., whether rdfs:subClassOf inferencing is taken into account or not – is an integral part of how the RDF graph should be interpreted, and thus of the proper functioning of the application. Different types of validation approaches are proposed to assess these constraints, namely hardcoded systems, ontology reasoners, and querying endpoints. However, these approaches do not allow to fully customize which inferencing is supported to match the entailment regimes as intended by the use case. They are thus unable to validate RDF graphs properly or need to combine systems, deteriorating the performance of the validation. In this paper, we present an alternative validation approach using rule-based reasoning, capable of fully customizing the inferencing rules during validation. We compare existing approaches with a rule-based reasoning approach, and present both a formal ground and practical implementation based on N3Logic and the EYE reasoner. Our approach (a) better explains the root cause of the violations due to the formal logical proof of the reasoner, (b) returns an accurate number of violations due to explicit inferencing rules, and (c) supports more constraint types by including inferencing up to at least OWL-RL complexity and expressiveness. Moreover, our performance evaluation shows that our implementation is faster than combining existing approaches. By allowing to precisely define the inferencing rules together with the constraints, we provide a more complete validation approach. We ensure validated RDF graphs can be interpreted as intended with additional inferencing, allowing more precise Semantic Web applications, and opening opportunities for automatic RDF graph refinement and validating implicit graphs based on their generation rules.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Nov/2018
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

# Summary

In this work, the authors argue for using formal rule-based approach for RDF graph validation. While the problem they are addressing is very
important I have serious several concerns about the quality of the work.

(2) The biggest concern is that the authors did not clearly state what is the rule language they are using (and this should be the main contribution of the paper!). There is no formal specification (grammar, syntax) nor semantics. Almost half of the paper is consumed on criticizing existing approaches but then there is no clear proposal at the end.

(1) I think this paper requires more carful way of addressing the problems it is solving. It states the problems too broadly while there is already broader solutions in the literature. In particular, my impression is that the authors did not carefully consider the work in [21] and [22] that discusses in detail combining CWA and OWL(OWA), nor more recent work on SHACL and ShEX. Perhaps just focusing on problems (1) and (4) in more depth would make this paper sound more credible.

(3) Secondly, only the intro (sec 1) is of a reasonable quality. The style of writing and academic rigor significantly gets worse afterwards.

Overall, I think that this work requires significant revision.

Below I list some of the concerns in more detail.

# Comments about motivating problems

* Overall, problems (1) and (4) I find reasonable though (1) has been already addressed in the literature partially. Problems (2) and (3) I think have been already addressed in the literature with more depth than listed in the paper, I am surprised the authors are not aware of those despite the fact they are citing them [21] and [22]

* “Problem 1: it is hard to find the root causes of the constraint violations.” I think this problem is important but it not fully try that has not been studied before. Especially for DL and OWL there are numerous papers on this issue. More recently, SHACL, as W3C standard for RDF constraints, supports this.

* “Problem 2 (P2): the number of found violations is biased with respect to the used entailment. ” If I am not mistaken the idea of entailment regime comes from SPARQL. But what you study is deeper than that. In your example (2) is interpreted under OWA while (3) you check under CWA — this goes beyond entailment regime and tackles the problem of combining OWA and CWA reasoning. There has been already studies on this. E.g. [21] and [22]

* “Problem 3 (P3): not all constraint types are supported.” Which constraints in particular, could you please more precise. If we take OWL you can express almost all imaginable constraints but the question is that if that would be readable or useful. Again, today we have SHACL, that is also very expressive. See, .e.g., “Semantics and Validation of Recursive SHACL” Corman et al ISWC (2018)

# Comments on related work in sec 2

* Work [21] and [22] should be better understood. In essence, this papers addresses the issue of combing OWA and CWA, and formally they provide a semantics that can cover all listed approaches. My impression is that the authors did not analyze these work thoroughly.

* Comments on SHACL are also imprecise. SHACL semantics is formally defined in terms of SPARQL queries but does not need to follow that route when implementing (nor it requires end-points). E.g., see the paper listed above.

* SheX is also wrongly interpreted. What is “domain specific language to declare validation rules”. It is a formal language with semantics, like OWL. How it is implemented is again a choice.

* What is “Querying endpoint”? I know what is SPARQL endpoint but for the former. I tried to google but it doesn’t exist…

# Other comments

* Sentence: “These constraints and their intended meaning depend on the use case and are stated explicitly.” I am confused with the logic here. I would think that depending of intended meaning RDF graph can been seen as more fit or less fit. And intended usage may require different constraints — but the way we interpret constraints should be fixed.

* Examples are provided, trying to motivate the problem but ideally a concrete a use-case that demonstrates the listed problem would make the paper more convincing.

* Quality of presentation deteriorates in the 2.1. E.g., sentence: “A reasoner is a piece of software that performs reasoning: inferring logical consequences (an inference) from a set of asserted facts” sounds unnecessary.

* Strange terminology in sentence, “Asserted facts (axioms) are commonly annotated using an ontology language. ” I would consider “asserted facts” as a name for ABox (data), while axioms are rather TBox (logical formulas).

* More, “A fixed set of inferencing rules that specify a specific (description) logic is called an entailment regime [14]”. Entailment regime in SPARQL =/= reasoning in (description) logic. This is poor scholarship!

* Arguments in 2.4 are not supported by any example. Could you give concrete example of a constraint that cannot be expressed in SHACL?

Review #2
By Jose Emilio Labra Gayo submitted on 26/Dec/2018
Major Revision
Review Comment:

The paper proposes a rule based approach to validate RDF graphs. As far as I know this is an original approach which combines the use of rules for inferencing as well as for validating.
The results are interesting as this approach can improve the performance of other techniques which separate inferencing from validation, it can also increase the expressiveness by being able to represent some constraints that are not represented in other approaches and it can also offer better explanations of the violations.
The paper is well written and the approach is sound. Nevertheless, I think the authors are exceedingly optimistic in their assessments of the benefits of their approach and I suggest them to rewrite several parts of the paper pointing not only to the pros, but also to the cons of their approach. Given that this is a research paper, it must try to offer a more objective comparison with alternative approaches and avoid a style which sometimes looks like a marketing paper.
As an example, the phrase that starts section 3. Comparative analysis says: "In this section, we show the shortcomings of existing approaches, and …". I think the authors need to show the existing approaches in an objective way, and not just their shortcomings.
In the same way, the conclusions show only the benefits of the rule-based approach, ignoring the trade-offs that this approach offers. For example, with this approach that combines inference and validation there is no separation of concerns between those two tasks which in some contexts are better. In some contexts, inference is tackled by ontology engineers with a focus on domain entities like people, while validation may be tackled by data engineers which may be more focused on integrity constraints and data representation. Having different technologies for both can be important in several contexts where domain ontologies can be reused. In fact, in ShEx, it is possible to validate the RDF graph before inference with some shapes, and the RDF graph after inference with other shapes. This technique can be used to debug the inference process and it seems that the rule based approach could not be applied to this use case.
There is no treatment in the paper to recursion and negation, a topic that was one of the main differences between ShEx and SHACL. Although the paper mentions the use of Scoped Negation as Failure, it is not clear if this approach could be extended to handle recursion and negation as in ShEx (see [1]) or in a recent proposal for SHACL (see [2]).
The State of the Art and the comparison with other technologies needs to be updated to take into account recent work proposed for SHACL as SHACL-rules [3] and to take into account which of the constraint types mentioned in Hartmann's paper can in fact be expressed in SHACL-SPARQL.
In the same way, Hartmann's paper didn't take into account that ShEx can also handle advanced constraints using Semantic Actions. In fact, a lot of those constraint types could be expressed using ShEx with semantic actions.
The related-work section differentiates between hard-coded systems and grammar-based approaches like Description Set Profiles or ShEx. However, it says that ShEx does not rely on SPARQL and concludes that it is a hard-coded system which I think is misleading. ShEx is based on a well-founded semantics (see [4]) which is a different approach from a hard-coded system. This mistake is replicated in section 3 (comparative analysis) where ShEx seems to have been embedded in the "hard-coded" system column, which I think is wrong as in the case of ShEx I would qualify the "Explanation" row to yes because Shape Maps in ShEx can explain which nodes conform or don't conform to some shape.
In the following, I enumerate some minor comments:

Page 2- Example line (1):
:birthdate "01-01-1970"^^xsd:date
should be:
:birthdate "1970-01-01"^^xsd:date
Page 2. Problem P1. "however, current approaches only report which resource violate which constraints, not why the violation occurs". It is not clear to me to which current approaches the authors are referring. Do they include also ShEx? If that's the case, if a system tracks which triples come from the original RDF graph, and which triples have been inferred, isn't it possible to show which are the triples that raise the error?
Page 2. 2nd column. "and find implicit violations". This is the first appearance of the "implicit violation" concept which I think it refers to violations caused by inconsistencies. I would ask the authors to define that concept at least…although I think it would also require some more justification about why an inconsistency is a violation…in ShEx/SHACL, an RDF node could conform to some shape while the RDF graph could be inconsistent.
Page 3. Problem P4. "it is not clear whether a piece of RDF data came from the original dataset or was inferred (P1)". There are systems that can differentiate between triples that are from the original RDF graph and triples that have been inferred.
Page 6. End of first column: The phrase: "ShEx does not rely on an underlying technology such as SPARQL to perform validation, a hard-coded system is used instead" is wrong…the fact that ShEx does not realy on an underlying technology does not imply that it must be implemented as a "hard-coded" system. ShEx is defined as a domain-specific-language with a denotational semantics which could have different implementation strategies…subsets of ShEx or even ShEx itself could be implemented with other strategies like a rule-based engine.
Section 2.3 talks about validation reports…in this context, it would be worth to mention that ShEx defines result shape maps as a result of the validation process.
Page 7, first column, I don't agree with the sentence "Supporting inferencing rules is thus an important requirement for validation approaches": although I understand the motivation for supporting inferencing rules, I don't think it should be a validation requirement. This is in fact a controversial statement that is based on a single study (Hartmann's PhD thesis). Further research could be done about which are the best validation requirements, because from a different perspective, adding inferencing rules to a validation system can be seen that it extends the validation language expressiveness too much in an uncontrolled way which may not be desirable…if the task is to describe and validate the structure of RDF graphs some people could consider that it is better to have a well-defined language with a clear semantics rather than a more expressive language whose rules are difficult to define and debug.
Page 7. Section 3 (first sentence). …the sentence "we show the shortcomings of existing approaches" sounds to me too strong. I would prefer the authors to include a more objective comparative analysis, rather than focusing of criticize the other approaches without also talking about the shortcomings of their approach.
Page 7, "…via translation of the SPARQL queries using property paths [21][23]" Why do the authors include reference [23] here?
Page 7, 2nd column. "…without inspection the code…"
Page 7, "…Customization of hard-coded systems is limited without requiring a development effort [50]"…why do you include the reference [50] here?
Page 8, table 2. I am not sure in which of the columns could ShEx or SHACL be included…maybe add a specific column for each of them?
Page 8, Figure 1. In both diagrams a box titled "Background knowledge" is presented…but it is not described in the paper…and in fact, I have doubts if it is really necessary. ShEx and SHACL don't have a different input for "background knowledge".
Page 8. "…has the following disadvantages: (1) multiple systems need to be combined and maintained, e.g. a reasoner and a querying endpoint". I understand the need of a reasoned, but why is a querying endpoint necessary?
"…(ii) different languages need to be learned and combined for the inferencing rules and constraints (e.g. OWL and SPARQL". Why is SPARQL necessary? And also…why is it a disadvantage to have 2 different languages for 2 different tasks? I consider it to be a good practice because it promotes a better separation of concerns…something like HTML and CSS which are different languages because they tackle different concerns.

Page 8, end of 2nd column. "Moreover, a rule-based reasoned natively supports custom inferencing rules, and thus, custom entailment regimes". Is this really an advantage? Because, it can also be seen as a challenge if the rule-based reasoned infers triples with a custom semantics which could be different from OWL…
In this respect, maybe the authors should also mention the problem that SHACL offers an entailment which is like a subset of RDFS but is different from RDFS, i.e. it supports rdfs:subClassOf, but does not support, for example rdfs:domain or rdfs:range…
Section 4.1. The authors talk about SNAF…should they talk about answer set programming also?
Page 11, figure 2. Represents the components view of a rules-based approach. The title says "rules-based reasoned", should it be "rules-based validator"?
That figure contains again an input titled "Background knowledge", is it necessary?
The components of the figure, as they are represented are very similar to the left part of figure 1 (pre-processing approach), substituting "validator" by "constraint translation"…I wonder if both approaches are really so different as the authors claim…I understand that one possibility is that the "entailment regime" and "constraint translation" phases can be run in parallel…but that possibility could also be tried in the other approaches where the validator could be running at the same time as the reasoned which would infer triples about the neighbourhood of a node2 on demand.
Page 11, "N3Logic supports at least OWL-RL inferencing…" so the system does not support OWL DL…maybe the authors should also mention this as another shortcoming of their approach…the fact that their approach is only viable when the reasoner can itself be implemented by rules.
Page 16, "regimes included"
Section 6.3. I think the sentence "Validatrr can support more constraint types than existing approaches RDFUnit, SHACL and ShEx" is wrong…in the case of ShEx, using semantic actions most of those constraint types could also be represented.
Page 17, "Without inferencing, our implementation is already faster for small RDF graphs. We perform about an order of magnitude faster until 10,000 triples, namely 1-2s per RDF graph compared to 30s per RDF graph…". And later the authors talk about set-up time required by RDFUnit…is it possible that if the comparison removed that setup-time, then both implementations would be equally fast?
Could the authors give some explanation about why after 10,000 triples, the times of their implementation increase considerably?
Page 18. "To make the results comparable, we used the EYE reasoned with the same RDFS rules to execute the reasoning preprocessing step…" What part of the time in RDFUnit is consumed by the reasoner compared to the validator? Could the authors use a different reasoned in RDFUnit?
Page 18. I could not understand the sentence "execution time drops from 120s to 80s for Validatrr whereas it rises from 25s to 185s for RDFUnit". Looking at the figure, I didn't see a point where the execution time drops…
Page 18. The 4 paragraphs that start by "RDF graph size" seem to be a justification that the size of most RDF graphs is not very big…although I understand the argument and it may be true that the size of current RDF graphs in LODLaundromat is not very big…I would not take that as a justification for not having good performant validators. On one hand, it may be that the current size of RDF graphs is not big because current technologies don't support very big RDF graphs well…on the other hand, maybe if we had better tooling, RDF adoption could be better and there would appear bigger RDF graphs…and finally, I think the size of RDF graphs will be increasing as long as there are better tools and computational resources to manage them…so I think the whole argument that current RDF graphs have fewer than 100,000 triples is not significant and I would suggest to remove those 4 paragraphs as they don't contribute to the validation approach at all.
Page 19. Conclusions. I think the authors should try to offer not only the benefits of their approach, but also to point to some drawbacks.
[1] Semantics and Validation of Shapes Schemas for RDF , Iovka Boneva, Jose E. Labra-Gayo, Eric Prud'hommeaux, In 16th International Semantic Web Conference, ISWC2017. – 2017
[2] Corman, J., Reutter, J.L., Savkovic, O.: Semantics and validation of recursive SHACL. 17th International Semantic Web Conference. ISWC-2018
[3] https://lists.w3.org/Archives/Public/public-shacl/2018Sep/0003.html
[4] http://shex.io/shex-semantics/

Review #3
By Simon Steyskal submitted on 12/Jan/2019
Major Revision
Review Comment:

In the present article, the authors discuss how using a rule-based reasoning approach for RDF validation can provide a better performance wrt. to speed, coverage, and explainability compared to other state of the art validation approches.

While the article is in general very easy to read, there are some points that need to be addressed, namely:

1) Revising SHACL Coverage - I feel SHACL wasn't accurately represented, especially as [45] was cited as source for SHACL's coverage.

SHACL became a W3C Rec. in 2017 while [45] was written in 2016. Furthermore [45] even states multiple times that it expects SHACL to support more constraint types once it's become an offical Rec. (cf. p. 128-129,177). I had a brief glimpse at constraint types that SHACL was supposedly not supporting according to [45], and almost all of them are now supported. E.g. 13 - sh:class/sh:datatype, 27 - sh:languageIn, etc.
Apart from that, what about:
.) sh:entailment (https://www.w3.org/TR/shacl/#shacl-rdfs) ->
"However, SHACL processors may operate on RDF graphs that include entailments [sparql11-entailment] - either pre-computed before being submitted to a SHACL processor or performed on the fly as part of SHACL processing (without modifying either data graph or shapes graph). To support processing of entailments, SHACL includes the property sh:entailment to indicate what inferencing is required by a given shapes graph."

.) SPARQL-based constraints which allow for using arbitrary SPARQL queries

.) SHACL Rules (not part of the Rec. but published along side it as a Note https://www.w3.org/TR/shacl-af/#rules)

=> Please revise and update accordingly!

2) Comparing language against implementation - in 6.1 you compare SHACL + its test cases against your implementation (Validatrr), claiming e.g. that SHACL's validation report isn't as rich as the one provided by Validatrr. That's true, however, you are comparing a constraint language against your custom implementation. That's like saying OWL 2 is worse than application xyz, because xyz allows not only to do OWL reasoning but also contains querying support. apples and oranges!

3) RDF graph size - you claim that LODLaundromat and LODStats both show a median of less than 100,000 triples per RDF graph. However, e.g. in LODLaundromat some datasets are split in several files (e.g. dbpedia), hence the size of the datasets in LODLaundromat is not really accurate. How would your approach perform on graphs with more than 1M triples?

0) General

.) excessive citing -> you don't have to cite the same publication 5 times in 5 consecutive sentences (e.g. p5 last paragraph, [46] on p.7, etc.)

1) Introduction

.) "RDF graphs which are interpreted as expected are more “fit for use” and thus of higher quality [2]."
-> if you want to cite [2] maybe consider adding a few words explaining as to why the concepts of [2] apply to RDF graphs too. I was expecting [2] to be about RDF & quality, rather than being a 30 year old quality control handbook;
-> "higher quality" with respect to what? the intended UC? if one expects an RDF graph of bad quality to be interpreted as an RDF graph of bad quality, does it become an RDF graph of higher bad quality?
.) "denote the fictional schema http:example.com/" -> http://example.com/
.) "however, current approaches only report which resources violate which constraints, not why the violation occurs" -> do you know for sure that there exists absolutely no approach out there that does that? besides that, doesn't this heavily depend on how granular constraints are defined?
.) "The used entailment regime [1]" -> later you cite [14] instead of [1]; or consider citing https://www.w3.org/TR/rdf11-mt/ instead of [1] here
.) "let alone complex cases involving inferencing which are not (well) supported" -> such as?
.) "Hartmann et. al has shown that thirty-five out of the eighty-one constraint types (43.2%) are constraint types that benefit from including inferencing." -> missing reference
.) "They are not able to customize the used inferencing rules for validation" -> "customize" as in "choosing an entailment regime" or as in "using specific inferencing rules only" ?
.) "~P2" -> what does ~ indicate?
.) "and (iii) the combination slows down the overall system (P4)." -> that's a recursive reference; P4 doesn't actually motivate why you think introducing inferencing as preprocessing step would "deteriorate" performance
.) "and supports more constraint types" -> still doesn't
.) "Moreover, rule-based reasoners only need a [...] . Thus, this approach is faster than including an inferencing preprocessing step" -> I can't follow your reasoning here.. what if inferencing is done using the same language as the one constraints are written in, and the same system validation is performed with? also, why does # of used languages/systems automatically lead to a worse performance?
.) "existing validating approaches" -> "existing validation approaches"
.) " and compare to " -> "and compare them to"
.) missing ref for EYE
.) "and evaluate performance," -> performance of what?
.) "position rule-based reasoning as an alternative and compare in Section 3" -> alternative to/compare it with what?

2) State of the Art
.) "Whether those triples are then mentioned explicitly or not does not contribute to a violation when only described in an ontology." -> depends on the used approach/implementation though..
.) " The inferencing rules defining the reasoning are then specified, as this ontology language follows a certain logic" -> what?
.) "OWL-RL and OWL-QL prevail [13]" -> use https://www.w3.org/TR/owl2-profiles/ instead of [13]
.) s/the reasoner, such/reasoners, such/ + s/follows/follow/
.) Table 1 -> how should one read this table? all hard-coded systems implement either grammar-based languages or no language at all? I highly doubt that..
.) "They use the SPARQL query language" -> "They use SPARQL"
.) "[26]. Schneider" -> "[26]. Patel-Schneider"
.) p.5 last pararaph -> you don't have to cite [31] everytime you mention Luzzu/LQML
.) W_3C -> W3C
.) "Hartmann et al. identified eighty-one general constraint types [44]." -> in [44] he was still called Bosch

3) Comparative analysis

.) "Using SPARQL property paths thus provides only limited inferencing expressiveness. Also, performance deteriorates [48]." -> [48] was written in 2012, one year before SPARQL 1.1 became a W3C Rec., and referred to a preliminary version of SPARQL 1.1; please investigate if the claims of [48] are still relevant today.
.) "Inferencing rules Inherent support for (custom) inferencing rules [46]." & similar -> are those quotes from [46] really necessary? remove!

4) Logical Requirements

.) "These requirements are not common for Semantic Web logics, as data on the Web is decentralized and information is spread" -> not all Semantic Web data is LOD. what about LCD?
.) provide ref for SNAF the first time you mention it
.) "UNA is valid" -> "UNA holds"

5) Application

.) "at Section 5.4" -> "in Section 5.4"
.) s/made up/consists/
.) "N3 introduces an extension to the RDF 1.0 model, more specifically, it is a superset of Turtle [60]." -> and how does N3 relate to RDF 1.1?
.) Figure2 -> what does * indicate?
.) eulersharp eye links -> maybe consider updating them to https://github.com/josd/eye ?

6) Hypothesis validation
.) Validatrr), More -> Validatrr). More
.) update the link in footnote 20

7) Conclusion
.) s/up until/up to/

8) References

8.1) Please sort references alphabetically! makes traversing through them easier
8.2) Cite W3C Documents in a consistent manner. E.g. [1,6] are both W3C Recommendations, but only [6] is denoted as such in its bibtex entry while [1] is listed als Technical Report
8.3) [14] -> update
8.4) [60] -> s/Beckter/Beckett/

FWIW: I'm happy to help translating constraint types listed in [45] to SHACL!