Learning SHACL Shapes from Knowledge Graphs

Tracking #: 2906-4120

Pouya Ghiasnezhad Omran
Kerry Taylor
Sergio Rodriguez Mendez
Armin Haller

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Knowledge Graphs (KGs) have proliferated on the Web since the introduction of knowledge panels to Google search in 2012. KGs are large data-first graph databases with weak inference rules and weakly-constraining data schemes. SHACL, the Shapes Constraint Language, is a W3C recommendation for expressing constraints on graph data as shapes. SHACL shapes serve to validate a KG, to underpin manual KG editing tasks, and to offer insight into KG structure. Often in practice, large KGs have no available shape constraints and so cannot obtain these benefits for ongoing maintenance and extension. We introduce Inverse Open Path (IOP) rules, a predicate logic formalism which presents specific shapes in the form of paths over connected entities. IOP rules express simple shape patterns that can be augmented with minimum cardinality constraints and also used as a building block for more complex shapes, such as trees and other rule patterns. We define formal quality measures for IOP rules and propose a novel method to learn high-quality rules from KGs. We show how to build high-quality tree shapes from the IOP rules. Our learning method, SHACLEARNER, is adapted from a state-of-the-art embedding-based open path rule learner (OPRL). We evaluate SHACLEARNER on some real-world massive KGs, including YAGO2s (4M facts), DBpedia 3.8 (11M facts), and Wikidata (8M facts). The experiments show that SHACLEARNER can effectively learn informative and intuitive shapes from massive KGs. The shapes are diverse in structural features such as depth and width, and also in quality measures that indicate confidence and generality.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Nov/2021
Review Comment:

The authors clarified my doubts on the adoption of RESCAL. For future work, I suggest further investigating the impact of KG embeddings in learning SHACL shapes.

Review #2
Anonymous submitted on 30/Nov/2021
Minor Revision
Review Comment:

I would like to thank the authors for providing evidence. In the revised version, the authors generally responded sufficiently to the comments provided in the initial review.

However, there are still some comments.

First of all, regarding the experiments in section 5.3, table 3 is difficult to read.  
What is the probability that a rule satisfied in the KG is indeed discovered? Did you perform any analysis of what kind of rules does each of the configurations fail? Why?

Second, the authors claim that using the complete KG for learning rules about all target predicates could harm the quality of the learned embeddings. So, in case we want to apply SHACLEARNER to DBpedia, how should the sampling work?

Third, do you have any evidence of how much does the sampling take? For example, if we consider Dbpedia, and run SHACLearner how much will it take to prune all predicates and the relative entities?

There are some typos still present in the revised version, some further proofreading will be welcome.

Review #3
Anonymous submitted on 16/Jan/2022
Minor Revision
Review Comment:

I would like to thank the authors for the quite detailed response letter for the previous round of reviews. The paper has improved a lot since the previous version and most of the review comments are either addressed or clarified.

This paper presents an approach called SHACLearner that extracts rules from a KG in the form of paths (Inverse Open Paths) which can be trivially converted to SHACL shapes. The approach is based on embedding-based open path rule learning. The main concerns regarding the previous review related to Originality, Significance of results, Quality of writing. The authors have addressed all three of those major concerns in this revision. There are still a few minor things that can be improved in the paper as listed below.

Other comments:

(1) Regarding the response to 3.4, while I agree that there are various formalisms for shapes defined in the literature to express diverse patterns, as the title of this paper, “Learning SHACL Shapes from Knowledge Graphs” suggests it focuses on SHACL Shapes. Thus, I believe it’s important to provide some insight on which SHACL features are covered, which can be covered theoretically but not covered due to time constraints, and which are fundamentally can’t be covered by the approach out of the constraint types such as value type, cardinality, value range, string-based, property-pair etc defined in https://www.w3.org/TR/shacl/.

(2) Regarding the response 3.6, while I understand the synaptic difference of encapsulating the type in an entity or in a predicate, I still fail to see the benefit of that. Why not follow the RDF/OWL convention of using binary predicates for defining types? What problem that is being solved by unary predicates that was not possible to solve using the binary predicates (similar to how the authors have done in Yago2)? Isn’t the notation P(e,e) for unary predicates conflict with reflexive properties in the KG. Including unary predicates includes a lot additional details to the method and makes the SHACL Shapes have ​​unconventional target classes such as `sh:targetclass class:_;. Thus, I believe still a bit clear description why binary predicates were not possible to use with types has to be included in the paper.

(3) In general, there are several points in the answer letter that are explained there but not explained in the paper such as 3.8, 3.14, 3.21. It might be beneficial to incorporate some of those explanations to the paper itself.

(4) Regarding the response 3.25, even though I understand a fully-fledged human evaluation is out of the scope of this paper, a qualitative analysis by the authors of a smaller sample of discovered trees would still be beneficial for the reader to understand the usefulness of the discovered trees and also the limitations. Such analysis will really validate the results and give readers ideas about the current challenges and potential future work.