Learning SHACL Shapes from Knowledge Graphs

Tracking #: 2681-3895

Authors: 
Pouya Ghiasnezhad Omran
Kerry Taylor
Sergio Rodriguez Mendez
Armin Haller

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Abstract: 
Knowledge Graphs (KGs) have proliferated on the Web since the introduction of knowledge panels to Google search in 2012. KGs are large data-first graph databases with weak inference rules and weakly-constraining data schemes. SHACL, the Shapes Constraint Language, is a W3C recommendation for expressing constraints on graph data as shapes. SHACL shapes serve to validate a KG, to underpin manual KG editing tasks and to offer insight into KG structure. We introduce Inverse Open Path (IOP) rules, a predicate logic formalism which presents specific shapes in the form of paths over connected entities. Although IOP rules express simple shape patterns, they can be augmented with minimum cardinality constraints and also used as a building block for more complex shapes, such as trees and other rule patterns. We define quality measures for IOP rules and propose a novel method to learn high-quality rules from KGs. We show how to build high-quality tree shapes from the IOP rules. Our learning method, SHACLearner, is adapted from a state-of-the-art embedding-based open path rule learner (OPRL). We evaluate SHACLearner on some real-world massive KGs, including YAGO2s (4M facts), DBpedia 3.8 (11M facts), and Wikidata (8M facts). The experiments show SHACLearner can learn informative and intuitive shapes from massive KGs effectively. Our experiments show the learned shapes are diverse in both structural features such as depth and width, and in quality measures.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Iovka Boneva submitted on 09/Mar/2021
Suggestion:
Major Revision
Review Comment:

(References in the review are those from the paper).

Overview
========

The submission presents an adaptation of previous work of the authors [13,14] to learning SHACL constraints from knowledge graphs (KG).
The learning mechanism is based on discovering paths (ie sequences of properties) that occur frequently starting from a given node type (eg song, human, city, etc.)
These paths can then be merged into a single SHACL constraint for the given type.

The paper starts with recalling in Section 2 the definition of a closed path (CP), which is a dependency of the form (all variables are universally quantified)
P(x,y) <- P_1(x,z_1), ... P_n(z_{n-1},y)
stating that if a path P_1 ... P_n exists from x to y, then P(x,y) is a fact in the KG.

Then defines an open path (already defined in the tech report [13]) as a dependency of the form
P(x,z_0) <- P_1(z_0, z_1), ..., P_n(z_{n-1}, y)
which I do not fully understand because the quantifications are not explicitly given.
The authors say that this rule contains free variables but I do not know which variables are free. x? z_0 ? y?

In Section 3 the authors introduce inverse open path (IOP) as a rule of the form
P(x,z_0) -> \exists(z_1, ..., z_{n-1},y). P_1(z_0, z_1), ..., P_n(z_{n-1},y)
which states that whenever P(x,z_0) is a fact in the KG, there exists a path P_1 ... P_n starting at z_0 in the KG.
IOP rules correspond to SHACL constraints when the variables x and z_0 are unified and P is a node type.
For instance,
song(x,x) -> \exists (z_1,y). album(x,z_1), recordLabel(z_1,y) (*)
is a IOP rule and corresponds to a SHACL constraint saying that nodes of type song have a property album which value is a node that has a property recordLabel.
Several rules with common body, eg song(x,x), can be combined in what the authors call a tree to yield a more complex SHACL constraint.

The learning algorithm called SHACLearner is intuitively based on discovering correlations between the occurrences of the body of the rule (eg song(x,x)) and the head of the rule (eg existence of a path album.recordLabel starting at x).
It is also possible to discover minimal cardinality constraints by counting the number of different y's that allow to satisfy the rule (*) above.

The SHACLearner algorithm is given in Section 3 together with a brief explanation from which I was not able to understand several important aspects (see below).
I could however see that the algorithm is very close to the ones in [1,2]. The main difference seems to be the construction of the tree (which is clearly a contribution of the paper under review), and possibly the evaluation of the relevance of discovered rules is also different. The authors do state that the algorithm is an adaptation of the OPRL algorithm from [14], but I would have preferred to see explicitly which parts are previous work and which parts are contributions of the paper under review.

Then in Section 4 the authors give some related work, for which I have a couple of remarks (see below).

In Section 5 is presented an experimental evaluation of the algorithm with the KGs YAGO, DBPedia and Wikidata. The algorithm is used in each KG with 50 randomly selected node types (eg song in example (*)), with predefined quality thresholds for the rules.
The experiments show that the algorithm finds very quickly (less than 3 hours) at least one quality IOP rule for 80 to 98 percent of the types.
Other figures show the number of quality tree rules discovered by combining IOPs.

Finally some conclusions are drawn.

Evaluation
==========

The experiments show that SHACLearner is able to discover very quickly rules in huge knowledge bases, which is IMHO a very interesting feature.
The paper is well written and generally clear (with some exceptions listed below).
I however have some concerns regarding the usefulness and the relevance of the discovered rules, with two aspect that I will develop in the sequel.

Completeness of SHACLearner
---------------------------

Does the algorithm discover all quality rules satisfied in the KG ? If not, how many of the quality rules are missing in the final result (probability to discover a quality rule) ? What kind of rules are missing ?

SHACLearner prunes the graph and uses heuristics, which IMHO is necessary in order to handle huge KGs. It is however not clear whether pruning or heuristics discard relevant rules.

Regarding the pruning, p. 6 l. 37 it is said that "entities and predicates that are not relevant for the target predicate" are removed. Without an explanation of how you decide relevance, it is impossible to judge whether pruning indeed removes only irrelevant entities and predicates.

Also, SHACLearner discovers frequent paths by exploring all possible paths in the KG, which of course requires huge amount of computation. A heuristic is used in order to reduce the search space, based on so called similarity. Similarity is not defined, only intuitively given. As far as I can understand, a path P_1(y,z),P_2(z,t) has higher score (so more chance to be kept by the heuristic) if the set of entities P²_1 that appear in second position of facts of P_1 is "similar" to the set of entities P¹_2 that appear in first position of facts of P_2 (all these entities correspond to valuations of the variable z). Although I do not know what exactly "similar" means here, I can imagine cases where a path is relevant while the sets of entities P²_1 and P¹_2 have very small intersection. For instance, if P_1 and P_2 are very frequent properties that can appear virtually everywhere (so they have very small similarity ?), but the path P_1.P_2 appears for ALL entities of some type T. Does the definition of similarity allow to discover the corresponding rule ?

Here is a suggestion of how one could possibly discover rules discarded by the above mentioned approximations: take a small KG and apply SHACLearner on it w/o any approximations, and with the approximations. Do you get the same rules ?
If yes, then nothing is proved (SHACLearner could still miss important rules on another KG). If no, then you have an indication on the rules that the approximation does not allow to discover.

Is is possible to quantify the precision / completeness of the algorithm ? Can you compute the probability that a rule satisfied in the KG is indeed discovered ?

Relevance of discovered rules wrt use scenarios
-----------------------------------------------

While IOP rules indeed can be encoded in SHACL, they only cover a part of it. They do not allow to encode complex cardinalities, data values, Boolean operators (negation and disjunction are complex to learn, I won't expect to find good algorithms for that).
In particular, the zero-or-one cardinality is very important while modeling, but SHACLearner never generates rules with zero-or-one cardinality.
How relevant are the discovered shapes ? Can you describe use cases or use scenarios in which precisely these shapes are relevant ?

In the Conclusion the authors do motivate the use of SHACL shapes for knowledge graphs, but are the shapes discovered by SHACLearner relevant for any of these tasks ? For instance, are rules discovered by SHACLearner relevant for generating forms for data completion ? In order to complete data one needs a reliable and approved schema. Can automatically discovered rules be used to decide that some data is missing in the graph ?

Also, SHACLearner is based on discovering paths of some length that are then combined. Does it often occur in practical applications to use paths in constraints ? IMHO the most frequent case is for the shape to describe the immediate neighborhood of the node, or maybe up to distance 2 or 3. What use scenarios justify the use of such a complex algorithm as SHACLearner that explores the huge search space of all paths up to some length ? When is the exploration of just the immediate neighborhood not enough for discovering a 'good' shape ?

Other remarks
=============

Related work
------------

While I didn't see important references missing, I find unfair the comparison with some of the related work.

- p. 9 l. 27-34, comparison with [31]: I wouldn't say that [31] is "partial attempt" to provide logical foundations of the semantics of SHACL. It is I think quite complete, and in any case a more complete proposal of the semantics than the rules presented in the paper under review. Btw it could be interesting to compare the rules you propose with the formalization proposed in [31].
- p. 9 , first paragraph of Section 4. The related works [8,22-25] are qualified "without logical foundation" and their output formalisms are implicitly qualified as not "well-founded". While I myself like to use logic formalisms in order to define things w/o ambiguity, I wouldn't say that other kinds of definitions are not well-founded. As the main author of [25] I claim that the output of the shape inference algorithm there is very precisely qualified (see also the companion work [**]): it is a precisely defined generalization to SHACL rules of the neighborhoods of a set of randomly chosen sample nodes. No probabilities, no heuristics, but an outspoken random sampling with clearly identified use scenarios: help expert users to write schehmas, help novice users to get some idea of the shape of the data.

Miscellaneous
-------------

- p. 2, right, l. 46: it is said that the rule forms a "loop". The standard definition of a loop I know of is a path that starts and ends with the same element. It is not the case here. What kind of loop is it ?
- Definitions of OP and IOP rules, I found it a bit disturbing to distinguish the variable y from the z variables, as in the definitions of satisfies and support, SC and HC y is also existentially quantified. Later on I could see that y is relevant for the cardinality. Maybe this could be mentioned right away, to clarify why it treated differently from the z's.
- p. 12, right, l. 25: what you call %SucRate was previously called %SucTarget
- I'm curious about the matrix representation used to explain the efficiency of the algorithm. With the size of the KBs considered, you have matrices of size n x n with n=3 000 000. Can it even fit in RAM (even if you represent 1 Boolean value with 1 bit) ? Do you use sparse matrices ? What kind of matrix multiplication algorithm is being used ?

Requested revisions
===================

Explicit contributions
----------------------

- Clearly state what are the new contributions of this paper and what comes from previous work, especially in Section 3 and the SHACLearner algorithm.

Better formalization and explanations
-------------------------------------

- State precisely what are the external components that are used by SHACLearner (eg RLvLR, RESCAL) and what are their properties. This includes
- equation (5): what are the formal properties of the scoring function ? What precision does this scoring function have, i.e. what guarantee does the value computed in (5) give on the actual probability of P_0(e_1, e_2) ?
- what is similarity (mentioned on p. 6 l.48) ?
- Explain what is the sampling or at least what are its formal properties so that one can judge whether sampling would remove relevant parts of the KG.

Motivation
----------

- Qualify and/or quantify the output of the SHACLearner algorithm. What kinds of paths might not be discovered ? What is the probability that an actual frequently occurring path is discovered by the algorithm ?
- Identify use scenarios: what use cases can be supported by SHACLearner, in particular in comparison with simpler algorithms ?

Additional citation
===================
[**] Semi Automatic Construction of ShEx and SHACL Schemas
Iovka Boneva, Jérémie Dusart, Daniel Fernández Álvarez, Jose Emilio Labra Gayo.
https://arxiv.org/abs/1907.10603

Review #2
Anonymous submitted on 16/Mar/2021
Suggestion:
Major Revision
Review Comment:

The manuscript introduces a novel predicate logic formalism to express SHACL shapes in the form of paths connecting Knowledge Graphs (KGs) entities. These paths are formalized using an inverse version of the open path rules, defined as Inverse Open Path (IOP) rules. The validity of these rules in KGs is established by the authors creating a new set of quality metrics: they re-adapt well-known measures (support, head coverage, and standard confidence), which are broadly employed to evaluate rule-learning methods. The manuscript also introduces a new learning method based, called SHACLearner, to mine IOP rules from KGs automatically.

The manuscript is well written and its impact can be relevant, considering that large-scale KGs are currently populated using (semi-)automatic methods that are unavoidably prone to error. Indeed, automatically-generated KGs can be validated against a set of SHACL shapes. The IOP rules and the related quality measures are clearly described, starting from alternative rule mining for KG completion, such as closed-path and open-path rules. However, I believe that the motivation and analysis regarding the embedding method of the SHACLearner should be further extended.

As claimed by the authors (p.6, second column, line 42), there is a strong relationship between the logical statement of the rule and specific properties in the embedding space. Consequently, in the authors' approach, the embedding algorithm plays a crucial role in mining IOP rules. Nevertheless, it is not clear the choice behind RESCAL for the embedding generation. The authors state the SHACL learner is adapted from OPRL, but I believe that further analysis can be conduct in this direction.

(i) Is RESCAL the best model for learning embeddings to detect paths?
(ii) Could you show it empirically, comparing the results with other embedding methods?
(iii) Which is the setting of hyperparameters that allow you to achieve the best performance?

I think the KG embedding field is quite mature to perform this type of analysis, and the authors can exploit existing software libraries including, https://github.com/pykeen/pykeen, https://github.com/thunlp/OpenKE.

I suggest a major revision of the manuscript extending the analysis and the discussion of the embedding component, which is fundamental in the authors' approach.

Minor changes:
p.2, second column, line 28, “as the label of an link”: “a” not “an”.
p.6, second column, line 21: remove an “a”
P.6, second column, line 47: “emdeddings” → embeddings

Review #3
Anonymous submitted on 07/Jul/2021
Suggestion:
Major Revision
Review Comment:

This paper presents an approach called SHACLearner that extracts rules from a KG in the form of paths (Inverse Open Paths) which can be trivially converted to SHACL shapes. The approach is based on embedding-based open path rule learning.
- Originality: Framing the problem of learning SHACL Shapes as a rule learning problem with paths using Inverse Open Paths can be considered as original. However, it will be important explicitly identify which are the novel contributions of this paper and what is being reused from the existing works.
- Significance of results: One of the main weaknesses of this paper is the evaluation. The hypothesis being validated is vague with claims such as “handle the task satisfactorily”. Evaluation needs to be improved to include (a) a reasonable baseline to compare with (b) some evaluation of the correctness of the shapes learned (maybe manually on a sample) and/or the usefulness of them in a real life use case (c) some qualitative analysis. Detailed comments are provided below.
- Quality of writing: Quality of writing is good. As pointed out below, several sections need further explanation on rationale and motivation for some design decisions taken such as the use of unary predicates. Furthermore, please pay attention to definitions in the SHACL specification as it’s one of the main focuses of this paper.

Considering the comments described below, I recommend a major revision for this paper.

Detailed Comments:

Abstract:
- Minor - the definitions of KGs are very diverse and people define them in different ways. For example, refer to https://arxiv.org/abs/2003.02320 or http://ceur-ws.org/Vol-1695/paper4.pdf. However, I find “KGs are large data-first graph databases with weak inference rules and weakly-constraining data schemes.” a bit strange. Maybe you can review the other definitions to see if there are others that fit better.

Introduction:
- This section puts some emphasis on the incompleteness of the knowledge graphs. Though the path rules can be used to infer new triples, the focus of this paper is on learning SHACL Shapes. It might be good to somehow relate this paragraph about incompleteness to the overall story.
- Minor: SHACL[6] -> SHACL [6]
- As SHACL Shape learning is the main topic of the paper, it would be beneficial if the paper provided a brief description of all different constraint types such as value type, cardinality, value range, string-based, property-pair etc and state explicitly which ones are covered by the proposed approach (nodeKind and minCount?).
- It will be also interesting to identify which constraint types can not be fundamentally addressed using path-based approaches, for example, value ranges or property-pairs? It might also help to validate the claim “common underlying form of all these shapes is the path”.
- It would help the reader if you can include an explicit list contributions at the end of the introduction using the common practise of “the main contributions of this paper are: (a) … (b) … (c) …”. This will help the reader to understand what are the novel ideas and what is being reused.

Preliminaries:
- Even after reading the paper completely, I didn’t quite understand the need for these additional unary predicates. What is the motivation for this? The IOP definition handles the binary predicates (as you do in Yago2). Is there any difference if you used dbo:type(x, dbo:Singer) as the body of your IOP instead of the unary predicate?
- On a separate note (if unary predicates were really necessary), why rdf:type and P31/P279 properties were not used for this? A discussion on motivation for this decision and the selection criteria will be useful to the reader. If rdf:type was used, it would have prevented unconventional target classes in SHACL shapes such as `sh:targetclass class:_;`.
- Isn’t the notation P(e,e) for unary predicates conflict with reflexive properties in the KG. For me personally, the idea of representing unary predicates as self-loops is counterintuitive. Also in the first paragraph of Section 2, P is used to denote both predicates and classes (where P is a class or a data type). To avoid confusion, it might be better to use another symbol such as C.

3. SHACL Learning

- Please review your example in Figure 2 with the semantics of SHACL specification at https://www.w3.org/TR/shacl/#core-components. According to my reading, some of the elements in your example is wrong.
- node kind relation has a typo. It should be “sh:nodeKind”, camel case.
- The values of sh:nodeKind in a shape are one of the following six instances of the class sh:NodeKind: sh:BlankNode, sh:IRI, sh:Literal sh:BlankNodeOrIRI, sh:BlankNodeOrLiteral and sh:IRIOrLiteral.
- You should have used “sh:class” instead of “sh:nodeKind” for your intended purpose.
- it has invalid turtle syntax in “sh:path: citizenOf;”
- a correct representation would look like

sh:property [
sh:path property:citizenOf;
sh:class class:country;
sh:nodeKind sh:IRI ;
];
- Is Lemma 1 used in any part of the process? Why is it relevant?

- In 3.2, please provide details on the exact strategy/algorithm used for pruning the KG. Is it done once per each gold predicate? What is the avg size of the fragment of KG used for rule mining? This information is relevant as one of the claims of the paper is to handle massive KGs compared to the other approaches.

- Looking at the embedding based approach, one potential baseline would be to do the PathFinding but using only the symbolic information found in the graph. This might help to improve the evaluation.

- 3.2, there is a typo in predicate argument similarities. I assume the second one should be P12 and P21 not both P12.

- Is there a novelty in the path finding approach or is it reusing what’s defined in [13]?

- I believe the aspect of knowledge graph completion and the aspect question answering etc. are very interesting and could be important contributions of this paper. One of the requirements for that would be making them part of the evaluation. Can the evaluation be extended to cover some knowledge base completion or HCI aspects?

4. Related Work

- I think related work is too brief. For example for [22-25], it would be interesting to at least discuss what is the approach they use, what are the datasets they used, what are the differences with the proposed approach.
- For example, the authors claim “[22-25] are not shown to be scalable to handle real-world
KGs. However, it seems 22, 23, 24 uses DBpedia as their KG and 25 uses DBTunes and Wikidata. The authors will have to explain why they claim that each of the previous methods are not scalable?

5. Experiments

- If there is no fair comparison out there, I guess you will at least have to provide a good baseline in order to validate your hypothesis.
- “handle the task satisfactorily”, “satisfactory performance” sound opinionated and not objectively verifiable. It would be good to formulate this with respect to the metrics identified.
- “can be applied in practice” - how are you demonstrating this? Have you used this in any feasibility study in a real-world use case? I guess the trees learned (Sec 5.2) can be used for a qualitative analysis with a real use case.
- As mentioned previously, I don’t understand the motivation for converting binary type assertions to unary relations in Section 5.0.1.
- in +UP, what is the intuition or reason for selecting dbo:type / dbo:class (DBpedia) and occupation_P106 (Wikidata) as unary predicates? What was the selection criteria among all other predicates with classes as their argument (for example, Wikidata has many of them)?
- Similarly, what’s the reason for using only binary predicates with YAGO2? What are the different characteristics of the KGs that motivated you to make these decisions?
- For completeness and readability, it would be good to include a one liner about rationale for the specific minimum threshold (0.1 and 0.01) values for IOPSC and IOPHC though it might be discussed in detail in [10].
- Analysis of Figure 4 and Figure 5 should be written with more details as it is one of the most interesting parts in the evaluation. I believe it is useful to first establish what high/low IOPSC and IOPHC indicates as discussed in 5.1.1. and then explain the charts relating to those. For example, “In the left-hand chart we observe a consistent decrease in the proportion of quality rules as the IOPSC increases.” has to be further discussed.
- Also in Figure 5, what can we learn from this chart? Is this because of the characteristics of the domain such that there are more properties with less cardinality? Or is this because of the characteristics of the mining that it is easier to find rules that are defined as the lower bound? Furthermore, for practical use, you will have to pick one cardinality rule, isn’t it?
- The discussion in 5.1.1. touches the issue of concept hierarchy such that some rules apply to all human (and thus applies to all it’s subtypes such as singers) and some rules are very specific to singers, for example. It would be interesting to discuss how this aspect of hierarchy or inheritance can be incorporated into your proposed solution.
- A qualitative evaluation/discussion of the usefulness of discovered trees will be beneficial for the paper.
- The paper provides a link to the data in Dropbox which is good for other researchers working on the same area.

6. Conclusions

- “IOPSC effectively extends SHACL with shapes, representing the quantified uncertainty of a candidate shape to be selected” - do you plan to represent this information within the SHACL shape itself as metadata? It would be interesting to see an example of such representation.
- Is this functionality already incorporated in schimatos.github.io?
- It would be good to include some lessons learned / challenges/limitations of the approach in the conclusion.