Querying Knowledge Graphs with Extended Property Paths

Tracking #: 1899-3112

Valeria Fionda
Giuseppe Pirrò
Mariano P. Consens

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
The increasing number of Knowledge Graphs (KGs) available today calls for powerful query languages that can strike a balance between expressiveness and complexity of query evaluation and, that can be easily integrated into existing query processing infrastructures. We present Extended Property Paths (EPPs), a significant enhancement of Property Paths (PPs), the navigational core included in the SPARQL query language. We introduce the EPPs syntax, which allows to capture in a succinct way a larger class of navigational queries than PPs and other navigational extensions of SPARQL, and provide formal semantics. We describe a translation from non-recursive EPPs (nEPPs) into SPARQL queries and provide novel expressiveness results about the capability of SPARQL sub-languages to express navigational queries. We prove that the language of EPPs is more expressive than that of PPs; using EPPs within SPARQL allows to express things that cannot be expressed when only using PPs. We also study the expressiveness of SPARQL with EPPs in terms of reasoning capabilities. We show that SPARQL with EPPs is expressive enough to capture the main RDFS reasoning functionalities and describe how a query can be rewritten into another query enhanced with reasoning capabilities. We complement our contributions with an implementation of EPPs as the SPARQL- independent iEPPs language and an implementation of the translation of nEPPs into SPARQL queries. What sets our approach apart from previous research on querying KGs is the possibility to evaluate both nEPPs and SPARQL with nEPPs queries under the RDFS entailment regime on existing query processors. We report on an experimental evaluation on a variety of real KGs.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Bo Fu submitted on 11/Jul/2018
Minor Revision
Review Comment:

This paper presents Extended Property Paths (EPPs) that extends the SPARQL query language with additional navigational expressiveness. The proposed EPPs were evaluated and compared to other navigational extensions with improved efficiency and added expressiveness. EPPs have also been made available to the general public via the Jena framework, as SPARQL endpoints, and as an independent API. The authors have made clear of the additional contributions of this manuscript compared to previously published versions of this work. The paper is well-written and presents comprehensive definitions and evaluations of the EPPs. There are some minor comments which the authors could consider in a revised version of the manuscript, as discussed below.

The evaluation appears to focus on the efficiency of the EPPs. Since a key motivation of this work is to potential “strike a balance between expressiveness and complexity”, it would be helpful to see the evaluation examining the expressiveness/complexity aspects of the proposed EPPs vs. other navigational extensions in addition to efficiency. This issue may be clarified if the notion of “overhead” mentioned in various places in the paper can be defined early on. Without any discussion on how overhead should be interpreted in the context of this paper, it becomes challenging for the reader to perceive what should be considered as “reasonable” and “minimal” overhead later (section 8.3). Is it possible for others to disagree with these overhead benchmarks? Should overhead be understood as power and memory consumption? It would be helpful if measures of overhead can be defined. In addition, it would be helpful to include a brief overview of existing evaluation approaches in the related work (section 9), which will provide the necessary justification for why the aforementioned evaluation criteria in particular were chosen for EPPs.

The differences reported in the paper, such as those shown in Fig. 14 provide a useful descriptive overview of the data collected, however, when differences appear marginal or mixed (e.g. Fig. 15) at times, it may be necessary to conduct statistical analysis to validate these findings are indeed statically significant. Would null hypothesis testing be appropriate?

Terminologies referring to some evaluation aspects appear inconsistent at times. For example, section 8.1, titled “transitional performance”, suggests that one could potentially expect other findings beyond just the speed itself, since performance can be understood as a broader concept than efficiency that may include other aspects such as scalability, correctness, completeness, latency, etc.

When a particular dataset is chosen for the evaluation, e.g. a portion of FOAF was used, it would be useful if the authors could briefly discuss the rationale for selecting one part as opposed to other parts of FOAF/other KGs altogether, which should help with the justification of the chosen dataset and how potential bias was minimized in the selection of this dataset that objectively compares EPPs against others.

Several diagrams in the paper such as Fig. 14, Fig. 15, and Fig. 16 do not have legends for the x-axis, it would be helpful if they can be added.

Review #2
By Peter T. Wood submitted on 01/Sep/2018
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper defines and studies Extended Property Paths (EPPs), which extend the Property Paths (PPs) of the SPARQL query language with features such as path and node tests, path conjunction and path difference. Although the non-recursive fragment of EPPs can be translated into SPARQL (as shown in the paper), EPPs can be more concise. In addition, the paper shows that full EPPs could be handled by existing SPARQL processors by a reasonably small modification to one of the evaluation functions.

The paper also studies the problem of answering queries where the reasoning capabilities of the RDFS entailment regime are taken into account. The authors show that this reasoning can be captured by query rewriting using EPPs. Empirical studies reported in the paper show that the overhead of the rewritings is not very large.

The paper is comprehensive and well written, and presents an interesting extension to SPARQL that provides increased expressive power. However, the novelty of EPPs is rather limited. As mentioned in fact by the authors on several occasions, many of the features of EPPs as well as results on expressiveness and computational complexity carry over from languages based on Nested Regular Expression (NREs), which have been studied by various authors over a number of years. Other features of EPPs are present in languages such as GXPath. Nevertheless, the focus on extending SPARQL with such features is useful. As a result, I believe the paper would be of interest and use to a significant number of researchers.

More specific comments follow below.

Since NRE-based languages specifically introduced tests to check existence of paths satisfying conditions, I think that it should be clarified in Example 3 what aspects of TP used in the example are not supported by them. This is done for Example 4, e.g.

Form a practical point of view, the motivation for using nEPPs (rather than EPPs) in SPARQL seems weak. They cannot be combined with PPs because of overlap in syntax, and one wouldn't want to replace PPs with nEPPs because the former are more expressive.

In the discussion on the conciseness of EPPs compared to the translation on p 16, an example using p1/p2 is used and it is pointed out that the translation produces two triple patterns with a connecting variable. However, p1/p2 could be translated into SPARQL as is, so the example hardly motivates the conciseness of EPPs. Also I fail to see the point of mentioning that there are infinitely many possible translations and the algorithm described implements one of them.

Again on p 26, the paper claims that navigational queries can be written more succinctly using nEPPs rather than SPARQL. As evidence of this, the lengths of nEPPs and equivalent SPARQL queries are compared. However, the SPARQL queries used are those resulting from the translation presented in the paper. It is quite conceivable that an automated translation will produce much more verbose queries than natively hand-coded queries, so the comparison is unfair. It would be more convincing to present results on conciseness, independent of the translation.

Theorem 26 (on p 21) shows that there is an EPP query that cannot be expressed as a PP query. In the proof, the EPP query using triple pattern ?b (p & q)* ?e expressed on the graph G in Figure 13 is used. The claim is that no PP query returns exactly the same result on G. In the inductive step for P = P_1 AND P_2, it is claimed that "all the additional answers" from evaluating P_1 and P_2 on G "will not be discarded" because they both return all the "self" bindings. However, all that can be deduced is that all the "self" bindings will be returned; other bindings may not be. So one cannot conclude that P_1 AND P_2 will return more answers than the EPP query. In fact, it seems to me that the PP pattern ?b p ?e . ?b q ?e will return exactly the answers returned by the EPP query (although obviously not being equivalent to it in general). So I think a more complex graph is needed to prove this, and the proof needs to be fixed.

It is also claimed on p 22 that it follows from Theorem 26 that a particular kind of recursive EPP query cannot be evaluated using the standard PP machinery. But this doesn't follow (even if the proof of the theorem is corrected) because the query used here is syntactically different from the one used in the proof of the theorem. The query used here includes "/" and TP, whereas the one in the theorem uses "&".

Earlier on p 22, it states that for the base case of p, the query must be rewritten using EPP constructs. However, the corresponding entry in Table 9 shows only PP is necessary. This needs to be clarified or fixed.

In the related work section, the focus is (correctly) on SPARQL and extensions to it. So on p 29, e.g., it is stated that "only EPPs allow to test node values in a nested expression" and "none of these extensions can express the Italian exclusive friends query mentioned in the Introduction". However, I think it should be pointed out that languages such as GXPath (or certainly conjunctive GXPath) can express both of these kinds of queries. Section 9.2 does mention GXPath [43] but dismisses it because it is not SPARQL-based. I believe readers who are not familiar with the expressive power of GXPath may be left with a incorrect impression of the significance of the work reported in the present paper. By the way, the full version of the GXPath paper appeared as "Querying Graphs with Data" in J ACM in March 2016.

The following is a list of more minor comments and typos.

abstract, l 2: "evaluation and," -> "evaluation, and"

p 2, l 1: "neither are" -> "are neither"

p 2, ex 1: "request" -> "requests"

(many places): "NREs-based languages" -> "NRE-based languages"

p 3, fig 1 caption: insert "a" before "Knowledge"

p 3, ex 6: "arbitrary length with labels" is clumsy; perhaps "arbitrary length comprising edges labeled"

p 4, l 6: "EPP expressions" -> "expression"

p 4, l 8: in fact the standard uses a single polymorphic function named ALP, not ALP1

p 4, ex 7: the expression from Example 6 cannot be rewritten as shown since the two expressions are not equivalent.

p 4, l -4,-5: "a mean of" -> "a means of"

p 4, c 2, l -17: "multiset" -> "multisets"

p 4, c 2, l -9: "into" -> "using"

p 6, l -10: "the only" -> "only the"

According to the definitions on pages 5 and 6, an RDF triple cannot have a literal in the subject position but a SPARQL triple pattern can.

p 7: double "then" (after definition of 'optional')

p 8: "spec." -> "specification" (and elsewhere)

p 8: second "ALP1" -> "ALP2" (see comment above about the standard); also the ALP1 and ALP2 functions seem to be from [48] in fact.

p 8: "NREs-like" -> "NRE-like"

p 8, def 12: delete "is" after "epp"

p 8: "a IRI" -> "an IRI"

p 9, Table 3: In R2, the result should be a set of mappings, not a projection over two variables; also, shouldn't "?v" be "?v_n"?

p 9: delete "it" before "is translated"

p 9: space missing between ":leaderParty" and "_s"

p 10, fig 8, caption: insert "of" after "semantics"

p 10: "denotes" -> "denote" (l 11)

p 10, ex 13: "in predicate" -> "in the predicate"; "mapping" -> "mappings"; insert "is bound" after "?v_R)"

p 10: delete "are" (c 2, l 4)

p 11: delete "text" before ":Carrara"; "that i" -> "that is"

p 12: add comma before "a position symbol"

p 12: "wrt" -> "with respect to"

p 12: function "coor" -> "corr"

p 13: "same type of" -> "same type as"

p 13: "translates in" -> "translate to"

p 13: "test are" -> "tests are"

p 14: "EPPs test" -> "EPP tests" (Table 5 caption)

p 14: "in the fact" -> "to the fact"

p 14: "in input" => "as input"

p 14: "To given" -> "To give"

p 14: "a nEPP" -> "an nEPP"

p 14: "of the for" -> "of the form"

p 14: "in translated" -> "is translated"

p 14 "a EBC" -> "an EBC"

p 16: "a EPPs" -> "an EPP"

p 16: "Exended" -> "Extended"

p 16: "changes into the" -> "changes to"

p 16: "table is the" -> "table is that the"

p 17: "full EPPs" -> "ful EPP"

p 17: "Fig. 5" -> "Fig. 6"

p 17: the second aim of the section (5) is actually in Section 6.2, so this needs to be rewritten.

p 17: "of RDFS vocabulary consisting in" -> "of the RDFS vocabulary consisting of"

p 17: "Authors" -> "The authors"

p 17: include the translation Phi as part of what is given in Lemma 22.

p 17: "three expression" -> "three expressions"

p 18: "UL" in the caption of Table 7 should be bold "IL"

p 18 "Enconding" -> "Encoding"

p 18 "capture rule" -> "captures rule"

p 19: "such type of expressions" -> "such types of expressions"

p 21: "in virtue of" -> "by virtue of"

p 21: "than all" -> "then all"

p 22: "close language" -> "closed language"

p 22: "our rewriting" -> "our rewritings"

p 22: "regime" -> "regimes" (c 2, l 9)

p 22: "Authors" -> "The authors"

p 22: "symmertricProperty" -> "symmetricProperty"

p 22: "give a receipt"?

p 22: "currently support" -> "currently supports"

p 22: "as a future" -> "as future"

p 22: "KG" -> "KGs"

p 22: "Semantics" -> "semantics"

p 23: Table 10 uses both "|" and ":" to mean the same thing; also the definitions given for R4 and R9 use "|" and "wedge" incorrectly - rather use "where" and "and"

p 24: "wrt" -> "with respect to"

p 24: "EPPs language" -> "EPP language"

p 25: "ran" -> "run"

p 25: "nEPPs constructs" -> "nEPP constructs"

p 25: "characters long nEPPs" -> "character long nEPP"

p 26: "EPPs" -> "EPP" (l 2); "iEPPs" -> "iEPP" (l 3)

p 26: "including" -> "comprising"

p 26: "of a total" -> "with a total"

p 26: insert "are evaluated" before "via Jena"

p 27: "under entailment" -> "under the entailment"

p 27: remove comma after "in some cases"

p 27: add "operators" after "UNION"

p 28: "available into" -> "available for"?

p 28: "from some time" -> "for some time"

p 29: "it is the language having less" -> "it has fewer"

p 29: "languages but" -> "languages except"

p 29: "wrt" -> "with respect to"

p 30: "GxPAth" -> "GXPath"

p 30: "wrt" -> "with respect to"

p 30: "2-EXPTIME" what, hard or complete?

p 30: "Reutters" -> "Reutter"

p 31: "such extension" -> "such an extension"

p 31: reference [14] is not 543 pages long!

p 32: presumably reference [51] has now appeared

Review #3
Anonymous submitted on 26/Sep/2018
Review Comment:

Graph-structured data collections are becoming increasingly common in a variety of application domains. Furthermore, the study of query languages is fundamental in knowledge and data management systems. Hence, the focus of this paper, query languages for knowledge graphs, is very well motivated and timely.

The authors here propose and study an algebra EPP for navigational queries in edge-labeled directed graphs. The semantics, relative expressive power, and evaluation complexity of the language are presented, also in terms of SPARQL for such graphs represented in RDF. The authors also demonstrate how the core schema language rho-df inference capabilities can be supported in EPP, thereby giving strategies for query-based reasoning. Finally, the authors present the results of experimental studies on: the overhead of compiling EPPs into SPARQL, the advantages of a purpose-built execution engine for EPPs over an off-the-shelf SPARQL engine, and the efficacy of query-based schema reasoning with EPP.

The paper is relatively well-written and organized. Furthermore, to my reading, all of the technical details are correct.

Regarding the contributions, I read this manuscript with several different hats on. First, as a language designer, I ask myself "What is innovative about EPP?". As a language, upon inspection EPP is essentially Tarski's Relation Algebra (RA) with transitive closure, extended with node predicates (i.e., a value selection operator on nodes, which in EPP is the "T" operator). The RA was proposed already in the 1940's and has been extensively studied ever since. For a modern presentation from a query language perspective, see:

* Relative expressive power of navigational querying on graphs. George Fletcher et al. Information Sciences, 2015.

* Relative expressive power of navigational querying on graphs using transitive closure. Dimitri Surinx et al. Logic Journal of the IGPL, 2015.

The node predicate is a fairly small extension, and doesn't really complicate the study of the expressivity or complexity of the language.

Next, as a language engineer, I ask myself "What is innovative about the execution strategies developed for EPP?". The translations into SPARQL are obvious and straightforward and do not provide any substantially new insights into the engineering of either language. The native execution strategy for iEPP follows directly the formal semantics of the language, and does not give us any special new tools for engineering efficient execution engines for path navigation.

Finally, as a community member, I ask myself "Does this work stimulate innovative exchanges between the data management and knowledge management research communities?". From the perspective of folks primarily working in the Knowledge Graph community, it is not clear to me what the interest would be in yet another query language for graphs (especially as EPP does not give us any substantially new leverage). From the perspective of folks primarily studying data management systems, it is not clear what new query formalisms or engineering insights can be gleaned from this particular bridge to SPARQL.

As the contributions in each of these perspectives are quite modest, it is not clear to me what major impact this work will have in the study of Knowledge Graphs. Consequently, I do not recommend the manuscript for publication.

A few minor comments follow.

-- what is a "knowledge graph"? You give a few examples of what might be knowledge graphs, but then it seems that knowledge graphs are just edge-labeled directed graphs ... sometimes represented in RDF. I would encourage you to not skirt the issue and take a stance here. (e.g., what are distinguishing new characteristics of knowledge graphs, from a query language design perspective?)

-- I find the leading example in Section 1, "find Italian exclusive friends" to be too vague as a first example. What is a friend? A friend with respect to who? I also think the further examples in Section 1.1 can be improved and simplified. For example the presentation of TP and T in Example 3 is very confusing.

-- In Section 8.1, you should give more details on q1-q3. It is hard to appreciate the results without having the concrete queries. Same comment applies for other queries used in later sections.