SPARQL with Property Paths on the Web

Tracking #: 1264-2476

Authors: 
Olaf Hartig
Giuseppe Pirrò

Responsible editor: 
Guest Editors ESWC2015

Submission type: 
Full Paper
Abstract: 
Linked Data on the Web represents an immense source of knowledge suitable to be automatically processed and queried. In this respect, there are different approaches for Linked Data querying that differ on the degree of centralization adopted. On one hand, the SPARQL query language, originally defined for querying single datasets, has been enhanced with features to query federations of datasets; however, this attempt is not sufficient to cope with the distributed nature of data sources available as Linked Data. On the other hand, extensions or variations of SPARQL try to find trade-offs between centralized and fully distributed querying. The idea is to partially move the computational load from the servers to the clients. Despite the variety and the relative merits of these approaches, as of today, there is no standard language for querying Linked Data on the Web. A specific requirement for such a language to capture the distributed, graph-like nature of Linked Data sources on the Web is a support of graph navigation. Recently, SPARQL has been extended with a navigational feature called property paths (PPs). However, the semantics of SPARQL restricts the scope of navigation via PPs to single RDF graphs. This restriction limits the applicability of PPs for querying distributed Linked Data sources on the Web. To fill this gap, in this paper we provide formal foundations for evaluating PPs on the Web, thus contributing to the definition of a query language for Linked Data. We first introduce a family of reachability-based query semantics for PPs that distinguish between navigation on the Web and navigation at the data level. Thereafter, we consider another, alternative query semantics that couples Web graph navigation and data level navigation; we call it context-based semantics. Given these semantics we find that for some PP-based SPARQL queries a complete evaluation on the Web is not possible. To study this phenomenon we introduce a notion of Web-safeness of queries, and prove a decidable syntactic property that enables systems to identify queries that are Web-safe. In addition to establishing these formal foundations, we conducted an experimental comparison of the context-based semantics and a reachability-based semantics. Our experiments show that when evaluating a PP-based query under the context-based semantics one experiences a significantly smaller number of dereferencing operations, but the computed query result may contain less solutions.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pedro Szekely submitted on 13/Jan/2016
Suggestion:
Minor Revision
Review Comment:

This is an excellent paper on a very important topic. The authors address the problem of answering queries over linked data documents on the web. In this setting the processor must discover the documents that contain the relevant triples, download the documents, determine other documents that may contain relevant triples to resolve the queries, and compute the solution to the queries. Until recently, this problem has received little attention even though it is a crucial problem for the linked data community. The authors do a beautiful job in formalizing the problem, contrasting it with the centralized query answering problem, and presenting formal models for query answering in this realistic setting.

The paper is beautifully written. The paper introduces a simple yet rich example at the beginning, and uses it to illustrate and motivate all the definitions and theorems in the paper. The examples present the intuition behind the formal and sometimes difficult to follow definitions and proofs. The examples make the paper accessible to readers interested in the topic who would otherwise have trouble following the paper and understanding the significance of the results. At the same time, the paper presents the formal definitions and proofs for experts in the area.

The main contribution of the paper is the definition of a family of semantics for property paths in the linked data setting where triples are stored in distributed documents that must be discovered, downloaded and inspected to build the graph being queried. The paper presents impossibility results that show that the usual semantics of property paths in centralized RDF graph is impractical in the linked data setting. The paper then introduces a notion of Web safeness to characterize semantics that can be evaluated in finite time in the linked data setting. The paper then explores a specific Web safe semantics called context-based semantics that couples navigation and evaluation of queries in the discovered RDF graph and is Web safe. The paper also introduces a simple syntactic test that makes it possible to easily verify whether a query can be evaluated safely. The paper also formally defines the semantics of SPARQL based on the property paths, and so presents a semantics for safely evaluating SPARQL queries on in linked data.

The paper also presents a small empirical study that compares the number of derefencing operations and the number of solutions of two semantics, the context-based semantics and a semantics that derefences a larger number of documents. As expected, the context-based semantics yields a smaller number of solutions, but the difference in number of derefenced documents is striking. The discussion and inutition in the experiments is adequate. While a more comprehensive empirical would be ideal, I think the study is sufficient to illustrate the theory concepts in the paper and inspire others to explore the empirical aspects more deeply.

There are two topics that I would encourage the authors to discuss in more detail.

The first one is an elaboration of the context-based semantics having to do with the types of triples contained in each document. The authors make no assumptions in this respect, but perhaps it is possible to derive a set of guidelines for linked data authors that make it possible for the context-based semantics to yield a larger number of solutions. Consider document d_E in the example in figure 4. d_E, as the name suggests, is the document that contains information about Eve, noting that Eve knows Alice. This document also contains a triple specifying that Alice knows Eve. One could expect that statements about Alice are in d_A. If Eve wants to make a statement about Alice, she could introduce a new URI for Alice that derefences to d_E and include the statements about Alice in d_E. Eve would provide a same_as statement between the two Alice URIs. Alice may or may not include the equivalent same_as statement in d_A. An interesting question is whether such an approach leads to more complete answers under the context-based semantics.

It would be good to discuss the theory implications for linked data publishers, perhaps as a set of publishing guidelines that lead to more complete answers and perhaps to more predictable answers where the number of answers is less affected by the number of documents derefenced.

The second topic is about the relationship between the web of linked data, the context-based semantics and the well studied area of web crawling. Web crawlers work in the same way as the web model described in this paper. They download pages, they select a subset of the links to follow and continue downloading pages. It is common for web crawlers to use URL patterns to determine the links to follow. Often these patterns are used to limit crawlers to a subset of the web, keeping them focused. A related approach used in so called focused crawlers is to extract topics from web pages and to filter out pages whose topics are not compatible with a set of desired topics. In these systems, the query is defined as a set of topics. Significant work is being done to discover the seeds for focused crawling. There is the possibility of interesting cross-fertilization between these two communities, so pointing out some of the relationships could be a good start.

A few additional comments regarding grammar and wording are included as comments in the PDF document.

In summary, this is an excellent paper and I recommend accepting it with minor revisions.

Review #2
By Jérôme Euzenat submitted on 20/Jan/2016
Suggestion:
Accept
Review Comment:

The paper discusses the use of SPARQL queries with path patterns in a distributed evaluation context. It first discuss the impossibility to evaluate in a formal way SPARQL queries against the web. It then defines three different semantics for the evaluation of such path patterns and extend them to the usual SPARQL connectors. Finally, because some of these semantics are still not simply practical, the paper defines the notion of Web-safe queries. The paper ends by two experiments which are very illustrative of the properties of the different semantics.

The paper is an extension of a previous ESWC paper. This is a very welcome extension. I find it far more clearer than the initial paper.

This paper draws clean formal foundations for the distributed evaluation of SPARQL queries. The proposed semantics enable the implementation of evaluators along sensible precise rules. Most decisions are convincing and well explained. The experimental evaluation is also surprisingly insightful. So, it definitely should be accepted.

In make hereafter a few form reservation. I would love if the authors could take them into account, but they are no obstacle to the publication of the paper.

In several instances, the authors would do a good job in connecting the concepts that they develop to other ones which are well known in the domain and which are barely mentioned in the document:
- I would consider that the context-based semantics formalises the follow-your-nose way of traversing the semantic web, why is this not mentioned?
- dereferencing is only mentioned in the second part (after p16 if I read well). However, it seems to me that dom\not\bot is the set of dereferenceable IRIs. Why not saying so?
- a similar easy explanation is to explain the meaning of web-safe queries as those queries in which any variable used in subject is necessarily previously bound during the evaluation of the query... not such a simple explanation is given to the reader.
It is possible, that the explanations that I offer are incorrect. However, it would still be nice to explicitly explain why.

There are, in my opinion, other vocabulary problems:
- in the abstract, and later, the oposition is between "data sources available as linked data"; "on the web", "at the data level", "web-graph navigation". Nothing is clear here. People who query dbpedia in a SPARQL endpoint think, legitimately, that they are querying linked data... I appreciate the difficulty of adopting an adequate vacabulary, but we are not there yet. I feel that the (three) key opposition are between; distributed non-specified data sources (or distributed open-ended querying), federated dataset querying (that is distributed specified) and centralised querying.
- similarly, opposing "reachability-based" and "context-based" is misleading because "context-based" is also "reachability-based" but in another way.
- finally, the term Web-safe, is not really appealing. There could be many ways to be web-safe and this is only a specific one. Safe-navigational queries could be an option (there are others below).

I have a few questions about the experimental setting:
- is there any cache enabled or not?
- is it possible to display the proportion of duplicate answers? (and following the discussion below, several categories of those if there are).

Suggestions:
- First, the resort to Figure for expressing the most important part of definitions is ugly. This may be due to the format, but this is ugly anyway.
- on page 3, it would be good to have a picture of the "web of documents" and web of data illustrating minimally their features (i.e., less complex than Figure 4. I am also surprised that the Web of documents seems to exclude SPARQL endpoints.
- p4 "PP patterns with blank nodes can be simulated using fresh variables" not in predicate position.
- p5: the result contains THE solution mapping
- P7, 4.2: The definition of ScP-reachable should be a definition. Moreover it is difficult to read. At least put ``d' is ScP-reacheable in W'' right after ``d'\in D''
- p8, l4: "we defineD"
- p8: cPPMatch should also be a definition
- Definition 10. This is ugly because the reachability condition depends on P while it seems like a function [[.]]_Gr was defined that which is not true.
- p10 "both patterns are semantically equivalent under context-based semantics" it seems that they are equivalent in all semantics defined in this paper no?
- in separation -> in isolation?
- p11 and followers: it would be clearer to keep P for path pattern and to use the symbol G for graph patterns
- One could use \phi-safeness since this property actually depends on the used semantics. That would have the merit to recall that web-safeness is not an intrinsic property but that this depends on the given semantics.
- Table 1 contains redundencies. Indeed, 15 entails 16, 10 entails 11 and 10 entails 12. So, the meaning of this table is the same if 15 and 10 are suppressed.
- note 28: recursive: recursively enumerable? Well we are talking about a definition, not a function.
- for experimental results, please use patterns because in grayscale, it is difficult to distinguish colors.
- p18: "duplicates which results from finding a greater number of alternative path"? I am surprised. I though that duplicate could not be generated through "path" in SPARQL 1.1 semantics, only through the use of operators UNION, AND (as presented on p5)? How can duplicate through path happens? It would be interesting to compare this with the set-semantics instead of the pseudo-bag-semantics.

Review #3
By Oscar Corcho submitted on 01/May/2016
Suggestion:
Accept
Review Comment:

General comments:
This is a very relevant paper for scientists and practitioners who are interested in understanding and using SPARQL as the query language for Linked Data sources. Although there have been previous papers (some of them published in the Semantic Web journal, including at least one of the authors of this paper) on this general topic, this paper focuses on adding a feature, property paths, that appeared in the SPARQL1.1 specification, reflects clearly the need to traverse the graph nature of Linked Data on the Web, and unfortunately, most probably because of a syntactic restriction, did not allow in the case of query federation to consider easily the traversal over different graphs. In any case, this last drawback is indeed not a drawback in the context of Linked Data, where the notion of RDF graphs is not so relevant.

Now I provide a single statement about each of the three main features to be discussed in reviews for full papers at SWJ:
(1) Originality: As discussed above, this is not the first paper that discusses how to use SPARQL for querying Linked Data, and goes into the understanding of the formal foundations of SPARQL querying over Linked Data. However, this paper adds the analysis of the simple but very powerful feature of property paths, and for this it introduces a family of reachability-based semantics, the notion of Web-safeness and the demonstration of the existence of a syntactic property that allows determining the web-safeness of a query. This is considered enough for a new publication on this area, especially on top of the corresponding previous conference paper.

(2) Significance of the results: With an increasing set of approaches and tools exploring the space of Linked Data querying in contrast with the federated query processing approaches over SPARQL endpoints that were explored some years ago, these foundational results are very relevant to understand the characteristics and limits of SPARQL querying over Linked Data. Some of the proposed formalisations are even directly usable for those approaches (e.g., the Web-safeness proof).

(3) Quality of writing: The paper is easy to read (obviously, dense as any formal work of this type) and the formalisations are well done. The paper presents clearly the foundations behind the formalisations that are presented, so that it allows understanding the main advances presented here. The paper would only benefit from some additional examples, which may be obviously omitted to avoid having an excessive length.

Additional comments:
As it can be seen from my comments above, I have really enjoyed reading this very well-written paper, which I acknowledge that provides a very clear view on the current state of the art in Linked Data querying, and on the formalisations required for dealing with property paths.

Now I will go for some very detailed comments on aspects that I think that may be errors, or that at least would require some further explanation.

In section 3.1, on the second paragraph, you say that alpha belongs to the set I union L union V. I may be wrong, but I cannot see how L can be included here, as literals are no allowed to be the subject of triples in RDF.

In page 6, just above Definition 5, the authors include a formula {s,p,o} intersection {s',p',o'} intersection B not equal empty set. However, I cannot see how these sets can be intersected. I think that you are abusing notation here. I suggest reviewing and possibly rewriting this formula.

This is a more general statement. You aim at allowing that in any RDF document there may be triples whose subject is different from the URI that one is reaching. I know that this is not unnormal in Linked Data querying, since sometimes we need to add triples about subjects that are not the ones that we are exposing, for completeness sake, or just for convenience, but I wonder whether this too general approach is actually impacting too much your formalisation. I would feel even more confident with a formalisation that only focuses on accessing triples that have as subject the URI that you have just de-referenced. Just an opinion (I am happy to accept the another option). This came to my mind when checking the last paragraph in definition 5, where you comment about some u belonging to {s,p,o}, where I would forget about the s.

I can say that I really liked section 4 on Web-aware semantics of property paths. However, at the end there is only one place where I would need a further explanation, which is when you say "For instance, the PP pattern .... context-based semantics)". I must admit that I did not see it very clearly.

In terms of section 5, I have only one concern for which I would like an explanation from the authors (I was even about to put a minor revision comment in my suggestion so as to force that answer, which I am sure that will come). You comment on the footnote of page 10 that adding some features of SPARQL would just be an exercise without major implications on your results. However, I have the impression that it is not so easy, and I may ask for a tech-report or alike (a 1-page demostration), to indicate how you would deal with triple patterns with a variable as predicate. I really like to understand whether that one is so easy to handle.

In definition 29 you have not defined the symbol "tilde".

One simple thing for the evaluation (I am very sensitive to this since I have been recently working with early-stage PhD students on defining their evaluation setups). I do not think that the characteristics of the computer setup that you are using in section 7 is very relevant taking into account the experiments that you do. I would remove the comment on the MacBook Pro. Other than that, I was not thinking that any experimentation was actually needed for this paper, but I found the experiments, which are correctly called experiments rather than evaluation, something that I like a lot, very useful to let people understand the expected behaviour of the different approaches discussed.

Typos (some of them on section 6 - please recheck that section for grammar and typos):
- choses --> chooses
- algorithm consist of --> consists