Flexible Query Processing for SPARQL

Tracking #: 965-2176

Authors: 
Andrea Calì
Riccardo Frosini
Alexandra Poulovassilis
Peter T. Wood

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
Abstract: 
Flexible querying techniques can enhance users' access to complex, heterogeneous datasets in settings such as RDF Linked Data where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for an extended fragment of SPARQL 1.1 incorporating regular path queries (property path queries), and query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their "distance" from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Schmidt submitted on 22/Feb/2015
Suggestion:
Major Revision
Review Comment:

The paper presents an extension of a subset of SPARQL allowing for flexible querying of datasets. More precisely, the authors introduce two new operators, APPROX and RELAX, which can be wrapped around triple patterns (possibly including regular expressions similar in spirit to SPARQL 1.1 property paths) and (i) approximate query results (in the sense that edit operations on predicates or predicate chains are probed during query evaluation) or (ii) relax the (ground) semantics of SPARQL by kind of inverse applying RDFS rules over the query. It presents a formal semantics for the novel constructs, provides exhaustive complexity results for the new constructs (extending previous results for standard SPARQL) and contains an experimental evaluation. Generally speaking, the paper is very well written and the ideas are innovative, considerably extending the state-of-the-art in the area of flexible querying over SPARQL. The chosen constructs and mechanisms (e.g., the implicit use of RDFS semantics) are reasonable and exploit the characteristics of semantic data formats, going beyond what one can do in, for instance, the traditional relational context.

I could well imagine that these constructs are particularly interesting for data scientists or application developers, who are concerned with writing SPARQL queries frequently: without detailed knowledge of the ontology and/or dataset, the constructs make it much easier to develop queries. However, it should also be noted that it would be kind of hard to interpret query results, as they might build upon predicates different from those mentioned in the original query (and hence the semantics of the results might be somewhat different from what the author was asking for). Therefore, I think it would be very important in practice to couple results to provenance mechanisms, associating the rewritten queries with the respective (sub)results. This is, however, an orthogonal issue, but I feel it’s very important and should be discussed in the paper.

While the complexity results show that, from a theoretical point of view, the constructs do not increase the complexity of query answering, I am a bit sceptical about the practical query enumeration approach for evaluation proposed by the authors. In particular, I see several problems here:

- If I understand right, there might generally be an exponential blowup in the number of generated queries (which is only bound by the fixed edit cost). In particular, when not binding the cost, the space of queries might be infinite (due to, for instance, constructs such as P*).
- As the authors also argue in the evaluation, this approach has its limits for queries with many triple patterns and/or many/complex APPROX/RELAX operations.
- At some point, the authors also discuss that generated queries typically exhibit a common substructure (which is implicit by the rewriting algorithm, which performs incremental modifications to relaxed or approximated triple patterns). I wonder whether it would be more appropriate to perform the approximation/relaxation locally at triple pattern level, rather than computing all possible permutations at query level. I understand that this would make the use of existing query engines more difficult (but instead might require implementing the operators “natively” in query engines), but it might dramatically speed up query evaluation. Whether implemented or not, I would really like to see this approach (and possible limitations/problems/pitfalls) discussed in the paper.

Another open question is the extensibility of the algorithm. Important SPARQL operators like UNION and OPTIONAL are not considered in the paper. Are there any obstacles adding such constructs? Further, I feel like operator OPTIONAL bears a great potential for query relaxation (in the sense that the absence of triples might not hinder results to be returned), i.e. rewriting AND clauses into OPTIONAL clauses would be a very interesting relaxation approach in practice. A discussion of such issues (and possible plans/suggestions for future work) would be welcome.

Finally, regarding the experiments, I feel like these are very hard to interpret for several reasons:

- The discussion should also elaborate on generated A/R queries, why they deliver more results, and what kind of results are added (from a more practical, user perspective), at least for selected queries
- It would be nice to indicate the number of queries that were generated/evaluated in the A/R scenario
- The number of returned results should be indicated in the table as well, for both the baseline query and the A/R query
- In general, it would be helpful to better understand characteristics of feasible vs. non-feasible settings, identifying characteristics that cause problems or do not (e.g., number of relaxed queries, size of the ontology/subclass/subproperty hierarchies, complexity of base query, execution time of baseline queries, etc.); the claim that "the difference between the execution time of the exact form and the approxed/relaxed form of the queries is acceptable for queries with less than 7 conjuncts" seems way to general to me.

In overall, I think this is a very interesting contribution containing many interesting ideas and first insights regarding practicability/performance. I would, however, like to see a considerably improved/extended version of experiments and some of the open issues mentioned above discussed in the paper prior to publication.

Minor issues:
- Def. 6: the domain of P_1, P_2 should be made explicit
- In Sec. 3.2.1, an example would be helpful
- Sec. 3.2.2: it should be mentioned that cl(K_1) is always finite (given that K_1 is finite)
- The first query in Ex. 3: it does not really make sense to ask for an actor who “wasCreatedOnDate”; the example to a certain degree puts your approach in question, so it would be good to have an example that shows “useful” applications of your approach
- The proof in Sec. 4.1 is long and very technical, not providing many new insights. I would propose to shorten it, restricting to the key ideas and demonstrating the induction only for the most interesting (rather than all) cases; or, alternatively, moving it to an Appendix.
- Equations (4) and (5) are identical, so (5) could be dropped

Review #2
By Aidan Hogan submitted on 07/Apr/2015
Suggestion:
Minor Revision
Review Comment:

In this paper, the authors propose a method for approximating and/or relaxing SPARQL queries so as to find more answers. The idea of relaxation is to generalise the triple patterns in the query using an RDF schema; for example, to search for super-properties or super-classes of those specified in the query. The idea of approximation is to replace predicates in the query by less restrictive patterns, such as match any predicate in the dataset. Approximation and relaxation are defined for a fragment of SPARQL that includes basic triple patterns, AND, FILTERS, and regular expressions over predicates similar to property paths in SPARQL 1.1. This fragment of SPARQL was previously introduced in a conference paper where the authors proved some results on the evaluation of queries (checking if a given result is contained in the results of a given query over the given graph/ontology) with different combinations of features. In this paper, the authors focus on providing a relaxation/approximation algorithm and showing its soundness and completeness for the fragment defined (with approx/relax operators). Some experimental results are also provided for a set of 10 queries over the YAGO knowledge-base.

The paper appears to be on topic for the Semantic Web Journal. I really like the problem area attacked: I think it is important for there to be tools that help users to get useful answers without needing to know the exact crisp form of the SPARQL query they need. The literature seems to be well-covered. The paper is also quite well-written. Although large parts of the paper are rather dry and technical, the arguments are well laid out and thus quite easy to follow. As far as I can tell, the paper appears technically correct. As such, I lean towards accepting the paper once some minor revisions are made. However, the paper and the work described do have some significant weaknesses.

In terms of the paper, it has a mix of theoretical and practical contributions, but for me, neither are really major contributions. The main contribution is the provision of an algorithm and a proof that it is sound and complete; this section fills seven pages (pages that are quite easy to follow and perhaps even necessary in some sense), but it just feels like the authors are just mapping one syntax (the definition) to another (the formal algorithm) without deriving any real insight in the process. The results is that for me, the main contribution, though well-executed and perhaps even necessary in some sense, is a bit uninteresting and dry.

In terms of the practical contributions, I feel it's hard to get a sense of how the approximation and relaxation operators perform from the examples given in that I don't understand the ideas behind testing the queries that the authors test. I think if you want to understand the behaviour of the relaxation and approximation operators, you would need to look at a lot of examples of single-triple pattern relaxations and approximations (e.g., for every predicate in YAGO) and look at runtimes, queries generated, result sizes generated. Aside from the evaluation, the paper itself does not seem to really care all that much about practical aspects: it does not seem, for example, practical to play with triple patterns in the query until you have multiple different types of queries and then return all the answers for all those queries to the user (asking the user to specify which triple patterns to relax/approximate likewise appears to be a chicken-and-egg sort of problem ... how should they know which ones to relax/approximate?). I'm not at all convinced that the system resulting from this algorithm would be at all useful in practice. The examples, though illustrative, are also a bit derived (e.g., in Example 4, the predicate "isPoliticianOf" just happens to be approximated by the predicate "holdsPoliticalPosition" even though in practice, approximation is not guided towards any particular predicate).

My last major concern – one that I ask the authors to address as part of a revision – is that the paper is not careful enough in characterising what parts of what standards are covered. The authors say in the introduction that they "describe an extension of SPARQL 1.1 with query approximation and query relaxation operators" and I thought, wow, that's pretty impressive. But in the end the authors have a very limited fragment of SPARQL (AND, FILTERS, SELECT) that doesn't even include OPTIONAL or UNION from SPARQL 1.0, let alone all the features of SPARQL 1.1 (aggregates, sub-queries, federation, values, variable assignment); likewise the regular path expressions that the authors use are quite different from SPARQL 1.1 property paths. Later on, the authors talk about RDF entailment, referring to six core entailment rules. But standard RDF entailment involves quite a bit more than just this: it incorporates notions like simple entailment (which is arguably not relevant here if blank nods are ignored), datatype entailment, etc. There are 13 ground rules in total. The authors do later mention that they cover a fragment of RDFS but that should be made clear from the outset. I would thus ask the authors to be more careful and to say that they extend fragments of SPARQL 1.1 and fragments of RDFS (looks like the fragment rho-DF described in [1]) consistently from the outset ... and to make other restrictions clear up front (e.g., no blank nodes, acyclical ontologies, etc.).

Despite these concerns, I am still leaning towards accepting the paper given that I feel it's an important problem and that this work could serve as a basis for other future works. Hence I would lean towards a "weak accept" with minor revisions.

COMMENTS TO ADDRESS:
* As mentioned above, distinguish from fragments of standardised languages and the languages themselves whereever necessary.
* "RDF Linked Data" .. the term sounds a bit awkward to me in that RDF is a redundant modifier. Maybe just "Linked Data"?
* Example 2: I don't like the use of disjunction in the example since it means one could not distinguish latitude from longitude in the results. (But okay.)
* "In an RDF-graph ..." A node in an RDF graph can also be a property ... but the language here seems strange. In RDF, there is no strong distinction between properties, classes and instances.
* After Definition 7, you should really clarify that you will not support other features like OPTIONAL, UNION, etc.
* When you mention cost in 3.2, it would be good to give the intuition of what cost refers to (for example, to be explicit on what sorts of values it takes, whether lower is better or not, what a 0 means, etc.) This would make it easier to follow the formal material directly afterwards.
* In section 3.2.1, there's an expression *[[]]_G ... is that supposed to be a Kleene star or what is it referring to?
* "A mapping satisfies a condition ..." Why does c need to be a literal in the first condition? Why can it not be a URI/IRI?
* "apply the rules of XXX in reverse until no longer applicable" I think this could do with more elaboration as to what it means to apply a rule in reverse. Likewise I'd like to see a short argument that the result of this "redundancy elimination" is unique in this case.
* "K needs to be acyclical in order for direct relaxation to be well-defined" That seems like a big assumption, no? Why does it need to be acyclical? What happens if it isn't? What about reflexive sc or sp properties?
* "by applying rule i from Figure 1" ... over a single triple pattern with variables? What does that mean? This needs to be elaborated on more.
* "we actually use (d,type-,y) ..." actually it's still unclear to me why you need to flip the type predicate specifically. Where is the asymmetry in the relaxation step? Could you elaborate or provide more of a hint on that, maybe with an example?
* Maybe the proofs of the theorems should be replicated in the journal paper? A journal paper should be as self-contained as possible in my mind. For example, I did not immediately get why adding SELECT made things so much harder (of course it's because you may not have the full mapping to ground the query) ... I had to go to the conference paper for that. Having the proofs (maybe as an appendix) would be nice I think.
* Example 5 has four pages between its start and its end meaning a bit of unnecessary flipping back and forward.
* Algorithm 2 has a weird if clause close to the bottom in italics in brackets ... I didn't get this.
* I find it strange in relaxation that you can only relax based on domain or range if a constant is given. What's the idea behind that?
* The material on the left column of page 14 is very cramped.
* On page 17, mention what the \mathcal{L} function means.
* Section 5: why not also include the LUBM results in this paper?
* "For each query [we] ran"
* In your evaluation, can all approximated queries be written as SPARQL queries? How can you perform, e.g., '_' in general (e.g., )? Do you need a list of all predicates a priori?
* Figures 4 and 5 are tables.

[1] Sergio Muñoz, Jorge Pérez, Claudio Gutierrez:
Simple and Efficient Minimal RDFS. J. Web Sem. 7(3): 220-234 (2009)

Review #3
Anonymous submitted on 16/Apr/2015
Suggestion:
Major Revision
Review Comment:

The submitted manuscript is an extension of the conference paper [5], presenting an extension of the SPARQL 1.1 language with the operators RELAX and APPROX for approximating queries that have few or no matching triples in the given RDF graph. The RELAX operation relies on the RDFS axioms of the RDF graph in replacing subclasses with superclasses and subproperties with super-properties, thus generalizing the query in order to obtain more results. The APPROX operator creates variations of property paths, in order to obtain more
variable bindings.

In [5], the complexity of SPARQL queries with the RELAX and APPROX opearators, called SPARQL^AR, has been considered. Since [5] also contains comprehensive proof sketches for the complexity results, these are not repeated in the submission. Instead, a concrete implementation of the RELAX and APPROX operators is given, and its correctness is proven formally. Answering queries with the extension operators is done via rewriting as a union of SPARQL 1.1 queries. The description of the rewriting algorithm and the proof of its
correctness are the main contributions of the current submission compared to [5]. One can also note that both the current manuscript and [5] discuss a problem very closely related to the work [16] (by two co-authors of the current submission) where the same approach is applied to the language of CQs with regular property paths over RDF graphs.

The submission addresses a relevant problem and provides an intuitive solution for it. The quality of presentation could be somewhat improved, but I do not see real problems in this respect. My main reservation towards this submission is the actual significance of its main technical contribution. The proposed rewriting algorithm basically restates the recursive definitions of the semantics of APPROX and RELAX operators (pp. 6,7, also cf. [5]), in pseudocode. The correctness proof then becomes a purely technical exercise in undwinding two recursive definitions using induction, unfolding shortcuts used in the semantics definition, and pointing out that if two queries are equal then so is their join resp. union (Lemma 2).

In my impression, the experiments section is rather formal: 10 queries are evaluated (of which one is too heavy to be processed in reasonable time even in the absence of approximation/relaxation), and we see that
flexible processing is viable for simpler queries but complicates the evaluation in the presence of Kleene star (Q5), unless the approximated/relaxed query happens to be empty (Q10). There are no lessons learned from the experiments, no optimizations implemented based on them, so this feels like a loose end to me.

In overall, I cannot wholeheartedly recommend publication of this manuscript in its current state, since it does not provide substantial extensions to the already published work [5,16]. I do not believe that
the translation of the semantics of APPROX and RELAX in pseudocode and giving a formal correctness proof of such a translation merits a separate SWJ publication, and that readers can learn something from this paper that they do not know from [5] or actually even [16], where all technical ingredients of the approach have been already described. My suggestion would be to invest some effort in optimizing the implementation, explore other relaxation/approximation possibilities or provide any other substantial delta w.r.t. [5,16].

Detailed comments:

- Extended reduction rules seem to play no role in the correctness proof of queries with RELAX. Please comment on this and preferably give an example illustrating the necessity of extRed.

- In Algorithm 2, is \in resp. \not\in in the "if" condition of the innermost foreach loop up to query equivalence or just syntactic match (e.g. string comparison?). The latter should be utterly inefficient
for larger c (e.g., the algorithm as it is presented would keep approximating URIs in the disjunction that represents the wildcard "_"). This is related to the way the Theorem 4 is proven: your argument is that you associate a counter with each rewritten query and that this counter automatically increases whenever any query is added, and so the algorithm terminates since there is a cut-off value for c. This looks like a workaround: it would be much nicer to have an algorithm that does not rely on the cut-off value to terminate, which could be an indication that Algorithm 2 does not produce unnecessarily complex queries. If this is not desirable or limiting (say, for proper treatment of paths with the Kleene star you would need to generate ever increasing paths for some reason), this should be discussed.

- Would it be difficult to actually add a look-up to Algorithm 1 and remove duplicates with non-minimal costs?

- Is there a justification for approximating P* as P*/APPROX(P)/P* and, say, not as (APPROX(P))*? In general, a discussion on the appropriateness of the approximation procedure would be very welcome: so far, your solution looks rather arbitrary.

- Proof of Theorem 3. The induction hypothesis is always redefined "on the fly", which does not fit the overall rigor of the proof. Say, on p13. the IH defined as (5), for some cost $c$ --- by the way, (5) is
literally the same as (4), so you could be not repeating this equation --- and then on p14. you say that IH was for the cost $c-\alpha$ resp. $c-\beta$ or $c-\gamma$. Why not go from $c$ to $c+\alpha$ / $c+\beta$ / $c+\gamma$ in the induction step, also pointing out that these three cases are mutually independent?

Minor comments:

- What is the benefit of replacing APPROX() / RELAX() with labeling in Algorithm 2? You never use labels in the proofs.

- Algorithm 2: nested if conditions like "(if \in queries then cost + cost' < cost' ')" inside another if condition feel strange. Consider rewording, e.g. " \in queries implies cost + cost' < cost' ' "

-p13, beginning of the proof of Th3. Explain the notation A/R(t) using a concrete example. Just saying what A and R are standing for is not sufficient, since APPROX/RELAX is not defined and A/R reads like a path.

- Th3: you always use Lemma 2 to justify that as long as respective operands in two join/union expressions are equal, these join/union expressions are equal as well, which is trivial. Lemma 2, formulated for the subset relation, just complicates things here.

- p18, after eq. (33): I do not think you need to consider the special case of simple triple pattern, since it should be covered by the case when the pattern is relaxed resp. approximated.

Typos:

p4, second col: "for greater clarify" -> clarity

p11. algorithm applyRelax: K needs to be passed as an argument to
relaxTriplePattern

p14. second col: "A(x,_,z).:" -> "A(x,_,z):"

then in the equation immediately following : either add an
opening parenthesis, or drop the one closing the first operand of the
join. The same for the next equation with two CostProj functions.

p18, second col: [[]Q']] has an extraneous "]" in a couple of
equations (search for "[]")