# Flexible Query Processing for SPARQL

### Tracking #: 965-2176

Authors:
Andrea Calì
Riccardo Frosini
Alexandra Poulovassilis
Peter T. Wood

Responsible editor:

Submission type:
Full Paper
Abstract:
Flexible querying techniques can enhance users' access to complex, heterogeneous datasets in settings such as RDF Linked Data where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for an extended fragment of SPARQL 1.1 incorporating regular path queries (property path queries), and query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their "distance" from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.
Revised Version:
Tags:
Reviewed

Decision/Status:
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Schmidt submitted on 22/Feb/2015
 Suggestion: Major Revision Review Comment: The submitted manuscript is an extension of the conference paper [5], presenting an extension of the SPARQL 1.1 language with the operators RELAX and APPROX for approximating queries that have few or no matching triples in the given RDF graph. The RELAX operation relies on the RDFS axioms of the RDF graph in replacing subclasses with superclasses and subproperties with super-properties, thus generalizing the query in order to obtain more results. The APPROX operator creates variations of property paths, in order to obtain more variable bindings. In [5], the complexity of SPARQL queries with the RELAX and APPROX opearators, called SPARQL^AR, has been considered. Since [5] also contains comprehensive proof sketches for the complexity results, these are not repeated in the submission. Instead, a concrete implementation of the RELAX and APPROX operators is given, and its correctness is proven formally. Answering queries with the extension operators is done via rewriting as a union of SPARQL 1.1 queries. The description of the rewriting algorithm and the proof of its correctness are the main contributions of the current submission compared to [5]. One can also note that both the current manuscript and [5] discuss a problem very closely related to the work [16] (by two co-authors of the current submission) where the same approach is applied to the language of CQs with regular property paths over RDF graphs. The submission addresses a relevant problem and provides an intuitive solution for it. The quality of presentation could be somewhat improved, but I do not see real problems in this respect. My main reservation towards this submission is the actual significance of its main technical contribution. The proposed rewriting algorithm basically restates the recursive definitions of the semantics of APPROX and RELAX operators (pp. 6,7, also cf. [5]), in pseudocode. The correctness proof then becomes a purely technical exercise in undwinding two recursive definitions using induction, unfolding shortcuts used in the semantics definition, and pointing out that if two queries are equal then so is their join resp. union (Lemma 2). In my impression, the experiments section is rather formal: 10 queries are evaluated (of which one is too heavy to be processed in reasonable time even in the absence of approximation/relaxation), and we see that flexible processing is viable for simpler queries but complicates the evaluation in the presence of Kleene star (Q5), unless the approximated/relaxed query happens to be empty (Q10). There are no lessons learned from the experiments, no optimizations implemented based on them, so this feels like a loose end to me. In overall, I cannot wholeheartedly recommend publication of this manuscript in its current state, since it does not provide substantial extensions to the already published work [5,16]. I do not believe that the translation of the semantics of APPROX and RELAX in pseudocode and giving a formal correctness proof of such a translation merits a separate SWJ publication, and that readers can learn something from this paper that they do not know from [5] or actually even [16], where all technical ingredients of the approach have been already described. My suggestion would be to invest some effort in optimizing the implementation, explore other relaxation/approximation possibilities or provide any other substantial delta w.r.t. [5,16]. Detailed comments: - Extended reduction rules seem to play no role in the correctness proof of queries with RELAX. Please comment on this and preferably give an example illustrating the necessity of extRed. - In Algorithm 2, is \in resp. \not\in in the "if" condition of the innermost foreach loop up to query equivalence or just syntactic match (e.g. string comparison?). The latter should be utterly inefficient for larger c (e.g., the algorithm as it is presented would keep approximating URIs in the disjunction that represents the wildcard "_"). This is related to the way the Theorem 4 is proven: your argument is that you associate a counter with each rewritten query and that this counter automatically increases whenever any query is added, and so the algorithm terminates since there is a cut-off value for c. This looks like a workaround: it would be much nicer to have an algorithm that does not rely on the cut-off value to terminate, which could be an indication that Algorithm 2 does not produce unnecessarily complex queries. If this is not desirable or limiting (say, for proper treatment of paths with the Kleene star you would need to generate ever increasing paths for some reason), this should be discussed. - Would it be difficult to actually add a look-up to Algorithm 1 and remove duplicates with non-minimal costs? - Is there a justification for approximating P* as P*/APPROX(P)/P* and, say, not as (APPROX(P))*? In general, a discussion on the appropriateness of the approximation procedure would be very welcome: so far, your solution looks rather arbitrary. - Proof of Theorem 3. The induction hypothesis is always redefined "on the fly", which does not fit the overall rigor of the proof. Say, on p13. the IH defined as (5), for some cost $c$ --- by the way, (5) is literally the same as (4), so you could be not repeating this equation --- and then on p14. you say that IH was for the cost $c-\alpha$ resp. $c-\beta$ or $c-\gamma$. Why not go from $c$ to $c+\alpha$ / $c+\beta$ / $c+\gamma$ in the induction step, also pointing out that these three cases are mutually independent? Minor comments: - What is the benefit of replacing APPROX() / RELAX() with labeling in Algorithm 2? You never use labels in the proofs. - Algorithm 2: nested if conditions like "(if \in queries then cost + cost' < cost' ')" inside another if condition feel strange. Consider rewording, e.g. " \in queries implies cost + cost' < cost' ' " -p13, beginning of the proof of Th3. Explain the notation A/R(t) using a concrete example. Just saying what A and R are standing for is not sufficient, since APPROX/RELAX is not defined and A/R reads like a path. - Th3: you always use Lemma 2 to justify that as long as respective operands in two join/union expressions are equal, these join/union expressions are equal as well, which is trivial. Lemma 2, formulated for the subset relation, just complicates things here. - p18, after eq. (33): I do not think you need to consider the special case of simple triple pattern, since it should be covered by the case when the pattern is relaxed resp. approximated. Typos: p4, second col: "for greater clarify" -> clarity p11. algorithm applyRelax: K needs to be passed as an argument to relaxTriplePattern p14. second col: "A(x,_,z).:" -> "A(x,_,z):" then in the equation immediately following : either add an opening parenthesis, or drop the one closing the first operand of the join. The same for the next equation with two CostProj functions. p18, second col: [[]Q']] has an extraneous "]" in a couple of equations (search for "[]")