A logical characterisation of distributed SPARQL processing

Tracking #: 632-1842

Audun Stolpe

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
The paper provides a logical characterisation of distributed processing of SPARQL conjunctive queries. The principal notion analysed is that of a distribution scheme, which is a pair consisting of an evaluation rule and a distribution function. Different choices of evaluation rules and distribution functions give different federation schemes that are proved to be sound and complete wrt. different sets of clusters. A cluster is here understood as a finite set of RDF sources. Three distribution functions are singled out for particular attention, two of which are implicit in several examples from the literature. These are named the even-, standard- and prudent distribution and yield sound and complete federation schemes when paired with an evaluation rule named the collect-and-combine rule. These completeness results are next compared with two systems that have been proposed in the literature. They indicate that the level of comparison facilitated by a formal framework can be useful. In particular, certain sources of incompleteness turn out to recur in the literature without being afforded much attention as such. Finally, the concept of a distribution scheme is related to the heuristic notion of a join-ordering to yield sound and complete execution plans for the aforementioned federation schemes.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Juergen Umbrich submitted on 14/Apr/2014
Minor Revision
Review Comment:

The author presents a logical characterisation of distributed and federated SPARQL query processing consisting of an evaluation rule and distribution function. The work itself is very interesting and a good step to formally describe those query systems.
The paper shows how the developed theoretical framework can be used to explore and compare different query approaches approaches in a mathematical way based on two examples (however, more recent systems like ANAPSID, FEDX, SPLENDID are left out).

Overall the paper is well written and structured and the scope and contribution is clearly presented.
The paper requires a certain level of formal background knowledge in math and RDF/SPARQL to fully understand all bits and pieces.
The related work seems to be well studied in detailed and contains a wide coverage.
The provided formalisation, theorems and proofs appear to be sound and without any typos

Overall a nice formal contribution to the research field of distributed and federated query processing.

Considering the weak points of the paper: I see the main area of improvement in making the paper more accessible for a wider audience by providing some illustrations and examples.
It would be helpful to illustrate certain formalisation based on a running example or provide an overview table.
Both would help to keep track and faster understand of all the formalisations.
It would also make the paper more accessible for a wider audience.

Why does the author include the set of Literals as possibility for the subject position (cf. RDF 1.1. Primer[1] and RDF 1.1 Concepts and Abstract Syntax[2] )

Federation Schemes:
Definition 4.2.:
I would either reorder the causes in reverse order to match the text or reorder the text to match the causes.
Definition 4.4.:
The difference between standard and prudent distribution is not entirely clear from the definition.
A running example would help here

List of some typos:
p.2: "to to the analysis" -> "to analysis"
p3 "is as family of " -> "is a family of sets"
p7. "Of course,this" -> "Of course, this"
p7. "from let to right" -> "from left to right"
p7. "mu need to be" -> "mu needs to be"
p8. "what completeness is completeness with " -> "what completeness is with respect"
p10: consistent font size
p10. "form we have\delta" -> "form we have \delta"

Minor comments.
*) I enjoyed the writing style in general, however, I would rephrase certain parts which indicates that the reader has already certain background knowledge, such phrases are "fairly obvious", "non of the proofs are particular deep or difficult".
*) There is the one or other overful hbox, which should be fixed for the final version.
*) Cleanup of bibtex file

[1] http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/#section-triple
[2] http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

Review #2
Anonymous submitted on 14/May/2014
Major Revision
Review Comment:

Summary of the paper’s main contributions and impact
The paper presents a formal framework for specifying federated source selection, and query decomposition and processing. Queries correspond to Basic Graph Patterns (BGPs) and a federation of SPARQL endpoints corresponds to a cluster of RDF graphs. Well-known conditions such as an answerable triple pattern by endpoint is represented by imposing restrictions on the signature of the triple pattern and the vocabulary terms used the endpoint RDF dataset. Additionally, the formalization of different query decomposition approaches is defined as distributions where distinct types of sub-queries are assigned to sources. For example, even distribution implemented by AliBaba or standard distribution by FedX. Further, evaluation rules to collect data from the sources that comprised a distribution of a query is defined, as well as some main of this schema of evaluation.
The paper finalizes illustration the proposed formal framework with the systems ADERIS and SemWIQ.

Overall the paper presents an interesting and challenging problem, and the proposed framework has the potential of providing the basis to understand the behavior and properties of existing federated query engines. Nevertheless, because of the lack of readability and illustration strengths of the framework cannot be clearly appreciated. First, variables used in the definition need to be defined in the definition and used consistently along the paper, e.g., T is not defined in Definition 4.5. Further, definition of “sig” is not presented, thus is not clear why sig(t) \in sig(R) is correct or should be sig(t) \subset sig(R). Although the framework is illustrated with existing federated engines, because these are not open source, is not possible to check soundness of the use cases. Additionally, the examples do not correspond to queries of existing benchmark. So, the optimal decomposition of the query is unknown as well as the way as the decomposition produced by existing federated engines would have been specified in the proposed framework.

Strong points of the paper
S1: A formal framework to define the results of tasks of source selection, query decomposition and reordering is defined.
S2: Three different types of query decomposition criteria are presented, as well as evaluation rules that ensure that correct execution of query plans.
S3: Two use cases illustrating the expressiveness power of the approach are presented.

Weak points of the paper
W1: Notation is ambiguously used along the paper, making hard to understand the proposed formalism.
W2: Terminology used in the context of SPARQL federations of endpoints is not reused, for example, signature, cluster, or sub-pattern.
W3: Use cases do not correspond to benchmarks defined by the community. There are many different and challenging queries in FedBench that could be used to explain the proposed framework. State-of-the art engines such as FedX, SPLENDID, or ANAPSID, that have been evaluated in existing benchmarks are not used to illustrate the usage of the proposed framework.

Detailed Comments
D1: Notation used in the paper has to be summarized in a table to enhance readability. Variables have to be unambiguously used across the paper. Terminology has to be unified to SPARQL formal semantics; definitions presented by Buil-Aranda et al. should be reused. Definitions need to be illustrated to improve readability.
D2: The proposed framework has to be illustrated using FedBench, the benchmark commonly used by the community to evaluate existing federated engines. Decomposition techniques implemented by state-of-the-art open source SPARQL federated engines have to be included as use cases.

Additional questions to the authors:
Q1: How decomposition produced by index-based heuristics as the ones implemented by ANAPSID (Acosta et al ISWC 2011 and Montoya et al. COLD 2012) or by HiBISCuS (Saleen et al ESWC 2014) can be formalized with the proposed formalism.
Q2: Can any of the state-of-the-art federated engines ensure the computation of an agnostic distribution if one exists for a query? Can any of the FedBench queries be decomposed by using an agnostic distribution?
Q3: What are the properties of the distributions of queries comprised by predicates on general vocabularies, e.g., rdf:type, owl:sameAs.
Q4: What happen is a query has variables in the predicate of a triple pattern?

C. Buil-Aranda, M. Arenas, and O. Corcho. Semantics and optimization of the SPARQL 1.1 federation extension. In G. Antoniou, M. Grobelnik, E. Simperl, B. Parsia, D. Plexousakis, P. Leenheer, and J. Pan, editors, The Semanic Web: Research and Applications, volume 6644 of Lecture Notes in Computer Science, pages 1–15. Springer Berlin Heidelberg, 2011.

Review #3
By Carlos Buil Aranda submitted on 26/May/2014
Minor Revision
Review Comment:

A Logical Characterisation of SPARQL Federation

This paper provides a formalisation of SPARQL 1.0 federated queries. In the paper the authors fist define the notions of query federation across a defined set of SPARQL endpoints (cluster) as well as the evaluation of a SPARQL query in that cluster of endpoints. The authors define a set of properties for the evaluation function of a SPARQL query over the cluster of RDF databases and that function is also used to characterise two SPARQL 1.0 federation engines.

In this section the author introduces the concept of query federation for SPARQL, arguing the high amount of implementation works in this topic and the small/inexistent works in formalising this topic. The author also introduces the work done in this paper, highlighting the query federation in an scenario of not having any knowledge about the data in the remote RDF database.
Comments: I think that in this introduction section there is missing a reference to the SPARQL 1.1 Federation Extension, an official W3C Recommendation which also defines what a SPARQL federation is (but in a different way). I think the author should mention in his work that there is another approach for federating SPARQL queries.

2. Related work
In this section the author presents a summary of the work done in the area. He presents three types of SPARQL query federation and places this work in one of the three.
Comments: Again, the author ignores the work done by the W3C in SPARQL query federation. I think that the SPARQL 1.1 Federation Extension should be cited either in this or in the previous section.

3. Preliminaries
In this section the author introduces the syntax used along the paper and also defines the key concept of cluster of RDF graphs and the evaluation of a SPARQL basic graph pattern P in this cluster of RDF graphs. This evaluation semantics in its two forms will be highly used used in the paper.
Not much to say here, I think it is clear what the author presents since they are only a couple of definitions. However I think it would be good to clarify a bit more the use of two forms of evaluation functions. Maybe with an example?

4. Federation schemes
In this section the author presents a distribution function which specifies where a SPARQL query pattern should be evaluated and an evaluation function describing how that SPARQL query should be evaluated on the selected SPARQL endpoints. The author uses a set of useful definitions which are widely used but in a more informal way like "distribution function”, exclusive pattern to a data source and the conditions that define that distribution function. These conditions should capture the way the SPARQL 1.0 federation engines distribute their queries. This section also presents the Evaluation rule definition which defines how the SPARQL query is evaluated across the RDF graphs. Finally the author defines what a federation scheme is: an evaluation rule for the SPARQL query and the distribution function for this query through the set of available RDF remote datasets.
I think these definitions are quite useful since they formalise what is normally said in plain words. The author also introduces a new evaluation scheme (the prudent distribution). Maybe the author should point out that there are three ways for evaluating a query, 2 of them are used in the existing systems and may yield into different results.

5. Completeness in the zero-knowledge case
In this section the author uses the previous definitions to show properties over them like cluster monotony, agnosticism, granularity, collect-and-combine, etc.
Comments: I found this section a bit hard to read and I think examples of these definitions would make them clearer. Specially the collect-and-combine rule: I would present examples of how the most popular SPARQL 1.0 federation systems collect-and-combine the results of their remote queries.

6. An analysis of two examples from the literature
In this section the authors analyse two SPARQL federation query engines: ADERIS and SemWIQ. The author starts with ADERIS and identifies that the system may return incomplete answers when the system does not have enough knowledge about the data stored in the remote sources. The author also presents some definitions and properties that describe the ADERIS’ behaviour. Finally the author presents a summarising paragraph of the analysis of ADERIS.
Comments: I think this section is similar to the previous one in the sense that more examples that clarify the concepts are needed. I would place them in the final paragraph to show the ADERIS’ behaviour.
Next, the author analyses SemWIQ and again extend the initial definitions for this system’s specific case. The author adds the consistently typed definition and he uses it in lemma 6.4. He concludes like with ADERIS SemWIQ may yield in incomplete results.
Comment: again, I would like to see a simple example of this incomplete results situation. It would clarify the section much more. Besides, why the author chose these two systems? why not to chose DARQ, FedX or ANAPSID? they are more “popular”. Also, is there any system that returns complete answers in a federation scenario?

7. A corollary wrt. join-order heuristics
In this section the author presents how the join order affects the final result set of the query. The author formalises the join order tree representation and how it is evaluated in a specific order using a given heuristic. The author highlights that evaluating groups may not be as constraining as expected and an analysis of the collect-and-combine rule.
Comments: I think this section is very interesting just one comment. Has the author how the engines implement the join operation may affect in the amount of results returned? specially the effect of joining unbound variables.

8. Conclusion
The author presents its conclusions which are mainly the usefulness of this formalisation and the rules presented here. He also presents as future work to extend the formalisation to the rest of the SPARQL language.
Comments: Regarding the conclusions I agree that this formalisation is useful and that it is needed. Regarding the systems analysed I would have chose either two different ones (two more popular ones) or one more system. If the author is planning to study the semantics & complexity of SPARQL I recommend him to read [1].

General comments:
Overall I like the paper, it presents clearly the ideas and formalises them properly. The author presents some properties and I think the proofs are correct. I have two comments only: I think some examples are needed to better understand the notions presented here, specially in the analysis of the systems and that I would add one more system to the evaluation (FedX or ANAPSID). Finally, the author should highlight that this is the study of the non standard SPARQL 1.0 federation, not the official W3C SPARQL 1.1 Federation Extension.

I only found one typo:
"there other” I guess it is there are other

[1] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. 2009. Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34, 3, Article 16 (September 2009), 45 pages.