Distributed query processing in the presence of blank nodes

Tracking #: 1277-2489

Audun Stolpe
Jonas Halvorsen

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
This paper demonstrates that the presence of blank nodes in RDF data represents a problem for distributed processing of SPARQL queries. It is shown that the usual decomposition strategies from the literature will leak information---even when it derives from a single source. It is argued that this leakage, and the proper reparational measures, need to be accounted for in a formal semantics. To this end, the standard set-based SPARQL 1.1. semantics is generalized with a parameter representing different execution contexts. This makes it possible to keep tabs on the naming of blank nodes across execution contexts, which in turn makes it possible to articulate a decomposition strategy that is provably sound and complete wrt. any selection of RDF sources even when blank nodes are allowed. Alas, this strategy is not computationally tractable. However, there are ways of utilizing knowledge about the sources, if one has it, that will help considerably.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 03/Feb/2016
Review Comment:

This paper discusses federated query processing (conjunctive query answering: CQA) when the individual sources contain blank nodes, adopting a SPARQL-style semantics where blank nodes are scoped differently across different sources, and scoped differently from the query results to the graph, etc. As the authors point out, the presence of blank nodes complicates the matter considerably and in a manner that none of the practical proposals for federation, thus far (to the best of the authors' knowledge ... and mine), have addressed: current proposals for federation rely on a single partition of a query (e.g., into singleton triple patterns, or into maximal sub-queries, etc.) that are only sound/complete in the absence of blank nodes. In the presence of blank nodes, the authors show formally that a single partition of a query is not enough: intuitively speaking, there is a tension between picking up on all remote (cross-source) joins and local joins involving blank nodes. If one, for example, creates a partition of a query into singleton triple patterns, local joins involving blank nodes will be missed. If one sends the entire query to each source, remote joins will be missed. The only way to solve this in the general case is to either have knowledge (or assumptions) of where blank nodes can coincide in the results, or to effectively "brute-force" the possibilities, splitting a query in all possible ways (modulo some filtering on predicates, if that knowledge is available, as the authors assume), and trying them all. Thus, in the general zero-knowledge case, the only way to maintain soundness and completeness for CQA in a federated scenario in the presence of blank nodes is to create exponentially many sub-queries from the original query. Actually I found this central observation of the paper really interesting: in some way it is sort of obvious, but despite there being a lot of work on federation, it was also new to me (sort of an "aha" moment, which is high praise). It also raises interesting follow-up questions (as made explicit by the authors in terms of future work): are these exponential sub-queries really a problem in practice, are there other forms of guarantees under which the problem becomes simpler, etc. (On the other hand, the other conclusion one can take is that blank nodes are simply not worth the hassle in federated scenarios, but as the authors note, it's hard to sweep them under the carpet since they are so common in real-world data.)

Overall the paper is quite well-written, relevant to the journal, well-motivated, non-trivial, (to the best of my knowledge) technically sound and original; there are useful examples to help digest the formal machinery, etc. Thus, for me, it ticks all the boxes for acceptance.

My one overall "gripe" about the paper is that it is quite dense in a formal sense, which lowers the effort-to-reward ratio for some readers (including me). There is a lot of notation (and a lot of subtelty in the notation) to grapple with before the real pay-off of the paper in Section 7 (arguably also Section 6). On the one hand, I appreciate that the main contribution of this paper is a formal one and that the various lemmas have to have their i's dotted and their t's crossed, and in general I can appreciate that the authors have kept the notation quite clean. On the other hand, in the current format, I worry that the density of the paper reduces its accessibility: it has an important message independent of this formal machinery and I'm afraid that it could get lost in the notational jungle of the paper (which although necessary, is not the important part of the work). Also some parts really just seemed unnecessarily long-winded, in particular the part about contexts seemed like overkill for the goal of making sure the blank nodes were kept disjoint, and likewise the treatment in section 4 for the equivalence of answers drags its feet. I guess I would recommend to the authors to consider two types of readers: one who wants to get the intuition and see the main results, one who wants to see the proofs and re-use the formal machinery for extended results. One option might be, for example, to move the proofs and more minor lemmas to an appendix, keeping the body of the paper "on message" throughout with the main formal results and examples/discussion/intuition in one thread. In any case, since this refers more to the style of the paper, I will not make any concrete high-level requests to the authors, but perhaps suggest they may wish to take steps to have more of the former type of audience read their paper.

All that remains are some specific comments, and then some typos. There are quite a lot of them: some suggestions, some questions, etc. In general though, they all appear trivial to correct, and all my other comments are matters of style more than anything. Hence I recommend accept rather than minor revision (but if the editor wishes, I can have a second look).

-- Title
* The title mentions "distributed query processing" whereas the running title mentions "federation across blank nodes". These should be aligned better.
* With respect to the title again, while there is some hand-waving involved with terms like "distributed" vs. "federated", I think the current consensus in this community (esp. since used in the SPARQL 1.1 standard itself) would dictate federation as the better word to use in the title.

-- Section 1
* At the end of Section 1, I'm still not really clear on what the paper, concretely, is about. This is covered nicely with the intuitive examples in Section 3, but maybe this should be moved before related work? It's harder to read related work without knowing what the paper is really about yet.

-- Section 2
* The mention of OBDA and OBDI in the related works are a little weird for me, particular the focus on owl:sameAs. It feels like if you extend the radius of related work to reach OBDA, then a lot of other stuff should fall in scope elsewhere. It's harmless but I would prefer to have the related work without this.

-- Section 3:
* Figure 4 projects variable "?name" not in the query. The projected variables do not match with the results presented later.
* "Generalizing from the present example ..." Conceptually you reach a problem here. You have not said what form of queries you support. For example, SPARQL 1.1 has a variety of "non-monotonic" features like negation, which would greatly complicate the discussion here. At some stage earlier than this point, you need to specify that you are only interested in conjunctive queries, aka. BGPs where only AND is allowed, aka. ... (In general, I would have liked a remark somewhere acknowledging that most of the features of SPARQL are ignored by the present work, and what it might mean to the discussion if they were included ... even if just to say such discussion is out of scope.)
- Do the papers on DARQ or FedX explicitly rule out blank nodes? That is to say, do they acknowledge the problem even if they don't solve it?
- "As a final example ..." This only became clear with the example much later in the paper of the prudent decomposition (which I assume refers to ADERIS). I wonder if those examples could be moved here in Section 3 and then simply referred back to later.
- "... viz. the decomposition below:" But P_3 still contains ?x, which needs to join with P_1 on a blank node?
- Perhaps (a suggestion here) it would be worthwhile to cover what the W3C officially recommends in terms of SPARQL 1.1 federation (it is somewhat orthogonal to the main message of this paper, but could be included for completeness; another option is that it could be discussed and "ruled out" as something orthogonal in the related work).

-- Section 4:
* "RDF graphs will be referred to interchangeably as sources, endpoints or datasets." Eek. This is a problem. These are four different things. Datasets are defined formally in SPARQL and RDF 1.1 as as extension to RDF graphs (effectively named graphs). This terminology needs to be cleaned up: for sure, you should not use the term "dataset" to refer to a "graph". Muddling endpoints, sources and graphs is also not ideal.
* Definition 4.1: Point 2, at first glance, I thought this was referring to SPARQL UNION, namely a disjunction of two query patterns, which confused me for a moment.
* "This is the standard set-based definition of answer sets" Actually the SPARQL standard defines solutions in terms of bags: the default semantics is to preserve every homomorphism in terms of the cardinality of the results, leading to duplicates when you have projection (here there is no projection defined, so it's not so relevant, but ideally this should be cleared up).
* Corollary 4.4, it would be good to introduce the symbol for equivalence used here.
* I'm in general a little uncomfortable with using the union symbol for mappings since I'm not sure what the result is if the mappings are incompatible; this is already the case in Theorem 4.5 for example. I would be more comfortable if that were somehow made explicit even if it is formally redundant (that way I don't have to wonder if it is formally redundant or not).

-- Section 5:
* Definition 5.2, condition two: so if a1 is (b,p1,o) and a2 is (b,p2,o), where b is a blank node, o is an IRI, then a1 and a2 cannot be in a b-component since o is shared in the object position but is not in B. Is this correct/deliberate?
* Example 5.1: again wondering, shouldn't P_3 be together with P_1 based on the connection with ?x bound to a blank node?
* Example 5.2: earlier you assumed no cartesian products. The example is intuitive enough as it is but it would perhaps be preferable to stick to the earlier assumption and find a case where there's a join variable (e.g., make the object a join variable and change the data to match).
* Lemma 5.3, you discuss the intuition after giving the formal statement. This is something you tend to do often, not just here. When reading, I thus tended to read the part after the lemma/proof first to get the intuition, then go back to the formal statement. As a general comment, I would try to put the intuition first in such cases.

-- Section 6:
* "The soundness and completeness of a federation scheme ..." I know in this paragraph you try three different ways to explain it, but I did not get this paragraph at all. In retrospect, I'm not sure it's needed here. It becomes clearer later when it's needed and with the help of examples.
* Definition 6.3: reading this I'm wondering if it is deliberate that the result is not a partition of P ... that P_i != P_j does not imply P_i and P_j are disjoint. Later it turns out this is deliberate but would give a brief note after the definition to remove the doubt.
* Theorem 6.1, point 3: "maximally connected subset of ..." This is not really ever defined nor motivated. I get the gist but wonder is it connected only through variables, etc.; it later becomes sort of clear from the example that it's connected through variables, and the goal is (presumably) to avoid useless cartesian products.

-- Section 7:
* "This requires an exponential number of steps and so is not computationally tractable." While true and important to note, this somehow hides the fact that query evaluation is in any case computationally intractable, even for a single source that doesn't contain blank nodes (assuming AND with projection for the decision problem; otherwise for enumerating all the answers, well there can be exponentially many for conjunctive queries so even without projection, it's intractable). Also, it is computationally intractable in combined complexity: in data complexity (considering the size of the query as fixed somehow, giving a nod to the fact that the query is typically small and thus relative to the data can be considered constant) it doesn't affect tractability. Most importantly though, I think it would be good to avoid giving the impression that this problem makes query evaluation intractable: it is already intractable ... rather it adds another EXP.
* Definition 7.3: I had forgotten what f(.,.) was and had to go back a long way. Perhaps include it in Table 4, or add a reminder.

- Section 1:
* "to {the} said dilemma"
* "semantical machinery" ... semantic machinery sounds better to me? It seems semantical exists as a word in some dictionaries, but it sounds ... wrong and it's a bit distracting.
* "that {the} distributed query answering"

Section 3:
* "by a single source only--- ... ---{source} into the same"
* "to show the following: [i]f answering ..."
* "is split to finely"
* "there is {a} decomposition"

Section 4:
* '"As far as it goes" because ...' not a sentence. (maybe 'We say "as far as it goes
" "for blank node"
* "Queries do{es} not"
* Lemma 4.2, bad box in the proof.

Section 5:
* "in join position"
* "all solution, and only"

Section 6:
* "such P_i \neq \emptyset"
* "defined as{ }:"
* "form an{d} exclusive"
* Definition 6.7: the '1' subscript is a curious choice: why not 'j'?
* "efinition is adjusted"
* Table 4: I think you use t more often than tr?

Review #2
Anonymous submitted on 10/Feb/2016
Minor Revision
Review Comment:

Reviewing Distributed Query Processing in the Presence of Blank Nodes

This paper presents a solution to the problem of information loss when joining blank nodes from different data sources in a federated scenario. The authors propose the use of mapping contexts when the joins are executed. To generate such contexts the authors propose to do "reverse engineering” on the SPARQL query patterns to identify which of these pattern will map to blank nodes, and once identified such patterns isolate such blank node joins. The final result is a "federation scheme" that generates an strategy for executing SPARQL federated queries that correctly deals with blank nodes.

In the Introduction section the authors introduces the problem of dealing with blank nodes in a distributed setting. I totally agree with the author’s motivation. In this regard the author argues that this topic has been an ambiguous in the community, however I have to disagree. The author can have a look at [1] or even at [2], a work that slightly approaches to blank nodes in a distributed setting.

The Related Work section presents an overview of the systems that implement the SPARQL federation. I think it is a quite complete overview however I think the author misses one publication that actually deals with SPARQL federation in the presence of blank nodes [2].

Section 3 describes the problem of blank nodes in a distributed scenario which that data is distributed across several datasets. The ideal answer to the distributed query would be an answer which is generated from the union of the datasets that are being queried. However if one uses a distributed query processor the process would be to query both datasets separately and next join the results from both queries. That process generates different results from the ones obtained from the union of the datasets.The authors also argue that according to the SPARQL 1.1 recommendation "requires every distinct query to be treated as a separate query execution context” and then "join information is not preserved by the two tables” being joined. Furthermore, the query will always lose information using any of the decompositions proposed by the same authors in a previous paper. The solution the authors observed is to decompose carefully the query, decomposition that should be driven by the variables that will be mapped to blank nodes.

The problem is well described, I do not have much comments about this section.

Section 4 describes the solution the authors propose, which is the use of execution contexts in federated SPARQL queries. Such execution concept is defined in Def 4.4 which basically says that mu_c is a solution mapping taking into account the “context that provides the solution with a unique set of names for its blank nodes”.

I think the context proposal is a nice idea to solve the problem of federated query processing in presence of blank nodes. The idea is simple and understandable, the solutions now are tight to a execution context. However this use of the context seems similar to the SERVICE keyword specially when the authors write about inter contextual joins, can the authors clarify a bit? I think the authors should also have a look at [1] regarding the blank node skolemisation solution. Theorem 4.6

Section 5 presents the method for generating the execution contexts that will allow a “safe" execution of the query, i.e. the query will not loss information when it is executed. This solution is to find a decomposition strategy that groups those SPARQL patterns that have compatible mappings which are also blank nodes. This decomposition is done by doing “reverse engineering” over the SPARQL query patterns. The method examines the solution mappings and builds a set of b-components. Using this set of b-components the authors build the final decomposition strategy.

I find the idea interesting but quite infeasible. I do not think that there are going to be many implementations that perform this generation of b-components to generate a complete solution. However it is a solution to the problem.

Section 6 presents the federation schemes, which is the result of the author previous paper [3], however the authors adapted it to deal with blank nodes (e.g. Definition 6.8).

Comments: not much to say here, this is just a brief adaptation of the previous paper.

Section 7 presents the federation scheme to correctly evaluate the federated SPARQL query. I think this is the extension of the author's previous work [3] to deal with blank nodes. To the complexity problem previously described the authors propose the use of ASK queries.

The authors present here how to generate the federation schemes, and how to skip the complexity problem by executing SPARQL ASK queries, which from my point of view is a good enough solution.

Overall comments:
I find the paper well written and very easy to read. The problem is a real problem and the authors found a solution to it. However this solution seems quite infeasible, even using ASK queries. I also find the use of contexts and interconextual joins quite similar to the use of SERVICE. I would like to read the author’s comments about it.

[1] Hogan, Aidan. "Skolemising blank nodes while preserving isomorphism." Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.
[2] Buil-Aranda, Carlos, Axel Polleres, and Jürgen Umbrich. "Strategies for executing federated queries in SPARQL1. 1." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 390-405.
[3] Stolpe, Audun. "A logical characterisation of SPARQL federation." Semantic Web 6.6 (2014): 565-584.


There was something I forgot to mention in the review: a (naive) alternative method that avoids the exponential number of partitions is to query for and download all of the federated databases (CONSTRUCT WHERE {?s ?p ?o }) and then process the query locally. Obviously this alternative has significant problems, but it does perhaps get around some of the exponential arguments the authors make, and raises question marks about the (slightly vague) claim in the paper "There seems to be no way around this conclusion".

If I have understood you correctly, what you are proposing is to transfer all of the remote data to the local host. That would work. This isn't feasible, but it would work in theory. Hence, since this is a mathematical paper, we should qualify our statement. Thanks.