Observing the Web of Data through Efficient Distributed SPARQL Queries

Tracking #: 620-1830

Xin Wang
Thanassis Tiropanis
Wendy Hall

Responsible editor: 
Pascal Hitzler

Submission type: 
Application Report
Dealing with heterogeneity is one of the key challenges of Big Data analytics. The emergence of Linked Data provides better interoperability and thus further enhances potential of Big Data analytics. The use of Linked Data for analytics raises performance challenges when considering the distribution of data sources and the performance of Linked Data stores in comparison to other storage technologies. This paper investigates the performance of distributed SPARQL approaches to gain an understanding of how well the distribution challenge can be addressed in such an environment. We describe the distributed SPARQL query infrastructure that has been deployed in the Southampton University Web Observatory (SUWO), to support analytics that requires prompt access to a large number of Linked Data datasets. This distributed SPARQL approach adopts dynamic optimisation techniques to take advantage of runtime statistics, and exploits parallelism. The infrastructure is evaluated on Twitter data hosted in the SUWO and popular queries on these data. It demonstrates that the infrastructure described in this paper has a performance advantage over other existing distributed SPARQL engines.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Duc Thanh Tran submitted on 18/Feb/2014
Review Comment:

The authors propose a new approach for querying over distributed SPARQL endpoints. The main idea here is to exploit parallelism. To this end, the authors develop a strategy for identifying subqueries that can be executed in parallel. The idea behind it is simple, straightforward and make sense: a subquery has the highest potential for being executed in parallel when it is rather independent from other parts, i.e. the number of bindings it generates is not affected by other parts.

Overall, the topic is very relevant, the paper is clearly structured, easy to follow, contains sufficient level of technical detail and the contributions / connections to related work is clear.

However, there are many issues with this idea / the current version of the paper:

- the idea makes sense but is however hard to implement; as pointed out, it is very to obtain statistics. Identifying independent subqueries requires detailed statistics (join cardinality etc) that are not available. This is why the authors employ dynamic optimization to leverage statistics obtained during query evaluation. It is however not clear whether the adopted greedy strategy is really effective here.
- this open issue shall be addressed in the evaluation, i.e. it would be nice to see a validation of the greedy optimization strategy + idea behind parallelization. However, since only 4 queries were use, the results reported are not conclusive / do not provide many insights. Especially with this idea, it is important to see how much parallelization potential really exists in real query workload. As it is right now, the experimental setting is too limited / artificial, focusing only on a few queries (designed to fit the proposed idea).
- related work contains pointers to most relevant work in the semantic web community. However, this problem of distributed query processing / optimization is well-known and also widely studied in the database community. There are many work for query decomposition etc. the authors should consider.

Overall, I think the direction is interesting but the authors need to address the issues above (especially: experimental validation) to improve the current version of the paper.

Review #2
Anonymous submitted on 10/Mar/2014
Major Revision
Review Comment:

In this paper the authors describe an algorithm for exploiting parallel SPARQL query processing by identifying subqueries that do not have dependencies between other subqueries from the same queries and thus can be executed in parallel by the underlying engine (SUWO). The authors also provide an evaluation showing how their approach is better than FedX and LHD (A previous work from the authors).

In the Introduction section the authors describe the problem, specifically the increase of data, Linked Data and the need of statistics for efficiently querying these data. The authors also introduce \Psi which is the algorithm that the authors will describe in the next sections.
The Related work section presents the current state of the art regarding distributed SPARQL query execution, focusing in those system that use statistics for generating their query plans. Since this is a journal paper I would add references to other systems that also distribute SPARQL queries like ANAPSID.

In Section 3, the authors describe their approach for processing distributed SPARQL queries. First, they describe the intuition behind the algorithm that the authors will present later on.The authors present the notion of cardinality node and they describe it using Figure 1. However I found it a bit difficult to follow this intuition, specially when they describe a triple pattern as an edge while I normally think that a property is an edge. I’m not against this way of describing the intuition but I would like to use examples using FOAF for instance in Figure 1. I think that would improve the understanding of the paper.
The algorithm presented identifies the triple patterns that are independent of each other relying in the statistics gathered from the datasets. These statistics are gathered at runtime when queries are executed. I find the cost estimation functions reasonable and I think they will do their work.

In Section 4 the authors the datasets and the queries used in the evaluation. The dataset contains twitter data about posts and users. These data is in RDF and can be queried using SPARQL. The authors describe these data in the paper and also the 4 queries used in the evaluation. While I think the data is described with enough detail, I find 4 queries too few. Isn’t it possible to have more queries? something like return the conversation or something that exploits the use of RDF?

In Section 5 the evaluation is presented. In that section the authors present the experiments settings and also the metrics used for evaluating the queries. These metrics are Queries per second (QPS), incoming network traffic (INT), outgoing network traffic (ONT) and average transmission rate (ATR). regarding the QPS metric I suppose that it indicates that SUWO can submit almost 14 queries 1 every time while FedX can submit about 5 queries. I would suggest to use the classic query execution time for evaluating each query and then explaining it in the result section.

Finally the Conclusion section consists in a summary of the work done and draws some conclusions mainly for stating that the systems is faster in this specific use case. The section also presents the future work to be done, and I think the log analysis is important for gathering statistics but also for extracting the user preferences regarding data to be queried.

At this point I have some questions:

Since the SUWO engine is based on statistics that are gathered on runtime, why not evaluate the systems with cold and warm queries? that would present better results for SUWO and the statistics gathering process.

Also, I would like to have a brief summary the engine running SUWO, since this is a new contribution a brief introduction to the backend would be nice. It is not clear to me how SUWO works.

What about scalability? how the system would handle queries with more predicates and queries against a more diverse dataset like DBpedia or bio2rdf?

The authors use a single dataset which is distributed across several nodes. Since all the data is about the same topic and it is not integrating data from other datasets, why not to use a system using a NoSQL backend like Jena+HBase? or maybe try with 4store for instance?

Review #3
By Olaf Hartig submitted on 17/Mar/2014
Review Comment:

The authors of this manuscript claim to present "a distributed SPARQL query engine that adopts novel techniques to [...] improve [Linked Data] query efficiency."

Frankly, I stopped reading this manuscript after the second paragraph in Section 3.1, at which point it became apparent that a thorough assessment of the proposed approach is impossible, because the authors fail to accurately introduce (let alone define) even the very basic concepts. This does not only hold for concepts that readers who know SPARQL may be familiar with, but it also holds for specific concepts related to the proposed approach. Examples for the latter are given as follows:

1) The authors neither explain what a "concrete node" is, nor what a "binding" is. Consequently, it is also not clear what the "number of bindings, or cardinality of a concrete node" is. Furthermore, there is no explanation of why edges/triple patterns are conceived of as active entities that might change such a cardinality.

2) While "shared variables" seem to have a cardinality too (as one can infer from the last sentence in the first paragraph of Section 3.1), the authors do not provide a description (let alone a definition) of what the cardinality of a shared variable is.

3) The authors claim to "introduce the notion of fixed cardinality node" but then go straight to a description of a property of such a node (without introducing/defining the concept before). If this description is meant to be the definition, this is not very helpful because the given description assumes an understanding of what "the execution of all connected edges" is, which has not been discussed before (moreover, the unspecified notion of "bindings" appears again in this description). Hence, after reading the description of this property (of fixed cardinality nodes), I do not know what a "fixed cardinality node" is (and I doubt the majority of the readers will).

4) The operation of "removing all fixed-cardinality nodes" (based on which the authors try to make a case for parallel processing opportunities that seem to be the key idea of the authors' approach) is unclear. Neither the "more precise description" in footnote 6 nor the given example provide a sufficient definition. More specifically, it is not clear what the "more precise description" in footnote 6 refers to and what it means, and the example is formally incorrect for the following reason: In graph theory, any sub-graph of some graph is a graph itself, that is, a pair consisting of a set of vertices and a set of edges; I do not see such pairs in the example (instead, the authors' represent any sub-graph as a set). Moreover, I do not see why removing vertices B and C from the graph in Fig.1 results in three subgraphs; the result that I would expect (based on the graph theory that I have been taught in university) is a _single_ (sub-)graph that consists of two vertices (A and D) and no edges.

The demonstrated lack of a precise introduction of the basic concepts makes an in-depth understanding of the rest of the manuscript impossible. Therefore, the manuscript is unacceptable for publication in the journal.

Further aspects in which the part of the manuscript that I read lacks clarity are the following:

1) Given that the presented work focuses on SPARQL query distribution, I do not see what this work has to do with "observing the Web of Data" (as mentioned in the title).

2) The Abstract does not mention what "the distribution challenge" is that the authors aim to achieve an understanding of.

3) The first sentence in Section 1 is not clear about what era it refers to.

4) The authors must elaborate on why "machine-understandable, interoperable data and coordinated datasets [...] are contradictory to the properties of Big Data." (Section 1)

5) I would expect a reference for the claim that "many LD can be accessed via SPARQL endpoints." (Section 1)

6) The authors must elaborate on why "the latency of data transfer [...] becomes more challenging in the case of distributed SPARQL queries" (as opposed to other distributed database settings).

7) The related works section (Section 2) does not clearly identify the main conceptual difference (data shipping vs. (sub)query shipping) between the two categories of approaches mentioned.

8) In the last line on page 2, it is not clear what "they" refers to.

9) The authors must elaborate on why "statistics that are accurate enough [...] are unlikely to be available on a large scale." (Section 2)

10) The related works section should discuss how the proposed approach is related to the mentioned approaches.