A Benchmark Suite for Federated SPARQL Query Processing from Existing Workflows

Tracking #: 1594-2806

Antonis Troumpoukis
Angelos Charalambidis
Giannis Mouchakis
Stasinos Konstantopoulos
Daniela Digles
Ronald Siebes
Victor de Boer
Stian Soiland-Reyes

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
This paper presents a new benchmark suite for SPARQL query processors. The benchmark is derived from workflows established by the pharmacology community and exploits the fact that these workflows are not only applied to voluminous data, but they are also equivalent to complex and challenging queries. The value of this queryset is that it realistically represents actual community needs in a challenging domain, testing not only speed and robustness to large data volumes but also all features of modern query processing systems. In addition, the natural partitioning of the data into meaningful datasets makes these workflows ideal for benchmarking federated query processors. This emphasis on federated query processing drived complementing the benchmark with an execution engine that can reproduce distributed and federated query processing experiments.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Verborgh submitted on 28/Mar/2017
Major Revision
Review Comment:

In this article, the authors describe a benchmark for federated SPARQL query processing systems, as well as an executor for such a benchmark. The proposed benchmark uses real-world data and queries from the biomedical/pharmaceutical domain. The authors analyze the properties of the datasets and queries, and run the benchmark on three federated systems and a single-source setup.

Unfortunately, the authors did not declare that this submission is an extension of earlier work [a], in violation of SWJ’s guidelines which state that “the submitted manuscript must clearly state the previous publications it is based on.” [b] This, by itself, would be grounds for immediate rejection.

After having reviewed the article, I have mixed opinions. On the one hand, the presented work addresses a clear need of the current federated SPARQL landscape. On the other hand, the article describing it has various flaws, but many of them seem to be fixable. Hence, I’m opting for “major revision” rather than “reject” (notwithstanding my remark above that the submission does not comply with the journal’s policy).

My main question is: why is the proposed benchmark a good benchmark, what does “good” mean, and how do we know? Does the benchmark really test those things that need testing in SPARQL federated engines? I am missing a clear section on requirements engineering, and a validation afterwards. While I agree that having real-world data and queries is important, this alone does not make something a good benchmark. If we want to improve on existing benchmarks, we need clear criteria for improvement; otherwise, it is impossible to assess whether the proposed benchmark is indeed a necessity, or just nice to have.

The strengths of the article are:
– addresses a need within SPARQL federation
– real-world data and queries
– large datasets

The weaknesses of the article are:
– the benchmark _itself_ it not evaluated
– no comparison with ANAPSID
– the majority of queries do not work
– missing important discussions such as a requirement analysis
– there is no link to benchmark data or queries
– not well structured
– several sloppy and/or unfounded statements

I will return to the above points in the detailed comments below.

– Why do you introduce a new benchmark? What are the problems with existing benchmarks?

– You explain the tension between generic and informative, but do not draw any conclusion from that. What does this mean for your benchmark?
– “KOBE” is mentioned without introduction.

– Reference for Open PHACTS?
– “the complexity of defining efficient Linked Data queries” => does Open PHACTS do anything regarding the “efficiency” of SPARQL queries? (The type FILTER in Listing 1 would contradict this.)
– Open PHACTS offers an HTTP API, but not a REST interface. (REST interfaces would be self-describing and offer hypermedia controls.)
– What is the observable difference between IMS “as a service” and as “a materialized dataset”? For instance, when accessing http://dbpedia.org/resource/World_Wide_Web, an observer cannot (and should not) conclude whether the underlying data infrastructure is materialized or not; i.e., the Linked Data Web consists of resources, not services. So does it mean that IMS is not offered as a resource-oriented HTTP interface?
– It might be useful to provide a brief inline definition of “structuredness”. Why doesn’t Conceptwiki have a structuredness in Table 1?

– Who has transformed the questions into SPARQL queries? If an external party, could you provide a reference?
– Structure-wise, the explanation of Q19 probably fits better in the previous section, in order to contain all domain-specific knowledge in a single section. In any case, it was hard to follow this mid-paragraph.
– Where are the queries published? I tried Googling a part of Listing 1, but this brought me to [a] instead of a dataset.
– The fact that no Open PHACTS workflow was defined does not seem a convincing argument for leaving out a question. The queries could have been manually created as well? As such, the paragraph starting with “Regarding workflow availability” also seems superfluous (and in any case does not provide arguments why the queries that _were_ included are good ones).
– The argument on IMS seems an unnecessary repetition of the previous section.

– “The typical flow of a federated query processing system consists of three phases” => reference needed
– Table 3 does not show the syntactic, but rather the structural point of view.
– How is the original organization in different named graphs?
– The term “graph annotations” is ill-defined and therefore confusing; please be specific when you mean the graph in which triples are organized. Similar for “graph annotations” in SPARQL queries later; they should be GRAPH clauses/keywords.
– typo: “on the other hard”
– The observation about the number of predicates is directly relevant to my remarks about an evaluation of the benchmark: do what extent does this property make your benchmark a good one?
– How was “potential contribution to the resultset” exactly measured? What is the relation to selectivity?
– Regarding the listed SPARQL features per query: this highly depends on how a query was generated. For instance, Q19 as shown in Listing 1 makes a crucial choice of placing a FILTER on the object of an rdf:type statement. Alternatively, this might have been expressed as a UNION of two rdf:type statements, without FILTER. In the latter case, federated engines might use class-based filters much more effectively. So then the question is what this query is measuring: an engine’s intrinsic source selection capabilities, or rather the optimization capabilities of a certain SPARQL implementation? And this is just a single question for a query that was given. I could not find the other queries online, but I imagine there could be more issues like this one.

– Why where those 3 engines selected? In particular, why was ANAPSID not selected, which is known to perform well on certain queries that result in large intermediate results in other engines [c]?
– On a similar note, why were Triple Pattern Fragments not considered, which have been shown to perform well in federated scenarios [c]? A short answer here could be that current TPF engines do not support all SPARQL features; however, the text mentions that other federated engines also has issues regarding feature support. (Disclaimer: I am an author of TPF work; I am not necessarily requesting its inclusion, but I think the question is relevant.)
– FedX “typically being faster” needs a reference; there is also evidence to the contrary [c].
– Results also depend on network latency, which was not mentioned.
– Regarding performance on the FedBench benchmark, it should be noted that this benchmark has been criticized for its bias toward exclusive groups, and more complex queries have been suggested that put fedX at a disadvantage [d].
– “stand-alone Virtuoso database” should probably be “a single Virtuoso instances containing all datasets”
– Why 4 computation nodes? There are more datasets.
– Which network setup? What is the influence of the network?
– Why would a “very large” query be “probably difficult”? This is not necessarily the case. Furthermore, this also brings me back to my main concern: what are we testing then? If the results strongly depend on the syntax used to express a query, we might be evaluating a certain implementation’s capability to perform very specific optimizations as opposed to the general strengths of federation engines. As we all know, many federation engines are currently research efforts, and research groups can typically not effort to focus on specific optimizations. Therefore, improvements on the benchmark would thus not necessarily mean fundamental improvements of the engine, which seems problematic.
– Structure-wise, the recommendations at the end of each subsection probably belong in the conclusion.
– The syntax errors that result from SPLENDID make the results rather non-interesting… wouldn’t it be more interesting to simply fix SPLENDID and continue from there?
– “exponential” needs a reference
– It would be useful to indicate the type of errors on Tables 5 and 6 for the sake of overview; also, the assignment of these types is hard to follow in the text.
– You verify the number of results, but do you verify the correctness of individual results?
– With relation to Virtuoso errors, what was the exact Virtuoso configuration used?
– It seems highly problematic that the majority of engines fail the proposed queries. It it is definitely interesting to have a couple of queries fail for clear reasons, as this sets a goal for a new generation of engines. However, given the high failure rates, it is unclear whether the proposed benchmark is a realistic next target goal, or rather a goal in some distant future. In other words, it is unclear what insights this benchmark delivers, and what we should conclude from it.

– This section comes too late, given that many of its concepts are already used earlier.
– This section should also discuss federation engines, and justify the selection of engines.
– The FedBench complex queries [c] need to be mentioned.
– More generally: what exactly are the strong and weak points of existing benchmarks? This would provide the motivation for a new benchmark.

– “To understand what new insights can be gained” => So, what are the new insights that can be gained?
– What are the conclusions drawn that have not been previously observed? They definitely belong in the “Conclusion[s]” section.
– Why would we need a “battery of benchmarks”?
– I appreciate the point on query loads; this domain should definitely broaden its focus from execution time only.

[a] http://ceur-ws.org/Vol-1700/paper-04.pdf
[b] http://www.semantic-web-journal.net/faq#q9
[c] http://linkeddatafragments.org/publications/jws2016.pdf
[d] http://ceur-ws.org/Vol-905/MontoyaEtAl_COLD2012.pdf

Review #2
Anonymous submitted on 11/Apr/2017
Major Revision
Review Comment:

The paper presents a benchmark for federated query processing engines based on real datasets and real queries from a pharmacological domain.
Datasets and queries are results from the Open PHACTS project. Queries correspond to relevant questions formulated by domain experts, they are meaningful and use a number of SPARQL features and a not so small number of triple patterns (avg=11.9).
While in the project data have been integrated and made available through a single endpoint, in this benchmark the natural partition of data into different datasets have been exploited to use these real and challenging queries to benchmark federated query engines.
Federated query engines FedX, SPLENDID and SemaGrow have been evaluated using a generalization of the driver provided by FedBench (KOBE benchmarking engine).
Experimental setup compare the engine in the different tasks relevant for federated query processing: source selection, query planning and query execution. Unfortunately, most of the engines failed to process many of the queries.
Some engines had trouble parsing the queries, in other cases the endpoints were overloaded by a large number of calls in very short period.
The benchmark is compared with existing benchmarks FedBench and BigRDFBench in terms of number of patterns, number of joins, join types, join vertex degree, number of results, use of SPARQL features, etc. According to these characteristics, the queries included in this benchmark are more challenging for federated query engines. They include more SPARQL features, more complex join types, larger number of results, more join vertices.

The main weaknesses of the paper are:
(W1) Evaluation: choice of the tested engines seem biased, mostly old and RDF4J/Sesame based engines.
(W2) Missing metrics: number of intermediate results, number of subqueries, selectivity of the subqueries, answer completeness.
(W3) Proportion of queries that ended with error is huge. We cannot be certain of the usefulness of the queries to compare engines if they fail in very early stages of query processing, any trivial query with the unhandled features would have the same effect, but the included queries are not trivial.

The main strengths of the paper are:
(S1) Real and challenging queries.
(S2) Throughout evaluation of relevant tasks for federated query processing.
(S3) Comparison with existing benchmarks.

Detailed comments:
* A clearer and detailed description of your contribution is missing, in particular making clear what is new with respect to what already existed (datasets and queries from the Open PHACTS project).
* Large result sets are challenging for their transfer to the federated query engine, but they do not necessarily imply large intermediate results that should be used to issue challenging or large number of sub-queries. These challenges have been overlooked in the benchmark design.
* Which version of FedX was used? The version 3.1 (available at https://www.fluidops.com/downloads/collateral/FedX%203.1.zip) handles VALUES clauses, and the paper claims FedX does not handle them. It is reported that Q18 execution with FedX ended with runtime error because it has unsupported SPARQL operations (page 9), but all the included features (BIND, FILTER and OPTIONAL) are supported by FedX 3.1.
* Why only RDF4J/Sesame based federated query engines were included in the experiments? (Maybe http://www.semantic-web-journal.net/system/files/swj954.pdf and http://dl.acm.org/citation.cfm?id=2907580 would help finding some other alternatives)
* Was any workaround tried for the missing VALUE feature? (Replacing the variables with the values, UNION- or FILTER-based equivalent queries). Since 8 out of 11 queries have at most one element in the VALUE clauses, it does not seem inherent to these queries the use of VALUES clauses.
* Why were GRAPH clauses used and not SERVICE clauses (SPARQL federation extension)?

The paper is mostly well written and easy to follow. It addresses a relevant topic and the proposed benchmark seems novel and relevant. I recommend to review the benchmark design, and to improve the evaluation and the writing of the paper before its acceptance.

Minor comments:
There is no mention of the OPFBench in sections 1 and 2, when it is mentioned in Section 3, it is not clear if it is your contribution or another existing benchmark.
Section 4 is not included in the paper overview at the end of Section 1.
What are accession codes?
What are linksets?
"cross link it with the with the ChEMBL and OPS Chemical Registry datasets" --> repeated words
"Some datasets... are now included in the Open PHACTS Discovery Platform, but have not been available when the workflows were created." --> why to use present perfect for the lack of availability? Aren’t they available now?
"Every query involves between one and five datasets to compute the result (ref. #sources columns in Table 4, for each version of our queryset)." --> "#sources" columns have values between one and four, or did you mean other column?
"but will not be joinable with any other dataset except the one that will be found." --> "the one that will be found" seems very strange wording
"versus the number of source that really contribute to the resultset." --> what is the column name?
What is "#sources span" in Table 4 (page 6)? (It has been defined in page 13, but it may be too late)
In Table 4, which column gives the number of included GRAPH clauses?
Definitions introduced in page 6 could benefit from some illustrating examples, perhaps using Listing 1.
"force these operators to be executed on the side of the federator than on the data stores." --> missing comparative word
Which version of Virtuoso was used?
How were the endpoints and federated query engines distributed in the cluster?
What is a "hot run"?
What are the configuration parameters used by the Virtuoso endpoints?
Does SemaGrow manage to make any source selection pruning of its own? Or is the reduction of the number of selected sources just because it takes into account the GRAPH clauses?
Missing reference for the "known Virtuoso issue" mentioned in Section 5.5.
In Section 5.7, there is no reference to the table where the discussed results were presented.
"the queryset without graphs is evaluated faster than those that the graph annotations are contained. This difference in query execution time is due to the difference in the cardinality of the resultset and the difference in the number of the relevant sources/graphs." --> but with graph annotations, it should be less selected sources and less answers... or am I missing something else here?
"Semagrow uses a more efficient plan than FedX, and as a result Semagrow is faster by an order of magnitude, but it is much slower that the standalone Virtuoso." --> In which setup? In Table 6, FedX is two times faster than Semagrow
In Table 9, what are the meaning of n, m in the given n/m values?
In Table 10, are the values presented averages?
"Sources span indicates the effect that non-trivial query selection can have on eliminating sources to be considered: A higher span will make more visible the effect of sophisticated" --> To really see this you should also include the number of sources that actually contribute to the answer, this determines how many sources could be pruned without losing part of the answer
"synthesised querysets" --> check if there is word misuse here
"Excluding from the coubt sources that contribute to the result for only one triple pattern." --> fix typo, what does "for only one triple pattern" mean here?
It is strange that the claimed second contribution is mainly located in the appendix.
"we will phenomena and choke points that will not be present in synthetic query loads" --> unclear
Section A.1, why only OPFBench and FedBench but not BigRDFBench?
"the configuartion files" -> "the configuration files"
"The driver can connect to the specific federation engine via the SPARQL protocol, and all the federation engines can access the data sources" --> do the engines act as SPARQL endpoints?
"usually due to cashing and metadata loading" --> cashing or caching? Of what kind?
Example in Section A.2: text says queries were executed six times, why does the output file show only three runs? Which column is the average number of results?
"the experiment should evoke the appropriate docker-compose commands" --> is evoke the right word here?

Review #3
Anonymous submitted on 18/Nov/2017
Review Comment:

The paper discusses the OPFBench benchmark and the KOBE driver that can be used for benchmarking federated query engines. OPFBench is based on datasets and queries from the OpenPhacts European Project that deals with the integration of a multitude of Linked Data sets from the pharmacology domain.

Regarding (1) originality: the queries and datasets used by the authors to define OPFBench have already been introduced in the context of the OpenPhacts project (see http://www.sciencedirect.com/science/article/pii/S1359644614004437) and therefore I do not see any significant originality of the work here.

Regarding (2) impact, the definition of a benchmark driver for executing the benchmark is not per se a significant contribution and I do not think that has an impact that justifies the publication of a journal paper. Furthermore, the fact that the queries were already defined in earlier works does not justify any accreditation to this paper regarding impact. Moreover, the experiments run by the authors, and the comments/observations presented justify a workshop but certainly not a journal paper. Therefore, I would suggest that the authors, in the case in which the paper is rejected from the journal, to shorten it and submit it to a workshop. Plus, there is no link to the actual SPARQL queries that we could see and review.

Last, regarding (3) quality of writing, I found the paper hard to read because it is very garrulous. I propose that authors are more precise when presenting the different matters. The paper must also be reviewed by a native English speaker in order (a) to remove the grammatical errors and (b) to rewrite the extremely long phrases found in the text.