A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems

Tracking #: 772-1982

Muhammad Saleem
Yasar Khan
Ali Hasnain
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. The abundance of datasets has motivated considerable work for developing SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. However, the granularity of previous evaluations of such systems has not allowed deriving of insights concerning their behavior in different steps involved during federated query processing. In this work, we perform extensive experiments to compare state-of-the-art SPARQL endpoint federation systems using the comprehensive performance evaluation framework FedBench. In addition to considering the tradition query runtime as an evaluation criterion, we extend the scope of our performance evaluation by considering criteria, which have not been paid much attention to in previous studies. In particular, we consider the number of sources selected, the total number of SPARQL ASK requests used, the completeness of answers as well as the source selection time. Yet, we show that they have a significant impact on the overall query runtime of existing systems. Moreover, we extend FedBench to mirror a highly distributed data environment and assess the behavior of existing systems by using the same four criteria. As the result we provide a detailed analysis of the experimental outcomes that reveal novel insights for improving current and future SPARQL federation systems.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Juergen Umbrich submitted on 19/Sep/2014
Minor Revision
Review Comment:

I would like to thanks the authors for addressing most of my previous comments.
While the paper further improved, there are still some comments which should be addressed to avoid confusion:

Section 5:
I found it slightly confusing to introduce the independent variables, discuss their influence on dependent variables and then introducing/mentioning the dependent variables. I would suggest to restructure the Section 5.
p.9, 3 paragraph (right size), I do not understand why "[...] the dataset size cannot be fully explored in existing SPARQL query federation" benchmarks? Please provide more explanation.

Section 5 and 6 contains some contradiction regarding the setup.
Section 5 states that there were no limits on answer size and query execution, while section 6.1 states that virtuoso was set up with a maximum rows of 100,000 and a maximum query execution time of 60 seconds.

Section 6:
The server description in the text and table 5 do not really match up and is still very confusing.
Also, and more critically, the servers have different specs which might influence the experiment (e.g. server time outs which cause zero result sets, or other errors), or did the authors use some methods to guarantee no cross influence of the different server specifications? If so, please explain this in the paper.

Overall, the evaluation section 6 is very hard to read and could benefit from a better structure ( see also the minor comments)
Again, i would suggest to merge the Fig2 and Fig7 so Fig4 is not needed

More minor comments:

It might be not entirely clear to every reader why an abundance of datasets directly leads to the development of SPARQL federation. An additional clarification sentence would be very helpful

Section 2 regarding/related to the result completeness:
I wonder if the author considered the correctness of returned results to be a crucial for a SPARQL federation engine?
Might be interesting to add some comment about how the federation setup and join implementation might can lead to wrong results.

Section 3:
last paragraph in section 3.1. Formatting of the keywords "SPARQL " and " query federation"
Maybe enumerate the three categories of engines rather then just having a paragraph, Same for the three sub-categories.
-> Provides a better structure

Section 5, first sentence:
The author show *some* of the variables that may influence the behavior, which immediately raises the questions what other variables exist and which might have a significant influence.

Section 6.1:
I would already explain why you are not using SP2Bench in experiment 1 at this part of the paper.
Just to avoid confusion, please add that the Python method time, reports the time in seconds as floats.
The current text might cause confusion if someone does not know the return value of the methods and assumes that you compare milliseconds (Java) vs (rounded) seconds (Python) runtime measures.

Section 6.1.1: paragraph 2, you mention three query categories, but list 4 (including SP2Bench).
p12. format of text ( line space too big)

Section 6.2, para 1:
"We select four metrics" -> the authors list five metrics.

Table 11: what is the meaning of "-" , does it mean that the approach returned the right results, please indicate.

Section 6.3.6 format of para2 ( line spacing).

"queires" -> "queries"

Review #2
Anonymous submitted on 27/Sep/2014
Minor Revision
Review Comment:

The authors addressed the majority of my comments as well as my questions. The paper is definitely much clear now, and I consider that the reported evaluation is important to understand the main features of existing federated engines and is a valuable contribution to the area. Nevertheless, some questions still remain open, reducing the value of the reported work.

First, although the systems are all implemented in different programming languages, execution time must be measured using the same function, e.g., with the “time” command of the operating system. Please verify and make explicit if this issue has introduced errors in the reported measurements.

Even one of the highlighted properties of existing federated engines is adaptivity, nothing is said about the time required to produce the first answer of the studied queries and federated engines, or how delays in the endpoints can impact on the performance of these engines. Particularity, according to Table 2, several engines implement no blocking operators, e.g., FedX, LHD, Avalanche, DAW, ANAPSID, or ADERIS. However, the empirical evaluation does not reveal if the implemented operators meet or not this property. To do so, it is required to report if all the answers are produced after finishing the query execution or contrary, if they are produced incrementally. Additionally, nothing is said about the impact that supporting adaptivity may have on execution time. I wonder if query execution time is affected not only by the source selection techniques, but also by the non-blocking operators implemented by some of the engines that exhibited poorly performance. Reporting on the time of the first answer will help to answer this question.

Other comments:
-) Changed: “Linked Open Data (LOD) Cloud” by “Linking Open Data (LOD) Cloud”.
-) Use the last name of the first author of a paper and the abbreviation “et al.” to refer to the characteristics of work reported in the paper, e.g., Change “[6] identifies various drawbacks..” by “Bezt et al. identify various drawbacks.”
-) Present references in increasing order, e.g., [25,32] instead of [32,25].
-) Page 9: fix errors as: the data dimension comprise*S* of and the platform dimension consist*S* of.
-) Clarify if all the virtuoso SPARQL endpoints for the SlicedBench are installed in the same machine.
-) Table 10- In which sense results in bold are key results?
-) Table 11- Mining of “??” and “-“.
-) Section 6.3.4., “Figure 2 and Figure 2 show-s-….” “each of the selected approach*es*”.
-) Page 21 “wilcoxon” by “Wilcoxon”

Review #3
By Carlos Buil Aranda submitted on 06/Oct/2014
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. All responses are OK to me. The only concern I have is the following:

We have added two sentences “The aim of this paper is to experimentally evaluate a large number of SPARQL 1.0 query federation systems (i.e., the relevant data sources needs to be transparently selected by the federation engine)

Unfortunately I do not see that sentence in the paper. Am I looking at the right version of the paper (swj772.pdf)?

By 100% recall, we mean that the engine should return the complete set of results that can be derived from the data without missing a single entry. Table 2 is completely based on survey results (filled by authors of the papers). However, as mentioned in the survey ( http://goo.gl/iXvKVT) “Note if your answer to the previous question, i.e., support for index/catalog update, is "No" then result completeness cannot be assured.”

I have the following concern regarding the “Result Completeness” category: I think that none of the evaluated systems can get all results from the RDF databases they query (and thus be complete). This is because the federation strategies implemented by the federation systems do not follow any standard to compare and assure that all results were obtained. Besides, there is no theoretical study that validates the completeness of such systems. Thus, since each of these systems implement their federation without any reference all of them may return completely different results (all correct from their implementation point of view), but there is no guarantee that they will be complete (i.e. all systems may miss entries). Thus, I think that the category should be renamed to something more appropriate.