A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems

Tracking #: 625-1835

Muhammad Saleem
Yasar Khan
Ali Hasnain
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Axel Polleres

Submission type: 
The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. The abundance of datasets has motivated considerable work for developing SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. However, the granularity of previous evaluations of such systems has not allowed deriving of insights concerning their behavior in different steps involved during federated query processing. In this work, we perform extensive experiments to compare state-of-the-art SPARQL endpoint federation systems using the comprehensive performance evaluation framework FedBench. We extend the scope of the performance evaluation by considering additional criteria to the commonly used key criterion (i.e. the query runtime). In particular, we consider the number of sources selected, total number of SPARQL ASK requests used, and source selection time, the criteria which have not received much attention in the previous studies. Yet, we show that they have a significant impact on the overall query runtime of existing systems. Also, we extend FedBench to mirror a highly distributed data environment and assess the behavior of existing systems by using the same four criteria. As the result we provide a detailed analysis of the experimental outcomes that reveal novel insights for improving current and future SPARQL federation systems.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/May/2014
Major Revision
Review Comment:

Summary of the paper’s main contributions and impact
The paper reports on an evaluation study of existing federated engines. The study is two-fold: a first part comprises a public survey where the developers of state-of-the-art federated engines described their engines in terms of their main features and implementation details. The second part of the study corresponds to an experimental evaluation of the efficiency and effectiveness of existing open source federated engines. FedBench was extended into SlicedBench to fragment original FedBench collections. FedBech queries were evaluated in both sets of collections the original and the one fragmented. Several metrics are reported to describe the performance and quality of these engines; particularly, the study is focused on the analysis of the impact that source selection and query decomposition techniques implemented by the studied engines have on the overall execution time of the engine. Observed results and lesson learned are used to outline best-practice recommendations to be considered during the development of SPARQL federated engines.
Overall the paper is well written and clearly describes the main features of existing SPARQL federated query engines. The survey allows the engines’ authors to describe their engines, and these descriptions provide the basis to understand the results observed during the empirical evaluation. Nevertheless, because of the lack of specification and statistical analysis of the results, both reproducibility and generality of the reported results cannot be ensured. First, parameters that may impact on the behavior of the engines are not outlined. Second, the experimental set up is not described in detail. For example, with respect to the endpoints is necessary to state: how endpoints were implemented, how much are the timeouts, buffer size, number of available threads; for each engine, the version evaluated and the amount of memory assigned to the virtual machine of the ones implemented in Java; the procedure used to measure the reported metrics, is time always measure in the same way? Is the same tool always used? Execution time measurements in Java may be not equivalent to the measurements reported by Python. RAM memory consumed by each engine can be different, and in case of only having 4 GB of RAM, can negatively affect the performance of the engines. Further, the reported metrics are not very informative independently; a figure plotting execution time versus completeness of the answer could give a better idea of the trade off between these two metrics. Finally, reported results are not statistical analyzed; for example, a t-test or ANOVA could be reported and used to support the discussion and lesson learned; additionally, observed results have to be linked to the results obtained from the survey.
Strong points of the paper
S1: A detailed description the existing federated engines.
S2: An extension of existing benchmarks to generate data fragments.
S3: An empirical evaluation of existing open source federated engines under the same conditions.

Weak points of the paper
W1: The experimental setup is poorly described.
W2: No statistical analysis is conducted.
W3: There is no analysis of the parameters that impact on the performance of the evaluated engines.
W4: There is no clear description of the procedure followed to measure the reported metrics.
W5: Relationship between the description of the engines provided during the survey and the results observed in the experimental evaluation is not clearly discussed.
W6: Nothing is said about the type of fragmentation produced by the Sliced tool or how the generated slides are assigned to the different endpoints. The behavior of existing federated engines is completely different if data is fragmented horizontally, vertically or a combination of both.
W7: Behavior of the source selection and query decomposition tasks in presence of general vocabulary predicates such as rdf:type, owl:sameAs, is not described.

Detailed Comments
D1: Parameters that may impact the performance and behavior of the engines should be explained. The effect of the amount of RAM on the performance of the engines should be discussed.
D2: The experimental setup needs to be clearly described, i.e., endpoint configuration, engines’ versions, RAM of the virtual machines.
D3: Procedure conducted to measure execution time and completeness must be the same for all the engines.
D4: Impact of the number of general vocabulary predicates in a query has to be discussed.
D5: The study needs to be extended to analyze the impact of vertical, horizontal, and hybrid fragmentation of the data.
D6: A statistical analysis that includes common tests, such as t-test, simple χ2 tests, Wilcoxon or Mann-Whitney tests, should be used to support the evaluation conclusions reached and lessons learned. Errors of the measurements have to be also reported.
D7: Plots comparing execution time versus answer completeness are more informative and should be also reported.

Additional questions to the authors:
Q1: What would be the expected impact on the studied engines of the different endpoints’ capability restrictions, e.g., limited execution time or maximal number of triples in a query response.
Q2: What would be the expected behavior of the evaluated engines if FedBench queries are executed against public SPARQL endpoints instead of private endpoints.
Q3: What is the impact of data replication on the performance of the studied engines?

Review #2
By Juergen Umbrich submitted on 16/May/2014
Major Revision
Review Comment:

The paper presents a good overview of federated SPARQL engines and conducted a detailed evaluation of 6 open source engines using the FedBench and a own benchmark.
The authors "extended" the benchmarks by three additional measures ( source selection time, number of ASK queries and selected sources) and evaluated the systems over a highly decentralised environments with their own SlicedBench benchmark.
The paper has the potential to provide a good overview over existing systems and the comparison of their performance; which is one of the first of such evaluations.
However, the structure, presentation and the length of the paper needs to be improved to clearly show the contribution of the work to the community.

The authors have a good coverage of the literature and the section could be further improved by explaining the general approaches of federated SPARQL query processing.
Section 2,3,4 are providing background information and a survey of existing systems which could be condensed and better grouped.
It might be good to think about merging the information of those sections.
e.g. Section 2,3,4 provide increasing details of SPARQL federation strategies and approaches.
Those information could be merged into one section which would be consistent in itself and help readers unfamiliar with all the approaches to get faster a better overview which is important to understand the rest of the paper.

References: The references have no consistent format. Some references use the firstname as lastname. This has to be fixed.

The discussion of the results is very detail and it is clear that the authors spent a lot of effort in running the evaluation and compiling and looking into the results
However, because of the length and details, this section is quite confusing and the message is not clear.
The explanation of the setup is partially inconsistent (text vs tables/figures) and the result discussion is too long.
The authors should shorten and change the structure of the evaluation and discussion section ( see details below).
Further, the influence of the partitioning is very important and should deserve more attention form the author and space in the paper.
In addition, the evaluation focuses only on query execution time but do not provide any information or discussion about the correctness of the results ( e.g. as highlighted by Montoya[8]).

Eventually, the paper should provide some clear commendations about which system is best suitable for which queries, data and partitioning based on the comprehensive results obtained in the evaluation.

Detailed comments:

for readers which are not familiar with the SPARQL federation it is not clear why the number of SPARQL ASK queries is of any importance for the benchmark.
The authors should better explain why such measure is of interest. The same holds for the other two measures.

end of section 1, the slicebench benchmark is not mentioned before, the authors talk about FedBench and SP2Bench and should also mention slicebench.
It is also not clear that the slicedbench benchmark was developed by the authors itself.

section 2.1

Olaf et. al -> typically, authors are addressed with their last name, please fix the bibtex file.

Section 2.2
It seems important for the authors to cover the SPARQL features used for the benchmarks and a discussion of the used SPARQL features in the different benchmarks would be interesting.

Section 3:
Section 3.1
The author should provide more information for the necessity of the survey.
As pointed out in the paper, Rakhmawati performed a similar survey, however not in such detail and it would be interesting to learn why the authors decided to do a second one.

Table 1:
It is not clear why the author do not list other systems such as [30] or link traversal approaches of Olaf Hartig?
last paragraph: text format of SPARQLand query federation

Section 3.2.
What are exactly the three main categories of the federation systems? The authors discuss 6 different categories! Maybe using subsection headings or a item list would bring more structure in this section.
It is confusing to provide percentage values for the number of systems.
It would be better to just write " 2 systems are index free approaches ( vs. 14% of the systems ...)
The order of the systems in table 1 and 2 can be aligned.
The authors should also discuss the join implementation listed in Table 1 (e.g., what is a pull based rank, or adaptive query processing?)
and why this is an important information since it is not used anymore in the paper

One idea is to dedicate a sub section to the 14 systems and their characteristics ( this includes beginning of section 3.2)
The results can be discussed in an own subsection
Maybe even text from Section 4 could be moved to the beginning of section3.

Section 4. The table 1 lists 8 system with available code, why did the author limit their evaluation to six systems?

Section 5:
The author explain the setup of their hardware which does not match with the information in table 4 (e.g., hard disk and RAM, processors).
What system is used for the SPARQL endpoints ( Fuseki, Sesame, 4store, Virtuoso, ...)?

Table 5: check the format, ( e.g. version column, consistency of numbers and units, e.g. 1063 vs 1.7k). What is the definition of a link in the table and why is that important?
Table6, i think it is important to list what SPARQL features the queries use
paragraph starting with "table 7 shows...":
the author should make clear in table 6 which queries are used for the first and second test.

Fig.1. Why are sources which contribute to two triple patterns counted twice?
Table 6 is using TP as triple patterns, while the rest of the paper refers to the triple pattern wise sources with TP.

Section 5.3.1
Why is the first experimental result the triple wise selected sources while the introduction of section 5 refers to the first experiment as measuring the runtime for the FedBech setup?
Table 8 is hard to understand, what exactly means "bold results are key".
In addition, the author talk about the optimal number of sources. this information is not provided in the tables ( cf. table 6,8 ).
paragraph 2. the formatting of "rdf : type" can be improved for a better readability.

There is no need to include approaches which do not send ASK queries in table 9 .

The result discussion is very long and detailed and it is hard to understand the relation between source selection, ASK query with the total runtime.
I would suggest to change the structure and presentation of results.
Firstly, start with presenting the overall runtime ( as a stacked bar plot of source selection and query evaluation time).
I would recommend to merge the plots and provide a stacked bar chart, consisting of source selection and query evaluation time, for the queries ( merge figure 2 and 3).
This aligns with the motivation of the author that runtime is the most important factor for the federated query scenario ( at least they do not discuss denial of service attacks because of two many ASK queries or interim lookups in nested loop joins).
Next, discuss the portion of the source selection in more detail. This can be backed up with the results of table 8 and 9.
I would merge the two tables into one since table 9 has only data for 3 of the 6 engines -> the #TP and #ASK can be presented next to each other.
This makes it also easier to understand how the number of ASK queries influences the number of triple wise related sources.

Section 6:
The discussion about query execution time should be in section 5 as part of the result discussion and the question is why only FEDx and SPLENDID are compared and not all systems. The query execution time can be calculated for all other systems as well.

Review #3
By Carlos Buil Aranda submitted on 20/May/2014
Major Revision
Review Comment:

A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems

In this paper the authors present an exhaustive evaluation of some SPARQL endpoint federation systems. The contribution of the paper consists in extending the Fedbench benchmark with more fine grained features to evaluate and more systems to run this evaluation. They provide a new dataset, Sliced Bench which contains the original datasets but sliced and distributed across several Virtuoso servers. In the evaluation section the authors run the queries in FedX, SPLENDID, DARQ, LHD and ANAPSID.

1 Introduction Section:

In the Introduction Section the authors introduce the increasing popularity of SPARQL federated systems and thus the need of evaluating them to find out which of these systems is the most appropriate for querying distributed SPARQL endpoints. They also describe that current evaluations to the existing system just focus on the execution time and leave out other important features like number of sources selected, number of ASK queries sent, or source selection time. The authors also describe the problem of data partitioning and at the end of this section they describe the contributions of the paper.

First, the authors almost always assume that the only way for federating SPARQL queries is by means of systems that only implement an extension to SPARQL 1.0, leaving out/ignoring the SPARQL 1.1 Federation Extension. The only sentence I think is in the paper for stating the difference between the author’s approach and the SPARQL 1.1 federation approach is: "Federated queries [7,20,27,23], in a virtual integrated fashion is becoming increasingly popular”. I think the paper should explicitly say that there are two ways for federating queries: using SPARQL 1.0 and a virtual integration of datasets and the one in the recommendation and that in the current paper the authors focus in the latter.

2 Related Work section:

In this section (2.1) the authors list the some of the works about SPARQL query federation engines (using SPARQL 1.0) and focus on their evaluation. This short list of engines + performance studies include link traversal systems. This section seems quite disconnected since it is just a list of works without analysing/describing in detail each work listed. Again, a reference to the SPARQL 1.1 Federation Extension should be included.

Section 2.2 present the existing the benchmarks is much more elaborated than Section 2.1. In this section the authors present these evaluation works and a small analysis of them. However I think that [1] is missing.

3. Federated Engines Public Survey Section

In this section the authors present a survey in which they ask for implementation details like the source selection technique, join implementation used, etc. The authors also ask in this survey for the desired features that a federation engine may have like partial results retrieval, privacy, adaptive operators, etc.

Comments: Regarding privacy, I’m not sure why this should be a task for the federation engine. There are approaches like [2] that require an ID in the query URL for accessing the data. I think the data access should be managed by the RDF database, not by the federation engine since that engine is just a client.
Regarding the systems analysed in the survey, I still think that the authors should say clearly that SPARQL 1.1 federation engines are not considered in this survey.
Regarding result completeness I would like the authors to clarify what is 100% recall in this evaluation. In Table 2 the authors show that some systems have complete result sets while others do not. However I do not know why some are complete, maybe they return more results than others? I would like to see an explanation of the diverging query result sets since the technique using for doing joins (or other operations) may affect the final result sets. Besides, this 100% recall is done for each query in the evaluation?

I find the other categories interesting and well described.

3.2 Discussion of the survey results Section

In this section the authors divide the "existing SPARQL query federation approaches” in three categories (SPARQL endpoint Federation, query federation over linked data, query federation on top of Distributed Hash Tables), and these categories are also divided into three more subcategories: catalogue/index assisted solutions, catalogue/index-free and hybrid solutions. This classification seems fine to me, the only problem I see is that Atlas has its own category because I do not think it is a query federation engine: in the end it is an RDF storage system which is distributed across a P2P network. So I think it is normal that has its own category since it is not a SPARQL federation engine like the others as it is shown in [3] and in [4]. I think it should be removed from this work, but I may understand its presence here.

4 Details of Selected Systems Section

After analysing and describing 14 systems the authors selected 6 engines to be benchmarked. In this section the authors provide a bit more detailed description of the selected systems and also provide a description of what the authors consider SPARQL query federation. The authors also describe in a bit more of detail the implementation details of the selected engines.

I think this section is good and easy to read.Present clearly the systems that will be evaluated. Maybe the authors should highlight that ANAPSID is developed in Python which has some drawback from using Java (compiled code in the end).

5. Evaluation Section

First the authors describe the experimental setup, which uses the Fedbench query sets for cross domain, live sciences, linked data and the SP2B datasets. All the datasets involving these queries are sliced in 10 pieces and distributed across 10 different servers so in each server there are bits of data from each of the datasets. The authors detail how the slicing was done and I find it a quite interesting technique.
The authors also select some queries from the LD and SP2B domains discarding others "to cover majority of the SPARQL query clauses and types along with variable results size”: I do not think removing queries from a benchmark is a good way to cover the majority of the SPARQL query clauses. Personally, if I want to cover more cases I just add more queries, not the contrary. Besides the authors of the benchmark included these queries in their evaluation for some reason. For instance, SP2B query 6 to 8 are very hard query to answer since they are selecting twice the same elements from the dataset (using ?article1 and ?author1). I would like to see all queries from the SP2B and LD query sets executed in this evaluation. The goal is to produce a fine grained and complete evaluation, and thus no queries should be removed.

5.2 Evaluation Criteria Section
In this section the authors describe the criteria used in the evaluation of the systems. I think they are a good criteria but I’m missing the result completeness criteria, like result completeness that the authors introduced in Section 3. It would be interesting to add a column for each query showing the amount of results returned by each engine.

5.3 Experimental Results Section
The next sections describe the results obtained from executing the queries in each of the datasets. The authors first describe the results from the pattern wise source selection, pointing out that ANAPSID is the best one. This section also presents the effects of the overestimation of the selected sources by the engines.
Next, the authors present the results for the amount of ASK queries used by the SPARQL federation engines. Table 9 summarises the results showing that ANAPSID is the system with less amount of ASK queries, which is inline with the results from the previous section. The source selection time is analysed in the next section and the authors point out that the results for ANAPSID include not only the source selection time but also the time needed for decomposing a query due to ANAPSID implementation details. That show ANAPSID as the slowest system which kind of contradicts with the two other results before. I would try to present more fair results here, like calculating for all engines query selection + query decomposition times or removing ANAPSID from the chart, since it shows this system as the slowest one when it is not.

Finally the authors present the query execution time results. These results show FedX as the fastest engine overall specially with warm cache. The authors also analyse the effect of the sources overestimation specially regarding FedX and SPLENDID.

6. Discussion section
The authors present a discussion mainly about why FedX and SPLENDID are the fastest engines. However I do not see any explanation of why ANAPSID is slower than the two other engines specially when ANAPSID is the fastest engine at the time of selecting the sources to query and the engine that submit less ASK queries.

Finally the authors also show the effects in partitioning the data which are very interesting.

Final comments
I think this is a very interesting work but it has to clarify some point. The first one is the fact that the authors totally ignore the fact that there is an W3C Recommendation for SPARQL Federated Query and that most of the RDF storage systems implement it, including Virtuoso. The authors should clearly state that there are two ways for federating queries and that they focus in one of them. Next, there are some sections that should be rewritten (like section 2.1). Regarding the discussion I think that it should be explained why ANAPSID is one of the slowest when it is t he best one when selecting sources. At the beginning of this paper the authors say:
"In our experiments, we made use of both FedBench [25] and SP2Bench [26] queries to ensure that we cover the majority of the SPARQL query types and clauses."
This is not true since you only use some of the queries from SP2Bench which are included in Fedbench. I would suggest to the authors to use all queries from these datasets or remove that sentence.

Some typos:
In the survey discussion section:

repeated sentence: "However, two out of the eight with public implementation do not make use of the SPARQL endpoints and were thus not considered further in this study.”

Typo: cashing results? maybe caching results?


[1] Gabriela Montoya, Maria-Esther Vidal, Oscar Corcho, Edna Ruckhaus, and Carlos Buil-Aranda. 2012. Benchmarking federated SPARQL query engines: are existing testbeds enough?. In Proceedings of the 11th international conference on The Semantic Web - Volume Part II (ISWC'12), Philippe Cudré-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, and Jérôme Euzenat (Eds.), Vol. Part II. Springer-Verlag, Berlin, Heidelberg, 313-324.

[2] Manuel Salvadores, Matthew Horridge, Paul R. Alexander, Ray W. Fergerson, Mark A. Musen, and Natalya F. Noy. 2012. Using SPARQL to query bioportal ontologies and metadata. In Proceedings of the 11th international conference on The Semantic Web - Volume Part II (ISWC'12), Philippe Cudré-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, and Jérôme Euzenat (Eds.), Vol. Part II. Springer-Verlag, Berlin, Heidelberg, 180-195.

[3] Zoi Kaoudi, Iris Miliaraki, Matoula Magiridou, Antonios Papadakis-Pesaresi and Manolis Koubarakis. Storing and Querying RDF Data in Atlas. In 3rd European Semantic Web Conference (ESWC2006), 11 - 14 June 2006, Budva, Montenegro, Demo paper.

[4] Filali, I., Bongiovanni, F., Huet, F., & Baude, F. (2011). A survey of structured p2p systems for rdf data storage and retrieval. In Transactions on large-scale data-and knowledge-centered systems III (pp. 20-55). Springer Berlin Heidelberg.