An Empirical Evaluation of Cost-based Federated SPARQL Query Processing Engines

Tracking #: 2390-3604

Umair Qudus
Muhammad Saleem
Axel-Cyrille Ngonga Ngomo
Young-koo Lee

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources selected and number of requests sent. However, although they are informative, these metrics are generic and unable to quantify and evaluate the accuracy of the cardinality estimators of cost-based federation engines. In addition, to thoroughly evaluate cost-based federation engines, the effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query engines. We evaluate the query planners of five different cost-based federated SPARQL query engines using LargeRDFBench queries. Our results suggest that our metrics are clearly correlated with the overall query runtime performance of the selected federation engines, and can hence provide important solutions when developing the future generations of federation engines.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Jan/2020
Review Comment:

In order to evaluate the effectiveness of cost-based SPARQL query federation algorithms,
two novel evaluation metrics were introduced: q-error and similarity errors.
The authors perform a statistical regression and correlation analysis
to determine the effectiveness of these metrics.
For this, the authors make use of the LargeRDFBench benchmark
using five state of the art SPARQL federation engines.

While I see the value of this work, I am unsure whether this is sufficient for a journal article.
The main contribution of this work is the introduction of two new metrics, and their accompanying evaluation (which has some significant issues).
I would consider this work a first step towards a broader analysis of cost-based federation algorithms,
where additional metrics could be introduced in the future,
as cardinality estimates are clearly not the only thing that impacts the effectiveness of these algorithms.
As such, my feeling is that this work deserves to be shared to the research community,
but in its current preliminary state perhaps only as a workshop/conference paper.
Without significant additions, I do not recommend this paper to be accepted.
More detailed comments can be found below.

A main strength of this work is the fact that the implementation of this metric in Java is available as open source on GitHub.
Furthermore, all experiments are available and reproducible.
All of this is clearly documented.

The analysis in section 6.2 is missing a proper critical interpretation of the p-values.
As can be seen in figures 2 and 3,
the q-error has almost no correlation to execution time (except for SemaGrow using robust regression),
and the similarity error has weak to no correlation to execution time in 2/4 cases for linear regression,
and in 1/4 cases for robust regression.
The authors should discuss this fact, and not merely discuss it as an alternative point of comparison next to the R values.
Instead, the p-values are necessary to show the significance of the R values.
Nevertheless, I still agree with the conclusion that the similarity errors are better predicators than the q-error for execution time.

The authors make use of robust regression as an alternative to linear regression.
As the authors claim, robust regression is indeed less sensitive to outliers, which can be beneficial.
However, this down-weighting/removal of outliers can also make the evaluation less realistic.
Therefore, it is important that the weighting is carefully studied afterwards,
to make sure that down-weighted observations are really meant to be down-weighted.
As such, it is important that the authors include a discussion on this analysis.
Open question: the weights are applied in favor of the similarity errors. Is an alternative weighting possible so that the q-error comes out better?

I am wondering to what respect the authors tested all required assumptions for the used statistical method.
For instance, whether or not the execution time and metric values are normally distributed for the simple linear regression,
and whether or not there is a monotonic relationship between the two variables when using the Spearman test.
It is important that all of the assumptions corresponding to each statistical method are tested thoroughly,
and are reported in this article.

Section 6.4 is not related to the new metrics, so this should not be included in the article.
However, as these execution times were used to calculate the correlations in previous sections,
they are valuable to have as a reference.
Therefore, I suggest the authors to move the plots from 6.4 into an appendix.

I find the current conclusion section rather weak.
Currently, it is mainly a summary of the paper, while the expected perspectives are missing.
Some aspects I would like to see discussed:
* Cardinality estimates are clearly not the only predicator, what else is? Can it be measured?
* How will this metric be used?
* What impact will this work have?
* It feels to me like the new metrics are to be used as alternative to execution times.
This makes me wonder when these metrics will be more valuable than execution times?
Are there cases where execution times can not be used, but these new metrics can be used?

Minor comments:
* In the similarity error definition there is a multiplication with 2. Why is this needed?
The experimental results show that this metric produces values in the range of [0, 2], which makes me thing that this factor can be removed.
* Table 1: To improve signal to noise ratio, I suggest to remove all crossmarks, and only keep the checkmarks.
* Definitions 2, 3 and 4 are nearly identical. It would be easier for the reader if those would be compressed into a single definition (with multiple possible instantiations).
* Page 6: "hard disc" -> "hard disk"

Review #2
Anonymous submitted on 20/Jan/2020
Major Revision
Review Comment:

The paper addresses the problem of measuring the accuracy of the cardinality estimators of federated SPARQL engines. The paper proposes three similarity error metrics, inspired by the q-error metric introduced in [5]. The paper measures the correlation of the value of the novel metrics with the overall query runtimes. An empirical evaluation is carried on five state of the art cost-based SPARQL federated query engines. Experimental results show that the proposed similarity-based errors have a positive correlation with the query runtime.

Detailed comments:

1) Originality

The proposed similarity-errors metrics are original, however:

O1) The rationale behind the definition of similarity errors is not explained. For instance, definition 2, line 33 why ET is defined like this? Same for definitions 3, and 4. Please give the intuition behind these definitions
O2) It is not clear why the proposed similarity errors are related to federated query engines? Could these metrics be applied to any cost-based federated query engine?

2) Significance of the results,

The evaluation section presents impressive experimentations, however,

S1) Experimental results need a better explanation. For instance, they show that one engine is better than another engine in terms of errors, etc. However, it is not clearly detailed why. For example, for the evaluation of the similarity errors, it is written: “ While CostFed remains the best system and produces the smallest errors, it is followed by Odyssey, SPLENDID, SemaGrow, and LHD ….” The experimental evaluation provides numerical comparisons without explaining to the reader if the obtained results are expected or not. Is this an implantation problem or this is expected results according to the theoretical model?
S2) Page 12, it is written, “We observed that it is possible for a federation engine to produce quite a high cardinality estimation error (i.e., 1.974217 is overall similarity error for S11 query in SemaGrow), yet it produces the optimal query plan. Hence, the runtime is smaller (i.e., 191 ms)”. This is in contradiction with the conclusion (line 22-24)!
S3) The experiments were run in a local environment where the network cost is negligible. However, in a real federated setting, the network (as highlighted in section 2) impacts the overall query execution time. My question: Do the results of the evaluation are significant?
S4) What is the added value and the real utility of using the LargeRDFBench benchmark? ( 10 large queries are omitted from the evaluation). My question: Does FedBech enough to evaluate queries similarity errors?

3) Quality of writing :

The paper is well written and easy to read for an expert. However, some sections need improvement:

Q1) The related work section is confusing and highlights aspects not really related to the proposal. Why Table 1 is related? What are dief@k, dief@t refer to?
Q2) The descriptions of selected federated queries engines are very restricted. Cardinality estimation methods used by the selected engines need to be described. This section needs to be rewritten in a way that allows the reader to understand the results of the empirical evaluation.
Q3) The values of the x-axis of q-error in figure 2 need to be fixed.

In summary, the paper handles an important problem, however, it misses deep reflections about the results of the evaluation. It is not clear if the obtained results are related to the implementation of theoretical models or to the models themselves. For a full research paper, the reader expects a better description of the selected engines and more elaborated discussions of the results of the evaluation.

Review #3
By Stasinos Konstantopoulos submitted on 19/Feb/2020
Major Revision
Review Comment:

The submission proposes a similarity metric that evaluates the accuracy of the predictions that a query optimizer makes about the cardinalities of triples and joins. As these predictions are the primary input for estimating cost, erroneous cardinality predictions can lead the optimizer to sub-optimal query plans.

The proposed similarity metric is compared against q-error, a metric for evaluating cardinality predictions from the relational database literature. The basis of comparison is how strongly the evaluation metric correlates with actual execution times on a benchmark suite. That is to say, the proposed similarity metric is found superior to q-error in the sense that the new metric is a better predictor of overall query execution performance than q-error. Such a metric can be used (in combination with other metrics evaluating other steps of the query processing pipeline) in order to pin-point the step that causes ill-performance.

This work can potentially be a significant contribution to the federated query processing literature, but the submitted manuscript needs substantial improvements to materialize this potential.

Substantial comments:

One comment is that the positive correlation between the similarity metric and the query runtimes was only clearly visible after applying regression to treat outliers. The authors also mention noteworthy outliers (see p. 12, left column, lines 5ff) where a significant error in cardinality estimation had no impact on query execution, because the numbers worked out so that the optimizer selected the optimal plan anyway. This instance (and, one assumes, many if not most of the outliers) is what the proposed similarity metric is promising to capture by normalizing the error with the Euclidean length of the cardinalities vector (see the denominator in the definition of the similarity metric).

The authors should investigate these instances and provide an error analysis where they explain why this normalization failed to capture how the cardinality estimation errors cancelled each other out and the correct plan was produced. Such an analysis can lead to insights about weights and other parameters in the definition of the metric, aiming to reflect how errors in the same direction in all cardinalities tend to cancel each other out. Even if the authors cannot devise an appropriate definition in the context of the work presented in this article, their analysis can guide future work.

Another omission is that the effect of index size is not investigated. Especially when it comes to estimating join cardinalities, the actual cardinalities can vary wildly depending on selectivity and cannot be predicted from individual triple cardinalities. This makes multi-dimensional indexes accurate, but multi-dimensional indexes are also voluminous and maintaining and querying the index can become a significant overhead. From this perspective, I suggest that instead of being a single value, the metric should be a plot of the current metric against index size (reducible to a scalar as AUC or similar, when necessary).

My final comment is that the submission does not evaluate the very first contribution claimed by the authors (p. 2, left column, lines 44ff): we do not see how the new metric can give a fine-grained evaluation of query engines. For this, I recommend that the authors identify in their query suite those queries where erroneous cardinality estimation has led to sub-optimal performance, and compare the similarity metric against metrics evaluating other aspects of the query processing pipeline (p.2, left column, lines 13-21). This will demonstrate how the similarity metric correctly identifies cardinality estimation error as the reason for the sub-optimal overall performance.

Editorial comments:

The paper is overall clear and well-written, with only a few editorial comments:

p2. 2, left column, line 21ff:

I do not understand the juxtaposition to Acosta et al. [24]. The cited paper does not address measuring the accuracy of the cardinality estimators; it is as relevant as any paper on evaluating query processing and as such I do not understand why it is being singled out. I urge the authors to revise this passage in order to avoid creating the (false) impression that they are referring to prior work on the same topic as what they are handling.

p. 2, left column, line 9:

"principle input": I think you mean principal

Section 4:

In all definitions, (r_1, ... r_n) and (e_1, ... e_n) and in general all these vectors are in R^n, not R

Also, it is better that you use \cdot and not * to denote the usual, arithmentic multiplication.

Finally, in Definitions 2 and 3 you are using |v| to denote the Euclidean norm instead of the generally used ||v||_2 notation.

Section 5:

It feels a bit odd that this brief introduction to the federation engines used in the experiments is elevated to a full section. I would only see a reason for having a full foreground section on the used engines if it included a full analysis that provided new, foreground knowledge. This new knowledge could be, for example, an analysis of these engines' innner workings for the purpose of providing prior theoretical motivation for the proposed q-error and similarity metrics; or an analysis that guides the reader on how to best read and interpret the empirical results that compare cardinality errors and query execution times for these engines. As things stand, the authors seem to have more pragmatic rather than theoretical reasons for their selection (see p. 6, left, lines 39-44 and p. 7, left, lines 16-19). These are perfectly valid reasons to select these systems, but I recommend moving the content currently making up Section 5 into the experimental setup in Section 6.1 or into the background in Section 2.

Figures 2 and 3:

The caption should explain two to interpret the gray area in the graphs.

p. 12:

"SPLENDID's average runtime is 98,494.52ms" and in other places: This is an abundance of significant digits, not necessary for the argument. 98sec would suffice for the comparisons you are making, 98.5 sec would already be slightly more than needed; 1/100th of a ms is really overkill and, for that matter, I suspect below the granularity that can be reliably measured.