Robust Query Processing for Linked Data Fragments

Tracking #: 2783-3997

Authors: 
Lars Heling
Maribel Acosta

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
Linked Data Fragments (LDFs) refer to interfaces that allow for publishing and querying Knowledge Graphs on the Web. These interfaces primarily differ in their expressivity and allow for exploring different trade-offs when balancing the workload between clients and servers in decentralized SPARQL query processing. To devise efficient query plans, clients typically rely on heuristics that leverage the metadata provided by the LDF interface, since obtaining fine-grained statistics from remote sources is a challenging task. However, these heuristics are prone to potential estimation errors based on the metadata which can lead to inefficient query executions with a high number of requests, large amounts of data transferred, and, consequently, excessive execution times. In this work, we investigate robust query processing techniques for Linked Data Fragment clients to address these challenges. We first focus on robust plan selection by proposing CROP, a query plan optimizer that explores the cost and robustness of alternative query plans. Then, we address robust query execution by proposing a new class of adaptive operators: Polymorphic Join Operators. These operators adapt their join strategy in response to possible cardinality estimation errors. The results of our first experimental study show that CROP outperforms state-of-the-art clients by exploring alternative plans based on their cost and robustness. In our second experimental study, we investigate to what extent different planning approaches can benefit from polymorphic join operators and find that they enable more efficient query execution in the majority of cases.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Hala Skaf-Molli submitted on 09/May/2021
Suggestion:
Minor Revision
Review Comment:

Robust Query Processing for Linked Data
Fragments

Summary of the paper

The paper proposes two contributions for robust query processing for Linked Data Fragments (LDF).

The first contribution focuses on the robustness of the query plan with respect to cardinality estimation errors. The second contribution focuses on the robustness of query execution.

For cardinality estimation errors, CROP, a cost model combined with robustness measure is defined. This model is used by a query planner to devise an efficient query plan.

For query execution robustness, Polymorphic Join Operators (PJO), a new class of adaptive join operators are proposed. These operators are able to switch their join strategy from BindJoin to hash join and vice versa in response to estimation errors.

The first contribution is evaluated experimentally. Results show that CROP outperforms state-of-the-art TPF clients. Robust query planning helps to devise efficient query plans that are less prone to potential cardinality estimation errors.

The second contribution is proved theoretically and evaluated empirically. The experimental results showed promising results in different settings. More precisely, they showed that adaptive join operators in the context of LDF can enhance the robustness during query execution by reducing the impact of suboptimal query planning decisions.

The paper extends a previous work published at the ISWC 2020. The new contributions of the paper are clearly defined: 1) a refined cost model and a more detailed evaluation of CROP, 2) new adaptive join operators: the Polymorphic Bind Join (PBJ) and Polymorphic Hash Join (PHJ), 3) results on the theoretical properties and correctness of the PBJ and the PHJ, and 4) an experimental evaluation of the PBJ and PHJ for different query planning approaches.

In general, the paper is well written and easy to read and follow. While the motivations, contributions, evaluation, and state-of-the-art are well explained, I still have a few recommendations and questions that I detail in the next section, as they appear in the paper.

Detailed comments

In Figure 1, I suggest to add the number of calls for each query plan. This will help the reader to better compare the number of calls by alternative. For instance for Figure 1 (a) Query plan alternative 1, the number of calls is 813.

In Page 4, line 27 (Section 3: Robust Query Planning): What do you mean by « Finally, the optimizer selects are query that yields a suitable trade-off of both cost and robustness. » Please check the sentence.

While definition 3.4 is clear, the text below in lines 30-33 is unclear for me. Could you clarify, please?

Page 5 line 48: The idea is T(P) denotes that T is a query plan for the BGP P., I believe you can simplify this.

Page 6 line 40 (left column): For computing the cost of evaluating T over LDF server c cost(T) you consider cost(T1 join T2). This is confusing as the server does not evaluate join. Please clarify this.

Page 6 line 50 (left column): What do mean by "the cost for processing the tuples from the server at the client. »? Do you mean the tuples transferred from the server to the client??

Page 7 line 32 (left column): What is a discounting factor? This is not defined in the paper.

Finally, only simple cases are considered « In the remainder of this work, we only focus on BJs where T2 is an access operator for a triple pattern because it allows for more accurate request cost estimations. » What happens if T2 is an expression? Does the proposed approach work correctly?

Page 9 line 14-22 (after definition 3.7): Different sets of estimation functions are tested empirically to find suitable functions for computing average-case cost (min, max, a+b, max{a/b, b/a), I wonder if there is any theoretical or rational reason to choose these functions. Are these the only functions? Do you choose these functions to fit your workload? Can these functions be applied to any workload? In other words, are these functions fit well to all workloads or they have to be customized according to the workload distribution?

In Theorem 3.1, it is unclear for me how you obtain the complexity of the CASE I: O(2 ^n), for 2< k = 6? From Table 1, I can count 16 queries with lPl < 6 and 45 query with lPl >= 6. Is this true?

The experimental results show that the block size impacts drastically the optimization time, especially for queries with a large number of triple patterns. Is this an expected result or unexpected? Please justify.

Determining appropriate parameters k and \delta for the cost model need several tuning, I wonder if it is possible to suggest any recommendations for determining appropriate parameters

Page 20, line 42 (right column): One important result of the experimentation is: «  minimizing the number of requests not always yields the best execution times » This in line with previous observation reading the height discount factor. Is there any rationale explication for this result?

Page 20, line 43 (right column): Why for nLDE benchmark, robust plans are selected less frequently than for the WatDiv benchmark? Is there any rationale explication for this result?

Page 21, line 31 (left column): It is written « the selected robust query plans produce the same number of answers or even more » Why even more? Is this because of timeout !!!

Page 21 line 40 (left column): It is written « CROP allows for devising efficient query plans even for queries where the cost model produces high cardinality estimation errors in the presence of oo and s-o joins. » This means that the cost model has better accuracy for star queries than for o-o and s-o joins.

In Figure 5, the white color for nLDE makes it hard to see? I recommend to change the color to another visible one.

Figure 5a,  it is written «we observe that CROP has either the lowest or second-lowest runtime in the majority of cases and it is not consistently outperformed
by the other engines. »

Figure 5b, why nLDE and Comunica outperform CROP for the single query
(Q02) (Figure 5 (b)).

Is there any rationale explications for the results in Figure 5a and 5b?

Page 24, line 28 (left column). What do you mean by dynamic k?

Reproducibility: the current implementation of CROP uses a deprecated version of Python (python 2.7), the latest version of Python is 3.9. This makes installation tedious because "Python 2.7 reached the end of its life on January 1, 2020. "

Originality with respect to related works

The contributions are original in the context of LDF approaches. The contributions are compared to similar approaches in 1) Federated SPARQL query processing, 2) Linked Data Fragments and clients and 3) Robust query processing in Relational Databases. For each approach, the state of the art is clearly summarized and the CROP and the new class of adaptive operators are well compared.

While the reference to SaGe query engine [4] is well defined on page 27, in lines 36-45. It is confusing to cite SaGe as a client of LDF in the introduction. SaGe is a SPARQL query engine that relies on web preemption. SagGe server is able to process join, aggregate queries, path queries, etc. The SaGe smart client interacts with the SaGe preemptive server and not with LDF server. Please clarify this point.

In summary, the paper is well written and presents original ideas in the context of LDF. The idea of defining intra-operator adaptability to support robust query execution in LDF is original. The paper extends a previous work published at ISWC 2020. The new contributions in the paper are clearly defined and make an interesting contribution that deserves to be published. However, some results need to be better explained and some sentences need to be clarified. For these reasons, I propose a minor revision.

Review #2
By Oscar Corcho submitted on 16/May/2021
Suggestion:
Accept
Review Comment:

This is a very detailed paper that proposes two main contributions (the authors claim four, but the main contributions may be summarised in these two, IMO): (1) a cost model that can be used in the context of query evaluation using Linked Data Fragments; and (2) the design and implementation of two adaptive join operators that can be used in this context, called Polymorphic Bind Join (PBJ) and Polymorphic Hash Join (PHJ), whose main characteristic is the fact that they can switch from one type of join to another during execution, aiming at reducing as much as possible, in average, the number of requests that must be done to the corresponding Linked Data Fragment Server.

Given the fact that two main contributions are provided in the paper, it looks somehow too dense for a usual paper. That is, it may have been better to have two different papers, one focused on the cost model and another one focused on the design and implementation of the adaptive join operators. While reviewing, I have found myself trying to understand the core relationship between one contribution and the another in the context of this single paper.

Other than that, the paper is generally well written and easy to follow, especially for sections 3 and 4, which are sufficiently detailed to understand many of the design decisions for both the cost model and for the design of the operators.

However, section 5 may require some reorganisation and rewriting. I would suggest that the authors provide at least at the beginning a general (perhaps graphical) overview of the elements that they are trying to evaluate with the different types of experiments, of the parameters that are considered for each of those experiments, and the reasons why the results from the proposed approaches are being compared with systems XX, YY, and the usage of benchmarks XX and YY. Considering how section 5.1 is written, it is good for reproducibility purposes, but it is not so clear in terms of "why this is being designed in this way, and which hypothesis I am trying to validate". Furthermore, I find the discussion on the results very informative, but it would be good for readers to have more clear messages (probably highlighting in bold font the main messages) about the results.

Other than that, the work is solid and reflects a large amount of work devoted not only to the design of the cost model and operators, but also to the evaluation framework that has been used, which may be relevant for others building similar solutions.

Minor comments and typos:
- I would suggest including in the introduction, just before the research questions are included, a paragraph describing briefly what you understand by robust query processing, since this concept is not introduced until much later in the paper, and it makes the research questions more difficult to understand, as this concept is not so well understood in the community (even by those working on SPARQL query processing).
- I do not understand very well why federated query processing is included as part of the related work. It is true that it is generally relevant when it comes to performing queries on the Web, but in reality this work is not so much about federation, as it is generally assumed, if I am not wrong, that a single Linked DAta Fragment server is considered for the execution, but mostly about Linked DAta Fragments and about Web preemption, which I consider very relevant techniques.
- "one requests using" --> request
- "selects are query" ?
- "chose" instead of "choose" in several places.
- "first we first" ?

Long-term stable URL for resources. All resources are clearly identified, and results of the evaluation are archived in Zenodo, at https://zenodo.org/record/4639843. The organisation of the material in the corresponding resource.
I would also recommend archiving (in Zenodo, ofr instance) the content/release of the GitHub repository github.com/Lars-H/crop

Review #3
By Stasinos Konstantopoulos submitted on 05/Jul/2021
Suggestion:
Major Revision
Review Comment:

The article presents two approaches for alleviating the efficiency degradation caused by join cardinality mis-estimations during query execution planning.

The first method adds the concept of robustness in the optimizer's cost function: plan A is considered superior to plan B when A is expected to suffer minor performance degradation in the face of mis-estimations, although B would be more efficient if all estimations prove to be accurate.

The second method presents "polymorphic operators", operators that can switch mid-execution between being bind-join or hash-join physical operators, if the observed selectivity indicates that such a switch would improve efficiency.

I found both methods interesting and the evaluation thorough and convincing. In general the submission is a demanding, but enjoyable, read describing interesting work.

The language is excellent with only some minor comments listed below.

One cannot help but notice how little overlap there is between these two: The submission reads like the interleaved concatenation of two articles. They both address the same concern, but in completely different ways. This makes the article extremely difficult to read: as soon as the reader leaves Section 3 and starts Section 4, it is a completely new work; and then entering Section 5 the reader has to switch back to the "robust planning" frame of mind to understand the evaluation in Section 5.2 and then switch again to the "polymorphic operators" frame of mind to understand Section 5.3.

Section 3:

p. 5, right column, lines 48 - 49:

This is unclear. I think it would make sense if you remove the "as", in which case I would understand T(P) as the "obvious plan", the plan that directly matches the SPARQL query. But I fear I am guessing, and guessing wrongly. Please clarify the text.

p. 9, right column, lines 12 - 27:

This is a non-trivial result, and since it is an original result I feel that it should be included in the paper. Since this is a preliminary test and not a core result, you might want to throw them in an appendix to not break the flow of the text, but I believe they should be available to the reader.

Also, the final three lines of Section 3.3 need to be expanded with more details to become clearer. As the text stands, is unclear how exactly the average cost is computed for o-o and s-o joins.

I think it would also help if you used your running example to show concrete examples of calculating the best and median estimations. Listing 1, for instance, has both a s-o join and a star, so you can show us on that query how to compute the average cost.

Section 4:

I do not fully understand what theoretical guarantees are offered by Theorems 4.1 and 4.2; I have read them several times, leaving several days between readings to make sure that I am re-reading them with a fresh mind, and I just don't understand what is being proved other than that "there are two execution flow branches, and, in either branch, if the underlying operators work, then switching also works."

This is just my opinion, and your Theorems seem to be perfectly fine, but I think making the same argument based on following the flow of the two algorithms would much easier to read (and just as convincing). I am saying this because after reading the proofs, the essence of the proofs remains in the arguments that break down the algorithms into the two cases, rather than in any algebraic manipulation.

Of course I might have missed something important, in which case I urge you to guide the readers of Section 4 into a better understanding of the significance of what they have read.

Regardless of the above there are two errors that should be fixed:

There is a mismatch between Algorithm 2 (p. 12) and Theorem 4.1, Case II (p .13, right column, lines 16-19). It is seems to me that to align the two, the Algorithm must be fixed and not the Theorem. Specifially, Theorem 4.1 assumes that the eval(T1) results receives during the BJ phase are inserted into the H1 hastable, but this step is missing in Algorithm 2. Furthermore, Algorithm 2 should also foresee that the eval(T1) results are kept so that they are accessible should the switch happen; for instance by maintaining H1 during the BJ phase anyway just in case it is needed.

Similarly, there is a mismatch between Algorithm 3 (p. 15) and Theorem 4.2, Case II (p. 16, left column, lines 9-15). Again, it seems to me that the Algorithm needs fixing: Since T1 must be completely evaluated before a switch can take place, a switch makes no sense if T2 has also been completely evaluated in a previous iteration of the while loop, before T1 also completed. So line 15 of Algorithm 3 should also check that mu'' is NOT EOF to allow switching. This would also make the algorithm match more closely the argument in Theorem 4.2, Case II.

Section 6 is partly background and partly related work. This is not a strong recommendation, but please consider moving the material that introduces query planning and TPF in a new section before Section 3. The material (roughly) from p. 27, right column, l. 46 onwards is more of a comparison and reads better where it stands.

p. 25, right column, line 38:

Please explain how it can happen that different resultsets are returned. Do you timeout the experiment?

Some minor editorial comments:

p. 4, right column, l. 28:
"selects are query" -> "selects a query plan"

p. 6, left column, l. 44:
there should be no paragraph indentation. Use a line with a % in the
source to visually separate the text from the formula in the source,
and still have the text after the formula typeset without a paragraph
indentation.

p. 28, left column, l. 34:
"uncertain" -> "uncertainty"

p. 28, left column, l. 35:
"cardinality- and selectivity-based" ->
"cardinality and selectivity-based" (no hyphen)

p. 29, left column, line 1:
There is a period missing after "plans"

p. 29, left column, line 16:
I wouldn't describe your analysis as "theoretical evaluation".
I think "theoretical analysis" is more appropriate.