Review Comment:
Robust Query Processing for Linked Data
Fragments
Summary of the paper
The paper proposes two contributions for robust query processing for Linked Data Fragments (LDF).
The first contribution focuses on the robustness of the query plan with respect to cardinality estimation errors. The second contribution focuses on the robustness of query execution.
For cardinality estimation errors, CROP, a cost model combined with robustness measure is defined. This model is used by a query planner to devise an efficient query plan.
For query execution robustness, Polymorphic Join Operators (PJO), a new class of adaptive join operators are proposed. These operators are able to switch their join strategy from BindJoin to hash join and vice versa in response to estimation errors.
The first contribution is evaluated experimentally. Results show that CROP outperforms state-of-the-art TPF clients. Robust query planning helps to devise efficient query plans that are less prone to potential cardinality estimation errors.
The second contribution is proved theoretically and evaluated empirically. The experimental results showed promising results in different settings. More precisely, they showed that adaptive join operators in the context of LDF can enhance the robustness during query execution by reducing the impact of suboptimal query planning decisions.
The paper extends a previous work published at the ISWC 2020. The new contributions of the paper are clearly defined: 1) a refined cost model and a more detailed evaluation of CROP, 2) new adaptive join operators: the Polymorphic Bind Join (PBJ) and Polymorphic Hash Join (PHJ), 3) results on the theoretical properties and correctness of the PBJ and the PHJ, and 4) an experimental evaluation of the PBJ and PHJ for different query planning approaches.
In general, the paper is well written and easy to read and follow. While the motivations, contributions, evaluation, and state-of-the-art are well explained, I still have a few recommendations and questions that I detail in the next section, as they appear in the paper.
Detailed comments
In Figure 1, I suggest to add the number of calls for each query plan. This will help the reader to better compare the number of calls by alternative. For instance for Figure 1 (a) Query plan alternative 1, the number of calls is 813.
In Page 4, line 27 (Section 3: Robust Query Planning): What do you mean by « Finally, the optimizer selects are query that yields a suitable trade-off of both cost and robustness. » Please check the sentence.
While definition 3.4 is clear, the text below in lines 30-33 is unclear for me. Could you clarify, please?
Page 5 line 48: The idea is T(P) denotes that T is a query plan for the BGP P., I believe you can simplify this.
Page 6 line 40 (left column): For computing the cost of evaluating T over LDF server c cost(T) you consider cost(T1 join T2). This is confusing as the server does not evaluate join. Please clarify this.
Page 6 line 50 (left column): What do mean by "the cost for processing the tuples from the server at the client. »? Do you mean the tuples transferred from the server to the client??
Page 7 line 32 (left column): What is a discounting factor? This is not defined in the paper.
Finally, only simple cases are considered « In the remainder of this work, we only focus on BJs where T2 is an access operator for a triple pattern because it allows for more accurate request cost estimations. » What happens if T2 is an expression? Does the proposed approach work correctly?
Page 9 line 14-22 (after definition 3.7): Different sets of estimation functions are tested empirically to find suitable functions for computing average-case cost (min, max, a+b, max{a/b, b/a), I wonder if there is any theoretical or rational reason to choose these functions. Are these the only functions? Do you choose these functions to fit your workload? Can these functions be applied to any workload? In other words, are these functions fit well to all workloads or they have to be customized according to the workload distribution?
In Theorem 3.1, it is unclear for me how you obtain the complexity of the CASE I: O(2 ^n), for 2< k = 6? From Table 1, I can count 16 queries with lPl < 6 and 45 query with lPl >= 6. Is this true?
The experimental results show that the block size impacts drastically the optimization time, especially for queries with a large number of triple patterns. Is this an expected result or unexpected? Please justify.
Determining appropriate parameters k and \delta for the cost model need several tuning, I wonder if it is possible to suggest any recommendations for determining appropriate parameters
Page 20, line 42 (right column): One important result of the experimentation is: « minimizing the number of requests not always yields the best execution times » This in line with previous observation reading the height discount factor. Is there any rationale explication for this result?
Page 20, line 43 (right column): Why for nLDE benchmark, robust plans are selected less frequently than for the WatDiv benchmark? Is there any rationale explication for this result?
Page 21, line 31 (left column): It is written « the selected robust query plans produce the same number of answers or even more » Why even more? Is this because of timeout !!!
Page 21 line 40 (left column): It is written « CROP allows for devising efficient query plans even for queries where the cost model produces high cardinality estimation errors in the presence of oo and s-o joins. » This means that the cost model has better accuracy for star queries than for o-o and s-o joins.
In Figure 5, the white color for nLDE makes it hard to see? I recommend to change the color to another visible one.
Figure 5a, it is written «we observe that CROP has either the lowest or second-lowest runtime in the majority of cases and it is not consistently outperformed
by the other engines. »
Figure 5b, why nLDE and Comunica outperform CROP for the single query
(Q02) (Figure 5 (b)).
Is there any rationale explications for the results in Figure 5a and 5b?
Page 24, line 28 (left column). What do you mean by dynamic k?
Reproducibility: the current implementation of CROP uses a deprecated version of Python (python 2.7), the latest version of Python is 3.9. This makes installation tedious because "Python 2.7 reached the end of its life on January 1, 2020. "
Originality with respect to related works
The contributions are original in the context of LDF approaches. The contributions are compared to similar approaches in 1) Federated SPARQL query processing, 2) Linked Data Fragments and clients and 3) Robust query processing in Relational Databases. For each approach, the state of the art is clearly summarized and the CROP and the new class of adaptive operators are well compared.
While the reference to SaGe query engine [4] is well defined on page 27, in lines 36-45. It is confusing to cite SaGe as a client of LDF in the introduction. SaGe is a SPARQL query engine that relies on web preemption. SagGe server is able to process join, aggregate queries, path queries, etc. The SaGe smart client interacts with the SaGe preemptive server and not with LDF server. Please clarify this point.
In summary, the paper is well written and presents original ideas in the context of LDF. The idea of defining intra-operator adaptability to support robust query execution in LDF is original. The paper extends a previous work published at ISWC 2020. The new contributions in the paper are clearly defined and make an interesting contribution that deserves to be published. However, some results need to be better explained and some sentences need to be clarified. For these reasons, I propose a minor revision.
|