x-Avalanche: Optimisation Techniques for Large Scale Federated SPARQL Query Processing

Tracking #: 1100-2312

Authors: 
Cosmin Basca
Abraham Bernstein

Responsible editor: 
Sören Auer

Submission type: 
Full Paper
Abstract: 
Attributes like ease of linking and integration, flexibility and standardisation are making the RDF data model more popular. As a consequence, more RDF data gets published across different domains. This distributed publication of RDF data ethos embodies the spirit of the Web of Data. While centralised RDF storage has gotten more scalable in order to keep up with the increase of published data, the problem of querying large federations of RDF datasets has not received as much attention. In this paper we extend our existing AVALANCHE federation engine to address some of the most pertinent issues with federated RDF query processing. First, we add support for disjunctions by employing a distributed union operator capable of scaling to hundreds or thousands of endpoints. Second, we enhance the distributed state management with remote caches aimed to reduce the high latency typical of SPARQL endpoints. Finally, we introduce a novel and parallel-friendly optimisation paradigm designed not only to offer an optimal tradeoff between total query execution time and fast first results, but to also consider an extended planning space unexplored so far. Our results show that combined, these capabilities improve our system’s performance by up to 70 times over the best performing SPARQL federation engine and find an optimal performance tradeoff between delivering first results and total query execution time under external constraints.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Maria-Esther Vidal submitted on 03/Sep/2015
Suggestion:
Major Revision
Review Comment:

Summary of the paper:

The authors tackle problem of scaling up to large federations of RDF datasets during SPARQL query processing, and propose x-Avalanche a SPARQL query engine for Web-accessible RDF data. x-Avalanche extends Avalanche, a previous work by the same authors. x-Avalanche is empowered with the capability of traversing search spaces of plans that include disjunctions of sources and produce query plans that minimize time to generate the first and total results. Plans correspond to bi-partite graphs and are represented as plan matrices. A dynamic programming based algorithm is proposed to solve the problem of finding execution plans that correspond to reductions of the plan matrices. A cost model is defined to guide the optimizer into the space of low estimated cost plans. A union operator is defined to merge the data retrieved from the selected sources in a distributed fashion; the proposed union operator relies on a master-slave federation of endpoints, where master endpoints are able to union the data produced by the slaves.
Furthermore, SPARQL endpoints are wrapped by proxies, which are able to join and cache intermediate results. Performance and scalability of the approach is empirically evaluated against data sources generated using the LUBM data generator. Queries are generated using the Waterloo SPARQL Diversity Test Suite (WatDiv). Experimental results suggest that the proposed query optimization techniques are able to identify plans that improve the performance of a federated query engine.

Overall the paper is well-written and tackles a relevant problem in the area of data management. Nevertheless, because of the lack of formalization, the reported results cannot be reproduced or generalized. The main drawback of this work relies on the fact that the authors propose computational techniques and claim optimality of the solution of an optimization problem that is not defined. Similarly, basic concepts are ill-defined. For example, definition 3.1 states that a plan fragment is a query plan where only some sites are considered, but neither a query plan nor the site is defined. Furthermore, nothing is said about the properties of a query plan, e.g., if it is a digraph or tree, or how “sites” are represented in a plan. The definition of fragmented bushy plan suffers of similar problems, and the concepts of plan matrix and plan matrix reduction are not presented. This situation impedes the formal demonstration of the properties of the proposed algorithms, e.g., complexity, conditions of optimality, and completeness and correctness of the answers.
The experimental evaluation is also misspecified, and the experimental design is not statistically justified, e.g., the selection of the parameters that impact on the behavior of the approach are not statically evaluated. Finally, the baseline federated query engine does not implement operators or cost-based optimization techniques comparable to the ones implemented by xAvalanche, preventing thus the generalization of the reported results to the rest of the existing federated query engines, e.g., HiBISCuS-FedX, SPLENDID, or ANAPSID.

Pros:
+ A query planner able to traverse the space of query plans that comprise sub-plans executable against several SPARQL endpoints.
+ Design of physical operators which allow for the execution of join and union operators in a distributed fashion.
+ A distributed caching system able to enhance performance of SPARQL endpoints.
+ Formal properties of the proposed approach are discussed in terms of time complexity analysis.
+ Empirical evaluation of the properties of the proposed approach using existing testbeds and one state-of-the-art engine.

Cons:
- Basic concepts are not formally stated.
- The problem of partition-aware union grouping is not defined, so it is not possible to determine the optimality conditions of this problem.
- Intractability of the problem of identifying efficient query execution plans in the studied space of plans is not analyzed. Similarly, complexity of evaluating plans with unions is not discussed.
- Optimality of the approach is claimed without any formalization of the solved optimization problem. Additionally, characteristics of the instance of the problem that ensure optimality of the proposed solution are not stated or demonstrated.
- The configuration of the experiment is not explained in detail, e.g., characteristics of the SPARQL endpoints, number of buffers assigned to the SPARQL endpoints, maximum memory use. Parameters that impact on the performance of the approach are not statistically demonstrated.
- Behavior of the proposed query optimizer tasks in presence of general vocabulary predicates such as rdf:type, owl:sameAs, is not analyzed.
- Reported metrics are not precisely defined, e.g., time for the total and first results.

Detailed comments:
Abstract and Introduction
The authors claim that xAvalanche is able to explore a search space unexplored so far, and do not make any reference to the plans produced by HiBISCuS-FedX, or ANAPSID. Both approaches are able to execute sub-queries against sub-sets of SPARQL endpoints. To support this statement, the explored search space have to be clearly defined, as well as the feasibility of assessing this goal in presence of queries with a large number of triple patterns, e.g., queries as the ones reported by Sahoo [Sahoo2010] and Loizou et al. [DBLP:journals/ws/LoizouAG15]. Additionally, the proposed approach should be compared to unfolding strategies implemented by Global As View integration frameworks, where the union of all the rewritings (query plans) corresponds to a plan of the original plan and global views correspond to entries in the plan matrix.

1.2 Contributions:
Optimality of the proposed approach is stated as one of the contributions of the paper, but no formal definition of the optimization problem is presented or proof of the optimality of the solution is reported. Unfortunately, this statement cannot be supported with the results reported in this current version of the paper.

The proposed physical operators are not defined, so there is no evidence, theoretical or empirical, that supports the statement: “…. the union operator scalable to hundred or thousand of endpoints…”.

The proposed caching strategy attempts to mitigate the high latency of the SPARQL endpoints; nevertheless, no experimental evaluation is reported to back this claim.

2.1 Related work
Adaptive query engines do not rely on statistics to adjust execution schedulers. So it seems inconsistent with the fact that Avalanche implements an inter-operator adaptive query execution strategy, while it relies on statistics during query optimization. This point needs to be clarified.

3. Optimization
Bushy plans have shown to be more suitable than left-linear plans for SPARQL queries. The selection of left-linear plans have to be justified, and the performance of the proposed approach empirically compared with federated engines that do implement bushy plans, i.e., SPLENDID or ANAPSID.

3.2
Definitions of the plan matrix and the plan matrix reduction have to be included. Also, the optimization problem has to be defined, as well as the conditions to be me by the instance of the problem to ensure optimality.

Complexity analysis of the proposed solutions is presented without any proof. Further, correctness and completeness of the proposed algorithms are not stated.

4. Scalable Distributed Unions
Completeness of the results produced by the union operator is claimed without any formal proof or empirical evaluation. Similarly, properties of the operator that allows for scaling up to large number of endpoints have to be backed up.

Eliminating duplicates from the intermediate results prevents xAvalanche to implement the semantics with SPARQL queries without the DISTINCT modifier.
Because, all the queries evaluated in the experimental study did not include the DISTINCT modifier and FedX does produce duplicates, the query results produced by both engines may differ, impacting this difference in the performance of FedX.

5. Distributed State Management and Caching
Explain in more detail the properties of the proxy and the caching strategies.

6. Evaluation
Explain in more detail the experiment configuration. Which are the hypotheses of this experimental study? Are these hypotheses validated or falsified?

Claims about optimality of the plans found by the proposed optimizer have to be backed up.

Because xAvalanche is empirically evaluated against a federated engine that implements optimization strategies incomparable to the ones provided by xAvalanche, the reported results cannot be generalized to the rest of the federated query engines. Engines as HiBICus-FedX, ANAPSID, or SPLENDID should be included in the study.

Questions to the Authors:
Q0) Definition of plan fragment is misleading and should be defined inductively. Can the subset of sources be empty? What is the shape of the plan? Is a plan fragment a logical or a physical plan? According to the authors (page 5) only left-linear plans are considered. So, is a plan fragment a left-linear plan?
Q1) What is the complexity of the problem of deciding if a reduced plan matrix is an optimal solution of the partition-aware union grouping? When a plan matrix has an optimal reduction?
Q2) Plans defined in Definition 3.2. correspond to the union of left-linear plans, so calling these plans fragmented bushy plans is misleading. Furthermore, the authors claim that the space of the fragmented bushy plans is smaller than the space of bushy plans. The authors should include expressions that count both types of query plans to support this statement.
Q4) The authors claim that fragment bushy plans are variant of bushy plans; note that also left-linear plans are variant of bushy plans. Formulas expressing the number of left-plans included in a fragment bushy plan should be included to understand feasibility of applying this approach in real-world queries with a large number of triple patterns.
Q4) Time complexity of the optimizer that explores the extended planning space is presented without defining the optimization problem that is solving. What are the properties of the algorithm that has this time complexity? Does this algorithm solve the partition-aware union grouping problem?
Q5) What is a master endpoint? Can the authors illustrate in the existing SPARQL endpoints which of them are master endpoints? The same questions for slave SPARQL endpoints.
Q6) Why parallel tree-unions are computed? How a fragmented bushy plan is rewritten into a parallel tree-union?
Q7) What are the heuristics that support the generation of parallel plans as the ones reported in Figure 6?
Q8) What are the features of a proxy? Could the authors provide examples of existing proxies on top of existing SPARQL endpoints?
Q9) What are the main caching techniques implemented by xAvalanche? What is a record (unit in the cache) of an RDF document?
Q10) The characteristics of the benchmark that impact on the behavior of xAvalanche are presented without any justification or proof. Can the authors provide statistical evidence that these parameters have a significant impact? For example, the number of triple predicates on general vocabulary predicates such as rdf:type, owl:sameAs, is not mentioned. So, what would be the impact of these types of predicates?
Q11) The authors claim that horizontal partitioning schemas are a natural fit for federations of RDF data. Provide examples of existing RDF data sets on the LOD cloud that meet this property.
Q12) What will be the impact on having vertical partitioning schemas or hybrid?
Q13) What are the characteristics of the evaluated queries, i.e., number of triple patterns, number of joins, and unions? Queries with the DISTINCT modifier need to be included in the testbed as well as queries with a large number of triple patterns. Please note that in real world scenarios queries may have more than 50 triple patterns [Sahoo2010, DBLP:journals/ws/LoizouAG15]. Please include queries with more than 10 triple patterns, e.g., use real-world queries as the ones reported by Loizu et al. [DBLP:journals/ws/LoizouAG15].
Q14) Please cite existing federated query engines that implement a serial execution of the union of fragments? The serial execution of the union of fragments corresponds to the worst scenario that the parallel execution may trivially overcome. Adaptive implementations of the union operator provided by ANAPSID will allow for a more fear comparison of the proposed union operator.
Q16) Reported metrics are not defined. Please, define total time, best time, as well as the methods used to compute these metrics.
Q17) Explain how the fragments were distributed along the SPARQL endpoints; report on the number of SPARQL endpoints available and how the endpoints were distributed in the 11 machines.
Q18) FedX is implemented in Java while xAvalanche is implemented in Python 2.7.8. Please, explain how the clocks used to measure the time execution were set up to ensure precise values.
Q19) Neither FedX nor xAvalanche implements adaptive inter- or intra-operator techniques that ensure the generation of answers in an incremental fashion. Could the author explain the characteristics of FedX and xAvalanche that justify the different behaviors exhibit by these two engines when the first and the last answer are produced. Consider that all the queries do not include the DISTINCT modifier and FedX produces duplicates; so, the number of results produced by FedX may be larger that the ones produced by xAvalance. Exactly the same comment applies for the experiments reported in Section 6.7.
Q20) Although the proposed optimization and execution techniques are supposed to be applied on queries with joins and unions, only queries with joins are evaluated in the experimental study. Please, include queries that mix both joins and unions in the evaluation.
Q21) Authors highlight that queries “finish well”; do they want to say that all the answers are produced? Can completeness of the proposed approach ensure? Please, report percentage of overlap between the answers produced by two engines. For queries without the DISTINCT modifier, report the overlap between the produced answers as well as the multiplicity of the duplicates answers coincides. Please, report on metrics to compare multi-sets to report on the completeness of the answers [DBLP:conf/sgai/KostersL07].
Q22) Justify the statement: “an optimal tradeoff is obtained when \phi”. Did the author compute all the plans and find that the one produced by xAvalanche optimizers (GRDY and DP) with fragmentation is optimal. If so, how many triple patterns have these queries? Can this property be met for all the LINEAR shape queries? What are the conditions that ensure optimality of the greedy and dynamic based solution? Is optimality ensure with respect to the estimated cost model or with respect to the real cost? Is monotonicity one of the conditions to be met by the cost model to ensure optimality of the solution? How correlations between the triple patterns will affect the behavior of the proposed approach? How does the proposed cost model compare to the cost model proposed by Naumman et al. [DBLP:conf/icde/NeumannM11]? Could characteristic sets be computed from the plan matrix?
Q23) Given that the proposed engine is not adaptive, please justify sthe differences between the time of the first results of the plans produced by GRDY and DP. How are the properties of these plans? Are these plans able to produce complete answers?
Q24) Explain the meaning of “flexible scheduling of resources”, do the authors mean to say that their approach is adaptive? If so, what are the characteristics of the federation and queries that benefit adaptivity (flexibility) of the proposed approach? Please, indicate the impact at inter- and intra-operator level.
Q25) What are the characteristics of the best cases for xAvalanche? Can these results be generalized, i.e., can the authors probe optimality of their proposed approach whenever cases that meet the characteristics of the best cases? Are the plans optimal with respect to the cost model or to respect to the real cost?
Q26) How would the proposed parallel multi-cast join compare to existing parallel joins, e.g., NUMA-awarw Hash Joins by Lang et al. [DBLP:conf/vldb/LangLA0K13].
Q26) The proposed cost model is quite simple. Please, report on the correlation between the estimates and the real cost values. Additionally, show the impact of the selectivity of the join and duplicates of union on the accuracy of the proposed cost model.
Q27) What will be the expected behavior of xAvalanche if data is provided by Triple Pattern Fragments[DBLP:conf/semweb/VerborghHMHVSCCMW14] instead that from SPARQL endpoints?

@inproceedings{DBLP:conf/sgai/KostersL07,
author = {Walter A. Kosters and
Jeroen F. J. Laros},
title = {Metrics for Mining Multisets},
booktitle = {Research and Development in Intelligent Systems XXIV, Proceedings
of AI-2007, the Twenty-seventh {SGAI} International Conference on
Innovative Techniques and Applications of Artificial Intelligence,
Cambridge, UK, December 2007},
pages = {293--303},
year = {2007}
}

@inproceedings{DBLP:conf/vldb/LangLA0K13,
author = {Harald Lang and
Viktor Leis and
Martina{-}Cezara Albutiu and
Thomas Neumann and
Alfons Kemper},
title = {Massively Parallel NUMA-Aware Hash Joins},
booktitle = {In Memory Data Management and Analysis - First and Second International
Workshops, {IMDM} 2013, Riva del Garda, Italy, August 26, 2013, {IMDM}
2014, Hongzhou, China, September 1, 2014, Revised Selected Papers},
pages = {3--14},
year = {2013}
}

@inproceedings{DBLP:conf/icde/NeumannM11,
author = {Thomas Neumann and
Guido Moerkotte},
title = {Characteristic sets: Accurate cardinality estimation for {RDF} queries
with multiple joins},
booktitle = {Proceedings of the 27th International Conference on Data Engineering,
{ICDE} 2011, April 11-16, 2011, Hannover, Germany},
pages = {984--994},
year = {2011}
}

[Sahoo] S. S. Sahoo. Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery. PhD Thesis. 2010.

@article{DBLP:journals/ws/LoizouAG15,
author = {Antonis Loizou and
Renzo Angles and
Paul T. Groth},
title = {On the formulation of performant {SPARQL} queries},
journal = {J. Web Sem.},
volume = {31},
pages = {1--26},
year = {2015}
}

@inproceedings{DBLP:conf/semweb/VerborghHMHVSCCMW14,
author = {Ruben Verborgh and
Olaf Hartig and
Ben De Meester and
Gerald Haesendonck and
Laurens De Vocht and
Miel Vander Sande and
Richard Cyganiak and
Pieter Colpaert and
Erik Mannens and
Rik Van de Walle},
title = {Querying Datasets on the Web with High Availability},
booktitle = {The Semantic Web - {ISWC} 2014 - 13th International Semantic Web Conference,
Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part {I}},
pages = {180--196},
year = {2014}
}

Review #2
Anonymous submitted on 09/Sep/2015
Suggestion:
Major Revision
Review Comment:

This paper presents x-Avalanche, an extended version of the authors’ previous work AVALANCHE presented in [6,5]. x-Avalanche is an adaptive engine for federated SPARQL query processing over multiple data sources (i.e., the SPARQL endpoints). In the current extension, a support for disjunctions by employing a distributed union operator is added as well as a remote caches aimed to reduce the high latency typical of SPARQL endpoints is introduced. In addition, fragmented bushy plans are used to assist parallel sub-queries processing. x-Avalanche is only compared to FedX and the baseline AVALANCHE [5]. The authors claimed that they improved the system performance by up to 70 times over FedX. Overall, I like the optimization strategies introduced in this paper. The paper reads well and easy to follow in majority of the sections. I have the following concerns that needs be addressed in the revised version.

1. Source Selection: I am not able to find how x-Avalanche performs source selection, i.e., how relevant SPARQL endpoints are identified? Is it index-free, index-only, or hybrid [23]? If it is not the first case, how does the index look like, i.e., what statistics it stores about the data sources? and how they are used during the source selection, query planning and query execution phases? This should be included in to the paper to better understand the query planning phase. In addition, how efficient is the proposed source selection? as efficient source selection leads to better query planning. The efficiency (in terms of the total triple pattern-wise source selected [23]) of the propose source selection approach should be compared with state-of-the-art source selection approaches introduced in HiBISCuS [R1], ANAPSID, and FedX.

2. Choice of Benchmark: It is stated “We propose a simple synthetic benchmark based on LUBM [12] and the design of the Waterloo SPARQL Diversity Test Suite or WatDiv [3] with support for different data distributions”. Unfortunately, both of these benchmarks have focused on the problem of query evaluation over local, centralized repositories. Hence, these benchmarks do not consider federated queries over multiple interlinked datasets hosted by different SPARQL endpoints. LUBM was designed with aim to evaluate the reasoners and triple stores with reasoning capabilities. Similarly, WatDiv was designed to test the triple stores for different workloads. Both of these benchmarks do not consider the requirements of the federated SPARQL query processing. For example, the number of sources the query span (i.e., the number of SPARQL endpoints actively participated during the query execution) is not considered. Existing federated benchmarks, e.g., FedBench was criticized “Unfortunately, semantically selective benchmarks do not shed any light into how the federation engine performs in worst case scenarios, where hundreds of endpoints are actively engaged in query answering”. I can see the data is horizontally partitioned but there is no information given in the paper how many SPARQL endpoints are actually participating while executing each query in the proposed benchmark? It is possible that a large number of endpoints are skipped during the source selection phase. In particular, join-aware source selection (used in HiBISCuS, ANAPSID) is able to skip many of the endpoints. One of the benchmark requirement mentioned on page 13 stats “benchmark must be able to provide a diverse and comprehensive set of queries.” Unfortunately, the queries used in the proposed benchmark is not diverse and comprehensive. In my opinion they are not complex enough, e.g., all of the queries are single BGPs, there is no query that makes use of the important and highly used SPARQL features, e.g., OPTIONAL, UNION, DISTINCT, FILTER etc. Even there is no query containing unbound predicate. The number of triple patterns only ranges from 1 to 9. There is no query with bound subject (i.e., a URI) in a triple pattern. According to Table 1, only 5 out of 14 benchmark queries scale. Thus, by only using 5 saleable queries how would you justify the first requirement “benchmark must be able to scale to as many endpoints as required” ? In the same table, how it was decided that a given query is less or highly selective? In summary, it would be nice to add the following information in to the benchmark section.

a. Perform source selection using any of the join-aware source selection approach (HiBISCuS or ANAPSID) and report the number of distinct sources that are selected for each benchmark query. This will give a rough idea of the number of actively participated sources during the query execution. I am assuming there must be some queries for which total number of distinct selected sources is around 100 to justify the claim that “we add support for disjunctions by employing a distributed union operator capable of scaling to hundreds or thousands of endpoints”.

b. In addition, to the existing queries, it would be nice to add some more complex queries (e.g., triple patterns greater than 10, number of GBPs greater than 1, and that make use of the FILTER, LIMIT, ORDY BY, Group By, OPTIONAL, UNION etc. The queries should scale as well. The result size for each of the benchmark query should either be provided in the paper or project home page. In section 6.1, it is written “we fixed the result-set selectivity threshold to 5000 tuple” suggesting that the proposed benchmark queries are not diverse enough in terms of the result sizes.

c. I would really like to see how x-Avalanche compares to FedX on FedBench, i.e., on simple yet well-known and the only (to the best of my knowledge) federated queries benchmark based on real multiple interlinked datasets.

3. Experimental Setup and Evaluation: I am very much interested to run x-Avalanche and compare with other federation engines. I tried to setup the experimental environment. Unfortunately, I was not successful to replicate the experimental results. A user manual may help to understand the whole setup. I would like to see dumps of one of the experimental setup say U3 on webpage so that one can upload to a local set of endpoint, run experiments and compare the results.

4. Fragmentation Results (section 6.7): x-Avalanche is only compared to the fastest (i.e., FedX according to [23]) SPARQL endpoint federation engine. I like the evaluation results presented in Figure 16. However, it only shows trade-off between first and total results pertaining to x-Avalanche. Since ANAPSID is also an adaptive federated query engine aiming to provide fast first result, I would like to see a comparison of x-Avalanche with ANAPSID in terms of how fast the first result is retrieved by both of these adaptive engines. In addition, ANAPSID also makes use of the busy trees.

5. Fragmented busy plans: I can clearly see the advantages of fragmented busy plans in terms of the parallel sub-queries executions. However, it seems this strategy can results in duplicate sub-queries execution in different fragments. For example, in Fig. 2 the (U5,TP3) is duplicated in F1 and F2. This may lead to the duplicate results retrieval? And thus affects the overall query execution time. I was just wondering, why such duplicates sub-plans exists? And is there any way to avoid them? It is possible that different disjoint fragments may not be possible for a given query. I was just wondering, in the is there any such query in the proposed benchmark for which the query plan only generated a single left-deep tree, same like FedX? If so what is the query performance of x-Avalanche compared to FedX for such queries?

6. Cost Model: In section 3.1, equations 1 and 2 show cardinality estimation of the join and union between two triple patterns, respectively. Both of these equations are based on cardinality estimation for single triple patterns. In order to better understand the cost model, the single triple pattern cardinality should be explained first. Θ represents the total number of triples in the dataset?

7. BayesianBlocks: A paper should contains all the necessary details to understand the overall approach. I was not able to understand Algorithm 1 because the key procedure BAYESIANBLOCKS(δ, p0) is not explained in the paper. Instead of pointing to another paper, the explanation should be added in to the paper.

8. On page 17 “We attribute FedX’s speedup over X-AVALANCHE for queries LQ5, LQ9 and LQ11 to the following: a) the queries are highly selective with 7, 3, and 133 results respectively, and b) FedX’s local cache, which can greatly improve performance by discarding sources known not to contribute to the current query.” I was just wondering why FedX performance is better in highly selective queries? Does this mean X-AVALANCHE is less optimized for such queries?

Some minor comments are:

“The limitation to of investigations to semantically selective situations can lead to a lack of attention and optimisations that target more difficult scenarios.”
==> This sentence needs better explanation or rephrasing.

“be resolved by one host, i.e., no disjunctions allowed inside a fragment.”
==> Is this same to the idea of exclusive groups used in FedX, ANAPSID ?

“In other words the general idea is to reduce the size of each column individually, i.e., by grouping together sites given a criteria for group fitness”
==> What is that criteria?

“In order to avoid the creation of segments that would span across multiple columns and therefore invalidating the semantics of the original SPARQL query, we introduce a helper structure referred to as breaks (B), which is a list holding the starting index of each PM column in D”
==> This sentence needs better explanation or rephrasing.

On page 9, cost o a fragment -> cost of a fragment.

“To address this notion, we distinguish between the selectivity of a query based on the number of result tuples, which we call result-set selectivity, and the source selectivity of the query. The latter kind of selectivity is the decisive factor during the source selection phase and can dramatically improve performance and recall.”
==> Where is the information about result-set and source selectivity given?

“In the first distribution U1, data specific to one LUBM university is allocated to one site, similarly, distributions U3, U5 and U7 split the triples of each university to 3, 5 and 7 sites respectively”
==> Does this mean a maximum of 7 sites (SPARQL endpoints) were used in the evaluation? If so how would you justify “Specifically, we measured the time it takes to union all partial bindings spread over 100 sites, while relying on the same experimental setup detailed earlier”? Overall, as mentioned before, the experimental setup needs to be better explained in further details in order to replicate the experimental results.

[R1] Saleem, Muhammad, and Axel-Cyrille Ngonga Ngomo. "Hibiscus: Hypergraph-based source selection for sparql endpoint federation." The Semantic Web: Trends and Challenges. Springer International Publishing, 2014. 176-191.