Online SPARQL Aggregate Queries Processing with Web Preemption

Tracking #: 2728-3942

Authors: 
Julien Aimonier-davat
Hala Skaf-Molli
Pascal Molli
Arnaud Grall
Thomas Minier1

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
Abstract: 
Getting complete results when processing aggregate queries on public SPARQL endpoints is challenging, mainly due to quotas enforcement. Although the Web preemption allows to process aggregation queries online, on preemptable SPARQL servers, data transfer is still very large when processing count-distinct aggregate queries. In this paper, it is shown that count-distinct aggregate queries can be approximated with low data transfer by extending the partial aggregation operator with HyperLogLog sketches. Experimental results demonstrate that the proposed approach outperforms existing approaches by orders of magnitude in terms of the amount of transferred data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Stasinos Konstantopoulos submitted on 06/Apr/2021
Suggestion:
Minor Revision
Review Comment:

The submission presents a method and prototype for computing
approximations of count-distinct aggregate queries against servers
that apply Web preemption to ensure responsiveness and fairness
between their clients. Web preemption is likely to make it impractical
to get count-distinct results against larger datasets, as (a) such
queries are likely to not be able to be computed within a timer slot;
and (b) receiving all results to perform the aggregation on the client
side wastes network resources when only aggregate results are needed.

The submission is well written. The text presents ideas and results
clearly and at the right level of technical detail.

The submission substantially extends the ESWC 2020 publication that
presented a method for preemtable aggregation, excluding the DISTINCT
operator. As handling grouping by distinct keys is a substantially
memory-intensive operation, there is significant added value (also
reflected in the text) with respect to the ESWC 2020 publication.

The proposed extension uses HyperLogLog, a prior algorithm, to
approximate counts with a limited memory budget. This allows Web
preemption to maintain state between slots allocated to the same query
with minimal overhead.

The idea was experimentally tested on BSBM and DBPedia. The method
delivers as expected, drastically reducing the transferred data.

What I am missing in order to feel that the behaviour of the
method is fully document are the following two items:

(a) Does the drop between 150 and 1500 in Figure 9 (bottom left,
sage-approx) correspond to the time range where (almost) all queries
terminate? I would like to see a visualization (possibly grouping
queries that terminate at similar times?) where we can tell if the
increase in efficiency is (partially) due to not having to merge
partial, preempted results and receiving the result in one go.

(b) Testing not just for different length of the time quanta, but also
for different densities of distinct values in the dataset.

A minor comment in Section 3.1: Although technically any symbol can be
used to denote a variable, using "?x" is somewhat confusing. What is
meant is "any variable", a variable over variables, and not a specific
variable. Using something like "v" to be a variable over variables
would make more sense, in my opinion. As it stands, I had to re-read
carefully twice to double check that ?x is "any variable" and not a
specific variable that I missed being mentioned earlier in the text.

Review #2
By Aidan Hogan submitted on 27/Apr/2021
Suggestion:
Major Revision
Review Comment:

## Summary

The paper discusses methods by which aggregation queries can be made (efficiently) preemptible, meaning that the query can be paused and resumed efficiently, allowing it to be processed in discrete quanta (e.g., each initiated by individual request). Though previous work has tackled the preemption of triple patterns, basic graph patterns, etc., the case for aggregation is considerably more challenging. The authors first propose a preemption framework that is similar to how distributed systems (such as Hadoop or Spark) process aggregations, decomposing the aggregation into a combiner function and a reducer function. This allows the base solution mappings to be processed in chunks on the server using the combiner, with the final result being computed by the client using the reducer. However, in the case of DISTINCT COUNT or DISTINCT AVG, a meaningful combiner is not possible, and thus, by default, the server would have to stream all values to the client, who would then have to apply the DISTINCT operation and the COUNT/AVG. Specifically for the COUNT DISTINCT case, the authors propose an approximate method based on HyperLogLog (HLL), which uses the maximum number of trailing zeros in different buckets of values in order to estimate the overall cardinality. This ensures that cardinalities can be estimated in constant space within a given error guarantee.

The proposed methods are then used to extend an existing preemption-enabled query engine, called SaGe, and experiments are run evaluating aggregation queries over a variety of datasets. The systems included in the comparison are SaGe, Virtuoso, TPF, as well as SaGe extended with both forms of preemptible aggregation (exact and approximate). The results show that although Virtuoso provides the fastest runtimes and lowest data transfer overall, it would reach the timeout limits typically defined on public endpoints for many queries. The next best alternatives are the extensions of SaGe with the proposed aggregation frameworks, which though slower than Virtuoso, support pausing and resuming queries efficiently. The approximation framework based on HLL does not improve runtimes, but can reduce the amount of traffic when evaluating COUNT DISTINCT queries.

## Strengths:

S1: SPARQL endpoints have become a key part of the Semantic Web infrastructure (e.g., Wikidata receives millions of SPARQL queries per day), and their performance issues are well-known. Query preemption is a promising strategy to improve their reliability by allowing to break a query down into quanta of processing. Likewise aggregation queries are very common, and are often among the most challenging queries for SPARQL endpoints to process (as they often require massive intermediate results to produce a concise result). Hence, I think that preemption for aggregate queries is an interesting and natural topic to explore.

S2: The techniques proposed are promising, both in terms of exact aggregation and approximate aggregation.

S3: The technical content is presented both formally and with examples, which makes them relatively easy to follow. (With the caveat that some of the definitions do need some polishing, as discussed later.)

S4: Experiments consider a reasonable selection of alternatives that cover examples from each of the most prominent families of approaches. A range of datasets are considered.

## Weaknesses

W1: There are some claims in the paper regarding certain query operators being "full mappings" operators. From my understanding, this refers to the idea that all of the mappings must be computed before the operator can be applied, and are thus not directly preemptible. The authors include ORDER BY as an example, along with aggregation functions. But it is not very clear to me what precisely this concept of a "full mappings operator" means, or what the relation is to preemptibility (intuitively I understand the rough idea, but formally I am not convinced about the details). It further does not justify why these operators necessarily require the full mappings to be computed before they can be applied. Is this just in some cases, or in all cases? For example, ORDER BY can sometimes be evaluated by "streaming" solutions from an index with the appropriate order (in fact, worst-case optimal join algorithms rely on this property of being able to stream the results for any variable of a BGP in sorted order assuming every triple pattern of the BGP contains that variable in time proportional to the number of results). Likewise, in the case of aggregation, it seems to be that one can apply preemption in a similar manner to joins, processing the aggregation key by key. This is, in fact, precisely how distributed frameworks, and the framework proposed by this paper, work. Put another way, for a given value of K, LIMIT K queries over ORDER BY (even on multiple columns) and aggregation queries (assuming non-trivial groupings) can be computed without computing the full base multiset of solution mappings, assuming we have complete indexes. For ORDER BY, we can generate an ordered list of K candidates for the first variable in the order expression, and then compute the rest of the query in a nested-loop fashion, stopping when we produce K solutions. For aggregate queries, we can again generate a list of candidates for K group keys, and then compute the rest of the query in a nested-loop way, generating the full results for each key, which we can then aggregate. (For aggregate queries without GROUP BY, it may be necessary to compute all the solution mappings, I am not sure.) In any case, I think these claims surrounding "full mapping operators" need to be clarified as they form a core part of the motivation for the paper.

W2: Though not previously familiar with the HLL approach, I wonder if it is really suitable for the proposed use-case. It would seem that HLL is better suited to estimating the cardinality of large sets, for which its size is quite small, and the estimation based on leading zeros can give relatively low errors, but often in the case of SPARQL, when grouping by keys, the sets (with distinct) can be small. In the case of having many small sets (when counting by groups), HLL would seem (intuitively) to give coarse results and to use quite a lot of space (1.5 kB per group key), which might even be larger than storing a small set. Unfortunately, the paper does not provide any reason to suggest that HLL is a good approach for the group-by aggregate case. The experimental results do not measure the actual error of the counts generated, and no other similar alternative is considered.

W3: The paper could be much clearer about the limitations of the proposed approach. Some limitations are explicitly discussed, some are alluded to, and some (such as mentioned in W2) seem not to be discussed. I would like to see a more cohesive discussion (in one section or sub-section) of all of the limitations, before the experimental results are presented. These limitations might include: the space used by HHL, the error rate of HHL for smaller cardinality sets, the need to still stream all values for AVG DISTINCT, having to count over many keys, the lack of join and filter optimisations in the current implementation, the issues of scalability, etc. For each limitation, it would be good to clarify why it occurs (e.g., is it a limitation of the expressiveness of SPARQL, a trade-off associated with the proposed approach, or just an implementation issue), and how it might be addressed.

W4: There are a couple of key aspects of the experimental results that I found unclear:
- With respect to scalability, 100 million triples is not that much, where the experiments report results over a *fragment* of DBpedia v3.5.1. Why was a fragment chosen? Would it not work on the full datasets? Why not?
- Perhaps I am mistaken, but wouldn't the data transfer be guaranteed to be significantly larger for SAGE-approx than Virtuoso in SP since the former must send the HLL registers (1.5 kB per group key?), which will be significantly larger (by a constant factor) than returning just a simple count?
- As mentioned before, I miss discussion of the actual error observed when using HLLs. I suspect this might be larger than what the parameter suggests it should be (my guess is that the error rate is defined approaching infinity, and does not necessarily apply to small sets?).

W5: While the paper is quite readable, there are frequent language issues and bugs in the notation. I will list some of these below, but the paper would benefit from a more thorough proof-read.

## Recommendation

Overall I think the work addresses an important problem, and proposes interesting techniques. However, I would like to see the paper revised in order to address the aforementioned weaknesses, specifically:

- W1: to clarify the claims surrounding "full mapping operators" (or perhaps remove them entirely if not essential to the paper);
- W2: to present the observed errors associated with HLL in the experimental results (optionally adding a similar baseline);
- W3: to gather together and discuss all of the limitations of the work before the experimental section;
- W4: to clarify why larger datasets were not used in experiments, and why the data transfer with HLL is often competitive to that of Virtuso;
- W5: to more carefully proof-read the paper (see minor comments below).

As these are quite significant changes that might affect my overall impressions of the work (considering W2 and W4 in particular), I recommend a Major Revision.

## Minor comments

General
- "In a previous work [6], it has been demonstrated" Since the submission is not double-blind, I think it would be more informative and natural to point out that this work is yours: "In our previous work [6], we demonstrated".
- "on [the] client side" (various examples)
- "preemptable" would rather be "preemptible"? The latter seems to be a more common spelling. (On checking, I guess that either is fine.)
- Make sure that there is a space before \cite{...} in every case.
- The colours in some of the figures are slightly too dark to be able to comfortably read the black text in small font, particularly the blue colour.

Abstract:
- "to quota[] enforcement"
- "Although [] Web preemption"

Introduction:
- "requires [materializing] the [query] mappings on [the] client side"
- "[Even] if the processing"
- "Compared to related Approximated Query Approaches [17], this approach ensures to find all group keys in a single pass with a pre-defined error rate for all values." I did not find this difference or benefit clear (at this point). I think this is a key sentence, so I would suggest to clarify, with an example if necessary.
- "The remainder of this paper ..." Section 6 is missing.

Related Works:
- "a lot of CPU, memory, [and] temporary storage [resources]" Also it seems redundant to include temporary storage and memory as the latter includes the former?
- "user[] community"
- "endpoints set[ ]up quotas" (when a phrasal verb it is two words)
- "Query results are then recombine[d] on [the] client-side"
- "This requires transfer[ing] ... and then [computing]"
- "computing the exact results for aggregate queries and approximate values for count-distinct queries" Are count-distinct queries not aggregate queries? (The phrasing is a little confusing.)
- "for count-distinct queries, ensure that all group keys are collected with guaranteed accuracy on values" What does guaranteed accuracy mean here, more precisely?
- "Amazon[, etc.,]"
- "that require counting"

Preliminaries:
- "let [us] consider"
- "RDF terms such [that]"
- "An RDF graph G (called also RDF dataset)" It is true that some papers call RDF graphs RDF datasets, but the RDF 1.1 standard defines RDF datasets in terms of named graphs, not RDF graphs, so I would drop the "(called also RDF dataset)" part.
- "Let [us] assume"
- "Mappings \mu_1 and \mu_2 are compatible on the variable ?x ..." This definition is a bit atypical. In this case, why not simply write $\mu_1(?x) = \mu_2(?x)$? Compatibility is not typically defined with respect to a variable, but rather between two mappings: \mu_1 ~ \mu_2. Also (if I am not mistaken) the definition is never used, so you could just drop it?
- "multiset[] of solution[] mappings"
- "with lists of expressions E = [E_1" (missing subscript)
- "E is restricted to variables, without reducing the expressive power of aggregates" I presume that this is only with the assignment of variables, which is not included in the defined fragment of SPARQL.
- Definition of group: "G an RDF graph", but the RDF graph is not used in the construct.
- "Let[ting] $\Omega = ...$"
- Definition of aggregate: "G an RDF graph", but the RDF graph is not used in the construct.
- Definition of aggregate: "a multiset of partial functions" Before it was defined as a set (and it should be a set as the keys are unique?).
- "that finally sort[s] them."
- "[Although] delegating ... allows [us to] effectively [] process"
- "In this example, Q_1' requires six quanta to complete" Why six quanta? Is this assumed for the purposes of the example or do we have some way to deduce that six quanta are needed?

Computing Partial Aggregations with Web Preemption
- "the way [in which] the property"

Count-Distinct SPARQL Aggregate Queries
- "over several quanta, as [per]"
- "requires [] transfer[ing]"
- "and [] bounded memory"
- "typical accuracy of 2%" Typical error? Otherwise it is not clear what accuracy means here.
- "This problem [is known] as the many-distinct count problem"
- "Thanks to [] web preemption"
- "In the following ... partial aggregations." Poorly structured; I would suggest to rephrase.
- In the definition of the HLL set, p is used to denote the precision (which I would understand to be a real value), while later it is used as an integer (m = 2^p). I do not (at this point) understand what a precision of (e.g.) 10 means here. The notion of precision needs to be defined more clearly at this point, rather than where it is currently defined. Also it would be good to provide the formula for computing the error rate given p. From the example given, it would appear to be something like (1 + p/100) x \sqrt{2^p}, which seems a bit strange.
- "an array R of m = 2^p registers" It would be good to indicate what kind of value (how many bits) each register contains.
- "represents the ind[ex]"
- "64 bit[] hash value"
- "[Long] quant[a] [reduce] data transfer as HLL-sets are better used."
- "the client memory is no [longer] a shared resource"

Experimental Study
- "What is the impact of [quantum duration] on the data transfer"
- "count distinct queries [reduce] data transfer?"
- "[]Python SAGE-AGG and SAGE-APPROX clients ... with [] Algorithm 2."
- "are [run] on synthetic and real-world datasets"
- "quer[y] workloads"
- "Virtuoso without quota [performs the best] in terms of"
- "many [HTTP] requests"
- "quer[y] execution time"
- "client-side [like] TPF"
- "red dotted line in [Figure] 8"
- "after [a] time quantum"
- "for each group key per quantum is bound[ed]"

Conclusion and Future Works
- "elements from [the] server to [the] clients"
- "in the experiment[s]"
- "could [reduce] the data transfer"

References
- Be sure to enforce capitalisation for acronyms (e.g., SPARQL), proper nouns (e.g., Virtuoso), etc., in the paper titles.

Review #3
By Oscar Corcho submitted on 16/May/2021
Suggestion:
Major Revision
Review Comment:

This paper presents a piece of work that is original, building on top of previous sound work done by part of the authors wrt making sure that complex long-running SPARQL queries can be run online, without the need to download RDF data dumps. As far as I know, this is the first attempt to cover a very specific type of SPARQL aggregate queries that is one of the most challenging ones: COUNT DISTINCT.

The work is well presented and sound, in general, and good for being considered to be published in this journal, in my opinion. However, I have many comments to improve it that motivate my request for a major revision, which I am sure that will be easy to deal with by the authors. I will deal first with the ones that I consider more important, and then I will provide a set of other minor comments:
1. The title of the paper should be different, IMO, since currently it is a bit misleading and may suggest a potential reader that the work is much wider. In my opinion, the title should be something like "Online approximate query processing for SPARQL COUNT-DISTINCT queries with Web Preemption". Another order may be possible, of course, but IMO two main aspects have to appear: make sure that we talk about COUNT-DISTINCT and not aggregate queries in general, and make sure that we introduce the approximate inside.

2. In the evaluation setup and in the discussion in general I am not able to understand why the authors do not include in the evaluation a comparison of the actual results obtained, given that the queries are approximate and hence there is a potential for error. I would include this as another hypothesis: we reduce data transfer, ok, but at not a major cost in terms of the accuracy of the results. This has to be included, IMO, to show clearly whether the initial assumption in terms of error rate from HLL stands here as well with the queries that are used in the evaluation. If this does not show good results, then probably we have to question whether this is a valid approach or not.

3. There is some mismatch all throughout the paper in terms of figure numbers, table/figure references, etc. This makes following some explanations very difficult. Please revise them all very carefully. And the quality of some of the key figures, which are showing examples of query processing, is extremely low, and it is difficult to follow them. Probably it would be even good to provide additional explanations about them in the text, with step numbers and mentions to them in the text so as to drive the reader throughout the examples.

4. I may be wrong, but algorithm 1 may have a problem of infinite recursion in lines 5,6,7. Could you please double check that?

These are the major comments. Now more detailed comments that are easy to fix:
5. The introduction is quite clear and very easy to read. I only have editorial comments. In the first paragraph I would include "for all available classes" after the "per class". In that first paragraph, as well, the authors claim that only partial results are delivered, but in reality Wikidata and DBpedia return no results (so partial is not a good term to use here).
6. In the introduction I would already describe briefly what preemptable means when you use it, since it will not be understood by the majority of readers, as this is not so well known. I knew about it because of having read previous works from the authors, but for a "normal" reader interested in SPARQL query processing, this is not neccessarily easy to understand.
7. When you talk about the contributions, you do not mention the error rate (as I discuss above) nor the computation time. This second one is something that you evaluate later as well. Even if results may not be good yet, I think that it has to be included.
8. When you discuss about approximate query processing in section 2, you talk about the number of draws, but do not explain what draw means here. Please do to facilitate reading. In that same section, it would be good in the last paragraph to show an example of why you can talk about thousands of group keys, since the first intuition is that when you talk about group keys you talk about those inside s SPARQL query, and normally there will be one or a few, but never thousands.
9. Why do you avoid OPTIONAL clauses? Is there any limitation for that?
10. In section 5 you say "now the data transfer is no more proportional to group cardinalities". But, is it proportional to something else? If it is, it would be good to mention that.
11. In section 5.1, when you discuss on HLL_count, then in the natural language description you include the precision p, but this is not included in HLL_count_p or alike. I think that if it is a relevant parameter, it should be included in the term.
12. The use of the harmonic mean on the m registers should be probably justified even if it comes from HLL, so that it is easier to understand why this one is selected.
13. It is not easy to understand why the error rate is 1.04xsquare-root (2 to the power of 4). Where are you taking this calculation from? This is not explained, I think.
14. Figure 6 has only one part, so no need for 6.a). It would be also good to have Q_2 explicitly included in the figure, so as to facilitate unxerstanding. And improve in general teh figure to make it more readable.
15. The last paragraph of section 5 is somehow repeated.
16. In section 7 I miss the initial question about how good the approximation is, as discussed above.
17. Configuration of servers. I think that the description provided is not enough for reproducibility and for understanding whether you are doing a fair evaluation. For instance, in Virtuoso the description of the virutoso.conf file with many of its parameters, memory actually used, etc., is relevant to understand the results obtained.
18. I am missing a general discussion in the last section about how this approach could be implemented in other SPARQL engines. Would it make sense? Which are the characteristics that the engines need to have in order to make it possible?

Typos:
"then recombine" --> then recombined
"paginated such as a" --> "paginated so that a"

Long-term sustainability of resources. There are some aspects that should be improved in terms of ensuring the appropriate reference to the resources used in the paper. Happy to see if this can be covered in the resubmitted version, since they are very easy to fix:
- The code is made available in a GitHub repository. While this is ok, it should be properly archived. I suggest generating a release and archiving it in Zenodo, for instance.
- The queries and datasets used in the experiments are not always provided properly. The BSBM ones are ok, since they are well known and reproducible by the community. However, the authors refer to using part of a specific version of DBpedia. Which one is that part? If you do not specify clearly which decisions you have made to generate that part of DBpedia or provide the dataset directly, we cannot reproduce results.
- It would be nice to have the exact queries used available as well in a long-term archival place.