Continuous Top-k Query Answering over Streaming and Evolving Distributed Data

Tracking #: 1843-3056

Shima Zahmatkesh
Emanuele Della Valle

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Continuously finding the most relevant answer of a query that joins streaming and distributed data is getting a growing attention in recent years. It is well known that, remaining reactive can be challenging, because accessing the distributed data can be highly time consuming as well as rate-limited. In this paper, we consider even a more extreme situation: the distributed data slowly evolves. The state of the art proposed two families of partial solutions to this problem: i) the database community studied continuous top-k queries [1] ignoring the presence of distributed datasets, ii) the Semantic Web community studied approximate continuous query answering over RDF streams and dynamic linked data sets [2] ignoring the specificity of top-k query answering. In this paper, extending the state-of-the-art approaches, we investigate continuous top-k query evaluation over streaming and evolving distributed dataset. We extend the data structure proposed in [1] and introduce Super-MTK+N list, which handle changes in distributed dataset while minimizing the memory usage. To address the query evaluation problem, first, we propose MinTopk+N algorithm, which is an extension of algorithm proposed in [1], in order to handle the changed objects in the distributed dataset and manage them as new arrivals. Then, considering the architectural approach presented in [2] as a guideline, we propose AcquaTop algorithm that keeps a local replica of the distributed dataset. In order to approximate the correct answer, we propose two maintenance policies to update the replica.We provide empirical evidence that proves the ability of the proposed policies to guarantee reactiveness, while providing more accurate and relevant result comparing to the state of the art.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Jun/2018
Major Revision
Review Comment:

The paper proposes a framework for continuous top-k query answering over streaming and evolving distributed data. The research problem that is being tackled is a challenging one, since it considers both the velocity of streaming data, and the growth of static (or distributed) data.

The framework extends the previous approaches by Yang et al. [1] and Dehghanzadeh et al. [2]. The new additions are: the Super-MTK+N list as a data structure, the MinTopk+N algorithm, the AcquaTop algorithm, and two replica maintenance policies (i.e., MTKN-F and MTKN-T).

Experimental evaluations were provided, which investigated different maintenance policies for stream queries over Twitter data. The experiments were reported to confirm that the MTKN-F policy maximizes the coverage accuracy of the top-k result, while MTKN-T maximizes the relevancy.

I believe that the paper at its current form cannot be accepted without a major revision. The following are the main points of my review:

# More convincing motivation

I am not yet fully convinced that the addition of the ability to handle slowly evolved data would have significant benefits to the state-of-the-art. How often does data slowly evolve, and why slowly, not quickly? What can be potential real-world applications of the proposed approach for handling slowly evolved data? What can be concrete real world problems that are so severely limited without the addition of the ability for tackling slowly evolved data? Also, more elaborated distinctions with [1,2] are necessary.

# Experimental evaluations

I believe that measuring response time would be a crucial aspect in the experimental evaluations, in particular since being reactive plays a key role in processing streaming data. Yet, the experimental evaluations only measure result relevancy. Moreoever, measuring the overhead of the proposed approach in the case of non-evolving data would provide a valuable insight as to when to adopt the proposed approach. Comparative experimental evaluations to [1,2] would be a valuable addition, too. Also, having a stress-test case in the experiments can show to what extent the proposed approach can still work reasonably.

# Quality of writing

There are many typos (see detailed remarks). Some sentences are too long (and too complex). Some parts are too dense, for example, Sec. 4.3.2. Some parts of writing lack sufficient examples to grasp the main ideas (see detailed remarks).

Below are my detailed remarks regarding the presented work:

# There is no explicit definition of top-k queries, as: A top-k query is defined as ...

# Can "reactive" be further elaborated? Is it equivalent to fast query processing? How does it differ to query processing over static data that is fast? Any concrete examples of query executions that are not reactive vs. reactive?

# What are your assumptions regarding the poor network conditions for AcquaTop? How poor is poor? There must be some lower bounds of the latency, bandwidth, and rate for your approach to still work reasonably.

# Related work: The problem of finding the most relevant answers is very similar to that of getting the most relevant documents in the information retrieval domain. There, the data is vastly distributed, and quickly evolved. Can you relate your approach to theirs?

# How much overhead is introduced by your approach when the distributed data does not change, particularly compared to [1,2]?

# The example in Listing 1 is a bit problematic. There you assume that mention counts are immediately available in the stream. Yet, in reality, the mention counts themselves are results from some other aggregations, which are also challenging to compute due to the size and quick evolution of social networks. I'd say that queries for computing mention counts also deserve some attention regarding their implementation.

# Pg. 2: "and they would recompute the result from scratch risking to lose reactiveness" -> Can you provide experimental evaluations that show some query execution time comparison of your approach to the baseline approach where results have to be recomputed from scratch?

# Grammar: Is data singular or plural? The important note is to be consistent with the usage.

# As the paper is directly extending [1,2], I'd expect also some performance comparison to the previous work. In the case where the previous work is not compatible to the new experimental settings (say, since now the distributed data slowly evolves), you might want to add some reasonable naive/baseline modifications to the previous work in order to make it work.

# The naming of the maintenance policies is somewhat misleading.

MTKN-F is said to maximize accuracy. However, it tries to get all top-k answers while ignoring the ordering. If the ordering is ignored, it becomes inaccurate, doesn't it? I believe "coverage" is a more suitable term.

# The sectioning can be further improved by breaking down Sec. 4 into two sections.

# What happens if the distributed data evolves quickly? What are the limitations of your approach? Moreover, what is the main motivation in considering slowly evolved data, instead of any evolved data in general? In practice, how common/rare is it to encounter slowly evolved data?

# Sec. 2.1.1 and 2.1.2 lack examples needed to help understand the formalization.

# In order to be self-contained, at the end of Sec. 2.1.1, briefly explain QF.

# The metric DCG@K was given without sufficient intuitions and descriptions. For example, what is rel_i?

# Sec 2.2: I believe distributed datasets haven't been introduced, yet they are mentioned here.

# The 1:1 join relationship seems like quite a restriction. In practice, how often/rare do 1-1 joins occur?

# F(XS, XD) is monotone. However, the values of XS and XD themselves may increase and decrease overtime. This might be discussed to clarify the matter.

# Fig. 2.a is just an example, or is it actually the generic form of SE? If it is the generic form, then I suppose that having multiple windows and multiple services is not allowed?

# At the end of Sec. 2, there is the idea of computing error by comparing Ans(RQi) and Ans(Qi). Can examples be provided?

# Pg. 7: For instance, objects E, C, and B are in the top-3 result of window W_0 -> There is no W_0 in Fig. 3? There is also no B?

# Fig. 3 is a bit too small. Moreover, the object scoring is not shown explicitly in the figure.

# Pg. 7: It appears to me that the MTK set relies on the assumption that previously top-k results are very likely to appear in future top-k results. Since your work is based upon the MTK set, how robust is this assumption, both from the theoretical point of view, and practical point of view? What are the main limitations of this assumption? Generally, I think all results have the potential to appear in future top-k results.

# Sec. 3.2: The current writing is still not really accessible. Perhaps, an example might be introduced to improve the readability as to how replicas are updated.

# Reactiveness is indeed an important measurement for querying over data streams. Yet, the experimental evaluations are missing such a measurement.

# Pg. 9: Why is it sufficient to keep only the minimum value of the scoring variable X_S?

# Theorem 3 seems to be the hardest case out of the two previous theorems. How often does this occur, in the experiments and in practice?

# Pg. 10: What exactly is the difference between Super-MTK+N list, and MTK+N list? Can there be illustrations?

# Fig. 5 can be made more detailed, especially on the components of Topk+N.

# Algorithm 1: How does one get changed objects from the distributed datasets? How can one detect changes? How easy is it to detect changes?

# Sec. 4.3.2: "... thus the LBP set is recomputed from scratch" -> Is the set usually small?

# Sec. 4.3.2 can be made more accessible by first having a motivation and intuition of the proposed algorithm, before presenting the details.

# Having presented the algorithms, perhaps discuss what is new compared to [1].

# Perhaps when discussing Alg. 4, also highlight the similarities and differences to Alg. 1.

# In the experiment setup, why are ties not preferred? What happens if there are ties? Would ties affect the performance?

# As a stress-test, what happens when there are many changes for each invocation, that are more than 80?

# Sec. 5.4: It seems that the figure shows some elbow point which is a point where the performance starts to rise slowly as the refresh budget increases. Can this elbow point be elaborated more?

# Sec. 5.4: What can be reasonable refresh budgets? What are the parameters influencing how big should we set the budgets?

# Sec. 5.6: Fig. 10.b and 10.d show an interesting behavior: the performance is the worst when K is in the middle. Any argument why?

# Sec. 5 lacks a closing discussion that elaborates and relates observations from the four experiments. Perhaps some parts of the conclusion can be made as a closing discussion.

# The experiments were done over Twitter. What are the assumptions on the characteristics of Twitter data streaming? Are these assumptions reasonable? Can they be generalized so that the experiments would provide some insights also to other kinds of data streaming?

# There are many typos, some of which are:
- Pg. 2: loosing -> losing
- Pg. 2: guaranteing -> guaranteeing
- Pg. 3: we presents -> we present
- Pg. 6: to proposed
- Pg. 9: we can keeping
- Pg. 17: it compute
- Pg. 19: Abising notatoion
- etc

Review #2
Anonymous submitted on 08/Aug/2018
Major Revision
Review Comment:

The paper addresses the problem of efficiently resolving top-k queries in the context of streaming data joining evolving distributed data. To that end, authors first propose an extension of the existing Mintopk algorithm and Super-MTK structure in order to support new arrivals of data coming from the changes in the evolving distributed data. The new structure, Super-MTK+N adds N positions to the original structure, acting as a buffer for current and future evaluations. Given that the streaming data out of the top list are discarded, authors discuss the cases in which the new data can also serve to resolve future evaluations in an exact or approximate way. In this first step, authors assume that data changes are notified by the distributed source. Then, the paper proposes an architecture that integrates Super-MTK+N in the existing Acqua architecture, which tackles the challenge of querying evolving distributed data providing a local replica that is updated based on some refresh policies. The main idea of the integration is that the candidates for the refreshment are taken from the Super-MTK+N list, which will be re-fetched from the distributed sources and could lead to the new changes that need to be considered in the topk computation. Authors propose two policies to select the elements to be re-fetched (one considering the top elements in the list, and the other considering the elements in the border of position K of the list) and perform an evaluation over user mentions and followers in twitter with varying number of changes, refresh budget, number of top-k results and number of N extra elements. Results show that the two defined policies generally outperform standard policies (e.g. based on LRU or random elements), pointing to specific cases (e.g. when most elements can be re-fetched) when a state-of-the-art technique is more efficient.

Overall, the paper is interesting to read but I found some limitations and problems that prevent an acceptation in its current state.

First of all, the focus of the paper is 1:1 joins between the streaming and the distributed data. My main concerns is how realistic this approach is, or how many cases the approach can cover when multiple operations such as other types of joins, UNIONs, FILTERs, etc. are not considered. Besides stressing more the limitation in the paper, I would at least expect a detail discussion on extending the current solution towards more realistic scenarios.

Then, authors justify that the original MinTopK algorithm “cannot be applied for complex query that gets data from distributed datasets”. This justification is vague and not explained. First, authors also don't manage complex queries as they limit to 1:1 simple joins. Then, it is not important that the data is distributed, but that the data change between windows. A small clarification on this would be needed. Also, it is unclear why authors enforce the concept of “slowly change”, for example when they assume that the distributed dataset is evolving and data in it slowly change between two subsequent evaluations. Does the change frequency affect the technique, and in such case, what is the change ratio where the technique is efficient?

In this regard, note that the evaluation is very limited to a single dataset of small size and a single, very simple query. Acknowledging the challenge of finding proper corpora, synthetic data or Live DBpedia changesets could be used to provide a more extended evaluation. In addition, the datasets of the evaluation and the source code are not available for reproducibility.

Other considerations:

- Regarding the structure of the paper, it is confusing to have two different sections addressing the “state of the art” (Section 3) and the “related work” (Section 6). These terms are often used for the same purpose of summarizing relevant existing work. I would recommend to reformulate Section 3 in terms of “Background”.

- When discussing Table 1 and the theorems 1, 2 and 3, I missed the point of discussing when the results are accurate again after introducing an approximate result.

- I might be wrong, but I think authors do not consider in the Topk+N Algorithm (Section 4.3) the case where the changed object from the distributed source is indeed a new object, which might require to satisfy other conditions (e.g. joins) with the stream data (that might be discarded in the topk list).
A mention on this should be provided. Indeed, in section 4.4 the candidate set to update the local replica comes from the Super-MTK+N List, which means that the objects out of this set, including the new potential objects in the distributed dataset, are not inspected.

- How algorithm 3 is different than the state-of-the-art algorithm to update the LBP?

- Consider to remove the references from the abstract in order to be self-contained.

- Conclusions are not self-contained as they contain acronyms only explained in the content of the paper.

- In Section 2.1.1, page 4, second column, I would suggest (i) to add a citation for the streaming graph pattern expressions, (ii) to better explain the last item of the definition as the names of the operators are unclear.

- The text should be reviewed to correct (many) grammar mistakes and typos. For instance (a non exhaustive list):

+ page 3, last paragraph of the first column: form → from
+ page 3, right before Section 2: discuss → discusses
+ page 3, last paragraph of the second column: we presents → we present
+ page 4, first sentence, formalize → formalizing
+ page 4, first paragraph of the second column: defines → defined
+ page 4, third paragraph of the second column: Streaming → streaming
+ page 10, Table 1, caption: Summery → Summary
+ page 17, Table 3, caption: Summery → Summary
+ page 19, right before section 5.3 Abising → Abusing ; notatoion → notation

- Figure 2(a) is redundant with 2(b) and could be eliminated. In both cases, the operator EXTEND is not explained.

- Figure 3 builds upon the example in figure 1, but this has only 2 windows (w0 and w1), while the example in figure 3 contains w1, w2 and w3. A minor clarification might be needed to explain that future windows (I assume that they start at time 7 and 10) are considered.

- In the first mention of Table 1, the meaning of “V” is not explained, which makes the reader confused until later on.

- Figure 5 is a bit unclear as the listed “components” in TopK+N are not in the correct sequential order. It seems more an architectural diagram.

- I would recommend to change the sub-indexes of the cost function in section 4.6 for better readability, trying to avoid the hyphen (-) symbol, and maybe replacing p^{intopk}.

- The legend of Figure 7 could follow the same pattern as the subsequent figures to improve readability. Also further explanation of the results would be needed.

- Footnote 6 is too large and expands to another page, which should be avoided. I would suggest to move part of the text to the normal page.

- In the evaluation in setion 5, when authors say that their techniques “outperform others”, it would be nice to highlight some numbers, for instance in percentage.