Continuous Top-k Approximated Join of Streaming and Evolving Distributed Data

Tracking #: 2055-3268

This paper is currently under review
Shima Zahmatkesh
Emanuele Della Valle

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Continuously finding the most relevant (shortly, top-k) answer of a query that joins streaming and distributed data is getting a growing attention. In recent years, this is, in particular, happening in Social Media and IoT. It is well known that, in those settings, remaining reactive can be challenging, because accessing the distributed data can be highly time-consuming as well as rate-limited. In this paper, we consider even a more extreme situation: the distributed data slowly evolves. The state of the art proposes two families of partial solutions to this problem: i) the database community studied continuous top-k queries on a single data stream ignoring the join with distributed datasets, ii) the Semantic Web community studied approximate continuous joining of an RDF stream and a dynamic linked dataset that guarantees reactiveness but ignores the specificity of top-k queries. In this paper, extending the state-of-the-art approaches, we investigate continuous top-k query evaluation over a data stream joined with a slowly evolving distributed dataset. We extend the state-of-the-art data structure proposed for continuous top-k query evaluation and introduce Super-MTK+N list. Such a list handles changes in the distributed dataset while minimizing the memory usage. To address the query evaluation problem, first, we propose Topk+N algorithm. This is an extension of the state-of-the-art algorithm that handles the changed objects in the distributed dataset and manages them as new arrivals. Then, adopting the architectural approach presented by the Semantic Web community, we propose AcquaTop framework. This keeps a local replica of the distributed dataset and guarantees reactiveness by construction, but it may need to approximate the result. Therefore, we propose two maintenance policies to update the replica. We contribute a theoretical proof of the correctness of the proposed approach. We perform a complexity study. And we provide empirical evidence that the proposed policies provide more relevant and accurate results than the state of the art.
Full PDF Version: 
Under Review