Continuous Multi-Query Optimization for Subgraph Matching over Dynamic Graphs

Tracking #: 2782-3996

Xi Wang
Qianzhen Zhang
Deke Guo
Xiang Zhao

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
There is a growing need to perform real-time analytics on dynamic graphs in order to deliver the values of big data to users. An important problem from such applications is continuously identifying and monitoring critical patterns when fine-grained updates at a high velocity occur on the graphs. Various approaches for incremental processing of queries have been proposed over the years. Unfortunately, current approaches are mainly designed for a single query: optimizing and answering each query separately. When multiple queries arrive at the same time, sequential processing is not always the most efficient. In this paper, we study the problem of continuous multi-query optimization for subgraph matching over dynamic graph data. (1) We propose annotated query graph, which is obtained by merging the multi-queries into one. (2) Based on the annotated query, we employ a concise auxiliary data structure to represent partial solutions in a compact form. (3) In addition, we propose an efficient maintenance strategy to detect the affected queries for each update and report corresponding matches in one pass. (4) Extensive experiments over real-life and synthetic datasets verify the effectiveness and efficiency of our approach and confirm a two orders of magnitude improvement of the proposed solution.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 08/May/2021
Major Revision
Review Comment:

The paper studies the continuous subgraph matching problem with multiple queries. The problem is important as the query is widely used in real-world applications. Overall, the algorithm proposed in this paper is good. However, some important problems should be addressed. The details are listed as follows. Thanks.

1. The problem definition is problematic: the definition says that "the query is to identify all satisfied query graphs". What does "identify all satisfied query graphs" mean? Based on Example 1, the query should report the positive or negative matches in the data graph, rather than output some query graphs. It seems that the definition contradicts the example. So the problem definition can be significantly improved.

2. How to merge all queries into a single query, i.e., AQG? I think this is the core design of the algorithm, while this paper fails to give the details. Section 3.1 says that AQG is the union of vertexes and edges of queries. However, it seems that it is hard to process some cases. For example, the first case is that queries contain multiple vertices with the same label. The second case is that a query is "A-B-C", and another query is "D-E-F". Then, how to merge them? The authors may want to give more details of the algorithm merging queries.

3. It seems that the authors misunderstand the design of TurboFlux. In Section 1, the authors claim that "It maintains candidate query vertices for each data vertex using a graph structure such that a data vertex can appear at most once". This is not correct. Actually, TurboFlux maintains candidate data vertexes for each query vertex, and a data vertex can appear multiple times in the auxiliary data structure. For example, given a query with two vertices u0 and u1 labeled as "A", each data vertex labeled with "A" can be the candidates of u0 and u1, respectively.

Minor issues:
- In Example 1, "When $\delta g$ occurs" -> "When $\delta g_1$ occurs".
- In Section 2.1, "$E \in V \times V$" -> "E \subset V \times V".
- In Definition 2, "homomorphism to" -> "homomorphic to".
- At the last line in page 2, "graph homomorphism" -> "subgraph homomorphism".
- At Line 9 in Algorithm 1, how can an operation be equal to an edge? The algorithm should be formalized carefully.

Review #2
Anonymous submitted on 21/Jun/2021
Major Revision
Review Comment:

The work proposes an approach to the problem of continuous multi-query optimization for subgraph matching containing 3 core aspects. First, the technique merges the set of input queries into a single query referred to as the "annotated query graph" obtained by merging each of the input queries into one. Second, An auxiliary data structure that stores intermediate results and finally, a matching strategy which outputs corresponding new/deleted matches in one pass.

The manuscript communicate the ideas of the work clearly with a reasonable set of experiments which can be improved on.

I have comments on the work's assumptions, prior work, or experiments:

1) Assumptions: The work assumes that all of the queries are available at the start making it possible to merge all queries in what is referred to as an "annotated query graph". This assumption is quite strong and the common case would be that queries are added one at a time over time. Specifically, within an organization where different teams are interested in different newly matched or deleted subgraphs for their needs, their demand wouldn't come all at once and therefore the right problem statement would assume an order in which the queries should be added and optimized.

2) Prior work: I believe the paper omits a lot of the prior research. Subgraph matching is equivalent to multi-way self-joins between using an Edge table (from,to,label). While the relational techniques wouldn't scale in the context of subgraph queries, it is important to acknowledge that given a mapping between the relational and graph models, why would the relational techniques not work. one can start from the following reference [1] and do a forward sweep for more related work.
In the context of continuous subgraph queries, it is worth pointing that multiple continuous queries have been studied unlike the claim in the abstract: "Unfortunately, current approaches are mainly designed for a single query: optimizing and answering each query separately."
For instance, reference [2] evaluates multiple subgraph queries under single-edge insertion workloads. The thesis by Kankanamge also does the same [3]. More recent work was published *after* your submission [4]. I do not expect you to compare against reference [4] due to the recency and conflict between submission times but I believe it is worth adding to the related work as another technique to consider in this context. A major and more thorough review of existing literature is a must.
It is important to also communicate in the related work section how your auxiliary data structure differs from ones used both in the one-time and continuous subgraph matching cases.

3) Experimental setup:
3.1) Is it possible to showcase an experiment regarding the matching order? Specifically one to show empirically that the intuition regarding matching vertices shared by more queries first is truly beneficial?
3.2) The experiments provide run times on the aggregates, it might be worth adding single query experiments to establish very clear baselines for the 0% sharing case instead of a large suit of queries none of which is shared.
3.3) It is important to showcase limitations with regard to either dataset size or query size for the proposed technique. Specifically, when does the auxiliary data structure used lead to a degradation in performance?

Handling these comments would be important to make the work impactful and also acknowledge and reflect a better understanding of existing literature.

[1] Efficient and extensible algorithms for multi query optimization. SIGMOD 2000.
[2] Efficient multiview maintenance under insertion in huge social networks. TWEB 2014.
[3] Multiple Continuous Subgraph Query Optimization Using Delta Subgraph Queries. Thesis 2018.
[4] Optimizing One-time and Continuous Subgraph Queries using Worst-case Optimal Joins. TODS 2021.