Enhancing the Scalability of Expressive Stream Reasoning via input-driven Parallelisation

Tracking #: 1750-2962

Thu-Le Pham
Ali Intizar
Alessandra Mileo

Responsible editor: 
Guest Editors Stream Reasoning 2017

Submission type: 
Full Paper
Stream reasoning is an emerging research area focused on providing continuous reasoning solutions for data streams. The exponential growth in the availability of streaming data on the Web has seriously hindered the applicability of state-of-the-art expressive reasoners to be applied to streaming information in a scalable way. However, we can leverage advances in continuous processing of Semantic Web streams to reduce the amount of data to reason upon at each iteration. Following this principle, in previous work we have combined semantic query processing and non-monotonic reasoning over data streams in the StreamRule system. We specifically focus on the scalability of a rule layer based on a fragment of Answer Set Programming (ASP). We recently expanded on this approach by designing an algorithm to analyse input dependency so as to enable parallel execution and combine the results. In this paper, we expand on this solution by providing i) a proof of correctness for the approach, ii) an extensive experimental evaluation for different levels of complexity of the input program, and iii) a clear characterization of all the algorithms involved in generating and splitting the graph and identifying heuristics for node duplication, as well as partitioning the reasoning process and combining the results.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 06/Dec/2017
Major Revision
Review Comment:

The article presents a stream reasoning system that integrates a standard stream reasoner with answer set programming (ASP) and thus allows for more expressive reasoning (i.e., ASP inference rules are taken into account); additionally, it demonstrates that the approach is efficient. In a nutshell, the latter is achieved by reasoning in parallel, based on a new analysis of the relations between the input data in view of the rules. The article claims to extend a conference paper in that it provides new contributions:
1. a better characterization of the algorithm applied to analyze input dependencies
2. a process that uses this input dependency graph to construct a plan for partitioning the input
3. a proof of correctness of the approach
4. experiments using a full implementation of the approach

I opt for major revision but, currently, I have strong concerns that the paper’s shape can be changed in the way necessary. The reasons are detailed below. Due to the large number of actual issues, I only list some examples (marked by *) with the single reasons.

(1) First of all, there are too many language issues, which also make it hard to understand the intended meaning sometimes:
- AE and BE style are mixed
* Abstract: analyse
* p. 2, col 1: characterization

- there are mistakes such as wrongly chosen or separated words, and strange sentence constructions:
* p. 3, col 1: Accordingly with the database terminology,
he avails of an instance of the StreamRule system
* p. 4, col 1: Fig-ure 1
* p. 4, col 2: Note that this is a different approach than distributing the processing

- there are general style issues such as missing/too much spaces:
* p. 3, col 1: b ← nota, a ← notb.
* p. 4, col 1: in following sections . The logic program

- the style could be more professional:
* p. 4, col 2: here is when the problem begins
* p. 14, col 2: In terms of comparison with similar systems, we recently became aware of two engines for complex stream reasoning we didn’t know about at the time the paper was written and the experiments performed.

(2) The technical parts (the definitions and the proof) do not comply with the general standards:
- Some definitions use notation that occurs in other definitions, but is not explicitly introduced as notation; in contrast, other definitions explicitly mention that (e.g., p. 3, col 1, Given a rule r as above, …):
* p. 4, col 1: An input window (or window), W, …
* Def. 3 refers to N_P, E_P1 which are mentioned in Def. 1, but in a way that implies that the notation is arbitrary instead of fixed

- The paper is inconsistent in the entities it defines and does not define, in terms of granularity. It defines rather general notions, such as “connected component” (in a graph), but uses much more specific notions without explanation:
* p. 3, col 2, consistency is not defined (whereas satisfaction is defined some sentences before): … is logically closed if it is consistent or contains all literals.

- Some details in the definitions do not make sense to me
* Def. 3, (i): it is not clear to me why there is a sequence p_1, …, p_n of predicates; I would the think you just need four different predicates
* Def. 3, (i): why the unique existential quantifier and not a general one

- Some definitions are confusing words (esp. “predicate” and “atom”) or missing conditions, or are inconsistent otherwise:
* Def. 1: p_u, q_u, p_d, q_d are nodes and hence predicates, and hence cannot be literals
* Def. 1, (ii), (a): only fits to the rest (e.g., Fig. 2) if p_u\neq q_u

- Proofs of Prop. 1,2:
* “recursive” rule is important but not defined
* Why is that case with recursive rules special?
* For the multiple rules case, I think, you need induction
* pre(W_i) is a set of predicates and cannot be connected components

(3) Regarding the article in general and Contributions 1.-4.:
- Large parts of the article are taken from the conference paper; e.g., the text is only slightly extended sometimes and the examples are the same.
- Further, I actually cannot make out exactly what is meant by the first two contributions. With 2., I guess, the authors mean that the article does not only contain a textual description of the approach but a more formal presentation as algorithms (Algorithms 1-3). However, in my opinion, these three algorithms do not contain interesting details - the corresponding definitions and descriptions, which are also contained in the conference paper already, should explain the necessary processing sufficiently.
- Similarly, regarding Contribution 3: First of all, the proof is lacking formality and details (see the technical issues above). Second, it is not surprising, but should be (!) a simple induction proof along the definitions.
- I am no expert in the practical field, but the experiments seem to be sound, to me: they compare the system to similar, existing systems, use standard benchmarks, and consider important metrics.

(4) Related work: The paper generally describes related works sufficiently, although sometimes giving details I do not see the reason for. On the other hand, the two related approaches mentioned in the conclusions should be described in the related work section and, especially, the processing of the first, also ASP-based system, should be delimited from the approach presented in the article.

Altogether, I think, that the approach is interesting, and that the experiments alone would be sufficient to classify the article as “extension” of the conference paper. However, as mentioned above, in its current form, the submission does not comply with journal standards and needs major revision. Amongst others, I would suggest to the authors to let a native speaker proof read the paper.

Review #2
By Konstantin Schekotihin submitted on 12/Dec/2017
Minor Revision
Review Comment:

The paper presents an interesting and novel approach to stream reasoning using answer set programming. The framework presented by the authors implements a number of preprocessing and input splitting techniques allowing for parallel processing of subsets of input window. The evaluation shows the effectiveness of the suggested approach in three different scenarios in which its performance was comparable to other modern stream processing engines.

In general I like the paper and find that the presented approach is reasonable and novel. The readability of the paper and some open technical questions listed below must still be improved.

Section 2
- q_1 ... q_n is not required since normal rules have only one atom in the head
- head literals -> head atoms (according to your definition)
- remove "term" from "A term, an atom, a literal..."
- check typing of rules in Section 2.1 Syntax.
- Semantics subsection is unclear, could you please check, e.g. [9], and improve it
- "a predicate occurring only in facts is referred to as an EDB" What about atoms over predicates appearing only in the body of a rule, e.g. traffic_light in Listing 1?

Section 3
- Grounding is the most expensive from a computational point of view, since you consider stratified programs. For many other application domain this statement false.
- predicates used in Listing 1 and in Figure 2 do not match
- Definition 3 is hard to read, especially the cases, when G_P is not mentioned explicitly. Directed path is defined on E_{P_2} in Definition 2 and in Definition 3 E_{P_1} is used. One can replace the undirected edges with the directed ones to keep the things simple.
- Also I do not get the explanation of Definition 3. How can a graph, which nodes is a set of predicates, define dependencies among ground atoms? Especially, if you allow for comparison predicates, like in Listing 1?
Consider the following program:
c(X) :- b(X), a(X), X > 5.
for all values of X <= 5 there is no dependency between ground atoms over predicates a and b.
- Algorithm 2 line 10: continue
- Algorithm 2: Since the algorithm ads all child nodes of a vertex, it might add also neighbor vertices over undirected edges and thus violate the definition. Please improve this in lines 15 and 20.
- Computation of the maximal cliques can be quite time consuming for a stream processing application. Only in the next section it gets clear that this procedure is off-line.

Section 4. Proposition 1. Proof:
- ground versions of predicate -> ground atoms over predicate
- I see why proposition 1 holds, but I do not get the sense of sentence: "This invalidates the hypotheses that $G^{inpre(P)}_P$ is not connected and $pre(W_i)$ are connected components of $G^{inpre(P)}_P$" What you want to show is that the inequality in the previous sentence does not hold. But I do not see this formally in the proof.

Review #3
By David Bowden submitted on 12/Feb/2018
Review Comment:

The paper describes a technique for scaling the process of reasoning over streamed RDF data, and is a more detailed investigation into the work started in “Towards scalable non-monotonic stream reasoning via input dependency analysis”, by the same authors.

1. Introduction
- Various reasoning systems are described, and the process is well defined. However, it could be clearer why the StreamRule system was selected for the implementation.
2. Preliminaries & Motivating Example
- Analysing streamed data in real-time is becoming an important tool in a number of industries, such as transport, so the motivating example is useful and relevant.
3. Input Dependency Analysis
- The approach of scaling the system by splitting the input stream, rather than the rule set, is an interesting approach, and is applicable to practical implementations.
4. Parallel Reasoning in StreamRule
- A good explanation of how the input dependency graph is used to split the input triples into sub-windows, with a validation of the approach.
5. Evaluation
- The experimental method was clearly explained, but a little more analysis of the results would have contributed to the overall evaluation.
6. Related Works
- The partitioning technique may also be applicable to the concepts set out by Chen and Dumontier, in their paper “A Framework for Distributed Ontology Systems”, although their work related to big data volumes, rather than big data velocities.
7. Conclusion
- It would have been nice to see some further conclusions drawn on the relative performance of the PR system.

I think the paper is a useful extension to previous work in this area and I would recommend publication.