Realizing Cascading Stream Reasoning with Streaming MASSIF

Tracking #: 1748-2960

Pieter Bonte
Riccardo Tommasini
Emanuele Della Valle
Filip De Turck
Femke Ongenae

Responsible editor: 
Guest Editors Stream Reasoning 2017

Submission type: 
Full Paper
To perform meaningful analysis over multiple streams of heterogeneous data, stream processing needs to support expressive reasoning capabilities to infer implicit facts and temporal reasoning to capture temporal dependencies. However, current stream reasoning approaches cannot perform the required reasoning expressivity while detecting time dependencies over high frequency data streams. Cascading Reasoning was meant to solve the problem of expressive reasoning over high frequency streams by introducing a hierarchical approach consisting of multiple layers. Each layer minimizes the processed data and increases the complexity of the data processing. However, the original Cascading Reasoning vision was never fully realized. Therefore, we propose a renewed and more generalized vision on Cascading Reasoning, serving as a blueprint for existing and future hierarchical approaches. Furthermore, we introduce Streaming MASSIF, a new Cascading Reasoning approach, performing expressive reasoning and complex event processing over high velocity streams. We show that our approach is able to handle high velocity streams up to hundreds of events per second, in combination with expressive reasoning and complex event processing.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 08/Dec/2017
Review Comment:

This paper presents a new vision for so-called cascading stream reasoning. The main idea of the architecture is to use different tools at different architectural layers so that different types of data can be handled with formalisms of appropriate expressivity. The authors also present an instantiation of their vision by extending the MASSIF stream reasoning system with two new components: a selection module, and an event processing module.

While the topic of the paper is highly relevant, I have numerous problems with the material presented in the paper. In my opinion, all of these problems combined make the paper not suitable for publication in its present form. The problems can be summarised as follows.

* I fail to see the value of the abstract vision of cascading reasoning. In fact, I found the surrounding discussion very confusing and without proper motivation. Also, I do not see how this benefits other practitioners in the field.

* The motivation for such a complex vision is unclear. In particular, instead of a layered architecture incorporating five different formalisms, I wonder whether the motivating example (and similar problems) could be solved using just one system and streaming processing language that can support querying, reasoning, and event detection.

* The technical material seems rather light-weight and has been presented poorly, and I question some technical choices made. The role of various attempts at formalising different languages is unclear, and the motivation behind them is unclear as well. As a consequence, the results are not reproducible (i.e., the paper does not contain sufficient information necessary to actually build different components of this system). Therefore, I do not see how the presented technical concepts could be of interest to other practitioners in the field.

* The motivation behind the evaluation is unclear, and it in fact shows an important weakness of unclear focus and contribution of the presented work.

* The quality of English is rather poor in places.

I discuss these issues in detail in the rest of my review.

1. The value of a cascading stream reasoning vision

I do not see what readers of this paper stand to gain from reading about the cascading stream reasoning vision: it is essentially just a picture that, in my opinion, provides very little guidance to implementors of stream reasoning systems. The rationale based on which the framework has been derived was presented in Section 4, but I could not follow it. For example, one "contribution" of the revised vision is replacing the DL and DLP layers from the original vision with a new Inference layer, but this is hardly an important insight: both DL and DLP are inference formalisms, and not a big leap of imagination is needed to support other formalisms as well. I also do not see how Figure 2 presents "the trade-off between expressiveness and rate of changes in the data": the fact that a logic is expressive and of high complexity does not necessarily mean that it cannot successfully handle high change rates. Such a trade-off should be demonstrated either theoretically or empirically.

The discussion in Section 4 is also very high-level. Moreover, the authors use terminology imprecisely: terms such as "descriptive analytics aspects" or "common semantic space" have not been defined. What does "populate a conceptual model" mean? Note that it could equally mean "add classes and/or properties to the model" or "add data to the elements already in the model".

Finally, no clear rationale for the vision design has been presented. I would expect that the design would be driven by clear qualitative or perhaps even quantitative requirements. Instead, this section seems to just present an authors' opinion that has not been substantiated.

I believe that the paper would be better if it focussed on the description of the design of the extension of MASSIF, without any high-level philosophical discussions. This might make the paper more focused and more technical.

2. Complexity of the solution

As far as I can see, the proposed solution requires an integration of at least five languages: RSP-QL for querying streams, DatalogMTL, CEP, DLs, and DSL. This leaves me wondering whether such complexity is actually needed: I just do not see what each of these languages contributes to the entire picture. In fact, the entire vision seems very complex and difficult to understand.

I can imagine that one system supporting just one language capable of performing all of these tasks would be much easier to use. This language should clearly have the expressivity needed to support all different tasks, and I wonder whether temporal datalog could be used for this purpose. It should be possible to express the time-based windows in temporal datalog, and the authors also show that CEP operators can be expressed as well. This leaves DL reasoning, but I also wonder whether expressive DL constructs, such as existential quantifiers, are really needed: for example, for the purposes of data analysis, in the definition of HighTraficMainToadNearFlexibleOffice, the <= direction of the definition seems to be what is needed, and that direction can be expressed in datalog without problems.

This would allow all processing to be done in a single framework, which would clearly be much simpler for users: they would just write the specification of their analysis in this one language and would not have to keep switching between various formalisms. Also, system implementation might be easier as one would need just one engine.

To address any scalability concerns, one could implement special processors for different fragments of the language. For example, if a part of a datalog program essentially implements the same functionality as RSP-QL, this program could surely be evaluated using similar techniques to what is found in an RSP-QL engine.

As a side-note, I wonder why DLs are needed in this system at all. The quantitative analysis of the form used in the examples is not really a strong point of description logics.

3. Presentation issues

My key criticism regarding very poor presentation: many definitions are incomplete and unclear, the material is disorganised, and the level of detail is insufficient for readers to reproduce the presented results. As a consequence, I wonder what the take-home message of this paper really is. I will next point our many such specific problems.

An overview of the CEP language seems interesting enough, but the semantics has not been explained with sufficient detail. First, the authors do not say what the indexes of A and B in Figure mean. Moreover, the authors say that A AND B matches in both streams as t2, but they never explain whether the operator ever stops matching. I have analogous questions for the remaining operators. As an aside, it would be good to explain more clearly why high traffic cannot be defined in CEP: is this because there are no primitives for the quantitative manipulation of data?

Definitions 3.1 -- 3.6 introduce a bunch of concepts that, as far as I can see, serve only to describe the syntax of RSP-QL. They are never used in the rest of the paper, and also the semantics of RSP-QL has not been specified. The notion of "RSP-QL algebraic expressions" has not been defined. Because of all that, I really do not see what is to be learned from all these definitions.

Similar comments apply to the definition of DatalogMTL: page 6 contains a bunch of formulas, but with just very high-level description. No formal semantics has been presented. I am not saying that a formal semantics should have been included; my main point is that presenting these definitions seems quite arbitrary to me.

Definitions in Section 5 are insufficiently precise to be called "definitions" in a formal sense. For example, Definition 5.1 defines the notion of a "physical event" in terms of an "event", but an "event" has never been formally defined. I was also confused by the relationship between E_phy and e_i in Definition 5.2: as far as I can see, e_i should be ontology individuals, and Definition 5.1 says that E_phy contains events; but then, this means that events are actually ontology individuals, which has never been explained. I was lost in the notation of Definition 5.2. What is a "complex event type" in Definition 5.4?

Section 5.2 contains a bunch of formulas that completely confused me. What exactly is a stream formally? (This has never been defined.)

I was also confused by Figure 6.a: apart from the fact that the text is typeset very poorly (there are strange spaces between various names), I really do not understand what I am supposed to learn from it.

I was also confused by the role of DatalogMTL in the paper, for two reasons. First, from the text I got the impression that the events are to be described using DatalogMTL, but even specifications are then translated into CEP from page 10. Note, however, that the table on page 10 presents an opposite translation -- from CEP into DatalogMTL. Hence, this translation could be used if we have a DatalogMTL engine and wanted to use it to evaluate CEP operators; however, from the paper I understood that the goal was exactly the opposite, which completely confused me. Second, if there is a close translation between CEP and DatalogMTL, why bother with using both languages in this approach? Would it not be simpler if everything were defined just in CEP, so we can just forget about DatalogMTL?

4. Problems with evaluation

The goal of the evaluation was not clear to me, and I believe this to be indicative of the general lack of focus in this paper. Was the goal mainly to conduct a feasibility study? Or was to goal to attain certain scalability criteria? None of this has been stated in the paper explicitly, and therefore the meaning of the performance graphs is unclear to me. In fact, I do not even know how to interpret these numbers: for example, is processing 600 events in a window of 100 s (top-left part of Figure 8) good or bad performance?

If the objective of the paper was to make a performance claim, then a much more extensive comparison with existing approaches would be needed. Unfortunately, the only such comparison consists of one paragraph in Section 7.3.

5. Quality of English

The paper contains many grammatical and stylistic errors, of which I will next list just a few examples.

- "perform the required reasoning expressivity" makes no sense grammatically: there is no such thing as "performing expressivity".

- "allow to extract" and similar phrases are ungrammatical and should be "allow us to extract" or something similar. Similarly, "enables to use" should be "enables the use of".

- What does it mean for "a high traffic street to have many interpretations"? (I.e., how does a street have interpretations?)

- "the all the" is a typo.

- "Lets" should be written "Let's", but in fact the abbreviations should not be used in formal writing so one should write "Let us".

- "a renovated and more general vision" is not proper English: one can renovate a house, but not a vision.

- "and can consumes" is a typo.

Review #2
Anonymous submitted on 22/Dec/2017
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The manuscript addresses the topic of stream reasoning thus fitting the scope of the special issue. There are two main contributions of the paper: 1) a new take on the cascading reasoning idea previously proposed, and 2) a realization of this newly proposed vision, by extending the MASSIV framework.

The manuscript initially provides an overview of the original cascading reasoning idea and the technologies it involves. It then highlights some its limitations and suggests a new take on the cascading approach, this time by allowing more flexibility among layers and not limiting the technologies used. Everything is backed up by an working example, which facilitates understanding.

The paper then introduces their realization of a cascading reasoning framework. It takes advantage of the layered architecture of MASSIV for event processing and adapts it by adding two extra layers to support the streaming aspect of the input data. A short evaluation is presented to demonstrate its feasibility.

Both the updated cascading idea and its implementation are novel. While the cascading pyramid is nice way to depict the stream reasoning framework, the significance of the results are more concentrated in their implementation of the Stream MASSIV. The writing is clear and the paper is easy to follow.

Section 9 presents a nice comparison with other approaches and puts them nicely into the new cascading reasoning idea. Based on that it seems that the new cascading reasoning idea has been already followed by a number of proposed solutions, although it was never explicitly coined. I would suggest to have a statement about it early on the paper. Also, it was mentioned that some approaches support ASP abut not CEP rules. It would be interesting to have a discussion about how Stream MASSIV can be extended to support ASP.

Minor remark: In Figure 5 for the second observation it should be "hasValue(obs2,35)"

Review #3
By Daniel de Leng submitted on 26/Mar/2018
Major Revision
Review Comment:

The authors argue that cascading reasoning, which they describe as 'expressive reasoning over high frequency streams by introducing a hierarchical approach consisting of multiple layers', was never fully realised. It is my interpretation that they therefore (1) propose a generalisation of the original cascading reasoning vision, (2) suggest an instantiation of this vision that is based on CEP, DL, and RSP technologies, and (3) present a realisation of this suggested system as an extension to the pre-existing MASSIF platform. Author Della Valle was a co-author in the cited paper proposing the original cascading reasoning vision, and authors Bonte, De Turck, and Ongenae were co-authors in the cited paper on the original MASSIF platform. While the article is definitely relevant as part of Semantic Web and stream reasoning research, there are major shortcomings. In the following, I will explain my reasoning, starting with what I perceive to be major issues before concluding with a short discussion of perceived minor issues.

One of the biggest impediments to comprehending the material presented in the article is its structure. The article fails to make clear what its contributions really are. While it does briefly describe what it proposes, it does not state what it is that makes these proposals novel. The originality of the article is diminished when we find that contributions (1) and (3) are based on a body of work published prior by a subset of the article's authors. This does however not become clear in the article.

The key motivation of the work seems to be based on the argument that the original cascading reasoning vision was never realised. A discussion of why this is the case is kept to a minimum in Section 4.1. The article also does not explicitly state whether the MASSIF extension is a proper realisation. One can question whether the 'renewed and more generalised vision' amounts to an oversimplification (more on this later) that is not true to the original vision. In that case, even if the MASSIF extension realised this revised vision, can it be convincingly argued to realise cascading stream reasoning?

Another question is why cascading (stream) reasoning is important to begin with. The introduction does a poor job of explaining what the concepts of cascading reasoning (at least until we get to the background section) and stream reasoning entail. The latter is especially problematic for a special issue on stream reasoning. The motivating example section unfortunately does not help out either, instead focussing on a specific problem from the CityBench benchmark. I assume this motivating example is a representative problem that motivates the need for cascading reasoning, but no such argument is provided. Where is the need for cascading reasoning in this example? Since it is recurring throughout the article, it is vital that it is introduced properly, as it will provide a reader with a motivating context.

The lack of a motivating context is a recurring issue throughout the article. 'Stuff' gets introduced for no apparent reason, towards no apparent solution. It is not until Figure 5 on page 8 that we may start to see the big picture. I assume that the figure is a realisation of the revised pyramid in Figure 3? Figure 5 does not receive an explanation; merely a reference in Example 5.1. We have to wait until Section 5.2 on page 11 to get an explanation of Figure 5. Perhaps Figure 5 is too detailed to be described earlier, but a simplification or a high-level informal discussion of the desired system as part of the introduction would greatly help in providing the reader with a context. What is it you are going to do in this article? Why is it important? How does it stand out?

The article does not position itself relative to a large body of stream reasoning research. As mentioned before, there is not even an explanation of what this article considers stream reasoning to be. But the article also lacks a stream reasoning subsection in its background discussion. I am aware that there is a (very short) discussion of related work on specific stream reasoning systems within the Semantic Web context in Section 8. However, stream reasoning as a research area should be introduced early on as background material. At the very least I expect a discussion of the work on LARS and/or SECRET in a discussion about what constitutes stream reasoning. This also provides an opportunity to correct obviously false statements such as those in Section 4.1 claiming Temporal Logics and techniques for reasoning about time (among others) were 'recently' proposed beside 'traditional' stream reasoning research areas---Koymans introduced MTL in 1990 and progression was introduced by Bacchus and Kabanza in 1996, for example. I also disagree with the notion that '[t]he majority of the work on Stream Reasoning is focused in the area of RSP' (the citation seems to be somewhat misleading here). If our definitions of stream reasoning and what constitutes 'traditional' stream reasoning differ, this makes for a good argument to elaborate on the two in the article.

The revised Cascading Reasoning vision and associated pyramid are problematic, and I suspect there may be some confusion among the responsible author(s). The original cascading reasoning pyramid shown in Figure 1 is fairly specific. It assumes raw streams that can be converted to RDF streams (Section 5.1.2 erroneously states that raw streams can be converted to RDF streams; this is true only in special circumstances and excludes for example image data), and employs Description Logics and DL programming as cascading tools, nicely fitting a Semantic Web approach. On the left side, complexity is shown to increase as we move up the pyramid in terms of complexity classes. For some reason, the top class is mangled and listed as '2NextPTiMe'. The original paper on cascading reasoning correctly lists 2NEXPTIME, which is the complexity class for the Description Logic SROIQ, which forms the basis for OWL2. The revised vision has a corresponding pyramid in Figure 3 with the same complexity class mangling. This pyramid is incredibly general to the point of being generic. There have been many approaches including within Semantic Web research that follow these three steps, including the work on DyKnow (for example object linkage structures, adaptive state stream generation), Bröring's work on semantically-enabled sensor plug-and-play, and information fusion approaches in general. This generalisation is thus not novel. The pyramid height also becomes an important factor, as different choices for filling in the pyramid's layers affect the associated complexity classes and thereby the supported change frequency. One can potentially correct the figure by removing the explicit complexity classes and frequencies.

The article attempts to formalise concepts mathematically but does so poorly. While I understand that some of the formalisation borrows notation from RSP-QL, a more thorough explanation is needed. For some reason, Section 5.2 also refers back to the RSP-QL semantics paper mid-way a formalisation. While this is ordinarily understandable, it makes me question why we spent time in Section 3.1.2 covering an apparently incomplete background subsection on this very topic. Some of the notation is inconsistent (SDS is sometimes a set, sometimes a function), incorrect (arguments disappear frequently in the rewritings), and incomplete; sometimes it is confusing (I still do not understand Definition 5.2 on abstract events), but it mostly seems to be unused outside of trying to describe things in terms of sets. None of the equations are numbered, and some are oddly-centered for no apparent reason. Whereas there is a lot of detail without a clear purpose, there is also lack of detail when describing for example MTL in Section 3.2 (no syntax or semantics) or the DSL (no semantics) in Section 6.

Benchmarks are great tools for doing comparative studies, but the article analyses the presented realisation in isolation in Section 7. This is followed by a brief discussion of other approaches within Semantic Web research in Section 8. As a full article submission, I would expect a more detailed description of and comparison to these other approaches that---if possible---extends to experimental results.

For the remaining minor issues, I would like to point out the presence of many typos (the most important one stating DL is the decidable fragment of FOL) and some odd wordings ('utterly meaningful', for example). Please verify that your references are listed correctly (specifically capitalisation), and check whether you can make Figure 5 easier to read (especially the grey text).

Despite the criticisms above, I believe the Streaming MASSIF realisation by itself is an interesting result that uses reasoning at different levels of abstraction, which is a similar approach to the one taken by DyKnow. This is a very important feature for stream reasoning to cope with the problems associated with handling streaming information, and few stream reasoning studies have explored this approach, making an argument for significance. The biggest issue with the article seems to be clarity rather than the work itself; the writing quality is somewhat poor. Finally, I find it difficult to judge the novelty of the work because the authors are not very clear in stating their contributions and how they are related to previous work. The combination of these factors leads me to recommend against inclusion of this article in its current form. If the listed problems are addressed, however, I believe this could be a good-quality article that provides a valuable addition to this special issue. I therefore suggest major revision as decision.