A Query Language for Semantic Complex Event Processing: Syntax, Semantics and Implementation

Tracking #: 1673-2885

Authors: 
Syed Gillani
Antoine Zimmermann
Gauthier Picard
Frédérique Laforest

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Abstract: 
Today most applications on the Web and in enterprises produce data in a continuous manner under the form of streams, which have to be handled by Data Stream Management Systems (DSMSs) and Complex Event Processing (CEP) systems. The Semantics Web, through its standards and technologies, is in constant pursue to provide solutions for such paradigms while employing the RDF data model. The integration of Semantic Web technologies in this context can handle the heterogeneity, integration and interpretation of data streams at semantic level. In this paper, we propose a new query language, called SPAseq, that extends SPARQL with new Semantic Complex Event Processing (SCEP) operators that can be evaluated over RDF graph-based events. The novelties of SPAseq includes (i) the expressibility of temporal operators such as Kleene+, conjunction, disjunction and event selection strategies, and (ii) the support for multiple heterogeneous streams. SPAseq not only enjoys good expressiveness but also uses a non-deterministic automata (NFA) model for an efficient evaluation of the SPAseq queries. We provide the syntax and semantics of SPAseq and based on this, we implement a query engine that employs NFA to evaluate these operators in an optimised manner. Moreover, we also present an experimental evaluation of its performance, showing that it improves over state-of-the-art approaches.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Jun/2017
Suggestion:
Major Revision
Review Comment:

The paper presents a new query language named SPA_SEQ that extends SPARQL with
Complex Event Processing operators that are evaluated over RDF graph-based
events.

The authors provide a very good and critical analysis of existing systems and
highlight their limitations. They distinguish between RDF Stream Processing
(RSP) systems, which inherit the processing model of Data Stream Management
Systems and adapt it to RDF triple streams, and Semantic CEP system that
extend classical rule-based CEP systems. The former do not typically offer
processing abstractions to define and detect temporal event patterns. Thus,
the introduction of temporal operators is the main factor that distinguishes
SPA_SEQ from RSP systems.

Conversely, the novelty of SPA_SEQ with respect to other Semantic CEP systems
such as EP-SPARQL is more blurry. Although I buy that there's room to improve
the languages in this area, I'm not convinced by some of the limitations
discussed in the paper. For instance, I do not believe that distinguishing
multiple streams at the language level is necessary: indeed, many existing CEP
languages do not distinguish between streams, but instead consider all
incoming events as part of a single flow. Similarly, I think that the level in
which the integration of live and background knowledge is performed (query
language vs external rules) is not a key distinguishing factor. On the other
hand, one could claim that interval time semantics (offered by EP-SPARQL but
not by SPA_SEQ) is a relevant feature. In summary, the distinction and novelty
is a bit blurry and I'd like the authors to elaborate more on these points.

Concerning the SPA_SEQ language, I like the idea of separating the definition
of temporal patterns from the definition of graph patterns. Indeed, I believe
that this is a good and novel approach to the design of SCEP systems and I
would stress it more in the paper. On the other hand, Section 5 (Formal
Semantics) contains many well known concepts from the Semantic Web community
(semantics of graph pattern matching expressions) and the CEP community
(semantics of temporal pattern matching expressions). I suggest that the
authors remove this section and refer to the existing literature, simply
discussing how it can be adapted to SPA_SEQ. The same holds for Section 7: the
algorithms to translate to NFA are well known in the literature (Cayuga, SASE,
SASE+). I suggest that the authors refer to these work and simply discuss the
differences and novelty of their approach.

Furthermore, the decoupling of graph pattern matching and temporal pattern
matching makes it possible to consider CEP operators other than the ones
presented in the paper (which mostly mimic SASE+). I would maybe stress this
aspect more, showing that the system can potentially use different types of
CEP languages and operators.

Concerning the NFA algorithm, the authors repeatedly claim that this is the
de-facto standard for event detection. However, several different types of
algorithms have been proposed in the literature, including tree-based
algorithms, lazy evaluation algorithms (similar to one of the optimizations
proposed by the authors), and others. I suggest that the authors acknowledge
the different types of algorithms and put their solution in context.

Concerning the evaluation, I would be interested in seeing experiments that
involve background knowledge and reasoning on background knowledge, which is
one of the key motivations to adopt the RDF format.

To conclude, I believe that the topic is interesting and the solution presents
some original contributions with respect to the state of the art, and in
particular the decoupling of graph pattern matching and temporal pattern
matching. The paper is also well written and easy to read and
understand. Nevertheless, the current version of the paper repeats some well
known concepts, such as the semantics of temporal operators and the
translation to automata. I suggest that the authors remove these parts and
simply refer to existing work in the area to make the paper more compact and
better put the work in context. I also suggest that the authors extend the
evaluation to better consider the integration with background knowledge and
reasoning.

Review #2
By Jean Paul Calbimonte submitted on 18/Sep/2017
Suggestion:
Major Revision
Review Comment:

The authors identify the following contributions: the syntax and semantics of Spaseq, the implementation, optimization and evaluation of the language. However, it is also key for this paper to motivate why this new language is needed. The authors devote a section to try this but this should be better reflected in the introduction as well. There, essentially 3 main points are raised: multiple streams, explicit kleene, and separated sequence from graph pattern.
As it is currently written ,these 3 main points are not a strong enough argument. Multiple streams are present in RESP-QL, for example, although only in a theoretic form. Concerning kleene, is it only a syntax addition (making it explicit) or does it have a fundamental difference to what ETALIS provides? The separation of graph and sequence is also presented as a 'cleaner' solution, but to me, these arguments are not strong enough or convincing.
In the end, the main motivation of a language is to answer to requirements that current languages do not satisfy. In this sense, the motivation of this paper lacks evidence of real needs for these additional features. What are the real use cases that require all these features? The example is ok, but it is a didactic one, and it would be better to also support the motivational aspects with evidence of a use case and requirement analysis. For example, it would be easy to argue the need for multiple streams, and cite use case that actually demand this. Use cases can be taken from state of the art or the RSP CG lists.

In this same lines, in section 2 the identified requirements are way too vague (genericity, composition, and user friendly-ness) and do not indicate the needs of the new language features mentioned later on. The authors could try to do a more systematic work of identifying requirements directly linked to the new features that they propose. For instance, event selection is mentioned several times, but there is no explanation why this is important, and to what requirements it answers to.
This information is somehow present in different parts of the manuscript, but it would take some work to reformulate and make this more clear and coherent in order to make the motivation stronger.
One way to do this is to systematically identify requirements (linked to use cases) related directly to the new features that the language introduces.

The evaluation has some issues regarding the heterogeneity of queries/settings, as well as the choosing of the datasets. At this point, it would be expected that a more comprehensive evaluation would be provided, given the vast amount of work done in this sense in the literature. More details on this in the detailed comments. It appears to me that only 2 queries + minor extensions, and limited window sizes, lack of simultaneousness evaluation, etc. are some of the aspects that could be improved.
This is important not only for showing evidence of the efficiency of the system, but also for the community that already has a range of options to choose.
Furthermore, it remains unclear to me how this system compares to non-RDF CEPs. I think it would be very important to have an idea of how far (or how superior?) this system is compared to these non-RDF options. this is a key element in the end to show that this kind of Semantic Web system efforts are not just an academic experiment, but a real alternative that could be explored by the industrial sector.
Another evaluation issue is the user-friendly-ness. Even if it is listed as one of the requirements/contributions, is there any evaluation to sustain it?

Finally, I should say again that this work requires revision, especially in terms of the criteria used to choose the material/text/content that has to be included in the manuscript. In my opinion, large portions of the current version of the paper would be better placed in a Tech report or equivalent, while only the novel/innovative parts should be included in the journal paper. This step is, to me, required in order to do a better evaluation of the paper, and to make it accessible to the SWJ audience.

Detailed comments:

** Abstract

The Semantics Web -> Semantic

** Introduction

With the evolution of […], large volumes of data are generated … -> not sure what is meant by this sentence, is the evolution of social networks provoking the generation of streams?

Performed in real-time manner -> this is not really true. The authors may look at real-time processing literature, which is a totally different area. Real time systems imply other type of time constraints
regarding data processing, which usually do not correspond with CEP.

Within data streams -> within a data stream

Both academic and industrial world -> check grammar

An Additional characteristic -> the additional

SCEP con be evolved from the common practice of stitching heterogeneous environments -> not clear at all what this means.

Specify known queries or patterns in an intuitive way: This point is important. In many CEP languages, the resulting language syntax is not easy at all or intuitive for end users.

These languages do not provide any temporal pattern .. -> True, but with a note, C-SPARQL provides a timestamp function that actually allows to write temporal patterns. Not elegant but it exists as a sort of workaround feature.

From following drawbacks -> from the following drawbacks

They outperforms -> outperform

** Section 2

Notifies the users or an -> of an

That is attached -> that are

Switch to renewable -> switch to a renewable

Their query languages do not provide any operators for temporal pattern… -> as mentioned before, only partially true. C-SPARQL somehow supports it with the timestamp function.

Based on RDF graph -> based on the RDF graph model

RSEP-QL is a theoretical work that models a SCEP language. It would be important to check its compatibility with Spaseq.

** Section 3

Is an RDF graph stream a set or a sequence? Definition 2 as it stands does not define the order based on the timestamps. Although later there is ac constraint mentioned in the text, this should probably be incorporated into the definition. Why is there only one graph per timestamp? There is no graph contemporaneity in this model?

In example 1, the table view follows any RDF format for the graphs or is it only an informative representation? This is maybe only a detail, but it actually matters a lot if these streams are exposed on the Web.

Definition 4 is a bit confusing to me. I was expecting the formal definition of a Spaseq query, but instead it seems to be a tuple that references the syntax of the query. The authors may want to look at similar definitions, e.g. the formal definition of SPARQL 1.1, or the definition of RSEP-QL and RSP-QL.
In sum , the problem is that the current definition only contains the variables, a duration and the syntax of the query. So this definition is purely a description of the syntax of the query, not the formal definition of an abstract query. The latter would be expected to reference instead an algebra expression (that corresponds with the syntax), the stream dataset over which the query will operate, at least.

In order to understand the examples a bit better, e.g. Example 2, an example of stream matching the queries could be very useful.

** Section4

While UC2,3 and 4 are useful form an informative point of view, the differences among them are not many, and do not add much to the overall discussion. Perhaps would fit better as annexes in this already pretty long paper.

** Section 5

One problematic issue with the windowing mechanism proposed in the semantics, is that it may break sequences of events. If an event A is in a windows, and B in the following window, the pattern SEQ A B will not recognize this sequence.

Again, in this section we miss the definition of the query. In the previous sections the authors only define the syntax elements but not the query itself. The same applies for a GPM. It is currently defined in terms of a syntax expression. Syntax is one thing, different from an algebra expression.

In definition 7, the result of the evaluation is essentially a stream of time annotated mappings. this should be better described. Or, even the concept of annotated mappings or stream of mappings could be introduced beforehand.

Definition 8 is too informal. It is basically a description of a Bop, but one would expect a formal expression definition. As a consequence, in definition 9 we don't know what the Bops are really.

As a minor issue, the authors may consider using single-line braces for multiple line expressions. Otherwise the result is visually strange.

In definitions 10 and 11 we have variants, including the skip-till options. Would it make sense to include this in the language itself, instead of making it a configuration? Or would it complicate the usage of the language instead?

Example 8: will results -> will result

Definition 17 results in a set of timestamped mappings. however, a sequence would probably reflect better that this is in fact a stream of timestamped mappings.

The negation and optional discussions are interesting, but they are not really part of Spaseq, and hence these sections seem out of place, and could be reduced to a paragraph, or point to them as annex somewhere else.

** Section 6

This section repeats some of the points mentioned beforehand, and is just another list of motivations for introducing this language. I would suggest to omit this or integrate the most relevant parts in the motivation, or include a paragraph in the conclusions.

** Section 7

This section includes the description of the compilation of each operator (in terms of the automaton) and then the details of the algorithm as well. This description is too detailed, and somehow repetitive. A much shortened version with the most illustrative case would be enough. the rest could be an appendix though.

variable techniques > ???

mapped onto -> into?

I do not understand why the negation and optional are presented if they are not part of the query language, at least as defined in the previous sections.

** Section 8
An introduction to what an optimiser is, is unnecessary, at least to me.
cost_n formula: where c(P,G_e^î) ?

The authors may prefer to focus on the novel techniques and omit details on parts like the window pushing or other techniques that are rather standard.

** Section 9
About the SMD dataset, although it uses an interesting generation model, does it have any property that guarantees, or indicates a degree of similarity to a real world stock market data series? Or could the authors have used basically the same model for any other type of data (e.g. a simulated traffic stream).
For pattern detection, the nature of the dataset is certainly important, as the sequences, and types of sequences are essential for the query processing algorithms.

In a comprehensive evaluation one would expect a more extensive set of test queries. Instead, the authors propose essentially two queries that consider most of the aspects that they wish to put to test. I think it would be much more convincing to propose a wider range of queries with different characteristics, otherwise there may be a bias towards a certain type of query pattern or query type. Furthermore, I miss a multiple query behaviour evaluation. It is common in CEPs to evaluate how the system responds to different combinations of data/query loads simultaneously.
It is also interesting to see the window sizes chosen for the evaluation, which have a quite small range between 2 and 10 sec. What is the rationale for this choice?
Even if this range is small, we see a quite important decrease in throughput in most of the experiments. Of course, in cases where the comparison with EP_SPARQL is made, the latter has an even worse behaviour. However, I am not sure how realistic it is t take these window sizes for the datasets chosen.
Perhaps it would be interesting to see a more heterogeneous set of evaluation settings, also including larger window sizes. This is totally common in real use cases in stock market analysis and smart cities as well.
It would also be interesting to see a rough comparison with RDF stream query processors, even if they are of course limited in terms of features compared to Spaseq. However, it could serve as a reference for implementers or users who need to choose among both. This specially applies for CSPARQL, given that it has a Time Function feature that naively allows a form of sequence operator. Of course it would be expected that this would perform worse than the proposed techniques.
Nevertheless, given that Spaseq also performs simple graph pattern matching, it would be pedagogic to see a comparison with one of these RSPs at least in cases where a comparison is valid/possible.