Review Comment:
The authors identify the following contributions: the syntax and semantics of Spaseq, the implementation, optimization and evaluation of the language. However, it is also key for this paper to motivate why this new language is needed. The authors devote a section to try this but this should be better reflected in the introduction as well. There, essentially 3 main points are raised: multiple streams, explicit kleene, and separated sequence from graph pattern.
As it is currently written ,these 3 main points are not a strong enough argument. Multiple streams are present in RESP-QL, for example, although only in a theoretic form. Concerning kleene, is it only a syntax addition (making it explicit) or does it have a fundamental difference to what ETALIS provides? The separation of graph and sequence is also presented as a 'cleaner' solution, but to me, these arguments are not strong enough or convincing.
In the end, the main motivation of a language is to answer to requirements that current languages do not satisfy. In this sense, the motivation of this paper lacks evidence of real needs for these additional features. What are the real use cases that require all these features? The example is ok, but it is a didactic one, and it would be better to also support the motivational aspects with evidence of a use case and requirement analysis. For example, it would be easy to argue the need for multiple streams, and cite use case that actually demand this. Use cases can be taken from state of the art or the RSP CG lists.
In this same lines, in section 2 the identified requirements are way too vague (genericity, composition, and user friendly-ness) and do not indicate the needs of the new language features mentioned later on. The authors could try to do a more systematic work of identifying requirements directly linked to the new features that they propose. For instance, event selection is mentioned several times, but there is no explanation why this is important, and to what requirements it answers to.
This information is somehow present in different parts of the manuscript, but it would take some work to reformulate and make this more clear and coherent in order to make the motivation stronger.
One way to do this is to systematically identify requirements (linked to use cases) related directly to the new features that the language introduces.
The evaluation has some issues regarding the heterogeneity of queries/settings, as well as the choosing of the datasets. At this point, it would be expected that a more comprehensive evaluation would be provided, given the vast amount of work done in this sense in the literature. More details on this in the detailed comments. It appears to me that only 2 queries + minor extensions, and limited window sizes, lack of simultaneousness evaluation, etc. are some of the aspects that could be improved.
This is important not only for showing evidence of the efficiency of the system, but also for the community that already has a range of options to choose.
Furthermore, it remains unclear to me how this system compares to non-RDF CEPs. I think it would be very important to have an idea of how far (or how superior?) this system is compared to these non-RDF options. this is a key element in the end to show that this kind of Semantic Web system efforts are not just an academic experiment, but a real alternative that could be explored by the industrial sector.
Another evaluation issue is the user-friendly-ness. Even if it is listed as one of the requirements/contributions, is there any evaluation to sustain it?
Finally, I should say again that this work requires revision, especially in terms of the criteria used to choose the material/text/content that has to be included in the manuscript. In my opinion, large portions of the current version of the paper would be better placed in a Tech report or equivalent, while only the novel/innovative parts should be included in the journal paper. This step is, to me, required in order to do a better evaluation of the paper, and to make it accessible to the SWJ audience.
Detailed comments:
** Abstract
The Semantics Web -> Semantic
** Introduction
With the evolution of […], large volumes of data are generated … -> not sure what is meant by this sentence, is the evolution of social networks provoking the generation of streams?
Performed in real-time manner -> this is not really true. The authors may look at real-time processing literature, which is a totally different area. Real time systems imply other type of time constraints
regarding data processing, which usually do not correspond with CEP.
Within data streams -> within a data stream
Both academic and industrial world -> check grammar
An Additional characteristic -> the additional
SCEP con be evolved from the common practice of stitching heterogeneous environments -> not clear at all what this means.
Specify known queries or patterns in an intuitive way: This point is important. In many CEP languages, the resulting language syntax is not easy at all or intuitive for end users.
These languages do not provide any temporal pattern .. -> True, but with a note, C-SPARQL provides a timestamp function that actually allows to write temporal patterns. Not elegant but it exists as a sort of workaround feature.
From following drawbacks -> from the following drawbacks
They outperforms -> outperform
** Section 2
Notifies the users or an -> of an
That is attached -> that are
Switch to renewable -> switch to a renewable
Their query languages do not provide any operators for temporal pattern… -> as mentioned before, only partially true. C-SPARQL somehow supports it with the timestamp function.
Based on RDF graph -> based on the RDF graph model
RSEP-QL is a theoretical work that models a SCEP language. It would be important to check its compatibility with Spaseq.
** Section 3
Is an RDF graph stream a set or a sequence? Definition 2 as it stands does not define the order based on the timestamps. Although later there is ac constraint mentioned in the text, this should probably be incorporated into the definition. Why is there only one graph per timestamp? There is no graph contemporaneity in this model?
In example 1, the table view follows any RDF format for the graphs or is it only an informative representation? This is maybe only a detail, but it actually matters a lot if these streams are exposed on the Web.
Definition 4 is a bit confusing to me. I was expecting the formal definition of a Spaseq query, but instead it seems to be a tuple that references the syntax of the query. The authors may want to look at similar definitions, e.g. the formal definition of SPARQL 1.1, or the definition of RSEP-QL and RSP-QL.
In sum , the problem is that the current definition only contains the variables, a duration and the syntax of the query. So this definition is purely a description of the syntax of the query, not the formal definition of an abstract query. The latter would be expected to reference instead an algebra expression (that corresponds with the syntax), the stream dataset over which the query will operate, at least.
In order to understand the examples a bit better, e.g. Example 2, an example of stream matching the queries could be very useful.
** Section4
While UC2,3 and 4 are useful form an informative point of view, the differences among them are not many, and do not add much to the overall discussion. Perhaps would fit better as annexes in this already pretty long paper.
** Section 5
One problematic issue with the windowing mechanism proposed in the semantics, is that it may break sequences of events. If an event A is in a windows, and B in the following window, the pattern SEQ A B will not recognize this sequence.
Again, in this section we miss the definition of the query. In the previous sections the authors only define the syntax elements but not the query itself. The same applies for a GPM. It is currently defined in terms of a syntax expression. Syntax is one thing, different from an algebra expression.
In definition 7, the result of the evaluation is essentially a stream of time annotated mappings. this should be better described. Or, even the concept of annotated mappings or stream of mappings could be introduced beforehand.
Definition 8 is too informal. It is basically a description of a Bop, but one would expect a formal expression definition. As a consequence, in definition 9 we don't know what the Bops are really.
As a minor issue, the authors may consider using single-line braces for multiple line expressions. Otherwise the result is visually strange.
In definitions 10 and 11 we have variants, including the skip-till options. Would it make sense to include this in the language itself, instead of making it a configuration? Or would it complicate the usage of the language instead?
Example 8: will results -> will result
Definition 17 results in a set of timestamped mappings. however, a sequence would probably reflect better that this is in fact a stream of timestamped mappings.
The negation and optional discussions are interesting, but they are not really part of Spaseq, and hence these sections seem out of place, and could be reduced to a paragraph, or point to them as annex somewhere else.
** Section 6
This section repeats some of the points mentioned beforehand, and is just another list of motivations for introducing this language. I would suggest to omit this or integrate the most relevant parts in the motivation, or include a paragraph in the conclusions.
** Section 7
This section includes the description of the compilation of each operator (in terms of the automaton) and then the details of the algorithm as well. This description is too detailed, and somehow repetitive. A much shortened version with the most illustrative case would be enough. the rest could be an appendix though.
variable techniques > ???
mapped onto -> into?
I do not understand why the negation and optional are presented if they are not part of the query language, at least as defined in the previous sections.
** Section 8
An introduction to what an optimiser is, is unnecessary, at least to me.
cost_n formula: where c(P,G_e^î) ?
The authors may prefer to focus on the novel techniques and omit details on parts like the window pushing or other techniques that are rather standard.
** Section 9
About the SMD dataset, although it uses an interesting generation model, does it have any property that guarantees, or indicates a degree of similarity to a real world stock market data series? Or could the authors have used basically the same model for any other type of data (e.g. a simulated traffic stream).
For pattern detection, the nature of the dataset is certainly important, as the sequences, and types of sequences are essential for the query processing algorithms.
In a comprehensive evaluation one would expect a more extensive set of test queries. Instead, the authors propose essentially two queries that consider most of the aspects that they wish to put to test. I think it would be much more convincing to propose a wider range of queries with different characteristics, otherwise there may be a bias towards a certain type of query pattern or query type. Furthermore, I miss a multiple query behaviour evaluation. It is common in CEPs to evaluate how the system responds to different combinations of data/query loads simultaneously.
It is also interesting to see the window sizes chosen for the evaluation, which have a quite small range between 2 and 10 sec. What is the rationale for this choice?
Even if this range is small, we see a quite important decrease in throughput in most of the experiments. Of course, in cases where the comparison with EP_SPARQL is made, the latter has an even worse behaviour. However, I am not sure how realistic it is t take these window sizes for the datasets chosen.
Perhaps it would be interesting to see a more heterogeneous set of evaluation settings, also including larger window sizes. This is totally common in real use cases in stock market analysis and smart cities as well.
It would also be interesting to see a rough comparison with RDF stream query processors, even if they are of course limited in terms of features compared to Spaseq. However, it could serve as a reference for implementers or users who need to choose among both. This specially applies for CSPARQL, given that it has a Time Function feature that naively allows a form of sequence operator. Of course it would be expected that this would perform worse than the proposed techniques.
Nevertheless, given that Spaseq also performs simple graph pattern matching, it would be pedagogic to see a comparison with one of these RSPs at least in cases where a comparison is valid/possible.
|