Scalable RDF Stream Reasoning in the Cloud

Tracking #: 1747-2959

Authors: 
Xiangnan Ren
OLIVIER CURE
Hubert Naacke
Ke Li

Responsible editor: 
Guest Editors Stream Reasoning 2017

Submission type: 
Full Paper
Abstract: 
Reasoning over semantically annotated data is an emerging trend in stream processing aiming to produce sound and complete answers to a set of continuous queries. It usually comes at the cost of finding a trade-off between data throughput, latency and the cost of expressive inferences. StriderR proposes such a trade-off and combines a scalable RDF stream processing engine with an efficient reasoning system. The main reasoning services are based on a query rewriting approach for SPARQL that benefits from an intelligent encoding of an extension of the RDFS (i.e., RDFS with owl:sameAs) ontology elements. StriderR runs in production at a major international water management company to detect anomalies from sensor streams. The system is evaluated along different dimensions and over multiple datasets to emphasize its performance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Nov/2017
Suggestion:
Major Revision
Review Comment:

The paper presents a system for RDF stream reasoning that exploits Apache
Kafka and Apache Spark to perform distributed computations in cluster/cloud
environments. Specifically, the paper focuses on reasoning techniques for a
subset of RDFS + owl:sameAs.

The topic is interesting and relevant for the journal, but in my opinion the
current version of the paper does not include enough details and comparison
with the state of the art to judge its novelty and contribution. I suggest
that the authors prepare a major revision of the paper to address the
limitations discussed below.

First of all, it is not clear to me which are the novel contributions of the
paper. I understand that the original Strider system does not include any form
of reasoning, while the presented StriderR includes RDFS + owl:sameAs
reasoning. However, the reasoning techniques seem to be inherited from the
LiteMat reasoner. Is this the case? If so, how do the authors changed the
LiteMat algorithms to adapt them to StriderR? Which are the most important
aspects that a reader can learn from this adaptation?

Also, when discussing the algorithms implemented in StriderR, it is not clear
where the "standard" approaches come from. Are they used in state-of-the-art
systems? Have they been proposed in the literature? Or do they represent a
naive approach that the authors use for explanation. As an example, take
Section 5.1. Which systems (if any) adopt the "standard" rewriting approach? I
suggest that the author better position their work with respect to existing
systems, discussing (here and in the background/related work) which are the
alternative approaches to perform the kind of reasoning they are considering,
showing their pros and cons. If the "standard" approach is adopted in some
concrete system, then it would be better to reference it.

Concerning reasoning for owl:sameAs, if I understand correctly the RB
approach, I would argue that it does not output complete results. This makes
the comparison with the SAM approach unfair, since they are producing
different results (and only the second one correctly reports all the
results). If this is the case, I suggest that the authors provide at least
some evidence that the RB approach can indeed be useful in practice, even if
it does not return all the results that derive from the owl:sameAs reasoning.

Section 6.3.4 needs to be discussed in more details. While I understand the
problem, the solution is not presented in enough details and it is hard to
understand how it can guarantee completeness. I suggest that the authors
include a precise algorithm to complete the example-driven description. This
is true also for all the algorithms presented in Section 5 and Section 6.

The evaluation lacks several details that are necessary to understand the
presented results. First of all, it is not clear to me how Spark partitions
the triples to perform the computation in parallel in the first place. This
should probably be discussed at the beginning of the paper, giving an
intuition of how the algorithms map to the Spark architecture. Where is the
background knowledge stored? On the disk? In memory? In every node or
partitioned?

Concerning the latency and thorughput, it is not clear to me how these metrics
are evaluated. How many triples enter the system before a window closes? If a
window closes every 1 million triples, then all these triples trigger a single
computation, and this makes it much easier to reach a thoughput of 1 million
triples per second (I would also argue that this result would not be
impressive). Conversely, if one computation starts every few triples, than
achieving a high throughput would be much more difficult.

Also, the graphs seem to be small enough to fit the memory of a single
computer. If elements are encoded using 32bit numbers, than each triple can be
represented with 12 bytes, which enables for storing billions of triples in
main memory. This would enable a direct comparison with existing RSP
approaches and RDF triples stores, which is an important element missing in
the current paper. It should also enable to test scalability (moving from 1 to
all 11 nodes), thus proving that the algorithm indeed benefits from the
presence of additional computers, and so is "scalable" as claimed in the
title.

Concerning the comparison with existing systems, the authors could compare
with RDFox (for instance) on datasets that fit in memory. RDFox includes
incremental reasoning techniques, which could be used to evaluate queries as
the set of triples changes over time (i.e., at different window
evaluations). In the case of RDF stream reasoner, I understand that the
comparison is more difficult, since most of them do not support reasoning.

Finally, the paper contains several syntax errors and non idiomatic
sentences. Because of this, some parts of the paper are difficult to read and
follow. I suggest that the authors get editing help from someone with full
professional proficiency in English.

To conclude, the paper considers an important topic, but the current
presentation makes it difficult to judge the contributions and put them in the
context of other work in the area. I suggest that the authors try to solve
these issues and resubmit their work.

Review #2
Anonymous submitted on 21/Dec/2017
Suggestion:
Major Revision
Review Comment:

The paper proposes a system to perform stream reasoning on the cloud. The idea
is to extend an existing stream processing system (Strider) with reasoning
capabilities. The discussion is carried on from an engineering perspective,
giving space for a description of the various components of the system and how
the computation takes place within the system. Unfortunately, the paper does
not describe sufficiently well what kind of reasoning is performed, not it
analyses its computational property. In my opinion, this is a major limitation
of this paper that prevents a successful replication of the work. Below, I
elaborate more on this point and other issues.

- On page 3, it is mentioned that the system implements \rhopdf with support
for owl:sameAs but it is not clear which restrictions are imposed on
the stream. For instance, on page 10 left column the authors mention that
sameAs triples occur in the static KB. But do they also occur in the stream?
If they are, then rules might need to be rewritten. Moreover, it seems that
TBox knowledge can only appear in the static KB, and never in the stream. But
what if the TBox is hidden within the stream? For instance, you could have a
TBox triple like and then in
the stream have the triple . In this case, leads to the
derivation of new knowledge which will not be included.

- The authors claim that the system performs complete reasoning (page 7, left
column) but the notion of completeness is left unspecified. My feeling is
that the authors use the notion of completeness to describe the feature that
the system returns all answers that can be inferred. However, there should be
a definition of what is the set of all answers, and which subset is being
returned. Does this subset include all expansions that can be obtained using
the sameAs links?

- The technique used to deal with sameAs derivations (i.e., use a
representative item) is not novel and already implemented in several systems.
The related work mentions that indeed this is already implemented in RDFOfx,
but I still do not understand what the difference is because sameAs triples
(apparently) do not appear in the stream, so the computation of the sameAs
cliques is an offline process, exactly like RDFOx.

- It seems that the system does not remember any previous inference, and that
everything is recomputed at each time point. This is very inefficient, and
there are quite some works which study the problem of incremental reasoning
(most of them are cited in this work). The authors should motivate whether
they also implement some form of incremental processing, or, in case they do
not support it, what's the reason for ignoring it.

- The system is evaluated only using the LUBM ontology. Other datasets should
be included in order to present a more robust evaluation. For instance, the
authors could consider DBPedia, or enrich data produced by other benchmark
systems with some additional knowledge in order to trigger more complex
reasoning.

My opinion is that this work should be significantly restructured in order to
become valuable to the community. With the current description, it is not
possible to understand what is the output of the system, nor the restrictions
that must be enforced on the stream. Thus, it is impossible (or at least very
hard) to compare this work with other existing solutions. In case the authors
have the chance to improve this paper, then I would like to find in the new
version a formal description of the supported rules, the restrictions on the
tuples that can appear in the input, and more detailed discussion over
completeness. Also, the evaluation certainly needs to be expanded with other
datasets in order to provide a clear picture of the performance the system with
data that triggers a more complex reasoning than LUBM.

Review #3
Anonymous submitted on 28/May/2018
Suggestion:
Major Revision
Review Comment:

The article presents StriderR a stream reasoning engine. StriderR builds on top of Strider, an existing RDF Stream Processor with query capabilities, and it adds features to perform RDFS and sameAs reasoning.

The main concerns I have about this submission are related to its originality:
- This manuscript is heavily based on a previous article of the same authors [1]. Such article is not cited in the submitted manuscript, but most of the contributions were initially presented there. Authors must explain in the introduction what is novel w.r.t. their previously published articles.
- The encoding of StriderR uses LiteMat, but it is not really clear if it is extended for the stream reasoning setting: Section 5.2.2 states that when the resource is not known since the beginning, it is not encoded. Thinking to a typical stream processing scenario, one would expect that only a minimal portion of symbols are known since the beginning, while most of them would appear online. It would have been interesting to study the trade-off between the encodings through experiments.
- The idea of using dictionaries to manage sameAs is a well-known technique implemented in several reasoners.

The novel part added w.r.t. [1] is the comparison of the RB and SAM approaches to manage the sameAs links. I have two concerns on this part of the article:
- what is the expected "correct" answer? Looking at the example of Section 6.3.1, shouldn't the the answer of Q be {?x=pDoc1, ?y=mary@gmail.com}, {?x=pDoc2, ?y=mary@gmail.com} and {?x=pDoc3, ?y=mary@gmail.com}? If the answer is only {?x=pDoc2, ?y=mary@gmail.com}, the query result is probably not complete. The article is lacking a formal notion of completeness, which makes hard to understand what the system should answer.
- the motivation of behind SAM. If "Mary" is a property of pDoc2, why should {?x=pDoc3, ?y=mary@gmail.com} be better than {?x=pDoc2, ?y=mary@gmail.com}?

Moreover, the article lacks a section describing the assumptions made. They are all around the text, e.g.,
- the stream carries only ABox axioms, but later it adds that such stream cannot carry axioms with owl:sameAs as a predicate.
- the individuals involved in owl:sameAs relations have to be known from the beginning (to process the sameAs relations and to realise the encoding)
Adding a section explaining all of these assumptions would improve the readability of the article.

Without addressing the problem listed above, it is hard to have a clear view of the manuscript and judge the significance of the results.

The paper is easy to follow, but the language can be improved (it could be more formal and less subjective). On this regard:
- the formalisation of SPARQL seems to be wrong - you should define recursively the notion of graph pattern and not the one of triple pattern;
- the main reasoning approaches listed in the paper are forward chaining and query rewriting. I am surprised that backward chaining is not mentioned (in this specific case, where there is a query involved, it would be a natural approach);
- authors mention the adaptivity of StriderR at the end of Section 3, but this is not described in the article;
- the algorithm in Section 5.2 should be explained more formally - a listing would be better than explaining it in the text with some examples. Moreover, it is not clear how you manage cyclic TBoxes and multiple inheritances

[1] Xiangnan Ren, Olivier Curé, Hubert Naacke, Jérémy Lhez, Li Ke:
StriderR: Massive and distributed RDF graph stream reasoning. BigData 2017: 3358-3367