Review Comment:
The paper presents a system for RDF stream reasoning that exploits Apache
Kafka and Apache Spark to perform distributed computations in cluster/cloud
environments. Specifically, the paper focuses on reasoning techniques for a
subset of RDFS + owl:sameAs.
The topic is interesting and relevant for the journal, but in my opinion the
current version of the paper does not include enough details and comparison
with the state of the art to judge its novelty and contribution. I suggest
that the authors prepare a major revision of the paper to address the
limitations discussed below.
First of all, it is not clear to me which are the novel contributions of the
paper. I understand that the original Strider system does not include any form
of reasoning, while the presented StriderR includes RDFS + owl:sameAs
reasoning. However, the reasoning techniques seem to be inherited from the
LiteMat reasoner. Is this the case? If so, how do the authors changed the
LiteMat algorithms to adapt them to StriderR? Which are the most important
aspects that a reader can learn from this adaptation?
Also, when discussing the algorithms implemented in StriderR, it is not clear
where the "standard" approaches come from. Are they used in state-of-the-art
systems? Have they been proposed in the literature? Or do they represent a
naive approach that the authors use for explanation. As an example, take
Section 5.1. Which systems (if any) adopt the "standard" rewriting approach? I
suggest that the author better position their work with respect to existing
systems, discussing (here and in the background/related work) which are the
alternative approaches to perform the kind of reasoning they are considering,
showing their pros and cons. If the "standard" approach is adopted in some
concrete system, then it would be better to reference it.
Concerning reasoning for owl:sameAs, if I understand correctly the RB
approach, I would argue that it does not output complete results. This makes
the comparison with the SAM approach unfair, since they are producing
different results (and only the second one correctly reports all the
results). If this is the case, I suggest that the authors provide at least
some evidence that the RB approach can indeed be useful in practice, even if
it does not return all the results that derive from the owl:sameAs reasoning.
Section 6.3.4 needs to be discussed in more details. While I understand the
problem, the solution is not presented in enough details and it is hard to
understand how it can guarantee completeness. I suggest that the authors
include a precise algorithm to complete the example-driven description. This
is true also for all the algorithms presented in Section 5 and Section 6.
The evaluation lacks several details that are necessary to understand the
presented results. First of all, it is not clear to me how Spark partitions
the triples to perform the computation in parallel in the first place. This
should probably be discussed at the beginning of the paper, giving an
intuition of how the algorithms map to the Spark architecture. Where is the
background knowledge stored? On the disk? In memory? In every node or
partitioned?
Concerning the latency and thorughput, it is not clear to me how these metrics
are evaluated. How many triples enter the system before a window closes? If a
window closes every 1 million triples, then all these triples trigger a single
computation, and this makes it much easier to reach a thoughput of 1 million
triples per second (I would also argue that this result would not be
impressive). Conversely, if one computation starts every few triples, than
achieving a high throughput would be much more difficult.
Also, the graphs seem to be small enough to fit the memory of a single
computer. If elements are encoded using 32bit numbers, than each triple can be
represented with 12 bytes, which enables for storing billions of triples in
main memory. This would enable a direct comparison with existing RSP
approaches and RDF triples stores, which is an important element missing in
the current paper. It should also enable to test scalability (moving from 1 to
all 11 nodes), thus proving that the algorithm indeed benefits from the
presence of additional computers, and so is "scalable" as claimed in the
title.
Concerning the comparison with existing systems, the authors could compare
with RDFox (for instance) on datasets that fit in memory. RDFox includes
incremental reasoning techniques, which could be used to evaluate queries as
the set of triples changes over time (i.e., at different window
evaluations). In the case of RDF stream reasoner, I understand that the
comparison is more difficult, since most of them do not support reasoning.
Finally, the paper contains several syntax errors and non idiomatic
sentences. Because of this, some parts of the paper are difficult to read and
follow. I suggest that the authors get editing help from someone with full
professional proficiency in English.
To conclude, the paper considers an important topic, but the current
presentation makes it difficult to judge the contributions and put them in the
context of other work in the area. I suggest that the authors try to solve
these issues and resubmit their work.
|