Real-time Generation of Linked Sensor Data and Multidimensional Data Cubes for Smart Environments

Tracking #: 914-2125

Authors: 
Muntazir Mehdi
Ratnesh Sahay1
Wassim Derguech
Weiping Qu
Stefan Deßloch
Edward Curry

Responsible editor: 
Guest Editors Smart Cities 2014

Submission type: 
Full Paper
Abstract: 
Events represent a record of an activity in the system, are logged as soon as they happen, and are chronologically independent. In most Smart Environment settings, events usually refer to the sensor data. The dynamicity of sensor data sources and publishing real-time sensor data over a generalised infrastructure like the Web, pose a new set of integration challenges. Semantic Sensor Networks demand excessive expressivity for efficient formal analysis of sensor data. Topics like data-warehousing, event processing, and decision making are very well established in research and industry. However, with the frequently changing nature of data models, the challenge is to deal with data model specific processing. With inception of ideas like Internet of Things (IoT) and Web of Things (WoT), research today deems it important for efforts to be placed in processing sensor or event data. In this work, we present a methodology to deal with sensor data using Semantic Web technologies, in real-time, and for smart environments is proposed. The proposed methodology addresses two different problems: 1) Collection of sensor data, transformation, meta-data enrichment of sensed data and publishing to Linked Open Data (LOD) Cloud, and 2) Adopting data model specific or context-specific properties in automatic generation of multidimensional data cubes and publishing to LOD Cloud. This article also details an extensive evaluation and analysis of the results obtained during the study. The results are compared with state-of-the-art W3C recommended RDF Data Cube Vocabulary.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sven Schade submitted on 18/Dec/2014
Suggestion:
Reject
Review Comment:

The authors propose an integrated solution for real-time event processing following the Linked Data paradigm. They suggest using data cubes for associating contextual information to measurements from sensor networks, and developed a specific cube vocabulary as an alternative to W3Cs RDF Data Cube. Initial tests in the area of smart buildings favor an Event-Data warehouse based the proposed solution (in terms of performance). This is obviously a demanding work, and the authors certainly put hard efforts in the software developments that were required to realize the proposed solution.

However, in its current form, the paper faces four fundamental drawbacks (listed below), so that I can only recommend to reject this contribution. Due to the substantial development work that went into this, I still encourage the authors to consider resubmitting this work - after intense changes of presentation.

1) The authors assume that RDF and data cubes are the ideal solution to challenges of real-time event processing. The evaluation does compare alternatives on this level, rather than comparing the proposed approach with different architectural solutions, such as the use of existing standards of e.g. the Open Geospatial Consortium (OGC), or arising trends in the area of Big Data (e.g. Kafka and Storm). In this sense, they take the second step before the first. When applying novel technologies to a problem, it has to be show that existing approaches (i) cannot solve the problem, (ii) are not as elegant as the proposed solution, or (iii) are not as simple as the proposed system.

2) The style of writing in clearly taking an engineering perspective (especially sections 1 and 3), which highly limits the potentially interested audience of the article. Instead of starting from events in a system, the paper would strongly benefit a clear introduction to the user challenges (handling real-world events that now can be detected due to novel sensor technologies and mobile internet) and the associated technical problems (managing data streams, integrating multiple data flows, understanding context, etc.).

3) Related to the above, the current text mixes elements of the conceptual setting, the implementation details and the example. Readability could be greatly improved if these aspects would be clearly separated, e.g. in a structure such as: motivation, problem description and user challenges, possible use cases and the example, background and related work, technical requirements, method, proof of concept implementation, discussion, conclusion.

4) The presented solution seems highly specific in relation to the applied use case. A discussion of interoperability and generalizability is completely missing. Instead the focus in put on technical feasibility and the comparison between the Event-Data warehouse and the pure W3C solution. Potentials of scalability might be addressed, too.

Additional comments:
- In 1.1.1, I did not get the intended meaning of the sentences “In this case, the closed data, from different data providers can be easily combined using linked data techniques. Such a scenario is depicted in Figure 1 …”. Why closed data?
- The second paragraph in the right column on page 3 (“The approach presented…”) is summarizing the core of this work. It should appear more prominent.
- The introduction mentions too many concepts at once. It would be helpful if those would be more carefully and step-wise introduced, e.g. by be restructuring the paper as suggested under (2) above.
- Parts of the background, especially in section 2.2 are too long. An SWJ paper does not need to introduce the history of Linked Data etc. Instead, each of the sub-sections could finish with a sentence that explain the relevance of this particular concept to the work presented in this paper. In addition, the related work (currently in section 7.1) could be moved here.
- The separation (if any) between events and observations/measurements should be clarified early in the paper. The current use of both terms is confusing, especially when considering the last part of page 13 “This event, after transformation is now called an observation with RDF:Type ‘Observation’”.
- Section 3.1.1: “In this article, the terms “sensed data stream”, “sensor data” and “event-data” are used interchangeably.” The overall readability of the article could be improved if the authors would decide for one of these terms, and use it consistently throughout the text.
- Last sentences of Section 3 “A separate set of techniques, proposed in [31] (to extract keywords or query terms) and [32] (to discover relevant SPARQL endpoints or LOD datasets) can be combined together to discover LOD datasets and link with using Silk or LIMES.” If there is a proof for this, is should be clearly referenced.
- The ontological model looks like a data model that is encoded in RDF, instead of really being ontology-based. It would be more honest to talk about a vocabulary.
- Although it seems central to this work, the authors’ differentiation between graphs and cubes remains unclear to me. I would appreciate clear discussions and justifications for this choice. (Notably, I might be biased here, because I usually thing of Geospatial Data Cubes and array data bases.)

Review #2
By Christoph Stasch submitted on 26/Dec/2014
Suggestion:
Major Revision
Review Comment:

This paper presents a methodology to collect, transform and publish sensor data as linked data and to generate multidimensional data cubes from it in order to deal with heterogeneous data management. The approach is developed in a real-world scenario where a building is equipped with different sensors measuring, e.g., power consumption or temperatuer. In general, I’d suggest a major revision of the paper for the following reasons:

Overall, the paper is well written, though there are a few minor errors that need to be corrected (e.g. . „In this work, we present a methodology ... is proposed.“ in the abstract. Either remove „we present“ or „is proposed“.) The structure of the paper needs to be improved: I’d suggest to shorten the motivation section (1.1) and move it to the beginning of the introduction section. Also, related work (section 7.1) is not part of the conclusions of the authors, so it should be moved to another section. I’d suggest integrating the related work in section 2. In sections 3,4,5 implementation details from the use case and examples are mixed up with rather abstract methods and concepts. While I agree that providing examples is always useful, I think that the readibility of the paper would be improved, if there is a clear section about the real-world use case and the implementation of the proposed methodology and if sections 3,4,5 describe the methodology and no implementation details.

The paper aims to address two issues: 1. Publishing sensor data as linked data and 2. proposing a method for automatically generating data cubes from the sensor data. The novelity of the approach addressing the first issue is quite limited. As the authors correctly state, there has been much previous work on publishing sensor data as linked data and the proposed approach, which appears to be rather an implementation/engineering issue, does not add new results to the existing methods. Addressing the second issue is novel and is the most interesting part of the paper. However, as it is written in the paper there is „... a lack of vocabularies that bridge the W3C SSN ontology and W3C Data Cube vocabulary...“ (page 10, left column). Unfortunately, the authors propose an ontology based on the concepts of the W3C Data Cube Vocabulary, but do not show how to map to concepts of the SSN ontology. In general, I acknowledge that the two issues of publishing sensor data as LOD and generating data cubes are linked, but still would propose to remove the description of publishing the sensor data as linked data (or shorten it), and to elaborate more on the issue of generating the Data Cubes, especially with a focus on the issue of combining the SSN and the Data Cube vocabularies.

After reading the paper, I was rather confused about the notion of an event and of an observation. In the abstract, it is stated that an event is „a record of an activity in the system“ (p.1, abstract) and that events are „chronologically“ independent. The definition of an event is then extended in the introduction section to „a record of an activity or a result of a particular function or of a business process“ (p.1, left column).
Also, it is stated that an event „... is chronologically ordered.“ (p.1, right column). A single event cannot be chronologically ordered, it just happens in time. Several events may be considered to be chronologically ordered. Also, I doubt that a meteorologist would agree that a record of a temperatuer value is a weather event, as described on page 8. According to the authors, „..., an observation symbolizes one instance of an event-data value.“ (p.10, right column). This is somewhat different to the definition in the SSN ontology, which is largely following the definition of the Observations&Measurements standard: „An Observation is a Situation in which a Sensing method has been used to estimate or calculate a value of a Property of a FeatureOfInterest. Links to Sensing and Sensor describe what made the Observation and how; links to Property and Feature detail what was sensed; the result is the output of a Sensor; other metadata details times etc.“ (Source: http://www.w3.org/2005/Incubator/ssn/ssnx/ssn#Section_Observation). The paper needs to illustrate how the concepts of the SSN ontology can be mapped to the concepts defined in the Data Cube vocabulary and needs to provide clear definitions of events, observations, measures, etc.. For a good discussion of objects, processes and events, I’d like to point to Antony Galton’s work available at http://iospress.metapress.com/content/y45h6g86488v0884/). For definitions of event, observation, observed property, the SSN ontology and the O&M standard can serve as a basis. Despite this issue, the description of the background and foundations in section 2 is well written and does provide the information needed to understand the approach.

The methodology described in section 3 for publishing the sensor data as linked data is straightforward and sound, though it does not provide many new insights, except the event-enricher. The enricher uses a sensor meta-data knowlegde base to enrich the sensor data by, e.g. consumerType, consuemr and consumer location. It remains unclear to the reader how this information is generated in the knowledge base. Also, the EventType seems to be determined by the type of measure. As such, is it really needed for the EventType registration (Section 4) or is it enough to state the type of measure?

The section on event registration and the EDWH is novel and interesting to read. However, as stated before, a proper description how the concepts introduced in the ontology relate to the SSN ontology is missing. Furthermore, the description mixes up with implementation details, e.g. that JMS is used for implementing the messaging infrastructure.

The most interesting part is the cube generation. This is done using the metadata provided during the event registration process. Links are provided to the observations forming the cube and about additional information. A general comment: Applying a particular aggregation function depends on the measure (observed property). As an example, the pure sum of temperature values does not tell you much, especially, if the sampling rates of temperature sensors are different, while an average temperature (weighted sum) is useful for doing comparisons.

The Evaluation section is quite comprehensive and evaluates the storage size, the query time, the overall performance for generating cubes and the accuracy of the data cubes. It is probably obvious, that queries on aggregated cubes are faster than on the raw data, but the provided numbers may still be of interest for the reader. The evaluation of the overall performance of data cubes is heavily depending on the sampling rate with which the sensors are recording events (or observations). Regarding the error evaluation, I doubt that „error“ here is the correct term. It is rather a kind of a deviation from the aggregates computed. The „error“ computed here also largely depends on the short-term and long-term temporal variability of the measure (observed property). As it is now, I think it is not very meaningful. In general, I think that the evalution is just focusing on performance and storage size (except the „error“ evaluation), which does not proof that the issues addressed in the paper are solved. The main goal of the paper is not optimizing performance and storage size, but rather to provide common methodology for integrating sensor data in the Linked Data cloud and automatically generate data cubes with the overall goal of easing integration and analysis. Illustrating this using the real world use-case should be achievable in addition to a pure performance and storage size evaluation.

The last section rather lists conlusionS (not a single one) and adds an outlook on future work, so I’d suggest to change the title to „Conclusions and Outlook“. The conclusions drawn rely on the performance evaluation, but do not directly relate to the goals stated at the beginning in the introduction. I was expecting that the main goal is to ease data integration and to provide a mean to do data-model independent processing by providing metadata to automate the mapping between different data models. Either the introduction needs to be updated or the conclusions need to relate to the goals stated in the beginning.

To sum up, I think that the idea to develop an approach for automatically generating data cubes from pure sensor data using Semantic Web technologies is novel and interesting. However, the approach itself and also its description in the paper need significant improvements.

Review #3
Anonymous submitted on 18/Jan/2015
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

--------------------

The authors have a great idea -- an automated processing workflow that automatically publishes heterogeneous sensor-derived data almost as fast as the data comes off the sensors, both in the SSN ontology model and also (user-designed) data cubes, suited for analysis applications. This is both novel and worthwhile.
Then the problems start. The title is good, but the first two-thirds of the abstract are misleading and disconnected statements without context.

Grammar problems start here and plague the paper, making it sometimes hard to understand what is meant.

The introduction immediately talks about "event processing" and contradicts the abstract about the nature of an "event" ("chronologically ordered" or "chronologically independent")? I don't think that event processing, as defined in the first sentence is relevant to the paper's work (and clearly the author who wrote that "sensed data stream", "sensor data" and "event-data" are used interchangeably on page 7 agrees with me). JMS middleware is used (sensibly enough) for event-oriented programming, but the significance in the overall problem is minor. Instead, the paper is certainly about streaming heterogenous sensor data. I cannot see why the paper talks so much about event processing and complex event processing throughout.

The paper has a running example around a smart building – which works well.

There are a lot of poor word choices, overuse of commas, grammatical errors etc, too many for me to report. The authors are advised to do a thorough rework. Figure 1 adds no value. Section 2 is much too introductory for this journal – 2.1 should be removed, 2.2 is much too long and mostly irrelevant, figures 3 are lifted and have to be properly attributed ( [38] is possibly ok, but [33] certainly is not, see the LOD diagram website for the right attribution for both).

Sec 2.3 – why does the user “not to worry about domain concepts and restrictions on quality” if they use Semantic Sensor Networks? “

3.1.1 what is a “channel” in this context? What is a “sensor” (it seems a single sensor can measure multiple qualities in your example).

3.1.2 It seems weird to convert “raw” quite readable date and time character strings to “milliseconds” along the processing chain! Does not that reduce the interoperability, and requiring the reference time to be known? Later on in the process you clearly convert again into a different date/time format –where did the base reference time come from this time? Why do this?

3.1.4 “event enricher” – is it using a linked data “knowledge base” for its “meta-data”. It looks like it, but the writing has a strange way of saying so. It says the W3C SSN ontology is used, but it does not look like it –it certainly does not show in listing 4.

3.1.5 ‘Event middleware” Say this is JMS and be done with it!

3.2 “ proposed methodology” .It seems you have ideas to, but have not implemented, publishing to the LOD cloud. This section should be deleted then (along with all the previous LOD background) – its enough to say that you propose to publish to the LOD cloud. And remove the claims in the conclusion that you have shown how to do it.

3.2.1 Again, too much well known stuff here, including some (all the linking technologies) that are irrelevant to this work. You mention two techniques usually used together, and say you (unusually) use only one of them. Why only that one, then. Why even mention the other?

4.1 Is this OWL? What part of it? Or something else? Use the language of the modelling language you use to describe your model (you are not using the language of OWL). Ref [27] does not speak to “lack of vocabularies” as you suggest. As it uses the W3C datacube ontology, it certainly does not support your design of a fresh datacube ontology as is implied. Why is the design of [27] , that also puts SSN together with a datcube, not good enough for your use case?

Fig 5a is useless (covered in 5(b)). Provide the ontology online for 5(b) if possible – it helps to understand when you do not formally describe it in the paper. Use double colons for namespace prefixes in the diagram. It is not clear why you model this data as an ontology at all – why is it not a database schema as its only use is internal to your software (it seems).

4.2 “following rules” are nonsense as members of D and M are individuals (instances of classes) but members of P are properties. Extended discussion about “intended” and “desired” object URIs is confusing.

Fig 8 do you really mean “rdf:type” to be there? Surely you only want domain properties (could filter on namespace).

Fig 12 is good to have – but you should have a much more informative screen copy. Otherwise remove it.

Listing 5: several of these edwh: namespace terms were missing from fig 5(b).

Sec 5.1 page 14 right column: you should have explained much earlier that you can only do one-dimensional cubes – this is a major limitation and would help to clear up earlier confusion when you explain how it works (and also why you use different cubes for quarter, hour, day and month.

Sec 5.1 it becomes clear here that you do not use the W3C rdf data cube. Why not? You have to justify your independent design! Would your auto-generation not work, for some reason I cannot see?

There is a lot of space devoted to evaluation in section 6. Unfortunately the evaluation does not seem to address the thrust of the paper. Except perhaps “accuracy” , which tests whether your software does some of what it should do correctly, something like a reverse-engineering test. The part of the third evaluation that looks at the performance of generating a datacube could be worthwhile if done comprehensively (vary some more parameters).
The rest is all about comparing the construction and retrieval speed and size of your datacube design vs the W3C datacube. I find your results rather insignificant. However, if you had written a paper about your design and how it improves in many ways over the RDF datacube, these results might be worth having in an evaluation. As it is there are too many assumptions unexplained, neither your queries nor the alternative models are adequately explained, and the performance is not much different anyway (nor any explanation of how the observed differences might scale nor what is causing them or at least what the pattern of differences is). I can’t really see how the evaluation can be used by the reader of this paper. You do not make any general conclusions about feasible limits on datacube sizes or numbers or number or frequency of sensor readings On the other hand, it seems that you *have* set up an W3C RDF datacube version of your processing pipeline – an explanation of this would make a much better paper! How did you align the SSN and datacube?
Why do you assume zero readings are missing data? But only for some cubes but not others?

Table 5/table 4 is out of sequence.

There are a lot of irrelevant references throughout, and lots of un-related “related work” (e.g. 17, 25, maybe 28 or else it needs more detail, [2] needs more detail and comparison as it sounds highly relevant, also [18] needs expansion as the difference you identify is rather tiny, deserving a much smaller paper) – but they should disappear with a tightening of the paper-- there is no need to reference every paper you have read or written. [16] must be updated to the 2014 final version. [9] and [10] are duplicates. Use e.g. {IoT}, {RDF} in titles in bibtex to preserve upper case when appropriate.

The authors might also be interested in this breaking news: http://www.w3.org/2015/01/spatial, also where the SSN and RDF Datacube ontology might be combined. If you made your combination available it might influence the standard.