A Performance Study of RDF Stores for Linked Sensor Data

Tracking #: 2249-3462

Authors: 
Hoan Nguyen
Martin Serrano
Han Nguyen Mau
John Breslin
Danh Le Phuoc

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Abstract: 
The ever-increasing amount of Internet of Things (IoT) data emanating from sensor and mobile devices is creating new capabilities and unprecedented economic opportunity for individuals, organisations and states. In comparison with traditional data sources, and in combination with other useful information sources, the data generated by sensors is also providing a meaningful spatio-temporal context. This spatio-temporal correlation feature turns the sensor data become even more valuables, especially for applications and services in Smart City, Smart Health-Care, Industry 4.0, etc. However, due to the heterogeneity and diversity of these data sources, their potential benefits will not be fully achieved if there are no suitable means to support interlinking and exchanging this kind of information. This challenge can be addressed by adopting the suite of technologies developed in the Semantic Web, such as Linked Data model and SPARQL. When using these technologies, and with respect to an application scenario which requires managing and querying a vast amount of sensor data, the task of selecting a suitable RDF engine that supports spatio-temporal RDF data is crucial. In this paper, we present our empirical studies of applying an RDF store for Linked Sensor Data. We propose an evaluation methodology and metrics that allow us to assess the readiness of an RDF store. An extensive performance comparison of the system-level aspects for a number of well-known RDF engines is also given. The results obtained can help to identify the gaps and shortcomings of current RDF stores and related technologies for managing sensor data which may be useful to others in their future implementation efforts.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Sep/2019
Suggestion:
Major Revision
Review Comment:

(1) originality,

This work is largely built upon on the authors' previous in this domain and in this work they provide a very detailed analysis of the RDF stores to handle and query large scale linked data. For the evaluation, a suitably large metrology dataset has been used.

The experiments and evaluations are well planned and clearly explained but the key justification and rationale behind using RDF store for working with time-series and/or sensory data require more justification. This could be a different point of view, but it seems most of the performed tasks could be done much more effectively on a SQL or Non-SQL database and the semantic linking could be performed on an overlay wrapper layer.

(2) significance of the results

Large-scale RDF stores for the IoT and high-velocity data would be interesting and useful only if there are strong applications for this and if similar functions could not be performed by (non-)sql databases.

Implicit query, relations analysis and reasoning over data could be a better justification for using RDF stores for such data but these are not discussed and/or experimented in the paper.

(3) quality of writing.

The paper is well written and the results are well presented and discussed.

Overall, this is a well-written and technically sound paper and provides insightful information about the capabilities of the existing RDF stores.
With regards to the above, the paper is a timely and relevant work. However, it terms of the key reasons and applicability of the work to real-world applications (or theoretical advancement in this area), the authors should improve the discussions and explain (or experiment) why the same functionalities could not be performed by a non-sql database and set of wrapper functions to interface between the underlying data and the semantic query/process requirements.

Review #2
By Leslie Sikos submitted on 02/Oct/2019
Suggestion:
Major Revision
Review Comment:

The paper provides a review and performance comparison of the common RDF datastores that can be considered for storing and querying RDF statements that are saved along with spatial, temporal, or spatiotemporal data, apparently with a focus on data derived from sensor networks, which is a relatively novel idea, although it needs more justification.

Storing RDF data with spatial and/or temporal data is somewhat oversimplified in the paper, and there are no mentions about the challenges associated with the extension of the standard RDF data model, nor about capturing the semantics of spatial and temporal terms as well (not only property values, and definitely not as string literals). It is not mentioned that adding spatiotemporal data to RDF statements can lead to undecidability when reasoning over the knowledge base.

Regarding the statement “the value with a timestamp is indexed as an RDF literal value” (in Virtuoso), it is not explained how the type and quality of the captured data affect machine-processability and machine-interpretability. A well-known limitation of the standard RDF data model is the incapability of capturing metadata and related data, such as provenance and spatial and temporal data at the statement element level and at the statement level. How do the authors justify using RDF for storing sensor data that requires spatial and temporal data to be stored along the statements? Using a formalism formally grounded in description logics for capturing spatiotemporal data is not justified; it should be emphasized that the benefits of the RDF data model are inherently exploited this way, however, doing so introduces some undesirable side-effects.

No formal definitions are provided for any of the concepts. The various (sometimes proprietary) indices of the RDF triplestores and quadstores discussed in the paper are not explained, and it remains unclear how the context element of RDF quads is used for storing spatial and/or temporal data. When it comes to spoc, posc, and opsc, no explanation is provided. By having quads with a context element, how and where is the semantics defined for the context? How do we know when you use it for temporal and when for spatial data? What happens if we need both spatial and temporal data simultaneously? How is spatiotemporal data stored by the context element? Do the authors consider RDF statement-level spatiotemporal data only and why? Without explaining this, the significance of the results cannot be assessed.

As for section 3, another big challenge (beyond the mentioned ones) is to retain decidability when reasoning over metadata-enriched RDF statements. Also, the capturing itself requires non-standard solutions that should be at least backward-compatible with standard RDF triples or quads. It is not mentioned anywhere what are the implications of diverging from standards to store spatiotemporal data, nor any alignment with alternatives to RDF reification and n-ary relations, such as for extending the standard RDF data model (e.g., RDF+, SPOTL, and RDF*), extending the RDFS semantics (Annotated RDF Schema, G-RDF), using alternate data models (e.g., N3Logic), decomposing RDF graphs (RDF molecule), capturing context with each statement (e.g., named graphs, RDF triple coloring), and using external vocabularies and ontologies (OWL-Time Ontology, 4D-Fluent Ontology, SWRL Temporal Ontology, etc.). These are at different levels of abstraction, and all have their own strengths and weaknesses.

In 5.1, a link to Virtuoso should be added as a footnote, similar to the other tools discussed in the manuscript.

In Table 1, since there is a dedicated license column, it would be more useful to provide actual licenses, rather than indicating a commercial vs. open source license type. The latest release date column does not add to the discussion and will obsolete quickly, so it can be omitted.

There are writing inconsistencies, such as using two versions of the same word in the manuscript (indexes and indices). Between a section heading and a lower-level subsection heading, such as 7 and 7.1 or 8.1 and 8.1.1, there should be at least one sentence. There are some typos, such as “benchmark queries set” instead of “benchmark query set,” “firstly define” instead of “first define,” “data for all over the world” instead of “data from all over the world,” “Virutoso” instead of “Virtuoso,” space before the full stop at the end of a sentence, etc. Some sentences are hard to read and should be reworded (e.g., “This reason is also explained for the poor data loading performance” should be “This can also be the reason for the poor data loading performance” or similar).

Regarding the case study, it is not clear what kind of spatiotemporal data is stored exactly and how. Only querying examples are provided, but a concrete example that would demonstrate the storage of spatiotemporal data-enriched RDF statements for data derived from real-world sensor networks is missing completely.

Apart from a 2018 and a 2019 article, the References section include older papers only. More recent papers should be cited. Under References, everything is lower case, which is incorrect (“rdf” instead of “RDF,” “Dbpedia” instead of “DBpedia,” “sparql” instead of “SPARQL,” etc.). This should be corrected throughout. DOI numbers are missing.

Review #3
By Anila Butt submitted on 21/Oct/2019
Suggestion:
Major Revision
Review Comment:

The paper presented a performance study of RDF stores for Linked Sensor Data. In this study, five RDF stores have been evaluated for linked sensor data by selecting eleven benchmark queries. The authors claim to identify the gaps and shortcomings of existing RDF stores in regards to the management of Linked Sensor Data. Although the presented study is relevant and important for the IoT and semantic Web community, the paper in its current form is lacking in detail and positioning of the work regarding previously conducted studies.

***** Related Work ******
The related work fails to motivate and highlight the gaps in the literature. This section should provide comparisons between existing RDF stores performance studies and that presented by the authors to positioning the work and understanding its novelty. In the paper, some studies that the authors considered related work are mentioned, [9, 18, 19, 26, 31, 38, 39] but not critically discussed. The only point made to validate the need for another performance study is “there is no existing approach that focusses on the study of readiness of RDF stores as regards linked sensor data” (Page 3, lines 35-37).

However, there is a recently published study on the “performance assessment of RDF graph databases for smart city services” available at (https://www.sciencedirect.com/science/article/pii/S1045926X1730246X). This study (referred to as existing study later in this review) is assessing existing RDF stores for linked sensor data. This work is not only the most relevant but also have considerable overlapping with the proposed work. For example, Section 3 (i.e., Fundamental requirements of a processing engine for linked sensor data) is the repetition of Section 2 (i.e., Smart city requirements for RDF stores) of existing study. Similarly, the proposed benchmark queries are more generic but a subset of previously presented queries in existing study. The evaluation methodology is also inspired by the existing study. The authors have cited this work but have not mentioned as related work.
The authors need to position their work and mention their contribution in regards to the existing study.

***** Benchmark Design *****
The authors analysed existing RDF stores based on their popularity and maturity. But then there are others like GraphDB and Blazegraph that has been skipped. Secondly, it’s not clear why one architecture of an RDF store is preferred over others. For example, RDF4j native store is evaluated, while previous studies have shown that RDF4j with 3rd party solution is more efficient as compared to its native stores (https://www.researchgate.net/publication/267864377_Scalability_and_Perfo... ).
There is no strong reason for selecting benchmark queries. I understand that each query covers some features (page 9, lines 41-49) but multiple queries are sharing the same features, e.g., GroupBy is part of 5 out of 11 queries. In this case, how would one evaluate a store for its ability to execute a feature in the query?

********* Results and Discussion **************
Figure 3 shows that data loading performance is evaluated for 2 billion triples. What was the method of capturing the loading time for 500m, 570m, … 2000m triples? For some RDF stores like Jena, Strabon and RDF4j data load is shown for 1.5B or fewer triples. I assume they failed to load more triples. It would be nice to mention a reason (e.g., timeout, memory exhausted, or error) in the paper. Moreover, the authors might consider evaluating these stores for their incremental load functionality that is a desirable feature for dynamic datasets.
Query performance presented in Figures 6-16 depicts that Virtuoso outperforms other RDF stores for all queries, but I have concerns on results where query execution time is constant with an increase in the dataset size. According to my understanding, query evaluation time not only depends on dataset size but also the matching triples(resultset size) in the store. So, I would suggest authors evaluate a query (let say Q1) by providing different parameter values and see if the performance varies for different parameters. While data loading performance has been evaluated using >2Billion triples, query performance is evaluated for 200 million triples. I understand that most of the RDF stores fail to run a query on a bigger dataset, but by evaluating queries on a large dataset they may verify the claims of RDF stores creators about their query performance on large datasets.
The discussion section is more like a summary of benchmark design and results. The authors may discuss if the evaluation for linked sensor data has different results than the previous comparative studies on linked data?

Summarising, overall the paper lacks focus on specific contributions of the study and justification of the benchmark design decisions. My suggestion is a major revision.