StarVers - Versioning and Timestamping RDF data by means of RDF-star - An Approach based on Annotated Triples

Tracking #: 3506-4720

This paper is currently under review
Filip Kovacevic
Fajar J. Ekaputra
Tomasz Miksa1
Andreas Rauber

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
In the era of data-driven research, we are confronted with the challenge of preserving datasets utilized in experiments for an extended duration. As a result of unretrievable datasets, research cannot be reproduced nor verified. One of the specific challenges is the preservation of the history of evolving datasets. To tackle this challenge, we nowadays use versioning mechanisms on datasets so that each version or revision can be identified. The retrieval of specific versions usually requires queries to be enhanced with some form of version identifiers in order to target specific snapshots. What these version identifiers look like depends on the versioning policy. In timestamp-based policies we would use timestamps in order to retrieve datasets as they were at a specific point in time. The implementation of such policies again depends on the type of data and more importantly on the database system. In the context of RDF data, researchers have contributed with countless temporal RDF models and RDF versioning systems over the last two decades. Many well-known RDF metadata representation models, such as named graphs, reification, n-ary and singletons have simply been applied to represent temporal metadata. However, only little reserach has been done in this direction with one of the more recent RDF extensions with the capability of representing metadata, namely RDF-star. In this paper, we explore the possibilities of RDF-star as basis of a timestamp-based versioning framework. We show how temporal metadata can be represented by utilizing RDF-star's nested triples and stored within RDF-star stores. Moreover, we develop and showcase timestamp-based SPARQL-star templates that can be used to 1) transform RDF datasets into RDF-star datasets 2) update and thereby evolve RDF dtasets and 3) query RDF subsets of time-specific snapshots. We also explain our utilization of the SPARQL query algebra for the purpose of translating SPARQL queries into SPARQL-star queries and link our python-based API that is capable of automatically generating and executing latter queries. Finally, we evaluate our work with datasets and queries from the BEAR benchmark and two RDF stores, namely, Jena TDB2 and GraphDB. Our results suggest that our solution is preferable if the frequency and manner of dataset evolution are uncertain, implying that it outperforms the baseline approaches in sum. However, in cases where datasets are prone to high change rates between consecutive updates our approach does not outperform the baseline approaches in terms of storage consumption. Moreover, the choice of RDF store is significant as the query performance differs vastly between the evaluated RDF-star stores. Our reproducible evaluation process is available on:
Full PDF Version: 
Under Review