Evaluating Systems and Benchmarks for Archiving Evolving Linked Datasets

Tracking #: 1605-2817

Irini Fundulaki
Vassilis Papakonstantinou
Yannis Roussakis
Giorgos Flouris
Kostas Stefanidis1

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
As dynamicity is an indispensable part of Linked Data, which are constantly evolving at both schema and instance level, there is a clear need for archiving systems that are able to support the efficient storage and querying of such data. The purpose of this paper is to provide a framework for systematically studying the state-of-art RDF archiving systems and the different types of queries that such systems should support. Specifically, we describe the strategies that archiving systems follow for storing multiple versions of a dataset, and detail the characteristics of the archiving benchmarks. Moreover, we evaluate the archiving systems, and present results regarding their performance. Finally, we highlight difficulties and open issues arisen during experimentation in order to serve as a springboard for researchers in the Linked Data community.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 13/Apr/2017
Major Revision
Review Comment:

This paper evaluates current RDF versioning systems using the SOTA in RDF versioning benchmarks. It is well written and nicely structured. However, its contents are insufficiently strong for publication, so I recommend a Major Revision.
The weakest point of this paper is the relevance of the experiments. In total, two benchmarks are tested against two systems. Although this is not the author’s fault -- this is simply the current state of the art -- one could argue whether these experiments make sense at this time. Also, both systems are hard to compare, as their use cases are different. R43ples aims at version management, i.e., multiple parties manipulating data, while Tailr aims at archiving Linked Data resources, i.e., retrieving prior versions of a resource state. As a result, their storage approaches are completely different, as well as their query interfaces (as noted by the authors).

All in all, is there actually something that can be concluded? I don't believe so: nor for the benchmarks, nor for the systems. A theoretical and feature evaluation would have been a better fit to the contents of the paper. What approach is interesting for which case and why? What scenarios will cause problems? What can we learn for future direction?

There are some issues with the argumentation in sec 2.1.1. I wouldn't say that Full Materialization has no processing cost. The large data volumes that come with snapshots are a handful, especially when they need to be indexed for query execution. They at least have a serialization cost. The complexity may be less than a delta-based approach, but still. Also, there is no guarantee that the space overhead is a bigger problem than with delta-based storage. Full-materialized versions can be aggressively compressed and, in the case many triples are deleted, could actually result in a smaller size. Although these scenarios might be uncommon, these claims are still incorrect.

Some minor issues:
- I don't think 'annotated triples' is the best name, as deltas can also be stored in an annotated fashion.
- Sec 2.2 could use some rephrasing and written more in line with Figure 1.
- In Table 1, R&Wbase uses the Annotated strategy, but in 3.5, it says 'Hybrid'.
- Section 4.1 states a few requirements on benchmarking, but I'm missing argumentation and references.
- It's unfair to require benchmarks to adhere to all the requirements set by the authors, e.g., sec 4.1. A benchmark can be use case specific, as long as it is representative for that use case and if that is clearly communicated. In fact, I'm not sure whether it is possible to create a qualitative fully generic benchmark.
- There’s a typo in 2.1: approac*h*es

Review #2
Anonymous submitted on 17/Apr/2017
Review Comment:

Authors tackle the challenge of studying the state-of-art RDF archiving systems. To do so, they first describe potential archiving strategies and query retrieval features. Then, they review current archiving frameworks. Finally, they make use of a prototypical benchmark for archives (EvoGen) to provide an evaluation of two particular systems (TailR and R43ples). Results show that (i) EvoGen is not reliable enough as a benchmark and (ii) the evaluated systems do not scale or do not provide complex retrieval functionality, hence they point to the lack of maturity of the area.

Although the paper is certainly timely, from my honest point of view, the contribution and novelty of the paper is rather marginal. First, the review of the strategies and queries are not novel, and are sufficiently covered in references 10, 11 and 31 in the paper, and the important missing reference:
“Y. Tzitzikas, Y. Theoharis, and D. Andreou. On Storage Policies for Semantic Web Repositories That Support Versioning. In Proc. of ESWC, pp. 705–719. 2008”

The review of current archiving systems may complement the review of other related works (e.g. 11), but it is far from being exhaustive. On the one hand, the review disregards the authors' categorization of queries and does not indicate which queries are then supported. On the other hand, it misses at least four related systems:

- Shi Gao Jiaqi Gu Carlo Zaniolo. Rdf-tx: A fast, user-friendly
system for querying the history of rdf knowledge bases. In
Proc. of EDBT, 2016
- Ana Cerdeira-Pena, Antonio Farina, Javier Fernandez, and
Miguel A Martınez-Prieto. Self-indexing rdf archives. In Proc.
of DCC, 2016
- Frommhold, Marvin, et al. Towards Versioning of Arbitrary RDF Data. Proceedings of the 12th International Conference on Semantic Systems. ACM, 2016.
- I. Dong-Hyuk, L. Sang-Won, and K. Hyoung-Joo. A Version
Management Framework for RDF Triple Stores. Int. J. Softw.
Eng. Know., 22(1):85–106, 2012

Finally, although it is interesting that authors reflect the difficulties of running current archives (besides the missing works listed above), the evaluation of the archiving systems is too narrow and does not help to gain further insights for future developments. Note that the decision of using EvoGen, a very initial and prototypical benchmark (i) is very arguable and may lead to unreliability (as pointed our by authors), and (ii) is not justified in the scale as authors only test a very limited number of versions and data sizes.

Other remarks:

- [35] is categorized as annotated triples and hybrid strategy.
- Figure 1 could be improved by moving “Type” to the right side.
- The review of the systems in Section 3 is unbalanced: some systems are evaluated in too much detail and some notes on the evaluation and performance are given, whereas authors omit the performance (such as x-RDF-3X), or the strategy (e.g. Dydra which, by the way, I believe is not fully materialized as stated in Table 1) of others.
- Is blank node enrichment similar to skolemization?
- References (and expanded meaning) of LDF and LDP should be provided.
- Are the principles in the beginning of Section 4 standard? A minimum review of related benchmarks both in RDF stores and other areas (e.g. databases) would complement the work.
- The number and position of tables does not help in the comparison of the systems.
- What is the raw size (in bytes) of the corpus?
- In the beginning of section 5.3, it seems TailR is not evaluated in all the section, but it is indeed tested in 5.3.3.
- Although TailR does not support full querying, simple subject lookups could be tested.
- The “diachron” queries are introduced in the evaluation but they are not mentioned in the categorization of queries.
- The discussion of results is a bit naïve. When authors state that the important factor for query performance is the shape of the query, it seems they actually mean the number of intermediate results, which is an already well-known factor in SPARQL benchmarking.

Review #3
Anonymous submitted on 17/Jul/2017
Major Revision
Review Comment:

The submitted manuscript has the goal to provide a framework for systematically studying the state-of-art RDF achieving systems and different types of queries.
The paper provides a comprehensive survey about the scenario, state of the art systems and available benchmark efforts but falls short to provide the advertised framework.

The main recommendation for improvements for this manuscript is that the authors should make their main contribution of the paper clearer. Is it the comprehensive survey with a a brief analysis about availability of systems and benchmarks, or is it a framework. If it is the latter, the authors should provide their framework.
Also the evaluation should be discussed in more details, rather than presenting the numbers and leaving interpretation left to the authors
The paper currently does not “[…] provide the first complete evaluation of existing archiving systems using existing archiving benchmarks.”.

As such my verdict of a major revision. The manuscript contains many useful information, but the requires to be shaped and the contribution should be stated very clear

The strong points of the manuscript is the detailed review of existing benchmarks, archiving strategies and systems (Section 2, 3 and 4).

The main critics on the manuscript relate to the motivation and execution of the evaluation.
Reading the manuscript i was expecting a large scale benchmark of many systems using a large dataset with many versions.
The manuscript states “We evaluate the archiving systems using existing benchmarks and we report on the different identified technical difficulties as well as systems’ performance“.

Unfortunately, the evaluation uses only one benchmark and two systems.
Also setting the memory for R43ples to 64 and imposing no limitation for the other system is not a fair comparison.
This might also explain why the experiment with 10M triples could not be conducted.
Also the two engines are not really compared since TailR does not provide certain query functionalities.

There are also a couple of statements which are not really backed up by the data:

*) The authors also state “This can be seen clearly in Table 10, where the number of triples in version v4 are more than those in version v3 ”.
However, Table 10 does not contain the number of triples, rather the sie and size increase. (

*) Also the import time for the 5M dataset does not differ by much (R43ples requires 1573Kms while TailR 1670K ms, summing up all duration times)

Table 11 and 12.
The authors miss also to discuss the difference in the result sets between to the two engines.

p14, there is a format error on the bottom left -> linespace too big