Optimizing Storage of RDF Archives using Bidirectional Delta Chains

Tracking #: 2830-4044

Ruben Taelman
Thibault Mahieu
Martin Vanbrabant
Ruben Verborgh

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Linked Open Datasets on the Web that are published as RDF can evolve over time. There is a need to be able to store such evolving RDF datasets, and query across their versions. Different storage strategies are available for managing such versioned datasets, each being efficient for specific types of versioned queries. In recent work, a hybrid storage strategy has been introduced that combines these different strategies to lead to more efficient query execution for all versioned query types at the cost increased ingestion time. While this trade-off is beneficial in the context of Web querying, it suffers from exponential ingestion times in terms of the number of versions, which becomes problematic for RDF datasets with many versions. As such, there is a need for an improved storage strategy that scales better in terms of ingestion time for many versions. We have designed, implemented, and evaluated a change to the hybrid storage strategy where we make use of a bidirectional delta chain instead of the default unidirectional delta chain. In this article, we introduce a concrete architecture for this change, together with accompanying ingestion and querying algorithms. Experimental results from our implementation show that the ingestion time is significantly reduced. As an additional benefit, this change also leads to lower total storage size and even improved query execution performance in some cases. This work shows that modifying the structure of delta chains within the hybrid storage strategy can be highly beneficial for RDF archives. In future work, other modifications to this delta chain structure deserve to be investigated, to further improve the scalability of ingestion and querying of datasets with many versions.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Jul/2021
Review Comment:

I would like to thank authors for their effort on the reviewed manuscript. Having reviewed the first version, I would like to just acknowledge the following big changes:

- Authors expanded their experiments very significantly, mitigating the criticism on their limited tests on a high number of versions, now reaching 1,299 versions.
- Authors also compare with 6 archive systems from the BEAR benchmark, mostly based on HDT and Jena
- The paper added an extended compared analysis of performance
- Authors clarified several misunderstandings like the qualification of OSTRICH as an IC approach, or the Out-Of-Order Ingestion Approach

Acknowledging all this work, it is still my honest feeling that the contribution of the improvement is limited and the technique has to be further enriched with the suggested future work to be a more practical system. However I would like to also give value to this intermediate steps and would like to support the publication of the paper as is.

Tiny details:

- Italics in Table 4 are difficult to distinguish, I might use bold or a sign (e.g. +) close to the winner.
- typo: "our-of-order" --> out-of-order

Review #2
By Johannes Frey submitted on 24/Aug/2021
Review Comment:

In the revised paper, the authors improved verbosity and clarity of aspects of the presentation, fixed errors, typos and addressed all issues raised by me in the previous review. Furthermore, they improved the Related Work section and extended the evaluation with 4 Jena- and 2 HDT-based versioning strategies as well as a set of detailed plots. While impact/significance of the single bidirectional delta chain approach seems to still leave room for discussion, I see inherent value in the benchmarking and evaluation part of this paper for the scientific community and therefore suggest accepting this revision. I have a few minor comments that could be addressed in the camera ready version.

Further Details:

page 2: “require all preceding deltas to be checked”. I think we misunderstood, based on your comment. I still wonder whether this holds true for aggregated delta chains (OSTRICH), AFAIK no deltas have to be checked when using an aggregated “chain” because the deltas are independent of each other and are not forming an actual “chain”. I think the process is slowed down due to the size of the add and del indexes / data structures. Maybe the word ‘checked’ is confusing.

Table 1: “fullfulls”

page 8: “our-of-order”

Subfig. 2.3 snapshot version label is incorrect

Subfig. 3.1: I do not understand the size of 0 (or very close to 0) for version 5 for COBRA which actually seems to be a snapshot of noticeable size. I would assume it on a level close to version 0 of OSTRICH.

Subfig. 3.2: The cum. size jump between version 0 (46th inserted version) and 46 (47th inserted version) for COBRA is suspicious (if the delta between version 45 (snapshot) and version 46 would be so high, I would expect to see the same gap in COBRA* since both use version 45 as reference snapshot)

Subfig. 3.3: Tearing lines (esp. for OSTRICH) and the clear size *drop* in the middle for COBRA* are counterintuitive and not explained (temporary data structure overhead/cleanup variety?)

Review #3
By Christoph Lange submitted on 30/Aug/2021
Minor Revision
Review Comment:

This is the first revision of the manuscript I am reviewing. Therefore, I am judging only the manuscript itself, not the comments on the revision.

This manuscript presents COBRA, an approach for storing RDF archives in a way that supports several versioned query patterns (introduced in Section 2.1). The authors' previous OSTRICH implemented, which already serves the same purpose, is extended from unidirectional delta chains to bidirectional ones in an attempt to reduce both storage size and ingestion time. This is measured in a number of evaluations using the BEAR benchmark. The hypothesis cannot be confirmed universally, but in a majority of situations.

Section 1 briefly introduces the problem. Section 2 introduces the basics of versioned queries, different storage strategies and discusses related work. Related approaches are classified w.r.t. storage strategy. Section 3 states the problem and phrases the research hypotheses. Section 4 introduces the bidirectional delta chain approach, for the case of subsequently ingesting additional versions during the lifetime of a dataset, and for the case of ingesting all past versions of a dataset at once, the latter of which can be performed out of order. Section 5 presents the results of evaluating OSTRICH and COBRA in different BEAR settings. Section 6 draws conclusions, also providing concise guidance w.r.t. what storage approach to employ in what practical setting. The implementation is open source and accompanied with everything needed to reproduce the experiments. At least it seems so – I did not try. Just the creation of the BEAR input datasets is not exactly reproduceable, as the evaluation scripts assume that the respective data already exists on a server donizetti.labnet.

The following aspects need improvement (cf. the annotated PDF at https://www.dropbox.com/s/bs96w03ab2ikiu2/swj2830.pdf?dl=0 for details):
* Algorithm 2 is introduced as showing the fix-up algorithm. However, that's actually what Algorithm 1 shows.
* Algorithm 2 is said to assume that n is even, but the shown implementation, which uses Math.floor, does not.
* There are multiple references to the OSTRICH article [5]. These references would be easier to use if they pointed to specific individual sections of that article.
* In Section 4.6.2, the text about delta materialization says that "the results from the two queries are sorted". Why are they sorted, and by what? Do you mean that the (unsorted) results of the first query come first, and then the (once more unsorted) results of the second query?
* In Section 4.6.3, Version queries are defined as "results being annotated with the version in which they occur". However, is the version always unique? (That's what this phrasing seems to assume.)
* The setting shown in Subfigure 2.2 does not involve bidirectionality.
* In Section 5.2, why do you restrict your scope to "at most two delta chains"? In other words, does this not mean that you assume that the threshold for a chain that's "too long" is "(number of versions) / 2"? Would your approach not perform even better with more delta chains?
* In Section 5.3.1, what do you mean by "ingesting a raw representation"?
* Regarding Figure 3: Before reading your reminder about the reverse order of ingestion on the next page, it's hard to understand that we have a zero value in the middle. You could facilitate understanding by showing an arrow that indicates the order of ingestion.
* clarity of phrasing (cf. PDF annotations and comments)
* multiple minor linguistic issues (cf. PDF)