Optimizing Storage of RDF Archives using Bidirectional Delta Chains

Tracking #: 2666-3880

Authors: 
Ruben Taelman
Thibault Mahieu
Ruben Verborgh

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
Linked Open Datasets on the Web that are published as RDF can evolve over time. There is a need to be able to store such evolving RDF datasets, and query across their versions. Different storage strategies are available for managing such versioned datasets, each being efficient for specific types of versioned queries. In recent work, a hybrid storage strategy has been introduced that combines these different strategies to lead to more efficient query execution for all versioned query types at the cost increased ingestion time. While this trade-off is beneficial in the context of Web querying, it suffers from exponential ingestion times in terms of the number of versions, which becomes problematic for RDF datasets with many versions. As such, there is a need for an improved storage strategy that scales better in terms of ingestion time for many versions. We have designed, implemented, and evaluated a change to the hybrid storage strategy where we make use of a bidirectional delta chain instead of the default unidirectional delta chain. In this article, we introduce a concrete architecture for this change, together with accompanying ingestion and querying algorithms. Experimental results from our implementation show that the ingestion time scales significantly better. As an additional benefit, this change also leads to lower total storage size and improved query execution performance for most cases. This work shows that modifying the structure of delta chains within the hybrid storage strategy can be highly beneficial for RDF archives. In future work, other modifications to this delta chain structure deserve to be investigated, as they may be able to provide additional benefits.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Johannes Frey submitted on 14/Mar/2021
Suggestion:
Minor Revision
Review Comment:

In the full paper with the title “Optimizing Storage of RDF Archives using Bidirectional Delta Chains” the authors describe and evaluate a storage layout improvement of the RDF Archiving system OSTRICH. The improvement uses two aggregated delta chains based on a snapshot in the middle of the version sequence. The language is besides minor typos very good and comprehensible.
The paper is structured well and follows clear research questions. Related work and the problem statement are described on a good level that also allows readers not familiar with RDF archiving to easily understand the paper. The storage strategies are evaluated w.r.t. ingestion time, storage size, and different versioning query workloads using subsets of BEAR-A, and BEAR-B daily+hourly.
However, some details of Related Work and the evaluation raise questions. Moreover, the evaluation does not show whether the problems with OSTRICH mentioned by the authors starting at version 1,100 are resolved, since the maximum number of versions is 400. Most importantly, from what I can see in the paper I do not agree with the conclusion of the authors, stating that COBRA solves the scalability issues of OSTRICH w.r.t. ingestion. The scalability issues seem to be only delayed by a fixed factor (around 2). To put it another way, it seems possible to store twice the number of versions until the same performance barrier should be reached.
I am not an expert in the field and therefore it is hard for me to assess the significance and impact of this work. Therefore, I consider the concerns I have more a matter of presentation that could be fixed in a Minor Revision.

________________
Further Details:

page 1 “velocity” of changes?

page 2 “require all preceding deltas to be checked” at first this seems reasonable, but given the aggregated delta chain that should not be necessary (?) and it is confusing to me after reading the entire paper. I could imagine the index insert/update is slowing it down.

Table 1
I understand the idea of hybrid characteristics and the intended purpose of Table 1, but in my opinion the presentation with checkmarks is misleading. From what I can see in the paper OSTRICH does not seem to fulfill the Independent copies criterion. My interpretation of this table is that OSTRICH uses independent copies/snapshots - by definition of IC for every version - and additional deltas for every version. While the latter is accurate, the former seems not valid at the current stage. I can not judge whether TB is strictly fulfilled. I suggest using circles which are fully filled / half filled respectively or a similar presentation to reflect this.
I argue that git does also not fulfill IC strictly due to garbage collection which is employed by QuitStore. The underlying git storage layer uses a mixture of snapshots (objects) and pack files (compression and deltas). Moreover, QuitStore is described as fragment-based in [4]. Additionally, the “virtual snapshots” in QuitStore are not independent but can be reused, e.g. if a change of one triple is “rolled back” in the next version (3 versions but only 2 “copies” because v1 and v3 are equivalent).

p2 “as shown in .”

p3 “make us” -> makes use
“suffer are not”

p4 “would be form”

p7 “by invoking the streaming ingestion algorithm for unidirectional agg….” the input chain is non-aggregated so why is the streaming expecting aggregated deltas?

Fig2 the varying number of deltas can be confusing. I think it would be more intuitive if they have the same number of deltas. I recommend using BEAR-A (versions) as a concrete and consistent example.

p9 “stored out of order”. Please describe the order more clearly. It would help to label the deltas and snapshots in Fig2 with actual logical version numbers and the ingestion time. I understand that you create e.g. for BEAR-A version 4 as HDT file for COBRA. But what happens then? Insert 3-2-1-0 version and then the metadata in the database is patched to actually represent the correct logical order of the snapshots? This needs clarification and more explanation. Or is it possible to insert the version out of order in a sense that allows explicitly assigning them a logical version number? My natural (probably naive) understanding of archiving is that the ingestion order of the files also determines the monotonic order of the version (ingestion time -> version timestamp / incremented version number). I think it is crucial here to understand what data input the 3 variants expect. A named graph of triples with additional version metadata (version number / label)?

p10 “3 show” -> shows
“These tables” -> plural?

p12 “4 show” -> shows
“to to”
“has is”

fig 6.1 I do not understand why values for 8 version-deltas are reported. DM between version 0 and version 0 (0-0) does not seem to be meaningful? Why is DM 0-3 most efficient for both approaches? I would expect only 0-4 to be significantly faster for COBRA.

fig 6.2/6.3 I do not understand the execution time minima very much at the beginning. For COBRA I would expect minima in the middle and for OSTRICH I would not expect them at all.

p14 “ could easily … can happen”

p15 “reject this hypotheses” -> hypothesis

“solves the main scalability problem” It is not obvious from the evaluation

General question:
Does the fixup require the database to be read-only?

Please use the review layout with line numbers and pages in future revisions.

Review #2
Anonymous submitted on 19/Apr/2021
Suggestion:
Major Revision
Review Comment:

The paper focuses on the storage and querying (across time) of RDF Archives (RDF datasets with version information). In particular, authors extend an existing archiving approach called OSTRICH, which merges several archiving strategies. OSTRICH is mostly a change-based approach, storing the differences across versions w.r.t the initial one, while deltas are stored and annotated in a B+Tree. In this paper, authors try to alleviate the main scalability issues of this approach: its pure change-based approach suffers from long ingestion times as the aggregated deltas might become too large. The proposed solution, called COBRA, is to enable a bidirectional delta chain, so the materialized version is placed at a given version N, and the initial versions 0, 1, 2...N-2, N-1 can be reverse deltas on the reference version N, while the next ones, N+1, N+2.... |V| (where |V| is the number of versions) are computed forward as before. In addition to improving the scalability issues, authors expect to reduce the storage requirements and query performance. Then, authors describe the ingestion approach and the query algorithms. As for the ingestion approach, it mostly consists of a regular OSTRICH procedure and a mechanism for fixing-up the initial versions (0, 1, 2...N-2, N-1) at a given moment in time (the optimal moment is not studied in this paper). The query algorithms are OSTRICH adaptations to consider both types of deltas. Experiments on the BEAR archiving benchmark shows that the COBRA approach effectively improves the ingestion times of OSTRICH although it might produce bigger storage sizes when the datasets have many versions. As for query performance, COBRA shows mixed results, with overall better results for version and delta materialization, but worse results in version queries with big datasets.  

The paper is overall interesting, it clearly states the motivation of the work (I should congratulate the authors for a very complete and comprehensible state of the art review) and describes the approach in simple terms. The code is available and the solution should be easily reproducible.

 
While I acknowledge the soundness of the technical solution and the relatively novelty of the bidirectional delta chain approach applied to RDF (bidirectional deltas have been applied in other fields, e.g. https://www.sciencedirect.com/science/article/pii/S0306457311000926), one could argue that the overall contribution of the paper is limited, or it is hindered by a few key facts:

- First of all, authors motivate the need of improving OSTRICH for a very large number of versions ("...  [OSTRICH] starts showing high ingestion times starting around version 1,100"), the evaluation stops at 400 versions. Why not testing with the full versions of BEAR-B?

- While the improvement in ingestion size is noticeable (41% less on average), this assumes that all versions are known beforehand. If I am not wrong, adding the in order time COBRA* (Table 3) and the fixing time (Table 4) can equal or make the time even larger. Please clarify this point.

- The mixed results in query performance and the difference performance with large and small archives clearly shows that the author's idea of considering multiple snapshots and delta chains for future work, might actually make this contribution stronger. Authors mentioned that "a certain ingestion time threshold could be defined, which would initiate a new snapshot when this threshold is exceeded Some additional comments are provided". However, given that authors only support 1 snapshot, this fix can only be done, hence authors are only partially resolving (or rather postponing) the problem as if the number of versions is exceeded a new one might be needed as well.

- While I understand that the work focuses on improving OSTRICH and hence the evaluation comparing OSTRICH, knowing how the new results of COBRA positions with the state of the art would enrich the big picture.

In addition, some further clarifications might be needed:

- Why is OSTRICH meant to be also an IC approach if it only keeps the first version? Isn't that a normal change-based approach? (at the end, there should always be a base dataset, right?)
- Why does not COBRA suffer a visible change in space in BEAR-A (Subfig. 3.1) in the middle (materialized) version in contrast to the two BEAR-B (Subfig. 3.2 and 3.3). It is indeed acknowledged in the caption of the figure ("For BEAR-B Daily and Hourly, the middle snapshot leads to a significant increase in storage size") but not explained why.

- In Section 5.4.1, when authors talk about "many small versions", does it refer to few triples in each version, or to few changes in a version compared to the previous one? In the latter, in absolute value or in % over the total size?

- The text can be reviewed for a few small typos (e.g. "RDF archiving solutions suffer are not sufficiently capable of"). In the abstract, this does not give much information: "In future work, other modifications to this delta chain structure deserve to be investigated, as they may be able to provide additional benefits."