Review Comment:
Summary
--------
The authors present a number of significant contributions, expanding previous work in the representation of RDF archives to improve performance and add new query capabilities. The key idea in the article is to improve OSTRICH, a solution based on aggregated delta chains. First, the authors describe strategies to use multiple delta chains, potentially improving space requirements, ingestion times and query times. Additionally, the authors improve previous work by introducing a new encoding schema that reduces disk usage and ingestion time. Finally, the authors describe how to enhance the system to allow SPARQL queries over the RDF archive. The experimental evaluation expands previous work by studying the multiple-delta-chain strategy in all the datasets of the BEAR benchmark, and tests the performance of the new contributions.
The paper is well written and the problem is worthy of attention. The combination of all the contributions yields several steps forward in the context of RDF archives, and significantly improves the baseline OSTRICH tool in many scenarios. In terms of query performance, I find that the overall relevance of the proposal (i.e., its position in the state of the art) is not easy to evaluate without additional discussion or results. On the other hand, the improvement over the baseline is promising, and the experimental evaluation with the BEAR-C benchmark is a significant contribution that provides a baseline for future work.
Compliance with open science data
---------------------------------
The authors provide several links to resources, that can be used to partially reproduce the experimental evaluation:
- The Zenodo dataset linked in the article is well organized and easy to use, but it appears to be the same file used in previous work [13]. It provides code and instructions to reproduce the experiments in Section 8.3, and most of the results in Section 8.2.
- There are no instructions on how to reproduce the experiments in Section 8.4, and I did not find in the linked resources any obvious references to switch between metadata representations.
- Several sources are linked in relation to SPARQL support, covered in Section 8.5. The contents of the Github repositories should be enough to run experiments using SPARQL, but there is no specific information on how to reproduce the experiments in the article.
Following the journal guidelines, a single file should be provided, including all the necessary resources. I would suggest creating a new Zenodo file, following a similar structure to the previous Zenodo link, to include the essential information to reproduce all the experiments.
Main comments: suggestions in the experiments section
--------------
I find that the use of OSTRICH as the only benchmark makes it difficult to ponder the significance of the results in relation to query performance. The authors achieve better space requirements, ingestion times and query times than the OSTRICH baseline. However, they omit any comparison with the reference systems included with BEAR, with the claim that they are outperformed by OSTRICH. The results in [17], however, require a more nuanced explanation: OSTRICH obtained good results, and offered a good tradeoff, but it was neither the smallest nor the fastest alternative. This leaves several open questions hidden behind the claim that OSTRICH outperforms the BEAR baselines:
- Would the multiple-snapshot strategy on top of OSTRICH be faster than the BEAR reference systems at ingestion time? The comparison with other alternatives may not be completely fair, as in [17], but the improvement of one order of magnitude compared to the OSTRICH baseline is not so relevant if the system is still slower than the alternatives.
- Would the multiple-snapshot strategy be faster (for VM, DM queries) than the HDT-based systems? Would it be faster for V queries than the BEAR baselines? It will surely be more competitive, but it may not outperform at least the HDT-based alternatives.
- Would the multiple-snapshot strategy use less space than the alternatives? Probably yes, except for HDT-CB.
A full comparison with the reference systems included with BEAR may not be necessary to demonstrate most of the current improvements, but limited experimental evaluation could easily give an answer to the previous questions. At a minimum, some additional discussion is necessary to place the contributions of this work within the state of the art in terms of performance for triple pattern queries, or to justify the omission of the BEAR reference systems as baselines.
Section 8.4 only displays results of the compressed representation for BEAR-B. The improvement is impressive, especially at ingestion time, so it would be useful to know if a similar improvement can be expected also on BEAR-A (if reasonable to compute) or BEAR-C, that have few versions and bulky delta chains. BEAR-C ingestion times and disk usage with the compressed representation are relevant since it is used in Section 8.5, and I assume that the results in Table 6 are obtained from to the version with uncompressed metadata.
Other comments
--------------
The third contribution described in page 2 states mentions an extended evaluation of previous work with additional baselines. However, the only addition in Section 8.3, compared to [13], seems to be the BEAR-A baseline, and the evaluation of ingestion time and disk usage on BEAR-C.
The choice of keeping snapshots also as aggregated deltas is briefly discussed in Section 4.1. This is used to improve DM queries between consecutive snapshots, but it should also require extra disk space and ingestion time. Are there other practical advantages of this choice? I would suggest discussing the benefits of the optimization in the experimental evaluation.
In Section 4.2, the change-ratio strategy computes the sum of change ratios for all the versions in the current delta chain to determine whether to build a new snapshot. The explanation in p.7, l.25 implies that this calculation is used to estimate the amount of data not materialized in the snapshot. However, the value of \delta_{s,k} is already a good estimation of this; the same could be said of aggregating the relative deltas \delta_{i,i+1} for all snapshots. If the goal is to more accurately estimate the redundancy in the delta chain, this should be explained. As a side note, it is not clear if the same metric (or at least the same value of gamma) should be equally effective before and after changing the metadata encoding as described in Section 6.
In Table 2, the notation for the first two rows is confusing: $u_{i,j}^+$ and $u_{i,j}^$ should be $u_{k}^+$ and $u_{k}^-$.
In Table 2 it is assumed that the change set sizes can be added up directly to compute the deltas. I suggest that you provide an example of the delta calculation, or simply state that changesets are independent.
Algorithms 1, 2, and 4 provide limited added value, since they could be summarized as the OSTRICH algorithms applied to one of the many delta chains. In any case, the pseudocode of Algorithms 1 and 2 may be useful to make the article self-contained, as well as to introduce the notation.
A few small typos in the first algorithms:
- Alg. 2, line 9: replace $u_j^+$ by $u_j^-$
- Alg. 2, line 11: replace the second $u_j^-$ by $u_i^-$
- Comments in Algorithms 1 and 2: "correspond" -> "corresponds"
In Algorithm 3, l. 15, $delta$ should be $(u_i, u_j)$. I would suggest using the same notation inside snapshotDiff instead of referring to a generic $delta$, so that notation is consistent also with Algorithm 2.
The explanation of Algorithm 4 is incomplete: it is stated that the algorithm iterates over the triples that match p in the snapshot, but the additions are not mentioned. The behavior of "queryAdditions" should be included in the explanation, and the description of line 6 should be adjusted.
Other small details found in the text:
p. 5, l. 31: "a HDT" -> "an HDT"
p. 8, l. 44: "correspond" -> "corresponds"
p. 11, l. 24: "describe" -> "describes"
p. 12, l. 3: "paramount to functioning" -> "paramount to the functioning" (?)
p. 12, l. 4: "the multiple aspects" -> "multiple aspects" (?)
p. 13, l. 17: "exists" -> "exist"
p. 13, l. 23: "Our experiments Section 8.4" -> "Our experiments (Section 8.4)" or "Our experiments in Section 8.4"
p. 14, l. 33 "create" -> "created"/"creates"
p. 16, l. 50 "can ingest the entire history (in around 26 hours)" -> mention the dataset
|