Review Comment:
The paper focuses on the storage and querying (across time) of RDF Archives (RDF datasets with version information). In particular, authors extend an existing archiving approach called OSTRICH, which merges several archiving strategies. OSTRICH is mostly a change-based approach, storing the differences across versions w.r.t the initial one, while deltas are stored and annotated in a B+Tree. In this paper, authors try to alleviate the main scalability issues of this approach: its pure change-based approach suffers from long ingestion times as the aggregated deltas might become too large. The proposed solution, called COBRA, is to enable a bidirectional delta chain, so the materialized version is placed at a given version N, and the initial versions 0, 1, 2...N-2, N-1 can be reverse deltas on the reference version N, while the next ones, N+1, N+2.... |V| (where |V| is the number of versions) are computed forward as before. In addition to improving the scalability issues, authors expect to reduce the storage requirements and query performance. Then, authors describe the ingestion approach and the query algorithms. As for the ingestion approach, it mostly consists of a regular OSTRICH procedure and a mechanism for fixing-up the initial versions (0, 1, 2...N-2, N-1) at a given moment in time (the optimal moment is not studied in this paper). The query algorithms are OSTRICH adaptations to consider both types of deltas. Experiments on the BEAR archiving benchmark shows that the COBRA approach effectively improves the ingestion times of OSTRICH although it might produce bigger storage sizes when the datasets have many versions. As for query performance, COBRA shows mixed results, with overall better results for version and delta materialization, but worse results in version queries with big datasets.
The paper is overall interesting, it clearly states the motivation of the work (I should congratulate the authors for a very complete and comprehensible state of the art review) and describes the approach in simple terms. The code is available and the solution should be easily reproducible.
While I acknowledge the soundness of the technical solution and the relatively novelty of the bidirectional delta chain approach applied to RDF (bidirectional deltas have been applied in other fields, e.g. https://www.sciencedirect.com/science/article/pii/S0306457311000926), one could argue that the overall contribution of the paper is limited, or it is hindered by a few key facts:
- First of all, authors motivate the need of improving OSTRICH for a very large number of versions ("... [OSTRICH] starts showing high ingestion times starting around version 1,100"), the evaluation stops at 400 versions. Why not testing with the full versions of BEAR-B?
- While the improvement in ingestion size is noticeable (41% less on average), this assumes that all versions are known beforehand. If I am not wrong, adding the in order time COBRA* (Table 3) and the fixing time (Table 4) can equal or make the time even larger. Please clarify this point.
- The mixed results in query performance and the difference performance with large and small archives clearly shows that the author's idea of considering multiple snapshots and delta chains for future work, might actually make this contribution stronger. Authors mentioned that "a certain ingestion time threshold could be defined, which would initiate a new snapshot when this threshold is exceeded Some additional comments are provided". However, given that authors only support 1 snapshot, this fix can only be done, hence authors are only partially resolving (or rather postponing) the problem as if the number of versions is exceeded a new one might be needed as well.
- While I understand that the work focuses on improving OSTRICH and hence the evaluation comparing OSTRICH, knowing how the new results of COBRA positions with the state of the art would enrich the big picture.
In addition, some further clarifications might be needed:
- Why is OSTRICH meant to be also an IC approach if it only keeps the first version? Isn't that a normal change-based approach? (at the end, there should always be a base dataset, right?)
- Why does not COBRA suffer a visible change in space in BEAR-A (Subfig. 3.1) in the middle (materialized) version in contrast to the two BEAR-B (Subfig. 3.2 and 3.3). It is indeed acknowledged in the caption of the figure ("For BEAR-B Daily and Hourly, the middle snapshot leads to a significant increase in storage size") but not explained why.
- In Section 5.4.1, when authors talk about "many small versions", does it refer to few triples in each version, or to few changes in a version compared to the previous one? In the latter, in absolute value or in % over the total size?
- The text can be reviewed for a few small typos (e.g. "RDF archiving solutions suffer are not sufficiently capable of"). In the abstract, this does not give much information: "In future work, other modifications to this delta chain structure deserve to be investigated, as they may be able to provide additional benefits."
|