Towards Fully-fledged Archiving for RDF Datasets

Tracking #: 2458-3672

Olivier Pelgrin
Luis Galàrraga
Katja Hose

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
The dynamicity of RDF data has motivated the development of solutions for archiving, i.e., the task of storing and querying previous versions of an RDF dataset. Querying the history of a dataset finds applications in data maintenance and analytics. Notwithstanding the value of RDF archiving, the state of the art in this field is under-developed: (i) existing systems are neither scalable nor easy to use, (ii) no solution supports multi-graph RDF datasets, (iii) there is no standard way to query RDF archives, and (iv) solutions do not exploit the evolution patterns of real RDF data. On these grounds, this paper surveys the existing works in RDF archiving in order to characterize the gap between the state of the art and a fully-fledged solution. It also provides RDFev, a framework to study the dynamicity of RDF data. We use RDFev to study the evolution of YAGO, DBpedia, and Wikidata, three dynamic and prominent datasets on the Semantic Web. All these insights allow us to set the ground to sketch a fully-fledged archiving solution for RDF data.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andre Valdestilhas submitted on 15/Apr/2020
Minor Revision
Review Comment:

(1) originality
> The paper explores solutions for archiving, i.e., store and query versions of RDF.

(2) significance of the results
> The vocabulary dynamicity: The correlation with change ratio showed in Figures 3b, 3e and 3h is a significant innovative point. Furthermore, showing big changes of YAGO releases.
> The evaluation of the related works brings us much more details giving more credibility to the paper, also showing the very well know the problem of the reproducibility, where Ostrich was the only one able to run the experiments.

(3) quality of writing
> The paper is very well written, I have no comments on this aspect.

Points to improve:
> About the implementation and result replication.
>> The code is in a dropbox link to a compressed file. I recommend putting the code on gitHub.
>> It was not possible to run the experiments, maybe it is because of my lack of knowledge in C++.
> In the RDFev, I miss information about the input data dubbed “revisions”, what they stand for?
> It does not work with a multi-graph.
> It’s well known that YAGO, or part of, is contained in DBpedia [5]. How RDFev deal with a YAGO version inside DBpedia? Which versions of YAGO could be there? Is it complete?
> The compression provided by Ostrich should not be considered because it is part of the HDT format.
> I did not understand the term “delta chain”.
> Ostrich already has his evaluation in his paper; thus, no need to include in yours.
> Need more information about an “arbitrary BGP”, an example could be a good choice.
> Where are those 100 queries? Why? Where they come from? Which domain?
> Which version of Ostrich was used?
> I miss a justification to use 64 GB RAM.

> WIMU[1] could help because it is an index of datasets where the user could obtain versions of the same dataset if they share some URI. Moreover, the authors could also use the wimuQ[2] the query datasets from WIMU, and they are already integrated.

> For versions of RDF datasets and tracking provenance information, I recommend including Quit Store [3] in your related works.

> Part of this work is about a study of the evolution of 3 datasets YAGO, DBpedia and Wikidata. Is this work aware of the concept of RDF dcat:distribution used in works, such as [4]?

> In section 7, the recommendation to add metadata to the datasets, I recommend you highlight that it will be easy for HDT format, due to several formats does not work very well with metadata.

[1] Valdestilhas, A., Soru, T., Nentwig, M., Marx, E., Saleem, M., & Ngomo, A. C. N. (2018, June). Where is my URI?. In the European Semantic Web Conference (pp. 671-681). Springer, Cham.
[2] Valdestilhas, André, Tommaso Soru, and Muhammad Saleem. "More Complete Resultset Retrieval from Large Heterogeneous RDF Sources." Proceedings of the 10th International Conference on Knowledge Capture. 2019.
[3] Arndt, N., Naumann, P., Radtke, N., Martin, M., & Marx, E. (2019). Decentralized Collaborative Knowledge Management using Git. Journal of Web Semantics, 54, 29-47.
[4] Frey, J., Hofer, M., Obraczka, D., Lehmann, J., & Hellmann, S. (2019, October). DBpedia FlexiFusion the Best of Wikipedia> Wikidata> Your Data. In International Semantic Web Conference (pp. 96-112). Springer, Cham.
[5] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., ... & Bizer, C. (2015). DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2), 167-195.

Review #2
Anonymous submitted on 04/May/2020
Minor Revision
Review Comment:

Title: Towards Fully-fledged Archiving for RDF Datasets
Authors: Olivier Pelgrin, Luis Galárraga and Katja Hose

The paper focuses on RDF archiving ie. storing and querying the entire
edition history of an RDF dataset. This topic is well-known in the semantic
web community with many useful use-cases and related works.

The paper raises the following issue: if many works already address
RDF archiving, there is no available fully-fledged solution. According
to the paper, this situation is the result of:
- performance and functional limitations of RDF engines
- No standard for querying RDF archives
- Disregard of the evolution of real RDF data.

So the paper proposes the following contributions:
- A set of metrics to analyze the evolution of RDF dataset
- A study of the evolution of representative datasets
- A survey of existing work on RDF archiving, with a benchmark on
previous datasets.
- A sketch of a full-fledged RDF archiving system.

Strong points:
The paper is well written
The topic is important for the semantic web community
The study of evolution highlights interesting points.
Experiments in the survey part highlight limitations of existing
solutions on representative datasets.

Weak points:
We have several parts on the paper: the first one is a study, with its
methodology and results. A second one is a quick survey and the last
one is a sketch of a "full-fledged" system. Although these parts are
related, it could also be 3 different papers: a study, a survey, and a
vision paper. Consequently, the paper may also be understood as
a sequence of 3 different minor contributions.
Research directions of section 7 remain blurry. There is a gap
between spotted issues and conclusions from experiments.
The root cause of the performance limitation of the only tested system is not established. Performances issues are not sufficiently covered in the paper. I think this is a very important issue for RDF archiving system
After reading the paper, we can also conclude that before having a full-fledged solution, maybe we first need a partial solution able to manage big RDF datasets over long periods.

Section introduction is clear and easy to follow.

Section 2 presents the preliminaries. It describes basic formalism to
describe RDF graph archives. If notations look complex, they are quite
easy to follow. It seems to me that there is a problem in figure 1
with delta1= USA:dr:Cuba. When applied to G0, I cannot obtain G1. I
have the same problem with figure 2. I think it is a typo, and I
understand globally the idea. The presentation of queries on the archive in
section 2.5 are very clear.

Section3 presents RDFev a set of metrics to observe the evolution of
RDF datasets. The contribution of this set of metrics is to
observe "high-level change" ie. change at the level of entities vs
change at the level of triples [18]. Such metrics allows highlighting
patterns of evolution of RDF minor/major releases. It is stated in
section 4.4 that "Wikidata exhibits a stable release cycle as our
metrics did not exhibit big fluctuations from release to release". It
is true in the observed period that ends in 2016. After this period,
wikidata was ~2billions triples, and it is now 8billions triples. I'm
not sure that observations made on wikidata before 2016 can be
extended to the last period. Consequently, conclusions of section 4.4
may be impacted if the recent period was included. Overall, just spotting
from this long experiment that there are major and minor releases on
datasets is a little disappointing.

Section 5 presents a survey of RDF archiving system. It retains several
criteria to compare different approaches: storage, data model, query types,
concurrent updates... Table 2 presents a synthetic view of systems vs
criteria. Results presented in figure 6 and table 3 are interesting.
The experiments of section 6 mainly redo the experiment of [49] with
bigger datasets. It is quite disappointing to see that on the 9
systems presented in the survey, only 4 have their sources available
and only one is able to just ingest data. This can be
considered as a contribution of the paper. Maybe more conclusions can
be derived from that situation.

In section 6.2, "ingestion time increases linearly with the number of
revision", then it is written, "Overall, Ostrich proves to handle large
deltas in a reasonable amount of time". I disagree with this
conclusion. If ingestion time increases linearly with the number of
revisions, it just means that the whole approach is not sustainable.
Suppose, you run the same experiment with wikidata 2016->2020, I think
that ingestion times will just become prohibitive. One remaining
question is why it grows linearly? Is it an implementation problem or
is it a complexity problem?

I also disagree that 30 hours for ingesting 51M triples is a
reasonable amount of time. Does it mean 300 hours is reasonable for
ingesting 510M of triples?

Section 7 describes some directions to design a full-fledged RDF
archiving system. The section is well structured but the research
directions remain too blurry for me. The introduction stated that
current RDF archiving systems do not take in to account the actual
evolution of real RDF data. From the study, we learn that they follow
major/minor patterns of evolution. Finally, in section 7, the paper only
suggests an adaptive data-oriented system. I cannot really understand
what "a more conservative allocation of the storage resources for
single entity index" means. I expected more precise directions, with
references to related papers.

On the other hand, there is something very clear: from experiments on
OSTRICH, there is a linear growth of ingestion time. If it is linked
to a complexity problem, then it is a serious issue. This problem is
spotted in the paragraph "Accounting for evolution patterns": "it is
vital for solutions to improve their ingestion time complexity".
Improving complexity is challenging but we don't know if the complexity
problem is linked to OSTRICH or to all RDF archiving systems.

More generally, I don't see, from the experiments of section 5, clear
evidence of issues highlighted in the introduction. Ostrich is the only
system that works, and finally, it seems that the only problem is
related to ingestion time. IMHO, it impacts the storytelling of the
paper. Maybe the results of the experiments of section 5 can be used more
wisely to highlight issues spotted in the introduction and improves
the storytelling of the paper.

In the introduction, the lack of a standard for querying was highlighted,
but it is not present in section 7. From the survey section, the only
system working was only proposing VM, DM, V queries. Does it mean
that a standard should concentrate on these query types?

On serialization and querying paragraph of section 7, RDF* is
highlighted. I agree that RDF* makes reification easier, but it is
just syntactic and performances are significantly impacted. As
performance issues have been highlighted in the introduction, I don't
see CLEARLY in research directions how the performance issues will be
globally tackled.

The conclusion is quite syntactic.

Review #3
By Natanael Arndt submitted on 16/May/2020
Major Revision
Review Comment:

The paper "Towards Fully-fledged Archiving for RDF Datasets" consists of an introduction (section 1), preliminaries (section 2), two major parts (sections 3-4 and 5-7) and a short conclusion.
The two major parts are a "Framework for the Evolution [of] RDF Data" based on the metrics proposed by Fernández et al. in [18] in section 3 and exemplary applied in section 4; and a "Survey of RDF Archiving Solutions" in section 5, which is discussed in sections 6 and 7.
From the title "Towards Fully-fledged Archiving for RDF Datasets" I expect a thorough requirements analysis and conceptual specification of an RDF Archiving system with a classification of the related work.
The four points listed in the abstract
(i) existing systems are neither scalable nor easy to use,
(ii) no solution supports multi-graph RDF datasets,
(iii) there is no standard way to query RDF archives, and
(iv) solutions do not exploit the evolution patterns of real RDF data.
support these expectations.
While reading the paper I see contributions towards (ii)-(iv), while I'm missing a definition of "scalable" and "easy to use". (ii) is not true, in Table 2 you list Dydra with Multi-graph support, according to our research [ANR+18] also R43ples supports multiple graphs (maybe not full RDF datasets, but still "multi-graph" support), and the Quit Store [ANR+18] has support for real RDF datasets.

I was very surprised that the Quit Store [ANR+18] is not among the list of related work. As it provides answers to some of the raised questions. Compared to R43ples and R&Wbase it was already evaluated regarding its scalability, it comes with a user interface that guides through the versioning system, it supports multi-graph RDF datasets, it provides a standard SPARQL 1.1 Query & Update interface with full BGP support and a virtual endpoint for each version in the history. Further it allows VM, DM, and V queries (DM and V using the provenance endpoint), it has support for branches and tags, allows concurrent updates (see also [AR19]), and it is Open Source (GPLv3, and available as docker image). Maybe you should also include stardog in your comparison, even though the source code is not available, as for many of the systems (which, I agree, is a huge problem for research!!!).

For the selection of systems to evaluate the performance you state "In addition to the limitations in functionality, Table 2 shows that most of the existing systems are not easily available, either because their source code is not directly usable, or because they do not easily compile in modern platforms." I admit, that it is not easy to get research prototypes of other teams running but I would expect a some more effort to get the systems running for a proper comparison. In the course of performing the evaluations for the Quit Store [ANR+18] we have made docker images available for the R43ples and R&Wbase systems (, We had to ask the respective teams for some support but in the end it worked. You are welcome to re-use these images. Further you you should state what exactly are the limitations of R43ples that made it not possible to run your experimental datasets.

In section 2.1 you define the label g ∈ I. But the RDF standard says "Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph." ( Please also check your section 2.4 with this regard.
Please make your definitions compatible with the RDF standard here.
In general this raises the question how do you deal with blank nodes in general?

In section 2.2 p2 l28 right you define with "∆^+_i ∪ ∆^-_i != ∅" that the changeset is not allowed to be empty. Why not allow empty changesets? Is there any problem with allowing empty changesets? That could be an option to make your notation of dataset changesets easier. You could use the same revision numbers for the dataset as for each graph and you would not need to deal with the drift of the two indices.

Further, in section 2.2, you define the notion of a changeset as follows: "We extend the notion of changesets to arbitrary pairs of revisions i, j with i < j, and use the notation u_ij = <∆^+_{i,j} , ∆^-_{i,j}>." As I understand it, i and j are the numbers of the revisions of a graph. In section 2.3 you transfer your concept from graphs to datasets. You then use several notations to address a graph in a dataset. In one places you use û_j and in another place u^1_1 but these notations are not properly introduced. I would expect a concise definition of a changeset for datasets.

In section 2.3 you write "In contrast to an RDF graph archive, an RDF dataset is a set D = {G^1, G^2, …, G^m} of named graphs where each graph has a label g^k ∈ I. Differently from revisions in a graph archive, we use the notation G^k for the k-th graph in a dataset, whereas G^k_i denotes the i-th revision of G^k ."
Please see the remarks to section 2.1 with regard to blank nodes here, according to the RDF 1.1 Concepts and Abstract Syntax g^k ∈ I∪B. Also per the definition of RDF 1.1 Concepts and Abstract Syntax a dataset consists of exactly one default graph. How do you deal with the default graph?

Please put the relevant formulas in section 2 especially 2.2 and 2.3 in separate definition environments and try to illustrate them with simple conceptual figures or examples. In the current state the definitions are convoluted with examples, this is hard to read. Maybe you can also take a look at the formalization to express changes as presented in [ANR+18] sections 6 and 7. This formalization might not be perfect, but it could be a possibility to pick up some ideas about what I want to say and maybe you can even extend the model according to your needs.

In section 3 on page 5 your provide a link to your source code. But Dropbox is no proper archive. Could you please provide your source code in some proper software source code archive? Else you contribute further to the problem that scientific prototypes are hard to reproduce.

I like the idea of ρ, ζ, l(), and rv(). But there is some problem with the consistent usage of the notation. On p10 l1 your write and i = rv(ρ). On p11 l42 your write , why don't your consistently write here. Is there a semantic difference? The same in l51.

In general I like how you include the categorization of types of queries and archiving policies as they were introduced by Fernández et al. in [18] respective [17] into your research. (Sometimes you refer to [18] and sometimes to [17] with this regard, that should be consistent.) But you should better point out what are the new contributions that you make to this model. Also I suggest to add the archiving policy "Fragment-based (FB)" [ANR+18] for systems that take snapshots of fragments of a dataset and thus are neither Independent Copies (IC) nor Change-based approaches (CB).

Regarding your citation style I have some remarks. You often use references as nouns in your sentences e.g. "the approach presented in [13] relies …". In my eyes "[13]" is no word and makes it hard to read such sentences. It would be better if you use phrases like "the approach presented by Dong-hyukim et al. [13] relies …" or if you name the respective systems. I can't remember all of the numbers while reading the text.
For reference [50] there is some problem with the encoding "… WWW âĂŹ19, page 961âĂŞ965, …". Also for [8].
Could you also please add the DOIs for your references?
For W3C standards could you please refer to the actual respective RDF standard documents e.g. RDF 1.1 Concepts and Abstract Syntax ( instead of the non normative W3C Working Group Note (RDF 1.1 Primer).
Also it is good to reference an exact version: e.g. for [42] Yves Raimond and Guus Schreiber. RDF 1.1 semantics. W3C recommendation, 2014. use instead.

As a summary I would expect a conceptual model or specification of a "fully-fledged solution". The various aspects and features expected by the authors from a fully-fleged archiving system are distributed across the paper. So this should be a matter of bringing them all together in a concise definition, which might involve some major re-work of the paper.
The overarching story to integrate the major parts of the paper "Framework for the Evolution [of] RDF Data" and "Survey of RDF Archiving Solutions" needs to be clear, currently they seem to be two different topics which should be part of two individual papers. Also it is not clear to me if this is a "Full Paper" or a "Survey Article".
The comparison lacks a state of the art system.
There are some problems in its presentation (citations, formula and missing examples).
As a result I like the idea of the paper and would really like to see this paper published but at its current state it needs a MAJOR REVISION clarifying all of the mentioned aspects.

[ANR+18] Decentralized Collaborative Knowledge Management using Git by Natanael Arndt, Patrick Naumann, Norman Radtke, Michael Martin, and Edgard Marx in Journal of Web Semantics, 2018.;
[AR19] Conflict Detection, Avoidance, and Resolution in a Non-Linear RDF Version Control System: The Quit Editor Interface Concurrency Control by Natanael Arndt, and Norman Radtke in Companion Proceedings of the 2019 World Wide Web Conference (WWW '19 Companion), 5th Workshop on Managing the Evolution and Preservation of the Data Web, San Francisco, CA, USA, 2019.;

Some pedantic notes:

p3 l22 right: "(variables are always prefixed with the symbol ?)". This is not correct it can also be "$".

p4 l27 right: Framework for the Evolution [of] RDF Data

p11 l38,39,42,48 left, l22 right, and Table 2: Dryda -> Dydra (please go through the document with find and replace)

p5 l12-l13 right: "The grow ratio is the ratio between [between] the number of triples in two revisions i, j."

p6 l26-28 left: "These are subsets of triples of the same nature, e.g., triples with literal objects extracted with certain extraction method[s]."

p10 l40 right: "R&WBase has inspired the design of R43ples [24]. Unlike its predecessor, [24] …" Does inspiring something make the one the predecessor of the other?

p14 l40 left "that go beyond the [?], treat …" In place of [?] there is something missing in the sentence.

p14 l31 right: vocabular[it]y dynamicity