Towards Fully-fledged Archiving for RDF Datasets

Tracking #: 2700-3914

Olivier Pelgrin
Luis Galàrraga
Katja Hose

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
The dynamicity of RDF data has motivated the development of solutions for archiving, i.e., the task of storing and querying previous versions of an RDF dataset. Querying the history of a dataset finds applications in data maintenance and analytics. Notwithstanding the value of RDF archiving, the state of the art in this field is under-developed: (i) most existing systems are neither scalable nor easy to use, (ii) there is no standard way to query RDF archives, and (iii) solutions do not exploit the evolution patterns of real RDF data. On these grounds, this paper surveys the existing works in RDF archiving in order to characterize the gap between the state of the art and a fully-fledged solution. It also provides RDFev, a framework to study the dynamicity of RDF data. We use RDFev to study the evolution of YAGO, DBpedia, and Wikidata, three dynamic and prominent datasets on the Semantic Web. These insights set the ground for the sketch of a fully-fledged archiving solution for RDF data.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Natanael Arndt submitted on 12/Feb/2021
Review Comment:

The minor revision was successful from my point of view. Also the newly introduced and edited sections are nice to read.

In my eyes the paper can be accepted. Nice work!

Remaining pedantic note:

p13 l24-25 "… is physically stored in text files (e.g. N-quads files)" -> it has to be "i.e." (it est, it is) instead of "e.g." (exempli gratia, for example). In the paper we write that we use N-quads files, but actually we have changed to N-triples, so you could write "(i.e. N-quads files resp. N-triples files in the latest implementation)".

Review #2
By Pascal Molli submitted on 11/Apr/2021
Review Comment:

Dear All,

I answered to authors by preceding my answers with a "*". I kept some part of the answers of authors to contextualize my answers.

The major difficulty of scalable RDF archiving lies in handling the different trade-offs between disk usage, query runtime, and ingestion time, which can be scenario-dependent. We discuss the interaction between those factors in Section 7, which has now been split into two subsections: Subsection 7.1 discusses the features that a fully-fledged RDF archiving solution should offer, whereas Subsection 7.2 elaborates on the algorithmic and design challenges implied by those functionalities.

* I think that the splitting of the section 7 greatly improves its readability. I'm still not 100% convinced by all arguments, but the overall challenges are ok to me.

Thank you for sharing this observation with us. We have updated the conclusions in Section 4.4 accordingly to reflect the fact that there are \textbf{not only two evolution patterns, but instead, each dataset exhibits a different pattern: Even though DBpedia and YAGO have both major and minor releases, DBpedia does show negative growth ratios. We now emphasize more clearly that a fully-fledged solution should be able to handle different evolution patterns.

* Ok. I prefer this last conclusion.

To resume, authors took into account all my remarks, and I globally agree with the conclusions of the paper.