A Toolset for Supporting Evolution and Preservation of Linked Data: the DIACHRON approach

Tracking #: 1019-2230

Panagiotis Hasapis
Danae Vergeti
Aggelos Liapis
Antonis Ramfos
Giorgos Flouris
Kostas Stefanidis
Ioannis Chrysakis
Yannis Roussakis
Marios Meimaris1
George Papastefanatos
Yannis Stavrakas
Christos Pateritsas
Theodora Galani
Peter Buneman
James Cheney
Slawomir Staworko
Stratis D. Viglas
Jeremy Debattista
Natalja Friesen
Loïc Petit
Simon Jupp
Tony Burdett
Robert Isele1
Knud Möller

Responsible editor: 
Rinke Hoekstra

Submission type: 
Tool/System Report
Over the course of the last few years, there has been a vast and rapidly increasing quantity of scientific, corporate, government and crowd-sourced data, published on the emerging Data Web that has been created for open access. Open Data is expected to play a catalyst role in the way structured information is exploited in the large scale. This offers a great potential for building innovative products and services that create new value from already collected data. Open data published according to the Linked Data Paradigm is essentially transforming the Web from a document publishing-only environment, into a knowledge ecosystem where users have become active data aggregators and generators themselves. A traditional view of digitally preserving them by pickling them and locking them away for future use, like groceries, would conflict with their evolution. There are a number of approaches and frameworks, such as the LOD2 stack, that manage a full life-cycle of the Data Web. More specifically, these techniques are expected to tackle major issues such as the synchronisation problem (how can we monitor changes), the curation problem (how can data imperfections be repaired), the appraisal problem (how can we assess the quality of a dataset), the citation problem (how can we cite a particular version of a linked dataset), the archiving problem (how can we retrieve the most recent or a particular version of a dataset), and the sustainability problem (how can we spread preservation ensuring long-term access). In this paper we describe DIACHRON, a unified semantic platform for supporting the evolution and preservation of Linked Dataset. We describe modules that tackle the previously mentioned issues. With regard to the synchronization problem, our approach allows the identification and analysis of the evolution of a dataset, in an efficient, user-friendly and customizable manner. The proposed solution allows the execution of queries spanning multiple versions, as well as queries related to the evolution itself (rather than just the data). For the citation problem, we describe a rule-based mechanism for specifying, ex-tracting, and assigning citable persistent identifiers to diachronic resources. A sequential process is implemented to efficiently assess a dataset’s quality, providing the user with the necessary quality metadata and quality problem report as a bonus, in order to keep track of the appraisal problem. With regard to the archiving problem, we have designed and developed a conceptual model that captures both structural and semantic aspects of evolving data, thus enabling evolution management at different granularity levels. Based on this, we have implemented a query language as an extension of SPARQL that inherently tackles querying evolving entities and their changes across time. Supporting this platform we provide three real-life use-cases; a business use-case, a life science use-case, and an Open Data use-case.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christophe Guéret submitted on 20/Apr/2015
Major Revision
Review Comment:

This system paper describes an integrated approach for the preservation of Linked Open Data. Preserving Linked Data is an emerging, relevant and pressing issue that is much worth being discussed in the community. The paper is well written and the approach interesting but I think the overall clarity of the description suffers from the page limitation. There is too much to be said in too little space, hence many aspects that are left unclear. I suggest the authors to eventually leave some aspects of the system out in order to save space to better describe the others. Interested readers could be invited to browse a more exhaustive description on-line on the web site of the system.

Here are some more specifics comments in no particular order :

* There is no proof of an actual or anticipated adoption of the approach / tool.

* A reference to https://www.openphacts.org/ could be added in the life sciences use-case of section 2.2.

* It is not accurate that INSEE does not publish LOD (see section 2.3): http://rdf.insee.fr/

* In section 3 I suppose that "used by" was meant where pilot applications are said the platform is meant to "adjust to" their needs. Otherwise this sounds like every use-case will have to deploy its own custom DIACHRON instead of using a general one. I reckon it would be more interesting for an archive to offer an instance of the platform as a service catering for the needs of several customers

* The "Data storage module" cited in 4.1.2 is not present on Figure 1

* The description of Figure 1 could be enhanced to describe the meaning behind the color coding used

* It could be interesting to swap 4.2 and 4.1 in order to start by describing the lower level components and then move up toward the end-user interfaces

* The relation between the Quality and Cleaning service (apparently aka "Validation Repair Service" in Fig 1) is unclear. Is it the quality of the cleaned dataset that is assessed ? Further, I do not trust that such a cleaning service can detect "incomplete, incorrect and inaccurate facts". Incomplete does not make much sense in a open world assumption. Incorrect is pretty much a mater on inconsistency on the Semantic Web, it can be argued that all the statement are correct from a logic point of view. Inaccurate is more interesting but very challenging. Finding that some factual information is inaccurate and fix it is an issue that goes beyond the scope of preserving an evolving KB.

* The scalability section is very vague. I would take it out unless it could be made more descriptive. By default the reader will trust the system is implemented in a scalable way.

* There is no link to a supporting web site where a running system and/or some more documentation about it could be found. At the very least a link to the project web site should be provided.

* It would be very interesting to have a short piece of text explaining how DIACHRON differs from https://archive.org/ in the way it works. The described approach also shows some similarities with http://sindice.com and http://lodlaundromat.org/ that could be worth being discussed.

* Archivists would surely appreciate reading an explanation of how DIACHRON fits the OAIS data model (http://en.wikipedia.org/wiki/Open_Archival_Information_System)

Back to a more general point of view, I do not agree that the preservation of Linked Open Data should span out to active crawling on the Web. Evolution of the data is a fact but the preservation of it should be a publisher-driven activity. It is (IMHO) up to the publisher to decide what is worth being preserved and when, and then take an active action to send the content to a trusted digital repository. There is also a massive amount of LOD that is *not* on the Web and can thus not be crawled (e.g. projects publishing dataset in example.org and using it only via SPARQL). But this is only my own 2 cents so nothing much to worry about ;-)

Review #2
By Eero Hyvonen submitted on 04/Jun/2015
Review Comment:

This tools and systems paper presents a framework and a toolset, called DIACHRON, for Linked Data (LD) publication.

The work covered addresses a wide variety of general problem areas in such work, such as data synchronization, curation, quality, citation, and archiving.
The topic fits well in the journal and the paper is generally well-written and finished.

In section 1 the challenges addressed in the paper are listed. The authors argue that the isssues "LOD is Structured, "LOD are Dynamic" and "LOD are Distributed" have not
been "actually addressed" before and no references to other LD publication and presenvation works are given. However, there are lots of LD platforms and publication projects around, and the relation and contributions of DIACHRON w.r.t. these remains unclear.

The paper then list five problems to be addressed in the paper (data monitoring, evolution, spatio-temporal quality, citation, and preservation). These issues are quite wide topics, and it is unclear what problems in particular are addressed, and how this work contributes to the state-of-the-art. No references to related works are given here.

Then, in section 2, three use cases for the framework are explained. A wide variety of issues is raised. The discussion remains on a very generic level. More focus and detail is needed.

Sections 3 and 4 present the platform and services of DIACHRON on a general level. It would be nice to learn more, why this architecture was chosen, and how it relates to other LD publication systems?

According to the text, he system has not been implemented and deployed, but "will be". Later on in the paper, it is said that there is a first prototype implemented. In a journal article, one needs to present evaluated results with pin-pointed contributions to the state-of-the-art.

Finally, five general qualities of the system are listed as a kind of goals of the work. No evaluations of the framework w.r.t. these are presented, or argumentation, and the contributions of the work remain unclear.

Review #3
Anonymous submitted on 11/Jun/2015
Review Comment:

(1) Quality, importance, and impact of the described tool or system

This paper is a plan of a set of tools for handling and managing Linked Data. There seems to be no actual tools yet to report about, only initial plans. Besides this, authors give a an impression that they solve "challenges not actually addressed by past and on-going Linked Data and Digital Preservation projects challenges not actually addressed by past and on- going Linked Data and Digital Preservation projects". I would propose to do a proper literature review: there is a lot of relevant high quality research papers, for instance on modeling change and on identity resolution, or for addressing data quality issues in the Semantic Web research area (and in related areas).

(2) Clarity, illustration, and readability of the describing paper

The paper is quite easy to read. However, the reader gets an impression that the paper is made out of a project proposal rather than results from a well designed research setting.