DataGraft: One-Stop-Shop for Open Data Management

Tracking #: 1285-2497

Authors: 
Dumitru Roman
Nikolay Nikolov
Antoine Pultier
Dina Sukhobok
Brian Elvesæter
Arne Berre
Xianglin Ye
Marin Dimitrov
Alex Simov
Momchill Zarev
Rick Moynihan
Bill Roberts
Ivan Berlocher
Seon-Ho Kim
Tony Lee
Amanda Smith
Tom Heath

Responsible editor: 
Rinke Hoekstra

Submission type: 
Tool/System Report
Abstract: 
This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g. open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g. interactive creation, execution, sharing, reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christophe Guéret submitted on 04/Apr/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

The manuscript provides a description of a cloud-hosted data management solution called "DataGraft". The paper goes over describing the system, some use-cases and related tools.

# Quality, importance and impact of the system
The systems is technically sound and is likely to be used. The evidence provided in Section 3 however fails at proving adoption outside of the authors group. All of the given examples of DataGraft usage are from co-authoring companies/individuals. This is not a bad thing per se but it would be interesting to discuss how much the tool is used outside of the community that built it. This would, IMHO, be more telling that saying there are 181 users having entered about 500 data transformations.

In terms of related work I was surprised not to read about DataLift (http://datalift.org/), Cliopatria (http://cliopatria.swi-prolog.org/home) or LDIF (http://ldif.wbsg.de/) - and that list could probably be extended, it's only the names on the top of my head. The issue of lowering data publishing costs is not new and many tools have been proposed to tackle it. I agree with the authors that having such tools proposed as a ready to use web hosted system is convenient and important but I would not consider this as a "notable differentiation" of DataGraft that would rule the LOD2 stack, and others, out. As DataGraft is provided as a cloud-hosted solution it would be fairer to compare it to cloud-hosted version of alternative platforms. That is, explaining how would DataGraft compare to a ready to use cloud-hosted version of the LOD2 or DataLift stack.

Finally I am a bit puzzled about the position of DataGraft VS PublishMyData and think it would be interesting to clarify this a bit more, especially considering that Swirrl is co-authoring. Will DataGraft be the free alternative to PublishMyData ? Or will the two platforms eventually merge back into a freemium model ? Are/Will DataGraft users reaching the limits of the free account invited to move over to PublishMyData ?

# Clarity, illustration and readability of the paper
The paper is clear and informative. There are only a few points that would deserve a bit more precisions:

* CSV import: it seems that Grafter is only able to import well formed CSV that follow a clear and straightforward structure of one record per row and one field per column. It that so? What could the system do with "messy" tabular data as input ? For instance, with data like http://www.volkstellingen.nl/web/excel/BRT_1899_01_T.xls . Furthermore I was surprised to see this part of the platform does not rely on https://www.w3.org/TR/csv2rdf/ . Was there any practical reason to discard this recommendation and go for an other solution ?

* Licensing: the (machine readable) licensing of datasets is increasingly becoming a pressing issue holding back the consumption of some datasets - even those described as open data. For example, and for various reasons, http://res.space can not index LOD that does not come with a HTTP link header pointing to an open license. It is unclear if DataGraft can handle licensing either from a data entry or a data publication point of view.

* Platform: there are some part of the description provided on page 8 that do not completely reflect what is pictured in Figure 6. It would be useful to revise this figure to name all the components after the labels used in the description ("Load Balancer", "Routing nodes", "Integration services", ...)

Lastly, the sentence "Further information about semantic graph database can be found in [6]" is unclear whether some general information about all the semantic graph databases can be found there or information about just the one presented.

Review #2
Anonymous submitted on 03/May/2016
Suggestion:
Minor Revision
Review Comment:

The paper presents a tool for managing open datasets in a cloud-based platform, dubbed Datagraft.
Using the tool, it is possible to define, execute and monitor pipelines for publishing raw data as open data or linked data. The main components (backend and frontend) were introduced.

The paper was submitted as 'Tools and Systems Report'. It is reviewed considering two dimensions: (1) Quality, importance, and impact of the described tool or system; and (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. In this case, the points listed below should be improved for accepting the paper.

Minor details:
a) Is it “open linked data” or “linked open data”?
b) To keep the same order for enumeration ahead, change “by the namespaces grafter.rdf and grafter.tabular” to “by the namespaces grafter.tabular and grafter.rdf”.
c) If it is possible, as in “3.2 Transforming and Publishing Environmental Data”, the numbers of PLUQI example could be detailed (how many data transformations, uploaded files, and so on).
d) Could Figure 2 and Figure 9 be joint? They express almost the same issues. In this case, the table 1 should be moved.
e) Change “Related Systems” to “Related Work”;

Mandatory details to fix or improve:
a) Figure 3 is too dark, when comparing to the others.
b) Figure 4 and 5 should represent the same example.
c) Figure 7 could be divided in part (a) and part (b) – as in Figure 4.
e) In Figure 7, scroll the query for showing the beginning of the SPARQL query.
f) Table 1 and Figure 9 do not correspond – see Semantic graph DBaaS and Semantic graph DB.

Appreciated details:
a) Could Figure 5 be reformulated, in order to the Pipeline and RDF mapping views be depicted? So, the reader can easily correlate Figure 4 (parts A and B) to Figure 5 in parts A and B.

Review #3
By Wouter Beek submitted on 10/Jun/2016
Suggestion:
Minor Revision
Review Comment:

Review “DataGraft: One-Stop-Shop for Open Data Management”

First of all, I would like to thank the authors for building what
seems to be a very useful tool and for writing a very readable and
also interesting paper.

# Overview

This paper present DataGraft, a tool that seeks to lower the threshold
for (1) data publication and (2) data consumption. The paper
identifies the shortcomings of current tools and explains the features
and benefits of DataGraft. I applaud the authors for painting what I
believe to be fairly a comprehensive picture of current issues.
However, I found some of the issues/features/benefits to be spread
through the text without much structure. I've distilled the following
lists:

Current approaches exhibit the following ISSUES:

1. Data preparation: (1a) technical complexity; (1b) poor toolkit
integration; (1c) requires expert knowledge (which is costly?).

2. Data publication: (2a) reliability, (2b) scalability and (2c)
sustainability? cost.

3. Data consumption: (3a) data is distributed over a vast number of
nodes (findability); (3b) different versions of the same dataset
exhibit structural differences.

4. UX: (4a) interactivity; (4b) repeatability; (4c) shareable; (4d)
scalability

DataGraft focuses on the following FEATURES:

1. Reliable cloud-based hosting.

2. Flexible data transformations.

DataGraft provides the following BENEFITS:

1. Reduced cost.

2. Reduced technical complexity.

In ISSUES and BENEFITS the term `technical complexity' is used.
Although I have some idea of what this may denote it may be good to
qualify this term a bit more.

# Main issues

1. DataGraft is only evaluated qualitatively, not quantitatively.

2. It is unclear what DataGraft is evaluated against. For instance,
on p11 it is stated that the use of DataGraft has resulted in a
cost reduction of 23% “compared to traditional approaches”. Are
these the approaches described in Section 4 (‘Related Systems’)?

3. When discussing existing systems it is not made clear that the
existing systems indeed suffer from the ISSUES identified earlier
(see also my enumeration above). Which existing system exhibits
which ISSUES? And which ISSUES does DataGraft solve? For
instance, it is not clear to me whether/how DataGraft solves the
distributed data problem (copying would introduce the problem of
data ownership and would result in out-of-sync data, problems
that may be as big or even bigger). Why does DataGraft reduce
expert knowledge costs more that other solutions? It is also not
clear why toolkit integration of DataGraft is better than the
existing LOD pipelines (some are reusable OS images with
centralized configuration of interacting software components).
The DIY benefit of DataGraft is very clearly explained BTW (other
stacks need to be installed/configured/maintained).

# Minor issues / clarifications

In Section 1 the authors make a _quantitative_ comparison of Open Data
WRT the Internet. The authors may point out here that data differs
considerably in various _qualitative_ aspects as well, e.g., most data
on the Internet today is only partially structured or not structured
at all.

The authors mention that there are limitations and difficulties with
data publishing and reuse. At the same time the authors claim at the
beginning of page 2 that “Open data is increasingly showing effects in
solving problems [...]”. If the limitations and difficulties are
still in effect, how then can the problem-solving capability of open
data increase over time?

On p2 the authors claim that the majority of Linked Data is converted
from tabular data (“most often tabular data”). It would be nice to
have some empirical proof for such a quantitative claim, e.g., in the
form of a literature reference (if such statistics are available, this
is). From the top of my head: a quantitative analysis based on the
metadata stored in data catalogs such a CKAN might give an inkling.

The various steps that make up the data processing pipeline are
enumerated but not defined. E.g., terms like ‘cleaning’,
‘transforming’ and ‘preparation’ are still somewhat abstract. Is all
cleaning also preparing? Can some transformations be cleaning tasks
as well? Maybe the acts of cleaning and transformation result in data
that is ‘prepared’? Then preparation would not be a separate step but
rather a state that data can be in after some steps have been taken,
etc.

The authors mention that an ontology “represents a data model”. I’m
not sure what ‘representing a data model’ means. An ontology _has_ a
data model, of course, but so does instance data.

What is a ‘semantic RDF graph’? The semantics of RDF graphs is
well-defined and is tightly coupled to RDF (abstract) syntax. As
such, a non-semantic RDF graph cannot be syntactically expressed.

On p2 the authors state that triple stores make accessing data easy
for users. It would be illustrative to describe the group of users
for which this is the case. I expect triple store-mediated access to
data to be very difficult for some groups of users!

Why it is a problem that toolkits for Linked Data preparation require
expert knowledge? Are experts generally unable to articulate their
knowledge in a way that is understandable (implicit versus explicit
knowledge)? Or are experts generally able to articulate their
knowledge in an understandable way but are they too scarce, expensive
or otherwise (culturally?) reluctant to effectively collaborate with?
This is of course the traditional expert bottleneck problem from KR,
but it would be good to substantiate the claim for the specifics of
Linked Data (maybe with a pointer to literature establishing this
problem?).

On p4 we read: “Transformations implemented as pure functions on
immutable data, which makes the logic of the transformation process
significantly easier to reason about.” Why are pure functions better
than something else? What else is there? Are there non-pure
functions? What types of reasoning can be applied to the pure
functions implementing the transformation process?

The authors use a streaming approach. This is very good WRT the
scalability requirement! However, this choice also limits the kinds
of transformations that can be applied to the data. It would also be
interesting to know what window is chosen (one column, one row, one
cell, something else?). To give a very simple example: if I want to
determine the RDF datatype for values in a certain column I must first
stream through all column values in order to determine which datatypes
hold all/most of those values in their value space. Also, after
checking for candidates for transformations, does DataGraft have to
stream a second time to in order to transform the values to RDF types
literals?

On p4, what are ‘graph templates’? Can they be defined or enumerated?

On p4 the streaming benefit is explained a second time IINM. Should
this be merged into one benefit? Also, what is a ‘melt operation’?
What does normalization mean here?

On p5 function names are mentioned as if they have been defined
earlier, but they are not: ‘derive-column’, ‘fill-when’, ‘melt’.

The separation into pipes and grafts is explained well! Why do users
prefer transformations on pipes i.o. grafts? Is it because of
familiarity with the spreadsheet paradigm or is it because graph
navigation tooling is not sufficiently developed yet?

There is a strong focus on reusability of data transformation tasks
(nice!). The use cases show that transformation reuse takes place
within the same task and by the same group of users (already a great
thing!). Can the authors indicate how often transformations are
reused between tasks and between user groups? What is the cost of
searching for existing transformations versus the cost of building a
transformation anew?

When discussing existing systems the following point seemed a bit
underspecified: “OpenRefine also has security issues”. Can the
authors indicate what issues these are? Are they described in
literature or online?

Streaming seems to be erroneously mentioned on p16 as future work?

# Spelling errors

- p3, “or on creating” → “or for creating”
- p3, “an[d] overview of”
- p4, ”in further detail[s]”
- p4, “the maximum size of [the] dataset that can be processed”
- p13, “it's” → “it is”
- p13, “[a] live services”
- p14, “[a] central access”
- p15, “open linked data” → “linked open data”