Review Comment:
Review “DataGraft: One-Stop-Shop for Open Data Management”
First of all, I would like to thank the authors for building what
seems to be a very useful tool and for writing a very readable and
also interesting paper.
# Overview
This paper present DataGraft, a tool that seeks to lower the threshold
for (1) data publication and (2) data consumption. The paper
identifies the shortcomings of current tools and explains the features
and benefits of DataGraft. I applaud the authors for painting what I
believe to be fairly a comprehensive picture of current issues.
However, I found some of the issues/features/benefits to be spread
through the text without much structure. I've distilled the following
lists:
Current approaches exhibit the following ISSUES:
1. Data preparation: (1a) technical complexity; (1b) poor toolkit
integration; (1c) requires expert knowledge (which is costly?).
2. Data publication: (2a) reliability, (2b) scalability and (2c)
sustainability? cost.
3. Data consumption: (3a) data is distributed over a vast number of
nodes (findability); (3b) different versions of the same dataset
exhibit structural differences.
4. UX: (4a) interactivity; (4b) repeatability; (4c) shareable; (4d)
scalability
DataGraft focuses on the following FEATURES:
1. Reliable cloud-based hosting.
2. Flexible data transformations.
DataGraft provides the following BENEFITS:
1. Reduced cost.
2. Reduced technical complexity.
In ISSUES and BENEFITS the term `technical complexity' is used.
Although I have some idea of what this may denote it may be good to
qualify this term a bit more.
# Main issues
1. DataGraft is only evaluated qualitatively, not quantitatively.
2. It is unclear what DataGraft is evaluated against. For instance,
on p11 it is stated that the use of DataGraft has resulted in a
cost reduction of 23% “compared to traditional approaches”. Are
these the approaches described in Section 4 (‘Related Systems’)?
3. When discussing existing systems it is not made clear that the
existing systems indeed suffer from the ISSUES identified earlier
(see also my enumeration above). Which existing system exhibits
which ISSUES? And which ISSUES does DataGraft solve? For
instance, it is not clear to me whether/how DataGraft solves the
distributed data problem (copying would introduce the problem of
data ownership and would result in out-of-sync data, problems
that may be as big or even bigger). Why does DataGraft reduce
expert knowledge costs more that other solutions? It is also not
clear why toolkit integration of DataGraft is better than the
existing LOD pipelines (some are reusable OS images with
centralized configuration of interacting software components).
The DIY benefit of DataGraft is very clearly explained BTW (other
stacks need to be installed/configured/maintained).
# Minor issues / clarifications
In Section 1 the authors make a _quantitative_ comparison of Open Data
WRT the Internet. The authors may point out here that data differs
considerably in various _qualitative_ aspects as well, e.g., most data
on the Internet today is only partially structured or not structured
at all.
The authors mention that there are limitations and difficulties with
data publishing and reuse. At the same time the authors claim at the
beginning of page 2 that “Open data is increasingly showing effects in
solving problems [...]”. If the limitations and difficulties are
still in effect, how then can the problem-solving capability of open
data increase over time?
On p2 the authors claim that the majority of Linked Data is converted
from tabular data (“most often tabular data”). It would be nice to
have some empirical proof for such a quantitative claim, e.g., in the
form of a literature reference (if such statistics are available, this
is). From the top of my head: a quantitative analysis based on the
metadata stored in data catalogs such a CKAN might give an inkling.
The various steps that make up the data processing pipeline are
enumerated but not defined. E.g., terms like ‘cleaning’,
‘transforming’ and ‘preparation’ are still somewhat abstract. Is all
cleaning also preparing? Can some transformations be cleaning tasks
as well? Maybe the acts of cleaning and transformation result in data
that is ‘prepared’? Then preparation would not be a separate step but
rather a state that data can be in after some steps have been taken,
etc.
The authors mention that an ontology “represents a data model”. I’m
not sure what ‘representing a data model’ means. An ontology _has_ a
data model, of course, but so does instance data.
What is a ‘semantic RDF graph’? The semantics of RDF graphs is
well-defined and is tightly coupled to RDF (abstract) syntax. As
such, a non-semantic RDF graph cannot be syntactically expressed.
On p2 the authors state that triple stores make accessing data easy
for users. It would be illustrative to describe the group of users
for which this is the case. I expect triple store-mediated access to
data to be very difficult for some groups of users!
Why it is a problem that toolkits for Linked Data preparation require
expert knowledge? Are experts generally unable to articulate their
knowledge in a way that is understandable (implicit versus explicit
knowledge)? Or are experts generally able to articulate their
knowledge in an understandable way but are they too scarce, expensive
or otherwise (culturally?) reluctant to effectively collaborate with?
This is of course the traditional expert bottleneck problem from KR,
but it would be good to substantiate the claim for the specifics of
Linked Data (maybe with a pointer to literature establishing this
problem?).
On p4 we read: “Transformations implemented as pure functions on
immutable data, which makes the logic of the transformation process
significantly easier to reason about.” Why are pure functions better
than something else? What else is there? Are there non-pure
functions? What types of reasoning can be applied to the pure
functions implementing the transformation process?
The authors use a streaming approach. This is very good WRT the
scalability requirement! However, this choice also limits the kinds
of transformations that can be applied to the data. It would also be
interesting to know what window is chosen (one column, one row, one
cell, something else?). To give a very simple example: if I want to
determine the RDF datatype for values in a certain column I must first
stream through all column values in order to determine which datatypes
hold all/most of those values in their value space. Also, after
checking for candidates for transformations, does DataGraft have to
stream a second time to in order to transform the values to RDF types
literals?
On p4, what are ‘graph templates’? Can they be defined or enumerated?
On p4 the streaming benefit is explained a second time IINM. Should
this be merged into one benefit? Also, what is a ‘melt operation’?
What does normalization mean here?
On p5 function names are mentioned as if they have been defined
earlier, but they are not: ‘derive-column’, ‘fill-when’, ‘melt’.
The separation into pipes and grafts is explained well! Why do users
prefer transformations on pipes i.o. grafts? Is it because of
familiarity with the spreadsheet paradigm or is it because graph
navigation tooling is not sufficiently developed yet?
There is a strong focus on reusability of data transformation tasks
(nice!). The use cases show that transformation reuse takes place
within the same task and by the same group of users (already a great
thing!). Can the authors indicate how often transformations are
reused between tasks and between user groups? What is the cost of
searching for existing transformations versus the cost of building a
transformation anew?
When discussing existing systems the following point seemed a bit
underspecified: “OpenRefine also has security issues”. Can the
authors indicate what issues these are? Are they described in
literature or online?
Streaming seems to be erroneously mentioned on p16 as future work?
# Spelling errors
- p3, “or on creating” → “or for creating”
- p3, “an[d] overview of”
- p4, ”in further detail[s]”
- p4, “the maximum size of [the] dataset that can be processed”
- p13, “it's” → “it is”
- p13, “[a] live services”
- p14, “[a] central access”
- p15, “open linked data” → “linked open data”
|