Review Comment:
Overall assessment
This paper has a number of interesting features such as a real deployment, the application of the datanode ontology and license policy reasoning. Unfortunately most of these aspects are discussed more fully in referenced papers instead of here. This paper focuses on an outline methodology to support quality improvement of datasets by ensuring the appropriate provenance meta-data is collected. It is not clear to me how this improves on the simple method of collecting provenance on all data processing activities. There is also a stated goal of enabling automated exploitability assessment for users of the datasets but as the conclusion itself describes the application of this to actual legal decision making is unlikely - "However, while this [exploitability] assessment is part of an early analysis, when the user wants to assess whether a given dataset is eligible to be adopted, we expect this assessment to be performed manually, on a case by case basis.".
Hence despite some good work, the twin problems of this paper are that it exists at the edge of the planned scope for the special issue (meta-data quality enhancement) and the contribution of the work described here (as opposed to the wider scope of the project it describes which is very interesting). This is compounded by the patchy presentation of the paper with many typos and a lack of a clear focus (data catalogue vs data hub) or logical flow in some sections.
(1) originality
The specific deployment scenario described in this paper is original. However the methodology contains many elements common to data quality/data lifecycle systems, policy-based management systems and automated license processing. Only the last of these is adequately covered in the papers related work section and I provide some links below on the other topics. Nonetheless these are all open areas of research and so more work, like that described here, is welcome. The advances in this paper seem incremental compared with the other papers being published by the team of authors.
(2) significance of the results
The research questions tackled by the paper are problematic, or at least their true value has not been made clear to me. Automatically making exploitability decisions seems to be a focus but as the conclusion makes clear if this is a legal decision made by a consumer it is more realistic that the techniques will support human decision-making rather than supplant them. The earlier parts of the paper would be stronger if they emphasised this rather than the idealised case of automated decisions. Especially given the lack of a trust infrastructure between the consumer and data hub, which would seem to be a basic requirement for any distributed decision-making.
Deciding what provenance meta-data to capture or present to the consumer in order to support exploitability decisions by them is also another question - it is not demonstrated how this is
superior to simply capturing all data processing steps and making this available to the end user to query.
Finally the meta-data value chain architecture/methodology has a lot in common with many data quality lifecycle models, which exhibit huge diversity (see for example Data Life Cycle Models and Concepts CEOS.WGISS.DSIG.TN01, Issue 1.0, September 2011) but is not directly motivated and when the MK Data Hub is used as an example of the methodology it is not compelling as a validation since it seems that the two things were developed by the same team for the same use case.
It is hard to evaluate the significance of the results when no real evaluation of the methodology is provided. In sec 4 the MK Data Hub is explained as a use case for the methodology but very little analysis of or explanation of the technical underpinnings are provided instead the section reads at a use case level. In section 5 good work is done to identify a number of assumptions built
into the methodology (although at times it again drifts the analysis to the MK Data Hub rather than of the methodology) but in many cases I have issues with the conclusions of even this lightweight evaluation. For example:
=Assumption 1.1==
"While we do not support complex policies at the moment, we could deal with it by user profiling
(with a commercial or non commercial account), or by including a taxonomy of usage contexts to consider separately, thus obtaining multiple policy sets depending on the usage context."
Would this not increase the complexity of the system and has the trade-off been analysed? An implication of this is that all policy information is then not included in the policy model (instead it is split between the policy model and the usage contexts), would the PPP Reasoner have access to this? I think that this makes it likely that implications of rules that cross contexts would be missed by the Reasoner.
Assumptions 1.2 and 3.3 make a similar assumption about the ease of splitting the problem this way (and implicitly depriving the reasoner of knowledge). Of course the exact implications are hard to evaluate since the specification of the reasoner is not in the scope of the paper. In the end this makes it hard for me to be convinced that assumption 4.1 is satisfied without some evidence.
==Assumption 1.4 ==
Your discussion here is confusing to me because on one hand you say that ODRL can support non-binary relations (violating your assumption) and on the other you say that as far as you have seen the binary assumptions are sufficient. Since your PPR are based on ODRL, why is the assumption the correct one to make?
(3) quality of writing.
The overall structure and presentation of the paper is good.
There is a recurring confusion in the text as to whether the paper is about data catalogs or data
sharing platforms (ie the MK Data Hub) that have a broader scope. This leads to statements like (sec 1):
" It is clear however that, as the number and diversity of the datasets they need to handle is growing, there is a need for these systems to play a further role in fully supporting the
delivery and reuse of datasets."
Is this a bit strong for a simple data catalogue? Doesn't this depend on the resources available and the role of the catalogue provider? What about the end to end principle of the internet that favours placing service intelligence at the edge rather than in the middle of the network?
Unfortunately there are a large number of typographic errors, see below for details.
There are a large number of RDF fragments included. I am not sure how much most of these add to the paper as they use space that could otherwise be used for discussion of the details of the system. For example, in section 4.1, Listing 1: This would be much easier to read if you used prefixes rather than full URIs for all the terms.
The paper is not specific at a number of places where further elaboration would be useful, for example in sec 4.1 it states "Moreover, the policies include a peculiar attribution requirement (Listing 4).". It would be useful to explain how it is peculiar rather than just stating this as a fact and expecting the user to interpret the listing. eg How have you quantified how unusual it is? Do you just mean that it doesn't map well to ODRL? A list of areas where further clarification would be desirable are listed below under "minor comments".
Section 4.4 needs more detail on how policy propagation is actually performed, limitations, advantages etc. If this was done in the context of the specific use case it would be a strong addition to the paper.
(4) Relevance to Call
The paper argues (sec 1) that "maximising the exploitability of data is an issue of quality of the
catalogue itself". In my opinion, this makes the paper fairly peripheral to the SI call in that it does not deal with data quality itself, but rather the quality of the meta-data or even more removed the
service that provides the data.
=Minor Comments=
=Section 2=
Policy Reasoning -
"specific forms of policy compatibility assessment are also
found in fields whose primary focus is tasks rather than
data, as in workflow modelling for task delegation"
What about network management where data access is a prime concern?
See for example: S. Davy Harnessing Information Models and
Ontologies for Policy Conict Analysis
http://repository.wit.ie/1059/1/2008_SDavy_Thesis_final_v2.pdf
= Section 3 =
"Our methodology follows the Data life-cycle, which comprises four phases"
Surely it is only one of many possible life-cycles? Is this life-cycle central to your methodology?
Does this limit the applicability of your results? You should discuss these points.
It also implies, to me, a confusion between a catalogue (which does not necessarily care about data lifecycles and a data hub which does.
"Processing: data are processed, manipulated and analysed in order to generate a new dataset, targeted to support some data-relying task"
Q: Why does the data-hub do this processing, couldn't it be at the client or provider? How do they know what the client wants? It seems like a very centralized approach. This should be documented as a limitation.
=Section 3.2=
"This activity can be rather complex, including automatic and supervised methods, and going into
the details of it is out of scope for this article. What is important for us is that this phase should provide a sufficient amount of metadata in order to support data processing."
Q: This seems like a hard requirement to meet, since it is very lightly specified? I think more detail is needed to scope things here.
= Section 3.4 =
"The exploitability task is indeed reduced to the assessment
of the compatibility between the actions performed by
the user’s application and the policies attached to the
datasets, with an approach similar to the one presented
in [16], for example using the SPIN-DLE reasoner described in [21]"
Q: But without trust between the consumer and provider how can this be done?
= Section 4 =
"Our hypothesis is that an end-to-end solution for exploitablity assessment can be developed by using stateof-the-art Semantic Web technologies."
Typically the possibility of developing a system is not a strong hypothesis since given sufficient time and resources the flexibility of IT systems means "something" can be developed. Hence it would be better to re-formulate your hypothesis in terms of limits, extents or desirable properties of such a system.
= Section 5 =
==Ass 2.1==
Q: Should this assumption be changed to state that "Content metadata appropriate for ETL generated from the data source is available"?
This seems to be what you actually need, rather than access to the data itself?
Assumption 3.3: Not sure how license changes are handled? ie what happens when a dataset license change occurs but the dataset itself does not change - does the ETL need to be run again? Need to make clearer.
= Ass 4.2 ==
"The user’s task need to be expressible in terms of ODRL policies, thus enabling reasoning on policies compatibility"
Q: Should this not be documented as a new assumption?
=Typos and English Improvements=
=Sec 1 =
Typo: " applyed" -> " applied"
= Sec 3 =
typo: "to what extend" -> "to what extent"
typo: "In this Section" -> "in this section"
typos: "including Air quality and Soil moisture" -> "including _a_ir quality and _s_oil moisture"
typo: "given geospatials coordinates" -> "given geospatial_ coordinates"
Example of poor readability, re-phrase: "The aforementioned ward (see Figure 3 for some
example data) and museum in Milton Keynes are examples of named entities the ECAPI may be queried for; but also, an arbitrary geographical area within a fixed radius of given geospatials coordinates (e.g. 51.998,-0.7436 in decimal degrees) could be an entity for an application to try to get information about (see Figure 4 for example data)."
= Sec 4.1 =
typos: "Air Quality and Moisture Sensors" -> "_a_ir _q_uality and _m_oisture _s_ensors"
= Sec 4.2 =
typo: "to supporting the data processing" -> "to support_ the data processing"
Figs 3 + 4: Not readable, due to both the size of fonts and colour schemes used
= Sec 4.3 =
typo:@ "b) a description on the process capable of" -> "b) a description o_f_ the process capable of"
typo: "that, as SPARQL query remodels" -> "that, a_ SPARQL query remodels"
= Sec 4.1 =
Typo: "UK Food Estanblishments" -> "UK Food Esta_blishments"
= Sec 5 =
==Ass 2.2 ==
typo: "multiple datasets, these cases" -> "multiple datasets, th_i_s_ case_"
typo: "The need of setting up Dataflow" -> "The need _t_o_ set____ up Dataflow"
==Ass 3.3 ==
typo: "Process executions do not influence
policies propagation." -> "Process executions do not influence
polic_y_ propagation."
typo: "hipothetical" -> "h_y_pothetical"
= Conclusions =
typo: "b) a description on the process" -> "b) a description o_f_ the process"
typo: "metadata-relying tasks" -> "metadata-rel_iant_ tasks"
End
|