Review Comment:
SUMMARY:
This submission introduces a semantic, machine readable approach to modeling the data usage requirements encoded in the classes of the DUO data usage ontology (which is part of OBO).
Data use limitations in DUO are encoded as unstructured human-readable text, that is used by a review board to decide whether a data request should be accepted. This paper proposes a machine-readable encoding of these data use requirements based on ODRL and DPV (a rich vocabulary for data privacy concepts developed by a W3C community group).
This paper is something more than a simple ontology description, as it considers also automated compliance checking methods for the policies encoded with the proposed ontology.
Pros:
- a major improvement of DUO: machine readable policies pave the way to significant enhancement of the support to the ethical board (eg tools for policy validation, compliance checking etc.);
- the proposed approach can be gently integrated with DUO's workflows and tools, because it is complementary to DUO's ontology;
- the integration with DPV fills a major gap of DUO, that currently does not model any legal aspects; with DPV's upper ontology general legal aspects can be modelled, and DPV's profiles can be leveraged to express conditions related to specific regulations, such as the GDPR;
- ODRL and DPV are publicly available resources.
Cons:
- apparently, the choice of ODRL is suboptimal with respect to the requirements collected by the authors;
- the motivations for using ODRL do not convincingly address the above issue and should be strengthened;
- this involves also a refinement of the related work section;
- the algorithms for compliance checking and agreement generation are just sketched; related desiderata cannot be checked at this level of detail; matching might as well be intractable.
QUESTIONS FROM THE EDITOR:
This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions:
(1) Quality and relevance of the described ontology (convincing evidence must be provided).
The DPV vocabulary is being developed within W3C, it results from a careful work of collection, harmonization, and extension of previous vocabularies, and is relevant in different application contexts. It is serialized in several semantic formats, including OWL and RDFS. ODRL is another W3C standard that can be regarded as an ontology without formal semantics. Both have a high potential, in terms of impact.
(2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.
Several clarifications are needed, see detailed comments.
(3) Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
yes
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
yes
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
The repository is on github
(4) whether the provided data artifacts are complete.
yes
DETAILED COMMENTS:
1) The authors advocate "mathematical guarantees regarding correctness and consistency" of policies and compliance checking (p.2, line 35). However, they propose ODRL that does not have a formal semantics - so getting formal guarantees is impossible.
There exist papers that provide such semantics for fragments of ODRL by translating its expressions in answer set programming [20]. However, representing policies as answer set programs makes the compliance checks based on subsumption (cf. section 3.6) horribly expensive and intrinsically intractable.
The authors implicitly hint at the possibility of mapping ODRL into low-complexity description logics (see the comment on reusing previous approaches on p. 11, liness 22-23, especially [19]); this cannot be done for full ODRL, whose constraints are not fully supported by the policy language of [19] (as they involve unrestricted mathematical comparison operators). Checking policy subsumption with those constraints is intractable, in general.
In the light of the above discussion, the exact fragment of ODRL to be used, its semantics, and the complexity of reasoning with that semantics need to be described more precisely, in order to convince the reader that using a non-semantic language like ODRL for a domain that requires semantics is a good idea.
2) Another negative consequence of the lack of standard semantics is that the meaning of policies is ambiguous and in principle, ODRL processors and evaluators may interpret a same policy differently (needless to say how bad this is in a context where sensitive data are processed). Such ambiguity is mentioned in section 3.6, but related risks are not discussed.
I have some doubts also about the interpretation of DUO's access determination approach. According to this paper's interpretation, a request for accessing a class of data C would allow to access also a wider class of data P; for example, this would make DUO's geographical restrictions useless, because permission to use the data of US citizens would permit using the data of all citizens, including those in the EU. This makes no sense to me. Maybe some examples could clarify the rationale behind the two opposite access control methods described in this section.
3) Another criticism of DUO is that it "does not offer much guidance on how the matching is performed between datasets and requests annotated with DUO concepts". From point 2 it is clear that ODRL (that has no formal semantics itself) does not help to solve this problem.
4) Some examples would help in several places. For example how to restrucure DUO's taxonomies (p.5 lines 36-42).
5) The authors suggest to "take advantage of ODRL’s ability to express [...] rules as
code through which it can identify when a given collection of rules associated with a single dataset are contradictory or impossible to satisfy". Again, according to which semantics? ODRL's MUST directives in the specification deal mostly with syntactic restrictions only - they don't say much about logical inconsistencies. Again, some examples would help to see the actual potential of such consistency checks (and their limitations).
6) The role of templates and their expansion mechanism are not clear either. Pleas provide examples to illustrate the role of templates and how they are processed.
7) (p. 11 lines 33-39) This description of the policy matching algorithm should be provably correct and complete, in order to fulfil the requirements that the authors themselves have put forward in the introduction. However, in order to prove this, the informal description included here is not adequate.
8) (p.13, lines 33-37) This paragraph explains how to encode the inputs needed to check compliance with the GDPR. How can this be checked automatically, though? The encoding ot GDPR's restrictions is not discussed, but as far as I understand it should be within the scope of the paper.
9) (p.14 lines 38-43) The algorithm for assembling the agreement is not described, and its cost is not analyzed.
10) Last but not least, the authors write: "we also consider ODRL the most suitable candidate for representing DUO concepts as it can be used without requiring any of the existing DUO-based data use or request governance processes to make radical and incompatible changes". In order to support this claim, the authors should convincingly argue that competing approaches cannot do the same. My feeling is that other approaches could do the same with less invasive changes. Take [19], for example: that approach is vocabulary-neutral (it is based on a vocabulary-independent fragment of OWL2) so adding (or changing) properties and classes is trivial, while the authors had to adopt a nonstandard extension of ODRL (which jeopardizes the advantages of using a standard language).
Moreover, OWL2 does have a formal semantics, which addresses all the issues raised above.
11) The observation that the policy language of [19] is vocabulary-independent calls for a refinement of the discussion of [19] in section 2.2. In particular the sentence "In principle, this is similar to DUOS’s matching algorithm where the concepts to be matched in a policy are pre-determined" suggests that [19] uses a pre-determined set of concepts, while the approach is vocabulary-independent (which the ODRL approach only partially is).
MINOR COMMENTS:
- (p.5 line 36) "data user permission" -> "data use permission" ?
- (p.8 line 20) "the concept would be an instance of the appropriate DUO class": instance or subclass?
- (p.15 line 45) complimentary -> complementary ?
|