Enhancing Data Use Ontology (DUO) for Health-Data Sharing by Extending it with ODRL and DPV

Tracking #: 3127-4341

Authors: 
Harshvardhan J. Pandit
Beatriz Esteves

Responsible editor: 
Tania Tudorache

Submission type: 
Ontology Description
Abstract: 
The Global Alliance for Genomics and Health is an international consortium that is developing the Data Use Ontology (DUO) as a standard providing machine-readable codes for automation in data discovery and responsible sharing of genomics data. DUO concepts, which are OWL classes, only contain textual descriptions regarding the conditions for data use they represent, which limits their usefulness in automated systems. We present use of the Open Digital Rights Language (ODRL) to make these conditions explicit as rules, and combine them to create policies that can be attached to datasets, and used to identify compatibility with a data request. To associate the use of DUO and the ODRL policies with concepts relevant to privacy and data protection law, we use the Data Privacy Vocabulary (DPV). Through this, we show how policies can be declared in a jurisdiction-agnostic manner, and extended as needed for specific laws like the GDPR. Our work acknowledges the socio-technical importance of DUO, and therefore is intended to be complimentary to it rather than a replacement. To assist in the improvement of DUO, we provide ODRL rules for all of its concepts, an implementation of the matching algorithm, and a demonstration showing it in practice. All resources described in this article are available at: https://w3id.org/duodrl/repo.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Jul/2022
Suggestion:
Major Revision
Review Comment:

SUMMARY:

This submission introduces a semantic, machine readable approach to modeling the data usage requirements encoded in the classes of the DUO data usage ontology (which is part of OBO).
Data use limitations in DUO are encoded as unstructured human-readable text, that is used by a review board to decide whether a data request should be accepted. This paper proposes a machine-readable encoding of these data use requirements based on ODRL and DPV (a rich vocabulary for data privacy concepts developed by a W3C community group).
This paper is something more than a simple ontology description, as it considers also automated compliance checking methods for the policies encoded with the proposed ontology.

Pros:
- a major improvement of DUO: machine readable policies pave the way to significant enhancement of the support to the ethical board (eg tools for policy validation, compliance checking etc.);

- the proposed approach can be gently integrated with DUO's workflows and tools, because it is complementary to DUO's ontology;

- the integration with DPV fills a major gap of DUO, that currently does not model any legal aspects; with DPV's upper ontology general legal aspects can be modelled, and DPV's profiles can be leveraged to express conditions related to specific regulations, such as the GDPR;

- ODRL and DPV are publicly available resources.

Cons:
- apparently, the choice of ODRL is suboptimal with respect to the requirements collected by the authors;

- the motivations for using ODRL do not convincingly address the above issue and should be strengthened;

- this involves also a refinement of the related work section;

- the algorithms for compliance checking and agreement generation are just sketched; related desiderata cannot be checked at this level of detail; matching might as well be intractable.

QUESTIONS FROM THE EDITOR:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions:

(1) Quality and relevance of the described ontology (convincing evidence must be provided).

The DPV vocabulary is being developed within W3C, it results from a careful work of collection, harmonization, and extension of previous vocabularies, and is relevant in different application contexts. It is serialized in several semantic formats, including OWL and RDFS. ODRL is another W3C standard that can be regarded as an ontology without formal semantics. Both have a high potential, in terms of impact.

(2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

Several clarifications are needed, see detailed comments.

(3) Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
yes

(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
yes

(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
The repository is on github

(4) whether the provided data artifacts are complete.
yes

DETAILED COMMENTS:

1) The authors advocate "mathematical guarantees regarding correctness and consistency" of policies and compliance checking (p.2, line 35). However, they propose ODRL that does not have a formal semantics - so getting formal guarantees is impossible.

There exist papers that provide such semantics for fragments of ODRL by translating its expressions in answer set programming [20]. However, representing policies as answer set programs makes the compliance checks based on subsumption (cf. section 3.6) horribly expensive and intrinsically intractable.

The authors implicitly hint at the possibility of mapping ODRL into low-complexity description logics (see the comment on reusing previous approaches on p. 11, liness 22-23, especially [19]); this cannot be done for full ODRL, whose constraints are not fully supported by the policy language of [19] (as they involve unrestricted mathematical comparison operators). Checking policy subsumption with those constraints is intractable, in general.

In the light of the above discussion, the exact fragment of ODRL to be used, its semantics, and the complexity of reasoning with that semantics need to be described more precisely, in order to convince the reader that using a non-semantic language like ODRL for a domain that requires semantics is a good idea.

2) Another negative consequence of the lack of standard semantics is that the meaning of policies is ambiguous and in principle, ODRL processors and evaluators may interpret a same policy differently (needless to say how bad this is in a context where sensitive data are processed). Such ambiguity is mentioned in section 3.6, but related risks are not discussed.
I have some doubts also about the interpretation of DUO's access determination approach. According to this paper's interpretation, a request for accessing a class of data C would allow to access also a wider class of data P; for example, this would make DUO's geographical restrictions useless, because permission to use the data of US citizens would permit using the data of all citizens, including those in the EU. This makes no sense to me. Maybe some examples could clarify the rationale behind the two opposite access control methods described in this section.

3) Another criticism of DUO is that it "does not offer much guidance on how the matching is performed between datasets and requests annotated with DUO concepts". From point 2 it is clear that ODRL (that has no formal semantics itself) does not help to solve this problem.

4) Some examples would help in several places. For example how to restrucure DUO's taxonomies (p.5 lines 36-42).

5) The authors suggest to "take advantage of ODRL’s ability to express [...] rules as
code through which it can identify when a given collection of rules associated with a single dataset are contradictory or impossible to satisfy". Again, according to which semantics? ODRL's MUST directives in the specification deal mostly with syntactic restrictions only - they don't say much about logical inconsistencies. Again, some examples would help to see the actual potential of such consistency checks (and their limitations).

6) The role of templates and their expansion mechanism are not clear either. Pleas provide examples to illustrate the role of templates and how they are processed.

7) (p. 11 lines 33-39) This description of the policy matching algorithm should be provably correct and complete, in order to fulfil the requirements that the authors themselves have put forward in the introduction. However, in order to prove this, the informal description included here is not adequate.

8) (p.13, lines 33-37) This paragraph explains how to encode the inputs needed to check compliance with the GDPR. How can this be checked automatically, though? The encoding ot GDPR's restrictions is not discussed, but as far as I understand it should be within the scope of the paper.

9) (p.14 lines 38-43) The algorithm for assembling the agreement is not described, and its cost is not analyzed.

10) Last but not least, the authors write: "we also consider ODRL the most suitable candidate for representing DUO concepts as it can be used without requiring any of the existing DUO-based data use or request governance processes to make radical and incompatible changes". In order to support this claim, the authors should convincingly argue that competing approaches cannot do the same. My feeling is that other approaches could do the same with less invasive changes. Take [19], for example: that approach is vocabulary-neutral (it is based on a vocabulary-independent fragment of OWL2) so adding (or changing) properties and classes is trivial, while the authors had to adopt a nonstandard extension of ODRL (which jeopardizes the advantages of using a standard language).
Moreover, OWL2 does have a formal semantics, which addresses all the issues raised above.

11) The observation that the policy language of [19] is vocabulary-independent calls for a refinement of the discussion of [19] in section 2.2. In particular the sentence "In principle, this is similar to DUOS’s matching algorithm where the concepts to be matched in a policy are pre-determined" suggests that [19] uses a pre-determined set of concepts, while the approach is vocabulary-independent (which the ODRL approach only partially is).

MINOR COMMENTS:

- (p.5 line 36) "data user permission" -> "data use permission" ?

- (p.8 line 20) "the concept would be an instance of the appropriate DUO class": instance or subclass?

- (p.15 line 45) complimentary -> complementary ?

Review #2
Anonymous submitted on 06/Jul/2022
Suggestion:
Accept
Review Comment:

very well written manuscripts; conceptually well developed; repository present, well organized, and maintained

Review #3
By Arianna Rossi submitted on 07/Oct/2022
Suggestion:
Accept
Review Comment:

This article convincingly presents the work done by the authors to extend the Data Use Ontology (DUO), i.e., a controlled vocabulary describing the conditions for health data sharing permissions, with the Open Digital Rights Language (ODRL). This extension converts such conditions into actionable rules that can be used to create use policies for datasets (e.g., sticky policies) and automate the management of use permissions. The paper also describes their integration with the Data Privacy Vocabulary (DPV), that can be further specified to be relevant for specific jurisdictions (in this case, the GDPR) and therefore assist in compliance tasks. The article ends with a policy editor prototype, the implementation of the matching algorithm and a nuanced discussion that proposes solutions on how to enhance DUO’s coherence (such recommendations could be directly shared with DUO authors and users), ease integration and support compliance checking.

The work is of quality and solves a well-stated and precisely defined challenge by drawing together and harmonizing controlled resources that were created for different goals. The topic is very timely: as briefly mentioned in the introduction, there is a pressing need for interoperable rules and automation for data access permission and their compatibility checking in healthcare settings, one reason being the growing efforts to ease the sharing of health-related data across institutions, states and jurisdictions and enhanced its accountability. Such efforts are encouraged by recent EU regulations like the Data Governance Act and the Health Data Space that are instrumental to realize the European Digital Strategy. The state-of-the-art outlines clearly past and ongoing efforts in this respect and the limitations of existing ontologies that need to be overcome. The relevance and novelty of the work are unquestionable. About the stable URL, the file is clearly organized and complete.

I’d like to praise the authors for the clarity of their exposition throughout the article and the sound argumentative structure, where the objectives are clearly stated (even though the R05 “Elucidating relevance” could be better restated) and follow logically from the research gaps identified in Section 2. Also notable is the rich, nuanced expression of the arguments presented in the work and of its limitations. Finally, the examples provided are appropriate.

Minor comments that would improve the paper:
- a few concrete examples of how the extension can be leveraged, by whom and in which use cases would make the findings more relevant for researchers, practitioners and other professionals (regulators, tech innovators, etc.) working in health sharing scenarios. What is the concrete application(s) for such an ontology extension?
- Similarly, “impacts” at line 44 p. 1 could also be further specified to illustrate the risks of lack of accountability and unclarity/confusion of health data use permissions. Both would help the readers to better understand the relevance of this contribution.
- p. 4 – l. 4 Add a few words about the objectives of the pilot mentioned to grasp why it is relevant here
- P. 4 Sec. 2.2. clarify that even the applications can be many, in this article the focus is on the GDPR.
- I would have appreciated to see if the interpretations of DUO concepts could be verified with an expert (someone that created the ontology? Someone that uses it?), at least as future work
- Add the link at l. 1 p. 14 “link in abstract”
- Sec. 6.3 p. 16: the authors could state more clearly what are the implications for the semantic modelling of “laws such as the GDPR are fairly recent in terms of how their obligations are understood to be applied” – what are the exact challenges that are implicitly recalled?
- Sec. 6.3 p. 16: what exactly is intended with “digital contract”? I seem to understand that in the EU a digital contract is a contract applied to digital goods, contents and services.

Typos:
- p. 3 l. 1 missing subject for “can choose which aspects”
- p. 3 l. 42 “adding new datasets and (in) data access requests”
- p. 4 l. 3 “comparable to [the outputs? The decisions? of] human data access committees
- p. 4 l. 25 health data [is] personal data
- p. 4 l. 27 identifying and meeting(s) their compliance
- p. 5 l. 5 “based on (on) these identified”
- p. 5 l. 36 data use(r) permissions
- p. 10 l. 32-33 (odrl:Assigner) and (odrl:Assignee)
- p. 12 l. 35 “in a(n) jurisdiction agnostic manner”

Review #4
Anonymous submitted on 16/Oct/2022
Suggestion:
Minor Revision
Review Comment:

The paper is not exactly an Ontology Description.

However, it proposes improvements to an existing ontology and how to complement it with rules and policies descriptions. In particular by using ODRL. In addition, how to add legal consistency is approached with DPV.

From this point of view, the paper is very interesting and sound. A deep analysis and justification is given, together with a demo.

Things to improve include:
- Table 1 is not fully comprehensive.
- Evaluate the effort needed in non-directly mapable concepts and how useful they are.
- Define the updated ontology as such.

Review #5
By Visara Urovi submitted on 28/Dec/2022
Suggestion:
Major Revision
Review Comment:

This work presents the use of the Open Digital Rights Language (ODRL) to explicit Data Use Ontology (DUO) as machine readable rules, and to combine them to create policies that can be attached to datasets. The work clearly describes the mapping process. The work is accompanied by a link to a repository and the shared resources seem to be complete for others to reproduce the work.

Currently, the paper reads more as a technical report than a publication therefore I would suggest the authors to improve the paper in the following ways:

1. Introduction should explicit the novelty of the work. I noticed the contribution listed as RO1-RO5 however these list the work of the authors but do not clearly position the work in relation to other research.
2. Several related works are identified in section 2.1 (ref 10-16) however no attempt is made to position them in relation to the current works. For example, also the work cited as [11] represents DUO as automatable rules (as smart-contracts) that map to ADA-M.
3. You state that "Of note in these identified articles and other resources is that we did not find a clear example or workflow for how the machine-readability of DUO should be associated with datasets, expressed as part of a request, or how the matching algorithm should function." however such statement is a weak indicator of contribution. Possibly, your overall contribution is in expliciting DUO into rules and combining them with DPV but you need a more detailed comparison to existing works.
4. Section 5 (Demonstration and Evaluation ...)shows an interface. No link is included to try such an interface and moreover I cannot see an evaluation as such. I think you might want to guide the reader a bit more into understanding your contribution: What was not possible before, in other works or with DUO itself that is possible now?
5. The Conclusions should include any planned future works. How will this work need to be improved ans extended going forward?