Analysis of Ontologies and Policy Languages to Represent Information Flows in GDPR

Tracking #: 2703-3917

Authors: 
Beatriz Esteves
Víctor Rodríguez-Doncel

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Survey Article
Abstract: 
This article surveys existing vocabularies, ontologies and policy languages that can be used to represent informational items referenced in GDPR rights and obligations, such as the `notification of a data breach', the `controller's identity' or a `DPIA'. Rights and obligations in GDPR are analyzed in terms of information flows between different stakeholders, and a complete collection of 57 different informational items that are mentioned by GDPR is described. 12 privacy-related policy languages and 9 data protection vocabularies and ontologies are studied in relation to this list of informational items. ODRL emerges as the language that can partially represent the highest number of rights and obligations in GDPR if complemented with DPV and GDPRtEXT, since 39 out of the 57 informational items can be modelled. Online supplementary material is provided, including a simple search application and a taxonomy of the identified entities.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Feb/2021
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper provides a comprehensive overview over relevant work accomplished in the area of representing and processing of GDPR-related policies and conditions. The work is relevant for experts and laypeople alike as it provides a discussion of up to date resources and their practical applicability for improving the automation of privacy preservation according to the European legislation. Additionally, it identifies deficiencies, discusses improvements and provides additional resources to enhance the technological capabilities for enriching online resources with privacy-related information.

(2) How comprehensive and how balanced is the presentation and coverage.

The paper starts with a discussion of privacy-related information flows and corresponding rights and obligations defined within the GDPR from the perspective of various actors involved in the processing of personal data. As a major outcome they authors present an expressive flow chart illustrating rights and obligations as information flows between various actors such as data subjects, data controllers, authorities and other agents defined with the GDPR framework.
In the following sections the authors describe 12 policy languages and eight 8 ontologies that have either been explicitely designed to express and model privacy-related issues or can be re-used for such purposes. They close this section by also briefly discussing the ties between GDPR and linked open data.
In the last chapter the authors compare the languages' and ontologies' expressivity against the information flows defined at the beginning of the paper. This analysis nicely illustrates the capabilities and deficiencies of existing resources for the automated processing of privacy-related rights and obligations.

(3) Readability and clarity of the presentation.

Generally speaking, the paper is well written and easy to read, although the topic itself is not trivial especially for readers not particularly familiar with the regulatory aspects of data protection and its intersections with knowledge representation. Nevertheless, the authors present their arguments and issues in a intelligible manner, thus providing a good starting point to engage in this specific area of research and application.

Nevertheless, some aspects of the work could be improved:

- Chapter 3 is necessarily of a very descriptive nature. Although this is necessary to gain an understanding of the research object - the relevant languages and ontologies - it would be nice to add a table at the beginning of chapter 3.2 and 3.3. that gives a short descriptive comparison / profile of the resources w.r.t. criteria such as
-- abbreviation
-- full name
-- creator
-- date of publication
-- version no.
-- depricated (yes/no)
This would improve readibility a lot and present the following information in a nicely condensed manner.

- Please also check the relevance of the following articles for the related work section:
- https://ieeexplore.ieee.org/abstract/document/8923532
- https://dl.acm.org/doi/abs/10.1145/3266237.3266270
- https://dl.acm.org/doi/10.1145/2872518.2890590
- https://link.springer.com/chapter/10.1007/978-3-319-58469-0_33
- https://ieeexplore.ieee.org/abstract/document/4262578
- https://www.researchgate.net/profile/Annie-Anton/publication/228941676_A...

- There are some formal issues that need to be mended:
p1. l32: "Alan Westin’s dreams" --> better: "Alan Westin’s vision"
p1. l44: "... as the reference framework." --> "... framework of refrence."
p3. l38: "when said processing in unlawful;" --> "when said processing is unlawful;"
p6. l42: "... policies with the users preferences..." --> user's or users'
p7. l8: "... about the websites privacy ..." --> website's or websites'
p7. l13: "... but rather to establish the practices of each website." --> What do you mean with "establish"?
p7. l23: "... the policy and an access, disputes and remedies" --> delete "an"
p7. l27: "... while the remedy specifies" --> "while the remedies element specifies ..."
p7. l28: "In relation to the statements applied only to specific data types, ..." --> What do you mean with this formulation?
p7. l33: "... processing, such as the .." --> When you say "such as" giving examples and in the second part of the sentence you say "should contain at least one of them", you must explicitly specify that there are more then the listed three purposes. Or you can make the second part a separate sentence.
p7. l39: "by P3P and the" --> for reasons of clarity: "..., and ..."
p7. l16: "... about permissions, prohibitions and duties related ..." --> In a former passage you refer to deontic concepts as permissions, prohibitions and obligations. Here you refer to duty instead of obligation. Would be good to stay consistent - even though ODRL is referring to duty itself. ... but who am I telling this ;) ... would be interesting to discuss the semantic difference between duties and obligations.
p7. l34: "... is related to the inability to ..." --> Is it really "inability" or rather "forbiddance" ... inability refers to a non existent physical or mental ability of a subject, while forbiddance refers to a state of denial imposed upon a subject.
p7. l38: "... conditions in which ..." --> conditions under which
p7. l51: "... and the temporal, spacial, sector, ... constraint ..." --> Did you intend to enter more attributes here? If not, please try to avoid using "...".
p8. l14: "... extension of the company ..." --> company's
p9. l1: "... within a more broad-domain framework." --> What do you exactly mean with this?
p9. l4&6: "the rights and duties regarding data disclosure" --> Again: duties or obligations?
With duties you would refer to the ODRL convention. With obligations you would refer to the broader context of deontic logic and reasoning. Or is "duties" a specific concepts defined within POL? IF yes, put it in italics.
p9. l23: "i.e. event/driven time, ..." --> shouldn't it be "event-driven"?
p9. l32: "... questions in relation to the electronic identifiers and electronic identities progresses." --> Pls check formulation for grammar.
p9. l38: "... to define whom has .." --> either "who has access" or "to whom access is granted
p10. l1: "fine whom can" --> who
p10. l4: "regulate users access" --> user or user's or users'
p10. l49: "... services that monitor the consumers and providers activities ..." --> consumer's or consumers' // provider's or providers'
p10. l15-18: Also, the retention period of the purpose should be defined in days and a negotiable attribute, set to false by default, can also be detailed. --> Check grammar!
p10. l19-21: "composed by one or more data elements and each one can have an expiry period, ..." --> "composed of" // Is it really "period" or rather "date"?
p10. l38: "... expands the PrimeLife Policy Language (PPL) to take into guidelines ..." --> by taking into account
p10. l41: "extensible privacy policy language designed on the context of the" --> do you mean "designed within the context of ..." or developed within the XYZ project"
p11. l12: "Each log has meta-data associated to it," --> either "has associated meta-data" or "has meta-data attached to it".
p12. l6: "and description," --> and a description
p12. l11&22: "conditions in which" --> conditions under which
p12. l13: "request is in accordance with" --> maybe better: complies with
p12. l29: "In this subsection, we describe the found data .." --> delete: the found
p13. l2-5: "The consent should be given by the data subject in an unambiguous way and for a specific purpose and how it is given depends on the type of data it is related to." --> divide in two sentences
p13.l33: "does it give detail on" --> details
p13. l38: "between entities and also to monitor" --> either "and to monitor" or "and also monitor"
p14. l15: "the work-flow. For this, provenance meta-data on the" --> "... workflow. To do so, ..."
p14. l29: data on the exercising of rights --> do you mean "execution"
p14.l39: "ontology focused on the cloud" --> focussing & delete "the"
p14. l9: falls on the authority --> into
p14. l26: "was also further developed to" --> either "developed further" or "extended"
p15 l20&22: work-flow --> workflow
p15. l23: "has associated several properties" --> several associated
p15. l30: "which the context was" --> consent
p15. l34: "and will be running until April 2021." --> was running until
p16. l12: "hierarchically according to the detail level of the data and" --> what do you mean with this?
p16. l19: "used to module the" --> What does that mean?
p17. l29: application scenarios, is ongoing to date. --> is going on to this date.
p17. l34: excluding PPO. P3P, --> excluding PPO, P3P,
p20. l37: tools both in the side of the individuals and in the side --> tools both on the side of the individuals and on the side

(4) Importance of the covered material to the broader Semantic Web community.

Given that the semantic web is not just a technological endeavour but unfolds within a social context, engineers working in this area need to be aware how to model and represent regulatory conditions within shared resources. This paper provides a comprehensive primer to the important area of data protection and thus is of relevance to anybody who needs to deal with data protection issues from a knowledge engineering perspective.

Review #2
By Julian Padget submitted on 07/Mar/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.
Section 1.

The introduction is fine as far as it goes, but given this is a survey paper, it needs to say more than that the contributions are a survey: first there needs to be a motivation (or motivations) for the survey - some of this is implied in earlier parts of the intro, but this needs bringing out and developing, and second there needs to be clear statement of what the outcomes of the survey are. A third aspect is whether this is a partial or a complete survey. This may be tied back to the motivational aspect.

In the case of a survey paper, the reader needs to know what is the authors' methodology up front. This could be rolled into the introduction, but would probably be better as a stand alone section. Perhaps promote 3.1 to here. Also needs expansion, for example what criteria were used to include or exclude work? Some of this is covered at the start of 3.2. Provide some reflection on process too.

Section 2.

The comprehensive identifications of subject rights (2.1, 2.2), rights and obligations of controllers and processors (2.3) is extremely useful, but the section needs to start by explaining why it is being done and how it fits into the argument of the paper. In addition, the reader needs to be told what the origins are or what the derivation process is for these rights and information flows. Likewise the mapping of rights to information items (table 2) and of rights and obligations of controllers and processors to information items (table 3) refer respectively to chapter III and chapter IV of the GDPR, but miss a reference to the analysis that underpins the mappings.

Section 3.

The opening of this section is about other surveys, but the survey the authors are presenting comprises sections 3.2 and 3.3. I feel these should be separate sections at the top level, not part of related work. Then it makes sense to have an intro to the section on privacy-related policy languages that maps out what is to come, why it's in the order it is, etc., as well as flagging up the outcomes, such as the comparison tables, in advance.

The content provides plenty of description, but not much analysis, or at least variable amounts of analysis. Would be reasonable to omit (but state explicitly) in the case of obsolete languages - 3.2 identifies some obsolete languages that are left out (sic), but others are left in (need explanation). But for the obsolete, what are the takeaways? How have they changed the landscape, what have they influenced? Would an influence/dependency graph be possible to capture the flow of research ideas? It is scholarly artefacts like this that make survey papers genuinely useful.

Any particular reason for the order? Might be useful to make it broadly chronological, particularly since the plain citation style is unhelpful in this respect. At present it is just a collection of names without any apparent connections, which is what needs addressing.

3.2.2

ODRL constraints however are at the same level as everything else, making their specificity (which permission, obligation etc.) unclear, when there may need to be different, possibly conflicting constraints for different rules, or for the same rules at different times. I'm positive in general about ODRL, but as it stands ODRL 2.2 does have its limitations and arguably confusions.

3.2.3

"XPref resorts to XPath ... making the preferences formulation more user-friendly and less error prone." It is a little surprising to find XPath being described as user friendly and less error prone. Is there a user study whose results support this view that can be cited here?

3.2.4

precising -> making precise

How does the S4P data disclosure protocol for third parties work? What happens to data subsequently is a particularly tricky matter that is not mentioned in any of the other sections (I think).

3.2.5

"These annotations are then incorporated by the AIR reasoner in its justifications and can be used to hide PIIs present in the rule set." This is intriguing, but needs illustration to make the point: unless the reader knows the material, they cannot imagine how this works.

"Also, the rules graph format allows for the nesting of rules within the same rule set, thus providing a way to segment the conditions stated by the rule in order to only expose part of them in the justifications." This explanation works for someone who knows about it, but is opaque, at least to this reader.

3.2.6

to module distinct -> to model distinct

3.2.7

whom has access to what -> who has access to what
restriction abilities should apply -> restriction abilities apply (?); can't see a reason for should

3.2.10

Given title should 3.2.10 be in 3.3 rather than 3.2?

PROV-O needs citation (it's in the bibliography)

3.2.11

The DPF rule engine sounds quite complicated; needs some technological grounding: how does it work (e.g. on what logic is it based?)

3.2.12

Heavy use of bold here; the different presentational style distracts, compared to other sections where there is little or no bold.

3.2 general points

What is the descriptive coverage of each language? What makes one better than another for a given task? What are their formal underpinnings? For example, there seems to be limited consideration of the reasoning aspects.

3.3

The current intro is rather brief. Needs expansion to complement that of 3.2. As per suggestion above this would be a top-level section, nad as for 3.2 maps out what is to come, why it's in the order it is, etc., as well as flagging up the outcomes, such as the comparison tables, in advance.

Contrast also missing here, especially apparent with the GDPR-related ontlogies.

Have to motivate categories for table too; say more about the methodology; how the GDPR informs the process;

In general there is a sense of some entries getting more insightful coverage. Review for balance.

In the end, which ones are the right ones to use? How to choose between them?

Section 4.

blank section

section 4.1

content quite dense: needs signposting/structure; content mostly observations about the table: what are conclusions?

Reason for order in table 4? Order by number of asterisks, then alphabetically?

Likewise Table 5?

It feels like it ought to be a matter of concern that "Most of the ontologies and vocabularies presented are obsolete or without new developments in recent years, with BPR4GDPR's IMO, GDPRov, GConsent, DPV and GDPRtEXT being the only ones that continue to be improved.". Will the same fate befall those currently being maintained?

I could not follow what was intended by "Moreover, only DPKO, IMO and PrOnto do not have open and accessible resources.". Perhaps "of these", instead of "only"?

Not convinced by Listing 1. Feels like part of another paper and not sure it contributes much here.

The tables appear too late in the paper: table 7 even falls into the appendices. Since there appears to be a fairly clear break on line 11 on p.18 between the languages part and the vocabularies part, the structure of the paper could potentially be improved by putting the discussion (and the tables) for each survey part at the end of what is currently 3.2 and 3.3 respectively, so that the reader can more easily look at the tables and at the discussion together.

Conclusion

Proposing to combine three languages/ontologies is not very informative and misses a clear justification beyond maturity. What the principles underpinning each component and how will they satisfy the goals that the analysis/survey has brought into focus? Again, more reflection needed.

Bibliography

Entries 7, 37, 58, 71 need attention or are incomplete in some way.

Online docs need access dates.

Review #3
Anonymous submitted on 10/Jun/2021
Suggestion:
Accept
Review Comment:

1) The paper presents a survey on policy languages, vocabularies and ontologies focused on privacy and it analyses to what extent they support GDPR-related applications aimed at supporting individuals and other stakeholders to manage GDPR compliance. The paper is presented in a very clear way, it provides a comprehensive introduction of the topic which is a current challenge also in industry.

(2) The main contribution is the review of existing ontologies, vocabularies etc and to what extent they support the representation of subjects rights as specified in the GDPR. To achieve this the paper introduces a sort of framework of evaluation in which authors first identify a collection of informational items and relate them to the subject rights which further simplifies the evaluation of the different proposed approaches. Already the framework itself adds some value to the challenges industry is currently facing when trying to be compliant.
(3)
While section 3 elaborates very well on the characteristics of each approach using a high level classification, it would have been nice to have had an earlier and perhaps simpler representation/categorization of each approach outstanding the pros and cons, Tables 4 and 5 achieve that but impacts the readability/flow. Otherwise, the paper is nicely written, clear and well structured.

(4) The covered material is of high value as GDPR compliance continues to be an important challenge and there aren't enough tools/resources to support a better understanding and automation for exercising personal data identification, classification, modeling of rights and implementation of actions related to those rights. The supporting materials that have been provided add a value to the community, however, as it seems to be resources published as part of a H2020 ITN, I wonder about the long-term availability of the resources and how other peers could contribute to the development, maintenance and maturity of the approach, note that it is quite common that after the end of a H2020 project topics are not maintained.