ciTIzen-centric DAta pLatform (TIDAL): Sharing Distributed Personal Data in a Privacy-Preserving Manner for Health Research

Tracking #: 3121-4335

Authors: 
Chang Sun
Marc Gallofré Ocaña
Johan van Soest
Michel Dumontier1

Responsible editor: 
Guest Editors SW Meets Health Data Management 2022

Submission type: 
Full Paper
Abstract: 
Developing personal data sharing tools and standards in conformity with data protection regulations is essential to empower citizens to control and share their health data with authorized parties for any purpose they approve. This can be, among others, for primary use in healthcare, or secondary use for research to improve human health and well-being. Ensuring that citizens are able to make fine-grained decisions about how their personal health data can be used and shared will significantly encourage citizens to participate in more health-related research. In this paper, we propose a ciTIzen-centric DatA pLatform (TIDAL) to give individuals ownership of their own data and connect them with researchers to donate their personal data for research while being in control of the whole data life cycle including data access, storage, and analysis. We recognize that most existing technologies focus on one particular aspect such as personal data storage, suffer from executing data analysis over a large number of participants, and face challenges of low data quality and insufficient data interoperability. To address these challenges, the TIDAL platform integrates a set of components for requesting subsets of RDF (Resource Description Framework) data stored in personal data vaults based on SOcial LInked Data (SOLID) technology and analyzing them in a privacy-preserving manner. We demonstrate the feasibility and efficiency of the TIDAL platform by conducting a set of simulation experiments using three different pod providers (Inrupt.net, Solidcommunity.net, Self-hosted Server). On each pod provider, we evaluated the performance of TIDAL by querying and analyzing personal health data from an increasing number of participants and variables. The performance evaluation of TIDAL shows the execution time has a linear correlation between the number of pods on all pod providers. Platforms such as TIDAL can play an important role to connect citizens, researchers, and data organizations to increase the trust placed by citizens in the processing of their personal data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dimitrios Karapiperis submitted on 03/Jul/2022
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality,

This paper is a system paper, which proposes a ciTIzen-centric DatA pLatform
(TIDAL) that gives individuals ownership of their own data, and connect them with researchers to donate their personal data for research while being in control of the whole data life cycle including data access, storage and analysis.

(2) significance of the results,

The authors seem to provide a robust solution and use state-of-the-art technologies and tools. The problem of sharing data for research purposes is quite important and authors' proposed system implements this sharing by respecting the European legislation (GDPR).
The only part that wasn't very clear to me is the connection between the submission of the researcher and the matching process with the corresponding pods of the participants. In my opinion, thr authors should elaborate further on this matching, since it is crucial for the whole lifecycle.

Another point that I will raise is the following "Considering the scalability, we enable TIDAL to access participants' pods in a concurrent way using HTTP requests.". It would be nice to know how the authors implement this concurrent access (libary, package, etc.).

and (3) quality of writing.
The presentation and writing is good. There are though a couple of issues, which should be rephrased.
- "The request form was designed as a digital consent to be informed and specific."
- " The pod providers limit the number of requests that can be responded to at one time by the servers."
- "It will contribute to enrich and improve the quality of personal data in the SOLID pods by linking data from multiple data sources."

Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete.

The data file, which is hosted in GitHub, is well organized and its instructions are suficient.

Please refer to the reviewer instructions and the FAQ for further information.

Review #2
By Vassilis Kilitntzis submitted on 10/Jul/2022
Suggestion:
Major Revision
Review Comment:

The paper describes a very interesting prototype platform for managing and performing secure and authorized analysis on health data. The use of ontologies such as dpv link to terminologies as SNOMED CT and use of human and machine readable RDF serialization RDF/turtle are also highlights.

A few changes are needed though to make the text more comprehensible

c1: "A single SOLID user can own more than one data"
Not clear whether data from one person are scattered in different pods ?

c2: "grant or revoke the permissions via a web application." are the permissions per whole pod?

c3: Not clear how the applications access the pods. For example of a user has his data in several pods does it query all the pods to find where data is hosted?

c4: Section 4.1 is hard to read and understand should be revised, e.g., participation request is mentioned in the beginning and only explained much later. Also, participation request is, in the same section, "crafted", "post" and "published" do these represent different steps?

c5: "Subsequently, TIDAL reads..."
How, who is orchestrating is it polling or event based?

c6 typo:"where researchers indicate the data elements (URI) are requested", that are requested.

c7: RF4 is quite vague how it works, the authors assume that all information is described in key(e.g. SNOMED CT pURL)- value (with not defined datatype, assuming string) pairs. This is not a major issue though, since data modeling is not the focus of the study. Last paragraph of 4.2 also is a step addressing this issue.

c8: "Each request form is assigned with a URI when it gets published" URI is generated from the SOLID pod and stored in TIDAL? general comment connected to that is to define location of information storage when dta are persistently stored in the workflow so that the reader understands if there are possible security issues as well.

c9: "the request is signed with the researcher’s private encryption key while it is published" : where is the private key? is it stored in TIDAL? or in Researchers end? TIDAL or SOLID does the signing?

c10: "All published requests that are in the valid period" since URI is only stored in TIDAL this can only be achieved ONLY if all requests are fetch each time by TIDAL and filtered in a second phase, is this the case?

c11: "TIDAL queries RDF data from the request files and displays them in a human readable manner in a card view... Each card is linked to the original request file from the researcher’s SOLID pod": what is the request file? is the rdf instance of Listing 1?

c12: "The participant’s WebID will be registered at the trusted party under the analysis request URI" : elaborate more on this what does "under the analysis request" means?

c13 typo: "integrating the Biopartal API" bioportal.

Review #3
Anonymous submitted on 11/Jul/2022
Suggestion:
Major Revision
Review Comment:

The paper describes a platform, called TIDAL, which integrates a set of existing technologies for the sharing of personal data in a privacy-preserving manner and for use in health research.
The platform consists of a set of components that make use of SOLID technology for requesting subsets of data stored in personal (distributed) data vaults and for offering a privacy-aware way to analyze the data.

Using this platform, health researchers can post participation requests which can be viewed and approved by participants. The personal health data of the participants are then retrieved and analysed in a privacy-preserving manner. All data, including participation requests, approvements, analysis, etc., are expressed and stored in RDF using established data models.

The paper is in general well-written and tackles a very interesting and timely topic. However, several claims are not justified, and also several aspects need to be better motivated and explained.

Specifically:

As stated in the introduction, the authors aim at addressing the research question: "how to engage individuals to “donate” their personal data for health-related research with maximal control in data access, storage, and analysis".
However, the paper does not provide any evidence about this "engagement", nor any explanation how citizens will be motivated to use the platform.

Authors state in the introduction: "The current personal data management technologies are mostly research-driven and in their early stages.".
There is no evidence that the proposed platform is not research-driven and not in its early stage as well, since it has not been used in a real environment.

Section 4 provides implementation details however it does not explain how participants can provide the data. In RDF directly? In other formats like spreadsheets using, for example, templates? (which means that data transformation is needed afterwards) What background knowledge do participants need? How easy is for a participant to create a pod? How do you ensure that data is provided in the desired manner? What if important data/parameters are missing?
Although there is a short relevant discussion in section 6, these aspects are, imo, very important and need to be clearly explained early in the paper. Also, the platform needs to contain mechanisms that can automate relevant processes (data entry, curation, etc.). Without a clear solution to this, it is difficult to judge if the platform will manage to be used in practice and actually *engage* individuals.

Other comments:
- Please provide examples of queries for each step of the pipeline: querying data request URIs, querying signature and verification key of data requests, querying RDF data from the participants' pods, etc.
- Section 4.3: It is not clear how one can register a new analysis algorithm, and how such a registration/extension is implemented in the platform.
- Last paragraph of section 4: "The queried data is then fed into the data analysis model which is pre-defined in the Docker image" => What is the format/model of the input and the output?
- Evaluation/Experiments: The objective and motivation of the evaluation needs to be explained. E.g., why is efficiency important in this context? (it does not seem so).
- Section 6 (Discussion): "TIDAL supports users to store and request personal data in a structured RDF format..." => There is no evidence on how users/citizens are *supported* on storing their personal data. E.g., is there a user interface for this?
- Section 6 (Discussion): "After the participants approve the data request, they can still update the data elements" => How?
- TIDAL emphasizes on engaging individuals in health research and connecting them with both researchers and data sources. => There is no evidence on how this engagement can happen.

About the resources provided and their sustainability:
There is a git repository explaining how to build and use a SOLID application. Its update date is two years ago (13-06-2020). There are lengthy video tutorials for each step of the process, however there is no mention of TIDAL.
With respect to privacy preservation, there are two videos explaining the processes of participation and analysis.