Differential Privacy and SPARQL

Tracking #: 3474-4688

Carlos Buil Aranda
Jorge Lobo
Federico Olmedo

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
Differential Privacy is a framework that provides formal tools to develop algorithms to access databases and answer statistical queries with quantifiable accuracy and privacy guarantees. The notions of Differential Privacy are defined independently of the data model and the query language at steak. Most Differential Privacy results have been obtained on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. So far, the data model used by the framework research has typically been the relational model and the query language SQL. However, effective realizations of Differential Privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying Differential Privacy in RDF knowledge graphs and SPARQL query language. By the simple nature of RDF data, most useful queries accessing RDF graphs will require intensive use of joins. Recently, new Differential Privacy techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new results carry over to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees Differential Privacy, if the RDF graph is accompanied with semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large graph databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of Differential Privacy for SPARQL and RDF.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 05/Jun/2023
Review Comment:

I am okay with accepting the manuscript now. However, there are still a few things that the authors need to fix when preparing the camera-ready version:

1/ The definition of the notion of induced sub-graph is still ambiguous and should be improved:
* The concept of "mappings of individual triple patterns in S over G" is undefined.
* It is not clear what are the "individual triple patterns in S over G that share the same image for the center of S." What is the "image" of a triple pattern? ... and what is an "image for the center of [a star BGP] S"?
* It is not clear whether such a "set of compatible mappings (of S with respect to G)" must contain at least one solution mapping for each of the triple patterns in S.
* It is not made explicit what mappings are to be considered when "Doing so for every mapping μ."
* The concept of "the RDF triples defining G" is undefined.
* The notation S_1(|G|) that you start using in Example 3.2 is undefined (as already pointed out in my previous review!)

Why can't you simply provide a formal definition??? Here is an almost complete sketch of such a definition:

* Given an RDF graph G, a star BGP S with center ?v, and an RDF term x, the \textbf{subgraph of G induced by S and x}, denoted by ind(G,S,x), is the following set of triples:

ind(G,S,x) = { t ∈ G | ∃ tp ∈ S and a solution mapping μ such that μ ∈ [[tp]]_G and μ(?v)=x }

* Given an RDF graph G and a star BGP S, the \textbf{induced subgraphs of G by S}, denoted by S(|G|), is the following set of RDF graphs:

S(|G|) = { ind(G,S,x) | ∃ an RDF term x such that ind(G,S,x) is not empty }

2/ Point 3 of my previous review (which was point 7 in the review before that) has been addressed, but still not in an entirely satisfactory way. Now there is the statement that "the benefit of this latter query is that it can be pre-computed for every tp," (page 11, lines 34-36 on the RHS) which is inaccurate because it is not clear what triple pattern is meant by "every tp." I can see different options for what this might mean:
i) In the context of the paragraph that contains this statement, the symbol tp is introduced to refer to a triple pattern that is "obtained from a triple pattern in B [...]," which means that tp depends on B. As a consequence, in this case we are back again at the issue that I raised in point 3 of my previous review.
ii) Another interpretation may be that "every tp" actually refers to every possible triple pattern. In this case, I don't see how the "latter query" can be pre-computed for every triple pattern.
My assumption, however, is that you actually mean all the triple patterns of all star BGPs of the given dp-schema. You need to be explicit about that in this paragraph! ...and then, to make the connection to the triple pattern tp in the paragraph, you also need to be explicit that a triple pattern tp, constructed as described in this paragraph, is guaranteed to be isomorphic (same predicate IRI but potentially different variables in subject and object) to one of the triple patterns in one of the star BGPs of the dp-schema.

3/ The new changes to the manuscript contain new typos as well. For instance:

Page 9, RHS, line 3: "reminder" -> "remainder"

Page 11, RHS, line 30: "other other" -> "other"