Private Record Linkage

Tracking #: 1360-2572

Pawel Grzebala
Michelle Cheatham

Responsible editor: 
Guest Editors Linked Data Security Privacy Policy

Submission type: 
Full Paper
The rise of Big Data Analytics has shown the utility of analyzing all aspects of a problem by bringing together disparate data sets. Efficient and accurate private record linkage algorithms are necessary to achieve this. However, records are often linked based on personally identifiable information, and protecting the privacy of individuals is critical. This paper contributes to this field by studying an important component of the private record linkage problem: linking based on names while keeping those names encrypted, both on disk and in memory. We explore the applicability, accuracy, speed and security of three different primary approaches to this problem (along with several variations) and compare the results to common name-matching metrics on unprotected data. While these approaches are not new, this paper provides a thorough analysis on a range of datasets containing systematically introduced flaws common to name-based data entry, such as typographical errors, optical character recognition errors, and phonetic errors.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jetzabel Serna submitted on 20/May/2016
Minor Revision
Review Comment:

In this paper authors present an evaluation of two privacy-preserving name matching techniques used for querying and joining databases while preserving some degree of privacy and security (name matching based on encrypted records). Concretely, authors focused their work on the evaluation of Soundex and q-grams algorithms, including a number of variations of the latter. To achieve this, authors performed their evaluation using 30 (user) automatic generated and corrupted datasets containing 4 types of errors (Character Edit, Keyboard Edit, OCR Edit and Phonetic Edit) and compared the aforementioned techniques with two non- privacy-preserving algorithms (according to the authors Levenshtein and Jaro-Winkler are two standard methods commonly employed on unencrypted data).
The paper has a very well delimited scope, it focuses exclusively on name-based linking i.e. privacy-preserving querying and merging of string attributes.
The evaluation of the approaches has been performed by considering the accuracy of the algorithms, their applicability, computational efficiency, security and privacy. The main contribution is therefore the comparative analysis that aims to give an insight of the costs of privacy, the degree of security offered and the impact on accuracy when various types of errors are present in the dataset.
In general I find the paper very interesting, well organized and well written, it is clear, concise and pretty focused, a paper that I have definitely enjoyed reading.
Some remarks/weaknesses:
- The originality/contribution of the paper is not so clear, there are papers that have also evaluated the accuracy and the computational efficiency of different techniques, therefore, authors should be more precise on why is this paper needed, what is the original contribution of their evaluation, and why is important to have this evaluation? Not being an expert in the area, I would also like to know the rationale for selecting only those two mechanisms (are they considered the best available?)
- Since Soundex’s applicability is evaluated negatively since the beginning, the remaining of the paper focuses on Q-gram technique and its variations.
- The review of the state-of-the-art seems limited or outdated (e.g. [1-3], see below). In particular the framework proposed in [1] could have been useful for the purposed of this paper.
- 4.1 Within the first paragraph authors mentioned that “Levenshtein and Jaro achieve on encrypted data”, I believe this is a typo/mistake, since previously both algorithms were introduced as standard methods employed on unencrypted data.
- The meaning of the threshold is briefly explained, but not its implications; especially considering that the evaluated algorithm q-grams performs badly when the threshold is set to a high percentage. It would be nice to have an insight of what threshold would be considered useful (realistic).
- Also similar to the variation of q in the q-gram technique, is it realistic to have a q=4?
- When applied to name, how realistic is a q=3 (as recommended in the conclusions)?
Minor remarks:
Abstract: “three different primary approaches” – aren’t they only two?
–pag.2 the acronym PRL appears without being introduced (even if it is quite obvious)
-pag.2 authors claim that “several” name-based similarity metrics are used, I believe there are only two and some variations of those two.
- The evaluated techniques are at times mentioned as techniques, algorithms and similarity metrics; it would be nice to stick to a single terminology and use it along the whole paper.
- Figure 1, would extremely benefit if different symbols are used, it is quite difficult to note the differences on a printed paper.
- Labels used in Table 2 are not explained (although somehow obvious) FN = False negatives; FP= False positives
- The paragraph (last three lines in column 1) on page 7 gives a negative impression on the presentation.
- Figure 3 is quite small, also it is not clear why authors decided to include again Soundex when it was already evaluated as not applicable.
- Figure 9 is also quite small

[1] Dinusha Vatsalan, Peter Christeny, Christine O'Keefez, and Vassilios S. Verykiosx; An Evaluation Framework for Privacy-Preserving Record Linkage Journal of Privacy and Condentiality (2014) 6, Number 1,
[2] 35Wen, Zikai, and Changyu Dong. "Efficient protocols for private record linkage." Proceedings of the 29th Annual ACM Symposium on Applied Computing. ACM, 2014.
[3] Schmidlin, Kurt, Kerri M. Clough-Gorr, and Adrian Spoerri. "Privacy Preserving Probabilistic Record Linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality." BMC medical research methodology 15.1 (2015): 1.

Review #2
Anonymous submitted on 01/Jun/2016
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper analyses different approaches for “approximated” name search over encrypted strings. The use cases for the approach are not clear enough. The weakest point is that the data consumer must have access to the symmetric key to encrypt the queries. The level of protection against the overhead of the solution would not pay off. Although data is kept encrypted in memory, at some point in time, the encryption key is also in memory to process the queries – does not matter if on the data consumer side or on the data provider – although this is not explained in the paper. Figure 3 is alarming. The time necessary to process 30k records is at least 500 seconds using any of the techniques! The performance is just too poor, unless there is a mistake and the y-axis is in milliseconds.
Limitation: the study is applicable only to “Anglo-Saxon” names. Whereas the availability of personal data repositories will inevitably contain names of all origins, give the proportion of immigrants in all societies or the global aspect of internet services, which rapidly congregate users from all over the world. Would the same study applied to typically Italian or French names would generate the same results? It is unclear if observations 1 - 3 would remain the same, whereas observations 4-6 seem straightforward. Errors in database entries stem from multiple factors such as names with same pronunciation but different spelling, or yet keyboard layout – which depend on the regional factors. These issues are not taken into account by GeCo. It would be important to understand how these would influence the results as well.
The security analysis also relies on American population names dating back from the 90’s. There is a mismatch between what GeCo would generate as names and their distribution, with this pretty outdated sample. Still, the decryption success ratings for 10k samples are already surprising, raising questions on the actual security of the solution when the attacker would have more realistic name distributions for larger samples.
Minor problems: Some figures have the caption above them, others below. Page 7 has typesetting issues in the bottom of the 1st column. In the beginning of section 5, it seems that a comma is missing “once the record linkage is complete, only a limited amount”.

Review #3
By Víctor Rodríguez-Doncel submitted on 05/Jul/2016
Review Comment:

This paper evaluates the accuracy and speed of three string metrics used to make approximate matching for querying and joining databases on encrypted strings.
Evaluation is made over synthetic strings generated and corrupted with GeCo. In particular, an anlysis is made on the value of q in the q-grams matching algorithm and an anlysis is made on the overall security of the system. The paper is well presented and technically correct. The introduction/motivation might be still improved, as the use case of interest is presented along with other situations that might generate confusion.

As a very positive aspect I would mention the availability in github of both data and code, which I could happily inspect at will.

I would reject this paper because the degree of originality is limited, because it is not much related to the SWJ areas of interest and because the contribution is of modest importance (in my view).

As the authors acknowledge, this paper builds on their previous work at the European Semantic Web Conference 2016 --already published and well visible as LNCS proceedings. My objections are twofold:

a) Most of the paragraphs in the paper are a verbatim copy of the contribution at ESWC ([7]). From section 1 to section 4.5, only two paragraphs differ. This means the first 8 pages have already been published. Subsection 4.6 and Section 5 are original, and again the conclusions are a verbatim copy of [7] plus one sentence.
I am aware of SWJ's policy saying that "results previously published at conferences or workshops may be submitted as extended versions" but still I would expect an editorial effort presenting the work in a better manner (there are now consistency problems with sentences like the one starting with "In this work, we have evaluated...")
Also, if the authors signed the ESWC copyright form (see they agreed with transferring Springer the exclusive right to publish the text "in whole, in part or in abridged form". If I were Springer I would raise a complaint.
If we build on the shoulders of giants, the sentence "Please read the first 8 pages of [7]" would have sufficed.

b) Even discarding the previous comment on the basis that SWJ is happy to republish existing work, I believe that the extension may not be valuable enough.
It is a common sense thought the fact that the record linkage quality degrades with a lowering "q" in the q-grams algorithm. By increasing q, the text can be summarized with less bits and information is coarser (perhaps entropy as an information content measure could have been offered for each of the values of q).
I believe that offering a quantitative measure of that intuition is not worthy enough, as the degradation may be strongly dependant on the dataset used.
Regarding the analysis of the security, I believe it is a rather complete study, although I am unaware of all the possible attacks and I would need further reader to assess these aspects in depth.

Understanding that the Semantic Web Journal topics of interest are broad, one would expect a bigger dimension with respect to "semantics". After all, linking names is not much an activity operating at the semantic level.