Logical Foundations for Data Interlinking with Keys and Link Keys

Tracking #: 1893-3106

Manuel Atencia
Jérôme David
Jérôme Euzenat

Responsible editor: 
Bernardo Cuenca Grau

Submission type: 
Full Paper
Both keys and their generalisation, link keys, have been proposed as a means to perform data interlinking, i.e. finding identical resources in different RDF datasets. However, the usage of keys and link keys for data interlinking has not been formalised yet. This is necessary to ensure the correctness of data interlinking tools based on keys or link keys. Furthermore, such a formalisation allows to understand the differences between keys and link keys and to pin down the conditions under which keys and link keys are equivalent. In this paper, we first formalise how keys can be combined with ontology alignments for data interlinking. Then, we extend the definition of a link key by giving the formal semantics of six kinds of link keys: weak, plain and strong link keys, and their in- and eq-variants. Moreover, we establish the conditions under which link keys are equivalent to keys. Finally, we logically ground the usage of these link keys for data interlinking and show that data interlinking with keys and alignments can be reduced to data interlinking with link keys, but not the other way around.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Jun/2018
Review Comment:

Data interlinking is a major inference in the setting of the Semantic Web. It allows finding identical resources among different datasets, and also, as a particular case, finding duplicate resources within a single dataset. Data interlinking tools are therefore fundamental for developing the Linked Open Data cloud (LOD), as well as for exploring/navigating within it, based on resources of interest.

Data interlinking builds on the notion of keys, which identify resources within a dataset. These keys are defined for classes of resources by some properties they have. Two distinct resources of a same class can then be stated equivalent if their descriptions intersect for some properties and/or are equal for some other properties.
Based on these keys, alignments (correspondences) between datasets can be found/set/enriched.

This paper starts by revisiting two mains notions of keys that have been used in the literature for data interlinking: S-keys and F-keys. S-keys, called here in-keys, state that two resources of a class are the same if their values on each key property intersect, while F-keys, called here eq-keys, impose that their values on each key property are equal.
Two trivial results are then provided: eq-keys are in-keys, and in-keys are eq-keys when the considered properties are functional.
These notions of in- and -eq keys are then generalized to that of keys, so that class resources can be identified with the behavior of in-keys for some properties, and that of eq-keys from some other properties.

Then, the paper studies data interlinking when keys are used in combination with alignments. It points out two simple results that allows finding links, using in- and eq-keys, between the resources of some class C in one dataset and these of another class D, known to be subsumed by C in the alignment, from another dataset.

Finally, this is the main part of the paper, link keys are introduced, compared with keys, and used for data interlinking.
Link keys generalize in- and -eq keys; they are used to establish that a class C resource is the same as a class D resource if their descriptions intersect or are equal wrt some properties. Various flavors of link keys are defined; they differ on whether the set properties used for the descriptions of C and D are keys (strong in- and eq-link keys), are not keys (weak in- and eq-link keys), or are not keys for C and D but for the linked resources (plain in- and eq-link keys).
Some simple results then follow, roughly eq-link keys are in-link keys, and in-link keys are eq-link keys when the considered properties are functional, as well as strong link keys are plain link keys, and plain linked keys are weak keys.
Expected relationships are then pointed out between link keys and keys.
The paper ends with two simple results that allows finding links, using weak in- and eq-link keys. Interestingly, it is pointed out that data interlinking with link keys is more general than with keys and alignments due to the use of properties that do not form keys for the class they describe.


The paper studies the inference problem of data interlinking, which is central for the Semantic Web, especially for the development and use of the LOD cloud datasets.

The goal of the paper is to provide logical foundations for data interlinking with keys and link keys, but unfortunately, in my opinion, this goal is not met.

First, the paper is hard to read and follows because the writing is very technical, though the results (which are easy to understand) are simple. For instance, in the Introduction, the problem of data linking as well as the goal of the study, though important, are neither motivated nor exemplified in details. There is a deluge of variations of key notions, with non-intuitive names, with no idea about their meaning or purpose, etc and of technical results about them, that only specialists on the topic may catch. I think that part of the good examples used in the paper could be brought into the introduction to make it more digestible.

Second, the paper does not really provide logical foundations for data interlinking, but rather use description logics to show some ways of using keys to infer links between data. Further, all the provided technical results are easy to obtain and quite directly follow from the various definitions of keys.

Overall, in my opinion, the paper lacks in originality and significance of the results, and also suffers from the way it is written (more below), for a venue like SWJ.

Other comments:
- in- and eq-keys are introduced right from the start (abstract, introduction) with almost no intuition; we need to reach section 4 to understand that in- means intersection and eq- equality, just before the formal definitions. This should be said in the first place, to help having the meaning of what in- and eq-keys are.

- Propositions 1, 4. There is something wrong here. Every eq-key is a in-key, hence the left-hand and right-hand sides of \models should be switched, as every model of an eq-key is a model of the corresponding in-key.

- Propositions 2, 5. There is something wrong here. When properties are functional, an in-key is an eq-key, hence the left-hand and right-hand sides of \models should be switched, as every model of an in-key with functional properties is a model of the corresponding eq-key.

- Page 8. The notation {p_i}_{i=1}^k = {p_1,…,p_k} is introduced while it has been already been used.

- Beginning of Section 5. It must be said in the text that A is the alignment at hand.

Review #2
Anonymous submitted on 05/Aug/2018
Major Revision
Review Comment:

In this paper the authors present a set of theorems that show how to use keys and link keys to find/link data in different RDF datasets. There are two types of keys: F-Keys and S-Keys. S-Keys need that two instances of a class have the same value for a property, i.e. (reviewer hasEmail rev@swj.org), (reviewer email rev@acm.org) and (reviewer email rev@swj.org) means that hasEmail and email are the same. The system would infer that email and hasEmail are the same. For an F-Key all the instance must share all values, thus in the example before the triple (reviewer email rev@acm.org) should not exists so the property email is an F-Key. The authors generalize these keys plus alignments as (strong) in-keys and (strong) eq-keys and provide six kind of keys based on these two and show how to link data based on link keys. By using these kind of keys that combine keys and alignments the authors show how can dataset linking can be done.

The Introduction section section is well written, however I would add a better explanation of F-Keys and S-Keys by by adding full examples like the triples I write above. This example could be used along all the paper, since when the reader reaches other sections it will be quite useful. The Related Work section is easy to follow and to understand. The authors provide a complete state of the art in linking and existing work on link keys. The preliminaries section include the necessary terminology and definitions to understand the next sections.

Section 4 formalizes the key notions explained before in Description Logic and the authors add an example about these keys. I would add/convert that example in a running example along the whole paper, so the reader can understand better the content. Technically, this section seems sound to me.

Section 5 Data Interlinking with Keys and Alignments provides the reader the theorems to understand how data interlinking can be done by using ontology alignments for in and eq-keys. These two theorems are the grounds for he next sections and they seem sound to me.

Section 6 Link Keys
This section builds the link key concepts from the previous eq and in-link kets. Again this seems sound to me.

The last two sections are theorems and proofs about the relations between keys and link keys.

Overall comments
I think it is a nice work and I find it interesting but I also have two concerns: 1. there is almost no story in this paper, unfortunately it seems a compendium of 3 papers in which the keys, link keys and their relations are described. I would like to see a running example to ease its reading and somehow a story, people use to remember stories rather than a set of sections with definitions, theorems and proofs. My second concern is about the coverage of the techniques in a dataset, i.e. the experimental evaluation. I would like to see its effectiveness on a dataset, for instance the one the authors provide as an example. Since this is a theoretical paper this has not to be an extensive evaluation, just a proof of work that in practice the amount of links discovered is significant.