A Survey of Current Link Discovery Frameworks

Tracking #: 1029-2240

Markus Nentwig
Michael Hartung
Axel-Cyrille Ngonga Ngomo
Erhard Rahm

Responsible editor: 
Natasha Noy

Submission type: 
Survey Article
Links build the backbone of the Linked Data Cloud. With the steady growth in size of datasets comes an increased need for end users to know which frameworks to use for deriving links between datasets. In this survey, we comparatively evaluate current Link Discovery tools and frameworks. For this purpose, we outline general requirements and derive a generic architecture of Link Discovery frameworks. Based on this generic architecture, we study and compare the features of state-of-the-art linking frameworks. We also analyze reported performance evaluations for the different frameworks. Finally, we derive insights pertaining to possible future developments in the domain of Link Discovery.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miriam Fernandez submitted on 16/Apr/2015
Minor Revision
Review Comment:

The following paper presents a comprehensive overview of Linked Discovery (LD) frameworks. The paper is very instructive and suitable as introductory text for researchers, PhD students or practitioners, to get started on the topic.

The authors categorised and explained research works based on a general workflow of LD frameworks that includes several phases: configuration, pre-processing, matching and post-processing. The different methods used in each phase by the different frameworks are specified and works are compared based on these phases and on the methods used. Works are also compared based on reported evaluations.

An interesting aspect of this paper is also the list of requirements present in section 2. The authors specified five main requirements for LD frameworks. However, something that I am missing in this paper, and that I think it will be really valuable, is a discussion of which frameworks better fit each of these requirements and, under which situations or scenarios it will be more adequate to use one framework over the other ones. For example, if online LD is required, and effectiveness should be prioritised over efficiency, which of the listed frameworks should I use?

The other issue that I will recommend the authors to address (whenever possible) is the “?” ->unclear from publication. I will recommend contacting the authors of the unclear papers to clarify with them whether their framework fit under the selected criteria. This will make the paper more complete.

Regarding specific issues with respect to some sections:

Section 2: “Most entity resolution approaches focus on homogeneous datasets … By contrast the resources for LD can be heterogeneous and highly interrelated”. Any reference to support this criticism?
The LD process usually involves an ontology and instance -> and an instance

Section 2.2
Simple match techniques -> matching techniques
Semantic neighbourhood of an resource -> a resource

Section 3
Workflows which consists -> consist

Section 4.1
This statement obviously does not hold for the framworsks- > frameworks
This in strong contrast -> this is in strong contrast
It is mentioned in this section that dictionaries are used for ontology matching but not so much for linking instance data. Any suggestions or insights of why? Have dictionaries been tested and not found very useful for linking instance data? Any reference discussing this issue, or any insights from the authors will be desirable.

Table 3
(*) The legend “not in current release” is applicable to all frameworks that do not include MapReduce. I understand that the authors may be working on adding this element to LIMES. Indeed, it should be according to reference [17]. So if it is available at the time of publication, please add it in your table, otherwise, I will advice to modified the legend as “investigated [17] but not available as part of the current release”

“space tiling” is mentioned as filtering mechanism of LIMES but is not explained in the paper. Please add the corresponding description of this filtering mechanism.

Section 4.6
In this section it is mentioned the use of distributed computing as beneficial element to obtain high efficiency and scalability. However, distributed computing may not be necessary for all scenarios, particularly when the datasets are small. A discussion mentioning the situations/criteria in which distributed computing may be useful (e.g., size of the datasets bigger than X) will make this section more useful.

Section 4.9.
In this section it is mentioned that “the high potential of utilising existing links and mappings as well as other data sources or dictionaries as background knowledge has not yet been explored”. A similar comment about the use of dictionaries was made in section 4.1. I was wondering reading this why is this the case, and if there are already studies that have applied them and find them useful. If so, please add the corresponding references in here. Also, learning from links generated in one domain may not help to discover links for a different domain… A discussion here explaining the potential benefits/drawbacks of using these resources and how this can be a future research direction for LD will be desirable.

An additional brief discussion that may be interesting to add to the paper is an overview of the different areas where LD has been investigated, so that the list of studied frameworks can be positioned within the LD literature. For example LD over relational databases, online LD performed by semantic search / question answering applications, etc.

Review #2
By Arnab Dutta submitted on 23/Apr/2015
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

1. The topic is suitable for a greater audience ranging from students to researchers. The topic has been nicely introduced and easily comprehensible. As a starting point, this paper is recommended. This is also valuable for the whole Semantic Web community.

2. The authors have presented the details in a well-structured way. Tables are always good in these cases where comparative studies are being conducted. Use of simple English and one connected flow makes the paper a pleasure to read.

3. The authors have introduced the well known benchmarks. They have also briefly described them which gives a sneak peek into these kinds of benchmarks and data sets. And compared with the important LD frameworks.

4. The key aspects for a good LD has been presented in Section 2,2 but later on, I missed out which frameworks are efficient and which did not have powerful infrastructure. It will be informative to present a simple table which gives a snapshot of all the LD frameworks along with their requirement satisfiability. The information is there but within the lines. One consolidated view would help the readers with easy reference.

5. Authors have mentioned of different benchmarks and reported the F-measures for different LD frameworks. But, if as an end user, someone uses this paper to decide a good LD framework for their task, they wont be greatly benefitted. Reason being, there are no specific use cases defined which provides a clear verdict on the choice of a particular LD framework for a particular task. Do I always choose KnoFuss since its F1 is better? Or there can be some other LD framework better suited even if they did not take part in the benchmark or has a lower F1. A bit of analytical discussion is missing.

6. I understand that the most of the LD frameworks did not participate in all the tasks and in all the years. This is a serious flip side of using such an evaluation setup. This makes the setup skewed without getting to know all the LD frameworks under the same light. The authors mentioned of this. Can the authors briefly suggest some generic evaluation framework? Is it impossible to devise one?

7. It will be helpful to discuss few pointers to the future areas of research in designing a LD framework. What can be improved? or what cannot be improved? Are the state-of-the-art ones absolutely complete? These are some of the few research questions one would ask if one is planning to create a new LD framework.

8. A very minor comment: the table and figure captions seem smaller than the contents itself. If this is not the style file configuration please alter it. It suppresses what the table or figure is talking about.

Review #3
By Kavitha Srinivas submitted on 28/Apr/2015
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

1. Suitability: This is a well written survey paper, covering 10 systems that perform linking at the instance level, and focuses particularly on systems that have had relatively good performance on OAEI benchmarks in the appropriate year, or used learning based approaches using OAEI datasets.

2. Comprehensive/Balanced: This is where I felt that the paper could do improve. The criteria used to select the tools (must use OAEI benchmarks) make sense if OAEI was used to actually compare them at the end, so researchers could get the sense of which tools have strengths for a given set of benchmarks. However, given that it is actually hard to compare the tools, even by the authors' own admission ("Despite the laudable effort of the OAEI instance matching tracks the comparable evaluation of existing
tools for LD is still a largely open challenge."), it would perhaps strengthen the paper to consider other work as well that did not use the OAEI benchmarks. One example of such a paper (and by no means the only one is "Discovering Linkage Points over Web Data" by O. Hassanzadeh et al.

3. Readability and Clarity: No comments here, the paper is well written and well organized around features that are desirable for such systems.

4. Importance: Clearly important.