DRX: A LOD browser and dataset interlinking recommendation tool

Tracking #: 1267-2479

Alexander Mera
Bernardo Pereira Nunes
Marco Antonio Casanova

Responsible editor: 
Jérôme Euzenat

Submission type: 
Tool/System Report
With the growth of the Linked Open Data (LOD) cloud, data publishers face a new challenge: finding related datasets to interlink with. To face this challenge, this paper describes a tool, called DRX, to assist data publishers in the process of dataset interlinking and browsing the LOD cloud. DRX is organized in five main modules responsible for: (i) collecting data from datasets on the LOD cloud; (ii) processing the data collected to create dataset profiles; (iii) grouping datasets using cluster algorithms; (iv) providing dataset recommendations; and (v) supporting browsing the LOD cloud. Experimental results show that DRX obtains good overall MAP when applied to real-word datasets, which demonstrates the ability of DRX to facilitate dataset interlinking.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 15/Jan/2016
Major Revision
Review Comment:

The paper introduces a tool which generates topical profiles for datasets, and uses those profiles for recommendation of datasets to be interlinked with other datasets. The recommendation is done based on topic vectors, derived from the top categories in Wikipedia. An evaluation on datasets from the Mannheim LOD catalogue shows a moderate precision. The paper is nicely written and easy to follow.

The problem addressed is very relevant. For data publishers, finding possible candidate datasets for interlinking may be challenging. Thus, solutions such as the one proposed in this paper would be very helpful, especially when integrated with an actual interlinking tool, such as SILK.

That being said, I am not very confident that the approach proposed in this paper actually solves that problem. My main concern is that the authors use topical similarity as a proxy for interlinking suitability. This is actually not valid:
* as shown by Schmachtenberg et al. 2014 (cited by the authors), there are a few established interlinking hubs, such as DBpedia and Geonames, which are used as interlinking target from many datasets. However, the topical profile of those hubs will usually be very different from the one of the interlinking source.
* on the other hand, if two datasets have a similar topical profile, this does not guarantee any instance overlap. For example, a dataset about English and a dataset about Spanish literature will be topically very similar, but not share a significant amount of instances. Such cases are actually quite frequent in the LOD cloud (e.g., different scientific publishers opening up their metadata as LOD, different local governments publishing open data, etc.).

These problems are probably reflected in the relatively low MAP value. Here, I would have appreciated a more thorough discussion, e.g., which sorts of problems have been observed, how do the datasets for which the approach works well differ from those for which it does not, etc. Simply stating that there are a few false positives is not sufficient for a discussion.

With respect to related work, there are a few more works that try a topical profiling of datasets, also using other features than Wikipedia categories (e.g., vocabulary usage) [1-3]. It might be worth to review those works for the task at hand.

A few minor remarks:
* Details on how the interlinking to Wikipedia is done are missing. This step is not trivial, and it may introduce a lot of errors (i.e., wrong links), which may lead to errors at a later stage in the process. Thus, this step deserves a small evaluation of its own.
* Euclidean distance may not be the best distance measure for the representation at hand. These are vectors in a topic space, and I would expect cosine similarity to deliver better results.
* Fig. 3 depicts the results at different clustering sizes, but X-means does an automatic assessment of the cluster size. Did the authors run another set of experiments with k-means, using different values for k manually?
* I would appreciate a comparison of the 23 categories by the authors to the 8 categories used in the LOD diagram. Are there any interesting correlations? The original categories are most likely not perfect, so this could lead to some interesting insights.

In summary, the work addresses an important topic. However, the approach has got a few significant drawbacks, and the results are not yet convincing and limit the practical usage of the tool introduced. I suggest to deepen the research on the topic a bit before resubmission.

[1] Meusel et al. (2015): Towards Automatic Topical Classification of LOD Datasets
[2] Ellefi et al. (2014): Towards semantic dataset profiling
[3] Böhm et al. (2013): Latent topics in graph-structured data

Review #2
By Anastasia Dimou submitted on 01/Mar/2016
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This is a system paper that describes the DRX tool. This tool aims to support users in discovering datasets to interlink a given dataset. The paper deals with an interesting topic, however suffers from some vague or weak points which need to be addressed or clarified. I explain below in more details:

The Contextualization section is more like a state of the art section. It consists of a brief introduction of two paragraphs which however fails to clearly introduce what the problem is and why this is a problem that requires attention. The remaining of the section is like a typical related work section. Even though the state of the art has sufficient length, it is more focused on the methodologies followed by other approaches, rather than the systems description which would be more relevant considering the purpose of this paper. My suggestion would be that more focus is given on the other system details and the retrospect on systems comparison rather than methodologies.

Again in the contextualization section, the following is mentioned: “.. the selection of the source and target datasets to be interlinked is still a manual, often non-trivial task. In what follows, we refer to this task as dataset interlinking and to the problem of suggesting a list of datasets to be interlinked with a given dataset as the dataset interlinking recommendation problem”. However, according to [1], dataset interlinking is the process of establishing explicitly links between instances from different data sources. I think what you describe is closer to what is known in bibliography as dataset discovery applied in your case in the context of dataset interlinking.

Last remark about the contextualization section but also refers to the remaining of the paper as well, the term “textual resource” is used. However, this term is associated with plain text. What the system appears to support considering Figure 1 is structured data. So, what exactly do you mean by “textual resources”? Is it a solution that deals both with plain text and structured data or do you aim to differentiate from other media? Could you please clarify? Overall, I would suggest that you are more careful with the proper use of terminology used.

At the second section, while the first of the modules claims that datasets are collected from the LOD cloud, later on, non-RDF data is considered. Where do these data sources come from? The LOD cloud and the Mannheim catalog only contain datasets in RDF. Or is it only the case that manually submitted data sources only might not be in RDF?

In the same context, one of my major concerns regarding this paper is how generic are these Linked Data wrappers? Do they employ Direct Mappings? If not, how do they function? How is the data model identified and which vocabularies are used? Automatically generating proper annotations is a rather cumbersome task and, as [2] showed, most of the tools for generating mappings from data in relational databases to RDF suffer from shortcomings, let alone other formats. So, how exactly is this task addressed in the case of DRX? How complete is the generated RDF representation after all and how much does it influence the actual interlinking task? Without providing more clarifications regarding how Linked Data Wrappers function, the data acquisition layer is rather vaguely described and I would suggest to further elaborate.

More, in relation to the aforementioned comment, lacking of clarity regarding how Linked Data is generated from raw data, it is not clear which predicates and how these predicates are selected to assess if a certain dataset is relevant or not. In section 3, it is mentioned that rdfs:label, skos:altLabel, or skos:prefLabel are used. Is it only for this use case or is it always that these properties are used? So are all matchings performed considering literal values? Other well-known interlinking tools allow more options, e.g. also taking into consideration numerical values. What about DRX? I would suggest that it is better clarified how the properties which are used to match entities are selected and what type of comparisons are supported.

In the same context, how are the entities extracted? Could you elaborate more? Also, you mention that you link the extracted entities to Wikipedia articles. Literally Wikipedia articles? Or do you mean DBpedia entities? Considering the type of the paper, I would expect that not only the architecture but also the workflow would be clearly clarified so the reader is in position to have a good understanding of how the tool functions.

Overall, (i) a running example and results after each step/layer would prove to be helpful to better understand how the system functions; (ii) aligning the modules, steps and layers would be helpful to understand which need is covered by each module; (iii) the techniques that DRX implements would be useful to be summarized here too, besides only providing a reference to another paper ([9]). This paper should be self-standing.

From the abstract, it is already mentioned that the DRX tool has ‘good overall MAP’. What does it mean “good”? I would suggest that the authors make sure that it becomes more explicit how good their system is by providing evidence from their evaluation. Looking at the evaluation results though, I am not convinced how good the solution is afterall. Similar comments are spread all over the text, for instance “to provide an easy way to read and understand dataset profiles...”. Why is this an easy way? Is there any evidence or is it proven it is easy? I would suggest that the authors keep the text more neutral if there is no evidence.

I tried several times to access the demo but either was down or it was not working. So I could not actually use it to assess its functionality. All in all, I think it is a useful tool, however, due to my concerns stated above, I don't think this paper is publishable as is.

p. 1, sec. 1 → it is better to equally treat all references, either provide footnote with a reference to the webpage both for LIMES and Silk (least preferable), or citation for both tools or both webpage and citation.
p. 2, TRT and other abbreviations, preferably firstly provide the full name and its abbreviation and afterwards use the abbreviation.

[1] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag (2013)

[2] Pinkel C., Binnig C., Jiménez-Ruiz E., May W., Ritze D., Skjaeveland M. G., Solimando A. and Kharlamov E.. RODI: A Benchmark for Relational-to-Ontology Data Integration (2015)

Review #3
Anonymous submitted on 02/Mar/2016
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This paper presents the DRX tool for dataset interlinking. When publishing a new dataset, it enables the user to get recommendations of datasets available on the LOD to link his dataset to them.
The interest of such a tool to support linked data publishers is thus quite clear.
The paper is well written, well illustrated and clear.

However, IMO, there are several improvements to be done before the publication of this paper.
1) The system should be made free, open, and accessible on the Web and the paper should indicate a download address.
2) The capabilities of the tool are clear enough but there is no mention of its limitations. Moreover, I would not say that DRX is a LOD browser since there are other ways (and kinds of tools) to browse the LOD, here the browsing is part of the recommendation process.
3) The tool implements a technique proposed by other authors in a previously published paper. Of course the present paper does not have to describe it again but IMO the motivation for choosing this technique should be given. The same remarks for the clustering module and for the Ranking module we would expect to get the indication of the kind of distance between fingerprints and the motivation for this choice. Also, how data is integrated (fig 1) is not explained and it seems that the chosen technique is not specific to LOD data and could apply to any textual data which is quite surprising.
4) For the evaluation of the tool, the results should be compared at least with those of TRT and TRTML (and possibly also with the other cited papers). Moreover a (possibly comparative) user evaluation of the tool would be very welcome. Finally the report of an application of the tool to a real-world publishing problem would also convince of the actual capabilities/usability of the tool.
5) Finally, a brief description of the ongoing or future work would be welcome in conclusion.