DRX: A LOD dataset interlinking recommendation tool

Tracking #: 1370-2582

Alexander Mera
Bernardo Pereira Nunes
Marco Antonio Casanova

Responsible editor: 
Jérôme Euzenat

Submission type: 
Tool/System Report
With the growth of the Linked Open Data (LOD) cloud, data publishers face a new challenge: finding related datasets to interlink with. To face this challenge, this paper describes a tool, called DRX, to assist data publishers in the process of dataset interlinking and browsing the LOD cloud. DRX is organized in five main modules responsible for: (i) collecting data from datasets on the LOD cloud; (ii) processing the data collected to create dataset profiles; (iii) grouping datasets using clustering algorithms; (iv) providing dataset recommendations; and (v) supporting browsing the LOD cloud. Experimental results show that DRX has a potential to be used as a dataset interlinking facilitator.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 12/May/2016
Minor Revision
Review Comment:

As stated in my review for the previous version, I deem the topic of this paper very relevant. The paper itself is clearly written and very easy to follow. I am happy that my suggestion of using cosine similarity instead of Euclidean distance increased the result quality :-)

Most of my concerns raised in the previous review have been properly addressed. Most prominently, the authors discuss the limitations of the tool in a very appropriate manner.

I have a few minor suggestions. First, I like the fact that the authors are critical about the categorization of the LOD cloud with its eight topical categories, showing that eleven clusters actually lead to better results. The relation of those eleven topics to the original eight clusters, as stated in the rebuttal, should be made available, e.g., as an appendix to this paper, because it is a very interesting practical result of data profiling.

Second, while I like the discussion of false positives and negatives in 5.3, I think it would be more helpful with real examples, instead of the abstract usage of d_j, d_k, etc. (DBpedia being mentioned as a typical false negative being a notable exception here).

Third, WikipediaMiner should be mentioned and briefly described (a few sentences are sufficient) already in section 3, not in 5. The remaining steps of the fingerprinting are rather clear.

In summary, with this revision, the paper is on a good path to being accepted for SWJ.

Review #2
By Anastasia Dimou submitted on 06/Jun/2016
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This is the second attempt to describe the DRX tool. The DRX is a tool which aims to find candidate datasets that appear in the Linked Open Data cloud and recommend them to be interlinked with a given dataset. To achieve that, it creates profiles of datasets at the Linked Open Data cloud and clusters them using corresponding algorithms.

The major problem that I identify at this version of the system's description is the lack of clear presentation of the data flows. The paper describes the systems architecture but not the workflow(s). As far as I understood there is the *analysis* workflow where the LOD datasets are analysed and profiles are generation and there is a *consumption* workflow which is triggered by a user who submits a dataset and expects to get the candidate datasets to interlink the submitted dataset. (Clarify: I call the two workflows like this for future reference)

The analysis workflow is described under the architecture section, whereas part of the consumption workflow is described at the last (two) paragraphs of the architecture and continues at the use case section, whereas certain modules are used by both workflows, e.g. the profiling module (if I properly understand). In particular the fact that steps are mentioned in the architecture section, it raises expectations for a workflow description rather than architecture, where the exact steps are not explicitly mentioned even though implied within the GUI/Case Study section. It would help if at least the double arrows are avoided and instead two type of different arrows show exactly what the (two) different workflows are.

Minors in this respect:
- the “Integrated data” of fig 1 is nowhere mentioned in the text
- To the contrary, the Wikipedia Miner is mentioned in the text but it is not present in the figure and a reference or footnote is missing
- “Independently of the strategy chosen, for a given dataset dt, the dataset recommendation module outputs a list of datasets ordered by the probability of being interlinked.” → But it is not clarified how the two strategies are combined.
- It is contradictory to conclude that “maximum value of overall MAP is 18.44%, when the number of clusters was equal to 11.” so why the user is allowed to choose the number of clusters and seeds, if an optimal is known? In which cases is it meaningful to choose something different?
- Besides the aforementioned and returning to the now-called “text literals”. Are they xsd:string type of literals only? Or datatypes are also taken into consideration? I assume it is the former and that gives me the impression that there is much room for improvement.
- More, I still find a bit ad hoc the following: “In the case study, we used a minimum of 8 clusters, since this is the number of categories of the LOD diagram. The maximum number of clusters and the number of seeds were set to 10.” but I take it as some number was needed to be chosen.
- "However, defining RDF links between datasets helps improve data quality" → Could you provide reference for this statement?

However, as soon as the workflows are better described, at least the second dimension under which this paper is reviewed is covered (i.e. clarity, illustration and readability). The paper describes the capacities of the system and, now, also its limitations.

Talking about limitations, I would still consider that the tool provides a naive approach to deal with the underlying problem which causes the discussed limitations and the corresponding MAP. However, it is at least one of the few tools that exist aiming to address the problem of dataset recommendations for interlinking and that turns it important in my opinion (covering the first dimension under which this paper is reviewed).

Review #3
By Catherine Faron-Zucker submitted on 13/Jun/2016
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

The authors answer the following points raised in my review of the first version of their paper.
- The quality of LOD browser has been removed from the title which makes the focus of the paper clearer.
- A url is provided where a demo can be found, the system can be tested online and the source code can be downloaded.
- The limitations of the tool are discussed in section “Discussion” which is new.

The motivation for the chosen techniques implemented by the three modules of the tool in the data processing layer remains absent. The profiling module implements a technique proposed by other authors. What is the motivation for choosing this technique? The same questions for the clustering module and for the ranking module (why the 2 strategies described, what scenarios do they answer, and why the cosine distance?). Each time, 1 or 2 sentences would be sufficient to make the reader aware of the choices underlying the modules.
Additionally, as already mentioned in my first review, I am surprised by the description of the crawling module which does not seem to give a special place to the RDF data. It seems that the technique described is not specific to handling LOD data and could apply to any textual data which is quite surprising. Do I understand correctly? If yes, I suggest that you justify your approach (and if no, that you better explain things). For instance, I am inclined to think that class labels could be given a higher weight than other literals. Here again, a few sentences would change things for the reader.

My suggestions about the evaluation of the tool (with regards to the “quality, importance, and impact of the described tool or system” in the call) are only partially answered:
- The results obtained with the tool on a gold standard still are not compared with any other state-of-the-art tools. The comparison with TRT and TRTML which I recommended in my previous review now appears as future work.
- I also suggested a user evaluation of the tool and/or a report of its application to a real-world publishing problem in order to show the actual capabilities/usability of the tool. These are still not mentioned in this new version of the paper. The (newly added) discussion on the results of the evaluation conducted shows the usefulness of the user evaluation I suggested.
I recommend again to compare the results of the experiment conducted on a gold standard with those that would be obtained with TRT and TRTML or to conduct a user evaluation.

As a conclusion, as already expressed it in my first review, I think DRX may be a very useful tool for the community, that the paper describing it is well written, but also that it still lacks from some justification of the choices made when designing the different modules of the tool and from a comparative evaluation or a user evaluation.