Review Comment:
This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
This is the second attempt to describe the DRX tool. The DRX is a tool which aims to find candidate datasets that appear in the Linked Open Data cloud and recommend them to be interlinked with a given dataset. To achieve that, it creates profiles of datasets at the Linked Open Data cloud and clusters them using corresponding algorithms.
The major problem that I identify at this version of the system's description is the lack of clear presentation of the data flows. The paper describes the systems architecture but not the workflow(s). As far as I understood there is the *analysis* workflow where the LOD datasets are analysed and profiles are generation and there is a *consumption* workflow which is triggered by a user who submits a dataset and expects to get the candidate datasets to interlink the submitted dataset. (Clarify: I call the two workflows like this for future reference)
The analysis workflow is described under the architecture section, whereas part of the consumption workflow is described at the last (two) paragraphs of the architecture and continues at the use case section, whereas certain modules are used by both workflows, e.g. the profiling module (if I properly understand). In particular the fact that steps are mentioned in the architecture section, it raises expectations for a workflow description rather than architecture, where the exact steps are not explicitly mentioned even though implied within the GUI/Case Study section. It would help if at least the double arrows are avoided and instead two type of different arrows show exactly what the (two) different workflows are.
Minors in this respect:
- the “Integrated data” of fig 1 is nowhere mentioned in the text
- To the contrary, the Wikipedia Miner is mentioned in the text but it is not present in the figure and a reference or footnote is missing
- “Independently of the strategy chosen, for a given dataset dt, the dataset recommendation module outputs a list of datasets ordered by the probability of being interlinked.” → But it is not clarified how the two strategies are combined.
- It is contradictory to conclude that “maximum value of overall MAP is 18.44%, when the number of clusters was equal to 11.” so why the user is allowed to choose the number of clusters and seeds, if an optimal is known? In which cases is it meaningful to choose something different?
- Besides the aforementioned and returning to the now-called “text literals”. Are they xsd:string type of literals only? Or datatypes are also taken into consideration? I assume it is the former and that gives me the impression that there is much room for improvement.
- More, I still find a bit ad hoc the following: “In the case study, we used a minimum of 8 clusters, since this is the number of categories of the LOD diagram. The maximum number of clusters and the number of seeds were set to 10.” but I take it as some number was needed to be chosen.
- "However, defining RDF links between datasets helps improve data quality" → Could you provide reference for this statement?
However, as soon as the workflows are better described, at least the second dimension under which this paper is reviewed is covered (i.e. clarity, illustration and readability). The paper describes the capacities of the system and, now, also its limitations.
Talking about limitations, I would still consider that the tool provides a naive approach to deal with the underlying problem which causes the discussed limitations and the corresponding MAP. However, it is at least one of the few tools that exist aiming to address the problem of dataset recommendations for interlinking and that turns it important in my opinion (covering the first dimension under which this paper is reviewed).
|