Review Comment:
This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
This is a system paper that describes the DRX tool. This tool aims to support users in discovering datasets to interlink a given dataset. The paper deals with an interesting topic, however suffers from some vague or weak points which need to be addressed or clarified. I explain below in more details:
The Contextualization section is more like a state of the art section. It consists of a brief introduction of two paragraphs which however fails to clearly introduce what the problem is and why this is a problem that requires attention. The remaining of the section is like a typical related work section. Even though the state of the art has sufficient length, it is more focused on the methodologies followed by other approaches, rather than the systems description which would be more relevant considering the purpose of this paper. My suggestion would be that more focus is given on the other system details and the retrospect on systems comparison rather than methodologies.
Again in the contextualization section, the following is mentioned: “.. the selection of the source and target datasets to be interlinked is still a manual, often non-trivial task. In what follows, we refer to this task as dataset interlinking and to the problem of suggesting a list of datasets to be interlinked with a given dataset as the dataset interlinking recommendation problem”. However, according to [1], dataset interlinking is the process of establishing explicitly links between instances from different data sources. I think what you describe is closer to what is known in bibliography as dataset discovery applied in your case in the context of dataset interlinking.
Last remark about the contextualization section but also refers to the remaining of the paper as well, the term “textual resource” is used. However, this term is associated with plain text. What the system appears to support considering Figure 1 is structured data. So, what exactly do you mean by “textual resources”? Is it a solution that deals both with plain text and structured data or do you aim to differentiate from other media? Could you please clarify? Overall, I would suggest that you are more careful with the proper use of terminology used.
At the second section, while the first of the modules claims that datasets are collected from the LOD cloud, later on, non-RDF data is considered. Where do these data sources come from? The LOD cloud and the Mannheim catalog only contain datasets in RDF. Or is it only the case that manually submitted data sources only might not be in RDF?
In the same context, one of my major concerns regarding this paper is how generic are these Linked Data wrappers? Do they employ Direct Mappings? If not, how do they function? How is the data model identified and which vocabularies are used? Automatically generating proper annotations is a rather cumbersome task and, as [2] showed, most of the tools for generating mappings from data in relational databases to RDF suffer from shortcomings, let alone other formats. So, how exactly is this task addressed in the case of DRX? How complete is the generated RDF representation after all and how much does it influence the actual interlinking task? Without providing more clarifications regarding how Linked Data Wrappers function, the data acquisition layer is rather vaguely described and I would suggest to further elaborate.
More, in relation to the aforementioned comment, lacking of clarity regarding how Linked Data is generated from raw data, it is not clear which predicates and how these predicates are selected to assess if a certain dataset is relevant or not. In section 3, it is mentioned that rdfs:label, skos:altLabel, or skos:prefLabel are used. Is it only for this use case or is it always that these properties are used? So are all matchings performed considering literal values? Other well-known interlinking tools allow more options, e.g. also taking into consideration numerical values. What about DRX? I would suggest that it is better clarified how the properties which are used to match entities are selected and what type of comparisons are supported.
In the same context, how are the entities extracted? Could you elaborate more? Also, you mention that you link the extracted entities to Wikipedia articles. Literally Wikipedia articles? Or do you mean DBpedia entities? Considering the type of the paper, I would expect that not only the architecture but also the workflow would be clearly clarified so the reader is in position to have a good understanding of how the tool functions.
Overall, (i) a running example and results after each step/layer would prove to be helpful to better understand how the system functions; (ii) aligning the modules, steps and layers would be helpful to understand which need is covered by each module; (iii) the techniques that DRX implements would be useful to be summarized here too, besides only providing a reference to another paper ([9]). This paper should be self-standing.
From the abstract, it is already mentioned that the DRX tool has ‘good overall MAP’. What does it mean “good”? I would suggest that the authors make sure that it becomes more explicit how good their system is by providing evidence from their evaluation. Looking at the evaluation results though, I am not convinced how good the solution is afterall. Similar comments are spread all over the text, for instance “to provide an easy way to read and understand dataset profiles...”. Why is this an easy way? Is there any evidence or is it proven it is easy? I would suggest that the authors keep the text more neutral if there is no evidence.
I tried several times to access the demo but either was down or it was not working. So I could not actually use it to assess its functionality. All in all, I think it is a useful tool, however, due to my concerns stated above, I don't think this paper is publishable as is.
Minors:
p. 1, sec. 1 → it is better to equally treat all references, either provide footnote with a reference to the webpage both for LIMES and Silk (least preferable), or citation for both tools or both webpage and citation.
p. 2, TRT and other abbreviations, preferably firstly provide the full name and its abbreviation and afterwards use the abbreviation.
[1] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag (2013)
[2] Pinkel C., Binnig C., Jiménez-Ruiz E., May W., Ritze D., Skjaeveland M. G., Solimando A. and Kharlamov E.. RODI: A Benchmark for Relational-to-Ontology Data Integration (2015)
|