Review Comment:
The paper presents the MatchaDL ontology matching system that uses a neural network to learn how to rank correspondences between ontologies from a partial alignment.
I am favourable to the publication of this paper. However there are some points mentioned below that would deserve, and for some of them are required, to be addressed before that.
The system performs well in the tasks it is concurrently evaluated, so it is natural to publish its description.
The use of some machine learning techniques in ontology matching is relatively novel, especially in what concern systems based on languages such as large language models or embedding in otherwise generated such structures. The impact of such a system can be large and as large as it is widely and openly available.
The text in general is clear and readable (some suggestion to improve it are given below).
However, the description itself is not very deep. More information should be provided to give the reader the opportunity to understand how the system works.
More generally, the paper could be improved by making it more precise and self-contained (suggestions below).
I would also have liked to see a discussion of the limitations of the system or the evaluation setting. In particular, it seems very reliant on lexical comparison. Could Bio-ML be extended in order to match ontologies with different lexicons? Is Matcha-DL really a general purpose system if it assumes such a lexical similarity? Is there evidence that the matcher selection is sensitive to this, i.e. would give lower weight to lexicon-based matchers if lexicons are different?
Suggested improvements follow in the order of the paper:
- page 2, line1-3: it is unclear what is made by this sentence for someone who does not already know: explain that form learning, examples may be necessary, hence tis required to have specific tasks for that.
- line 28: 'unsupervised' This term is usually reserved for machine learning operation, since this is not what the majority of systems are, it would be better not use it in this context or to precise that this would correspond to an unsupervised setting for a learning system.
- line 19-20: explain what is new in Bio-ML that allows it to evaluate learning systems.
(all the above comments are about the same thing: the paper takes for granted that the reader knows).
- page 3, line 48: search space escalate exponentially? Actually given that the correspondences are between pairs of entities with a limited number of relations, when one entity is added, this increase the search space only by the size of the other ontology. This is more than geometric growth because this size is constantly growing, but this is subexponential.
- page 4: a picture illustrating the process involved in MatchaDL would help the reader understanding.
- page 4, line 6: 'optimal combination' would be worth describing the optimisation criterion
- page 4, line 14: 'describe' is a bit strange 'extract' or 'generate' seems to be better.
- page 4: it would be worth describing what is called the 'lexicon' of an ontology in the context of Table 1.
- page 5: it does not seem worth describing again the formulas for the classical measures, rather give references instead of repeating them.
- page 5, line 31: it is unclear what are 'null reference mappings. It seems to refer to the 'semi-supervised setting' but this does not seem to have been described before.
- page 5, line 6: larger then -> larger than
- page 5, line 46: 'friendly' the term does not feels very accurate. ML-oriented would definitely be better (since non learning systems would not use the partial reference, they are indeed ML-oriented)
- pages 5-6: it would be better to describe the data sets in just one enumeration, instead of their list and then their description.
- page 7, line 42: no hyperparameter tuning as is customary?
Concerning the URL provided as a 'long term stable link', this is a github link so fully appropriate for code. However, as far, as I can judge (commit 8e365b6):
- the README.md file is reduced to its simplest expression and does not provide me any instruction about how to use the system.
- the source code is not available in the main branch... this is a disturbing remark, as the repository seems empty (the code, at least some of it, is in the dev branch.
- the repository contains files automatically generated, which is, in principle bad practice: the code should work and thus be able to regenerate them.
These issues should be definitely fixed before publishing a 'Tools and Systems Report'.
|