CANARD: An Approach for Generating Expressive Correspondences based on Alignment Need and ABox-based Relation Discovery

Tracking #: 3197-4411

Authors: 
Elodie Thieblin
Ollivier Haemmerlé
Cassia Trojahn dos Santos

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Abstract: 
Ontology matching aims at making ontologies interoperable. While the field has fully developed in the last years, most approaches are still limited to the generation of simple correspondences. More expressiveness is however required to better address the different kinds of ontology heterogeneities. This paper presents CANARD (Complex Alignment Need and A-box based Relation Discovery), an approach for generating expressive correspondences that relies on the notion of competency questions for alignment (CQA). A CQA expresses the user knowledge needs in terms of alignment and aims at reducing the alignment space. The approach takes as input a set of CQAs as SPARQL queries over the source ontology. The generation of correspondences is performed by matching the subgraph from the source CQA to the similar surroundings of the instances from the target ontology. Evaluation has been carried out on synthetic and real-word datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Sep/2022
Suggestion:
Reject
Review Comment:

This article describes an approach to generate complex correspondences between entities of different ontologies. This approach is based on the use of the notion of Competence Questions for Alignment (CQA).
The paper is an extension of a work published at the ISWC 2020 conference. The authors propose a further analysis of the impact of different choices related to the alignment process.

General review:
The positive point in this new paper is that the different steps of the alignment process are better detailed and the fine analysis of the experiments allows to better understand the conclusions of the original paper. But this extension seems to me limited and more like a version that extends the different sections of the ISWC paper. Indeed no new experiments, comparisons, or conclusions represent a real addition to the original contributions. Moreover, the thesis manuscript of Élodie Thiéblin defended in 2019 and published here: https://tel.archives-ouvertes.fr/tel-02735724/document represents a more complete document to understand the approach and the different results obtained. So I don't see the need to propose a new paper which will be less detailed. New experiments (on the same datasets) and analyses that address the limitations of the approach, such as the measure used to match the instances, would have been a real addition.

Detailed review:
I focus here in particular on the experimentation section where there is the most addition. In the rest of the paper, the two things that deserve, in my opinion, to be more detailed are; the choice of the instance matching measure (the exact label matching) and the consideration of counter examples in the alignment process which might have a negative effect in some cases.
In section 5.2 it would have been desirable to define well each metric (classical, recall-oriented, ..., overlap) that are necessary to understand the different experiments. For example the definition of "overlap precision" comes 5 pages later (page 16) which in my opinion is too late.
Sometimes examples are missing to support some analyses. For example when the precision is low or high for some specific cases (section 5.3.2) or when accidental correspondences are introduced.
Finally, a qualitative analysis of the types of generated correspondences (especially complex ones) is often not addressed. And the only time this point is addressed (page 16), no details are given to explain the results (apart from a quantitative analysis).
Regarding the form, the paper is hard to read especially in the evaluation section because of the figures that are very badly positioned (sometimes one has to look several pages further to find the right figure).

Review #2
Anonymous submitted on 12/Oct/2022
Suggestion:
Major Revision
Review Comment:

Summary

The paper describes CANARD, a a tool that creates complex mappings between populated ontologies based on the notion of Competency Questions
for Alignment.
The paper describes the approach in detail and present an evaluation based on two different tasks: the complex alighment of the popular Conference ontologies, and also of the Taxon dataset which covers species of plants.
CANARD has been previously published, so this review also focuses on the degree of novel content this submission has in comparison to the 2020 ISWC publication, and the overall evaluation approach.

Motivation and Introduction

The paper presents a strong motivation for the problem. However, the wording of the two hypotheses in the introduction is not clear. In fact, neither points are hypotheses. The first one is close to one, but lacks the explicit proposition, while the latter one is simply not an hypothesis in any sense. While I can infer to what the authors are referring to, this needs a complete rewording and a clear exposition of what the hypotheses actually are.

It is not entirely clear that it does not actually present CANARD, but rather expands on a previously published work at ISWC 2020. Although this is alluded to later in the text, this should be more clearly stated.

Methodology

The methodology is presented in a detailed and well-structured fashion.
However, I am curious about a few aspects:

1. There is no clear definition of what a "support answer" is.

2. In section 4.4, it is not clear to me why a threshold for the Levenshetin similarity is needed. Later on it is mentioned this is because of noise.

3. In Equation 4, the sum of labelSim and structureSim adds up to 1.5, since labelSim is in [0,1] and structureSim is set to 0.5 or 0. Is this correct? I was expecting values of similarity in [0,1]. Why this unusual definition of similarity? And then later "When putting the DL formula in a correspondence, if
its similarity score is greater than 1, the correspondence confidence value is set to 1." This means that a Levenshtein similarity of 0.5

4. There is not a lot of detail on the computational complexity of CANARD. There is a limit on the length of paths, but this could be more clearly presented.

Evaluation

While CANARD has been evaluated on OAEI editions, which is a clear plus, many of those results do not make it into this paper, I think this really detracts from the paper, and I struggle to understand why it was left out.

1. The evaluation is based on parametrizing the many equations behind this approach with manually set values. I appreciate that some are varied in the evaluation, however, the DL formula threshold is fundamental to the final results obtained and is not contemplated. Is this because results do not vary considerably whenaltering it?

2. The different variants in Table 2 are not clearly described. The Table should include a short textual description of each variant.

2. In 5.3.2 did you limit the maximum number of answers per CQA to the threshold, or was it exactly that number?

3. The authors identify running time as a limitation of the exact label match approach. While I understand this is a limitation of their implementation, it is not a limitation of the approach in itself. This large time is probably due to inefficient data structures and multiple class to the SPARQL endpoint and could be considerably reduced. Many OM systems (ALIN, LogMap, AML to name a few) perform exact label matching on ontologies with thousands of labels in a matter of seconds.

4. In 5.4 authors describe the alignment data sets they are comparing their approach to. While it is easy to understand that Ritze and AMLC are the results of complex matching approaches, it is not clear wat the query rewriting and ontology merging alignment sets are, and what characteristics they have. To make the paper more self-contained, it would be best to briefly introduce these.

5. The authors recongize that here is some circularity in the evaluation. The same CQAs used by CANARD are the basis of the coverage evaluation. CANARD is the only system based on the CQAs. The OAEI 202 actuallly included more complex matching tasks, on which CANARD was evaluated - Populated GeoLink and Populated Enslaved. However, this paper ommits these results. Why? I believe they should be included and discussed. While I understand these two tasks do not come with pre-defined CQAs, these results could highlight the reliance of CANARD on manually defined CQAs vs automatically generated ones. In fact, it would be great to see an evaluation for Conference, based on both the high-quality CQAs and automatically generated ones (which were made for CANARD's 2020 OM paper).

Related Work

1. I would have expected more highlight on the work of Zhou et al (ref 7 in the paper). This work was evaluated against CANARD in OAEI 2020.

Minor

p. 3: A competency questions --> A competency question
Which are the accepted paper? --> Which are the accepted papers?

p. 24: The ontologies are are populated --> The ontologies are populated

p. 28: CANARD relies common instances. --> CANARD relies on common instances.