Review Comment:
This paper proposes a benchmark for complex alignment evaluation composed of an automatic evaluation system
that relies on queries and instances, and a populated dataset about conference organization with a set of associated competency
questions for alignment as SPARQL queries.
The work is very relevant, and producing benchmarks is a much-needed contribution to allow progress in the field of complex matching.
The work is sound and the paper is well-written, however, there are a number of topics that need to be clarified to support a better understanding of the work. I have organized my comments around the main topics that need to be addressed, but the main issues are:
1. There should be a better discussion of the limitations/impact of query rewriting systems. Empirical results could be shown comparing the two documented approaches.
2. The paper addresses both reference alignment and reference query-based evaluations. There are important distinctions between the two that are not always clear throughout the paper
3. The paper puts forward an evaluation system based on instance data, but it lacks a more thorough discussion of the desiderata for instance data to support complex matching.
4. Evaluation based on CQAs may be unintentionally skewed. There is a body of work on complex matching based on patterns, and existing systems (also used in the paper) use this approach. If the CQAs have different coverage for mappings achieved through different patterns, this may have an impact on evaluation. Acknowledge and discuss.
5. I was surprised by the lack of future work.
1. Query rewriting systems
The paper describes the employed query-rewriting systems/approaches very lightly. Authors cite their previous work in [34], but this only presents one of the approaches.
1.1 Both approaches need to be described in more detail
1.1.1 In pp12 "This rewriting system cannot, however, work the other way around. For example, the CQA
SELECT ?s WHERE{
?s cmt:hasDecision ?o.
?o a cmt:Acceptance.} cannot be rewritten with c11.""
This is not clear to me. Why can't it be rewritten the other way around? Is it a feature of the rewriting system?
1.1.2 The proposed query rewriting system is not very clearly presented. "It can deal with (m:n) correspondences but cannot combine correspondences in the rewriting process." Can you explain this more clearly?
1.2 Why use the two query rewriting approaches? Are there any advantages of using the one from [20] vs the new one presented here?
1.3. Although the first approach was evaluated in [34], the approach proposed here is not evaluated. Given the impact that the rewriting approach can have on the validity of the evaluation method proposed, a standalone evaluation of the proposed query rewriting method is needed.
1.4. In 5.5.2. "the anchoring phase is strongly dependent on
the employed rewriting system"
After reading this I was hoping to see a study comparing the impact of the two different query rewriting systems.
2. Automated evaluation
2.1. Is there a difference between Anchor selection and Comparison when using a reference alignment?
2.2 In 5.1 "In the case of reference queries, the anchoring phase
consists in translating a source query based on the evaluated
alignment,"
How do you then select which pairs of queries are anchors? I found this paragraph very confusing.
2.3. I find the paper lacks a clearer statement of the conceptual and practical difference between Comparison and Scoring.
2.4. In 5.2, I would appreciate a clarification of "In the case of queries, relations and confidence comparison
may be not expressed and then not used in the comparison step." if they may not be expressed, how are they expressed when they are?
2.5. In 5.3 "There really is no “best
scoring function” or “best metric”. It all depends on
what the evaluation process is supposed to measure."
Could you exemplify what could be different evaluation goals?
2.6 in 5.5 the example with D1 and D2 illustrates a limitation of instance-based evaluation. The paper would improve if a more thorough discussion of the desiderata for instance data to support complex matching was given.
2.7 In 5.5.1. "The pair scores considered in this step are: score(c11; cr1) =
1, score(c12; cr1) = 0:5, score(c21; cr2) = 0:2. As
no evaluated correspondence ci was paired with more
than one reference cr j, no evaluated correspondence
aggregation needs to be performed"
Which of the strategies was used to compute these scores?
2.8 It would be nice to also get the scores for the 5.5.2. example
2.9 in 6.1 "None of these systems consider the correspondence relation
or correspondence value."
How impactful is this?
2.10. In 6.2 the concept of instance-based precision is introduced. Couldn't also recall be computed?
3. Performance metrics
3.1. In 5.5.2 an example of the computation of scoring using queries should be given
3.2. Section 5.3 discusses scoring. This is a highly complex subject in complex matching. I believe the paper should provide a stronger motivation for the need of scoring metrics beyond classical precision and recall.
3.3. In pp12 6.1.3 "The query F-measure was preferred over other metrics
to be the scoring function. Indeed, it represents
how well the evaluated query suits the user needs in
comparison to the reference one"
This assumes both kinds of errors are equally important for users needs, that may not be the case. In fact, in semi-automated applications of complex matching, relaxing precision to increase recall may make sense, since it may be easier for the user to filter out incorrect mappings rather than perform exhaustive searches over both ontologies to produce the missing mappings. As a general-purpose scoring function, I have no problems with using F-measure, but the statement should make this clear.
4. Dataset and Evaluation
4.1. In pp13 7.1, how did you ensure that the fact that a single researcher created the CQAs is not a limitation/bias on the validity of results? Was there a set of criteria followed when creating the first set of CQAs?
4.2. I am unfamiliar with the use of the term "pivot format" as it is used throughout the paper. I believe a pivot format describes a data format that can be used to bridge two heterogeneous ones, and I am unsure of how this relates to the artifact mentioned in the paper.
4.3 Although the coverage of the ontologies over the CQAs was assessed, it appears that the coverage and complexity of the CQAs over the ontologies, i.e. how many entities of the ontologies are covered by the CQAs, was not. This may be relevant for precision evaluation. If a system is creating valid mappings for areas of the ontology not covered by the CQAs it may have a negative impact on performance.
4.4. In 7.3.1 the process of refining the CQA list is described. It would be great to have some numbers on this, detailing the original number of CQAs, and how this as changed in subsequent steps of the process.
4.5 Five different alignments were used in the Evaluation. They are not described at all, and the differences between them are not always taken into account when discussing the results. A small description of how the alignments were obtained (excepting ra1) would help understand the results without expecting the user to go to 3 different papers.
5. Others
In 4.2 it is not clear how the Hydrography and GeoLink evaluation is conducted. Is it manual or automated?
Typos etc
pp1 "However, simple correspondences are not fully enough"
pp6: " the focus is done on how this" --> the focus is on how this
pp13 "Based SPARQL INSERT" -> Based on SPARQL INSERT
pp14 "it has been partially populated." -> it has only been partially populated.
pp15 "The idea is to provide the same conference ontologies but with more or less common instances." This sentence does not read well, maybe use "partially overlapping set of instances"?
pp16 Another sentence that could be improved: "All the ontology concepts were not covered by the pivot CQAs." should be: Not all the ontology concepts were covered by the pivot CQAs.
pp16 (7.5) "a few things are needed" - two things are needed
The examples of D1 and D2 in page 2 could be given in a figure or table format so that they stand out more. They are used throughout the paper and not easy to find.
|