Review Comment:
Overall evaluation Select your choice from the options below and
write its number below.
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
-1
Reviewer's confidence Select your choice from the options below
and write its number below.
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
5
Interest to the Knowledge Engineering and Knowledge Management
Community Select your choice from the options below and write its
number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5
Novelty Select your choice from the options below and write its
number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3
Technical quality Select your choice from the options below and
write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4
Evaluation Select your choice from the options below and write its
number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
3
Clarity and presentation Select your choice from the options below
and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4
Review Please provide your textual review here.
summary
The paper presents three similarity measures for ontologies
utilizing TBox information only. After introducing the basics and
discussing related work, baseline measures are introduced. Then,
the new similarity measures are presented and along two empirical
studies these measures are evaluated. The paper closes with
discussing limits and an outlook.
Overall the paper is well written and good to read. The made
contribution is clearly stated and it seems to be technically
sound. Sometimes the made claims are strong and unfocused which
needs to be corrected. The proposed measures are well integrated
into a larger framework and the discussion of their relationship
is well made. Another big plus is the evaluation/discussion of the
measure relationship during exp. 2. Weak points are the related
work which address only a very small part of all existing work and
exclude very important work from the evaluation. In addition the
evaluation of the method utilize only 19 concept pairs which is
not much.
While I think the work is quite nice and addresses a very
interesting and important topic I can't accept it due to mentioned
weak points.
more comments:
The first thing I would like to stress is the way claims are made
in the introduction. Let me try to explain this by quoting the
following sentence from the frist page: "To date, there has been
no thorough empirical investigation of similarity measures.". Such
a sentence means to me that all other researcher which works on
similarity measures have never made a serious and complete
empirical investigation of similarity measures. Given your work, I
would say that your paper have also not reached this goal. In
addition there is work which compares a bunch of similarity
measures. Work from colleguages like "Budanitsky, A. & Hirst, G.
(2006), 'Evaluating WordNet-based Measures of Lexical Semantic
Relatedness', Computational Linguistics 32 (1) , 13--47." go a
step further and do a human grounding. This work is completely
ignored even they made a complete comparison of similarity measure
for wordnet relaying on similar information as your work (I agree
that you go beyond, but not much). For the next time, I suggest to
come up with a weaker claim which reflects better the presented
and existing work and which are more serious. Please revise the
first two paragraphs (there are more very general and not very
nice phrases like the mentioned one) of the introduction to
address this issue.
Another set of work which is not even touched in you paper are the
once using the WS353 dataset:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
There are a big number of work for similarity measures and most
might be not relevant as it utilize the ABox. But as your work is
so "thorough" I expect that all bigger research areas for
similarity measures will be mentioned.
Fig. 1: all the proposed measures follow a similar pattern
compared to Wu and P. measures. Any idea why? How about leaving
out some of the concepts to test the influence of the structure on
the measure. This was one critics on ABox based measures and I
think it hold here as well and such a test will answers the
question: What happens if the ontologie is not complete? How
sensitive is your measure on the current ontology? Maybe it gives
an inside into common pattern of all the measures.
Further, exp. 1 is the only real evaluation as exp. 2 does not
include baselines nor does it compare with human information. This
brings me to my most critical question. Who decides which concepts
are similar and to what extent? My guess would be that a human
makes this decision but then 19 concepts are not very
representative. Ok, you mentioned the limits of the results at the
end of your work but how is this limit related to the promised
very complete empirical investigation promised in the
introduction? And exp. 2 is only an internal comparison of the new
measures without any human grounding. So this part is more an
discussion than an evaluation.
Lets go back to example 1. It is stated that "Sim(Carnivore,
Omnivore) > Sim(Carnivore, Herbivore)" for technical reasons. Is
this also true for humans? In any case? I suggest to extent the
example and discuss the issue of similarity and the goals your
would like to reach with your work w.r.t what similarity means.
At the end of sec. 5 two IC measure are introduced which are later
not used to compare with. While it might be that Rada has
sometimes outperformed these measures it would be nice even for a
journal publication to include at least one measure of this
family. In addition it turns out that the measure of Jiang&Conrath
in combination with wordnet is the best similarity measure in the
above mentioned study of Budanitsky and Hirst. Therefore, I expect
results for this measure as well.
Parameter are always tricky. Why is a delta of 0.1 in sec. 8 a
good number?
|