Tversky’s feature-based similarity and beyond

Tracking #: 1667-2879

Silvia Likavec
Ilaria Lombardi
Federica Cena

Responsible editor: 
Lora Aroyo

Submission type: 
Full Paper
Similarity is one of the most straightforward ways to relate two objects and guide the human perception of the world. It has an important role in many areas, such as Information Retrieval, Natural Language Processing (NLP), Semantic Web and Recommender Systems. To help applications in these areas achieve satisfying results in finding similar concepts, it is important to simulate human perception of similarity and assess which similarity measure is the most adequate. In this work we wanted to gain some insights into Tversky’s feature-based semantic similarity measure on instances in a specific ontology. We experimented with various variations of this measure trying to improve its performance. We propose Normalised common-squared Jaccard’s similarity as an improvement of Tversky’s similarity measure. We also explored the performance of some hierarchy-based approaches and showed that feature-based approaches outperform them on two specific ontologies we tested. Above all, the combination of feature-based with hierarchy-based approaches shows the best performance on our datasets. We performed two separate evaluations. The first evaluation includes 137 subjects and 25 pairs of concepts in the recipes domain and the second one includes 147 subjects and 30 pairs of concepts in the domain of drinks. To our knowledge these are some of the most extensive evaluations performed in the field.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pasquale De Meo submitted on 15/Jul/2017
Minor Revision
Review Comment:

The paper is about Tversky’s feature-based semantic similarity measure on instances in a specific ontology. The authors considered many variations of this classic metrics with the goal of getting some intuitions on how to improve this metric. Interestingly enough a variation of the popular Jaccard Coefficient (Normalised common-squared Jaccard’s similarity) and some hierarchy based approaches were considered. Experiments involving real evaluators in the domains of recipes and drinks were performed.

I liked a lot reading this manuscript. It provides the right balance of theory and practice- The main message is a set of practical guidelines allowing developers to choose the best way of assessing semantic similarities. The paper clearly fits the scope of this journal and it covers in a detailed and precise way the related literature. It clearly advances the state of the art and I'm happy that experiments were performed through the involvement or real human evaluators.
I think that, after few modifications, the paper is ready to get published.

Firstly I think that background section about ontologies is perhaps too long and it provides many detail that are perhaps not novel for the average reader of Semantic Web journal. If possible I would recommend to avoid some (non-necessary) details. Of course I really liked this part, it is clearly written and interesting but I think that some concepts should be discussed in a light format.

As for Section 3, please provide concrete examples of usage of the proposed measures. In which domains a metric to assess semantic similarity is better than others? What are the main pros and cons of each approach to computing similarity? What is the computational effort required for calculating each metric? The ability of quickly calculating similarities over complex and large datasets is a good feature to consider. Please comment Equation 8. How do you choose \alpha and \beta parameters? They have a crucial impact on the results we can get to (as the authors correctly say) and to this end, there is a brief comment on WordNet. However, I would expect a more detailed discussion.

Explain better Equation 10. What are the advantages of considering squared terms? I suspect that similarity score gets denser and denser around 0, thus making the process of detecting similar objects harder (similarity scores would concentrate around zero thus making impossible to rank pair of objects on the basis of their similarity score).

Can you provide some statistics about the sample used to evaluate results? Were the tests performed in Italian language only? Do you expect relevant differences in case of different languages? What about gender of interviewed persons? The specific domains you consider (recipes and drinks) may induce some kind of bias? Why did you choose Pearson Coefficient as a reference test metric? Pearson correlations makes makes some strong statistical assumptions that may be violated in a particular domain. Perhaps, some kind of non-parametric test (Kendall or Spearman tests) would be more informative.

Review #2
By Jérôme Euzenat submitted on 21/Jul/2017
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include :
(1) originality: the paper is not outstandingly original but provides original measures and a new benchmark
(2) significance of the results: are neither in terms of statistical significance nor in more practical significance convincing
(3) quality of writing: this should be revised, especially in terms of narrative, i.e., what is the main point of this paper.

The paper introduces variants of some similarity measures, mostly Jaccard, and compare them on new data sets that have been built on purpose.

To go straight, I do not think that the paper should be published in its current shape.
This is mostly a question of form.
Here is the reasons:

The paper, p.2, declares four main contributions. The way they are phrased is puzzling: concrete proposal, evaluation, considerations and conclusions. Only one seems a concrete contribution, the rest is only assessing the contribution.
This is not the whole story.
My opinion is that this paper has two aspects on which it contributes:
(a) new similarities based on features but taking into account hierarchy,
(b) a new benchmark for such similarities.
Any of these aspects is a worthy topic for a paper. However, bringing the new benchmark together with the new measures is prone to cast doubt on the quality of both work.

Independently from this, the reader needs a clear statement about what the contribution of this paper is.

The conclusion of the paper is itself confusing: "We came to the conclusion that the underlying hierarchical information [...] is expressed better with features than with underlying hierarchy" and "the measure with the best performance is the one combining feature-based and hierarchy-based measures".

More details below.

* About similarities:

New such similarities are created all the time. These are not less legitimate than others but the necessity to introduce them should be carefully defended. For instance, in Section 4.1, the introduction of SIM_{SQ} is not justified. Why would it be good? The next one has a better justification.

I also have a question: the correlation used compare the similarities. However, for those containing sigmoid, the maximum value is below 1, hence the scale is not the same as that adopted by the reference. I am not sure I am right, but this may have an influence on results.

The discussion in Section 4.2 is confusing. In particular because this is the first place where the justifications are based on examples related to the specific evaluation data set. The justification mixes the reasons for adopting particular similarity measures and the reasons for modifying the data set which do not have their place here.
Then, the presentation of V1-V3 are less clear than the shorter ones on page 10. In fact, it is not very clear at that point what "without counting" means. It would be better to first discuss the idea of adding one member to the divisor of the equations for accounting for the hierarchy.

In the end, it would be better to present that there are two options: either keeping rdf:type in the features or not, and either adding a specific equation member to the similarities or not. This would clarify why with three variants, there are four tables.

The properties of the measures taking hierarchies into account are not discussed further.

In summary, there are good reasons for introducing some of these measures. The presentation of these ones and why they are introduced should be clarified.

* About benchmarks:

The point is that the traditional benchmark for such measures is well-accepted and has a long history. It may have some limitations which may require a new benchmark to overcome. However, the new benchmark should also be very strong.

There are justification for the proposed benchmark which are surprising:
- "It is extremely difficult to find a publicly available ontology with defined properties" ?
Is is? What about all these medical ontologies? For instance Galen (or the one cited in conclusion)?
Actually such ontologies seems more like what is expected from an ontology than the considered ones which are very regular with the same small set of properties for all classes. On the other side, the little shown about the ontologies is not particularly appealing: it seems from Section 2 that Aspargus or Side_Dish are two individuals.
- "specific domain ontologies" + "could be tested with non-expert": why? It seems that specific domain, as opposed to the knowledge in WordNet, is not really what the layman may assess.
- "the correlation of these results was mostly used to test similarity measures on WordNet" With a reference to wordnet and not to a paper supporting this assertion.

In particular, with respect to the latter assertion, if one shows that what is observed by using wordnet as a corpus does not work with another corpus, then there is a point to discussing this benchmark. Given that it predates WordNet, one cannot suspect a bias, so there would be a nice paper to write. So far, these reasons are not obvious.

It is very difficult to replace such an established benchmark by another. This is the reason why this should be done carefully.
The proposed one(s but I keep singular sorry) has interesting new features:
- it has hierarchies and properties
- it has been evaluated by a large population with a control population.

But it also has "deficiencies":
- it has been processed in Italian by an Italian population
- it is not really what would be expected from an ontology in terms of heterogeneity of described concepts and in terms of number of properties
- contentwise, there are also some arguable choices (Preserve and Special diet as Dish types, moreover, the class Dish type suggest that the instances are dish types).
- it is non public (we have entered in the area of highly reproducible science, so the data should be published).

In summary, there are good reasons for introducing such a benchmarks. However, it should be make clear how and why this one is a good one. And this should not be justified by "bad" reasons.

* About evaluation:

5 hypotheses for one experiment is a lot. This explains why the results did not come with p-value statements.
The statement of the first hypothesis is vague: what is a good correlation?

Have you similar quantitative results if you invert control and reference groups?

The results are quite mitigated, and the number of hypotheses does not help: "in the first experiment the best results are obtained by considering the rdf:type and in the second experiment by excluding it" So we cannot conclude right?
No the next sentence is "In both case, there is a good correlation".
It would be good to define "good correlation", otherwise, it is difficult to judge this statement.
In particular, it is difficult to know if good-old Jaccard has good correlation or not, in which case, the necessity of new measures may be questioned ---and significance information is needed.

5.5.4: "the consistently best performing measure is the NCSJS". I am not sure that I understand well, but it does seem to be the best measure on Table 6 and 8, but in none of 2, 3, 4, 5, 7 and 9. So, how should this sentence be interpreted?

Finally, given that Tversky dates back to 1977, there has already been many tentative improvements cited in the related work. It would have been worth comparing with them as well.

Making a clear statement about what is brought by some of the new measures and actually, showing that this hypothesis holds would already be a nice result.
Here it is not clear what has been evaluated.

I have an additional point. I am always surprised that people test value correlation, since these similarities are most of the time used not for having an absolute value, but to decide what is more similar than what. So, why not considering rank correlation?

* Varia

- The title may seems like a nice catch. However, it does not tell what the contribution of the paper is. Hence it is not a very good title.
Actually, it is difficult to defend that the paper is about Tversky similarity since its originality was not only to weight the factor of the similarity but to weight them independently leading to "non symmetric similarities". Focusing on Jaccard there does not remain much of Tversky and the addition of Li and Wu-Palmer derived measures does not help.
There is a paragraph about avoiding parameters in the introduction. However it is not convincing of why the results without tuning the parameters would be worth anything if the goal is to tweak the weights later on.

- Section 2 is terrible. 2.1 and 2.2 are very vague. For instance, 2.1 describes a partial order through many paraphrases "groups the concepts into classes"? "A conceptual hierarchy is a simple knowledge structure when the properties [which ones? that have never been mentioned nor defined before] are not taken into account": is this supposed to help understanding? "The characteristics of a property are defined with the property axiom, which in its basic form asserts the existence of the property": this is supposed to describe properties in OWL and I have no idea to what it refers to in OWL. "Instances in the ontology [...] give life to classes". There are ways to be precise about all this.

* Spelling
- 'The' in English is used for some thing has been introduced before.
- can_ba_eaten_as
- definition of DF_p^1 and CF_p^1 why wouldn't take O_1 as an argument like the other functions?
- why only the properties of Jaccard since several similarities have been introduced?
- Eq. 7: DIST is not a distance, which is a function of two argument, DEPTH would have been better.
- uses THE external corpus to compute THE similarity or similarities.
- The introduction of Normalised Jaccard's similarity and Normalised CSJS is difficult to follow because it refers to "the following formulas" without telling which one correspond with which one. Naming them with the equation label or their name (SIM_N) would help.
- Eq. 16 and 20: the O_1, O_2 arguments are missing
- 5.5.1: column -> row P1-P2
- Since there are 4 different tables for each option, it is misleading to add +h or no t. after the name of measures
- I do not understand what the value in bold are.
- what is the point of using 4 digits for correlation between values provided by users on 1 digit.
- I think that Table 11 should have appeared earlier, as a control of the benchmark, just like P2 is a control for P1.