Review Comment:
The paper presents a a study of similarity in Wikidata and the impact that retrofitting (subsequent training of embeddings to fit with external information, in this case similarity of entity pairs in Wikidata) can have on both KG and text-based embeddings similarity.
This is a very relevant topic with several interesting applications in multiple domains.
I believe the paper would be much improved by addressing the following issues:
1. Related work
There is avery relevant paper in this area that is not covered in the related work:
Lastra-Díaz, Juan J., Josu Goikoetxea, Mohamed Ali Hadj Taieb, Ana García-Serrano, Mohamed Ben Aouicha, and Eneko Agirre. "A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art." Engineering Applications of Artificial Intelligence 85 (2019): 645-665.
This is a fully reproducible paper, and producing results in the same conditions for the proposals presented in the manuscript would increase its value substantially.
Moreover, this work combines language models and KG embeddings, but does not cover the related work that covers this overlap.
Wang, Zhen, Jianwen Zhang, Jianlin Feng, and Zheng Chen. "Knowledge graph and text jointly embedding." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1591-1601. 2014.
Xie, Ruobing, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. "Representation learning of knowledge graphs with entity descriptions." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1. 2016.
Peters, Matthew E., Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. "Knowledge enhanced contextual word representations." arXiv preprint arXiv:1909.04164 (2019).
2. Clear definitions
The concept of retrofitting is only presented very late in the text. This is not something most readers will be familiar with and it does represent an important aspect of the work. It should be defined in the introduction to help understand the goals. A definition of KG embeddings and text embeddings is also lacking. Although these are increasingly commonplace, and soft introduction to these terms would improve the readability.
3. More focus on novel contributions
Retrofitting appears to be the more original aspect of the work. However, this is not described or analysed in depth. Results are only shown for cosine similarity of DistilRoberta embeddings.
How the pairs are built is not very well described. The word edge I believe is used to mean a pair of concepts. The selected pairs are in all likelihood quite similar (parents and siblings), and the distribution of similarity for these pairs is not studied.
"We focus our experiments on cosine similarity as a
weighting function, because we observed empirically that it consistently performs better or comparable to the other
two weighting functions." This is a pity. This is exactly what I was hoping to find in the paper. In the end, I am unsure if there is any real advantage of using wikidata to measure conceptual similarity, or if we are simply better off just using language models.
The analysis of the quartiles is potentially quite interesting, but results are not easy to read (no table), and now the performance metric is F-measure, which is not at all clear how it is computed.
I strongly advise the authors to apply their retrofitting method in the same datasets and conditions of Lastra-Diaz et al.
4. Justifications and clarification of methodological aspects
How TopSim is computed is not clear at all. Is this the measure proposed in 10.1109/ICDE.2012.109?
Why does Composite-6 not include labels and desc?
Why is DistilRoberta used for abstract, labels, label+description and BERT-base for lexicalization?
5. The large size of wikidata is referred to multiple times, but the application was at most to a few hundred entity pairs, what are the true implications of the large size of Wikidata for similarity estimation?).
Minor: it would be better to employ the terminology defined by Lastra-Diaz et al when categorizing the different similarity metrics.
In summary, there is an interesting idea in applying retrofitting to concept similarity with KGs. However, the paper does not consider related work appropriately, which limits the value of its contributions (see Lastra-Diaz et al). It also does not afford sufficient detail in the description of the methods and choices, and could be much richer in terms of tested configuration, presented results, and discussion.
|