Large Language Models for Creation, Enrichment and Evaluation of Taxonomic Graphs

Tracking #: 3921-5135

Authors: 
Viktor Moskvoretskii
Irina Nikishina
Ekaterina Neminova
Alina Lobanova
Alexander Panchenko
Chris Biemann1

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
Taxonomies play a crucial role in organizing knowledge for various natural language processing tasks. Recent advancements in LLMs have opened new avenues for automating taxonomy-related tasks with greater accuracy. In this paper, we explore the potential of contemporary LLMs in learning, evaluating and predicting taxonomic relations across multiple lexical semantic tasks. We propose novel method for taxonomy-based instruc- tion dataset creation, encompassing multiple graph relations. With the use of this dataset we build TaxoLLaMA, a unified model fine-tuned on datasets exclusively based on English WordNet 3.0, designed to handle a wide range of taxonomy-related tasks such as Taxonomy Construction, Hypernym Discovery, Taxonomy Enrichment, and Lexical Entailment. The experimental results demonstrate that TaxoLLaMA achieves state-of-the-art performance on 11 out of 16 tasks and ranked second on 4 other tasks. We also explore LLM ability for constructed taxonomies graph refinement and present comprehensive ablation study and thorough error analysis supported by both manual and automated technique.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 25/Jul/2025
Suggestion:
Accept
Review Comment:

This paper presents the TaxoLLaMA model, a suite of instruction-tuned models based on LLaMA, trained on Princeton WordNet 3.0 to perform a wide range of taxonomy-related tasks. The authors introduce a novel dataset covering diverse graph relations and show that their model outperforms prior approaches, achieving state-of-the-art results on 11 of 16 task. This paper summarizes and extends two previous conference submissions by introducing new experiments with updated models (e.g., LLaMA 3.1), refining taxonomies using bidirectional and multi-relational strategies, resolving graph cycles, and adding deeper ablation studies and metrics like F&M.

This paper presents a comprehensive framework for taxonomy extraction as well as new datasets going beyond traditional hypernymy. The results are strong and a rigorous evaluation through ablation studies and multiple metrics is highly commendable. The are still some challenges with ambiguous hypernyms and some of the methods (such as using the NLTK ID) are questionable, and expanding the context and the prompt with examples for a WSD corpus could be a way to further improve the result.

"English WordNet" is not a resource and probably refers to Princeton WordNet, a lexical database for English developed at Princeton University. Another more recent project is the Open English Wordnet (note the lowercase 'n'), an open-source model that has released improved versions of the Princeton WordNet. The authors should explain why they chose to use outdated data.

Note the IDs used in example 2 and 4 are from NLTK and are not official IDs from any English-language wordnet project.

The authors use perplexity to assess hypernym relations, but, while this works, it may introduce bias against infrequent terms. Perplexity reflects probability scores so hypernyms that are less frequent (e.g., technical jargon, low-resource vocabulary) may be unfairly penalized, even if they are correct. Perhaps the authors could consider trying to correct for this by using the unigram frequency scores?

Minor
-----
p1 l28 *a* novel method
The paper exhibits inconsistent capitalization of technical terms that are not proper nouns. Here are several examples:
* Entity Linking
* Relation Classification
* Named Entity Recognition
* Hypernym Discovery
* Taxonomy Construction
* Taxonomy Enrichment
* Lexical Entailment
* etc. etc.
p7 l36. "id" -> "ID"
p8. Don't use fixed-width fonts for emphasis
p8 l27. ",,"
p11 l23. "Table ??"
p11 l34. "VS" -> "versus"
In several places "SoTA" should be either "SotA" or "SOTA"
The references need to be checked with the same care as the rest of the manuscript (don't rely on BibTeX!). There are many oddities such as capitalization issues or duplicate DOI/URLs.

Review #2
Anonymous submitted on 11/Aug/2025
Suggestion:
Accept
Review Comment:

In this paper authors describe TaxoLLaMA, a large language model (LLM) fine-tuned on English WordNet 3.0, designed to handle taxonomy-related tasks like Taxonomy Construction, Hypernym Discovery, Taxonomy Enrichment, and Lexical Entailment. The paper is framed by the recent advancements of using LLMs to improve the automation of organising knowledge in natural language processing tasks. The results reported in the paper indicate that TaxoLLaMA achieves state-of-the-art performance in 11 out of 16 tasks and ranks second in 4 others. In this regard, the topic dealt with and the results obtained are of interest and relevant.
In this paper, authors have significantly extended previous results presented in references [36, 37], by i) adding new experiments using different models in zero- and few-shot settings (Phi3, QWEN, etc.), ii) fine-tuning and updating TaxoLLaMA, iii) adding additional ablation study on the consistency and performance for the TaxoLLaMA by different numbers of generations and iv) adding an additional metric F&M for Taxonomy Construction Evaluation.
For the Hypernym Discovery task, TaxoLLaMA is tested using the 3 datasets of the SemEval-2018 dataset and two additional general datasets for Italian and Spanish. Results indicate that the fine-tuned TaxoLLaMa outperforms the other five models in the SOTA considered, for all five datasets.
For the taxonomy enrichment, TaxoLLaMA was tested using the Taxonomy Enrichment benchmark. Results indicate here that it outperforms all previous approaches on the WordNet Noun and WordNet Verb datasets, but falls short of the current SoTA method on more specialised taxonomies (MAG-CS and MAG-PSY).
Finally, for testing the performance of the lexical entailment task, the Hyperlex benchmark and the ANT entailment subset were used. For ANT, the results differ for the two metrics considered, since for Average Precision the proposed approach ranks better than the other four SoTA proposals considered, whilst for AUC it ranks second. For Hyperlex, TaxoLLaMA outperforms the other 5 zero-shot proposals considered for the Lexical Dataset and ranks second for the Random dataset. In both cases, it falls short of the best fine-tuned proposal (RoBERTa best [40]).
The experiments are methodologically sound and demonstrate that the proposed approach is well-suited for solving the considered taxonomy-related tasks and challenges.
Together with the paper, the authors provide links to GitHub and Zenodo repositories with the code for the paper and the dataset used. The repositories are well organised and allow the reader to assess the data, as well as reproduce the results. In this regard, the reproducibility of the reported results is guaranteed.

Review #3
By Pablo Calleja submitted on 16/Sep/2025
Suggestion:
Accept
Review Comment:

All the comments from the previous review have been handled. The overall result is a good paper with a good contribution. Moreover, the clarity of explanation has been improved and tables such as Table 2 have better presentation and details that are useful during the reading.

Review #4
Anonymous submitted on 17/Sep/2025
Suggestion:
Accept
Review Comment:

I have read the author's comments, and have seen the changes they made in the new manuscript with respect to (1) the readability of the text (formalisms) and the missing technical details, (2) publishing the dataset on zenodo and including more details for reproducibility on the Github repository, and I agreed with reviewer 3 on the 'overly broad predictions', but am satisfied with the answer that the authors give.

From my perspective, I believe the article to be of interest to the journal's special issue and would make a valuable contribution in its present state.