From GPT to Mistral: Cross-Domain Ontology Learning with NeOn-GPT

Tracking #: 4014-5228

Authors: 
Nadeen Fathallah
Arunav Das
Stefano De Giorgis
Andrea Poltronieri
Peter Haase
Liubov Kovriguina
Elena Simperl
Albert Meroño-Peñuela
Steffen Staab
Alsayed Algergawy

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
Abstract: 
We present the extended NeOn-GPT pipeline, an LLM-powered ontology learning pipeline grounded in the NeOn methodology. NeOn-GPT is a domain-agnostic ontology learning pipeline that comprises two components: (i) ontology draft generation: a multi-step prompting pipeline following the NeOn methodology, including requirement specification, Competency Questions generation, ontology conceptualization and implementation, formal modelling, population, documentation, (ii) automated ontology verification and resolution achieved through orchestrated calls to third-party tools complemented by LLM-suggested repairs. The extended pipeline incorporates an explicit step for reusing existing relevant domain ontologies to guide LLMs toward more consistent modeling decisions. We evaluate NeOn-GPT across four distinct domains (Wine, Cheminformatics, Environmental Microbiology, and Sewer Networks) using both proprietary (GPT-4o) and open-source (Mistral, Llama-4, DeepSeek) models. Gold-standard alignment is assessed using three complementary metrics: structural metrics (class, property, and axiom profiles), lexical metrics (exact matches and Jaro-Winkler similarity ≥ 0.8), and semantic metrics based on sentence-transformer embeddings. Results show that LLMs consistently generate ontologies with rich relational structures (including functional, transitive, and domain-range constraints) and meaningful semantic alignment, with most entity and triple similarities falling in the 0.5-0.8 range. Overall, this study provides a comprehensive, cross-domain evaluation of a NeOn-guided LLM ontology learning pipeline, clarifying its capabilities and limitations.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Feb/2026
Suggestion:
Accept
Review Comment:

This reviewer wishes to express my appreciation for the thoroughness with which the authors have addressed the reviewers' comments and suggestions. I accept the responses provided to my (Reviewer #1's) points; they have addressed my issues with an appropriate balance between additional work (e.g., expanding the range of LLMs and ontologies evaluated, the ablation study in Section 7), references to limitations and future work (e.g., prompt development, comparisons of multiple-expert curations in a single domain), and improvements to clarity and presentation (e.g., charts in Section 6). As indicated in my previous review, my suggested decision is to accept.

Review #2
By David Chaves-Fraga submitted on 09/Mar/2026
Suggestion:
Accept
Review Comment:

I would like to thank the authors for the clarifications, revisions, and updates made to the paper. I believe the manuscript is now ready for acceptance (I was Reviewer 3 in the first round)

Review #3
Anonymous submitted on 20/Mar/2026
Suggestion:
Accept
Review Comment:

The revised version of the paper has significantly improved and addresses many of the concerns raised in the previous review. In particular, the contribution is now clearer, the evaluation framework is better justified and structured, and the overall positioning of the work within the literature is more convincing.

I only have a few minor comments that could be addressed to further improve clarity and reproducibility:

***Appendix references and links**: some references to appendix materials (e.g., prompt templates or external resources) appear not fully clear. In particular, some links/references seem to work while others are difficult to access. I suggest carefully checking and harmonising all appendix references to ensure they are complete, consistent, and easily accessible.

***Experimental setup clarity**: it would be helpful to explicitly state whether the reported results are based on a single run per model/domain or if any repetition was performed. This would help readers better interpret the robustness of the results.