Knowledge Engineering with Large Language Models: A Capability Assessment in Ontology Evaluation

Tracking #: 3852-5066

Authors: 
Stefani Tsaneva
Guntur Budi Herwanto
Majlinda Llugiqi
Marta Sabou

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Advancements in large language models (LLMs) offer opportunities for automating challenging and time-intensive Knowledge Engineering (KE) tasks. Constructing an ontology is a complex process, particularly when logical restrictions are modeled or when the development is performed by novice knowledge engineers or domain experts with limited training in KE. Consequently, developed ontologies often contain modeling errors, undermining the success of ontology-based applications and hindering subsequent KE tasks. Thus, it is important to investigate how LLMs can support KE tasks such as the evaluation of ontologies, involving the detection and correction of errors in knowledge-based resources. However, challenges remain in systematically evaluating LLM performance and comparing different models in terms of their capabilities to perform concrete KE tasks. Moreover, there is a lack of comprehensive, task-specific benchmarks needed for such LLM capability assessments. As a result, selecting the right LLM to effectively support knowledge engineers presents a nontrivial problem. To fill these gaps, this study investigates how and to what extent LLMs can support four concrete ontology evaluation sub-tasks: the detection, classification, explanation, and possible correction of modeling issues in ontologies, focusing on the use of existential, universal, and cardinality constraints. To this end, we construct a benchmark dataset based on student-built ontologies and perform experimental assessments of the performance of four LLMs--GPT-4o, Claude Sonnet, DeepSeek V3, and Llama 3.3-- on these four KE sub-tasks. Additionally, we exemplify the definition of an annotation framework for the qualitative evaluation of LLM outputs and perform a comparative analysis of each model's capabilities. Our findings reveal notable differences in model behavior and task-specific strengths, underscoring the importance of selecting the most appropriate model to support concrete KE tasks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Jun/2025
Suggestion:
Minor Revision
Review Comment:

------------ Short Summary of the work ------------
This work points to a good direction of research on LLMs capability assessment in ontology evaluation. The authors studied four LLMs, called GPT-4), Claude Sonnet, DeepSeek, and LLaMA to perform experimentations over four knowledge engineering tasks to study how model behavior and task-specific strength are important and appropriate for KE. The four tasks designed for this study are: 1) Detecting an Ontology Modeling Issue, 2) Classifying an Ontology Modeling Issue, 3) Explaining an Ontology Modeling Issue, and 4) Generating a Correct(ed) Ontology Modeling. Tasks 1 and 2 are evaluated using standard metrics such as precision, recall, and f1-scores targeted at finding how LLM can find an issue with modeling, where if an issue is found, within task 3 the aim is to provide an explanation which the provided explanation was evaluated using introduced assessment schema for expert evaluation. Later in task 4, LLMs are used to correct the ontology modeling.

------------ Originality ------------
This work points to a timely topic in KE, on what task LLMs can support what is the limitations, and behavior of LLMs, and how we can define good evaluation criteria. Given the good direction of the work, the only weak spot that I can see is a limited number of LLMs in this study. The RQ2 is a bit broad to study and using only four LLMs can be misleading here. Thus, the findings may not be generalized without evaluating a more diverse set of LLMs. My recommendation would be to (if it is feasible) incorporate at least two more LLMs (from different families of LLMs such as Gemini, Mistral, or Qwen) in this study to confirm your findings for this RQ. I believe this would strengthen your work.

Moreover, the assessment criteria for task 3 are "Modeling Comprehension, Inference Capability, Mistake Correction Competence and Vocabulary Usage", where each dimension takes a value between -1 (incorrect) and 1 (correct). It would be great to make a clarification on this choice and what is the reason behind this. Considering the LLMs-as-Judge research in NLP, why not make the scoring finer to provide multiple scorings such as (this is an example): 1 unacceptable, 2 poor, 3 medium, 4 good, and 5 very good. Continuing this point, Task 4 also appears to involve expert evaluation; if so, here the reproducibility concern arises on whether other researchers can use your benchmark for experimentation as well or not since another researcher might use different experts, and the inconsistency with perspectives might show a different scoring for these tasks, so this might be challenging to find out who/what LLM/why is doing a good job here. I think it would be great if the authors clarified these points for a more reliable benchmarking effort within these four tasks.

------------ Significance of the results ------------

The presented results appear solid and support the research findings across the defined tasks. The quantitative assessments are comprehensive and reveal several interesting insights. Below are a few comments that could help strengthen the discussion and improve the clarity of the results:

- According to Table 2, it appears that the number of evaluated ontology axioms is quite limited. For instance, in terms of ML-based workflow, there are only 82 samples for Task 1, 29 for Task 2, 29 for Task 3, and 82 for Task 4. If this interpretation is correct, I would like to refer to your statement on page 4, lines 4–7, where you critique work [12] for being "constrained by a small dataset and a limited number of investigated relation types". A brief clarification on how your dataset addresses or differs from this limitation would help contextualize your claims more clearly.

- I suggest presenting a table in the results section (or an adjustment in Table 2) that clearly reports the exact number of samples for each task, including the number of correct and incorrect instances in the gold standard datasets. This would enhance the clarity and transparency of the evaluation.

- Following the previous comment on task statistics, In Table 3, GPT-4o achieves a relatively high accuracy but a lower F1 score, primarily due to reduced recall. This discrepancy makes it somewhat difficult to interpret the model's actual performance due to the missing statistics of correct/incorrect instances. It would be helpful if authors could provide a brief explanation for this behavior——such as whether the model is favoring the majority class, being overly conservative in its predictions, or missing positive instances——so readers can better understand the trade-offs between accuracy and F1 in this context.

- Based on Table 3 and Figure 3, as you noted, GPT-4o demonstrates strong precision but suffers from low recall. It would be helpful if the authors could clarify whether this behavior indicates that the model is overly conservative in its predictions. Furthermore, is there any evidence of hallucinations or a high rate of false positives that might be contributing to this performance imbalance?

------------ quality of writing ------------

The paper is well-written and effectively communicates the key ideas. However, I noted a few minor issues and typos that should be addressed:

- Page 1, Line 46: The phrase "..., for instance in an ontology [25]" appears to be incomplete. Please revise the sentence for completeness.
- Page 11, Line 44: The text states "We identify six main mistake types", but there are seven bullet points listed from "Type I (6 axioms)" to "Other (2 axioms)". Please clarify this.
- Page 12, Line 38: The category "Other (2 axioms)" lacks explanation. It would be great to briefly describe these two axioms or clarify why they were grouped separately.
- Page 15, Line 43: In Table 3, under the "GPT-4o" row, please correct the typo "45.4%5" by removing the "5" after the percentage.

------------ Data File Assessment ------------

- The dataset is not provided via a stable, long-term URL. No specific README or repository is linked for accessing the benchmark. The only related link is https://github.com/wu-semsys/ontology-analysis, which contains only the Ontology Analysis API, not the dataset itself. I recommend publishing the dataset in a public repository with a clear README to support community use and reproducibility.

- The dataset creation methodology is well described. Involving undergraduate students across 16 domains resulted in 31 ontologies and 82 curated axioms with modeling intentions. This indicates strong potential, but public access is essential to validate its impact.

------------ Other Comments ------------

- I would suggest reconsidering the phrasing of "Comprehensive and Comparative Evaluation" as a standalone contribution. While evaluation is indeed an essential part of any research work, it is generally expected as a standard component rather than a unique contribution. Moreover, the paper refers to the "Guided Expert-Annotation Schema for LLM Outputs" as a key contribution. However, this aspect is only briefly presented—mainly through a single table—without sufficient explanation of how the schema was developed or the rationale behind its design. Additionally, such a schema should be more accurately framed as part of an experimental LLM capability assessment rather than a standalone contribution. I recommend revisiting the contributions section and rewriting it with greater abstraction and clarity, ensuring that each claimed contribution is well-motivated, distinct, and appropriately positioned within the scope of the work.

- Is there any discussion on potential LLM hallucinations in the generated outputs?

Review #2
Anonymous submitted on 20/Sep/2025
Suggestion:
Major Revision
Review Comment:

Summary
This paper presents a systematic, empirical evaluation of the capabilities of large language models (LLMs) for ontology evaluation tasks. It addresses an important and timely topic in the Semantic Web community. The paper is well written, and the study is carefully conducted.
The authors investigate four sub-tasks in ontology evaluation involving logical constraints (existential, universal, cardinality):
1. Detection of modeling issues
2. Classification of issue types
3. Explanation of the modeling mistake
4. Generation of a corrected modeling solution
To support this investigation, they construct a benchmark of 96 ontology axioms from student-built ontologies, design task-specific prompts for four LLMs (GPT-4o, Claude Sonnet, DeepSeek V3, Llama 3.3), and evaluate performance across these subtasks using gold-standard and expert-annotated assessments. The paper offers nuanced insights into model behavior and identifies strengths and weaknesses of the evaluated systems.

Strengths
• Relevance: The topic is timely and of clear importance for both Semantic Web research and practice.
• Methodological rigor: The empirical study is thorough, covering multiple LLMs, diverse tasks, and both quantitative and qualitative analyses.
• Clarity: The paper is clearly written and well structured.
• Novel dataset: Using student-created ontologies provides a realistic basis for evaluation and introduces a valuable (if somewhat specialized) benchmark.
• Balanced analysis: The authors provide detailed observations on model performance differences across subtasks.

Major Weaknesses and Issues
1. Scope and Positioning
The paper claims to provide a capability assessment of LLMs for ontology evaluation in general. However, the actual scope is much narrower: evaluating correspondences between natural language intent descriptions and specific ontology axioms with certain property restrictions.
Ontology evaluation as a field is far broader, encompassing checks for logical consistency, alignment with competency questions, structural analysis, and more (see, for example, work by Vrandecic). The current study should be presented as addressing one specific corner of this space, rather than ontology evaluation overall.
2. Semantic Precision
The semantics of the ontology language used in the study are not made explicit. Examples are given in Turtle syntax, but boundaries of axioms are unclear. The syntactic form suggests some profile of OWL, but OWL provides an open-world semantics, not constraint checking. The paper frequently refers to “constraints,” but OWL does not support constraints in this sense.
At best, the examples can be interpreted as property restrictions in Description Logics. Even then, only a very small subset of OWL expressivity (existential, universal, cardinality) is considered. This lack of semantic grounding undermines the precision of the study.
3. Limited Generalizability
The approach requires explicit representations of modeling intent alongside axioms. This setting is plausible for student assignments, where tasks and results are clearly specified, but less so in real-world ontology engineering, where intent is often expressed in broader ontology requirement documents rather than in neat NL–axiom pairs. Thus, the generalizability of results beyond student-built ontologies is questionable.
4. Narrow Evaluation Focus
The study only considers single axiom–sentence pairs. Broader aspects of ontology evaluation such as coherence across multiple axioms, global consistency, or integration with larger ontologies are not examined.
5. Quality of Examples
Some example axioms are questionable or contain errors, e.g., :Guest rdfs:subClassOf :WeddingParty, which does not make semantic sense. This raises concerns about the reliability of the benchmark.
6. Fragility of Conclusions
The results are based on detailed experiments with prompts and LLMs available at the time of writing. However, models evolve quickly, prompting is highly sensitive, and the chosen LLMs may already be outdated by publication. This raises doubts about the robustness and lasting value of the reported conclusions.

Recommendation
While the study is well executed within its chosen scope, the scope itself is narrow, semantically underspecified, and not clearly aligned with the broader notion of ontology evaluation.
I therefore recommend Major Revision, with the following priorities:
1. Clearly delimit the scope of the work and avoid overgeneralized claims about ontology evaluation.
2. Fix the semantic imprecision: specify the ontology language and its semantics, and align the terminology accordingly.
3. Revise examples and ensure semantic correctness of axioms.
4. Discuss the limitations of generalizability beyond student-built ontologies more explicitly.
If these issues are addressed, the paper could make a useful contribution as a focused case study on LLM capabilities for a specific ontology evaluation task.

Minor Comments
Page 1: "limitted" → "limited"
Page 4: NeOnGPT is based on the NeOn methodology, not the 101 methodology
Page 6: “In the filed of psychology” -> “In the field of psychology”