Review Comment:
------------ Short Summary of the work ------------
This work points to a good direction of research on LLMs capability assessment in ontology evaluation. The authors studied four LLMs, called GPT-4), Claude Sonnet, DeepSeek, and LLaMA to perform experimentations over four knowledge engineering tasks to study how model behavior and task-specific strength are important and appropriate for KE. The four tasks designed for this study are: 1) Detecting an Ontology Modeling Issue, 2) Classifying an Ontology Modeling Issue, 3) Explaining an Ontology Modeling Issue, and 4) Generating a Correct(ed) Ontology Modeling. Tasks 1 and 2 are evaluated using standard metrics such as precision, recall, and f1-scores targeted at finding how LLM can find an issue with modeling, where if an issue is found, within task 3 the aim is to provide an explanation which the provided explanation was evaluated using introduced assessment schema for expert evaluation. Later in task 4, LLMs are used to correct the ontology modeling.
------------ Originality ------------
This work points to a timely topic in KE, on what task LLMs can support what is the limitations, and behavior of LLMs, and how we can define good evaluation criteria. Given the good direction of the work, the only weak spot that I can see is a limited number of LLMs in this study. The RQ2 is a bit broad to study and using only four LLMs can be misleading here. Thus, the findings may not be generalized without evaluating a more diverse set of LLMs. My recommendation would be to (if it is feasible) incorporate at least two more LLMs (from different families of LLMs such as Gemini, Mistral, or Qwen) in this study to confirm your findings for this RQ. I believe this would strengthen your work.
Moreover, the assessment criteria for task 3 are "Modeling Comprehension, Inference Capability, Mistake Correction Competence and Vocabulary Usage", where each dimension takes a value between -1 (incorrect) and 1 (correct). It would be great to make a clarification on this choice and what is the reason behind this. Considering the LLMs-as-Judge research in NLP, why not make the scoring finer to provide multiple scorings such as (this is an example): 1 unacceptable, 2 poor, 3 medium, 4 good, and 5 very good. Continuing this point, Task 4 also appears to involve expert evaluation; if so, here the reproducibility concern arises on whether other researchers can use your benchmark for experimentation as well or not since another researcher might use different experts, and the inconsistency with perspectives might show a different scoring for these tasks, so this might be challenging to find out who/what LLM/why is doing a good job here. I think it would be great if the authors clarified these points for a more reliable benchmarking effort within these four tasks.
------------ Significance of the results ------------
The presented results appear solid and support the research findings across the defined tasks. The quantitative assessments are comprehensive and reveal several interesting insights. Below are a few comments that could help strengthen the discussion and improve the clarity of the results:
- According to Table 2, it appears that the number of evaluated ontology axioms is quite limited. For instance, in terms of ML-based workflow, there are only 82 samples for Task 1, 29 for Task 2, 29 for Task 3, and 82 for Task 4. If this interpretation is correct, I would like to refer to your statement on page 4, lines 4–7, where you critique work [12] for being "constrained by a small dataset and a limited number of investigated relation types". A brief clarification on how your dataset addresses or differs from this limitation would help contextualize your claims more clearly.
- I suggest presenting a table in the results section (or an adjustment in Table 2) that clearly reports the exact number of samples for each task, including the number of correct and incorrect instances in the gold standard datasets. This would enhance the clarity and transparency of the evaluation.
- Following the previous comment on task statistics, In Table 3, GPT-4o achieves a relatively high accuracy but a lower F1 score, primarily due to reduced recall. This discrepancy makes it somewhat difficult to interpret the model's actual performance due to the missing statistics of correct/incorrect instances. It would be helpful if authors could provide a brief explanation for this behavior——such as whether the model is favoring the majority class, being overly conservative in its predictions, or missing positive instances——so readers can better understand the trade-offs between accuracy and F1 in this context.
- Based on Table 3 and Figure 3, as you noted, GPT-4o demonstrates strong precision but suffers from low recall. It would be helpful if the authors could clarify whether this behavior indicates that the model is overly conservative in its predictions. Furthermore, is there any evidence of hallucinations or a high rate of false positives that might be contributing to this performance imbalance?
------------ quality of writing ------------
The paper is well-written and effectively communicates the key ideas. However, I noted a few minor issues and typos that should be addressed:
- Page 1, Line 46: The phrase "..., for instance in an ontology [25]" appears to be incomplete. Please revise the sentence for completeness.
- Page 11, Line 44: The text states "We identify six main mistake types", but there are seven bullet points listed from "Type I (6 axioms)" to "Other (2 axioms)". Please clarify this.
- Page 12, Line 38: The category "Other (2 axioms)" lacks explanation. It would be great to briefly describe these two axioms or clarify why they were grouped separately.
- Page 15, Line 43: In Table 3, under the "GPT-4o" row, please correct the typo "45.4%5" by removing the "5" after the percentage.
------------ Data File Assessment ------------
- The dataset is not provided via a stable, long-term URL. No specific README or repository is linked for accessing the benchmark. The only related link is https://github.com/wu-semsys/ontology-analysis, which contains only the Ontology Analysis API, not the dataset itself. I recommend publishing the dataset in a public repository with a clear README to support community use and reproducibility.
- The dataset creation methodology is well described. Involving undergraduate students across 16 domains resulted in 31 ontologies and 82 curated axioms with modeling intentions. This indicates strong potential, but public access is essential to validate its impact.
------------ Other Comments ------------
- I would suggest reconsidering the phrasing of "Comprehensive and Comparative Evaluation" as a standalone contribution. While evaluation is indeed an essential part of any research work, it is generally expected as a standard component rather than a unique contribution. Moreover, the paper refers to the "Guided Expert-Annotation Schema for LLM Outputs" as a key contribution. However, this aspect is only briefly presented—mainly through a single table—without sufficient explanation of how the schema was developed or the rationale behind its design. Additionally, such a schema should be more accurately framed as part of an experimental LLM capability assessment rather than a standalone contribution. I recommend revisiting the contributions section and rewriting it with greater abstraction and clarity, ensuring that each claimed contribution is well-motivated, distinct, and appropriately positioned within the scope of the work.
- Is there any discussion on potential LLM hallucinations in the generated outputs?
|