A Framework for Assessing LLM Consistency in Knowledge Engineering

Tracking #: 3967-5181

Authors: 
Mohammad Javad Saeedizade
Reham Alharbi
Hamed Babaei Giglou
Anna Sofia Lippolis
Eva Blomqvist
Valentina Tamma
Floriana Grasso
Terry Payne
Jennifer D'Souza
Sören Auer1
Andrea Giovanni Nuzzolese1
Robin Keskisärkkä
Zebah Valeyil

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
Abstract: 
Consistency, i.e. the degree to which a system, process, or its results produce similar outcomes when repeated under identical or different conditions, is a critical concern in knowledge engineering (KE). This is particularly the case given the increasing reliance on Large Language Models (LLMs) in various tasks. This paper introduces CoLLM, a framework designed to assess whether a system or process produces consistent results in LLM-based KE tasks through three tests: (1) the LLM Repeatability Test, which evaluates the level of stochasticity or non-determinism of LLMs in existing studies; (2) the LLM Update Impact Test, which examines the effect that LLM updates may have on results; and (3) the LLM Replacement Test, which explores the effect of using alternative LLMs to perform the same study. Through 59 different experiments taken from five separate, recent studies, and leveraging various LLMs and datasets, we investigate the consistency of the results to empirically validate the reliability of the original findings for each study. Our investigation shows that in the majority of cases (81.4%), a consistent behaviour with respect to the original studies can be observed, despite some variability across the individual outputs. Additionally, in some cases, changing the choice of LLM can result in a consistent improvement across different metrics. These results demonstrate the viability of the proposed framework in general to assess the consistency of LLM-based KE tasks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Dec/2025
Suggestion:
Minor Revision
Review Comment:

The authors have improved the paper based on the reviewers’ comments. However, I remain unsatisfied with the related work section. In particular, for the ‘replacement test’, there already exists work comparing different LLMs on the same KE tasks, including KG completion and reasoning [1], ontology learning [2], relation extraction [3], ontology generation [4], ontology matching [5]...

[1] Li, Qian, et al. "LLM-based multi-level knowledge generation for few-shot knowledge graph completion." Proceedings of the 33rd International Joint Conference on Artificial Intelligence. Vol. 3. 2024.
[2] Mai, Huu Tan, Cuong Xuan Chu, and Heiko Paulheim. "Do LLMs really adapt to domains? an ontology learning perspective." International Semantic Web Conference. Cham: Springer Nature Switzerland, 2024.
[3] Zhang, Bohui, et al. "Using large language models for knowledge engineering (LLMKE): a case study on Wikidata." arXiv preprint arXiv:2309.08491 (2023).
[4] Llugiqi, Majlinda, Fajar J. Ekaputra, and Marta Sabou. "From Experts to LLMs: Evaluating the Quality of Automatically Generated Ontologies." 2nd Workshop on Evaluation of Language Models in Knowledge Engineering (ELMKE), co-located with ESWC-25, to appear. 2025.
[5] Qiang, Zhangcheng, Weiqing Wang, and Kerry Taylor. "Agent-om: Leveraging llm agents for ontology matching." arXiv preprint arXiv:2312.00326 (2023).

Review #2
By Anelia Kurteva submitted on 05/Jan/2026
Suggestion:
Accept
Review Comment:

The authors have addressed my comments sufficiently. The content of the paper is clearer and more intuitive to follow.

However, I would like to point out that it was difficult to figure out what exactly was updated and where, since the response letter included limited pointers (e.g. update on page X line Y), and updates in the paper have not been highlighted to guide the reviewer.

For example, "We have now clarified the difference between reproducibility and reproducing (similar to replicating and repeating) throughout the paper." It would be beneficial if you pointed the reviewer to a concrete section (at least).

Minor formatting comment: There seems to be quite a lot of space before and after the abstract. This could be from the template itself, but if not, the page will look better if the authors try to include parts of section 1 on the 1st page.

The beginning of section 5 is still cut off by table 5.

References [28][55][82] are incomplete.
Reference [88] is missing a hyperlink. What was accessed?

Review #3
By Maria Angela Pellegrino submitted on 27/Jan/2026
Suggestion:
Accept
Review Comment:

The paper presents a framework for assessing consistency in LLM-driven knowledge engineering by considering experiment repeatability, the impact of LLM updates, and the effects of LLM replacement. The topic is timely and clearly presented. Compared to the previous version, the paper has significantly improved, particularly in its introduction of the problem, positioning of the contribution, and explanation of the proposed framework.

As a minor suggestion, the related work section could benefit from the inclusion of a comparative table to better structure differences among existing approaches and make patterns more evident. The results are clearly and thoroughly described. Another minor comment concerns accessibility: the authors should consider using color-blind–friendly color schemes in tables to indicate increases and decreases (e.g., yellow and blue instead of red and green).

The discussion section is the main point of concern, as it reads more like concluding remarks than an in-depth discussion. However, the results are sufficiently rich to support this structure. The authors might consider either more clearly separating results and discussion (e.g., organized by research questions or in relation to prior studies) or merging the two sections.

Overall, I recommend acceptance of the paper as is, with a few minor suggestions for improvement.