Abstract:
Consistency, i.e. the degree to which a system, process, or its results produce similar outcomes when repeated under identical or different conditions, is a critical concern in knowledge engineering (KE). This is particularly the case given the increasing reliance on Large Language Models (LLMs) in various tasks. This paper introduces CoLLM, a framework designed to assess whether a system or process produces consistent results in LLM-based KE tasks through three tests: (1) the LLM Repeatability Test, which evaluates the level of stochasticity or non-determinism of LLMs in existing studies; (2) the LLM Update Impact Test, which examines the effect that LLM updates may have on results; and (3) the LLM Replacement Test, which explores the effect of using alternative LLMs to perform the same study. Through 59 different experiments taken from five
separate, recent studies, and leveraging various LLMs and datasets, we investigate the consistency of the results to empirically validate the reliability of the original findings for each study. Our investigation shows that in the majority of cases (81.4%), a consistent behaviour with respect to the original studies can be observed, despite some variability across the individual outputs. Additionally, in some cases, changing the choice of LLM can result in a consistent improvement across different metrics. These results demonstrate the viability of the proposed framework in general to assess the consistency of LLM-based KE tasks.