A Survey on Interaction Design with Large Language Models for Ontology Requirements Elicitation with Competency Questions

Tracking #: 3866-5080

Authors: 
Yihang Zhao
Xi Hu
Timothy Neate
Albert Meroño-Peñuela
Elena Simperl

Responsible editor: 
Dagmar Gromann

Submission type: 
Survey Article
Abstract: 
Competency questions (CQs) are essential in ontology engineering (OE), as they express an ontology's functional goals and serve as a foundation for its construction, evaluation, and reuse. Large language model (LLM)-based systems for CQs elicitation have recently attracted substantial attention. These systems can generate thousands of candidate CQs from domain experts or knowledge sources to help define the boundaries of a target application domain. However, current interaction paradigms fall short in supporting knowledge engineers auditing target domain boundaries to ensure that CQs are neither too few, risking critical omissions, nor too many, resulting in information overload. This gap reflects the absence of supports for processes that closely align with the divergent (lateral) and convergent (vertical) thinking observed in the arts and creativity domain. We therefore present a systematic literature review (N = 50) investigating interaction design patterns in LLM-based systems that support divergent–convergent thinking in the arts and creativity domain. We then map the identified patterns to the context of CQs elicitation and propose an interaction model that extends the existing OntoChat system, enabling knowledge engineers to navigate a set of candidate CQs on an interactive canvas, supporting exploration, evaluation, and informed decisions about what to add, discard, or revise in order to progressively define the scope of the target domain.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 11/Nov/2025
Suggestion:
Major Revision
Review Comment:

This submission presents the results of a systematic literature review on the use of divergent and convergent thinking (from creativity domains) into the generation of competency questions for the construction of ontologies with the help of LLMs. The focus is on understanding what kinds of interactions are meaningful; what interfaces elicit these interactions; and how they help in the construction of the competency questions.
The submission describes the systematic way in which the papers were selected, their classification (made by two human analysts) and some of their details, specially focusing on the interaction techniques and the interface used. They then propose how these different approaches can be combined adequately.

The submission is quite systematically developed, and covers a reasonable volume of publications related to the use of LLMs to elicit knowledge from an expert from different perspectives. The text is suitable as an introductory text, but accessible only to readers that already have knowledge on ontology engineering and competency question elicitation, as these elements are never explained. Moreover, there are some important details that are not fully convincing in their current form.

To find the relevant literature, the authors run a query on three different academic databases, thus obtaining 40184 papers from 184 different venues. The first filter they do is through venue. In this step, the authors select only the venues where at least 3 out of the top 13 (or 12?, Figure 1 says more than 2 out of 13, but the text in section 3.1 says at least 1.2 out of 12!) papers in the venue are "relevant". Two issues here:
1) the authors do not explain what "relevant" means or how they are selected in this regard
2) if the publications are already being evaluated for their relevance, why is their venue important? That is, a highly relevant paper in an uncommon venue will be removed from the analysis, without any adequate justification.

After this first screening, 3275 papers remain, which are then filtered through 8 criteria leaving only 42. Again, here is an issue of imprecision: while the first 5 criteria make sense, the last 3 as for introducing "inspiring" techniques/designs (without clarifying what this means).

The authors then snowball (3x forward, 2x backward, although the submission describes them in the wrong direction), which adds 8 more submissions, ending up with a total of 50.

To improve the clarity and allow for reproducibility, the data obtained throughout this process should be made available. Instead, only the final 50 publications are known. Hence, it is essentially impossible to verify whether important work has been dismissed throughout the filtering process (or already removed from the original query, but that is a different issue). The lack of a full dataset makes it impossible to evaluate the breathe and balance of the coverage of the survey.

The description of the findings is necessarily compact, as in every literature review, but the new proposal is too vague. The presentation in general lacks clarity due to an inadequate level of detail. In addition, it seems that the authors use a varying notion of an LLM which they do not define. Is an LLM just a text-to-text system, or are multi-modal systems also allowed? This is important specially when considering the UI elements described. Apparently, the authors also take the success of LLM systems at face value, which is of course problematic when one is building a knowledge base and hallucinations and other artefacts should be avoided. Specifically, how certain can a user be that when, say, they ask for more detailed elements, or for a tangential topic, the LLM is actually doing that? Is the general space really being explored?

To conclude, another important question is why the authors limit the search to works that refer to the arts and creative processes? While I understand the idea of divergent and convergent thinking, I do not see the close connection to general knowledge elicitation. Perhaps this should be better clarified. It is also unclear how relevant this is to the SWJ and the Semantic Web community as a whole.

Overall, I believe that the submission still requires some major revisions before being ready for publication.

Below is the assessment of the long-term stable resources:
1. the data is relatively well-organized and easy to explore, but
2. the data is not enough to replicate the systematic review, the results, or the experiments, as many notions are human annotated without a clear explanation

Review #2
Anonymous submitted on 29/Jan/2026
Suggestion:
Major Revision
Review Comment:

SUMMARY:
This paper proposes a survey on as well as an interactive catalog of LLM-based interaction patterns with the objective to better specify competency questions as utilized for ontology engineering. To this end, interaction patterns in LLM-based applications in the arts and creative domain are explored. Patterns derived from this domain are then to be integrated in form of an interaction model into an existing competency questions elicitation and ontology engineering tool OntoCaht.

OVERALL EVALUATION: MAJOR
The idea of taking interaction patterns from LLM-based applications that focus on creative and artistic tasks and activities and take these to a tool for competency questions in ontology engineering is definitely interesting, however, only became entirely clear to me on Page 11. Thus, the first sections, especially the very first sections need a thorough revision to strongly improve the clarity of the proposed approach and objectives.

The validity of the idea is to be assessed based on an interaction model that is integrated into OntoChat. However, the interaction model and its integration are described similar to a technical manual without any justifications or explanations for decision. The most considerable weakness of the submission, however, is the lack of evaluating the strongly adjusted and visually/cognitively enriched user interface/application. At least a use case of ontology engineers not involved in the development of the interactional model using the newly created US would be necessary in my view. The fact that the paper merely presents a conceptual model but no actual interaction model being integrated into OntoChat does not become clear before Page 19. This should be stated explicitly from Page 1, if a conceptual model was sufficient, which I believe it not to be for a journal publication.

The idea of improving the human-LLM interaction in order to ensure that creative thinking processes are (not so much/less) impeded is an innovative direction of research. However, the lack of clarity and concretely tested and evaluated interaction patterns/designs reduce the validity of the submission.

DETAILED COMMENTS:
While it might be very clear to the authors what they mean by convergent-divergent thinking, to ensure wide intelligibility by a varied group of readers, I would strongly recommend defining both terms properly beyond aligning them with lateral/vertical early on before Page 3.

The illustration of the literature review process in Figure 1 is nicely done and provides a very good overview of the selected process. The selection of search databases is not entirely clear to me as these are the ones presented are very selective. Furthermore, the explanation of the keyword selection needs further clarification. The text suggests the term LLMs was excluded, but the example query indicates a different direction. Furthermore, I was lost at this point on what the objective of this survey is and why creat* and art* would be included if the objective is interaction and user interface design. Should the search not focus at least also on interaction design patterns and/or user interfaces?

On which basis were the retrieved papers ranked? The method indicates that the 12 top papers served as a basis for exclusion/inclusion, however, how was this ranking from top to bottom determined? Simply based on date or citations? The relevance would presumably strongly depend on the keywords. It would be good to really provide the full keyword list utilized. I would also like to point out that Figure 13 states the top 12 papers were utilized, whereas the text states it is the top 12 papers.

For the screening, the calibration of the eligibility criteria is well explained and reasonable. However, the actual screening process is not further specified. Was this done manually? By how many people? Only based on title, abstract and keywords?

As regards the reported kappa value for the codebook generation and application, was the reported score calculated before or after the detailed in-person coordinating discussions? I would presume the number was calculated after that process, as it is quite high for the task.

In the findings section, Table 2 indicates IRR but no mention of it is in the text. I think it should be mentioned in the text that this presumably is the kappa value per code. To me personally, it is also strange that the codes and the codebook are presented but the actual results of the survey are reduced to the categorization of references in Table 2. Should a survey not also include a description of the results set? In Section 4.1. the whole text presents these codes and categories as if these are well-established and matter-of-fact, however, no references are included. Where are these descriptions taken from? How do the interaction techniques directly relate to the cited references? Even though assumptions can be made, this should still be explicitly stated.

The objective of this paper is very clearly described on Page 15, when the limitations of the OntoChat system are presented. I believe that a similar argumentation very early on, i.e., abstract or introduction, would tremendously help clarify the idea.

The introduction of the interaction model equally to Section 4.1 reads like a technical manual rather than a research contribution. To change this, decisions should be explained and justified. The major issue in my view is a complete lack of evaluation of the newly proposed and creatively/artistically inspired interaction model integration into OntoClean. Without an evaluation in a practical setting or at least a use case, the validity of the whole idea cannot be determined.

In the limitations, an evaluation of the quality of the CQs is mentioned, however, no explicit evaluation is presented elsewhere. This is very confusing as are the contradictory arguments across the paper about actually integrating an interaction model and then finally only proposing a conceptual model.

MINOR COMMENTS:
1.23 absence of supports => the noun support has no plural
2.4 em-dashes => I would strongly recommend replacing em-dashes by syntax and academic punctuation, which in this case would be commas; in 2.14 its even grammatically problematic as is the sentence; since em-dashes are used annoyingly often, I would strongly recommend removing them since they are very typical of LLM-generated texts, as are brackets around examples.
2.8 CQs candidates => CQ candidates
2.22 with LLM => Did you mean what LLM-based UI designs are needed?
2.23 What supporting can => What support can
2.31 we also includes => include
2.32 cognitive process => I would propose either cognitive processes or the cognitive process
3.7 It is quite uncommon to have a heading be followed by a (sub-)heading
3.48 This is the first time that an acronym is introduced correctly with capitalizing letters that contribute to the acronym in the long form
4.43 lung is at best a term but not a terminology and a terminology may not be a point in space
5.10 The acronym HCI should best be introduced the first time it is presented
14.17 This is a very strange way of introducing simply acronyms; please make the introduction of acronyms consistent, ideally Long Form (short form) where the letters contributing to the acronym are capitalized in the long form
15.35 Referring to Figure 3, => As depicted in Figure 3,
17.37 Figure 3d => Figure 3 has no d and based on the context, did you maybe mean Figure 4d