Review Comment:
The paper studies the use of Large Language Models (LLMs), namely GPT-4o and GPT-4.5, for performing semantic alignment of archival metadata schemas to the Records-in-Contexts Ontology (RiC-O), assisting the process with other schemas in Cultural Heritage but also other domains, such as ones available in Linked Open Vocabularies (LOV).
In this context, the submitted paper contributes to the state-of-the-art by (a) evaluating the efficiency of LLM models (GPT-4o and GPT-4.5) for the LLM-assisted alignment for new schemas and (b) evaluating the ability of LLMs to propose meaningful reuse of ontologies and ontology components across domains.
The paper is of interest to the readership of the SWJ and relevant to the journal’s aim and scope. Moreover, the topic of the paper is within the discourse of current research.
The authors real-world data in their evaluation, and this is a major strong point of the research since it supports result applicability and method reuse in other real-world settings. Results include quantitative evaluation and two distinct scenarios (scenario 1: alignment, scenario 2: cross-domain reuse) exploring thus a breadth of alternatives. Clearly the results designate modest performance, with accuracy being below 70% in all experiments, while a considerable amount of hallucinations (approx. 30%) are also identified. While this may be a barrier for the use of the proposed method in a fully-automated scenario, it testifies for the potential to use the method in human-in-the-loop approaches. The results also identify the current level of performance of LLM-based alignment.
The authors’ work is also based on well-established cultural heritage work, including CIDOC/CRM, RiC-O and ISAD(G), supporting thus reusability in a wide area of contexts.
The work presented by the authors mostly involves prompt engineering and evaluation, deviating from the majority of works published in the SWJ, where theoretical contributions are accommodated. This is not a shortcoming per se, however the fact is noted.
The evaluation by humans is not clearly described. The authors state that “The outputs of experiment 1 were initially human-evaluated by the authors, who were domain experts in this case”, however the number of evaluators per result element is not listed and no protocol for quantifying the level of agreement between evaluators is described. The use of multiple evaluators might also change the outcome of the experiment since (in cases of human evaluator disagreement) the severity of a misalignment could be reduced (under the rationale that a human expert might opt for the modelling approach suggested by the LLM, even if after a discussion it could prove to be suboptimal). Conversely, a choice made by the LLM and has been found to be in agreement with the human evaluator could be successfully challenged by a second evaluator. In the same line, a deeper-level analysis of the errors identified in the results provided by the LLM approach would be beneficial for the paper. For instance, errors might involve structural errors, semantic drift, missing-out elements or inclusion of unneeded/false elements. The authors could also provide some insight on the reasons why the LLM may fail in the assigned tasks. Exploration of different variants of the prompts and the effect that this variation might have on the results would be useful. The listing of graph-based similarity metrics could also provide an overview of the quality of results.
The paper lacks a comparison with other state-of-the-art approaches. For instance [R1] and [R2] are recent works that could potentially serve as baselines. More established baselines, such as AML and LogMap should be also included.
The paper also does not discuss the aspect of constraints. It would be also worth investigating whether the combination of LLMs with other approaches (e.g. graph matching, reasoners etc.) could provide improved result quality.
The portions of the paper related to the background on archival standards and the discussion include some amount of repetitiveness, and should be reworked for conciseness.
The authors have published the results on Zenodo, following the journal’s recommendations.
[R1] Nguyen, L., Barcelos, E., French, R., Wu, Y. (2026). KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models. In: Garijo, D., et al. The Semantic Web – ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16140. Springer, Cham. https://doi.org/10.1007/978-3-032-09527-5_34
[R2] Rinaldi, A.M., Russo, C. & Tommasino, C. A semantic approach for cultural heritage ontology matching and integration based on textual and multimedia information. Soft Comput 29, 1019–1034 (2025). https://doi.org/10.1007/s00500-025-10517-y
|