Using LLMs for Semantic Alignment: A Study on Archival Metadata Description

Tracking #: 3973-5187

Authors: 
Maria Ioanna Maratsi
Charalampos Alexopoulos
Yannis Charalabidis

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Full Paper
Abstract: 
The advantages of aligning custom data schemas with standardised ontologies within their respective knowledge domain have long since been proven in practice. Sharing a common structural representation by mapping concepts and relationships between the schemas is essential to ensure data interoperability (especially on a semantic level), integration, reuse, and the ability to leverage machine-processable and advanced-search capabilities. Archival institutions preserve, manage, and provide access to large amounts of diverse cultural and historical data, demonstrating a high potential to be active contributors to a global knowledge network, should archival data be transformed and offered as linked (open) data. Based on the expert-validated dataset of the alignment (mapping) of the Swedish National Archives schema to the Records-in-Contexts (RiC-O) ontology, the purpose of this study is two-fold. First, to examine whether it is possible to automatically and effectively extend one case (Sweden) to other archival institutions and align new custom schemas to RiC-O, given an expert-curated dataset of this domain. Secondly, using the aforementioned dataset and one more of a few human-evaluated examples of mapping to other cultural heritage ontologies as input, to examine whether an LLM (e.g., GPT-4o) can recommend meaningful alignments for enhanced metadata description to more ontologies within the same domain (CH and archives), but also across other domains. The experiments reveal challenges and shortcomings of the LLM prompting approach for these tasks, but also possible opportunities to leverage towards this direction.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 06/Apr/2026
Suggestion:
Accept
Review Comment:

The reviewers suggestions were taken into consideration. A paragraph acknowledging the existence of previous mapping tools and the need or interest in evaluating new ones based on LLMs, in the specific context of this study, could still be added to improve the justification of the work.

Review #2
Anonymous submitted on 03/May/2026
Suggestion:
Major Revision
Review Comment:

The paper studies the use of Large Language Models (LLMs), namely GPT-4o and GPT-4.5, for performing semantic alignment of archival metadata schemas to the Records-in-Contexts Ontology (RiC-O), assisting the process with other schemas in Cultural Heritage but also other domains, such as ones available in Linked Open Vocabularies (LOV).

In this context, the submitted paper contributes to the state-of-the-art by (a) evaluating the efficiency of LLM models (GPT-4o and GPT-4.5) for the LLM-assisted alignment for new schemas and (b) evaluating the ability of LLMs to propose meaningful reuse of ontologies and ontology components across domains.

The paper is of interest to the readership of the SWJ and relevant to the journal’s aim and scope. Moreover, the topic of the paper is within the discourse of current research.
The authors real-world data in their evaluation, and this is a major strong point of the research since it supports result applicability and method reuse in other real-world settings. Results include quantitative evaluation and two distinct scenarios (scenario 1: alignment, scenario 2: cross-domain reuse) exploring thus a breadth of alternatives. Clearly the results designate modest performance, with accuracy being below 70% in all experiments, while a considerable amount of hallucinations (approx. 30%) are also identified. While this may be a barrier for the use of the proposed method in a fully-automated scenario, it testifies for the potential to use the method in human-in-the-loop approaches. The results also identify the current level of performance of LLM-based alignment.

The authors’ work is also based on well-established cultural heritage work, including CIDOC/CRM, RiC-O and ISAD(G), supporting thus reusability in a wide area of contexts.

The work presented by the authors mostly involves prompt engineering and evaluation, deviating from the majority of works published in the SWJ, where theoretical contributions are accommodated. This is not a shortcoming per se, however the fact is noted.

The evaluation by humans is not clearly described. The authors state that “The outputs of experiment 1 were initially human-evaluated by the authors, who were domain experts in this case”, however the number of evaluators per result element is not listed and no protocol for quantifying the level of agreement between evaluators is described. The use of multiple evaluators might also change the outcome of the experiment since (in cases of human evaluator disagreement) the severity of a misalignment could be reduced (under the rationale that a human expert might opt for the modelling approach suggested by the LLM, even if after a discussion it could prove to be suboptimal). Conversely, a choice made by the LLM and has been found to be in agreement with the human evaluator could be successfully challenged by a second evaluator. In the same line, a deeper-level analysis of the errors identified in the results provided by the LLM approach would be beneficial for the paper. For instance, errors might involve structural errors, semantic drift, missing-out elements or inclusion of unneeded/false elements. The authors could also provide some insight on the reasons why the LLM may fail in the assigned tasks. Exploration of different variants of the prompts and the effect that this variation might have on the results would be useful. The listing of graph-based similarity metrics could also provide an overview of the quality of results.

The paper lacks a comparison with other state-of-the-art approaches. For instance [R1] and [R2] are recent works that could potentially serve as baselines. More established baselines, such as AML and LogMap should be also included.

The paper also does not discuss the aspect of constraints. It would be also worth investigating whether the combination of LLMs with other approaches (e.g. graph matching, reasoners etc.) could provide improved result quality.

The portions of the paper related to the background on archival standards and the discussion include some amount of repetitiveness, and should be reworked for conciseness.
The authors have published the results on Zenodo, following the journal’s recommendations.

[R1] Nguyen, L., Barcelos, E., French, R., Wu, Y. (2026). KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models. In: Garijo, D., et al. The Semantic Web – ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16140. Springer, Cham. https://doi.org/10.1007/978-3-032-09527-5_34

[R2] Rinaldi, A.M., Russo, C. & Tommasino, C. A semantic approach for cultural heritage ontology matching and integration based on textual and multimedia information. Soft Comput 29, 1019–1034 (2025). https://doi.org/10.1007/s00500-025-10517-y

Review #3
Anonymous submitted on 15/May/2026
Suggestion:
Major Revision
Review Comment:

The manuscript presents an evaluation of semantic alignment using large language models (LLMs). The authors employ GPT‑4o to evaluate two scenarios in which the injected context varies, and they partially validate the results with GPT‑4.5.

From a general perspective, the manuscript presents an interesting topic for the Digital Humanities, exploring how LLMs can support decision‑making and automation through the injection of contextual or specialised knowledge. However, the manuscript lacks the technical depth that is expected in comparable studies.

Normally, DH artifacts (ontologies, mapping files, etc.) have significant size/extension, exciding by far an LLM context window, in the experiments is not clear how these was address by the authors, the discussion section offers only vague arguments.

The prompts, only shared in the supplements, lack structure. Role → Context → Task → Constraints → Outputs are typically used to steer LLMs toward specific tasks over specialised knowledge, which diminishes the LLM's role in the task performance, therefore in the computation of the presented metrics.

It would be valuable to see a more comprehensive evaluation that includes zero‑shot, one‑shot, few‑shot, or chain‑of‑thought techniques, and that validates specific alignment tasks.

Strengths

- The manuscript tackles a challenging area: using LLMs to aid decision‑making and automation by injecting contextual or domain‑specific knowledge.
- Within Digital Humanities, semantic alignment of archives and knowledge bases is a highly relevant and timely issue.
- The experimental framework is comprehensive: it reports TP, FP, TN, FN, as well as Accuracy, Recall, and Precision.
- The supplementary materials (available on Zenodo) provide sufficient detail to understand the experiment and the evaluation.

Weaknesses
+ The reported experiments show just one model evaluation (e.g., GPT-4o) in both of the RA, besides zero-shot experimentation
+ The performed experiments are not clear in the figures nor the text on how the new knowledge is injected into the LLM to predict the required answers. Perhaps a RAG technique was used to create the LLM context; this is not clear in the manuscript.
+ The reported prompts in the supplements execute simple questions to the LLM, not providing enough instructions to the LLM to perform the solicited tasks.
+ Within the experimental design, there is no definition of the ground truth used for the computation of the Precision, Recall and F1-Score.

Other Remarks:
* Even though all supplements were published in Zenodo, these are uploaded as different resources (one per file) and do not present cohesion with the actual manuscript. The authors must create a consistent supplement that complements the research, allowing the community to see the full landscape of the experiments conducted in that manuscript.
* The manuscript is still not consistent with the Journal requirements in format and style, making it extra difficult to revise and assess. (https://www.sagepub.com/journals/information-for-authors/preparing-your-...)

The paper addresses an important problem and offers a solid experimental framework, but it falls short in methodological detail, prompt design, and compliance with journal standards. Addressing the points above would significantly strengthen the contribution.