Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tracking #: 3912-5126

Authors: 
Mariana Dias
Carla Teixeira Lopes

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Tool/System Report
Abstract: 
Linked Data (LD) enables cultural heritage institutions to refine archival descriptions and improve findability, but manually creating LD descriptions remains labor-intensive. This paper presents an ontology-guided information extraction system that assists archivists by automatically identifying concepts and relations in digitized archival records. Focusing on Portuguese archival collections, we extract and structure data according to ArchOnto, a CIDOC-CRM-based LD model for archives, to support future metadata enrichment. Our approach identifies core archival entities from textual digital representations of archival records obtained through optical character recognition and human-made transcriptions. However, it shows limited results in extracting some entities and relational facts. Our low-performing results indicate that fine-tuning information extraction models using adapted general-domain datasets for Cultural Heritage tasks in 20th-century documents is only marginally viable.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Nov/2025
Suggestion:
Major Revision
Review Comment:

Overview:
This paper tackles the challenge of generating Linked Data (LD) descriptions for cultural heritage artifacts. Since manually producing LD annotations from archival records is time-intensive, the authors propose automating the process through an ontology-based information extraction process. Given the scarcity of digitized and annotated archival documents, the study aims at evaluating how NER and RE models trained on general-domain datasets perform when applied to 20th-century archival documents. With a focus on Portuguese Cultural Heritage (CH) documents, they compare the performances of these models on a baseline of contemporary Portuguese documents vs historical documents, covering both OCR-extracted and human-transcribed texts. The results show that the models perform poorly on historical data, highlighting the limited transferability of such models to historical CH archives.

(1) Quality, importance, and impact of the described tool or system
Strengths
* The paper addresses an important challenge in the Cultural Heritage (CH) domain. Extracting structured information from non-born-digital archival records, which are often poorly digitized and lack consistent annotation, is a very common problem for archivists and digital humanities researchers.
* I also appreciated that the study contributes to exploring these tasks in non-English language settings.
* The methodology is clear: the authors present a transparent and reproducible pipeline, detailing dataset creation/adaptation, model training, and evaluation procedures.
* The implementation leverages open-source frameworks and publicly available datasets.
* The accompanying datasets and trained models are openly shared through a stable repository (Figshare), which also includes a README-like description of the content of the repo.

Weaknesses
* Despite its title, the work focuses more on machine learning experiments than on actually presenting a new tool or system. The ontology serves only as a schema for alignment, not as an active tool to guide the extraction. For this reason, I think that the Semantic Web aspect remains only marginal. The paper also does not show how extracted entities and relations are transformed into Linked Data (e.g., RDF or SPARQL-queryable resources).
* Performance results are weak, particularly for RE and for RE/NER on OCR inputs. This suggests that the approach is currently not viable for the intended setting.
* The work does not position itself adequately within the current state of the art. More recent models, such as GLiNER for multilingual entity extraction, Relik for relation and event linking, and LLMs are not discussed or compared against. These models would handle both ontological-aware extractions and some of the challenges related to domain-specific or noisy documents. I believe that this omission weakens the paper’s contribution.

(2) Clarity, illustration, and readability of the describing paper
Strengths
* The paper is clearly written and well organized, following a logical structure.
* Tables and figures effectively support the narrative.
* The authors are honest about limitations (i.e. “limited viability”) and provide a detailed account of both the methods and results, which is commendable in a Tools & Systems Report.
* The discussion section clearly identifies sources of poor performance, including OCR noise and domain mismatch, which helps readers contextualize the results.

Weaknesses
* The abstract and introduction do not fully reflect the actual focus of the work. They frame the paper as a solution for automating LD generation, while the real contribution is an evaluation of models transferability. In the abstract, the authors also overstate the applicability of the system, stating that it successfully ‘identifies core archival entities from textual digital representations’, while results show otherwise.
* The ontology-based aspect of the extraction system is only partially described. Specifically, the paper does not sufficiently explain how ontology concepts/relations are mapped from dataset labels to classes and properties. Also, in both cases the mapping process remains partially manual, which goes against the main scope of the paper.
* While the limitations are acknowledged, a more explicit reflection on how these findings inform future system design would strengthen the conclusion.
* The related work section could be significantly strengthened

(3) Quality of available resources:
The models and datasets are publicly shared via Figshare. The resource package appears well organized, including distinct datasets for training, testing, and evaluation (OCR vs human transcription), and accompanied by metadata and a README that explains the structure. However, the folder lacks the code for exact reproducibility. I encourage the authors to share both the data and the code on GitHub.

(4) Suggestions for future work:
* Explore the integration of state-of-the-art models and LLMs.
* Compare performance with models trained on domain-specific data
* Improve ontological grounding
* Reinforce positioning within the state of the art

Summary:
Overall, the paper addresses common challenges for researchers working in the CH domain, such as noisy and unannotated data. However, the work does not fully align with its claim of presenting a new system. Rather, it constitutes an exploratory study on the transferability of IE models to archival data. While the aim remains meaningful, results show that the approach is not yet viable for real archival applications. Furthermore, the work needs to be better positioned within the current state of the art, and include more recent advances in NER and RE, particularly leveraging transformer architectures and and LLMs.

Review #2
Anonymous submitted on 12/Nov/2025
Suggestion:
Accept
Review Comment:

The proposal implements an ontology-based information extraction (OBIE) system guided by ArchOnto on Portuguese texts in the Cultural Heritage domain.
Here too, it would be useful to provide an adequate description of the ArchOnto ontology, presented in previous works, in order to fully understand the conversion in Table 2. In addition, persistent URLs for the ontology should also be added here.
I would suggest adding a few more examples in natural language to better illustrate Tables 2 and 4.
Overall, the paper is well written and methodically structured. Aims and motivations are presented with clarity. The Background and Related Work is well structured, with a particular focus on archival corpora and Relation Extraction in Portuguese. The paper also includes clear and explanatory tables and figures. The bibliography is good and satisfactory. The results are substantiated by experimental evaluation, including analytical measurements.
Future work could include testing the OBEI system with other Romance languages such as Spanish or Italian.
Finally, in the introduction and keywords, it would be preferable to use ontology instead of LD.

Review #3
Anonymous submitted on 21/Nov/2025
Suggestion:
Major Revision
Review Comment:

This paper presents an information extraction system applied to Portuguese archival material.
This is an important and timely real-world use case, as many archival collections remain non-digital,
and such systems can significantly support and accelerate their digitization.
Overall, the paper addresses a relevant problem and is generally interesting;
however, several sections require improvement or further clarification.
In particular, the introduction, background, and several other parts needs clarification
Moreover, the authors should explain why large language models (LLMs) were not considered in their approach and whether they plan to explore them in future work. Below, I provide detailed suggestions for the next version of the paper.
My recommendation is Major Revision.

Abstract
The abstract is well written. However, please add a sentence briefly explaining what CIDOC-CRM is,
as the term may be unfamiliar to some readers.

General Comment
Please add section numbering
Additionally, discuss whether the proposed solution can be adapted for other languages beyond Portuguese.

Introduction
The introduction is missing a running example that illustrates the problem more concretely. for instance,
an example of a Linked Data description from the cultural heritage domain.
Please also expand the background information, especially on CIDOC-CRM and ArchOnto.

Moreover, the introduction should clearly present:
First, the main challenges addressed by the paper
Second, the research questions
Third, a description of the proposed approach (and steps)

Background and Related Work
This section should explain CIDOC-CRM and ArchOnto in greater depth.
At the end of the related work discussion,
please provide a clear comparison with existing approaches, for showing the differences with your method.

Ontology-based Named Entity Recognition

The pipeline figure is helpful, but several details regarding the manual annotation work are missing.
For example:
How many ambiguous entities were encountered?
Which classes were most affected (only E5 Event and E53 Place, or others as well)?
Regarding the evaluation datasets, Table 4 requires more explanation.
Specifically, the values 105 (Eval.Human) and 73 (Eval.OCR) are unclear to me,
please clarify how this subset was derived from the initial Arch.NER dataset and what selection criteria were used.

Training of Machine Learning Models and Evaluation
A key question is why the authors did not consider using or evaluating a large language model (LLM) for NER and relation extraction,
given their strong performance in similar tasks.
Please discuss this choice explicitly.
Additionally, Tables 6 and 7 require more interpretation.
Explain the results in greater detail and discuss how and why they differ from the results shown in Table 5.

Ontology-based Relation Extraction / Dataset Creation
All datasets used or created in the study should be described together in a single subsection for clarity.
After presenting the datasets, the corresponding experimental results should be provided
Especially the results for the ontology-based relation extraction, should be presented in a table or figure to improve readability.

Discussion and Conclusion
The conclusion should also include key quantitative results—e.g., F1 scores.