The InTaVia Knowledge Graph – European National Biographical and Cultural Heritage Object Data

Tracking #: 3851-5065

Authors: 
Matthias Schlögl
Jouni Tuominen1
Joonas Kesäniemi
Petri Leskinen
Go Sugimoto1
Victor de Boer

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Dataset Description
Abstract: 
The InTaVia Knowledge Graph (IKG) is a large Knowledge Graph containing heterogeneous multilingual data from four European national biographies, connected to related cultural heritage objects. This resource provides researchers, heritage professionals, and the informed public access to such biographical information. This paper describes the source data, the data model, the pipeline components for managing and harmonizing the data and the resulting knowledge graph. The data model combines domain standards CIDOC CRM and Bio CRM with elements to represent multiple perspectives on biographical information. The knowledge graph was consolidated from four prosopographical databases (PDBs) and enriched with links to Cultural Heritage Objects (CHOs) from Europeana and Wikidata. The resulting knowledge graph as information about 112,050 persons, described by 257,673 person proxies. In addition to the data model and the data itself, we also describe the infrastructure used to harmonize and maintain this heterogeneous knowledge graph.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Yannis Marketakis submitted on 16/May/2025
Suggestion:
Major Revision
Review Comment:

This work presents the construction of a knowledge graph of national biographies by integrating heterogeneous data from multiple sources. The authors outline the contents of the original datasets, detail the workflow for building the final knowledge graph, and describe the reconciliation and enrichment processes. The paper is well-structured and organized, with relevant references and supplementary materials provided via external URLs where appropriate. A few issues remain that could be improved, and I encourage the authors to consider the comments listed below.

- In the introduction section, the datasets are sometimes described as ‘biographical datasets’ and sometimes as ‘biographical dictionaries’. Please consider referring to them using a single and consistent terminology
- Missing citation or reference for TEI (in related work)
- Footnote 13 is a dead URL
- Regarding provenance, the authors mention that they rely on PROV-O. Why was this preferred over CRMdig, which is an extension of CIDOC-CRM (that is already adopted as the core model for IKG)?
- Figure 1: Increase the fonts, and if possible, mention the actual namespaces that are used (i.e. crm, idm)
- When describing the four data sources, several records are not included in the integrated set (e.g. from BiographySampo 5833 from 7000, from SBI 7908 of 11660). What’s the reason for this?
- Figure 2: the fonts are really small and practically unreadable on a printed version. Perhaps adding those details to a table would be better.
- Regarding the overall number of resources, are the figures mentioned in Figure 2 the actual sizes in IKG? Are there any duplicates (i.e. Persons or places that appear in several data sources) or are they eliminated after reconciliation?
- It is not clear if the authors apply sameAs rules for reconciling resources (e.g. persons), or if they rely on existing sameAs from the original data sources (based on the mappings with wikidata and GND they mention).
- In the ETL pipeline the authors mention that they support a workflow that updates IKG against updating contents from the original sources. This is OK for new content being added in the original data sources, but how do they deal with updated content (e.g. changed names for persons or other details)?
- Regarding the transformation of datasets with respect to CIDOC CRM, the authors mention that they convert the data sources, and only for one source they apply mappings to transform them. Why only for one ? Are the rest of the sources translated programmatically?
- In general, I am missing a technical discussion about the semantic data integration of the data sources. For example, in what format and structure are these sources harvested? Do they deal with data cleaning or normalization (e.g. normalize formats of dates)? I think such a discussion should be included.
- Consider making footnote 31 (SHACL) a citation.
- The example SPARQL queries do not work. I am using the provided URLs but they redirect to the landing page IKG SPARQL endpoint (https://qlever-ui.acdh-ch-dev.oeaw.ac.at/intavia/HW8Vyo)
- It is not clear if the authors managed to “merge” records found in heterogeneous data sources. If so, that would be explicitly stated since this would provide a holistic view of resources and support additional scenarios, for example, complex query answering that is possible only after combining data sources.

Review #2
Anonymous submitted on 23/May/2025
Suggestion:
Major Revision
Review Comment:

This article presents the InTaVia Knowledge Graph (KG) to integrate heterogeneous, cultural heritage data produced by different organizations. The integration is done through Semantic Web languages and technologies, resulting in a KG based on CRM as core ontology.

IDM-RDF, the ontology used to integrate the data, is an OWL taxonomy that declares the domain and range of object and data properties and includes some axioms of equivalence between classes. As this paper is a 'dataset description', the emphasis is on data publication rather than the ontology itself.

However, I think some aspects of the ontology deserve to be presented or discussed more thoroughly in the paper. In particular:

-- In the introduction, the authors state that knowledge graphs can be used to reveal hidden patterns and relationships in data. This is indeed interesting. Does automated reasoning play any role in this project? If so, what type of reasoning does the ontology support?

-- Figure 1: the diagram deserves explanation, in particular the use of roles for representing participants in events. For example, consider the following RDF triples, based on the model in Figure 1. The triples represent two instances of E12_Production, i.e., :event1 and :event2, with participant :Printer (I consider it as an instance of :Event_Role, following the authors’ proposal). I assume that the same role can be carried by multiple actors; in the example, :John and :Mary both carry the role of :Printer. How can one understand in this representation who is the agent participating in the event? For example, does Mary participate in :event1 or in :event2?
Perhaps there is something that I don’t understand well in the authors’ proposal that can clarify my concerns.

:event1 :had_participant_in_role :Printer.
:event2 :had_participant_in_role :Printer.

:Mary :bearerOf :Printer.
:John :bearerOf :Printer.

As a suggestion, to capture that an agent participates in an event with a role might require the use of reification methods for relations with ariety higher than 2 (https://www.w3.org/TR/swbp-n-aryRelations/).

-- The authors emphasize the importance of representing conflicting information, yet they fail to provide any examples. Consequently, it is difficult to grasp what they mean or the nature of the conflicting data they require. Apart from the lack of examples, conflicting data seems to be represented through so-called 'proxies', but nothing is said about modelling them. It would be very interesting if the authors could explain this aspect of their proposal in more detail, as otherwise it remains unclear how it works.

Other comments:

-- Regarding the dataset description, the authors should specify the type of data contained in each dataset (e.g. name, surname, date of birth, date of death, relatives, artworks produced, etc.). This information would help users to understand the context of the research proposal and the type of data that can be retrieved through the SPARQL endpoint. Adding a table that schematically compares the integrated datasets would also help readers to understand the similarities and differences between the data sources.

-- I accessed the SPARQL endpoint given in footnote 9, and ran the queries provided in the Example tab. The queries return data which are not however explorable. For example, I clicked on the IRI for a person with id 53823 but the system returns a ‘page not found’.

-- In the GitHub repository (https://github.com/InTaVia/idm-rdf/tree/main/idm-OWL), the ontology folder contains three files. The paper says that InTaVia uses a modular structure, so it would be helpful to know how the modules relate to each other. More specifically, of the OWL files in the repository, which one is the main ontology file? I browsed the intavia_idm1.ttl file, but I’m not sure whether it is the main file or how it relates to the others.

-- From a research perspective, I believe that detailing the challenges faced by the authors while developing their project would enhance the quality of this proposal. For instance, did they encounter difficulties when integrating data from various sources or aligning it with CRM? Did they gain any specific advantages from a data modelling, conceptual, or other perspective by reusing CRM?

-- The paper lacks an evaluation of the materials presented. Setting aside the evaluation of the ontology, has the usability of the platform been assessed by users? Have the authors collected user feedback to ascertain whether the platform meets their research needs and expectations?

Review #3
By Michalis Sfakakis submitted on 22/Jun/2025
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here . Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.