WarSampo Knowledge Graph: Finland in the Second World War as Linked Open Data

Tracking #: 2354-3567

Mikko Koho
Esko Ikkala
Petri Leskinen
Minna Tamper
Jouni Tuominen
Eero Hyvonen

Responsible editor: 
Christoph Schlieder

Submission type: 
Dataset Description
The Second World War (WW2) is arguably the most devastating catastrophe of human history, a topic of great interest to not only researchers but the general public. However, data about the Second World War is heterogeneous and distributed in various organizations and countries making it hard to utilize. In order to create aggregated global views of the war, a shared ontology and data infrastructure is needed to harmonize information in various data silos. This makes it possible to share data between publishers and application developers, to support data analysis in Digital Humanities research, and to develop data-driven intelligent applications. As a first step towards these goals, this article presents the WarSampo knowledge graph (KG), a shared semantic infrastructure, and a Linked Open Data (LOD) service for publishing data about WW2, with a focus on Finnish military history. The shared semantic infrastructure is based on the idea of representing war as a spatio-temporal sequence of events that soldiers, military units, and other actors participate in. The used metadata schema is an extension of CIDOC CRM, supplemented by various military historical domain ontologies. With an infrastructure containing shared ontologies, maintaining the interlinked data brings upon new challenges, as one change in an ontology can propagate across several datasets that use it. To support sustainability, a repeatable automatic data transformation and linking pipeline has been created for rebuilding the whole WarSampo KG from the individual source datasets. The WarSampo KG is hosted on a data service based on W3C Semantic Web standards and best practices, including content negotiation, SPARQL API, download, automatic documentation, and other services supporting the reuse of the data. The WarSampo KG, a part of the international LOD Cloud and totalling ca. 14 million triples, is in use in nine end-user application views of the WarSampo portal, which has had over 400 000 end users since its opening in 2015.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Laura Pandolfo submitted on 23/Mar/2020
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions.

This paper presents the WarSampo knowledge graph, a shared semantic infrastructure, and its Linked Open Data service, which is aiming at publishing data about the Second World War, with special focus on the Finnish military history. As it is widely acknowledged, cultural heritage represents a complex domain due to the heterogeneity of its contents and, for the same reason, it is one of the domains where Semantic Web technologies and Linked Data can bring great advantages and benefits to the Digital Humanities research.

This paper is largely built upon on the authors' previous works in the domain. Even though there are no new relevant research results in this paper, I think this is a good use of existing works since it provides a comprehensive and detailed overview of WarSampo knowledge graph and its Linked Open Data service for the first time. The proposed work describes an extensive dataset of 14 million triples, on which WarSampo portal is based. The general approach seems quite appropriate and the work well fits in the Linked Data Descriptions category of the journal.

The authors provide a clear description of what are the source datasets (in particular, I appreciated the organization and the high detail level of the information presented in Table 1), the event-based data model used for harmonizing the data, and the data transformation process in 5 steps to populate the model.

***Quality and stability of the dataset***
The quality of the data is certainly high, since most of the considered sources have been provided by national archives, institutions and associations. Also, the stability of the datasets is not questioned; URIs seem to be stable and reliable, and the versions’ history of the datasets is presented in Table 3. The dataset is realised under the Creative Commons BY 4.0.

***Usefulness (or potential usefulness) of the dataset***
The WarSampo knowledge graph represents interconnected data about events occurred during the Second World War as well as information about the actors’ lives (e.g., soldiers) that participated in. The digital humanities research aims to grasp the potential of this data for humanistic inquiry, but also the general public could be interested in finding out information about, e.g., battles, soldiers’ lives, etc. The authors reported that more than 500 000 end-users accessed the datasets through the WarSampo portal. I would be curious to know how many of those end-users are domain experts, such as, e.g., historians, and access the data for research purposes. In other words, it would be interesting to know in detail how this data is used by interested parties.

***Clarity and completeness of the descriptions***
The whole paper is well-written and quite easy to follow. The authors clearly described the sources datasets, the vocabularies and ontologies used in the data model and the data transformation process for populating the data model. The main classes in the data model are described well, and there are two good diagrams (represented in Figure 1 and Figure 2) of how these classes interact. However, an example with real individuals and properties might have been nice.

A few minor comments/suggestions for the improvement of the work are as follows:
- in the Abstract, it is reported that WarSampo portal had have over 400 000 end-users since 2015, while in Section 1 this number changes to 550 000.
- I would suggest changing the structure of contents in this way: Section 1 should be the “Introduction” of the paper, while Section 2 should present the WarSampo initiative, including the few related works mentioned. The outline of the following sections should be moved at the end of the Introduction.
- Footnote n. 9: website page not found.
- In Section 4, the explanation of the difference between domain ontologies and meta-datasets should be added, as you did in one of your previous works - i.e. reference [20].
- The KG webpage on the LDF platform is well organized and contains all the important information, but I only got through to a couple of SPARQLE query examples. I would suggest to the authors to add a larger selection, since they can help users to become familiar with the dataset schemas and would give a clearer idea of what can be achieved.
- In a previous version of the datasets, FOAF vocabulary was used in your schema for modelling, e.g., family names and firstnames. I am not clear why you decided to remove it from the current version and preferred to use your own ontology.
- In Figure 2, the WarSampo core classes are presented. Why crm:E52_Time-Span is not included as core class? Time should be a core class in an event-based model, since events are mainly characterized by times.

Review #2
Anonymous submitted on 02/Apr/2020
Minor Revision
Review Comment:

The described dataset is very interesting and the paper is very clear. The authors illustrate the process followed with intuitive figures and all steps of the process are sufficiently described. All key choices made related to the production of this integrated dataset, are well justified. The selected conceptual backbone is a good choice, and I liked the discussion about the event-based conceptual modeling approach (it is informative and fair). Overall, the presentation is very good.

The only step that was not that clear, is the transformation from CSV to RDF: the authors could describe in more detail how they produce identifiers (URIs) and whether these identifiers are stable over time (after every reconstruction).

Review #3
By Günther Görz submitted on 14/Apr/2020
Review Comment:

This is an excellent contribution to the category "Data Description". All the criteria in the "Review comments" are met and I have no suggestions for improvements. The authors are well known in this field and have already an impressive list of pertinent publications. The data have been collected from a variety of heterogeneous sources about WWII in Finland. aggregated and harmonized by means of a shared ontology based on CIDOC CRM and an appropriate data infrastructure. In particular, the paper presents a knowledge graph and a LOD service that allows sharing data and to support data analysis in DH research. They are available through a portal that had already more than half a million end-users in the last five years which is excellent evidence of its usefulness; it was already awarded the LODLAM Challenge Open Data Prize in 2017. The data form an integrated interlinked 5-star LOD publication and is part of the global LOD cloud. The paper describes clearly how the information in the source datasets was harmonized and presents then underlying event-based data model. Furthermore, it indicates the data transformation process and provides an analysis of the data quality as well as the stability and usefulness of the data. With respect to LOD, the authors made the right decision to extend CRM by a subontology introducing class for each specific type. In my view, the descriptions in the paper are clear and complete and overall it is an exemplary model of how things should be done.