Digital Humanities on the Semantic Web: Sampo Model and Portal Series

Tracking #: 2832-4046

Eero Hyvonen

Responsible editor: 
Christoph Schlieder

Submission type: 
Application Report
Cultural heritage (CH) contents are typically strongly interlinked, but published in heterogeneous, distributed local data silos, making it difficult to utilize the data on a global level. Furthermore, the content is usually available only for humans to read, and not as data for Digital Humanities (DH) analyses and application development. This application report paper addresses these problems by presenting a collaborative publication model for CH Linked Data and six design principles for creating shared data services and semantic portals for DH research and applications. This Sampo model has evolved gradually in 2002–2021 through lessons learned when developing the Sampo series of semantic portals in use, including MuseumFinland (2004), CultureSampo (2009), BookSampo (2011), WarSampo (2015), BiographySampo (2018), NameSampo (2019), WarWictimSampo (2019), MMM(2020), AcademySampo (2021), and FindSampo (2021). These SemanticWeb applications surveyed in this paper cover a wide range of application domains in CH and have attracted up to millions of users on the Semantic Web, suggesting feasibility of the proposed Sampo model. This work shows a shift of focus in research on CH semantic portals from data aggregation and exploration systems (1. generation systems) to systems supporting DH research (2. generation systems) with data analytic tools, and finally to automatic knowledge discovery and Artificial Intelligence (3. generation systems).
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Kai Eckert submitted on 23/Dec/2021
Minor Revision
Review Comment:

This paper is a joint review of Kai Eckert and Benjamin Schnabel. Benjamin Schnabel is a PhD student in the field of Digital Humanities (Jewish Studies).

In this paper, the author describes the “Sampo Model”, an informal collection of principles for LOD publishing. It is an extended version of a DHN 2020 conference (poster) paper. The extension is indeed substantial and therefore a valid contribution to the SWJ.

The author lists six principles for LOD publishing: 1. Support collaborative data creation and publishing, 2. Use a shared open ontology infrastructure, 3. Support data analysis and knowledge discovery in addition to data exploration, 4. Provide multiple perspectives to the same data, 5. Standardize portal usage by a simple filter-analyze two-step cycle, 6. Make clear distinction between the LOD service and the user interface (UI).

As emphasized in the discussion, none of these principles are new, but it is still valuable to describe them together in context. Besides a short explanation, for each principle examples from the various LOD portals using the Sampo model are provided.

In a sense, the paper is a retrospective of the ongoing work for the past 19 years where many LOD portals (mainly in Finland) have been developed, also with a focus on interoperability.

The principles are meant to be seen as an extension to well-known ideas and requirements such as the four LOD principles and 5-star linked data. It is not so much about why we need LOD but on how it should be done. Nevertheless, there are more recent developments where LOD provides immediate advantages, such as for further analysis of the data. There is a need to have analytic tools and automatic knowledge discovery.

This paper does not go into further technical details. While it is meant as an overview paper, some more details on the technical setup and the commonalities and differences between the different Sampo instances would be very interesting.

While it is acknowledged that the principles themselves are not new, it should also be stated that all of the principles are certainly also implemented in many LOD projects apart from the Sampo portals.

Regarding the first principle, the author remains very vague about how the data should actually be created. It is assumed that a knowledge graph is already available to be published. While such a scope limitation of course makes sense, at the same time the principle does not help for publication beyond the well-known LOD principles. Similarly, the second principle does not really provide new insights beyond data and ontology reuse and interlinking. We would argue that both principles are simply the foundation for the following principles which focus actually on the publishing aspect beyond plain data publication and a SPARQL endpoint.

The third principle acknowledges a change in user requirements towards data portals. While earlier, a simple search and access of the data was sufficient, today, the data should be easy to analyze and to combine with further data sources. Unfortunately, here, very little information is provided as to how this can be achieved.

The fourth and fifth principles emphasize the importance to curate data for the user instead of merely simply putting it online and to provide simple, standardized tools so that the users can explore prepared and own questions on their own.

At last, the sixth principle is about frontend-backend separation. Here, the author only mentions SPARQL as a backend API. This is not sufficient in all, not even many, cases. While SPARQL is rightfully seen as the lingua franca in the Semantic Web, many application developers prefer dedicated APIs and additional services such as a search index (Elastic, Solr, etc.) to speed up application development and the resulting applications.

Overall, the paper is well-written and easy to understand. The scope is a mixture between a research paper about the identified principles and a retrospective on close to 20 years of LOD portal development. As an application report, the paper is certainly acceptable and provides interesting and actionable insights for similar projects.

On the other hand, with this mixture, the paper falls somewhat short both as a research paper and the retrospective. It would actually be interesting to get more insights into how these principles evolved over time and what other lessons have been learned. The presentation of the principles should address earlier and more clearly its relations to the existing principles (4 principles, 5 stars, etc.) and how they deal with their shortcomings, i.e., the questions that arise once the LOD is in place. We are certain the paper would benefit from rather minor adjustments and clarifications regarding the above mentioned issues.

Minor issues:
“shared publishing infra, ” probably infrastructure is meant.
Footnote to dumb down principle should point to
p. 4: ahead to is based on, superfluous “to”.
p.4: mdel, instead of model.
perspectiveS in principle P4.
p.1 WarWictim, should be WarVictim.
many more typos, please ensure proper proof-reading.

Review #2
By Christoph Schlieder submitted on 17/Jan/2022
Minor Revision
Review Comment:

The application report systematizes the findings from more almost two decades of research on the Sampo portals, which focus on the domains of cultural heritage (CH) and the digital humanities (DH). As the main conceptual contribution, the author proposes (1) a set of design principles for semantic portals and (2) a description of three technology generations. The article also provides an excellent entry point to the literature on the Sampo portals.

The article extends a short paper published in the DH Nordic conference proceedings. There is little overlap with this prior publication, however. Both texts present an overview of selected Sampo projects, but the article covers more projects and gives additional technical details. More importantly, the discussion of the design principles does not appear in the short paper. The submitted article is also sufficiently independent from other publications on the Sampo model to justify its publication.

As an application report, the submission is not typical since it describes several related semantic portals instead of a single system. This raises the question of the common technological ground of the portals, which the article addresses convincingly by formulating the Sampo design principles. The author mentions that the Sampo model “has evolved gradually over time” but does not not discuss how the principles evolved over time. Nevertheless, it seems possible to extract a timeline by combining the information given in section 3 on individual projects with the dates from table 2.

P1 (= data linking) in 2004, MuseumFinland
P2 (= shared ontologies) and P4 (= multiple perspectives) in 2008, CultureSampo
P6 (= separation of data services and UI) in 2015, WarSampo
P5 (= faceted search) and P3 (= knowledge discovery) in 2019 BiographySampo

I am not sure whether this timeline is correct. In any case, it would be interesting for the reader to learn in which order the principles have been adopted and how the principles map onto the three technology generations of semantic portals. P1 and P2 are general principles implemented by virtually all LOD projects – not just the Sampo portals. That makes them candidates for first generation systems (data aggregation and exploration). In contrast, P3 is the principle introduced most recently and least explored in the Sampo portal series. P3 also marks "the next step ahead" in research, that is, the third generation of semantic portals (knowledge discovery and AI). If such a correspondence between the principles and the technology generations exists, the article should describe it. Optionally, the principles could be renumbered to reflect their chronological position. The current P3 would become the new P6, for instance.

While this is my main observation, there are minor points that would profit from further clarification.

(1) The author mentions related design principles such as FAIR, but I missed a discussion of the nature of this relation. This is especially relevant with respect to the four LD principles and the 5 star principles. Are P1 and P2 equivalent to the four LD principles? Where exactly do the Sampo principles go beyond the LD or 5 star principles? The four LD principles refer to specific technologies whereas the Sampo principles are stated in technology agnostic form. Do the Sampo principles intend to cover non-RDF knowledge graph technologies as well?

(2) The Sampo model has been designed for CH and DH applications. It is yet difficult to see how the principles reflect that origin. I asked myself whether different principles would have emerged from a series of projects focusing on, say, genetic epidemiology. To put it differently: what in the Sampo model is specific to the way humanity scholars collect, systematize and analyze historical sources? The article should mention if the design principles of the Sampo model address research challenges in the humanities.

(3) With 20 years of operation, the Sampo series of semantic portals provides a unique opportunity to study issues of LD preservation. Have there been lessons learned beyond the fact that data needs to curation? What are the challenges (and costs) of maintenance of the portals?
The application report is well structured, points to the relevant references and gives an adequate discussion of related work. The author clearly succeeds in demonstrating impact, which is a central requirement for systems described in a SWJ application report. There are few LD systems in the domains of CH and DH that have attracted a larger user base over such a long period than the Sampo portals.

In summary, I believe that all points raised can addressed by minor revisions.

Open Science Data

The SemanticComputing Github provides the long-term stable link to the resources. Some of the projects mentioned in table 2 do not appear in this Github, however. In addition, an access link is missing to some of the semantic portals or the link given in the footnote is defective. This should be resolved before publication.

Not in the SemanticComputing Github
MuseumFinland, CultureSampo, BookSampo, Norssit Alumni, Mapping Manuscript Migrations

no link given
Norssit Alumni, U.S. Legislator Proso[po]grapher

404 error
WarVictimSampo (footnote 28), AcademySampo (footnote 31)


Please, check carefully for spelling errors, as the following list is incomplete.

page 1, line 23, abstract : War[V]ictimSampo
page 2, line 41: infra[structure]
page 4, line 35 : m[o]del
page 7, line 14, table 2 : Proso[po]grapher