Analyzing Biography Collections Historiographically as Linked Data: Case National Biography of Finland

Minna Tamper
Petri Leskinen
Eero Hyvonen
Risto Valjus
Kirsi Keravuori

Christoph Schlieder

Full Paper
Biographical collections are available on the Web for close reading. However, the underlying texts can also be used for data analysis and distant reading, if the documents are available as data. Such data is usable for creating intelligent user interfaces to biographical data, including Digital Humanities tooling for visualizations, data analysis, and knowledge discovery in biographical and prosopographical research. In this paper, we re-use biographical collection data from a historiographical perspective for analyzing the underlying collection. For example: What kind of people have been included in the collection? Does the language used for describing female biographees differ from that for men? As a case study, the Finnish National Biography, available as part of the Linked Open Data service and semantic portal BiographySampo - Finnish Biographies on the Semantic Web is used. The analyses show interesting results related to, e.g., how specific prosopographical groups, such as women or professional groups are represented and portrayed. Various novel statistics and network analyses of the biographees are presented. Our analyses give new insights to the editors of the National Biography as well as to researchers in biography, prosopography, and historiography. The presented approach can be applied also to similar biography collections in other countries.
Review #1
Anonymous submitted on 11/Jul/2021
This paper is organised in four parts. In the first one, the authors introduce their project that allowed the extraction of structured semantic data from the 6500 biographical records of the online National Biography of Finland. Transformed into a knowledge graph, the data is now available for browsing, querying and analyzing with data-analytic and visualization tools on the BiographySampo semantic portal, and on a SPARQL-endpoint. This undertaking is then situated in its specificity in relation to contemporary projects devoted to the online publication of national biographies. In the second part, the process of semantic data extraction from the biographical records and their metadata is presented. The third part consists in a presentation with detailed examples of the different analysis functionalities made available in the BiographySampo portal. This part presents at the same time some essential characteristics of this population of 6500 persons using the extracted, semantified data. Finally, the authors discuss the usefulness for historical and prosopographical research of these tools and the whole process carried out, and at the same time point to the difficulties that result from a partial inaccuracy of the information due to automatic extraction.
The article is presented as a full paper but is at the same time a report on the BiographySampo portal (application report) and a description of the content of the data produced (dataset description) and of the process of semantification from the original texts and metadata. The data is available on a SPARQL endpoint (tested successfully), published with a CC BY 4.0 licence, the URIs are dereferenced and well documented, and given that the platform is a part of the Linked Data Finland service, it should be sustainable. The project presented in the article is truly cutting-edge and promising in terms of the perspectives opened up more generally, and realised here concretely, for research in biography and prosopography. The reference to other projects in the first part rightly highlights its originality, even though the APIS Austrian Prosopographical Information System, also based on a national biographical dictionary and carried by Austrian Centre for Digital Humanities and Cultural Heritage ( should have been mentioned and presented in detail, as it has some features similar to the ones of the present project and is also based on semantified data available online.
The 'mixed' nature of the paper (presentation of a methodologically innovative project, a portal with rich analysis tools and a specific dataset, and its production, at the same time) makes the article interesting but also entails some limitations and imbalances (some parts, such as the presentation of the analysis tools, are very developed while others are quite concise) that do not fully meet the reader's expectations. This raises the issue of the targeted audience: digital humanists interested in semantic engineering or more traditional historians wishing to discover the potential of semantic methodologies and data analysis in web applications ?
As for the former, they will regret not finding more information, notably in the paragraph “Transformation into Linked Data” (p.5), on the decisions about information choices, modelling and the vocabularies used in order to semantify the data, and this especially in the context of the SWJ. There is certainly a mention of the BIO CRM extension of CIDOC CRM, for which reference is made to an external publication, but a more developed explanation of the choices made, in particular with a view to integrating heterogeneous data, and the reasons why these event-centered vocabularies seemed more suitable than others for this purpose would have been greatly appreciated as it represents the foundation for the semantification of the data. This applies in particular to the 130,000 events extracted from the biographical records, which represent considerable added value and originality, but whose semantics and content are not specified, whereas the analyses in the third part often mobilise the 'vocations' (professions or occupations) that will be considered as more questionable classifications by historians.
As for the latter, experience shows that they are often sceptical about the usefulness of this type of data for historical research because this information is not extracted from sources but, as the authors point out, is derived from texts that reflect “the editorial values and biases in selecting the biographees” (p.3). For a prosopographical approach it is essential to have a complete corpus of people, whereas a biographical dictionary is by definition a selection of people considered as being ‘representative’. It is therefore quite appropriate, as the authors of the article do, to focus on the “historiographical perspective” in order to illustrate the full potential of the data semantification method implemented, and use the data produced to create a profile of the individuals’ generations, geographic origins, occupational and parental relationships, etc.
But this demonstration could have been reinforced, based on the available data, to deepen the historiographic approach, by focusing on more specific questions going beyond a purely exploratory approach. For instance by comparing the Finnish people present in DBPedia (Wikipedia) or Wikidata with those available in BiographySampo: what kind of persons are missing? Are the temporal, geographical or ‘vocational’ profiles of the biographees different in these ‘semantified encyclopedias’ stemming from crowdsourcing and data aggregation, and not from a classical editorial approach? The question would have been worth asking at least in view of future research. Or one could have compared the vocabularies of the records (produced by the NLP processing) with the ‘vocations’ in order to highlight possible differences between these editorials classifications and the contents of the records: is the choice of core activities (“vocational groups”) sufficiently robust to allow historical analysis, event in the sense of distant reading ? Vocations and “vocational groups” stem from the original metadata of the records and they could result from a classification by the editorial staff in order to facilitate queries on the Dictionary website. Their origin and relevance for the historiographical analysis should have been subjected to a critical discussion.
Finally, it is essential in the historiographical analysis to highlight the relationships between the content and form of the texts and their authors. Some aspects of this problem have been addressed (Author Analysis, p.25) but this subject would have deserved a more developed treatment given the general aim of the paper stated in the introduction, by trying, for example, to reconstitute the profiles of the entries’ authors by using external public data in relation to the distribution of the biographees. A slightly more elaborate critical and historiographical discussion of the results, in connection with some developments on the semantic modeling choices, and slightly reducing the quite long, and more ‘classical’ presentation of the visualizations tools, would strengthen the demonstration about the usefulness of the adopted methodology in the eyes of historians who are sensitive not only to the quality of the data but also to its semantic meaning and context of production.

Review #2
By Werner Scheltjens submitted on 01/Aug/2021
This paper provides a detailed, well-structured and original overview of different perspectives for the exploration and analysis of biography collections. The results are significant and deserve to be published, as they align well with other recent surveys of biography collections (especially the ODNB, see Warren 2018). The quality of writing is high; the paper is well-written and the language is clear. The introduction and the description of the transformation process from the NBF to a Linked Data service in sections one and two are concise and introduce the reader to the topic of the paper. In the main section of the paper (3. Analyzing and Visualizing the NBF), the reader is guided through seven perspectives for searching and exploring the data from the National Biography of Finland. The space devoted to the description of the different analytical perspectives is divided somewhat unevenly and the number of tables and figures is rather overwhelming. For the sake of the reader, I would suggest to reduce the number of figures and tables in sections 3.1 and 3.4. The discussion (section 4) touches upon the issue of data literacy, which is very important.

The authors pursue two goals: (1) “to argue and show that using biographies as Linked Data opens up unprecedented new possibilities for the study by distant reading” (p. 2) and (2) to “present novel insights into the nature and contents of NBF” (p. 2). In the title of the paper and on p. 3, the authors point out, that biographies can also be studied from a “historiographical perspective as an artifact reflecting its own time, the editorial values and biases in selecting the biographees, the authors’ perspectives, and also from a linguistic point of view.”
The paper convincingly argues that semantic technologies have the potential of opening up biography collections in novel ways. The authors show at length how exploratory statistical and network analysis can be conducted based on data from the National Biography of Finland. The exploratory analysis is very informative; in particular, I like the use of the PageRank measure very much. At the same time, the precise nature and value-added of the historiographical perspective remains rather implicit throughout most of the text (although it is there). It is only in sections 3.6 and 3.7 that more profound text-analytical insights into the biography collections as the result of man-made selection processes are given. From a historian’s point of view, it would have been great to learn more about the differences between the vocabularies for male and female entries in the NBF (p. 25), about the different styles of authors delivering biographies to the NBF, or about the impact of (tacit?) editorial decisions. Finally, for a paper that engages with the historiographical perspective on biography collections, section 3.7. (Author analysis) is rather short. It would have been great to gain a better insight into the different writing and editorial strategies during the production of the biographies. The authors of the paper do hint at these differences on p. 25, when referring to later additions to the NBF (“Multifaceted Finland”), but the topic of ‘history writing’ is not pursued much further. The impression remains that the authors’ engagement with the ‘historiographical perspective’ could obtain a more prominent position throughout the text, e.g. by pointing out how the analysis of NBF data contributes to understanding the process of biography writing. Perhaps, a more direct comparison with the findings discussed in Warren (2018), which has clearly inspired the present paper, might be useful.

Besides these suggestions, I have some (very) minor remarks about the text, figures and tables, which I list below:
p. 2, left, r. 44: …1997 … when Finland celebrated her 90 years … => 80 years?
p. 2, right, r. 19: 13100 biographies by 980 scholars. Fig. 2 on p. 7 visualizes a subset of 6500 biographies written by 1000 authors. => is the latter figure about the number of authors correct?
p. 3, left, footnote 12: is redundant (see footnote 9)
p. 3, right, r. 32-40: It seems to me that these lines fit better at the end of section 1.2.
p. 4, right, r. 15: 6478 entries => I arrive at 6476 entries
p. 5, left, r. 37: CVS => CSV
p. 5, right, r. 16: on p.4 it is said that only data for men and women from the core NBF are used. These are 6197 entries. Here, a total of 6510 biographies is given. It would be good to explain where the different numbers come from.
p. 6, right, r. 12-17: It would be helpful to point out where this discussion could be found later in the text.
p. 7, left, r. 24: [? ] => a reference is missing
p. 8, left, r. 30: … from the start of the 20th century => from 1950?
p. 10, left, r. 37: I could not find the term vapaaherratar on fig. 8.
p. 11, right, r. 41-43 = p. 12, left, r. 47-48.
p. 13, left, r. 39-42: To me, it is a bit puzzling that the selection of biographies does not reflect the importance of agriculture until the 1960s, because ‘farmer’ or ‘farmer’s wife’ are listed consistently among the top vocations of parents in table 2. It would be helpful to explain this briefly in the text.
p. 28, right: I think [21] and [22] refer to the same title.

Review #3
Anonymous submitted on 07/Aug/2021
Review Comment:

The authors analyze data of biographies of historical persons published by the Finnish Literature Society. The national biography data covers several tens of thousands of biographies of the past as well as alive Finnish people.

The paper is well-written and easy to follow. The related work overview is done quite comprehensibly. The paper is then divided into two main parts: the first one being description of the data preprocessing and infrastructure establishing for data access and analysis and the second being the analysis of the dataset.

Regarding the first part, the authors have done many preprocessing steps to standardize, enrich and link data so that it can be further analyzed by others. This is an important contribution in terms of disseminating knowledge of cultural heritage and facilitating analysis as well as knowledge discovery from accumulated data.

The authors should provide more detailed description on certain processing steps such as life event extraction and understanding.

As for the second part, the paper provides many interesting analytical results and the analysis is done from many diverse angles. Fig. 8 which presents highly interesting analysis is unfortunately not-readable for readers who don't speak Finnish. I would suggest the translation in the figure or in its caption.

It would be good if the authors provide description on how their analytical findings could be used in practice. This could be coupled with the discussion of the way in which the data can be used, searched and visualized by the external parties.

Finally, I would suggest using italics or bold font to emphasize key findings in text and by this to make the manuscript more readable. This could be also done in the form of brief list summarizing key observations in the discussion or conclusions section.

Overall, I think the submission presents a very useful and interesting work, and I can’t find any major flows, hence I would suggest its acceptance, perhaps with the small suggested modifications above.