Knowledge Graphs and Data Services for Studying Historical Epistolary Data in Network Science on the Semantic Web

Petri Leskinen
Javier Ureña-Carrion
Jouni Tuominen
Mikko Kivelä
Eero Hyvonen

Communication data between people is a rich source for insights into societies and organizations in areas ranging from research on history to investigations on fraudulent behavior. These data are typically heterogeneous datasets where communication networks between people and the times and geographical locations they take place are important aspects. We argue that these features make the area of temporal communications a promising application case for Linked Data (LD) -based methods combined with temporal network analyses. The key result of this paper is to present a framework, tools and systems, for creating, publishing, and analyzing historical LD from a network science perspective. The focus is on network analysis of epistolary network data (metadata about letters), based on recent advances in analysis of temporal communication networks and the behavioral patterns commonly found in them. To test, evaluate, and demonstrate the usability of the framework, it has been applied to (1) the Dutch CKCC corpus (of ca. 20000 letters), (2) the pan-European correspSearch corpus (of ca. 135000 letters), (3) to the Early Modern Letters online data (of ca. 160000 letters), and (4) to the aggregated Finnish CoCo collection of more than 300000 letters from 1809--1917.
Anonymous submitted on 05/Jul/2023
My comments on a previous version of this paper appear to be applicable to this version of the paper as well, so I am not repeating them here.

By Andrea Mannocci submitted on 06/Sep/2023
The paper focuses on epistolary datasets and shows how SW technologies and LD can be used to represent data in such a domain and facilitate the perspectives of network science analysis.
The paper extends previous work and elaborates further on the Sampo framework for deploying Web UIs and enabling exploration and querying of such data.

The paper is clear and well-structured. The Sampo framework is interesting and well-exploited. The released resources can be beneficial for the DH research community.

- Perhaps the title is somewhat overloaded and could be simplified (especially the part "in Network Science on the Semantic Web")
- As stated, the paper extends [17] and [37], extending the network science perspective and relevant tools. However, several visualisations here presented were already available in [17] (e.g., network, timeline, top ranking). Network measures have been clearly added to the UI, but I am wondering if there is anything else I am missing. This makes me wonder if the paper brings enough novelty, w.r.t. the previous work by the authors.
- Are the network measures configurable somehow when someone is instantiating Sampo? For example, include one specific centrality measure, rather than all of them.
- Also, I often got back the error "One of the backend services is not available at the moment. Please try again later." when playing with the demo.
- I do not understand how you use all four datasets stated in Table 1. For CKCC and correspSearch, this seems clear, as you state, "In this case study, the Linked Data of CKCC and correspSearch were analyzed" and "These measures are based on a network containing both the CKCC and the correspSearch datasets."
However, I do not comprehend at what point and for what purpose you leverage EMLO, while for CoCo, you just briefly state, "The framework will be used in the Constellations of Correspondence (CoCo) project on correspondences in the Grand Duchy of Finland in the 19th century [35]", which sounds more like a future work than a current application.
Reading the abstract, one could expect the demo to seamlessly switch between the four datasets.

Minor remarks:
- "Early Modern Letters online" missing capital O
- "Thedatasets of historical" missing space
- "postal services, This" punctuation
- The URL in footnote 23 is not working
- For the sake of uniformity, please stick to the same punctuation style (for example, "i.e.," and "e.g.,"). Also, in some occasions, such as "As a comparison the correspSearch" and "In our work these datasets were transformed", you do not enforce the same style as in other occurrences, as in "For the static approach, a network".
- "public use. [17] Th" move the full stop
- "SPARQLWrapper and Networkx." missing capital X
- Is footnote 27 redundant?
- The colour coding in Fig. 8 is somewhat troublesome, as the same colour is repeated for multiple persons.

By Ruben Verborgh submitted on 25/Feb/2024
This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.