Knowledge Graphs and Data Services for Studying Historical Epistolary Data in Network Science on the Semantic Web

Tracking #: 3679-4893

Authors: 
Petri Leskinen
Javier Ureña-Carrion
Jouni Tuominen
Mikko Kivelä
Eero Hyvonen

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
Communication data between people is a rich source for insights into societies and organizations in areas ranging from research on history to investigations on fraudulent behavior. These data are typically heterogeneous datasets where communication networks between people and the times and geographical locations they take place are important aspects. We argue that these features make the area of temporal communications a promising application case for Linked Data (LD) -based methods combined with temporal network analyses. The key result of this paper is to present a framework, tools and systems, for creating, publishing, and analyzing historical LD from a network science perspective. The focus is on network analysis of epistolary network data (metadata about letters), based on recent advances in analysis of temporal communication networks and the behavioral patterns commonly found in them. To test, evaluate, and demonstrate the usability of the framework, it has been applied to (1) the Dutch CKCC corpus (of ca. 20 000 letters), (2) the pan-European correspSearch corpus (of ca. 135 000 letters), (3) to the Early Modern Letters Online data (of ca. 160 000 letters), and to (4) the aggregated Finnish CoCo collection of more than 300 000 letters from 1809-1917.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andrea Mannocci submitted on 03/Jun/2024
Suggestion:
Minor Revision
Review Comment:

This is a new iteration of a previously reviewed paper.
I think this version has been improved, but it still needs some edits and overall and thorough checks from the authors.

- "Section 4. and" remove the fullstop
- "Data sets conforming" + another instance; please use datasets which seems to be the form you prefer in the rest of the paper
- "AcademySampo and ParliamentSampo" add references or footnote, perhaps?
- "red line for time period 1643–1650" it's purple in fig. 4, right?
- There is still some confusion on the datasets you tested in this paper. The sentence "(3) to the Early Modern Letters Online data (of ca. 160 000 letters), and to (4) the aggregated Finnish CoCo collection of more than 300 000 letters from 1809–1917." in the abstract and the "four datasets discussed," in the discussion allude to 4 datasets, but you are testing your method against two, as Tab. 1 now suggests. Please revise accordingly. Coco is mentioned as future works, while EMLO is something your data link to, as far as I get.
- ensure that you put a reference/footnote at the first instance of something (e.g., NetworkX), rather than later on. Make a thorough check.

In general, the paper appears well-written and easily understandable. Resources are made available for the DH research community, which is to praise.

Review #2
Anonymous submitted on 25/Sep/2024
Suggestion:
Reject
Review Comment:

This paper presents the usage of the Sampo framework for a network analysis. The novelty of the presented work seems to be in the tranformation of the historical data into linked data, which is then visualized, to some extend in the already existing Sampo framework. Whereas the addition of the network analysis is then done in seperate Google Colab and Jupiter Notebooks. I'm unsure, how this paper first the call of "Tool and Systems Report".

There are multiple large problems with this paper that need to be addressed, as well as my concern of the fit for the call.

Even though stated multiple times, the goal of this paper is nonetheless unclear, as there seem to be different signals given throughout. The analysis of the datasets, to my knowledge, seems to well done. However, the authors state multiple times that the conclusions from this analysis is not the main goal of the paper, which given that the authors are not experts on the field, makes a lot of sense. However, there does not seem to be focus on the state goal in any way. The tool itself is not analysed in a way that would provide the answer to the question of if the using LD and the Sampo framework is sufficient or even providing more support in this sort of analysis than conventional tools. The authors do not showcase how it is that the tool/system supports this sort of analysis (better) as opposed to just using NetworkX without transforming the data in to LD. I suggest here, that the authors also make use of a modern network and conduct an analysis of said network, where the network has previously been analysed with out transformation to LD. That would provide a direct comparison and would enable the autors to draw conclusions on the actual system/tool/approach, rather than having to state multiple times that the data is historical, hence could be incomplete and biased. (This is an important limitation, but has nothing to do with the tool and everything to do with the analysed data, hence not contirbuting to the goal of the paper).

The abstract, introduction and conclusions (called discussion in this work) are not aligned to each other. For example, the abstract and conclusions mention 4 datasets, but throughout the paper and in the introduction only 2 datasets are introduced, discussed and analysed. One third dataset is mentioned briefly as being part of future work, but in the abstract it is presented as if 4 datasets are transformed into LD, made available and also analysed in the paper itself.

Given the major issue of contribution of this paper, I refrain from going into too much detail on form and grammar, however there are inconsitencies in, e.g. the usage of abbreviations or how figures are referred to (Fig. X vs. Figure X) throughout the paper.

Lastly, I have some concerns on how the LD was transformed into an NetworkX network, as this is not described in detail in the work itself. Also, it is very unclear to me why the two datasets were combined, or taken together, when calcuating the measures for the specific actor which are reported in Table 3. This is very unconventional, especially given that certain measures like the betweenness or eigenvector centrality takes the entire network into account and by combingin the two datasets the numbers are being calculated over its entirety rather than the relevant network.

Review #3
Anonymous submitted on 22/Oct/2024
Suggestion:
Minor Revision
Review Comment:

The paper presents a series of tools to analyse networks of epistolary exchange from different sources. Although, the paper itself is generally well-written, I would ask the authors to revise the discussion session as it seems unorganised, and the English in some of the sentences could be improved. (i.e., Line 32-23 page 13: "This as well as conceptual difficulties in modeling complex real world ontologies, such as historical geogazetteers,
become sometimes embarrassingly visible when using and exposing the knowledge structures to end-users.")

Important comments:

I am not sure how the "usability" is tested, were end-users involved in any of the analysis? Was there a user study? How do you evaluate the usability without doing a user study or providing potential users for the platform?

Second, and I suppose this is just a matter of changing the paper due to the previous versions, in the abstract and in some parts of the paper it is mentioned that you are analysing and reusing 4 datasets. Yet, in Table 1, you only list two. More clarity on what was actually reused and how would improve the paper.

Overall, after answering my comments above, this contribution should be sufficient for publication.