Knowledge Graphs and Data Services for Studying Historical Epistolary Data in Network Science on the Semantic Web

Tracking #: 3236-4450

Authors: 
Petri Leskinen
Javier Ureña-Carrion
Jouni Tuominen
Mikko Kivelä
Eero Hyvonen

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
Communication data between people is a rich source for insights into societies and organizations in areas ranging from research on history to investigations on fraudulent behavior. These data are typically heterogeneous datasets where communication networks between people and the times and geographical locations they take place are important aspects. We argue that these features make the area of temporal communications a promising application case for Linked Data (LD) -based methods combined with temporal network analyses. A key result of this paper is to show how to create and publish a global Linked Open dataset and data service about historical epistolary data, based on distributed data from several international heterogeneous data sources, that can be enriched by data linking and reasoning, and that can be served back for the research community as an open infrastructure, a data service, and a semantic portal for further study in Digital Humanities (DH). A framework for this purpose is presented for publishing and analyzing communication network data, based on recent advances in analysis of temporal communication networks and the behavioral patterns commonly found in them. The framework was applied to two created and published open LD services (CC BY 4.0): (1) the data schema, dataset, and data service of the Dutch CKCC corpus (of ca. 20 000 letters) and (2) the pan-European correspSearch corpus (of ca. 135 000 letters) related to the Republic of Letters (1500--1800). To evaluate and demonstrate the usability of the new data services in DH research, a semantic portal was implemented on top of their SPARQL endpoint demonstrating re-usability, flexibility, and feasibility of the Linked Data approach from a DH research perspective.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andreas Kuczera submitted on 29/Dec/2022
Suggestion:
Minor Revision
Review Comment:

The Paper suggests using LOD and Semantic Web Technologies to create and publish a Linked Open Dataset and Data Service for historical epistolary data. For the authors communication by letters are based on communication metadata, not on the content of the letters. Furthermore the state that the paper "focuses on presenting a technical framework and approach for applying network analysis and LD technology to publishing and using historical epistolary data in research, not on particular domain specific analyses of the datasets from a humanities point of view." (p2,42ff.).
Here, from my point of view, the authors stay beyond their possibilities as research should include the domains perspective. But from a technical perspective the paper is well written and it adresses the important task to connect research data from different sources over semantic technologies. But the last point is only touched half as the correspSearch-Data was exported and transformed in different formats. Maybe the correspSearch-Creators could present their data directly over an SPARQL-Endpoint as suggested from the authors.
The impact of the paper is different in different research communities. For the sebantic-web- and the network-science community the paper brings an interesting new approach but for the common (digital-) humanities the missing connection to humanities research questions is a problem. And on the other hand SPARQL is a barrier even for trained DH-people.

The paper is well written, clearly stuctured and well made illustrations. For being more convincing it would be helpful to have on specific research question which then can be answered with this new approach.
The authors introduce social signatures as a stable pattern but show no evidence that this measure, derived from contemporary social media data can be used for early modern letters communitcation. I have doubts that measures from more or less complete data from social media can be adapted on the often inclomplete source situation of the material they have in focus.
Moreover would it be interesting to take a look at the contents of the letter. This could be fruitfull to complete partly handed down material (e.g. if you have only letters from one person and not from the other the assertions in the letters could be picked up in the next letter and reconstructed by that. For DH-Research the content of the letter should be taken into account when modelling information.

The data is well organized, it has a README but it contains only link to online resources which could change over time and then not be compatible with the stored data. A prominent link to the data on the first page would be helpful.
Its not that easy to build up a SPARQL-Service and SPARQL is not an easy task to handle so i cannot say anything about that. This could be a caveat for the usage even in the digital humanities researcher community.

Review #2
Anonymous submitted on 22/Jan/2023
Suggestion:
Major Revision
Review Comment:

GENERAL REMARKS

This article deals with the use of network science for the exploitation
of knowledge graphs (KGs) with an application on KGs about epistolary
relations (in a digital humanity framework).

It is correctly written and is quite easy to read (however, see below
for remarks on details).

Now, the impact of this work is either rather limited or poorly defended.
It gives some views on the data, but I have not been fully convinced by
the usefulness of them for an end-user (e.g. a history). I have well
noticed that it is claimed (line 27 of page 12) that the goal of this
paper was not to present the analyses on those KGs, but at least
one /convincing/ example should have been given that points out a
particular phenomenon on the data that the experts were not aware of
before (or, another way to make the reader understand better the impact of
your work).

The URL https://zenodo.org/record/6631385#.Y8ql-K2ZOUl associated with
this publication gives information to access the data in an appropriate
way (it contains a README.md file). It is published on Zenodo so
is an appropriate long-term repository.

PAGE BY PAGE REMARKS
p. 2
Just above table 1, lines 12-14, the end of the sentence
"Furthermore, ..." is hard to read:
"... the Semantic Web methodology [12] the practical LD
publishing principles including SPARQL endpoints."
(a grammar error?)

p. 3
Line 15, "early Early Modern": is it written this way willingly?

You might want to mention the research on the use of Semantic Web
in the Henri Poincaré correspondence. For instance:
http://semantic-web-journal.net/content/applying-and-developing-semantic...
(Or more recent work of some of the authors.)

p. 4
Line 31
"by reasoning new triples" is odd. Maybe use
"by inferring new triples".

p. 5
Lines 10 and 11
"The sent letters by each actor are announced using the
property :created."
Ambiguous (to me): if (x :created y) is a triple, is x a person and
y a letter or x a letter and y a person? (Is is "created by" or
"has created").

Line 32
Adding parentheses around
"Findable, Accessible, Interoperable, and Re-usable data"
would make the sentence easier to read

p. 7
Lines 25 and 26
"It turned out that contemporary and historical epistolary
communication networks resemblance each other strikingly even
if the media were quite different."
This sentence seems to have a grammar problem (missing verb).

p. 8 (and 9, etc.)
The order of the call to figures is odd. For instance,
figures 2 and 3 are referenced in the text after figure 4,
which affects the readability of the article.

p. 9
Line 26: ". [16]" --> " [16]."

Line 29: "network distance". Could you explain what distance
function is used? Is it simply the length of the shortest
path from one node to another? Or does it take into account
the valuation on the edges? Or something else?

Line 43: "Due the to performance" --> "Due to the performance"

Line 48: "Figure. 3" --> "Figure 3"

p. 10
Figure 6: could you give an accurate definition of the meaning
of the values on the vertical axis?

p. 11
Line 50
You mention the "Granovetter effect". Could you recall briefly
(e.g. as a footnote) what it is? Sorry I did not know that before
(I'm not a specialist in network analysis and I assume that
many readers are not either) and I have to search this elsewhere
(I know that it is related to reference [6], but for a self-contained
paper, this is not enough).

Line 51 and lines 1 and 2 of page 12
"We found, however, difficulties in drawing conclusions from global
network analyses, particularly given that some individuals are
overrepresented in historical datasets."
Would there be a way to overcome these difficulties?
I suggest that this important issue shall be discussed in Section 5,
with some lines of research...

p. 12
The video is nice: having the link before page 12 would have been
helpful to quickly understand the paper (I have not learnt from
the video more information than from the paper, but viewing it
would have quickened my understanding).

Review #3
Anonymous submitted on 06/Feb/2023
Suggestion:
Major Revision
Review Comment:

After reading the paper and some other publications by the authors, I am not convinced that this paper adds much substantial information to the already published papers by the authors, nor that it usefully addresses an identifiable target audience.

Specifically to my first point, there are four papers cited in the bibliography (16, 20, 34, 36 with 38 being a near-duplicate entry), that have been written by a group of authors strongly overlapping with the authors of the present paper, have been published in 2022 or are forthcoming, and appear to be very close in content to the present article, based on their content (where available), abstract and/or titles. What exactly the present article adds to these other papers is not made sufficiently clear, whether in the abstract or in the "related work" section.
- Papers 36 / 38 describe (among other things) the "Granovetter effect" and the "social signature" phenomenon in 4 datasets, two of which correspond to the datasets also used here. The passages in the present paper on these two phenomena are therefore not new.
- Paper 34, based on its title only, appears to describe the same "Linked Data Service and Portal for Studying Large and Small Networks of Epistolary Exchange" as the present article, except that it likely focuses on a different dataset for illustration. The added value of the present article is unclear.
- Paper 20 introduces and describes the Sampo UI, the framework also used in the present paper. The description of this framework, then, is not new.
- Paper 16, presumably published since the authors submitted the present paper (at https://dl.acm.org/doi/10.1145/3569372), describes the LetterSampo approach and framework to publish and analyse three datasets: CCCK, correspSearch and the EMLO dataset. Again, the modeling, publication and analysis of the first two datasets are also the object of the present article, and the added value of the present article is not clear.
- In addition, the data model for the letters has been developed in the EMLO project and is described in reference 18.

Given this situation, the present article, while providing appropriate references when it summarizes earlier research by the authors, does not appear to contain much content where this is not the case.

An additional or alternative source of usefulness of a publication, beyond pure novelty, can of course be derived from presenting materials already described elsewhere for a specific audience not previouly addressed. However, I am not convinced that the present article achieves this. The audience of this particular paper is actually rather unclear, as it mixes rather cursory technical information (possibly aimed at semantic web tool developers) with similarly cursory, illustrative analyses (aimed, possibly, at DH scholars who would use the platform, but who are not really introduced to it). Also, because it pulls together so many different aspects of the (certainly complex) project described, it does so with a quite limited degree of detail for all of them, with limited usefulness for any group of intended / possible readers. Cases in point are the technical aspects of the implementation, or the analyses of the Granovetter effect and social signature phenomenon, or the datasets, or the analysis scenarios.

As a consequence, I think the paper needs to be thoroughly re-structured to better distinguish it from the earlier papers and/or programmatically re-oriented towards a much more specific audience. Such a re-orientation may or may not mean that the audience that is then targeted can best be reached via SWJ.