Temporal Knowledge Representation for Historical Corpora: Application to the Henri Poincaré Correspondence Corpus

Tracking #: 2774-3988

Authors: 
Nicolas Lasolle
Olivier Bruneau
Jean Lieber
Laurent Rollet

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
Abstract: 
The study of Henri Poincaré's life and works (1854-1912) has led to the constitution of a corpus which includes various document types (articles, books, reports, letters, etc.). There is a keen interest in his correspondence, which gathers scientific, administrative and private exchanges. Semantic Web technologies have been chosen to represent and to exploit corpus data: the RDF model, the RDFS knowledge representation language and the SPARQL query language. The existing representation does not include temporal knowledge, yet it is an important aspect when studying historical corpora. Different methods have been proposed by the Semantic Web community to include temporal knowledge in RDF databases. This article presents the temporal knowledge representation issues encountered in the exploitation of the Henri Poincaré correspondence corpus. Through examples, it compares existing methods and ontologies and details the implementation of the approach chosen for this corpus. A form-based interface has been developed to assist users in corpus data querying. The system relies on the use of a transformation rule mechanism to address two issues faced when dealing with temporal graphs: simplifying SPARQL query generation and reasoning with temporal data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Alessandro Adamou submitted on 21/Jul/2021
Suggestion:
Major Revision
Review Comment:

The paper discusses the possible ways of representing time-indexed historical information as Linked Data, and elaborates on the choices made for the execution of this task for a particular corpus: the one of Henri Poincaré's correspondence.

There are therefore at least two perspectives to read this paper from: the merits of semantically representing temporal data and the issues arising; and the presentation of a novel resource as a historical corpus, with the particular challenges it presents. While the former aspect is amply covered in the paper, the latter would benefit from further elaboration.

The first section introduces the setting and gives an overview of the resources at hand. This is already a place where I believe the authors should make it clearer about what makes the Poincaré case study so special and challenging from a historical KR point of view: the examples included refer to statements that are truthful within an interval or in an instant. This is already significant, but it is found in many other corpora and I have reason to believe this use case presents much more challenging occurrences that should be illustrated. A hint to that is given later in the section about rules, which has an example of an undated letter known to be a reply. Perhaps there are more challenging instances, like vague or not fully qualified temporal indicators (say, "late Summer") that are worth tackling?

Then, there is a section about the core standards of RDF[S], OWL and SPARQL: in a Semantic Web Journal paper, this might well be omitted, but since it is not too long and features a few dataset-specific examples, it might make sense to keep it. However, its positioning interrupts the narrative flow of the paper: section 3 goes back to the problem of KR for historical corpora, and does so quite agnostically from Semantic Web technologies, barring one mention of ontologies. For that reason, I would at least consider switching Sections 2 and 3.

A couple of notes re Section 2:
1. the Turtle syntax is used not to store RDF databases, but to serialize them
2. that OWL corresponds to SROIQ(D): that depends. It is the case of OWL 2, but which DL is adopted depends on the chosen fragment of OWL. Besides, description logics are not discussed much in the paper and this aspect may perhaps be generalized or omitted.

Sections 4 and 5 go on illustrating the existing recommendations for Semantic Web compliant KR of temporal facts, followed by a discussion on pros and cons and justifying the choice to settle for n-ary relations for the use case at hand. While the discussion of the merits of each is sensible, and the reasons for the final choice may be agreeable in several ways, the decision to rule out anything that extends the RDF semantics is quite drastic as it excludes quite recent approaches. I am not arguing in favor of any in particular, but those approaches may deserve further elaboration. As a hint: one problem with RDF* could be that one will most likely want to use at least one embedding level for the metadata of statements (provenance etc.) and using it for temporal facts might require at least another level of embedding.

As for ontologies, it is recommended that authors do not just focus on dedicated models like OWLTime, but also look at how temporal statements are modelled in top-level ontologies like DOLCE, the TimeIndexedSituation design pattern, and the like.

One small item to address there is the convention in Figure 5(b): using the syntax of variables for blank nodes (?node) may be confusing: I would suggest to use a Turtle-like convention _:node

Up to this point, I would have been keen to recommend a Minor revision. However, section 6 shifts the narrative by introducing something quite different altogether, namely a rule language for SPARQL query expansion and a human interface for query generation. Taken individually, each covers quite a complex aspect that deserves longer elaboration and comparison with existing work. Now, the rule language has been detailed in another paper, but the underlying principles of the GUI need further discussion as to e.g. why text searches on names are preferred over entity selection: in the examples, a Semantic Web expert would rather select the URI for Poincaré from a search index than enter name and surname into two fields. More importantly, though, this part needs to show awareness of approaches in user interfaces for SPARQL generation, like e.g. SPARKLIS, Nitelight, or the VOWL query interface. That is also the main reason why I believe the paper is in need of a related work section, rather than scattering the authors' knowledge of the state of the art across the paper.

The quality of writing is itself quite high and I only have a few recommendations to make in that regard:
- Sometimes "Poincaré" is spelled "Poincare" without accent (e.g. P4L41), please double-check
- The text size and line height changes in some sections up to their end, e.g. section 2 starting from P3L11 up to the end, and 5.2 from "The formulation..." on.
- P12L41: "when it comes to _querying_"
- P13L37: "substituting left(r) _with_ right (r)"
- P15L21: perhaps use "restricting" rather than "restraining"
- Figure 7(a): is the opening tag missing?

Finally, the paper is rather unclear as to where the resources associated with the paper are. There is a mention of http://henripoincare.fr/ along with a couple example entries that seems to indicate that there is a data API and other resources. By going to http://henripoincare.fr/ I was able to access a wealth of information, including a SPARQL endpoint, but I cannot be sure which parts of it pertain to the content of this paper. The documentation is in French, but this is partly soothed by the language-independent nature of the Semantic Web and standard ontologies. Lacking a single, stable and persistent URL on a data repository that gathers all the resources, perhaps a short dedicated section that illustrates what and where they are would be a good idea.

Review #2
By Andrea Giovanni Nuzzolese submitted on 03/Aug/2021
Suggestion:
Reject
Review Comment:

The paper investigates the problem of representing temporal knowledge in knowledge graphs. More specifically, the article focuses on the temporal knowledge representation issues the authors faced when modelling an ontology for exploiting the Henri Poincaré correspondence corpus. Such an investigation is carried out by presenting and comparing existing approaches and ontologies aimed at introducing the temporal dimension in RDF/OWL. The approaches taken into account are: Temporal RDF, Named graphs, n-ary,
4D fluents, RDF+, aRDF, Singleton property, RDF*, and RDFt. Among those the authors opt for the n-ary approach in order to express time-indexed facts gathered from the correspondence and works of Henri Poincaré. Then the authors present the SPARQL Query Transformation Rule Language (SQTRL). SQTRL is a tool that allows to define transformation rules that can be applied to generate new queries from an initial application-independent queries. The authors claims that SQTRL eases the creation SPARQL queries that need to deal with complexity of triple patterns required by the use of n-ary relations for adding the temporal dimension to RDF triples. Finally, the authors present the RDF Transformation Rule Language (RTRL). RTRL allows the creation of entailment rules by relying on the same assumptions valid for SQTRL.

==== Overall comments ====
The paper is well written and structured in all its parts.
Accordingly, the readability is good and the language adopted is adequate from a technical perspective.

=== Strengths ===
The topic is relevant to the Semantic Web journal and to the special issue on cultural heritage.
The literature analysis of how temporal information can be injected into RDF/OWL knowledge graph is interesting both from the knowledge representation perspective and the reasoning one.
The analysis provides preliminary seeds the might be helpful for semantic web practitioners in adopting the right solution for dealing with the representation of relations of higher cardinality than the binary one.
The examples presented with a graphical representation along with SPARQL queries are very helpful in undressing the different possibilities discussed by the authors.
The use case in helpful in understanding how to apply the methodology.

=== Weaknesses ===
Nevertheless, the paper shows significant weaknesses that prevent it from publication as it is in its current form.
Those weaknesses are:

+++ Limited SOA +++
The authors dedicate an entire section (Section 2) to introduce the preliminaries on Semantic Web technologies. Those preliminaries are totally unnecessary if we take into account the venue of publication and its audience.
On the contrary, the authors do not provide room to a proper literature analysis focused on what exists at the state of the art for representing on one hand temporal facts and, on the other, n-ary relations.
Notable examples are provided by the Ontology Design Patterns repository [1], such as the TimeIndexedSituation ODP [2].
Other relevant works are [3] and [4]. Both provide modelling solutions for time-based factual knowledge. The former is focused on building a knowledge graph in the cultural heritage domain, while the latter a knowledge graph in the scholarly domain.
Finally, works such as [5], [6], and [7] should be taken into account for providing a comprehensive overview on the problem.

+++ Analytical methodology +++
The general goal or research questions are never made explicit. On the contrary, they should be made explicit.
Additionally, the authors claim they started from an existing version of the ontology used for the Henri Poincaré scenario. However, neither the ontology is publicly provided, nor it is introduced. Accordingly, it is hard to clearly contextualise the nature of the problem based on what the authors assume to be the starting point of their research.
Then, the authors identify the possible alternative approaches for representing time in RDF/OWL by providing useful examples and comparison. Nevertheless, the analysis is limited to the number of triples required and on a vague concept of complexity. This should be explained with more clarity and rigorousness. What do the authors mean by complexity? What are the parameters that a knowledge engineer should take into account for selecting the most appropriate approach according to her design problem and context? How this analysis can be reused by knowledge engineer, linked data experts and practitioners?

+++ SQTRL and RTRL +++
The authors then focus on describing two tools for simplifying the design of SPARQL queries and the execution of reasoning entailments. However, the introduction of those tools is not anticipated by a clear analysis of what the requirements of the final system are. It is fully comprehensible and straightforward that n-ary relations might imply more complex SPARQL triple patterns than binary relations, but the role of the two system here is not well justified. What are the requirements? Who are the target users?

+++ Results +++
Is the knowledge graph generated available? If it is not for any licensing-based reason, then this should be clarified.

+++ Evaluation +++
There is no evaluation neither of the n-ary solution adopted nor of the two tools presented. This makes the paper not acceptable for publication.

[1] Gangemi, A. and Presutti, V., 2009. Ontology design patterns. In Handbook on ontologies (pp. 221-243). Springer, Berlin, Heidelberg.
[2] http://ontologydesignpatterns.org/wiki/Submissions:TimeIndexedSituation
[3] Carriero, V.A., Gangemi, A., Mancinelli, M.L., Nuzzolese, A.G., Presutti, V. and Veninata, C., 2021. Pattern-based design applied to cultural heritage knowledge graphs. Semantic Web, (Preprint), pp.1-45.
[4] Nuzzolese, A.G., Gentile, A.L., Presutti, V. and Gangemi, A., 2016, October. Conference linked data: the scholarlydata project. In International Semantic Web Conference (pp. 150-158). Springer, Cham.
[5] Scheuermann, A., Motta, E., Mulholland, P., Gangemi, A. and Presutti, V., 2013, June. An empirical perspective on representing time. In Proceedings of the seventh international conference on Knowledge capture (pp. 89-96).
[6] Rouces, J., De Melo, G. and Hose, K., 2015, May. Framebase: Representing n-ary relations using semantic frames. In European Semantic Web Conference (pp. 505-521). Springer, Cham.
[7] Presutti, V., Lodi, G., Nuzzolese, A., Gangemi, A., Peroni, S. and Asprino, L., 2016, November. The role of ontology design patterns in linked data projects. In International Conference on Conceptual Modeling (pp. 113-121). Springer, Cham.

Review #3
Anonymous submitted on 06/Aug/2021
Suggestion:
Reject
Review Comment:

This paper describes the application of several techniques, used to represent time-indexed situations within the Semantic Web, to a corpus available in RDF of different types of Henri Poincaré’s documents. The paper also discusses the application of existing techniques (e.g., SQTRL) to build a form-based web interface that facilitates the query task for users who are not necessarily experts in the standard SPARQL language, the language used in the Semantic Web to query RDF knowledge graphs, such as the one provided for the Henri Poincaré corpus.

Unfortunately, I believe the paper cannot be accepted for a publication in the Semantic Web Journal.
The main motivations for this rejection can be summarised as follows.

In general, I am struggling to see the overall scientific contribution of the paper in its current form. The paper simply describes an activity done for producing a Semantic Web-based resource; that is, an ontology with additional extensions that are capable of dealing with temporal situations. What surprise me is that the representation of the time for this type of data was not considered since the beginning of the development of the ontology. The Competency Questions (Q1 and Q2) presented in the paper are quite trivial and they can be addressed with proper modelling patterns available at the state of the art since years.

An analogous paper of the authors is already published in the semantic web journal - http://semantic-web-journal.net/system/files/swj2328.pdf (in some cases the same technologies are re-described again in this paper under review, e.g., SQTRL), even if this paper focuses more on the time properties.

I was not able to find a direct reference to the ontology under revision neither on a GitHub platform nor in any other platform (e.g., figshare, zenodo, other). I was not able to find a direct reference to the linked data corpus; it seems it is not possible to download it in bulk. The data does not seem to be open data: the license applicable is CC-BY-ND according to the website http://henripoincare.fr/s/correspondance/page/accueil, therefore, no derivative works are possible.

Also, the conclusion of applying n-ary relation (reification) surprised me, I mean, this is a well known practice in the semantic web world, applied all the times a relation involves more than two dimensions, as it can happen with the time. The authors themselves say that this is the best practice recommended by W3C.

The paper does not seem considering important ontology design patterns that propose already standard modelling ways to represent the time (e.g., TimeIndexedSituation, TimeIndexedParticipation, TimeIndexPersonRole - http://ontologydesignpatterns.org/wiki/Submissions:ContentOPs). These patterns can be helpful for the situations the authors mention in the paper.

The rest of the paper is more or less just a matter-of-fact description of a web form application built by using existing techniques.

I think that the most interesting part for a special issue on cultural heritage for the semantic web journal is section 3.1 and the notion of uncertain knowledge but it is mentioned as future work.

In addition, there are also sentences that denote an incomplete knowledge of the potentials offered by certain standards of the semantic web. For instance:

"FILTER clause that is used to add constraints for the literal values associated with the properties" —> the filter clauses can be used in more complex scenarios

"A syntax inspired by RDF reification is introduced to associate the label with a triple within a classical RDF graph." —> what does “inspired” mean?

"Thanks to the properties subject predicate and object" —> what does this mean? Which properties?

"The n-ary approach requires to work with blank nodes" —> Not necessarily. you can use n-ary also without using blank nodes. It is not a good practice to use blank nodes when producing RDF data.

Other comments:
"In this document, an abstract syntax close to the Turtle syntax is used to represent RDF triples." —> is it the one presented in Fig. 5 (b)? Why not using directly turtle? I mean, it is true it is a special issues on cultural heritage, but it is the semantic web journal!

subclassof —> it should be subClassOf

subpropertyof —> it should be subPropertyOf

“This ontology has several limitations and is currently being restructured”. The reasons seem to be ambiguities and redundancies — > probably some examples to justify your work can be beneficial.

Huto reference seems to me a pre-print version only. Is there a peer-reviewed publication?

“Here are the two requests to be made, expressed here in an informal way” —> at the state of the art there exists a work that defines these questions as “competency questions”; they are derived from use case scenarios and then used for ontologies design purposes. Please see http://ceur-ws.org/Vol-516/pap21.pdf

I imagine that in the sparql queries the name henri poincaré is given as an example but in a real query I guess there is an object of type person whose name is henri poincaré, right? Probably I would say that in the paper explaining that you have done that for the sake of readability of the query.

The name of some sections are not appropriate, from my point of view. I wouldn’t name a section “Taking into account the n-ary representation”. It can be something like “N-ary representation”.

In section 6.3, the inference example that t1 before t2 is weak for me. For instance, you say that letter22 hasWritingDate t2 and t2 hasTime 1883 and since letter22 replies to letter 11 the date of letter 11 is earlier of letter22. This is quite straightforward but since you have included as date only the year, letter11 could have been written in 1883 as well, some months before letter22 for example. So the example still work since the competency question asked the letters exchanged before 1885, however, the example presented seems to me a little bit weak. Probably using a real date or months (not only a year) the example is better presented. In addition, it is not very clear the representation of t1 and t2. It seems a class with a property hasTime but I would suggest specifying it better.

Other minor comments:
“Used to generate SPARQL queries for the interrogation of the Henri Poincaré ..” —> I would simply say “used to generate SPARQL queries on the Henri Poincaré correspondence corpus”

I am not a native English speaker but the form “allows” is “to allow somebody to verb” (allows one to..)

Review #4
By Go Sugimoto submitted on 10/Aug/2021
Suggestion:
Major Revision
Review Comment:

Overall this is an interesting article which compares different RDF modelling approaches to facilitate temporal information in the Henri Poincaré Correspondence Corpus. Such an analysis contributes to the Semantic Web communities in the cultural heritage and digital humanities sector. As a background, the paper starts from the discussions on some important issues on temporal knowledge representation for historical studies. The project aims to develop an easy-to-use semantic application for the historians who may not have substantial technical skills and knowledge on graph databases, which is vital for the wide acceptance of the technology in the research arena.

The previous studies in the article are unfortunately limited to the presentation of temporal modelling in RDF in general, although it is interesting. For the special issue of SWJ, it would be required to outline the existing applications of the models in the historical studies. Thus, the literature section can be improved.

The methodologies are generally sound to analyse the temporal data representation models by SPARQL queries. A minor issue is the second SPARQL query is too similar to the first one, thus various query patterns are not well investigated. Reconsideration is appreciated. The implemented application will certainly help the end users of the digital corpus, which is a valuable development for digital humanities alike.

Diagrams and figures are well prepared. The graphical illustrations of the models are very helpful. Table 1 is a good contribution to SWJ.

In general the article is presented in professional academic English in a good order.

The most critical shortcomings of the paper would be the logic or coherence between research questions/goals and conclusions. Some parts of the argumentations and conclusions are not adequately clarified and interlinked. For those reasons, a major revision would be recommended. More details about this point are provided below.

Major issues
-------------
There are many aspects of temporal data modelling, so it is recommended to specify/clarify (in the beginning) what aspects of modelling are discussed in this paper. For instance, temporal modelling may include discussion of temporal hierarchy, relations between relative time, and individual interpretations of time. This paper focuses on the relations between absolute time instants or intervals mostly for the biography of persons and the letter objects, but it is not clearly stated as such, due to the generalisation of the subject (including the paper title).

Section 3.1 provides valuable information about critical issues on time in historical studies, but they are dealt as generic examples. It is not certain if those points are actually evaluated in the later sections in the sense of Semantic Web (SPARQL queries), which is the central topic of this paper. In this regard this part of arguments would not be most convincing to come to the conclusion. Clearer connections between Section 3.1 and Section 5 and 6 (and 7) would be required.

Section 3.3 should better describe how temporal information is currently stored (outside RDF in a separate database, perhaps?). It is not enough to say the current data model/ontology is not able to represent temporal information. In case such data does not include temporal data at all, the problem is not the existing ontology, but the lack of data (value) in the first place. Then, the new data modelling would not solve the problem. Similarly, Section 3.3, states “In a history of science context, it is necessary to consider knowledge that is sometimes incomplete because of the lack of resources related to a certain context.”. This is absolutely true, but this is a general uncertainty issue, not necessarily the time issue about which the paper would like to discuss.
Moreover, “These issues require the addition of a temporal element to these relationships between people and places.”. This part also mixes up the uncertainty and temporal elements. Although those two issues are related in some cases, they are separate in other cases. Therefore, more clarification would be needed about what types of temporal data issues the paper concentrates on.

In Section 3.3 “For this knowledge representation work, the granularity is defined at the level of the day. This allows consideration of data associated with letters in the correspondence for which the day of writing is sometimes known.” It is good to define the day level granularity, but the issue in the earlier part (reasoning) has little to do with it. The issue can be solved by providing the process of reasoning (by historians): who made the reasoning and when etc. Therefore, I am not sure if the day level granularity provides an actual solution to the issue raised.

Section 4.1.3.The temporal model of CIDOC-CRM should be more explained and examined as a standard for cultural heritage ontology. This is a big shortcoming when discussing cultural heritage data modelling.

Section 4.2. This part would need to be extended to include more references to other important initiatives, including Wikidata, DBpedia, LODE, HuTime, CIDOC-CRM, EDM etc.

Section 5.6. The arguments would be weak to justify the decision. To improve, they can be more tightly related to the issues raised in Section 3. One discussion missing is that the decision is based on the Henri corpus only. Semantic Web is advantageous, when data is integrated with external datasets. A common problem of such data integration is that it becomes very complex to make federated queries across multiple endpoints, due to the variety of ontologies. For this reason it is doubtful that n-array is the best choice. It may be intuitive for the Henri project members, but it may not be the most popular way of encoding historical corpora (mainly because "original triple is lost"). In this regard, the CIDOC-CRM (4D fluent) approach seems to be more suitable for interoperability and data integration. Even if the authors still argue that n-array is the best solution, the comparison of different approaches could be done more carefully, and convincing justifications could be presented.

Section 6.3. “This entailment mechanism is useful for specifying relationships between temporal elements.” The mechanism is fascinating and this method can be reused/extended for any inferences, thanks to the rule based approach. However, it is uncertain if historians and data modellers can define all such inferences beforehand. The danger is, in case a few rules are missing, the users would believe that they have obtained query results with all possible inferences included. In addition, this method seems to lack ways to preserve the rules in RDF, next to the source RDF data. XML representation is acceptable, but it would be desirable if it is somehow more integrated into the source RDF data, so that the rules can be also queried/examined by the end users. Moreover, the versioning and provenance (metadata of rules) would be a useful addition. It would be recommended to present not only “good results”, but also “pending issues”, and how to address the latter in the future.

This paper did not investigate how the rules could be modeled in the ontology itself, instead of adding extra rules in XML afterwards. As mentioned above, the rules can be written only when historians know beforehand how to infer information from the source data. Thus, in principle it is also possible to encode inferences in the ontology. This point could be examined further.

Conclusions in Section 7 are generally well written, but it seems that the outcome of Section 4.2 is not really reflected. Only the “t1 before t2” example is presented. In order to represent diverse rules and time relations for historical studies, it would be important that other properties such as "A overlaps B", "A metBy B" are also evaluated in the context of historical analysis. The modelling is even more complex when dealing with uncertainties. For those reasons, at least CIDOC-CRM should be analysed in more detail, but it is largely missing in the article.

Minor issues
-------------
Section2 is too basic for SWJ. As this article is not for a cultural heritage journal, it can be significantly shorter, or completely omitted.

Footnote 14. It is interesting, but rather too detailed, not directly related to the Semantic technologies. It would not be needed for SWJ.

Section 4.1.3 and 4.1.4. Explanation is too vague and not enough for readers. For example, 4.1.3 has no mention of how to model temporal data. In general, it is easier to understand the subsections of Section 4, when reading the modelling examples in the subsections of Section 5. But, it is hard to understand without them. Thus, it would be nice to restructure and harmonise the two sections.

It would be interesting to investigate (in the future) how to preserve and distinguish the transcription of the source letter and the interpretation of it. (i.e. in this case, the inference of t1 before t2, which is not written on the source letter). This remark is not directly related to the ontology modelling that this paper is about, but, since the ontology modelling needs to take text encoding into consideration, it is highly relevant.