Applying and Developing Semantic Web Technologies for Exploiting a Corpus in History of Science: the Case Study of the Henri Poincaré Correspondence

Tracking #: 2328-3541

Olivier Bruneau
Nicolas Lasolle
Jean Lieber
Emmanuel Nauer
Siyana Pavlova
Laurent Rollet

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Full Paper
The Henri Poincaré correspondence is a corpus of letters sent and received by this mathematician. The edition of this correspondence is a long-term project begun during the 1990s. Since 1999, a website is devoted to publish online this correspondence with digitized letters. In 2017, it has been decided to reforge this website using Omeka S. This content management system offers useful services but some user needs have led to the development of an RDFS infrastructure associated to it. Approximate and explained searches are managed thanks to SPARQL query transformations. A prototype for efficient RDF annotation of this corpus (and similar corpora) has been designed and implemented. This article deals with these three research issues and how they are addressed.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Nov/2019
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper reports an important engineering work for providing tools that can be used for tagging digitalized scientific letters with descriptive, scientific and mathematical content. RDF is used for describing the documents and providing a semantic querying facility. The solution related to scientific patrimony is original in the use of semantic Web techniques for dealing with documents content.

(1) The originality of the paper resides in the fact that they apply (data annotation and querying using semantic techniques) to scientific correspondence by using existing Semantic Web languages and tools. Thanks to this strategy the solution enables approximate ad explained searches on RDF annotated corpus, which is somehow novel in corpora data exploration.

(2) The work introduced in this paper resulted in a “complete” software that enables both the annotation of corpora, particularly, with historic scientific content, and its exploration through different kinds of queries that seem to come up when document corpora are studied.

(3) The paper is in general well written. Figures, particularly screen shots of interfaces can be given in an appendix to give continuity to the text. Similarly, formal details about RDF might not be that useful if authors consider that the intended audience of the journal could have a basic knowledge of RDF and SPARQL. Besides the technical and formal description of the solution, authors could give a word on the way in which the tool is integrated in the history of science scientific method.

Review #2
By Guillem Rull submitted on 01/Dec/2019
Minor Revision
Review Comment:

The paper focuses on a system for querying and editing RDF metadata associated to the digital corpus of Henri Poincaré correspondence. In terms of searching, the system has three interfaces: SPARQL, form-based and a graphical interface. The graphical interface seems to be a sort of query builder in which the users constructs a graph that represents the query. The paper however focuses more on the SPARQL interface and proposes a method to go beyond simple querying that allows for approximate searches. The authors also present an online editor for the corpus’ metadata that takes advantage of the Semantic Web reasoning capabilities.

The need for approximate searches in the discussed use case is very well established. Examples are provided that clearly show why straightforward SPARQL query processing does not suffice in this context. In particular, the reason is that searches must be able to deal with vague concepts such as ‘the end of the 19th century’ and should also consider how historians use the corpus in their research. In the latter sense, the system should provide results that are not strictly answers to the posed question but are related to it and might provide valuable insights to the historian’s research.

To perform these approximate searches, the authors propose to apply a technique referred to as elastic search, in which the original query is expanded in multiple ways by applying a set of rules. Each rule has a cost, which allows ranking the results according to it. In the end, this results in a tree of queries, with the original query at the root, and where each child node is the result of applying one of the rules to the parent. Each of these rules performs a query transformation, replacing some of the triples in the query pattern. They leverage information from the ontology, namely class and property hierarchies, and also domains and ranges. For example, one rule generalizes the given query by replacing a property with a superproperty.

Regarding the editor for the corpus metadata and its use of reasoning, the papers presents two approaches: RDFS inference and case-based reasoning. They both aim at solving the same problem, which the authors make a great job in formalizing. RDFS inference allows the editor to propose a ranked list of potential values to the users when they are editing a triple. The system takes into account (1) the field that is being edited, (2) which of the other fields in the triple have already been filled, and (3) the triples that are already present in the system. Values are ranked higher the more they match the ontology. Essentially this relies on the domains and ranges of the properties in the ontology to suggest the more likely candidates. Case-based reasoning compares the letter being currently edited with the letters that have been already annotated, and ranks higher those values that were used on letters that are more similar to the current one. The combination of these two methods is left as future work.

The authors report on a first evaluation of this editor that was done automatically, comparing the already annotated letters in the system with a dummy version of the editor that uses no inference and relies just on alphabetical order, and the proposed version with inference. Unsurprisingly, the inference-based editor is found to be better. The authors however acknowledge that this is just a preliminary experiment and that further evaluation, involving human users, is required to fully assess its usefulness.

Overall, the paper is very well-written and makes very clear the significance of the proposed system in the discussed use case. What it lacks, however, is a related work section, which makes it hard to evaluate its novelty. Most of the techniques applied in the papers seem to be taken from other works, and since no comparison with other systems is provided it is unclear if this is the first time these techniques have been used in the context of cultural heritage or not. This makes the paper feel more like an application report than a full research paper. Despite this, I thinks that adding the aforementioned related work section would remedy this, and in that case I would recommend the acceptance of the paper.

Review #3
By Ranka Stankovic submitted on 21/Dec/2019
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
This paper present original work related to the semantic digital library of Henri Poincaré correspondence implemented in Omeka S content management system expanded by the RDFS infrastructure to enable SPARQL query system.
The presented work is significant for semantic web community, specially for those in Digital humanity field.
A prototype for different types of queries is available on-line giving unique opportunity to search such valuable collection. Advanced query approximation and elastic query research will probably give new light on this subject.

The user interfaces are available classical interface (based on Solr full text ), search with the SPARQL language querying, a form-based interface (more classical) and an interface using a graphical view.
Paper is written well, but few suggestion related too terminology is given.

Minor changes and explanations are needed.

Sphinx is not referenced, authors should add a link and write that it is an open source full text search server.

It is clear that mathematical search on formulae (like in or similar) is not implemented, but is there some ideas of supporting math ontology and some specific querying of mathematical content?

In right column, p.3. rows 39-43 variables in formulae are not explicitly stated that p, q, r are properties (predicates), x and y nodes, C and D are...
In r3 formula is it "subc" or "subp", probably is typo?

SPARQL querying seems more frequent term than SPARQL interrogation. Authors should rethink about used term.

Please clarify: "assumed to be modulo RDFS entailment,"

p.5 "A specific RDFS base had to be installed." It is clear that turtle syntax is used for RDF, but what type of database or end-point solution is used for RDF store? More technical details for implementation of SPARQL endpoint and RDFS base are needed.

More details statistics related to triples by classes, by properties etc. would give an insight on the digital collection.

Has been used any type of normalisation for Solr index, eg. are tokens lemmatized, stemmed,...?

One more terminological issue: corpus vs. digital library. The result from "corpus system" (in NLP) is generally a concordance from corpus text, chunks retrieved from several documents, while for "digital library" a query result is a list of documents. So, a unit of response is a part of text vs. whole document. Also corpus is usually annotated with grammatical information, that is not included in this system. In my opinion, the term "semantic digital library" is more suitable than corpus for this collection and system.