Semantic Technologies for Historical Research: A Survey

Semantic Technologies for Historical Research: A Survey
Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke van Erp, Kees Mandemakers, Leen Breure, Andrea Scharnhorst, Stefan Schlobach, Frank van Harmelen
The diversity of sources of information for historical research fill a continuum between individual accounts transmitted for instance in letters but also in poems and songs, and aggregated statistical information as in the case of historical census. Historiography shares this heterogeneity and complexity of source material with other humanities fields. Methods to order this rich material, and by this ordering also to determine the way history is told are as old as history writing and vary among the different branches (or subdisciplines) of historical research. In this paper we focus on the work of historians, and even more specifically economic and social history.At the crossroad of information and historical sciences, so-called Historical Informatics or History and Computing emerged as a specific profession during the nineties of the last century. Together with computer scientists historians created a research agenda concentrating around questions how to create, design, enrich, edit, retrieve, analyze and present historical information with help of information technology. There exist a number problems and challenges in this field; some of them are closely related to semantics and meaning of knowledge in general. In this context, Semantic Web technologies can be applied in a number of situations, environments, applications of historical computing and historical information science. However, only a few number of contributions have yet considered these technologies. In this survey we present an overview of the past and present problems, challenges and advances of historical science computing, from out the perspective of Semantic technology.
Survey Article
Review 1 by Jack Owens:

Review of "Semantic Technologies for Historical Research: A Survey" by Albert Meroño-Peñuela et al
This is a wonderful article, even inspiring. I congratulation the authors and urge its publication.

The authors do not appear to have grasped the importance of what they have done. For example, on page 11, they write: "Semantic technologies do not require much introduction in an article in a Semantic Web journal", and there are aspects of the article that reflect this attitude. However, historians must find this article, and when they do, they need to find a text that is accessible to them, even though much of the information will be unfamiliar. With some slight changes, this article will become a major introduction for historians who have been snared by the suggestive title.

The most serious problem introduced by this neglect of historians as readers is that acronyms are frequently introduced without definition, at least the first time they are encountered. In the same way, certain terms require a brief explanation (e.g. "triples").

On page 3, the authors need to do a better job of defining the historical sources with which they are concerned. For example, later in the text, they discuss projects dealing with letters, photos, film, and oral accounts, but here, it appears that such sources are not relevant. Similarly, in discussing administrative sources, they do not mention the category of judicial proceedings, which are important to many historians, from those who study, for example, Chinese criminal proceedings from the Song Dynasty, Inquisition tribunals, or criminal trials of labor union leaders in the 20th century.

For researchers interested in world history or the history of any large geographic region, the linking of data is fundamental. However, the presentation of the task as forming a "graph" will be mysterious to almost all of them. The concept of a "graph-building pipeline" (quoting page 21) should be clarified.

Technical matters:
All of the figures are too difficult to read. Fig. 1 is the best, but it may be useful to number the circulating entities to increase clarity. It is always hard to recognize quickly where a circle begins. The text in Fig. 2 is too small. The rest cannot be read without unreasonable effort.

The type font for URLs is ugly and harder to read that the text, both in footnotes and bibliography.

Finally, the English needs to be improved. I am sympathetic because I have the same difficulties when writing in a language that is not my native one. But I repeat: this article is really important. I do not want anything to block access of historians who need its guidance. If, after the other corrections are made, someone will send me the article in WORD or other similar format, I will be happy to recommend the corrections that I have noted while reading the review text.

Review 2 by Monica Wachowicz:

This paper describes the authors' perspective on the importance of semantic web technologies in the field of historiography. It is not presented as a survey paper or a road map with the purpose of giving the opportunity to readers to become closely familiar with the current work being done in the particular fields of Semantic Web technologies and eHistory (also referred to Historical Informatics/History and Computing/Historical Semantic Web). Rather, it is presented as a summary of a literature review, the interview outcomes of 8 scientists in the Netherlands, as well as a description of current research projects, mostly in the Netherlands. Moreover, the paper has numerous generalisations that are presented as facts but without any apparent supporting evidence (see the specific comments below for examples). The technical details on semantic web are ill-structured done and insufficiently detailed to understand how the authors and/or previous research work propose to share historical information across agents and services based on more intuitive search engines.

1. The goal of the paper is not clear stated in the introduction but rather later in the paper. Some examples include:
- Section 2, page 3. "We will discuss in this paper how semantic technologies can support research practices we depicted archetypically above: the careful, authentic and preserving collection …..";
- Section 3, page 4. "This paper looks also forward on how semantic web technology can be applied to historical data sets, and how these technologies can facilitate, boost, and improve research by historians."
- Section 3.1, page 5. "In this article we look into the benefits of the utilization of semantic technologies in the field of history…";
- Section 5, page 15. "The goal … was to identify possible bridge heads for common problem solving processes on a conceptual level."
None of the goals mentioned above are on line with the standard purpose of producing a state-of-the-art survey paper containing a comparison of two (or more) problems, things, ideas or events and the evaluation of their differences and similarities.

2. Section 2 should be reduced and integrated with Section 3. Most of the ideas discussed in section 2 (e.g. types of historical data sets) have been also discussed in Section 3 and to a certain extent in Section 4.

3. Section 3 introduces the field of Computational Semantics as given in Wikipedia: "Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural language processing and computational linguistics." However, computational semantics is not further discussed in the paper.

4. It is not clear why the lifecycle of historical information as proposed by Bonstra et al. (2004) was not further used in the paper. I would suggest that the lifecycle is the appropriate framework to write a survey paper that the authors can demonstrate how semantic web can be or has been applied to historical information. Unfortunately, the discussion in the paper was restricted to primary historical sources as mentioned in Section 2, page 2.

5. In Section 2, page3, the authors state that "… knowledge encoding is interesting, and it requires more sophisticated ontologies than currently available". What is meant by "sophisticated"? Current examples of developed ontologies and their advantages/disadvantages are needed to support such a statement. The authors seem to have adopted a bottom-up approach for explaining the role of knowledge enconding in historiography. What about top-down approaches? Why they are not mentioned in the review?

6. In Section 3, page 5, the authors briefly state that the challenges are mainly related to textual, linkage, structuring, interpretation and visualisation problems. More detailed examples based on the literature review are needed to support such a statement. Further, the authors seem to have given more focus on linkage and structuring. Why?

7. The authors do not consider that the dichotomy of historical sources (i.e. structured versus unstructured data) is a major issue in the semantics technologies pipeline (section 3, page7). I agree, and therefore, I would like to suggest the authors to reduce the text discussing this dichotomy, and extend the ideas on context issues since it is more relevant to semantic web applications. For example: How is context being used by historians? What sort of knowledge management models can be useful in contextualising ontologies in e-History?
See Contextualizing ontologies, Bouquet et al., Web Semantics: Science, Services and Agents on the World Wide Web, Volume 1, Issue 4, October 2004, Pages 325–343 for more details.

8. In Section 3.2, the authors claim that "…historians often have different interpretations and no clear research question when starting an investigation and it is neither possible nor desirable to model the data according to certain requirements in advance." I would argue that historians usually develop an abductive reasoning task in their early investigations steps. In this case, abductive logic programming will be more suitable to allow historians to generate hypothesis rather than testing them. Plan recognition, diagnosis, language interpretation and many other tasks can be viewed as types of abductive reasoning (Charniak and McDermott 1985). In contrast, deductive reasoning engines are usually used for Semantic web. Wouldn't be the case that semantic web based approaches are inappropriate for the early reasoning tasks of historians? If this is correct, what will be the impact on the proposed historical information lifecycle?
Charniak, E., and McDermott, D. 1985. Introduction to Artificial Intelligence. Reading, MA: ADDISON.

9. Section 3.2.1 on semantic interoperability is very ill-structured. The focus was given on describing data structures rather than semantic interoperability issues. Moreover, some examples are more related to syntactic interoperability (e.g. the examples given in the tabular data section). For a review on interoperability levels please refer to Manso-Callejo, M. A. and Wachowicz, M. (2009). GIS Design: A review of current issues in interoperability. Geography Compass Journal, 3(3), pp. 1105-1124.

10. Theree is a vast literature by historians describing their efforts on georeferencing which should be mentioned on annotation techniques for semi-structured sources.

11. Have the authors encountered published research work using NLP in e-History? Examples are needed here.

12. The Section 4 is ill-structured as well. The focus was given on describing how semantic web languages can/could handle historical data integration/harmonisation issues. However, the SemanticWeb is basically an extension of the Web and of the Web enabling database and Internet technology, and, as a consequence, the Semantic Web methodologies, representation mechanisms and logics strongly rely on those developed in databases. This is the motivation for many attempts to, more or less loosely, merge the two worlds like, for instance, the various proposals to use relational database technology for storing web data or the use of ontologies for data integration. The authors need to re-structure this section accordingly to Semantic Web methodologies, representation mechanisms and logics. This will help readers understand the research issues identified in section 5 (i.e. knowledge modelling and ontologies, text mining, and search/retrieval).

13. Section 5 is a very well structures section but it also introduces the concept of Historical Semantic Web. What is that?

14. In Section 5, it is not clear what the authors actually mean by holistic approaches.

15. There are several misspelled words throughout the text.

Review 3 by Aldo Gangemi:

This is a survey paper about semantic technology for historical research (here and there called eHistory). As a survey it has most (not all) the references to the field, and attempts a classification and a comparison that are not however mature enough, in terms of depth as well as of precision.
I think this paper is worth acceptance, and I encourage the authors to revise the paper accordingly.

The overall introduction is interesting, and useful for historians, and even more for computer scientists to make them knowledgeable of the domain.

The authors then explain that they concentrate on primary (non-historian-made) sources of historical research. This is incomprehensible to me, because, for the sake of a survey, it does not make sense to restrict to a certain type of sources, specially considering that no paper/project/tool surveyed actually adheres to one single type of sources.

The subsection 2.1 on the notion of "structured/unstructured" is not very clear, and not complemented with the similar Section 3.1.2. It'd be better to state briefly that there are different notions of structure, and how they are mapped to the computer science one.

In Section 4 the paper arrives at a core contribution, which consists in revisiting the problem space of eHistory in terms of the Semantic Web problem space. Six problems are singled out. I suggest the authors to modify their narrative, and clarify from the very beginning their contribution by means of the six problems, and the related added value for historians.

Figure 3 contains a summary of the possible contribution from five semantic web "things" to the six problems. This seems quite limited. The five "things" (called "technologies" by the authors) include languages like RDF, query syntaxes like SPARQL SELECT, vocabularies, and reasoners. What about all the rest? Even for pure SW, this looks too partial. And actually the authors mention many more things later in Section 5.

Section 5 is eventually the real survey, organized by: papers, projects, resources, and tools. Good enough, however the list contains inaccuracies in my opinion. I list some of them that are quite evident:
- LODifier is listed as a relevant paper, but that paper does not talk about history, but envisages a tool for arbitrary domains
- half of the "tools" are actually ontologies or data models (SEM, OpenCyc, Dublin Core, SUMO, TEI), or lexical resources (WordNet, FrameNet).

The comparison is made along four dimensions – knowledge modelling, text processing, search, and interoperability – but the results are just visualized, and only partly interpreted in text. In particular, it's not clear their contribution to eHistory in domain terms instead of functionalities. For example: what has been obtained in the studies described or carried out? How resources and tools are related to eHistory? Just as potential ones, or have they been actually used for eHistory? and with what results?

Some useful references may be added, in particular about history:

C. Retoré et al. A discursive analysis of itineraries in an historical and regional corpus of travels:syntax, semantics, and pragmatics in a unified type theoretical framework In Contraints in Discourse 2011, Agay, sept. 14-16

J. B. Owens et al. Visualizing Historical Narratives: Geographically-Integrated History and Dynamics GIS. National Endowment for the Humanities workshop “Visualizing the Past: Tools and Techniques for Understanding Historical Processes”, 20-21 February 2009, University of Richmond, Virginia, USA

and about tools, see at least the recent:

Aldo Gangemi. A Comparison of Knowledge Extraction Tools for the Semantic Web. Proceedings of ESWC2013, LNCS, Springer, 2013.

and for event extraction, the FRED tool:

Overall, there is a sense of shallowness: the survey is quite extensive and the classification useful (besides some inaccuracies), but the actual articulation of the research surveyed seems to remain implicit. It looks more like the state-of-art section of a project proposal or a technical report or a PhD thesis, than an actual scientific contribution. Whatever its interest as a roadmap to historical research, I feel the need for something deeper.

I invite the authors to provide a more definite and scientific contribution to the comparisons between the research entities in eHistory, and to accompany it with at least one leading example that makes it clear what can be done by what. Deeper classification/comparison is barely needed in my view.

Definitely a very interesting and relevant survey. However, I was asking myself when skimming through the article why Semantic Wikis are mentioned nowhere in the survey. From my perspective, Semantic Wikis are wonderful tools for historians to collect, structure and access information, data and knowledge related to historical research. They are easy to deploy, configure and use, which is often an issue with notoriously underfunded humanities. The following paper describes a use case, where a Semantic Wiki was used for the creation of a comprehensive historical knowledge base:

Knowledge Engineering for Historians on the Example of the Catalogus Professorum Lipsiensis In: Proceedings of the 9th International Semantic Web Conference ISWC2010.

Disclaimer: I'm co-author of this one ;-)