Cultural Heritage Information Retrieval: Data Modelling and Applications

Babak Ranjgar
Abolghasem Sadeghi-Niaraki
Maryam Shakeri1
Soo-Mi Choi

Responsible editor: 
Mehwish Alam

Submission type: 
Survey Article
Abstract. The Cultural Heritage (CH) community is one of the domains to adopt Semantic Web recommendations and technolo-gies, which can provide interoperability between various organizations by creating a shared understanding in the community. The CH employed Semantic Web technologies step by step along its evolution process for better knowledge management and a uni-form understanding among the community. To identify this evolution process, there is a need to review CH knowledge engineer-ing and the process to improve information retrieval, which new researchers could follow the newest developments in the area. This paper presents this process from its initial steps and the various challenges faced to the latest developments in the CH infor-mation retrieval. CH has the goal of preserving and dissemination of the historical information to people and society. Therefore, by making data machine-readable and achieving data interoperability thus a better information retrieval, there is a wide set of op-portunities to develop smart applications based on rich CH information as a form of interactive, user-friendly, and context-aware dissemination of information to users. We also reviewed intelligent applications and services developed in the CH domain after establishing semantic data models and Knowledge Organization Systems. Finally, challenges and possible future research direc-tions are discussed.
Review #1
Anonymous submitted on 13/Feb/2022
Major Revision
Review Comment:

Overall the structure is clear. And the presentation is adequate, although a native speaker should review for spelling.

[1] Some aspects deserve more attention for the paper to be really comprehensive. Two main examples:
- The paper states "There are two important issues [...]. The first is technical interoperability, which is solved by the decentralized architecture of the web and its platform independent protocols for data sharing and exchange. The very web itself lead to the second problem, which is semantic interoperability." There's a third important issue, namely the issue of understanding what’s in the data/collections itself. Think about issues related to polyvocality and semantic drift. Objects mean different things for people over time.
- When looking at "Fig. 1. Overall methodology of the research." = it seems that the stap of automatic information extraction (f.i. using AI to extract keywords or low-level features in data, such as colour) is missing. More generally, AI is only briefly mentioned altough there's a lot of work in this area (look for instance at the Europeana Tech working group on this subject).

[2] The paper feels a bit dated at times. This is an issue for people new to the field:
- the bibliography only contains few references younger than four years old.
- many of the examples "Yahoo!, community portals" and "semantic information portals" aren't really contemporary. Also, reading "CultureSampo are well-known examples of semantic information portals in the CH domain." this refers to a paper from 2009. The same is true for Amalgame, a tool that's not in operation. I would encourage the authors to find more contemporary examples.

[3] Much of the examples listed are related to (predominantly European) research projects, and smaller scale implementations. However, there are quite a few commercial vendors that offer solutions that have accelerated the uptake of semantic technologies in the CH. For instance the Semantic Web Company (PoolParty). Ontotext and many others.
Also from industry: graph databases such as TerminusDB and neo4j are also having more an more impact.
=> The paper should also reflect on these developments in order to reflect on the current state of the art and current practices.

[4] I'm missing references to a few initiatives that should be included, notably:
- important standards (notably FRBR, RDA, EBUCore and ISBD);
- the OAIS reference Model;

[5] In Section "3.4.1. CRM vs. EDM" it would be good to discuss the practical uptake of the two ontologies. Basically unpacking "Memory institutions have stored their data mostly in an object-centric way, and its conversion to the event- centric type needs a great deal of effort." This a bit more. The fact that "This shows that CRM has higher ontological commitment than EDM." Could make it also more complex to apply in practice. I think it is informative to add this dimension.

[6] "5.2. Data reuse and dissemination"
=> if the paper wants to be comprehensive, it should also mention intellectual property. (f.i. referring to
=> there's a lot of work in this area; check f.i. reuse on wikipedia.

[7] Some claims need to be substantiated:
=> "However, visualization is quite young in CH domain."
=> "CH from the very beginning embraced Semantic Web technologies, so it evolved as it did." ?

Review #2
By Victor de Boer submitted on 23/Feb/2022
Minor Revision
Review Comment:

This survey article gives an overview of the history and state of usage of information modeling technologies in the domain of Cultural Heritage. It presents a historical view on the evolution of usage of metadata schemas, vocabularies, thesauri and ontologies in the domain and presents some applications of such models. The final sections discuss some current challenges around information modelling.

As this manuscript was submitted as 'Survey Article', I address the following dimensions as indicated by the journal:
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

What is quite interesting about the article is that it places the semantic web elements (ontologies, LOD etc) in a historical context, relating it to other KOS solutions. It is also quite complete and presents a nice overview, that would be mostly relevant reading for interested parties from the cultural heritage and digital humanities domain. For readers from the Semantic Web field, this relation to older and/or less semantic KOS solutions is of interest.

(2) How comprehensive and how balanced is the presentation and coverage.

The balance is appropriate, with parts dedicated to the various KOS solutions, and the most important models. The section on challenges is less balanced. 5.1 and 5.2 present relevant challenges, but 5.3 and 5.4 are more descriptions of applications/specific types of information that could be better presented elsewhere in the paper. A more comprehensive listing of challenges would make the paper more interesting.

(3) Readability and clarity of the presentation.

A downside of the paper is that the language and grammar can be greatly improved to make the overall presentation more effective. In many cases, grammatical errors lead to unclear sentences and ambiguitiy.

(4) Importance of the covered material to the broader Semantic Web community.

Although this is indeed a survey article, no new analysis or synthesis is performed (or really claimed) in the paper. It brings together information that has been presented before, also in other survey articles, or articles comparing models (for example Dijkshoorn et al). However, as a starting text this survey is quite complete and presents most elements in a comprehensive way.

Minor issues:

- Fig 1: the method of structuring this figure is quite unclear. I would think that it does not really comprise a comprehensive methodology but rather a conceptual framework.
- Abstract: the abstract should state the main results of the study. It now does not provide much information on what was the outcome of the survey research.
- p2: " In the late 20th century, " -> this is from 2001, so 21st C
- Section 3.1 makes several claims on the relation between various KOSses, that are not backed up by references or original research (for example "These types of KOSs were not enough to address the heterogeneity of the data and semantic interoperability").
- P4: "(DDC), which is a system of 10 numeric sections with decimal extensions." this is incorrect. DDC has 10 classes with 10 divisions, with 10 sections. So 1000 Sections
- P4: "They have no special structure within them and are not of much interest in the CH domain." -> this would be a matter of opinion or needs to be backed up by references or research
- Section 3.1.4: here the authors describe XML and RDF, however XML and RDF are not particular to medatadata schemas but also play a role in authority files, ontologies, structured vocabularies etc. I suggest to move this paragraph out of this section into a separate section describing technologies (for example, combining it with OWL/SKOS etc).
- 3.4 "Functions in this domain as said in" -> This concerns applications and not ontologies, I suggest moving that part to another section.
- Fig 3 is not centered
- P20: "The concept-based modelling of OWL prevents it from performing inferences based on the properties. For example, OWL lacks composition constructors for properties that makes it unable to capture the relationship between concepts associated with a combination of properties"-> this is of course possible with OWL 2 propertyAxioms
- p20 "However, visualization is quiet young in CH domain" -> (quite) / Young as opposed to what? I think visualizations are quite common, especially geo-viz, social and object networks etc. This assessment would need some more clarification

- in multiple occasions, the authors use "the Cultural Heritage" as an actor, for example in the abstract "The CH employed Semantic Web technologies step..." . This is incorrect and confusing. Please rephrase as "the CH community" or "In the CH domain, various actors..."
p1. "For example, artists who lived in a desired city during a special period of time. "-> in a specific city during a specific period
- There are many typos and grammatical issues and I suggest running spell- and grammar-checkers for a next version.

Review #3
By Mehwish Alam submitted on 01/Mar/2022
Major Revision
Review Comment:

The following points should be addressed by the authors for improving the manuscript:

- Many works related to the intersection of Artificial Intelligence and Cultural Heritage are missing.
- The authors should mention the paper selection criteria such as which platforms were chosen to search for the papers along with the keywords and the filtering criteria.
- List of challenges should be extended. Moreover, a vision for open problems should be given.
- Grammatical mistakes should be corrected.