A Strategy for Archives Metadata Representation on CIDOC-CRM and Knowledge Discovery

Tracking #: 2657-3871

Dora Melo
Irene Pimenta Rodrigues
Davide Varagnolo

Responsible editor: 
Eero Hyvonen

Submission type: 
Full Paper
This paper presents a strategy for the semantic migration of Portuguese National Archives records into CIDOC-CRM standard, an ontology developed for museums, within the context of EPISA project. The approach to automatically populate the CIDOC-CRM is based on Mapping Description Rules to semantically translate the archives descriptive information into CIDOC-CRM representation. The compliance of the CIDOC-CRM model recommendations guarantees that the populated CIDOC-CRM ontology of archives descriptive information verifies interoperability, and could be linked and integrated with other populated CIDOC-CRM ontologies. In the information modelling, requirements on the mapping representation, due to the intent of interpreting natural language text to automatically extract information of metadata text fields and to interpret natural language queries, are taken into account. To automatically interpret the Mapping Description Rules, OWL API was used to obtain the set of assertions that represents the information in the target ontology and two datasets are available with some migration examples. The exploration of the knowledge representation is done through some Description Logic queries to highlight the advantages of having this new representation of the National Archives. The evaluation of the resulting representation can be done automatically proving its correctness for the metadata that has a direct representation in CIDOC-CRM.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Mar/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper introduces and discusses a strategy for a semantic migration process that maps information from descriptive records of the Portuguese National Archives onto the CIDOC CRM.

The major result presented in the text appears to be the definition of Mapping Description Rules and their application to the specific use case of the DigitArq, the national archive system of Portugal. Beyond that, a couple of interesting modelling challenges are being discussed, that constitute valueable lessons-learned, as well as a range of relevant "open problems" with regard to the further development and implementation of the Mapping Description Rules.

The kind of (practical) work discussed in the paper is very relevant (and needed) for the further advancement of the idea of semantic networks, i. e. integrated and rich knowledge bases in the cultural heritage domain.

However, my overall impression is that the manuscript is border-line with regard to a full (research) paper publication. The work that is discussed is original, relevant (also to the SWJ's scope) and useful, but seems to be very much work-in-progress. The results presented appear, overall, to be still too much of an interim nature. Furtherhmore, the quality of writing, for the most part, makes it hard to follow the argument. At the very least, the text needs to be revised linguistically.

Additional Comments:

- The Introduction (p. 2) states that the "Mapping Description Rules, as defined, can be easily adapted to the use of other ontologies": This statement is not taken up or substantiated further in the remainder of the text.
- The Introduction (p. 2) states that Section 5 will present an evaluation of the results of the migration process. Section 5 does, however, as far as I understand, present how the results could be evaluated, but not what the outcome of the evaluation was. The Conclusion then states, if I am not mistaken, that the evaluation is still pending, and that there are actually two "sub processes" (still) to be evaluated.
- The authors could be more clear about (or formulate more clearly) the state of evaluation: What exactly is evaluated and how; and what are the (possibly preliminary) results?
- However, it appears that the evaluation part is still very much on-going; for a research paper, I would expect corresponding results to be included.
- p.3: CIDOC CRM has been chosen as the target ontology, and a couple of reasons for this choice are mentioned. However, have other choices been considered/evaluated?
- How does the work presented here – the mapping from ISAD(G) to CIDOC CRM – relate to ArchOnto?
- p. 7, l. 45: "2 NewInst(IDE22 , ’E31 Man-Made Object’)" → "2 NewInst(IDE22 , ’E22 Man-Made Object’)" (read: change "E31" to "E22")
- On p. 14 it is stated, that „any baptism event is established as an instance of the entity ’E5 Event’“: Why is baptism not an instance of "E7 Activity"? Related to this (p. 14/15): Why is the domain of "PC14 Carried Out By" defined as "E5 Event" (and not "E7 Activity")?
- On p. 17: Librarians and archivists are specified as the main target users for the "Query Ontology Interface". Why also librarians? Are there library data / bibliographic records included in the archical data set you are working with? Are researchers – who use the archives and want to utilize the archival aids (themselves) – not part of the target group?
- Generally, the paper would benefit from a more extensive discussion of the modelling challenges/decisions encountered, and, related to that, a more extensive discussion of the advantages of the semantic network approach (based on CIDOC CRM) in terms of (new/advanced) query options (that were not possible before and are relevant to the target user groups).

Review #2
By Carlo Meghini submitted on 11/Mar/2021
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper is original and presents significant results. The quality of writing could be improved, so I suggest acceptance with minor (editorial, non scientific) revision. A detailed review is given below.

The paper presents a method and a set of associated tools for transforming archival data expressed in ISAD(G) and ISAAR(CPF) into equivalent data expressed in the CIDOC-CRM, to the end of making the original data interoperable with a larger set of applications. This objective is very important because archives are fundamental information sources for knowing our past and better understanding our present. Making them more interoperable is therefore paramount, and a CIDOC CRM is an ideal choice since it is a well-known and largely used standard in the Cultural Heritage domain. The method and the tools presented in the paper succeed in realizing this objective, facing and solving several conceptual and technical problems, therefore I recommend acceptance.

On the other hand, the paper may prove difficult to read for the non-initiated, therefore I also recommend some revisions to the presentation, detailed below section by section.

Section 1 clearly sets the context, the objectives and the achievements of the paper.

Section 2 provides a detailed and well-argued state of the art.

Section 3 would benefit from some modification.

• Figure 1 is hard to read, I recommend enlarging the text in the boxes
• In section 3.2, the sentence “the description level of a unit is Fonds” is analyzed as follows: “description level” is akin to a class, “Fonds” is akin to an individual, while the whole sentence asserts membership of the individual “Fonds” to class “description level”. In other words, the authors interpret the sentence above as the sentence “Fido is a dog”, which asserts membership of the individual “Fido” in the class “dog”. I believe this interpretation is flawed, because it considers the notion “description level” as an independent entity, while that notion depends on the notion named “unit”, which the interpretation actually ignores. More simply, the sentence above is akin to a sentence like “the birthplace of a person is Place” which is a categorical assertion about universals and is typically interpreted in semantic modelling as follows: “Person” and “Place” are classes, “the birthplace of” is a property, and the sentence asserts that the domain of property “the birthplace of” is class “Person” and its range is class “Place”. Similarly, “the description level of a unit is Fonds” asserts that property “the description level of” has class “unit” as domain and class “Fonds” as range. By interpreting “description level” as a class the damage is done, and it is not repaired by reverting to a type in place of a class, because the two notions are akin and entirely different from the notion of property. The misinterpretation continues with the next sentence analyzed by the authors: “The description level of the unit with reference code 41 xxx is Fonds” (en passant, the NLP tagging of the sentence with PoS elements does not help detecting the mistake, so perhaps it is not so important and can omitted in the sake of brevity). Here, the obvious reading would be that the individual unit identified by code xxx has Fonds as description level. The fundamental difference from the previous sentence is that this sentence is not categorical, but merely expresses factual knowledge about a specific unit, identified by means of another property, “with reference code”. So here we have (a) two classes, “unit” and “Fonds”; (b) two properties, “the description level of” and “with reference code”; and (c) two individuals, the unit which is the subject of the assertion and the code “xxx”. But of course, this is not the interpretation of the authors who insist on “the description level of” being a class “Description_Level”. I urge the authors to fix these issues, which do not have any impact on the mapping to CRM, it is just a rhetorical issue.
• Table 1 in Section 3.3 is very hard to read and I wonder whether it would be more appropriate to present the rules in the form they are given in Appendix A, moving Table 1 to the Appendix. In the same Appendix, I’d move the six commands that produce the transformed data and the presentation of the workflow with the examples. I understand the authors are very proud of these technical aspects of their work, but placing them in the middle of the paper may create a barrier for the reader, who first needs to understand the concepts, and then, possibly, the way these concepts have been implemented. Note also that the concepts are far more important from the technical implementation, because the latter may change due to many contingent matters, while the former are set once forever.
• Another issue with Section 3.3 is the way the authors handled so-called “.1 properties” of the CIDOC CRM, such as for instance P1.1 (illustrated in Figure 4 to 6). I spent a few hours to find out the CIDOC CRM recommendation for dealing with these properties that the authors refer to. I finally found it, it’s in a PowerPoint presentation on a web page on the CIDOC CRM website (http://new.cidoc-crm.org/Resources/modeling-properties-of-properties-in-...). Please, insert this reference in the paper, either as a footnote or as a reference proper.
• Related to the previous point, I did not find in the paper any statement reporting which version of the CIDOC CRM the authors have used. I believe they used is version 6.2 and its RDF Schema expression, but this needs to be stated clearly. Notice that the latest version of CIDOC CRM is 7.1 which I believe is fully compatible with the one used by the authors, but they have to check this.

Section 4 is fine, except for Table 5, which presents technical rules that are likely to be of interest only for the implementors. Please consider moving this Table in an Appendix as suggested for Table 1, leaving in this Section only a conceptual illustration. Another option is to drop the Table from the paper, presenting only one rule as an example and referring to a technical document for the complete set of rules.

Section 5 is fine, except that I found missing quotes (for instance on the right-hand side of query DLq1) and that the queries are not very readable in the present form. The authors should consider indentation as a way of highlighting the structure of the queries, perhaps using a different font to save space. Or using a Table for presenting the queries, thus availing of the full page width.

Section 6 in my opinion can be made much shorter, if not removed altogether. The reason is that this Section mostly discusses quality issues of the data that the authors have been working with. These issues are not related to the strategy presented by the paper, and anyway they can be dealt with by using specific techniques, such as de-duplication or lexical data management. So I would urge the authors to leave these issues out, perhaps just mentioning them en-passant.

Section 7 is fine.

Review #3
Anonymous submitted on 06/Apr/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The aim of this paper is to introduce a rule-based approach to automatically populate a CIDOC-CRM model with the Portuguese National Archives metadata that obey ISAD(G) and ISAAR recommendations.

1 Introduction

This section introduces the problem addressed in the paper.

I think more explanation of what are "Mapping Description Rules" is needed and references. An example could be helpful here to make the section more readable as the readers of this journal are mostly not specialists in archiving.

2. The Archival ...

This section explains the ISAD(G) and ISAAR (CPF) formats used in archives.

3 Representing

4/44 Explain Figure 1. The reader may no be familiar with its concepts used in archives.

5/40 The subsection starts with bullet points. For readability, more textual explanations are needed. Explain also why NL interpretation is needed here and how it is used. What is happening here is not obvious to the reader. Explain also what is tagging here. I think you cannot assume that the reader knows the formats you are dealing with. Also the parse tree comes "from the bushes".

6/7 Mapping Description Rules system should be explained -- this system is not common knowledge to readers even if perhaps CIDOC CRM experts may know it and have read [8].

8/46 from -> of

9/13 ternary property ??? Aren't we dealing here with RDF and binary ones only?

4 Automatic Migration ...

This section explains in detail how CIDOC-CRM data is generated especially from the textual element values of the original data. There are lots details and an example is used but I found this section difficult to follow. Perhaps more focus on describing the method on a more general level would make the text more readable?

Listing 1: are the red arrows really needed in the listings?

11/16 The migration process ... this sentence does not make sense grammatically

5 Querying CIDOC-CRM ...

This section explains how the CIDOC-CRM data is queried using SPARQL-DL queries. The end user is supposed to use a GUI system called Query Ontology Interface that translates into SPARQL-DL. It is not explained how this works. Some kind of explanation or motivation for this approach would be needed: what are the benefits and challenges compared to the legacy system? It seems that in this case data is not aggregated from different data silos but only from one database and that CIDOC CRM is used only as a schema to represent the data in another more structured form. Tell why is this is useful for the archivists?

6 Open problems

This section raises up several open problems. For example, the key problem of entity resolution seems to remain open: how to e.g. disambiguate people with similar names and identify that different name variants refer to the same person?

What is the quality of the resulting data in terms of errors made and completeness of information? These issues are crucial when querying the structured data.

The system has not been evaluated yet; it is not really shown that the results are fit for their purpose even though the examples and implementational details seems to indicate that something operational and useful has been created.

My impressions about the paper using the SWJ criteria for full research papers are presented below:

(1) originality

This paper has some originality as not much about the topic has been published.

(2) significance of the results

This remains somewhat open: there is a system but it is has not been evaluated and also qualitative arguments and motivations for the approach could be more. My concern is that the paper may not be mature enough for the journal yet. The paper could explain more how the system can be generalized in other similar cases than the Portuguese one; this would be important for significance. Now the focus is more on explaining and documenting one particular Portuguese system. Also more motivations on using CIDOC CRM behind the work would be helpful.

(3) quality of writing

The paper is in general carefully written and in detail. However, there are several issues in readability as explained above.

Minor corrections:
1/28 EPISA -> the EPISA
2/39 automatic -> automatically
2/51 -> intend -> intent
2/27 add space beroe (ICA)
2/45 add space before referwnces such as [2]
3/36 mean -> means
3/36 The sentence "The need ..." is complex; reformulate it simpler.
3/7 trough -> through
3/50 Population that -> Population system that
3/19 mean -> means
4/42 recomendations -> Recommendations
6/49 follows -> follow

Check all references; you seem to have troubles with Bibtex:

Ref [4] What is C.b.t.C.S.I.G. ?
[7] G.d.T.d.N.d.D. Diração ?
Ref [12] T.T.S. for ..?. Something wrong here ??

Check CIDOC refs:

[4] C.b.t.C.S.I.G. ICOM/CIDOC Documentation Standards
Group, Definition of the CIDOC Conceptual Reference
Model, 7.0.1 edn, ICOM/CRM Special Interest Group, 2020.
[5] C. Meghini and M. Doerr, A first-order logic expression of the
CIDOC conceptual reference model, International Journal of
Metadata, Semantics and Ontologies 13(2) (2018), 131–149.
[6] C.b.t.C.S.I.G. ICOM/CIDOC Documentation Standards
Group, Definition of the CIDOC Conceptual Reference
Model, 7.0.1 edn, ICOM, 2020.
[7] G.d.T.d.N.d.D. Diração-Geral de