Review Comment:
The manuscript presents an ontological framework, named CAAPT (computational approaches to addressing problematic terminology), aimed at supporting the activity of *critical cataloguing* in museums, by representing and interlinking the contents of *terminology guidance documents* about *potentially problematic terms* as well as decisions about how to handle them. CAAPT consists of three complementary and independently reusable ontology modules along with a supporting vocabulary, publicly available as resources, which were used to populate a non-publicly available knowledge graph (KG) spanning different data sources. The ontology modules include a core module about guidance documents with terms and suggestions (CAAPT-O), a module about terms' use contexts (CAAPT-UC), and a module about decision making (CAAPT-DM), developed by integrating and extending several related existing ontologies (CIDOC-CRM, OntoLex, CULCO, OA, SKOS) through an ontology engineering methodology that involved domain experts and was inspired/informed by lean ontology development and critical theory (namely, feminist and queer theory). The supporting vocabulary defines SKOS concepts relevant to populate the proposed ontology in the KG. The KG covers guidance documents from Words Matter, part of Inclusive Terminology Guidance (ITG), and Victoria and Albert Museum (TGD), as well as a spreadsheet of past decisions from the latter institution, which were processed through a semi-automated approach and for which data statistics and some insights are reported in the paper. Evaluation of CAAPT was performed showing feasibility in building the KG, as well as through 43 competency questions positively answered through SPARQL queries over the KG.
The manuscript extends a LREC-COLING 2024 workshop short paper [15] and a shortly following ESWC 2024 PhD Symposium paper [14] by the same author. It is submitted to the special issue on Ontology Design for Cultural Heritage and fits with the latter topics of interest, being concerned with the development and application of ontologies and knowledge graphs in the field of cultural heritage, also through the reuse of standard ontologies such as CIDOC-CRM. The manuscript is submitted as full paper, though it may fit also the requirements of an ontology description paper (https://www.semantic-web-journal.net/authors). As a full paper, this review will focus on the dimensions of originality, significance of results, and quality of writing, also covering the resource material linked in the submission.
### ORIGINALITY ###
In general, there is limited related work concerning ontological support in the domain of critical cataloguing considered here. Relevant prior work appears to be the CULCO ontology [63] and prior papers by the same author [14, 15]. Concerning the CULCO ontology, it only covers contentious terms and related suggestions, which are a small fragment of the larger domain addressed by CAAPT. Concerning prior authors' work, the submitted paper appears to provide a substantial extension of it, with additional contributions (identified through a comparison with [14, 15] and substantially aligned with what stated in the manuscript) that consist in:
* the extension and restructuring of the ontology in three independently reusable modules, with the revision of some prior modeling decisions (e.g., the inclusion of caapt:TermEntry,) and with the content of the CAAPT-DM module appearing completely novel;
* the extension of the knowledge graph (mentioned in [15]) with additional data from Words Matter, part of ITG (one document), and a Victoria & Albert decision spreadsheet;
* the online availability and documentation of the three ontology modules and the supporting vocabulary, along with 43 competency questions and associated SPARQL queries, all this material linked to the submission;
* the more in-depth discussion in the paper of underlying domain, considered data sources, related ontologies, and artefacts developed in this work, as well as followed methodology and its grounding on critical theory including feminist and queer theories (only briefly mentioned in [14]).
These contribution, along with the effort (which goes beyond my competency field, though) at grounding the work on the very same critical theory at the basis of critical cataloguing, all represent novelty aspects of the submitted article.
### SIGNIFICANCE OF RESULTS ###
The motivation for using ontologies and semantic techniques for critical cataloguing is nicely stated in the paper, particularly for what concerns the integration and the establishment of community resources to support practitioners in this field. I think CAAPT has the potential to provide a valuable and comprehensive resource in that area, going beyond the scope of prior work such as CULCO. In particular, I appreciate the variety of sources (different terminology guidance documents, a decision logs spreadsheet, experts' interviews and workshops, etc) considered in designing CAAPT, as well as the substantial effort put in combining existing ontologies (CIDOC-CRM, Ontolex, CULCO, etc) wherever applicable, which is far from trivial due to the inevitable challenges in finding meaningful and compatible ways to align and extend their concepts.
At the same time, I have some major concerns, listed next, regarding the alignment with existing ontologies (C1, C2, C3) and data availability (C4):
* (C1) Reuse of ontolex:reference to link caapt-uc:UseContext to crm:E33_Linguistic_Object -- Property ontolex:reference (https://www.w3.org/community/ontolex/wiki/Final_Model_Specification#Lexi...) is meant to link a sense of a lexical entry to the ontology element **denoted** by that entry / sense. In the proposed framework, we have caapt-uc:intended_target_mark covering the role of ontolex:reference, e.g., to link indian_uc-1 and indian_uc-6 to crm:E74_Group instances "People of India" and "Indigenous people of Canada", respectively. Instead, ontolex:reference is used here (and already in [14]) to link a use context to some description text in the terminology guidance document, so we have both indian_uc-1 and indian_uc-6 (and others) having the crm:E33_Linguistic_Object "Description of 'Indian' in TGD" as their ontolex:reference. Such description (i) is not the denoted concept of these senses, and (ii) even worse, is a description that covers all possible senses of "Indian", conflating them together. I see this form of reuse as a major incompatibility with OntoLex, especialy in the light of the central role of ontolex:reference (and other denotation-related concepts) in OntoLex. At the same time, I think ontolex:reference can be here easily replaced with some other unrelated property without substantially changing the proposed ontological framework, so fixing this issue should be straightforward.
* (C2) Modeling of caapt:TermEntry as sub-class of ontolex:LexicalConcept -- I find this decision rather questionable and deserving further justification in the paper. On the one hand, caapt:TermEntry captures an entry in a terminology guidance documents, such as the entry for 'Indian' (listing at page 18-19) and the one 'aboriginal' (Figure 1). In these entries, **multiple meanings** are described, such as 'Indian' as inhabitant of India or as native American, and 'aboriginal' as indigenous people in certain areas or as flora/fauna existed in a place since earliest known time. On the other hand, an ontolex:LexicalConcept is meant to "represents a mental abstraction, concept or unit of thought that can be lexicalized by a given collection of senses" (cit: OntoLex specification), such as a WordNet synset, which I understand as a **unitary meaning** thus conflicting with what we find in a caapt:TermEntry.
* (C3) Modeling of caapt:TermEntry as sub-class of both crm:E55_Type and culco:ContentiousIssue -- Class culco:ContentiousIssue (https://cultural-ai.github.io/wordsmatter/) is defined as a "discussion about a term" having "various formats and types (a textual publication, a social media post, a multimedia file, a verbal conversation etc.)". Given this definition, to me it appears being intended as a kind of crm:E73_Information_Object, and this class is explicitly stated in CIDOC-CRM (https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html#E73) to be incompatible with conceptual items such as types and classes, which include crm:E55_Type. This issue is mostly a follow-up of the one mentioned before (C2), since the use of crm:E55_Type follows the use of skos:Concept, which stems from reusing ontolex:LexicalConcept.
* (C4) Unavailable KG data and/or example data showing how to populate the ontologies -- Building a KG may demonstrate feasibility of using CAAPT on data spanning different, representative sources. However, the fact its data is not publicly available (for the reasons justified in the paper) hinders reproducibility. More generally, I think there is the need for example data showing how to concretely use CAAPT. If not coming from the KG built, this data might come from a publishable subset (possibly anonymized) of it, or alternatively from a representative enough set of data snippets showcasing (e.g., in online resource documentation) how to instantiate the proposed ontological framework. Right now, the examples in the Turtle listing of pages 18-19 and in Figure 7 are insufficient in my opinion, as they necessarily cover only part of the ontology, leaving out important practical details related to its usage (e.g., about other properties whose presence is required by extended ontologies, as mentioned in minor comment M7) this way hindering potential impact and reuse of CAAPT by practitioners.
### QUALITY OF WRITING ###
The paper appears adequately structured and written. The quality of English is overall good, though there are typos and other minor presentation issues (listed later) and some tendency in writing very long, complex sentences that at times result difficult to parse by readers (disclaimer: I'm not a native speaker).
Concerning diagrams, I find they are rather complex (esp. Figure 7), also due to representing the same referenced concept multiple times in the same diagram (e.g., crm:E74_Group in Figure 3, but there are many cases of this). I appreciate the effort put in preparing diagrams and the attempt at providing comprehensive overview of each ontology module through a single diagram, but I wonder whether splitting complex diagrams in multiple "sub-diagrams", each focusing on a subset of the proposed concepts, can simplify understanding.
Also, showing concrete instantiation examples (e.g., in Turtle, possibly coming from some unifying running example) along with diagrams, instead of concentrating them towards the end of the paper (listing at pages 18-19), might facilitate understanding the proposed modeling solution and its intended use.
### QUALITY OF RESOURCES ###
The submission links to a GitHub repository -- and specifically to a ZIP file hosted therein, in compliance to submission requirements -- which provides the RDF/OWL source files for the three ontology modules and the supporting vocabulary, the 43 competency questions and a couple of README files providing guidance to users accessing this material.
The enclosed material appears to be complete and covers all modeling-related outcomes of the work. However, some of the ontology/vocabulary files are not valid RDF, leading to errors when parsing / opening them: caapt-uc.ttl contains an undefined caap-uct: namespace (line 74, typo?); caapt-dm.ttl contains a ';' in place of a '.' (line 278, again typo?); caapt-v.ttl contains the undefined vann: namespace; caapt-all.ttl (undocumented, possibly redundant) also has syntax errors. These errors point at manual authoring of ontology files through a text editor and lack of import of these files in the described KG. They are trivial to fix, and for this I suggest using a proper ontology editor such as Protégé, as this will also facilitate spotting other errors besides syntax ones. There are some syntax errors also in competency questions (e.g., ?frac:attestation typo in CQ 33 and other queries), so I suggest double checking their correctness as well.
KG data is not distributed as it covers private material, as mentioned in the paper. I acknowledge this restriction. If possible, I would still suggest to extract and provide representative (possibly anonymized) snippets of the KG content to illustrate how the proposed ontology modules and vocabulary can be used in practice.
### MINOR COMMENTS ###
* (M1) [page 8] In which sense the "level of complexity" of OntoLex "exceeds what is required for the representation of the project domain"? The core of OntoLex is rather simple, and the argument about ontolex:LexicalConcept being "beyond the requirements" is contradicted later by exactly reusing and extending such class in the proposed ontological framework.
* (M2) [page 8] I think the statement that ontolex:usage (with its rdfs:Resource range) "does not make it possible to differentiate between types of conditions or implications" is too strong, since this OntoLex solution still provides the foundations for introducing sub-properties of ontolex:usage and/or sub-classes of rdfs:Resource in order to achieve that goal, as done in LexInfo or done later in the proposed ontological framework.
* (M3) [Figure 4, page 13-14] The text states that caapt-uc:describes_replacing and caapt_uc:describes_replaced_by capture a "shift of term use" rather than a "shift of meaning", so I would find more precise to have them link a caapt-uc:UseContext to another caapt-uc:UseContext for a different lexical entry but with the same meaning, i.e., referenced ontological concept. Using ontolex:LexicalEntry as the range for such properties works, although it leads to a simplified representation, as the linked lexical entry may have multiple senses / use contexts and it is left unspecified which of them is being considered (which therefore doesn't have to be explicitly represented). I suggest considering relaxing the range of these properties, admitting both ontolex:LexicalEntry (simplified representation, no need to model senses of linked lexical entry) or ontolex:LexicalSense / caapt-uc:UseContext (precise representation, where a specific sense / use context is pointed to).
* (M4) [page 13, figure 4] Why restricting the range of property caapt-uc:intended_target_mark to crm:E74_Group, i.e., a group of people? While this can be mostly the case when using CAAPT-UC for critical cataloguing, this solution restricts possible applications of CAAPT-UC to other domains also involving the modeling of terms' use contexts.
* (M5) [page 14] Consider mentioning the Web Annotation Data Model (OA) and OntoLex FrAC as part of the review of related ontologies in pages 7-9.
* (M6) [Figure 5, page 15] I'm fine with the following "chain": caapt-dm:EncounterInstance (an occurrence of the lexical form) --crm:P106i_forms_part_of--> crm:E33_Linguistic_Object (a text snippet including that occurrence) --crm:P165i_is_incorporated_by--> crm:E31_Document (the text field) --crm:P148i_is_component_of--> crm:E31_Document (the record). What I see as problematic, however, is the proposed typing of encounter, snippet, and field individuals instantiating the chain with OA selector concepts. In general, this makes these individuals being at the same time the annotation / enclosing text (which are linguistic objects) and the annotations / selectors over them. I'm unsure whether this meets the intended use of OA concepts. An apparent incompatibility that I see is in having the crm:E31_Document field individual being a oa:FragmentSelector. Consider the case where two encounter instances occur in the same field. Given how OA refined selectors work, two oa:FragmentSelector individuals would need to be introduced, each one being oa:refinedBy a different snippet selector (crm:E33_Linguistic_Object + oa:TextQuoteSelector). However, both of these oa:FragmentSelector individuals would also stand for the same crm:E31_Document field, that will thus be represented twice.
* (M7) [Figure 5] The proper reuse of frac:Attestation (https://ontolex.github.io/frequency-attestation-corpus-information/#atte...) would require having a frac:observedIn attestation property linking to the enclosing document (could be the crm:E31_Document record), as well as a rdf:value property with the quoted text. A user querying the KG based on FrAC would expect these properties to be populated, but they are neither mentioned in the paper nor present in Figure 5 or in the online documentation, and it's not possible to assess whether they are populated in the knowledge graph due to data unavailability. I acknowledge that this -- and similar cases of ontology reuse entailing the inclusion of specific properties, e.g., for OA -- are secondary details in the scope of the paper, where the focus is on the key aspects of the proposed ontological framework. Yet, such aspects are relevant to practitioners wishing to reuse the proposed framework and be compliant with the ontologies reused therein, so I would suggest covering these aspects in the online documentation (e.g., adding instructions about which other properties to populate, besides documenting the introduced ones) and if possible add some disclaimer in the paper to clearly state that such details are omitted in text and diagrams.
* (M8) [Figure 5, page 15 around "... act of annotation ..."] Coherently with the modeling of crm:E13_Attribute_Assignment as specific crm:E7_Activity aimed at classifying a PPT occurrence, consider explicitly modeling the annotation activity of PPTs, possibly allowing linking an instance of such activity to the resulting annotated caapt-dm:EncounterInstance instances. In case this can be already achieved by reusing CRM classes / properties, I still suggest tackling this aspect in the text.
* (M9) [Figure 6, page 17-18] What are the identity criteria for determining whether two caapt:TermEntry(ies) are the same across different sources? Are they based on the main term for the entry, checking for an exact / "close enough" match? Was coreferencing of such caapt:TermEntries across sources done automatically to produce the statistics of Figure 6, or manually? Suggest briefly covering these aspects in the text.
* (M10) [pages 18-19, Figure 7] The reported RDF/Turtle snippet lacks of variety, as it focuses on listing the 8 suggestions (a couple of them might have sufficed) with just a few of their properties, not exemplifying aspects of the ontology such as the crm:P129_is_about properties, the replacement terms, caapt:encountered restrictions on historical / current context, etc. Some of these aspects are covered in Figure 7, but I find this figure very complex, also lacking a clear starting point from where to navigate it (e.g., I could not find the caapt:TermRoot and caapt:TermEntry for "Indian" suggestions are for).
* (M11) [Table 6] For CQ 20, variable ?contents and related OPTIONAL clause are unused and can be removed. The second FILTER clause is problematic as it is formulated now, since it's enough to have a use context having two of the considered ontolex:usage sub-properties (e.g., caapt-uc:diachronic_mark, caapt-uc:diatopic_mark) for the NOT EXISTS condition to evaluate to false, even if there is exact match between all the suggestion's crm:P129_is_about and the use context's ontolex:usage sub-properties. An alternative SPARQL query formulation is reported next. Similar to the original query, it checks for exact matches between use context and suggestion conditions. However, I think that exact match may be too restrictive depending on scenarios: e.g., what if there is time overlap between suggestion and use context, but no exact equivalence? In general, these difficulties and the complexity of the query for this CQ stem from not having explicitly linked caapt:Suggestion individuals to caapt-uc:UseContext individuals, which might be done at population time possibly using dedicated logic, difficult to express in SPARQL, to account for partial/non-exact matches. Alternative SPARQL formulation:
```
SELECT ?use ?sugValue {
EG:TERM ontolex:sense ?use ; ^ontolex:isEvokedBy/culco:hasSuggestion ?sug .
?sug rdf:value ?sugValue .
OPTIONAL { ?sug crm:P129_is_about ?sugCondition }
OPTIONAL { ?use caapt-uc:diachronic_mark|caapt-uc:diatopic_mark|caapt-uc:diaevaluative_mark|caapt-uc:intended_target_mark ?useCondition }
BIND (IF(?sugCondition = ?useCondition, ?sugCondition, 1/0) AS ?sharedCondition) # 1/0 = error, producing no binding
}
GROUP BY ?use ?sugValue
HAVING (COUNT(DISTINCT ?sugCondition) = COUNT(DISTINCT ?useCondition) && COUNT(DISTINCT ?sharedCondition) = COUNT(DISTINCT ?sugCondition))
```
* (M12) [Table 6] For CQ 41, ORDER BY syntax is wrong (or at least, non-standard) due to presence of the AS keyword. Suggest using: ORDER BY DESC(COUNT(DISTINCT ?encounter)) (note: DISTINCT may or not be needed).
* (M13) [Table 6] For CQ 37, the second OPTIONAL clause is redundant and the SPARQL query can be simplified (also removing the following BIND clause), this unless the intention was to separately match persons and groups, and that involves different ontology properties.
### LIST OF TYPOS AND PRESENTATION ISSUES ###
* (T1) [page 1] "... larger goals.However ..." -> missing space before "However"
* (T2) [page 2] "The acronym "CAAPT" is used ... terminology" -> I suggest simply introducing the CAAPT acronym when the project is first mentioned under "Research context", and drop this sentence
* (T3) [page 2] "The ontologies that will be referenced ... RDF, RDFS ..." -> [minor] RDF is a data model and RDFS is an ontology language, primarily. That said, I understand that they come with vocabularies and those are referenced in the paper. Suggest possibly rephrasing. Also consider adding/moving references to Table 1, where their association to corresponding ontology/vocabulary will be clearer.
* (T4) [most tables] If allowed by template, suggest increasing vertical separation between rows. In most tables, like Tables 1 and 2, there is little/no separation, which combined with multi-line row contents, makes difficult to understand where a row ends and to connect cells of the same row.
* (T5) [page 3] "... more direct guidance: and ..." -> remove "and"
* (T6) [page 3] "... (42; 43) describe ..." -> [minor] suggest "The works in (42; 43) describe ..." (or mention authors' names), and similar for other citations
* (T7) [page 3] "... goes back a 2021 ..." -> "to a"
* (T8) [page 4] "... 170 years of collecting activity, and which is still ongoing ..." -> remove "and"
* (T9) [page 4] "... (TGD) ... (WM) ... (ITG) ..." -> [minor] consider introducing these acronyms when corresponding resources are first introduced (background section)
* (T10) [page 4] "... their reuse and reference ... evidences ..." -> "evidence"
* (T11) [page 5] "... and constantly changing ..." -> "change" (consider simplifying sentence)
* (T12) [page 5] "Terms defined as a related to ..." -> remove "a"
* (T13) [page 6] There is substantial overlapping between bullets 2a and 2b, and between 3a and 3b: consider restructuring these lists
* (T14) [page 6] "They also contain references ... 4a ... 4b ... 4c ..." -> Suggest having this paragraph as further bullet 4 of previous list
* (T15) [page 7] "... combined with the researcher's experiences regularly attending ..." -> "experience(s) in" or just drop "experiences"
* (T16) [page 7] "... additional notes ... that they feel is relevant" -> "are"
* (T17) [page 8] Suggest changing paragraph title "Ontolex" to "Ontolex, LexInfo and Lex-0" or similar, as also LexInfo and Lex-0 are covered there
* (T18) [page 8] "... only being to connect ..." -> "only being employed to connect"
* (T19) [page 8] "... to from being a generic attribute ..." -> remove "to"
* (T20) [page 8] In Table 4, the match of LexInfo and Lex-0 properties - as well as the diachronic / diatopic / etc concept - to CAAPT is only approximate (for reasons discussed in the text), hence I suggest to reflect that in CAAPT column caption and possibly move it as the last column of the table
* (T21) [page 9] "To accomplish this, an entity ... instead." -> I find this sentence unclear. Suggest rephrasing.
* (T22) [page 9] "... to have skosXL:Label as their domain" -> I think it should be "as their range"
* (T23) [page 9] "Culco is made up of two classes" -> perhaps "core classes", since later on there is also a culco:ContentiousIssueScheme being mentioned besides the two classes listed in this paragraph
* (T24) [page 9 and later] "cuclo:" -> "culco:" (multiple occurrences)
* (T25) [page 9] "... robust foundation to build from" -> "build on"?
* (T26) [page 11] "This decision responds specifically the principle ..." -> "to the principle"
* (T27) [page 11] "... other documents which entries ..." -> "whose"
* (T28) [page 11] "... of as it is ... this class as being for ..." -> remove "of", then "being used for" (or "intended")
* (T29) [page 11] "The classifications of terms is expressed ..." -> "classification"
* (T30) [page 11] "... can then come the subject ..." -> "become"
* (T31) [page 12] "... are not the entries ..., but instead are the concepts ..." -> replace "but" with "which"
* (T32) [Figure 3] The relation caapt:Suggestion --crm:P129_is_about--> crm:E55_Type + skos:Concept is represented twice in the figure
* (T33) [page 13] "... as this is the language used in the source materials ... in the record" -> unclear, suggest rephrasing
* (T34) [page 13] "wwhat" -> "what"
* (T35) [page 13] "Suggestions also be further ..." -> "can also" (or "may", etc)
* (T36) [page 13] "... the intended meaning that has been attributed to the use of the term in the given context, and who the term was intended to refer to when it was used in the context being described ..." -> here, "intended meaning" and "who the term was intended to refer to" appear to overlap and both regard the *denotation* of the term, whereas "intended meaning" here actually refers to term *connotation* or, as written earlier in the paper, the attitude of the speaker / sociopolitical context, for which caapt-uc:diaevaluative_mark is introduced; suggest rephrasing to avoid confusion in the reader
* (T37) [page 13] "... where an analysis of the descriptions revealed ... of the term." -> redundant text whose content is already covered by prior sentence "Use contexts can be described by four attributes ..."
* (T38) [page 14] "... and things (crm:E55_Type)" -> "concepts" may be more appropriate than "things", due to referring to instances of crm:E55_Type
* (T39) [Figure 4] "caapt-uc:describes_relacing" -> "replacing"
* (T40) [page 15] "Terms are connected to instances ... discussed further below." -> suggest adding references to corresponding elements in Figure 5, to make text more clearer and further facilitate interpreting the figure
* (T41) [page 15] "set of technology" -> "technologies"
* (T42) [page 15] "... strategy taken by the DE-BIAS Project who use OA ..." -> "that uses"
* (T43) [Figure 5] In relation oa:SpecificResource --oa:refinedBy--> crm:E31_Document + oa:FragmentSelector, oa:refined should actually be oa:hasSelector
* (T44) [page 16] are crm:E7_Activity_1, crm:E7_Activity_2, etc, instances or classes? If instances, I suggest making that explicit in the text and possibly use a Turtle snippet to represent these assertions
* (T45) [page 16] "provided in at" -> remove "in"
* (T46) [page 17] "includes two skos:Collection" -> different font for "s"
* (T47) [page 17] "... domain position of a triple ..." -> "object" position?
* (T48) [page 17] "there are 34 cases" -> in Figure 6, I get 19 + 13 + 1 = 33
* (T49) [page 17] "(Table 5, row 1, "Entries listed in source documents")" and "(Table 5, row 2, "Terms listed or referenced in source documents")" refer to non-existing rows in Table 5
* (T50) [page 18] "... bring compatible strengths to the knowledge graph." -> "complementary"?
* (T51) [page 19] "... are shown in Table 6. and a reference ..." -> remove period
* (T52) [Figure 7] "indian_sug-8" different from "indian_s7" of previous turtle snippet (same for other suggestions)
* (T53) [Table 6] "distinct" -> "DISTINCT" (for consistency)
* (T54) [page 22] "... make space to represent details around ..." -> suggest adding colon after "around", since followed by a semi-colon -separated list
* (T55) [page 23] "knowledges" -> "knowledge"
* (T56) [page 24] "... by integration the results ..." -> "integrating"
* (T57) [all figures] Suggest using a vector format (vs. current raster one) for figures so that text can be searched in them, which would enable searching for a concept and finding matches of them in the diagram figures
|