CIDOC CRM Negative Properties Test Dataset

Tracking #: 2760-3974

Stephen Stead

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Dataset Description
Abstract. This data set is intended for the testing of software tools that combine and integrate sets of semantically rich data. It exercises the ability of the software to integrate data that has been recorded at different granularities and using different record-ing approaches. It allows testing of the ability to detect contradictions when using positive, negative and class level assertions. The data set comes with definitions of the content and the software required to generate additional sets of similar test data.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Robert Sanderson submitted on 08/Apr/2021
Minor Revision
Review Comment:

The paper describes an artificially constructed dataset with a very narrow scope - the conservation of manuscript bindings. The basic information is provided (name, URL, version date and number, license) along with a description of its intended purpose (testing software that should consume such narrowly scoped data). It covers the source of the data (artificially constructed) and connectivity (only internal, as it's artificial). The structure described is interesting, with both instance data and a referenced poly-hierarchical thesaurus. It describes the format (XML + CSV) for the data, and a reference to the version of the CIDOC Conceptual Reference Model which it uses. Section 5 goes into a lot of detail about the possible constructs that are generated. It clearly passes the initial gate of a clear description.

However, I feel that it does not reach the necessary standards for acceptance in its current form according to the three criteria for evaluating such papers:

1 - Quality and Stability: No evidence is provided of the quality of the dataset, relative to cultural heritage linked data. Was it reviewed by subject experts to validate that the records actually resemble real world data? Are there any real world datasets that this artificial construct could be a reasonable facsimile of, and if not, is there any likelihood of them being created in the coming decade? My experience would suggest this is, unfortunately, unlikely at a scale of 28000 instances.

2 - Usefulness: Given the artificial nature, the limited scope, and no determination that it resembles any real world data, it is hard to imagine it being useful other than for the stated purpose of testing software designed to consume this particular dataset. Which is not all that useful. Secondly, it is also difficult to determine the usefulness as Linked Open Data, as it is not available as Linked Open Data. The description discusses the XML and CSV files and claim that they can be transformed into LOD through software in the repository. However that software is in a non-standard (these days) rar format. After downloading an unarchiver, the code wouldn't compile and the jar wouldn't run with Java 15.0.1 under MacOSX. The included CRM ontology was many years out of date, compared to the paper which uses the 2021 version 7.1. As such, I have grave doubts that anyone would find the dataset useful in its current state.

3 - Description: As per the initial summary, the description of what is provided is very clear. However, it uses custom extensions to the CRM, isn't provided as LOD, isn't used by anyone apart from the author and collaborators, and doesn't make references outside of itself.

For it to be acceptable, I feel that the following, relatively minor but very important, revisions must first occur:

* Provide LOD in at least one of the regular RDF formats - Turtle, NQuads, JSON-LD or RDF/XML. This is the semantic web journal, after all.
* Provide evidence that the dataset is useful to anyone outside of the originating context, following criterion 2 "shown by corresponding *third-party* uses - evidence must be provided."
* Provide evidence that the dataset has been reviewed by appropriately qualified domain specialists that the artificially generated content reflects real world data to meet the quality criterion.

Review #2
By Richard Smiraglia submitted on 09/May/2021
Review Comment:

The presentation is clear and concise and exhaustive. One or two emendations might improve the final version.

I might have preferred an explanation of “CIDOC CRM Negative Properties” in the introduction. In particular in the penultimate paragraph of section 1 examples of positive and negative type-level statements would be helpful.

(1) Quality and stability of the dataset is clear

(2) Usefulness of the dataset, I don’t see explicit evidence

(3) Clarity and completeness of the descriptions: clear

Papers should usually be written by people involved in the generation or maintenance of the dataset: Yes, clearly

Details about the used vocabularies; clear

Review #3
Anonymous submitted on 03/Jul/2021
Review Comment:

The paper is not very clear, it doesn't explain the domain of application and it doesn't show the usefulness of the dataset, like how in practice this dataset has been / could be used (e.g. by discussing a specific use case). While a believe that the dataset has potential, I don't think that this paper is publishable. I think it would make more sense to integrate the description of this dataset in [5], where there are even more details e.g. about the logic based on which contradictory statements should be identified, that are not mentioned here.

- Quality and stability of the dataset - evidence must be provided: the dataset and the source code are online. However, it seems that the CIDOC CRM extension [2] used in the dataset has not been implemented yet. I believe this is not a good practice.

- Usefulness of the dataset, which should be shown by corresponding third-party uses: it is not clear from the paper how this dataset is useful in practice, except for general considerations (such as "The dataset is designed for testing the capabilities and efficiency of software intended to draw conclusions about complementary sets of semantically rich data."). It is missing not only a third-party use case, but even a project/real-world motivation for building this dataset. As a consequence, it becomes even more difficult to understand the dataset content and scope.

- Clarity and completeness of the descriptions: A paragraph dedicated to the explanation of the domain of the dataset is missing. In this paper, many domain-specific concepts are introduced (mostly in sec. 5) without any explanation, so it is very difficult to follow.
It is not clear to me what the percentages in sec. 5 mean. In general, sec. 5 is very hard to follow.
Figures would have helped.

Ref [5] is not "forthcoming", it is still under review.