The Rijksmuseum Collection as Linked Data

Tracking #: 1210-2422

Authors: 
Chris Dijkshoorn
Lizzy Jongma
Lora Aroyo
Jacco van Ossenbruggen
Guus Schreiber
Wesley ter Weele
Jan Wielemaker

Responsible editor: 
Harith Alani

Submission type: 
Dataset Description
Abstract: 
Many museums are currently providing online access to their collections. The state of the art research in the last decade shows that it is beneficial for institutions to provide their datasets as Linked Data in order to achieve easy cross-referencing, interlinking and integration. In this paper, we present the Rijksmuseum linked dataset (accessible at http://datahub.io/dataset/rijksmuseum), along with collection and vocabulary statistics, as well as lessons learned from the process of converting the collection to Linked Data. This dataset contains over half a million objects, including detailed descriptions and high-quality images released under a public domain license.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dana Dannells submitted on 04/Dec/2015
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

The paper describes the current state of Rijksmuseum collection as Linked Data. It presents the history of the data, its characteristics and provides some statistics and overview of the links from the collection. The paper provides a solid work that points state-of-the-art in semantic technologies for the cultural heritage domain. The paper is clear and well-written however there are some weak points regarding the description of the usefulness of the data and there is little being said about some aspects of the dataset which I specify below.

(1) Quality and stability of the dataset - evidence must be provided.
No evidence as to the quality and stability of the dataset are provided, there is however a link to an object that gives evidence to the fact it is a stable dataset. Nevertheless there is no mention of how frequent the dataset is being updated and what are the major difficulties regarding keeping it up to date, and at the same time keeping the links to other datasets stable. what is the procedure around version numbers is maintained.

You mention in Section 6 that there is a danger of Getty vocabularies disappearing, what is this statement based on? please add a reference. Also the authors write that the museum chooses to maintain its own vocabulary, what implication does this have to your dataset and to others.

The authors should also consider adding more descriptions about the Iconclass vocabulary, how was it linked and is it sufficient? Furthermore the authors write 189,041 objects have at least one Icoclass annotations, what about the renaming objects, how many of them are linked to other annotations and to which ones.

In the last paragraph in Section 6 the authors mention a matching process, was it an automatic process? and if it was, how accurate is this process, what are the major difficulties?

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.
The authors describe the usage of the data rather than its usefulness. There are some statistics about the amount of people who are using the collection; some older references to systems which demonstrate the collection was in use 10 years ago and nowadays through the Europeana's API, but they say nothing about the purposes of of these usages, how does this collection extends other collections and what are the benefits it brings to other users? do you know who uses the collection? the types of API requests? what has changed comparing to previous systems you refer to and today's systems?

(3) Clarity and completeness of the descriptions.
Related to Figure 1, it will be interesting to see a figure representing the hierarchy of these concepts in your model such as how many concepts are related to ic:71 via skos:broader.

Most of the predicates listed in Table 1 are straightforward and easy to understand but the writer could expand the text in Section 5 and describe these predicates more thoroughly, for example, was dcterms:hasPart also used to specify objects belonging to an exhibition or a collection, how many "edm:type" are there in the Rijksmuseum collection? is it a closed set of types? do you use any vocabulary for the types?

In the discussion section, the authors claim some facts without providing any references, for example "only a limited number of institutions have managed to make their collection available as Linked Data", "... many institutions are hesitant to do so, in fear of losing a possible revenue stream"?

In the last paragraph, "digitised objects are added on a daily basis and employees extend and refine information ...", is this does manually? can you please elaborate how information is defined?

Some minor comments:
* The abbreviations presented in Fig 1 should be explained for example, I assume ic: is the abbreviation for Iconclass but this is not stated explicitly before the figure is presented.
* Section 5 on page 4, write explicitly that "textual description" is dc:description the same way you exemplify the other predicates in this paragraph.
* The URL in reference [1] is not reachable.

Review #2
By Mariana Damova submitted on 06/Dec/2015
Suggestion:
Accept
Review Comment:

This paper presents the 5 star Linked Data version of the Rijksmuseum collection. It explains the value of such dataset from different angles, e.g. as part of the linked data movement, as part of the cultural heritage domain that has huge potential to benefit from open access and linked data, as part of a museum that has recognized the potential impact from opening the collection measured in traditional museum visits.

It presents in detail the process of transformation of the data into semantic format, and the model of describing them - the Europeana Data Model - emphasizing the ease of extending it with museum specific information if necessary. The paper discusses the linking of the representation model and the semantic dataset to three other datasets, e.g. SKOS, Art and Architecture Thesaurus, Short-Title catalogues Netherlands. This puts the collection in a broader art and museum artifacts description context. The size of the dataset is discussed in several sections of the paper, showing the number of museum object, the number of generated triples, the number of objects with images, etc. The URI-s of the museum objects are hinted in a way that they can be used to track back the entire collection - http://rijksmuseum.sealinc.eculture.labs.vu.nl/browse/list_graphs. A special section dedicated to dataset statistics outlines different facets of it, thus making very clear what a potential re-user can expect from it, and how to approach it in a more efficient way. The API allowing to access the data for re-use is also provided. Finally, the authors give a report about the usage of the dataset, and show the number of visitors and registrations of interested partees to explore and employ the Rijksmuseum dataset for creative applications via Europeana Thought Las, e.g. MultimediaN E-Culture project, CHIP demonstrator, SEALINCMedia, and an impressive number of 34 206 unique visitors on Europeana portal for a period of two-three months in 2015.

One important aspect of the paper is its revealing that the entire infrastructure of converting and handling semantic linked data, communicating with Europeana, etc. is put in place at the Rijksmuseum, and that it is its curators that are responsible for the entire publication pipeline. This is an example of a best practice that can serve as a guideline for many cultural institutions around the world.

This paper presents a very valuable dataset in a field that can benefit immensely from open access, and publishing in semantic linked data format. It showcases a pioneering effort not only from a technical stand point, by demonstrating the transformation of museum collections into digital re-usable collections in semantic format, but also a leading example of actual re-use of the dataset in creative manner, and an exemplary best practice of empowering museums to embrace modern technologies to enhance their work and enrich their audience.

I think this paper meets all standards to present a dataset, set by SWJ, and goes beyond this by outlining the whole context of cultural institutions dealing with such datasets and the whole tremendous world of potential creative usages and their impact on spreading quality and culture.

Review #3
By Eetu Mäkelä submitted on 12/Jan/2016
Suggestion:
Minor Revision
Review Comment:

This article presents the fruits of a series of quality long-standing research projects. It does so in a clearly organized way, with a succinct narrative that was a joy to read. In short, an excellent article.

I do have a couple of small suggestions still. First, in table 1, could you add notes on whether the objects of the properties are literals, objects or both. Possibly also add 1) example objects and 2) source vocabularies to the table where applicable?

From the present text, it is also unclear to me in which languages the various properties are available. I'm guessing that at least the descriptions are in Dutch, but this should be spelled out for each combination. Also, what languages are the labels in for the Rijksmuseum thesaurus?

After reading the paper, I was also left somewhat unsure on how the LD version in the end compares with the data in the original CMS in terms of breadth and accuracy. For example, how many original fields are left out aside from 'rejected creator'? Are all artworks now transformed? Does the transformation result in any possible sources for errors or losses in precision?

Finally, while the place for such is not in this dataset paper, I would be greatly interested in a longer reflectional piece on the insights gained, both into technical aspects (such as literals vs resources, different ways of linking thesauri etc) as well as into CH institution cultures (desires for thesauri ownership, position towards metadata quality and "dumbing down" in Europeana etc) during this long process.