DM2E: A Linked Data Source of Digitised Manuscripts for the Digital Humanities

Tracking #: 831-2041

Konstantin Baierer
Evelyn Dröge
Kai Eckert
Doron Goldfarb
Julia Iwanowa
Christian Morbidoni
Dominique Ritze

Responsible editor: 
Christoph Schlieder

Submission type: 
Dataset Description
The DM2E dataset is a five-star dataset providing metadata and links for direct access to digitized content from various cultural heritage institutions across Europe. The data model is a true specialization of the Europeana Data Model and reflects specific requirements from the domain of manuscripts and old prints, as well as from developers who want to create applications on top of the data. One such application is a scholarly research platform for the Digital Humanities that was created as part of the DM2E project and can be seen as a reference implementation. The Linked Data API was developed with versioning and provenance from the beginning, leading to new theoretical and practical insights.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Eetu Mäkelä submitted on 05/Dec/2014
Review Comment:

Both the article and the dataset still need significant work in order to be publishable.

A big problem I have with the article is the haphazard nature of what information has been included and what has been left out. Particularly, many aspects that would help people to actually use the dataset are missing, as well as discussions that would help in determining the relevance of the dataset to a particular task.

First, the statement on licensing should be worded better. Now you're first saying that all RDF data is CC0, but in the very next sentence you say that this doesn't extend to the original metadata, which have restrictions of their own. For a person to evaluate the possibility of using this dataset, these distinctions and their ramifications should be discussed in much more detail, going through all subsets and noting which licenses apply to which parts of the original data.

In addition to license metadata, more information on the actual contents of the subcollections would also be of use. For example, elsewhere it is stated that the quality of the metadata ranges widely between the subcollections, with some making references to agent, place and subject authorities, while others do not. For a potential user of the dataset, this is essential information to know, along with information on exactly which reference authorities are used, etc. It also seems a lot of the data only has labels in German. This would be good to point out.

It would also help to have instance counts associated with each individual collection instead of just aggregate numbers. Also state upfront the amount of total documents vs individual pages of the documents, instead of hiding this information by forcing the reader substract the count for individual pages from the total count of CHOs (If I counted correctly and adequately accounted for the duplication inherent in EDM, you should have some 83,000 documents in your repository?).

I would also like the article to contain concrete examples of what the data is good for, going for example through a search and browse session that highlights some interesting connections in the data.

The data model should also be described more concretely. First, the example associated with the data model should be moved much earlier in the section, and should be used as a focus to highlight the various aspects of the model. Then, more detail on the model itself should be provided.

For example, how does the model actually align with EDM? It is stated that the DM2E model is an application-specific specialization of EDM. However, later we learn that for ingestion into Europeana, the model actually undergoes a transformation from the DM2E model to the EDM model, implying that they are distinct. This really needs to be spelled out.

It would also be interesting to know from a modeling perspective what changes were made between e.g. the 1.1 and 1.2 versions of the model, and what caused them.

There are also design decisions made in the model that I would not have done, and would therefore like to be rationalized. First, I would think the choice of which items to show would be an application specific one. Thus, putting the dm2e:displayLevel -property on the same level as more neutral metadata strikes me as odd, particularly as this info seems to be something that can be directly inferred from the hierarchy. The same goes for the dm2:levelOfHierarchy -property, even though here I can see the use in providing this information shortcut as a means of lessening query time complexity.

The model also currently seems to contain a lot of other duplication that could just be inferred at query time, such as associating the organization and creator resources with each individual page of a manuscript in addition to the manuscript as a whole.

As I said, these are probably justified design choices, but I'd just want to see the justifications.

Regarding access, at the time of the writing this review, the SOLR search API did not function (but did in earlier testing). As the LD site for the most part doesn't disclose incoming references to concepts, this severely limits the ability of a LD agent in browsing and searching the site (e.g. I cannot currently find out which items refer to the concept Sagengestalt, because the RDF at does not return this). Related to this, it would also be extremely beneficial if the data could be made available as a SPARQL service. At the very least, providing a data dump in addition to the current LD site would be a must.

Regarding the integration with Pundit and Feed, a concrete example would help. I was unable to find any dm2e:hasAnnotatableVersionAt properties on the items I browsed. Similarly, with regard to statement-level provenance, the example given didn't actually contain what was described in the paper.

From the presentation, I couldn't parse out how the whole of provenance and versioning work. On exploration, it seems that the different VoID dataset versions record which items were produced in each run, along with batch details. However, it seems the items are recorded in these version datasets using their unversioned identifier, so that in actuality, also earlier version datasets always refer to the newest version of the item! I also didn't find the versioning links between the resource maps that were advertised in the paper, which would have remedied this situation somewhat.

Minor comments:
* In the introduction, a reference is made to "Scholarly Primitives" with an outside reference. If these are important enough to be mentioned, they should be spelled out also in this paper.

Review #2
By Bernhard Haslhofer submitted on 09/Dec/2014
Review Comment:

This dataset provides interlinked metadata from several cultural institutions across Europe focusing on manuscripts and old prints. It is available as 5-star dataset under a public domain dedication, which makes it reusable in other application scenarios. The underlying datamodel is based on Europeana's EDM model, which provides a high level of interoperability with other related datasets. Instead of introducing new properties, it reused vocabulary definitions from established vocabularies, which is beneficial for interoperability. All resources are accessible via their URIs, and searchable via a SPARQL endpoint and a RESTful search API. Provenance has been considered as well. In total, I consider this as being a high quality dataset.

The DM2E dataset comprises an impressive number of collections, which are certaninly useful for scholars and application developers in the digital humanities domain. The authors have also demonstrated usefulness of this dataset by providing a faceted search interface and a tool that supports scholars in annotating docuemnts.

The dataset description is clear and provides a useful guidance for potential consumers of that dataset.

Minor issues:

- Sec 3: The DM2E model as created and refined IN an...

Review #3
Anonymous submitted on 23/Mar/2015
Review Comment:

I like that paper. It presents an interesting and - from the viewpoint of the Digital Humanities - very important application: a linked data source of digitised manuscripts of various kinds (cf. table on p. 3). The data model and the characteristics of the data set are described clearly and comprehensively and the justification of the design decisions for the data model are reasonable.

The paper meets the technical characteristics for the description of the dataset and metadata standards very well. Due to the many participating premium cultural heritage institutions there is not the least doubt about the quality and stability of the dataset. Since delivery to Europeana is one of the main goals, I presume that all licensing problems have be clarified. Multiple resource handling, versioning, and questions of provenance are clearly addressed. I checked a lot of the many links and found them working; maybe the authors should add a "last accessed" date.

The API worked very well for the examples I tried out. So, in my opinion the paper meets the requirements for a five-star dataset providing metadata and links for direct access to
digitzed content from various CH institutions very well.

State of the art: 5 (of 5)
comprehensiveness: 5 (of 5)
completeness: 5 (of 5)
usefulness for community: 5 (of 5)