DM2E: A Linked Data Source of Digitised Manuscripts for the Digital Humanities

Tracking #: 1121-2333

Konstantin Baierer
Evelyn Dröge
Kai Eckert
Doron Goldfarb
Julia Iwanowa
Christian Morbidoni
Dominique Ritze

Responsible editor: 
Christoph Schlieder

Submission type: 
Dataset Description
The DM2E dataset is a five-star dataset providing metadata and links for direct access to digitized content from various cultural heritage institutions across Europe. The data model is a true specialization of the Europeana Data Model and reflects specific requirements from the domain of manuscripts and old prints, as well as from developers who want to create applications on top of the data. One such application is a scholarly research platform for the Digital Humanities that was created as part of the DM2E project and can be seen as a reference implementation. The Linked Data API was developed with versioning and provenance from the beginning, leading to new theoretical and practical insights.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Bernhard Haslhofer submitted on 28/Jul/2015
Review Comment:

This article provides a concise description of the DM2E dataset, which represents a normalized aggregation of data from a number of dataset providers, which are of central importance in the Digitial humanities field. Key dataset characteristics and possible applications are well described and could serve as blueprint for similar datasets in that domain.

The authors also addressed reviewers' comments and provided requested additional details. Overall, I recommend to accept this article.

Review #2
By Günther Görz submitted on 30/Jul/2015
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - The paper has been revised. It meets the requirements as requested
very well: (1) Quality and stability of the dataset - evidence must be
provided. (2) Usefulness of the dataset, which should be shown by
corresponding third-party uses - evidence must be provided. (3)
Clarity and completeness of the descriptions. The data model and the
characteristics of the data set are described clearly and
comprehensively and the justification of the design decisions for the
data model are reasonable. Because of to the many participating
premium cultural heritage institutions there is not the least doubt
about the quality and stability of the dataset. Multiple resource
handling, versioning, and questions of provenance are clearly
addressed. I checked a lot of the many links and found them working;
maybe the authors should add a "last accessed" date. The API worked
very well for the examples I tried out. So, in my opinion the paper
meets the requirements for a five-star dataset providing metadata and
links for direct access to digitzed content from various CH
institutions very well.

State of the art: 5 (of 5)
comprehensiveness: 5 (of 5)
completeness: 5 (of 5)
usefulness for community: 5 (of 5)

Review #3
By Eetu Mäkelä submitted on 11/Aug/2015
Review Comment:

First, the paper has to my opinion improved a great deal, particularly with regard to clarity of expression.

However, with the benefit of this clarity I've come to realize that in my mind, this is actually not a Linked Dataset Description paper. In fact, it reads 2/3 like a project description and only 1/3 like a dataset paper. Both of these would be interesting in their own right (the project description even moreso), but trying to cram these two different narratives into a single paper does a disservice to both. Therefore, while I actually now like the content, I am still recommending a rejection and resubmission as properly focused separate papers.

With this in mind, I'll give some comments for both of these orientations:

First, for a dataset description (which as defined by SWJ is primarily aimed at potential dataset users), the article really should delve deeper into the dataset itself. Here, a further detailing of the contents of the different subdatasets _would_ be essential (i.e. which metadata fields are recorded for each subcollection, and which vocabularies they refer to). On the other hand, for a project description paper I would agree with you that such detail could be left out.

For a dataset description paper again, the Linked Data API section should be much more detailed in exposing how to actually access the dataset programmatically (e.g. detailing and giving examples on how to use the Solr API, as well as giving straight links to where dataset dumps are provided), as well as go into even more detail on e.g. how to parse and use the dataset/statement revisions (examples would be good here). On the other hand, the details on data ingestion tools, export to Europeana or Pundit integration actually don't concern a dataset user.

For a project description paper, this would again be reversed. In such a paper I would however be interested in additional reflection: why did you end up with two different tools for ingestion? How did the dataset providers relate to these? Is there actually a need for statement-level provenance (or actually dataset-level for that matter)? What are those use cases? Are they actualized? How did that go? Are there any lessons to be learned in how the project approached the Europeana ingestion? Comparisons to the methods of other projects? Expand on the reasons you needed to expand Pubby, and so on..

On a separate matter, I must also note that both the search site as well as the LD API were again down for at least two days at the time of this re-review, so I haven't been able to validate all responses relating to e.g. provenance information encoding.

Finally, I found one typo: in "Furthermore, for datasets contains links to annotatable digital objects", you should have "containing links" instead of "contains links".