Crowd-sourced Digital Humanities linked data contributing to library datasets: the case of the Listening Experience Database

Tracking #: 828-2038

Authors: 
Alessandro Adamou
Mathieu d’Aquin

Responsible editor: 
Lora Aroyo

Submission type: 
Dataset Description
Abstract: 
In this paper, we present a linked dataset for 'early access' to information crowd-sourced as part of the Listening Experience Database project. We call it early access, consistently with the practice in modern software development, as its main aim at this stage is to collect feedback and initial use cases that can support the evolution of the dataset. The Listening Experience Database is a Digital Humanities project aimed at gathering structured and documented evidence of how music is perceived throughout history. The content is largely represented in terms of widespread ontologies for the domains of music and literature. Reuse from external datasets such as DBpedia and the British National Bibliography is guaranteed by the data entry workflow, and reused entities are re-published with data that improve upon the original datasets, for instance by modelling portions of published written works. The dataset is updated daily by both a community of enthusiasts and a team of experts, the latter also being in charge of approving and curating data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jacco van Ossenbruggen submitted on 05/Nov/2014
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description'.

(1) Quality of the dataset.
============================
In general, this seems to be a small (55K triples) but rich (70 different predicates, 23 different classes) and high quality dataset, as every experience needs to be grounded in a literature reference. Real quality of the content is hard to judge for a non-domain expert.

Minor technical issues:
Listening experience instances do not have rdfs:labels or dc:titles, which make them hard to display or visualise in the UI of many public RDF tools. This unlike other instance types, such as written works and agents, which display nicely in the tool I tried.

There are 404 errors on many publisher objects. Example:
http://data.open.ac.uk/page/led/organization/Daldy+Isbister+&+Co./138618...
refered to by: The Life and Letters of Frances Baroness Bunsen
http://data.open.ac.uk/led/source/The+Life+and+Letters+of+Frances+Barone...
These publisher objects also seem to be repeating too many times (looks like a blank node issue?)

There are a few suspicious circular triples where the subject == object:
http://reference.data.gov.uk/id/day/1773-08-27 http://www.w3.org/2006/time#hasEnd http://reference.data.gov.uk/id/day/1773-08-27 .
http://bnb.data.bl.uk/id/resource/005634860 dcterms:isPartOf http://bnb.data.bl.uk/id/resource/005634860 .
http://data.open.ac.uk/time/edtf/1780-uu-uu http://www.w3.org/2006/time#hasEnd http://data.open.ac.uk/time/edtf/1780-uu-uu .
http://data.open.ac.uk/led/source/The+John+Marsh+Journals:+The+Life+and+...(1752-1828) dcterms:isPartOf http://data.open.ac.uk/led/source/The+John+Marsh+Journals:+The+Life+and+...(1752-1828) .
http://data.open.ac.uk/time/edtf/1774-uu-uu http://www.w3.org/2006/time#hasEnd http://data.open.ac.uk/time/edtf/1774-uu-uu .

(2) Usefulness (or potential usefulness) of the dataset.
It is hard to predict if (and which) other user groups than those directly involved with the project will be using this dataset. However, for me the value of the contribution of the article goes beyond the data set, as also the description of the curation life cycle could be a very useful inspiration for publishers of other data sets.
(3) Clarity and completeness of the descriptions.
The description of the dataset is clear. The required basics (name, URL, licensing, availability) are nicely summarised in table 1, and the versioning policy is explained and provided with a clear rationale. More extensive metadata is available using VOiD. License seems to have "temporarily " changed from CC BY to CC-BY-NC-SA, see comment of the author. This needs to be resolved before publication, as it might confuse potential users.

Topic coverage and sources for the data are well-defined.
Purpose of the Linked Dataset, is demonstrated by relevant queries or inferences over it
Applications using the dataset and other metrics of use are not available because the dataset has only be open recently. Main usage figures are about the contributions to the dataset. It would be nice if some recent usage figures could be added in the next version.

Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability are all described in the paper.
Quality, quantity and purpose of links to other datasets: the dataset reuses data from the British National Bibliography, DBPedia, the MusicBrainz instrument taxonomy and VIAF.
The LED domain ontology has been modeled using OWL in a modular fashion, using many established vocabularies including Bibo, DBPedia ontology, OWL-time, BBC MO, event.owl, FOAF and Dublin Core.
Examples and critical discussion of typical knowledge modelling patterns used mainly focus on the representation of time, especially representing underspecified time references.
Known shortcomings of the dataset are discussed in the future work section.
The data is discussed in terms of the traditional five starts of linked data, not explicitly in terms of the Five Stars of Linked Data Vocabulary Use. However, IMHO this dataset would also qualify as a five-star set in these terms.

Review #2
By Marta Sabou submitted on 19/Nov/2014
Suggestion:
Major Revision
Review Comment:

This paper describes the Listening Experience Dataset (LED), a recently published dataset on the OU’s Linked Data platform, providing information about listening experiences which are gathered using a controlled-crowdsourcing approach (i.e., volunteers contribute listening experiences which are then checked and approved by moderators).

The work described has several interesting aspects (strong points). Firstly, from a content perspective, it provides a rather unique dataset, since as far as I know no similar type of data is yet available on the linked data cloud. Secondly, the data creation process relying on volunteer contributions is interesting and differs from the more frequent approach of generating linked data from databases. Thirdly, the technical realisation of the data representation and publishing approach is of high quality, thus fulfilling the “quality of the dataset” evaluation criterion of the Call.

On the less positive side, there are also some concerns. Most importantly, as the authors themselves point out, the dataset is very new and as such has not yet been used by third-parties. It is therefore rather difficult to judge the usefulness of the dataset (this being one of the criteria in the Call). Given the specialised focus of the data content, a broad adoption of the dataset is probably not realistic to expect. However, this dataset might enabled specialised, niche applications. It would be useful if the authors made this point clear in the paper.

The crowdsourced data creation is one of the particularities of the dataset, and therefore, it would be important to briefly overview related work at the intersection of crowdsourcing and linked data – I am aware that short papers are not required to have an extensive related work section, however, given that the crowdsourcing aspect is a core differentiator of this work, the authors should position themselves in the landscape of other works in this spirit. At a minimum, a short definition of crowdsourcing and an overview of its main genres should be included. Authors should also mention other similar works (e.g., [1, 2, 3]), which use different genres (GWAPS, paid for crowdsourcing) and also have different aims, mostly the verification of existing data. Another important aspect to cover would be providing an example data instance. Section 5.2 about URI schemes would better fit in the section 4 where dataset design issues are described as opposed to section 5 which focuses on dataset usage. Finally, I would suggest reconsidering the title and providing a shorter, more concise title.

Smaller comments and typos:
•p2: link against => link to;
•p3: coming across => encountering
•p3: to relate them at the time. => to relate them at the time of creation.
•P3: remove “Basically”
•P6: EDTF should probably not be shown in boldface
•P7: This is in true => This is true
•P8: and one in towards => and one towards

[1] Irene Celino: "Geospatial dataset curation through a location-based game", Semantic Web Journal, DOI: 10.3233/SW-130129, IOS Press

[2] J. Waitelonis, N. Ludwig, M. Knuth, and H. Sack. WhoKnows? Evaluating Linked Data Heuristics with a Quiz that Cleans Up DBpedia. Interact. Techn. Smart Edu., 8(4):236–248, 2011.

[3] L. Wolf, M. Knuth, J. Osterhoff, and H. Sack. RISQ! Renowned Individuals Se- mantic Quiz - a Jeopardy like Quiz Game for Ranking Facts. In Proc. of the 7th Int. Conf. on Semantic Systems, I-Semantics ’11, pages 71–78, 2011.

Review #3
Anonymous submitted on 16/Dec/2014
Suggestion:
Reject
Review Comment:

This is a dataset description paper concerning listening experiences of music. The dataset contains more than 1000 written experiences from 21 contributors. The data is interlinked with especially library data.

In below, the paper is evaluated in terms of the three evaluation criteria set for this kind of papers: (1) Quality of the dataset. (2) Usefulness (or potential usefulness) of the dataset. (3) Clarity and completeness of the descriptions.

(1) Quality of the dataset

The quality of the crowd-sourced data is checked by human moderators, and the annotation process supports creating correct semantic references using autocompetion. The process seems to be reasonable, but it is difficult say how good is the actual outcome. No direct evaluation of data quality this is given.

The dataset seems quite versatile and is interlinked also with exterdal data, such as DBpedia, BNB,and VIAF.

(2) Usefulness (or potential usefulness) of the dataset

According to the paper, the data has been published two weeks before writing the paper, and there are no experiences of how it is used externally outside the research consortium.
Applications of the data are not described, and the original database seems to be used for the current dataportal for listening experiences, not the linked data version.
It is therefore not quite clear, how the data is actually used by its developers.
It would have been nice to see what benefits are obtained in the portal in practice by using the linked data.

In my mind, the usefulness of the dataset is not presented convincingly enough in the paper, even if the work done may well have potential for this.

(3) Clarity and completeness of the descriptions

A concern of the paper is that it does not present any explicit model of the data, although a link (http://led.kmi.open.ac.uk/ontology) to an OWL2-DL ontology is given, and resources and external vocabularies used in the data are listed in tables.
It is therefore not possible to evaluate the datamodel used. I had a look at the ontology referred to. The ontology there should be explained and modeling decision made in it be motivated.

The paper contains lots of verbal descriptions about the dataset and details. The English used in perfect, but I found the text flow a bit unstructured and difficult to read and understand. There are, for example, references to forthcoming sections in page 4, which is usually an indication of non-optimal structuring in a paper.
Presenting the data in a more structured and formal/systematic fashion could help.

It unfortunately seems that the criterion for clarity and completeness of description is not satisfied well enough, even if the paper has many virtues, too.

In further versions of the paper, consider giving more motivations for the design choices made in addition to documenting them. Disucussion about lessons learned would always be welcome in a paper like this. Related work is hardly discussed in the paper and should be extended. For example, in References only two own publications of the authors and four standards/recommendations are referred to now.


Comments

We had to temporarily re-license the dataset described in the paper as CC BY-NC-SA, http://creativecommons.org/licenses/by-nc-sa/4.0/ . This will be reflected in further manuscript updates unless the licence is reverted to the original one by then.