Hello Cleveland! Linked Data Publication of Live Music Archives

Paper Title: 
Hello Cleveland! Linked Data Publication of Live Music Archives
Sean Bechhofer, David De Roure, Kevin Page
We describe the publication of a linked data set exposing metadata from the Internet Archive Live Music Archive. The collection provides access to recorded performances and is linked to existing musical and geographical resources. The dataset contains over 17,000,000 triples describing 100,000 performances by 4,000 artists.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Pascal Hitzler
Major Revision

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Resubmission after a "reject and resubmit" and subsequent "major revisions". First round reviews are beneath the second round reviews.

Solicited review by Prateek Jain:

The revised version of this paper does not covers all the issues pointed out by myself and other reviewers. For example,

What kind of additional relationships beyond owl:sameAs can be useful for this dataset? Can OWL based modeling help with the dataset?

Note: Granted the authors use relations beyond owl:sameAs, and they refer to the other ontology. But how is it helping them and requirements which it fulfills is not explained.

I do not see any discussion related to maintenance of the datasets either which I believe should be an important aspect for datasets.

What kind of modeling challenges were faced which were unprecedented?

Have the authors investigated using a tool like SILK for doing the linkage?

I did not find any answers for this in the new version either.

Solicited review by Peter Haase:

The authors present the Hello Cleveland! data set, which exposes metadata from the Live Music Archives as Linked Data. The paper is well written and structured, with a clear description of the data set itself as well as the process of its generation.

The data set (re)uses many state-of-the-art vocabularies, ontologies and patterns, including e.g. the music ontology, similarity ontology, provenance ontology, SKOS, PROV-O, VoID and others. Further, the data set links to numerous other data sets, such as MusicBrainz and Geonames.
In this sense, the data set is a good example of how data sets can be properly published by reusing ontologies to address typical, re-occurring publishing problems.

On the counter side, the use of these ontologies and patterns seems overly complex considering the simplicity of the problem. This introduces difficulties and barriers in using the data set.
For example, the problem of identifying links between this data set an MusicBrainz is essentially left to the consumer of the data set. While the similarity values are published, they are not effectively usable as links, on the hand since the relationships are rather hard to understand and query (compared with a sameAs link), on the other hand I doubt the quality: The similarity measure based on the names of the artists is rather ineffective. An evaluation of the quality of the links/similarities by the authors is not provided.

Regarding the relevance of the data set, it would be nice to see examples of adoption, sample applications, or at least some indication what expected/intended uses are.
As is it described now, I would doubt that the data set is relevant enough to justify a journal publication.

Minor comments:
- consistency: linked data vs. Linked Data
- In the data graphs in Fig. 1 and 2, it would be good to follow standard RDF notation (e.g. labeling edges with rdf:type, rdfs:subClassOf etc. instead of color coding).

Solicited review by Danh Le Phuoc:

Authors added some explanatory text to respond to the comment of the reviewer. They did not completely address the reviewer's concern. However, I agree to accept this article with recommendation for considering following comments in the final version.

It's interesting to use Similarity Ontology instead of owl:SameAs to relate LMA entities to external entities. However, it'd be interesting to see quantitative difference between two of them, for instance, percentage of matching.

I assume that using Similarity Ontology make the SPARQL queries on this dataset different than normal query. The matching patterns of SPARQL queries are driven by Similarity Ontology. Therefore, examples queries with such peculiar patterns should be demonstrated.

Some clearer explanations on figures on external links, connections among individuals would be appreciated.

First round reviews:

Solicited review by Prateek Jain:

The work "Hello Cleveland! Linked Data Publication of Live Music Archives" explains a dataset related to large community-contributed collections of live recordings.

The dataset was generated by using the Internet Archive Live Music Archive. The data source provides free and non-commercial access to the archives. The main contribution by the authors is in creating the dataset using this information and existing vocabularies. Further they also link to MusicBrainz and Geonames dataset.

The dataset contains information about artists,venues and tracks. Considering there are numerous datasets in this field such as MusicBrainz, Last.FM, Jamendo available as part of the LOD, the usefulness does not requires any justification. The key distinction of this dataset is the creation from live recordings and use of internet archives.

Based on the guidelines provided on the CFP for this issue, the work addresses most of the points.

* Description of the dataset

The work gives sufficient information about the name, URL, licensing and availability. It also talks about the topics covered and statistics about interlinks and details about use of existing vocabularies. The author has given a good description of the known shortcoming of the datasets.
However, I find it very strange to see not even a single RDF/OWL snippet of any key entities modeled in the paper. Further, I am curios because the collection does not distinguishes between Artist and Bands, how is the relationship between a singer and band identified. I would imagine it is an important relationship to be modeled in the domain of music.

* Quality and usefulness of the dataset
The dataset is definitely useful as it deals with a different part of the music dataset, i.e. the live music archives.

I have a few questions though and would hope the authors address them subsequently.

What motivated the authors to create this dataset and use this source?

What practical problems are the authors planning on solving using this dataset?

What kind of additional relationships beyond owl:sameAs can be useful for this dataset? Can OWL based modeling help with the dataset?

What kind of modeling challenges were faced which were unprecedented?

Is the name for the dataset inspired from the Cleveland Rock n Roll Hall of Fame?

Have the authors investigated using a tool like SILK for doing the linkage?

What is the threshold used for linking lat/longs as described on Page 4?

What is the overlap Vs new entities in the dataset. I Do see some figure for it in Table 2. So the new entities would be 4000 (artists)-1168 (artists)? But I would imagine not all entities linked to MusicBrainz are artists, so what is the real figure?

* Clarity and completeness of the descriptions.

The work is nicely written and presented and gives sufficient details

Overall, I would say while the dataset can be useful, there are a lot of questions which have been left unanswered. I hope the authors can answer these questions which will make it high quality submission.

Solicited review by Axel Polleres:

The paper describes an export of the Internet Archive Live Music Archive as Linked data. The date is linked to sources such as last.fm, GeoNames and MusicBrainz, which seems adequate.
The re-use of existing vocabularies seems appropriate.

Figures 1 and 2 are not very readable, I recommend to use larger fonts. Even if you do that, it seems you could make the figures smaller and thus save some space there.
Also, it is not explained what certain things mean. Some edges don't have labels (are these rdf:type?), some of which
are gray (not clear what relations they denote).

My main problem is that the work does not make clear how sustainiable the data set is: i.e. what are the plans to keep it up-to-date, running, how long is the project running? This should be elaborated on, e.g. in section 6.

Also, at the moment, linkage with other datasets seems to be rather preliminary. This is ok for a first step, but I am not convinced, whether in this stage and without adopters this dtaset already justifies a journal publication. I would expecty this special issue to contain more a reference for sustainable/already usable datasets with proven adopters and ideas how they can be used to motivate further usage, rather than first attempts of exporters (which are certainly needed, but which I think should first go through e.g. workshops and find adopters before being published in SWJ)

I explicitly note that this might be in disagreemenat with the intentions of the guest editors/call for this special issue and leave the final decision to the editors.

The authors are fair in reporting the limitations of the current approach and the format (except the unclear description of the figures) is adequate to the call.
From my end thoufgh, I would expect at least proof-of-concept usage of the data set and added value from converting it into linked data, e.g. in a demo application.

Solicited review by Danh Le Phuoc:

The paper presents the dataset exposed from Internet Live Archive Music Archive. I think the paper and dataset need significant improvements to get published. Here are some comments to improve :

-Method of creation : I suppose that section 2 and 6 present this aspect. The paper mentioned a layered approach, but I got a hard time to understand the technical details after that. I don't get how the data was created, where the contributions are. It's not quite clear what the authors mean by "resource is published "as is" without attempts to align entities".

-The vocabularies used in Figure 1 are quite simple. The paper also does not reveal how many properties contained in the original data sources, what is the decision of mapping/aligning? the paper mentioned that it does not align entities? then describe why!

- The statistics on external links are quite modest. for example, there are no external links to over a million of tracks, 100k performances. there are no figure/statistics on internal links. From Figure 1, the dataset seems to be partitioned to several groups of instances, the connectivities among these groups are scarce.

- All the further work mentioned in the paper are desirable to make the dataset useful and to bring it to level to be published.

-Some other aspects to judge the quality of the dataset are not clearly mentioned such as reported usage, maintenance, language expressivity, quality of dataset, completeness of description.