Migration of a library catalogue into RDA linked open data

Tracking #: 1453-2665

Gustavo Candela
Pilar Escobar
Rafael Carrasco
Manuel Marco-Such

Responsible editor: 
Christoph Schlieder

Submission type: 
Dataset Description
The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The entries in the catalogue have been recently migrated to a new relational database whose data model adheres to the conceptual models promoted by the International Federation of Library Associations and Institutions (IFLA), in particular, to the FRBR and FRAD specifications. The database content has been later mapped, by means of an automated procedure, to RDF triples which employ basically the RDA vocabulary (Resource Description and Access) to describe the entities, as well as their properties and relationships. This RDF-based semantic description of the catalogue is now accessible online through an interface which supports browsing and searching the information. Due to their open nature, these public data can be easily linked and used for new applications created by external developers and institutions. The methods applied for the automation of the conversion, which build upon open-source software components, are described here.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Carlo Meghini submitted on 15/Nov/2016
Review Comment:

The manuscript describes the steps followed in generating the dataset from an initial set of MARC21 bibliographic records. These steps are presented in a clear form, the relevant challenges are highlighted in the correct way, although many details (especially in the preprocessing) had to be suppressed for reasons of space. But this is no big problem as long as the interested readers can contact the authors for clarifications. In this respect, it would help if the authors could make available the relevant material on demand.

The resulting dataset has been evaluated according to the standard methods for measuring its connectivity to established datasets, and its compliance to the 5-stars classification. It relies on a number of established vocabularies, all relevant in the bibliographic field, all duly documented.

The authors also report about links from important datasets (Wikidata) to their dataset.

Overall, a successful project resulting in a state of the art dataset.

Review #2
By Trond Aalberg submitted on 27/Nov/2016
Minor Revision
Review Comment:

The submitted revised article is a good step towards a final version. Many of the major comments to the first submission has been properly addressed and the paper is now well structured with a clear content and contribution.

The paper still needs a new round of revision which mainly should focus on the following parts:

I would like to see a more insightful discussion of the “semantic quality” of the data. The biggest problem in frbrized data sets is the occurrence of false positives. These are typically found for works that have more than one expression and for authors having more than one work. This is also the part of the result that is likely to be linked to/from (reused). The paper documents that 112 work groups have been inspected, but does not say how they are selected and what they represent. Checking a random selection of groups will give a number for correctness, but the number does not really say anything about the actual quality of the data. 8 false positives out of 112 may sound like a low number, but if these errors are in the 10% of the data that is most likely to be linked to/from it could imply very low quality.

The technical quality of the paper must be improved. There are still many errors and odd phrases.

The “em dash” is simply overused – combined with rather variable use of space before and after. Please replace a reasonable number of these with commas, and rephrase accordingly when needed.

Part 1, first paragraph:
• “…RDF format own description…”: do not understand this.

Part 1, second paragraph:
• “… alternative replacement…”: RDA is a modernized version of AACR2, also a bit strange to state that something is an “alternative replacement”. It is sufficient to say “alternative” or “replacement”.

Part 1, third paragraph:
• “…provide easier navigation…”: RDA descriptions themselves do not necessarily provide easier navigation and retrieval….
• “…data expressed primarily in natural language text…”: I do not agree on this description of bibliographic records. They are highly structured and have some elements that have natural language text, but most fields are more value-like than human-text-like.
• The last part of this paragraph is rather meaningless and odd.

Part 2, first paragraph

• Is the publication of linked data one of the building blocks of the semantic web, and does the publication require normalization and adaption?
• What is a web-oriented format?

Part 2, second paragraph
• The TELplus project was an experiment on a selected set of records. Use words like experiment, case study, prototype etc.

Part 2, fourth paragraph
• “different approach to musical content”: different approach “based on” or “for” musical content?

Part 2, paragraph 9
• FRBRoo is an elaborated version of FRBR implemented as an extension of CIDOC CRM

Part 2, paragraph 12
• A bit simple to describe BIBFRAME just as an RDF-based alternative to MARC21. "replacement for" is maybe more appropriate?

Part 3, first paragraph:
• “see for instance….”: remove the comma.
• “, a common requirement …”: can be deleted.

Part 3.1
• Consider using an alternative to bulleted list. Bulleted lists are not particular readable when each item is a longer text.

Part 3.2, second paragraph
• Comparing persistent storage technologies with semantic storage? Most triple stores are persistent too.

Part 3.3, second paragraph
• The subject relationship between work and author is already described in FRBR, why present it as something you are introducing?

Part 3.3, third paragraph
• Reference to figure 2 should be to figure 3.

Part 4:
• The defined constraints are only documented by a URL to a git repository. In the git repository I was looking for a readable listing of these constraints, but could not find any.

Refence [1] is missing information that is needed to retrieve this publication. Describe all conference proceedings with same level of detail.