Migration of a library catalogue into RDA linked open data

Tracking #: 1162-2374

Gustavo Candela
Pilar Escobar
Rafael Carrasco
Manuel Marco-Such

Responsible editor: 
Christoph Schlieder

Submission type: 
Application Report
The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The entries in the catalogue have been recently migrated to a new relational database whose data model adheres to the conceptual models promoted by the International Federation of Library Associations and Institutions (IFLA), in particular, to the FRBR and FRAD specifications. The database content has been later mapped, by means of an automated procedure, to RDF triples which employ basically the RDA vocabulary (Resource Description and Access) to describe the entities, as well as their properties and relationships. Compared to a direct transformation, the intermediate relational model —ensuring, for example, referential integrity— provides tighter control over the process and, therefore, enhanced validation of the output. This RDF-based semantic description of the catalogue is now accessible online through an interface which supports browsing and searching the information. Due to their open nature, these public data can be easily linked and used for new applications created by external developers and institutions. The methods applied for the automation of the conversion, which build upon open-source software components, are described here.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Oct/2015
Major Revision
Review Comment:

This manuscript was submitted as 'Application Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described application (convincing evidence must be provided). (2) Clarity and readability of the describing paper, which shall convey to the reader the key ideas regarding the application of Semantic Web technologies in the application.

The paper reports the main steps of a project to migrate a library catalogue structured according to the MARC21 standard into RDA linked open data.

The paper is accettable on dimension (2), since it is well-written and quite readable (apart from several minor mistakes which can be cleaned with proof-reading).

Instead, the paper is not acceptable on dimension (1) since it does not offer any contribution. This is not to say that the project per sé was not high-quality or important. But the paper fails to highlight the quality or the importance of the project, presenting it as the n-th migration process which does not add anything to those already carried out, some of which are described in the Related work Section.

More specifically, the basic motivation of the project was to develop tools that assisted libraries in transforming old records into the new formats that facilitate automatic processing of information, because the textual descriptions contained in those old records cannot be easily read and interpreted by computers (beginning of Section 3). But then, the paper says nothing at all about the methods and algorithms employed to overcome the limits of the old records. How were the MARC records mapped into the relational rows? How were the textual descriptions handled to extract from them the formal knowledge contained into the new structures? How were the values found in the MARC fields transformed into relational values and (subsequently) into URIs or literals? The paper provides only general information about the employed data models and ontologies, and (irrelevant) technical details about the languages and frameworks used in the implementation, but, as I said, it fails to give the real significant information.

I have also many doubts about the proposed architecture. The passage through a relational database is justified (Section 3.1) because the relational data model provides controls over data integrity in contrast to RDF. To exemplify, the authors mention referential integrity on the domain and range of relationships (such as creatorOf) and cardinality constraints. This is simply not true. RDF/S offers the rdfs:domain and rdfs:range properties for enforcing the mentioned referential integrity, while OWL offers terms for expressing cardinality constraints. Since OWL can be encoded in RDF, also cardinality constraints can be expressed in RDF. So, the basic motivation offered to justify the use of a relational database does not stand. In fact, one wonders why not migrate the catalogue from MARC to RDF directly. The considerations on the open vc. closed world reading reported in the Conclusions are truly obscure. We are talking about the same set of data, why two different readings (CWA vs OWA) would be appropriate?

I suggest that the authors re-write the paper including the required information about the transformation process and a better motivation for the chosen architecture, and highlight in what ways their work is going to help people carrying out similar projects.

Review #2
By Trond Aalberg submitted on 18/Nov/2015
Major Revision
Review Comment:

This manuscript was submitted as 'Application Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described application (convincing evidence must be provided). (2) Clarity and readability of the describing paper, which shall convey to the reader the key ideas regarding the application of Semantic Web technologies in the application.


The contribution describes the migration of a library catalogue into linked open data. The resulting data is primarily based on the RDF vocabulary for RDA (Resource, Description & Access), but also other vocabularies are used. The emphasis in the article is on the migration process, the underlying conceptual model (IFLA’s FRBR model), and the vocabularies used in the final coding. The article is submitted as an “Application Report”, but it does not describe any particular application that uses this data, and should rather be considered as a “Linked Dataset Description”. The project has earlier this year been presented as a poster on TPDL 2015, accompanied by a short paper in the proceedings, but this contribution is sufficiently different/elaborated to justify a new publication.

The paper is well written and easy to read, reasonably well organized and the figures are relevant and of good quality. The contribution is relevant for others that are looking for examples on how to migrate library catalogues to linked open data. The potential impact, however, is somewhat limited because the main focus is on the process, the software and vocabularies used, rather than the quality and reuse value of the resulting data set.

Given that this is a semantic web journal, I find the introduction a bit elementary and appears to be written for readers without any prior knowledge about semantic web technologies. The statement about RDF being based on XML can be revised because RDF better is presented as a graph-based data model independent of XML. The motivation for this work, however, is well described. The description of the RDF-vocabulary developed for RDA is acceptable, but the authors could distinguish better between RDA as a standard for descriptive cataloguing and the RDF-vocabulary developed for the elements and relationship designators in RDA. The authors argue that there is need for data processing before bibliographic records can be published as linked open data, which of course is perfectly correct, but it is difficult to figure out what they mean by “encoded using heterogeneous library standards”.

In “Related work” the authors demonstrate knowledge about recent comparable initiatives/projects in the library domain, but a I miss references to research on the process and problems related to transforming library records to the FRBR model and get the impression is that this is largely ignored in this project. The transformation process is described merely from a technical technical point of view, but the main challenge is quality in terms of semantic correctness in the result. Please check e.g. the paper by Decourselle from the TPDL 2015 proceedings, earlier papers by Aalberg et. al. (this reviewer), Manguinas et. al., M. Yee, as well as papers from the Variations2-project etc.
The relational database that is used internally to store the Biblioteca Virtual Miguel de Cervantes appears less interesting and the presentation could benefit from a more dominant focus on the final ontology instead. After all, it is the final output of triples that will be exposed to others. I also miss a better description on the logic applied in the transformation. The guidelines by LC they refer to are primarily a mapping between properties which typically has to be accompanied by some interpretation logic for identifying and relating the entities described in each record.

Results are presented in the final section and the claim is that the procedure has been able to automatically transform a reasonable number of records “successfully”. The main problem in the result section is lack of discussion on what they have succeeded with. The quality is merely discussed from a syntactical point of view, described by counting classes, properties, triples and entities that are linked to external collections (such as VIAF). A more in depth discussion/analysis of the the data in the context of its use, is needed to show the actual quality/reuse value of this data. A simple query performed by this reviewer on the SPARQL endpoint for works having Cervantes as author, returns a listing of 408 works (?), including numerous works with different URI but titles that indicate equivalent entities (“El ingenioso hidalgo don Quijote de la Mancha” and variants of this title). This simple test indicates that the problem of deduplication and erroneously identified instances is ignored in this project which also was my impression from reading the article. This reviewer may of course be wrong, but the paper does not give any evidence of the opposite. The result is a collection that implements the vocabularies of RDA and shows what can be done in terms of coding data for the semantic web, but does not recognize and deal with the migration problems and quality issues that have been identified in previous research.

Main conclusion is that this potentially is a relevant contribution, but there is a need for major revision before it can be accepted. In particular, the authors should include a discussion on the typical migration problems and quality issues that others have identified for transforming MARC-data into FRBR. Secondly, they need to include some evidence demonstrating the actual quality of the final data e.g. by looking at the results for known cases, counting duplicate as well as erroneously generated entities etc. Given that the proper migration of library data into richer semantic models such as FRBR coded using the RDA-vocabulary is a very hard problem, I do not expect such data to show perfect results, but I do expect a thorough discussion about well known migration challenges, the solutions they have implemented and evidence on the results they have achieved.