Fiction Literature as Linked Open Data - the BookSampo Dataset

Paper Title: 
Fiction Literature as Linked Open Data - the BookSampo Dataset
Eetu Mäkelä, Kaisa Hypén, Eero Hyvönen
The BookSampo dataset provides information as linked data on fiction literature published in Finland going back to the 15th century, along with rich descriptions of both their content and context. The dataset contains data on nearly 400,000 subjects, including literary works, authors, book covers, reviews, awards, images, and movies, over 3 million triples in total. The data has been applied as the basis of the BookSampo portal in public use in Finland, and is aligned with the cross-domain cultural heritage contents and ontologies of CultureSampo, another in-use semantic portal. The data has been used to answer complex questions, such as what topics should one write about, if one wants to get a literary award (based on statistics). The metadata was transformed into RDF from legacy library databases, then enriched manually by dozens of librarians in aWeb 2.0 fashion in Finnish public libraries, and is constantly updated at a rate of some new 90,000 triples monthly.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 

Submission in response to

Revised resubmission after an "accept with minor revisions", then accepted with minor revisions, and finally accepted for publication. The first round reviews are beneath the second round reviews.

Solicited review by Oscar Corcho:

Following on the comments from my early review of the early version of this paper, the paper describes a relevant dataset that can be useful for a large community of users, what indicates that the work presented is suitable for this special issue.

From a methodological point of view, the design decisions on the vocabularies reused and the part of the vocabulary that is created for the publication of this dataset are good, although the paper still lacks some clear descriptions of the design decisions for URIs. It is worrying that the constraints on RDF Schema that are presented in the paper still hold, when the vocabulary could have been presented in OWL instead.

I tried tonight the urls provided for the sparql endpoints and dumps and they do not work, btw. This should be solved, the dataset registered in a registry, etc.

Solicited review by Fabien Gandon:

I am satisfied with the answers to the reviews.

Solicited review by Aba-Sah Dadzie:

My two main concerns have been much better addressed in the revised submission. I'd recommend accept, minor comments below. Wrt to the key review criteria for the call:

* Quality of the dataset
* Usefulness (or potential usefulness) of the dataset - in use and continuously updated (from multiple sources). Specific use cases are also described.

* Clarity and completeness of the descriptions - good amount of detail provided. The paper is well written and overall, fairly easy to follow.

* Name, URL, version date and number, licensing, availability, etc. - licensing information provided, with relevant information on which types apply to which versions of the data.

* Topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.
* Metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth. - provided in good detail. While most linking is within the Finnish cultural heritage system, reuse of other standard ontologies and vocabularies provides points from which to link to other "external" data.

* Examples and critical discussion of typical knowledge modeling patterns used - improved compared to original submission. One of my main concerns was design that limited the structure of the dataset. The reasons for the decision and its effects have been more fully addressed. Options for addressing this are also presented. Improving linking to the LOD cloud is discussed in "future work".

* Known shortcomings of the dataset - addressed (see above)

Additional points to address

RDF Export URL (section 2) not reachable (07-08.11.2012)

Table 2 (Important classes in BookSampo along with their instance counts) is missing

"[section] 5. Uses Cases for the BookSampo Dataset" -> "5. Use Cases for the BookSampo Dataset"

First round reviews:

Solicited review by Oscar Corcho:

The paper describes a dataset about fiction literature from libraries in Finland, which is being continuously updated when new additions of book are input in the source data system. The paper does not provide much information about typical aspects that would be necessary to make some checks of the quality of the data inside (e.g., a sample URI for a sample item), but clearly describes the types of entities that it deals with and the use cases that can be run on the dataset.

The quality of the dataset is, as in many of these papers for this special issue, relative to the quality of the source dataset. Here a clear potential problem of co-reference resoultion appears, but it is nicely solved in the approach presented, and additionally Web2.0-like annotation from experts is used in order to curate and add more information into the set. The vocabularies used for the RDF export are adequate and those that are normally used in the bibliographic domain, and the design decisions on which parts of those vocabularies to use (e.g., from FRBR) are appropriate.

The dataset is useful, from what it can be inferred from the use cases that are presented, although its use may be limited to Finland mainly. however, that should not be a problem, obviously.

Finally, the dataset is quite complete, considering the sources that are being used, and it would be nice if links to external resources were added, or at least described, such as authority records for the authors, what would increase the value and completeness of the dataset. A dicussion on this should be available in the revised version.

Solicited review by Fabien Gandon:

The paper presents the BookSampo dataset that provides linked data on fiction literature published in Finland.

The provided URLs were working at the time of writing that review.
Metrics and stats are provided for the internal content.
Interlinking with external schemas and datasets is mentioned but no statistics are provided

The authors identify a list of shortcomings among which one is very disturbing: "the schema definitions in the dataset virtually violate RDFS semantics in one major aspect, due to the specifics of the SAHA editor used: properties may have multiple separate domain and range constraint statements, but this doesn't imply that the instances related by these properties are members of the intersection of domain/range classes, as required in the RDF Schema specification."
Using additional (abstract) classes this could be avoided using for the domain and range a class defined as the super class of the united classes i.e. replacing union by dedicated super classes.
Breaking the RDFS semantics is a very big problem for interoperability if I load your data in my triple store, I will draw false conclusions.

"Bringing events to the fore, the approach fractured and distributed the metadata of the original primary objects. For example, people wanted much more to see information on authors' birth and death dates and places as simply attribute-object values of the author, instead of as events where the author was involved in. The project thus changed back to a more traditional model, where data about times and places of occurrences are directly saved as author, not event attributes. In the case of representing degrees attained by authors, this did lead to some loss of data, since the flat attributes allowed only representation of multiple degrees without dates. However, the librarians deemed the simplicity to outweigh the costs in this situation."

This is surprising: you changed the conceptual model because of an interaction design issue. Why not design an interaction mechanism that bridges the two worlds. For instance in RDF the Fresnel initiative was introduced to decouple RDF models and RDF views.

Solicited review by Aba-Sah Dadzie:

The paper discusses design decisions taken in building the Finnish BookSampo linked dataset, with the use of Web 2.0 technology to index fiction literature such that it provides a rich resource for browsing and analysis not possible with traditional indexing.
The main data source was an RDF dump provided by the Helsinki metropolitan area library; new data and annotations are provided by a named company. The dataset is already in use, and continues to grow, due to annotation carried out by librarians. In addition to web services that support this annotation, support for browsing by ordinary users of the libraries is provided through a dedicated web portal.

The authors describe the domain-specific ontologies used, out of a subset of Finnish resources for describing cultural information (primarily the KOKO ontology), and also the links to other standard ontologies and resources (e.g., DBPedia and GeoNames to match to physical locations). They also discuss restrictions to their model that limit interconnectivity with other resources, due mainly to the need to simplify the model to suit the librarians who provide the annotations and other related information (e.g., awards). A few examples of use are given, including statistical analysis to derive information about the dataset itself and its use for other purposes such as grant sourcing based on subject area. Design to enable easy linking to Finnish cultural heritage resources is also highlighted.
Conflicts in licensing are discussed; the dataset does however appear to be largely accessible for analysis and other (re)use.


"In the case of representing degrees attained by authors ..." (p.4) - does "degrees" refer to academic degrees? If so this should be explicitly stated - the word is ambiguous.
I am a bit puzzled as to the justification for using a flat structure for storing data about dates, with the specific example of degrees. Based on the authors' description I (safely?) assume the librarians do not manually edit the backing ontologies - would it not have been possible to set up this structure to capture the date information as well and provide a more usable interface to support simpler provision of the information? Along with additional training as described for the example of annotating a part or a series of a book (p.5-6)? Alternatively or additionally, automatic methods could be used to (attempt) to retrieve this information by making use of other related author attributes, e.g., timestamped information about attendance at relevant institutions - information which IS also captured.

The paper ends suddenly. A brief conclusion with a discussion of future work is necessary; the discussion within the paper (mostly in section 4) covers specific design decisions and the data model, but does not identify any open issues and/or plans to revisit those areas the authors acknowledge to be less than optimal.


Figures & Tables

Convention places table captions at the top. Also, because the tables in the paper do not have a line at the bottom, the text in the bottom cells runs into the captions, making reading more difficult.

Citation & Bibliography

[1] is a non-English citation. While it may well be appropriate it requires at least a translation of its title into English to give some indication of its relevance - simply because THIS article is written in English, which is, fortunately or not, the lingua franca when it comes to scientific articles. (see, e.g.,
Further, the claim it is meant to support appears to be the main justification for the creation of this linked dataset.

(p.2) - The "Getty AAT thesaurus" is not cited - at least a URL must be provided.

Language & Presentation

(p.3) - "so the user should be aware of a few conventions, or lacks thereof" -> "so the user should be aware of a few conventions, or lack[no 's'] thereof"

(p.4) - There is a weird split from the paragraph at the top to the next - it would make for better reading to move the 1st sentence in para2 to the end of the previous one - "The project thus changed back to a more traditional model, where data about times and places of occurrences are directly saved as author, not event attributes."