CEDAR: The Dutch Historical Censuses as Linked Open Data

Tracking #: 878-2088

Authors: 
Albert Meroño-Peñuela
Christophe Guéret
Ashkan Ashkpour
Stefan Schlobach

Responsible editor: 
Pascal Hitzler

Submission type: 
Dataset Description
Abstract: 
In this document we describe the CEDAR dataset, a five-star Linked Open Data representation of the Dutch historical censuses, conducted in the Netherlands once every 10 years from 1795 to 1971. We produce a linked dataset from a digitized sample of 2,300 tables. The dataset contains more than 6.8 million statistical observations about the demography, labour and housing of the Dutch society in the 18th, 19th and 20th centuries. The dataset is modeled using the RDF Data Cube vocabulary for multidimensional data, uses Open Annotation to express rules of data harmonization, and keeps track of the provenance of every single data point and its transformations using PROV. We link these observations to well known standard classification systems in social history, such as the Historical International Standard Classification of Occupations (HISCO) and the Amsterdamse Code (AC), which in turn link to DBpedia and GeoNames. The two main contributions of the dataset are the improvement of straightforward data access for historical research, and the emergence of new historical data hubs, like classifications of historical religions and historical house types, in the Linked Open Data cloud.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Jan/2015
Suggestion:
Major Revision
Review Comment:

This article describes the CEDAR dataset, a LOD representation of the Ductch historicial census data covering from 1795 to 1971. Overall the article is easy to follow and although I am not a domain expert, from a general point of view I believe the dataset offers good value to a specific community. However, I also think the article should be further improved along a couple of dimensions to better 'sell itself' to both the general and domain specific readers.

== Quality ==
The dataset is built with data from authoritative sources with extensive human input.However the article is missing a section for evaluating the dataset using explicitly defined metrics. Hence it is difficult to make objective judgement of quality. The authors themselves also pointed out in the end of the article that checking the consistency and errors in the data is listed as 'future work'. I personally think this is acceptable. However, authors should at least evaluate in terms of usefulness of the data (see below).

== Usefulness ==
I believe the dataset is very useful for a number of reasons: 1) it is the first historical census data made available as LOD while previous legacy data have been difficult to use; 2) some 'auxiliary' resources (e.g., classification schemes, vocabulary mappings) can also be very valuable; 3) it is linked to other Linked Data sets including some major ones such that it is likely to be exposed to a wider community.

However, I feel that these are inadequately addressed in the article.
First, the authors have used an example in Section 4.1 to illustrate how the dataset can be useful to historical census researchers (This also seems to be the same example on the demo website). But one example is just not enough because it gives little depth of what the dataset allows one to do and there's not enough information to teach a non-expert to write similar queries to test for themselves. In my opinion, this can be improved by: a) adding more examples, at least on the demo website; b) quantify the benefits and show some indicators in a table. For instance, test 10 different queries and show what is the average number of excel tables that a user need to consult to answer them without CEDAR. Also, are there any queries that users would not be able to answer previously?

Second, the auxiliary resources such as the classification schemes and vocabulary mappings have been mentioned across a number of different sections. I think the article can benefit from a summary table listing these resources, together with some statistics (e.g., number of new concepts created; number of mappings, avg. number of terms mapped in a group), and some discussion of if, and how they can be useful to the community in general. I notice that some of these are available in github, but I think it is important to make these explicit in the article as they are important contributions.

== Clarity and completeness ==
In general the article is clear to me. The demo also shows statistics about the proportion of data that still remains to be converted, giving a sense of 'completeness'. The article can be further improved for clarity as follows:

1. The dataset is described using a number of vocabularies. While many of them are already covered across different sections, it'll be nice to have a 'lookup' table (or similar) listing them and also their characteristics. This can be very helpful for readers who want to use your SPARQL endpoint to try out their own queries. At the moment I find it extremely difficult to do this. Even to understand the example query, I had to frequently go back to the paper and navigate through paragraphs to find relevant descriptions.

2. As part of the data creation process, you have created lots of mapping scripts and rules. As explained before, I think these are valuable resources. I understand they are available on Github but it would be nice to show some concrete examples and also statistics in the article.

3. On page 2, section 2.1 "To this end, we developed TabLinker, a supervised Excel-toRDF converter that relies on human markup on critical areas of these tables". How many markups are there? Does Figure 2 show all of them?

Other minor comments regarding clarity:
- In figure 1, component "Raw-data analysis (TabCluster+LSD)" is never explained in the article. What is it?
- Some links provided in the article did not work at the time of review. These are identified with footnote number 8, 16, 17.
- On the webpage shown in footnote 9, 'Example of visualisations' is empty.

Review #2
Anonymous submitted on 08/Jan/2015
Suggestion:
Major Revision
Review Comment:

The paper presents the CEDAR dataset, an RDF dataset exposed as linked data, about the Dutch historical censuses between 1795 and 1971. The authors first introduce how the Dutch historical censuses were collected and some of the initiatives to make such data easier accessible. Then, they discuss their contribution that is to convert Dutch historical censuses in RDF and make the converted data available as linked data. They explain the data conversion approach involving the conversion of raw data to RDF and the integration of data coming from the censuses of different years. They also discuss how they track data provenance and how they generate URIs. Then, after describing how RDF links are generated they conclude the paper with an overview of the impact of having the data exposed as linked data and provide the links to access the dataset.

Linked data completely lacks of historical data and thus the problem of providing historical linked datasets in my opinion is absolutely relevant. However, I have several concerns about the paper. First, even if I agree that the problem of making available historical linked data is interesting the paper should have a related work section (at least as a subsection of the introduction). The authors seems to care only about their previous work, indeed 7 out of 13 references are papers previously published by (a subset of) the same authors of the submission. You should at least briefly mention some of the initiative supporting the importance of having temporal and historical information in RDF. Some examples are:

- Hoffart, J.; Suchanek, F. M.; Berberich, K.; and Weikum, G. 2013. Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194:28–61.
- Rula, A.; Palmonari, M.; Ngomo, A.-C. N.; Gerber, D.; Lehmann, J.; and Bhmann, L. 2014. Hybrid acquisition of temporal scopes for rdf data. In Proc. of the Extended Semantic Web Conference 2014.
- Fionda, V.; Grasso, G. 2014. Linking Historical Data on the Web. In Poster session at the 13th International Semantic Web Conference (ISWC2014).

Moreover, it is not clear to me why the authors limit themselves to the censuses of the years up to 1971. As far as I know the Dutch census continued at least up to year 2001 (see also your reference [3]) and I can immagine that there has been also a census in 2011. Is there any particular reason why you decided to stop in 1971? If so please make explicit such reason otherwise it is a bit limiting to have data referring to 40 years ago and completely neglecting the last 4 editions of the census. Supposing that there is a reason to stop in 1971, in my opinion it would also be necessary to discuss in more detail the kinds of applications that would benefit of having the CEDAR dataset. A bit of discussion is done in Section 4.1 but again the fact that the dataset stops in 1971 seems to be limiting also for the example suggested by the authors. Why a researcher should have been interested in mapping the evolution of the total number of inhabitants of a specific gender, in a specific municipality, and for a specific occupation completely neglecting the last 40 years?

Some minor points are the following:

- In sections 2.1 and 2.2 Figure 2 is discussed by referring to the colours of the different areas of Fig. 2. However, I printed the paper in b/w and for me it was impossible to understand what the authors tried to explain. It would be better to find another way to discuss Fig. 2 than the colours of the different areas of the diagram.

- Also Figure 3 is unreadable when printed and its description as reported in section 2.5 is not clear to me. It would be necessary either to explain such figure better or to remove it since in my opinion it does not give any added value to the discussion as it is.