CEDAR: The Dutch Historical Censuses as Linked Open Data

Tracking #: 1140-2352

Albert Meroño-Peñuela
Christophe Guéret
Ashkan Ashkpour
Stefan Schlobach

Responsible editor: 
Pascal Hitzler

Submission type: 
Dataset Description
In this document we describe the CEDAR dataset, a five-star Linked Open Data representation of the Dutch historical censuses, conductedin the Netherlands once every 10 years from 1795 to 1971. We produce a linked dataset from a digitized sample of 2,300 tables. The dataset contains more than 6.8 million statistical observations about the demography, labour and housing of the Dutch society in the 18th, 19th and 20th centuries. The dataset is modeled using the RDF Data Cube vocabulary for multidimensional data, uses Open Annotation to express rules of data harmonization, and keeps track of the provenance of every single data point and its transformations using PROV. We link these observations to well known standard classification systems in social history, such as the Historical International Standard Classification of Occupations (HISCO) and the Amsterdamse Code (AC), which in turn link to DBpedia and GeoNames. The two main contributions of the dataset are the improvement of data integration and access for historical research, and the emergence of new historical datahubs, like classifications of historical religions and historical house types, in the Linked Open Data cloud.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ziqi Zhang submitted on 26/Aug/2015
Review Comment:

The re-submit of this paper has been largely improved over the previous one and I can see that my comments have been properly addressed. I think the paper is ready to be accepted, subject to corrections of a few typos.

The current paper is clear, and easy to read and follow. The authors have added more detailed descriptions to address the ambiguities raised by reviewers. The authors have added examples (in the form of tables, figures and paragraphs) that make the usefulness of the dataset more convincing. Also details about the 'auxiliary resources' have been provided. In particular, I notice that the mapping rules are made available for download and I believe can be very useful resources to future research.

A number of typos have been found:

page2, left column, line 4: "books books" => "books"
page4, right column, line 7: "that where" => "that were"
page 9, left column, line 3 of the 4th paragraph: "our is" => "ours is"

Review #2
Anonymous submitted on 27/Aug/2015
Review Comment:

In my original review, I have raised minor concerns about the related work, data covered by the dataset and some details of presentation in the paper. After reviewing the authors' response and the corresponding passages in the manuscript I am satisfied that this paper is acceptable. In particular, a related work section has been added and the authors clearly motivated the rational behind the fact that they only considers years 1795-1971. Moreover, they solved the minor issues about figures' readability and clarity of explanation.

Review #3
By Eetu Mäkelä submitted on 11/Sep/2015
Minor Revision
Review Comment:

In general, I like the content of the article and think it deserves publication. However, I feel there is still work to be done in cleaning up and expanding the text, as well as fixing bugs in the dataset its homepage.

Going chronologically, the Related work section of the introduction is puzzling. In my opinion, much of what is presented as related is really far from the actual content of the paper, and a lot that would be of relevance is missing. I'd suggest as related work other 1) other projects and tools that do Excel/CSV to RDF conversion, 2) other projects that publish data as RDF data cubes (or publish similar data even if it is not as data cubes) (I do have to note however that most of the referenced works I find puzzling actually come from the suggestions of a prior reviewer, so can understand your dilemma). As a small note, there is a typo in "knwoledge graph".

Similarly, in section 2.1, it is stated that "well-known generic community tools" cannot be used. However, no actual references to such tools are given. Such a statement would also need at least a couple of sentences of argumentation with relation to those tools.

It is not easy to understand Listing 1, when not even an example dimension resource is included. I would also otherwise like the examples to cover the the information associated with the different types of row and column header resources.

In the article, multiple links to GitHub resources no longer resolve due to refactorings. These should be changed to point to a particular tagged version so this does not happen.

Also, presently, multiple of the example queries on the dataset query page do not work. These should be fixed.

In the article, as well as on the dataset page, sometimes the term/graphs cedar-mini appear instead of cedar. The relationship between these two is not explained, and hampers use of the dataset.

For many of the tables in the article, the values depicted are not adequately explained. For example, what exactly is the frequency/% in Table 2 a portion of? What does SPARQL as a means of generation mean in Table 5, and so on.

In the article, links to outside data sources are mentioned in passing in several places, but not comprehensively discussed. Here, an additional table would help listing all outside link targets, as well the how these links were generated.

Inside the dataset itself, some of the URIs are currently not dereferenceable (having a port of :8888). Also, some OA resources are in my opinion unnecessarily blank nodes, which hinders their exploration.

The property cedar:isTotal is present in a table in the paper, but not discussed. However, looking at the example queries on the site, this seems to be an important property that discerns different types of observations from each other. Thus, its significance should be explained.

Finally, and most importantly, I do not think that the analysis of dataset (rule) coverage is adequate. For example, at present one does not really know how much of the information of the original data is transferred into the final representation. One would need at least percentages for the number of different codes handled thus far vs the total number of codes, as well as a qualitative evaluation on for example how many of the dimensions in total of the original have been thus far mostly mapped. The paper states that these questions are answered by the "full statistical analysis" on the dataset home page, but at least for me this was not the case. On that page, I only see the number of total sheets processed, but not deeper detail into how much of their information made its way into the end result.