Review Comment:
This article describes the CEDAR dataset, a LOD representation of the Ductch historicial census data covering from 1795 to 1971. Overall the article is easy to follow and although I am not a domain expert, from a general point of view I believe the dataset offers good value to a specific community. However, I also think the article should be further improved along a couple of dimensions to better 'sell itself' to both the general and domain specific readers.
== Quality ==
The dataset is built with data from authoritative sources with extensive human input.However the article is missing a section for evaluating the dataset using explicitly defined metrics. Hence it is difficult to make objective judgement of quality. The authors themselves also pointed out in the end of the article that checking the consistency and errors in the data is listed as 'future work'. I personally think this is acceptable. However, authors should at least evaluate in terms of usefulness of the data (see below).
== Usefulness ==
I believe the dataset is very useful for a number of reasons: 1) it is the first historical census data made available as LOD while previous legacy data have been difficult to use; 2) some 'auxiliary' resources (e.g., classification schemes, vocabulary mappings) can also be very valuable; 3) it is linked to other Linked Data sets including some major ones such that it is likely to be exposed to a wider community.
However, I feel that these are inadequately addressed in the article.
First, the authors have used an example in Section 4.1 to illustrate how the dataset can be useful to historical census researchers (This also seems to be the same example on the demo website). But one example is just not enough because it gives little depth of what the dataset allows one to do and there's not enough information to teach a non-expert to write similar queries to test for themselves. In my opinion, this can be improved by: a) adding more examples, at least on the demo website; b) quantify the benefits and show some indicators in a table. For instance, test 10 different queries and show what is the average number of excel tables that a user need to consult to answer them without CEDAR. Also, are there any queries that users would not be able to answer previously?
Second, the auxiliary resources such as the classification schemes and vocabulary mappings have been mentioned across a number of different sections. I think the article can benefit from a summary table listing these resources, together with some statistics (e.g., number of new concepts created; number of mappings, avg. number of terms mapped in a group), and some discussion of if, and how they can be useful to the community in general. I notice that some of these are available in github, but I think it is important to make these explicit in the article as they are important contributions.
== Clarity and completeness ==
In general the article is clear to me. The demo also shows statistics about the proportion of data that still remains to be converted, giving a sense of 'completeness'. The article can be further improved for clarity as follows:
1. The dataset is described using a number of vocabularies. While many of them are already covered across different sections, it'll be nice to have a 'lookup' table (or similar) listing them and also their characteristics. This can be very helpful for readers who want to use your SPARQL endpoint to try out their own queries. At the moment I find it extremely difficult to do this. Even to understand the example query, I had to frequently go back to the paper and navigate through paragraphs to find relevant descriptions.
2. As part of the data creation process, you have created lots of mapping scripts and rules. As explained before, I think these are valuable resources. I understand they are available on Github but it would be nice to show some concrete examples and also statistics in the article.
3. On page 2, section 2.1 "To this end, we developed TabLinker, a supervised Excel-toRDF converter that relies on human markup on critical areas of these tables". How many markups are there? Does Figure 2 show all of them?
Other minor comments regarding clarity:
- In figure 1, component "Raw-data analysis (TabCluster+LSD)" is never explained in the article. What is it?
- Some links provided in the article did not work at the time of review. These are identified with footnote number 8, 16, 17.
- On the webpage shown in footnote 9, 'Example of visualisations' is empty.
|