Review Comment:
This paper describes a dataset about government expenditure that is made available as Linked Data by reusing and transforming the data available at OpenSpending.org as a source.
As a dataset description paper, I will follow the usual criteria for evaluation that has been proposed in the context of previous special issues on Linked Data dataset descriptions, namely quality of the dataset, usefulness (or potential usefulness) of the dataset, as well as clarity and completeness of the descriptions. Besides, I am also following the checklist that is used in such dataset descriptions:
- Name, URL, versioning, licensing, availability: COVERED
- Topic coverage, source for the data: COVERED
- Purpose of the Linked Dataset, e.g. demonstrated by relevant queries or inferences over it: PARTIALLY COVERED, since the queries that are provided are mostly demonstrating what can be done once that some data is available using the RDF DataCube vocabulary, but the paper does not provide a clear description of other applications using the dataset currently.
- Applications using the dataset and other metrics of use: NOT COVERED (see my previous comment)
- Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability: COVERED
- Quality, quantity and purpose of links to other datasets: NOT COVERED (it should be discussed, especially in what respects to linking to SDMX vocabularies, if relevant, or to SKOS/XKOS)
- Domain modeling and use of established vocabularies: RDF DataCube is extensively and adequately used. However, it would be also useful to add links to SKOS/XKOS vocabularies/thesauri for the dimensions.
- Examples and critical discussion of typical knowledge modeling patterns used: COVERED
- Known shortcomings of the dataset: COVERED
As it can be seen from this previous checklist analysis, most of the aspects that are required for dataset descriptions in the journal are covered, and the only two main missing points, which should be improved should the paper be accepted with some type of revision, would be a better description of potential applications and a better usage of other vocabularies that are normally associated to the use of RDF DataCube.
Now I will move into reviewing the other aspects that are normally covered in dataset reviews:
Quality of the dataset
----------------------
I have some concerns about the overall quality of the described dataset. This is not to say that the dataset is wrongly
transformed, as the use of DataCube seems to be adequate (both in terms of how the observations are generated and
about the generation of the data cube structure). Besides, the transformation process seems adequate as well, but it
falls short in my opinion on what it should be done. I will comment on some of these shortcomings/design decisions which
are not well explained:
- The threshold that is applied on the number of missing values for a component property is very ad-hoc. why do you
use a number of 30 observations and 10% of the observation-property pairs? I may imagine cases where there is a lot
of detail in the potential values of the dimensions, but where the data is heavily aggregated in many cases, providing
only observations of aggregations over some property values. I have seen many cases of statistical data like this. It
seems to me that it is necessary to establish more clearly and in a more justified manner why these thresholds are
applied.
- The use of the same URL for properties with the same name in different datasets seems ok, in general, and the authors
justify that this has not brought in any problem in the transformation. However, it is very common when one is dealing with
some of this data to use properties that are proposed, for instance, by SDMX COG (e.g, sex, areas, age, etc.). Is any of
those properties applicable? Why do you create new URLs (it should be probably URIs, in fact) instead of trying to reuse
some of those properties already available by using some linking technique, or even humans in the loop? This would probably
increase the quality of the dataset, IMO.
Some additional questions that I have are the following:
- In page 4, the figure 4 includes a description of a slice. In fact, I would recommend changing ns0 there by qb, which is
the commonly used prefix for Data Cube. However, you do not explain how you generate slices from the data that you are using.
How do you generate this slice? the same applies to Dublin Core ones.
- You say that the LinkedSpending dataset is continually growing, but from the CubeViz visualisation that is available, and from
running the following query in the SPARQL endpoint, I get the following:
select distinct ?a ?date where {?a a qb:DataSet ; ?date} ORDER BY ?date
I get that all datasets were created at the same time (early September 2013), and there is an inconsistency in the number
of datasets wrt the paper (which says 321), per evaluating:
select count (distinct ?a) where {?a a qb:DataSet ; ?date}
I get 261 different datasets. Where is this inconsistency coming from?
- Have you actually done what you were proposing in the interlinking section of page 6? That is, using labels of
datasets and dimensions, linking them to regions?
Usefulness (or potential usefulness) of the dataset
---------------------------------------------------
Personally, I see a lot of potential for this dataset, especially in what it has to do with
the achievement of a better degree of transparency in the public accounts and expenditure
of Governments over the world. It is also true that this potential really comes from the dataset
that is used as the origin of this linked dataset (OpenSpending), and it is difficult to see
what is added by providing a Linked Data transformation of that data, which is already
available through an API. I would see the benefit if the current dataset was better linked
with other existing datasets, if SKOS/XKOS vocabularies were identified, if common vocabularies
about spending were used, etc. This would allow going across datasets from different countries
and would allow people like data journalists to make better comparisons, etc.
However, the dataset does not see to be reused yet by others, which is something very important for
linked dataset descriptions in SWJ.
BTW, in table 5 you show several queries. Some of them may be more easily written using property paths, such
as for instance query 2. Why don't you use them? This is a minor point, obviously.
Clarity and completeness of the descriptions
--------------------------------------------
The description is very thorough, both on the dataset itself and on the transformation process.
Minor comments - typos
----------------------
There are several typos and grammar errors throughout the paper. I have identified here some of them:
- Page 2: "the is" --> "this is"
- Page 2: "the the" --> "the"
- Page 2. The paragraph "Because the source data adheres..." is not understandable. What do you mean there?
- PAge 2. When you refer to the previously explained data cube model, it may be better to advance a little bit the description of SDMX, in which RDF Data Cube is largely based. It is described much later in the paper.
- PAge 3. You refer to other tools in the LOD2 stack. It would be good to provide their names. Personally, I have only used CubeViz, but I was not aware of the existence of other tools.
- PAge 4. Facetted --> faceted
- PAge 4. underlaying --> underlying
- PAge 4. Examplarily. Is this correct?
- PAge 6. I cannot understand well the paragraph that starts with "We generally preferred..." Can you rephrase?
|