LinkedSpending: OpenSpending becomes Linked Open Data

Tracking #: 562-1768

Authors: 
Konrad Höffner
Michael Martin
Jens Lehmann

Responsible editor: 
Natasha Noy

Submission type: 
Dataset Description
Abstract: 
There is a high public demand to increase transparency in government spending. Open spending data has the power to reduce corruption by increasing accountability and strengthens democracy because voters can make better informed decisions. An informed and trusting public also strengthens the government itself because it is more likely to commit to large projects. OpenSpending.org is a an open platform that provides public finance data from governments around the world. In this article, we present its RDF conversion LinkedSpending which provides more than 2.4 million planned and carried out financial transactions from nearly 250 datasets from all over the world from 2005 to 2035 as Linked Open Data. This data is represented in the RDF Data Cube format and is freely available and openly licensed.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Juergen Umbrich submitted on 08/Dec/2013
Suggestion:
Minor Revision
Review Comment:

The paper presents the transformation of the OpenSpending datasets into Linked Data.
Overall, the authors manage to give a good overview about the transformation process, however, the level of details of the paper is low.
The main part of the work is to transform the JSON data of the OpenSpending datasets into RDF. Whereas the interlinking with existing LOD datasets seems to be of minor relevance.
Only URIs for DBPedia currency instances and countries from the LinkedGeoData project are reused.

Detailed comments:

Section 1:

I suggest to introduce the benefit and purpose of the Linked Data transformation
of the OpenSpending project in this section to make a strong and convincing case for the purpose of this work.

Remove unused prefixes in table 1 since not all of those are currently used in the paper.

Section 2:
This section would strongly benefit from more details about the OpenSpending project, such as details about the submission process of datasets,
the type of information available (e.g. Transactional spending data, Budgetary data)
and the structure of the datasets ( which is for each dataset completely up to the creator of the dataset -> does that also mean the model might change).
Also, it would help to present the data cube vocabulary could with more details and an appropriate example.
In addition, the author should present the challenges of the transformation based on those details (e.g. structure is defined by creator, this might cause problems of identifying properties, etc..).
(This can be also nicely aligned with the future work section of the paper).

Section 3:

This section should include a discussion about 1) the frequency and handling of updates of existing datasets and 2) how often new datasets arrive and need to be processed.
The authors mention that existing datasets are very unlikely undergoing changes, however, do not exclude that changes can happen and those should be considered in their transformation process.
Do the author have an idea about how frequent new datasets are added?
Also, the description of Table 2 is not entirely clear to me. A example or more details about URL patterns would support a better and easier understanding.
The sentence "In the ?rst step, all datasets are downloaded, in several parts if necessary," caused a confusion.
Are the transformation steps applied for each datasets or over the aggregated data? ->

The authors describe their error handling and mention only one type of error, that is, "missing values of component properties".
What are the other types of error observed and can they be related to matching problems related to equivalent component properties?

It would be very interesting to have some insights about the used resources to convert the datasets (e.g. any memory requirements) and the statistics bout runtimes.
I assume it is feasible to run those transformation steps in a daily basis, but would like to see some statements in the paper.

I suggest to move the interlinking subsection of Section 5 into the transformation section and describe in more details to which extend the Linked Spending datasets resuses external LOD identifiers.

Section 4 & 5:
I would suggest to merge section 4 and 5 to give an overall overview of the Linked Spending projects, including the details of the data and the various access options.
I think an overview of reused vocabularies would be a nice contribution to this section and in addition to the paragraph "Use of Established Vocabularies".
The table 4 is missing the number of links to external datasets.

Section 6:
I suggest to move this section to extract some appealing use cases and present them as high level motivation in or after Section 1.
Since the queries in Section 5 are only discussed in Section 6, I suggest to move them into that section and only present the queries which are actually discussed.
In addition, the authors present the use case of comparing "the spending on education per person" by using the population size of the LinkedGeoData data.
I assume the authors do not import information from LinkedGeoData, in case they do, the authors should quickly discuss the problem of updating those third party datasets.

Section 7:
I am missing a discussion how those projects are related and different to Linked Spending (e.g FTS).

Section 8:
The authors mention in Section 6 that more datasets can be integrated with the Linked Spending project.
It would be interesting to have some concrete examples and what is the added value.
It seems that the tasks of "individual modelling" and "Drilldowns" are most crucial factors for this project.
In addition to the possibilities to solve those challenges, it would be very interesting to have some more information from the authors experience which way might be the most promising.

Additional comments

the URL http://linkedspending.aksw.org/ was not accessible during the review ( 08.12.2013) -> "Unable to connect to Virtuoso Universal Server via ODBC"

In general, the presentation is clear, however, there are some sentences which are hard to read and break the flow of the text/reading,
e.g.:
"All datasets have at least one measure... "
"69 of the datasets do not contain"
chronjob -> cronjob
"...a mapping entry is not specified, however, and the .." ->" not specified and the ..."
...

Review #2
By Oscar Corcho submitted on 11/Dec/2013
Suggestion:
Major Revision
Review Comment:

This paper describes a dataset about government expenditure that is made available as Linked Data by reusing and transforming the data available at OpenSpending.org as a source.

As a dataset description paper, I will follow the usual criteria for evaluation that has been proposed in the context of previous special issues on Linked Data dataset descriptions, namely quality of the dataset, usefulness (or potential usefulness) of the dataset, as well as clarity and completeness of the descriptions. Besides, I am also following the checklist that is used in such dataset descriptions:
- Name, URL, versioning, licensing, availability: COVERED
- Topic coverage, source for the data: COVERED
- Purpose of the Linked Dataset, e.g. demonstrated by relevant queries or inferences over it: PARTIALLY COVERED, since the queries that are provided are mostly demonstrating what can be done once that some data is available using the RDF DataCube vocabulary, but the paper does not provide a clear description of other applications using the dataset currently.
- Applications using the dataset and other metrics of use: NOT COVERED (see my previous comment)
- Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability: COVERED
- Quality, quantity and purpose of links to other datasets: NOT COVERED (it should be discussed, especially in what respects to linking to SDMX vocabularies, if relevant, or to SKOS/XKOS)
- Domain modeling and use of established vocabularies: RDF DataCube is extensively and adequately used. However, it would be also useful to add links to SKOS/XKOS vocabularies/thesauri for the dimensions.
- Examples and critical discussion of typical knowledge modeling patterns used: COVERED
- Known shortcomings of the dataset: COVERED

As it can be seen from this previous checklist analysis, most of the aspects that are required for dataset descriptions in the journal are covered, and the only two main missing points, which should be improved should the paper be accepted with some type of revision, would be a better description of potential applications and a better usage of other vocabularies that are normally associated to the use of RDF DataCube.

Now I will move into reviewing the other aspects that are normally covered in dataset reviews:

Quality of the dataset
----------------------
I have some concerns about the overall quality of the described dataset. This is not to say that the dataset is wrongly
transformed, as the use of DataCube seems to be adequate (both in terms of how the observations are generated and
about the generation of the data cube structure). Besides, the transformation process seems adequate as well, but it
falls short in my opinion on what it should be done. I will comment on some of these shortcomings/design decisions which
are not well explained:
- The threshold that is applied on the number of missing values for a component property is very ad-hoc. why do you
use a number of 30 observations and 10% of the observation-property pairs? I may imagine cases where there is a lot
of detail in the potential values of the dimensions, but where the data is heavily aggregated in many cases, providing
only observations of aggregations over some property values. I have seen many cases of statistical data like this. It
seems to me that it is necessary to establish more clearly and in a more justified manner why these thresholds are
applied.
- The use of the same URL for properties with the same name in different datasets seems ok, in general, and the authors
justify that this has not brought in any problem in the transformation. However, it is very common when one is dealing with
some of this data to use properties that are proposed, for instance, by SDMX COG (e.g, sex, areas, age, etc.). Is any of
those properties applicable? Why do you create new URLs (it should be probably URIs, in fact) instead of trying to reuse
some of those properties already available by using some linking technique, or even humans in the loop? This would probably
increase the quality of the dataset, IMO.

Some additional questions that I have are the following:
- In page 4, the figure 4 includes a description of a slice. In fact, I would recommend changing ns0 there by qb, which is
the commonly used prefix for Data Cube. However, you do not explain how you generate slices from the data that you are using.
How do you generate this slice? the same applies to Dublin Core ones.
- You say that the LinkedSpending dataset is continually growing, but from the CubeViz visualisation that is available, and from
running the following query in the SPARQL endpoint, I get the following:
select distinct ?a ?date where {?a a qb:DataSet ; ?date} ORDER BY ?date

I get that all datasets were created at the same time (early September 2013), and there is an inconsistency in the number
of datasets wrt the paper (which says 321), per evaluating:
select count (distinct ?a) where {?a a qb:DataSet ; ?date}
I get 261 different datasets. Where is this inconsistency coming from?

- Have you actually done what you were proposing in the interlinking section of page 6? That is, using labels of
datasets and dimensions, linking them to regions?

Usefulness (or potential usefulness) of the dataset
---------------------------------------------------
Personally, I see a lot of potential for this dataset, especially in what it has to do with
the achievement of a better degree of transparency in the public accounts and expenditure
of Governments over the world. It is also true that this potential really comes from the dataset
that is used as the origin of this linked dataset (OpenSpending), and it is difficult to see
what is added by providing a Linked Data transformation of that data, which is already
available through an API. I would see the benefit if the current dataset was better linked
with other existing datasets, if SKOS/XKOS vocabularies were identified, if common vocabularies
about spending were used, etc. This would allow going across datasets from different countries
and would allow people like data journalists to make better comparisons, etc.

However, the dataset does not see to be reused yet by others, which is something very important for
linked dataset descriptions in SWJ.

BTW, in table 5 you show several queries. Some of them may be more easily written using property paths, such
as for instance query 2. Why don't you use them? This is a minor point, obviously.

Clarity and completeness of the descriptions
--------------------------------------------
The description is very thorough, both on the dataset itself and on the transformation process.

Minor comments - typos
----------------------
There are several typos and grammar errors throughout the paper. I have identified here some of them:
- Page 2: "the is" --> "this is"
- Page 2: "the the" --> "the"
- Page 2. The paragraph "Because the source data adheres..." is not understandable. What do you mean there?
- PAge 2. When you refer to the previously explained data cube model, it may be better to advance a little bit the description of SDMX, in which RDF Data Cube is largely based. It is described much later in the paper.
- PAge 3. You refer to other tools in the LOD2 stack. It would be good to provide their names. Personally, I have only used CubeViz, but I was not aware of the existence of other tools.
- PAge 4. Facetted --> faceted
- PAge 4. underlaying --> underlying
- PAge 4. Examplarily. Is this correct?
- PAge 6. I cannot understand well the paragraph that starts with "We generally preferred..." Can you rephrase?

Review #3
By Andreas Hotho submitted on 26/Feb/2014
Suggestion:
Reject
Review Comment:

The contribution of this work is an integrated view on public
available financial data. OpenSpending.org, a new platform is
presented, transformation and integration of the mentioned data as
well as the model are introduced and potential queries to work
with it are given. In addition a nice visualisation gives a first
impression of the power of the proposed solution.

The preparation, transformation and integration of the data is
quite nice. The chosen model seems reasonable and I like the idea
that everything is open source and the data becomes accessible in
an integrated form. The presented technical work as well as the
discussed queries and usage scenarios demonstrate nicely what can
be done with the system.

Unfortunately, there are some issues which needs to be addressed
before the paper can be published. As far as I understand, the
contribution of the work is limited to the transformation of the
data into RDF, the integration and the setup of the system
including the visualization. What I miss is a discussion of the
underlying reasons why this architecture and not another one was
chosen as a solution. What are the benefits of the proposed way to
setup such a system? This could be shown by some experiments or at
least there should be a detailed discussion of other systems and
approaches using other technics as well, like OLAP cubes (more
details below). This would lead to a clear design decision.

There is another point which needs to be improved. What's the
scientific challenge, when I setup such a system? This needs to be
explained and presented in the paper as well. The benefit would be
a presentation of the system including lessons lernt and the
scientific challenges solved which would increase the value of the
work and gives better insides.

To summarize my arguments, I like the idea of the work and I'm a
fan of such kind of systems but I think the presentation of the
work needs to be done in such a way that everyone can easily see
why this was a clever way to solve the task and that the lessons
learnt become immediately evident for everyone.

Some additional remarks:

A comparison with traditional OLAP and Data Warehouse Systems is
missing. It is mentioned within the text but not discussed as a
competing solution. So, what's the difference between OpenSpending
technics and a data cube of an OLAP system? The mentioned
advantages are not clear as the idea of a data warehouse is the
integration of different datasets over time. This is mentioned as
one advantage in the introduction of OpenSpending as well but it
is never explained why it holds and is a benefit.

As a remark, an OLAP cube is modeled as a star schema in a
relational database and is not - as mentioned in the paper -
similar to the star schema.

Only one query (query 7) is given to show the power of the
RDF-based system over db systems. I think this query can be stated
in current relational databases. Such database systems
go in the direction of knowledge bases too and often allow such
queries. I suggest to provide more background information to
substantiate this claim.

The remark in Sec. 5, Example Queries: "...An equivalent query
using relational databases would thus be more convoluted." needs
to be explained in more detail. Even if the queries is more
difficult to write it says nothing about the performance of the
underlying system nor that it can't be stated as such.

The last question with respect to the comparison with databases
goes in the direction of the performance. I suggest to compare the
proposed system with a standard database system, either by
pointing to the literature which discusses comparable cases as the
one in this work or by doing some performance analysis which I
would prefer.

Related work misses important parts as it only compares with other
models from the application point of view but not from the
underlying framework and its performance. Either the contribution
is limited to the representation of the financial dataset and the
possibility to query them independent from the performance or the
related work is not complete. I miss the discussion of two parts:
the modeling in an OLAP like way in RDF and the system component.
For the latter I would like to see a comparison with system like:

http://lsm.deri.ie/

which of course do not model financial data but other data with
similar properties. Another thing could be the usage of:

http://superstreamcollider.org/

which integrates data in an easy way. Last, relational databases
could act as a kind of competing system.

Minor:

What is meant by "2005 to 2035" in the abstract? Is this a
mistake?