Supporting the Linked Data publication process with the LOD2 Statistical Workbench

Tracking #: 591-1798

Valentina Janev
Bert Van Nuffelen
Vuk Mijović
Karel Kremer
Michael Martin
Uroš Milošević
Sanja Vraneš

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Tool/System Report
In the last few years, with the rise of the open data movement, a large and increasing number of governments and organizations have started to make information freely available and easily accessible online. Additionally, in order to increase transparency and improve interoperability and interaction with citizens and society as a whole, but also create new businesses and job opportunities, national governments publish their data in a machine-readable and future-proof format. In this paper we present the LOD2 Statistical Workbench, an integrated set of professional tools for accessing, manipulating, exploring and publishing statistical data. The data representation and processing is based on the W3C standard vocabularies (RDF Data Cube as a main model) and open source components delivered by the LOD2 consortium. The system meets the needs of both publishers and consumers of statistical data and directs the potential of the LOD2 tools to the specific domain of the statistical office. Using an illustrative case study of the Statistical Office of the Republic of Serbia, the paper introduces the user requirements, gives an overview of possible scenarios and shows examples of its use. The first results indicate that wider adoption of the LOD2 tools in practice can be foreseen.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Oktie Hassanzadeh submitted on 09/Mar/2014
Major Revision
Review Comment:

The paper presents LOD2 Statistical Workbench project, which is a collection of existing tools and systems along with a set of new components put together to support publication of statistical data in RDF. The paper is certainly a valuable contribution to this field and on a very relevant topic which can make a nice article in “Reports on tools and systems” category. However, the paper’s organization and presentation style makes it more like a nice technical report or project documentation, and not a high-quality journal article. More detailed comments below.

Issue #1: Motivation

First, the motivation behind the work and this paper is not at all clear. One would expect at least a high-level discussion of why there is a need for publication and management of statistical data using RDF and Linked Data in the introduction section. Then there is Section 2 entitled “motivation” in which I was hoping to find such a discussion, but the section talks about:
1) The European Commission investment in "delivering large amounts of trusted data to the public and improve interoperability".
2) A list of a large number of projects funded by EU FP7, and a very brief description of some of these projects and tools.
3) The goals of the LOD2 Statistical Workbench.
so, what is the "motivation"? The only statement related to motivation in this section is: "The work … was motivated by the need to support the process of publishing statistical data in the RDF format using common vocabularies such as the RDF Data Cube [10]". Are you expecting the reader to refer to reference [10] to find the motivation? Still, reference [10] is a technical specification of RDF Data Cube, with not much motivation for a need for a workbench on publishing data in RDF using the vocabulary and the potential applications.

In Section 2, I do not understand the purpose of Figure 1, and it is not described in enough detail. Similarly, Table 1 that seems to include a set of scenarios that in part motivate this work, is not explained at all. For example, what is "Code lists - creating and maintaining"?. Why "Export" is an important goal? We can do "Export" using any popular data model and format. We don't need RDF and RDF Data Cubes for that. Am I missing something here?

After reading the first two sections, I was still wondering: what is the problem addressed in this paper? i.e., what is the problem that LOD2 statistical benchmark addresses? why is it important? and what is missing from the existing tools or in the literature? Typically, at this point I should know why I should continue reading the paper and if it is addressing an important and relevant problem.

Issue #2: Missing details and scientific/technical discussions

Here are a few examples where I was hoping to read more discussion of details (or discussion related to goals, motivation, justification of choice of tools, and experience with real-world use cases).

In Section 3.1 where you discuss Import features, one example of missing details is the requirements of loading XML data in the system. Do you expect that users are experts in their domain and familiar with XSLT and will write custom transformations? Is this a reasonable assumption? How would the system take in the transformation scripts as input? Why do you pick this method and how about the alternatives? For example, how about XSPARQL or the approach used in xCurator ?

Similarly, how does the “Find more data online” feature work? Do you perform similarity analysis of some sort to find related/similar data? or is just a search over a CKAN repository metadata for example?

Similar questions can be raised for other features of the system. This is one of the reasons that the paper reads more like a technical report or project documentation, where no detailed description of the features or justification in choice of features or techniques is provided.

In Section 4, I was hoping to find the answer to some of my questions regarding motivation and technical details by reading real-world use case scenarios. But again, a high-level description of a set of tasks are given without justifying why they are needed.

For example, in Section 4.2 first paragraph, you state “every dataset should be validated to ensure it conforms to the RDF Data Cube model.” Why? What happens if you don't do this properly? Are there real-world application scenarios you can describe that have suffered from quality issues and therefore your solution has helped?

In Section 4.3, you describe a set of tasks that can be performed in the UI. Why are these tasks important? Again, are there real-world scenarios that take advantage of these? What would be the alternative if LOD2 Statistical Workbench was not developed?

In Section 4.4, you describe the need for alignment. Do you perform the alignment? Are there any scenarios you have explored that require this type of alignment?

Section 5 has the same issue. In Section 5.1, why is SORS interested in using LOD2 Statistical Workbench? What are the advantages? Any use cases where publishing LOD makes a difference, comparing with publishing bunch of CSV files for example? Similarly in Section 5.2, what is achieved as a results of this “triplification” process and the visualization? And again in Section 5.3, any example scenarios why this transformation from CSV is useful? I am not saying it is not, and I am sure it is very useful, but this paragraph really has no value unless you explain why the outcome is useful in this case, and why LOD2 Statistical Workbench is needed and what is wrong with just using a simple RDF wrapper for example.

Note that in all of the above examples, I am not questioning your motivation or the design of your system, and I am sure there are good answers to each of the questions. Providing answers to some of these questions would be a very valuable contribution and would make the paper a higher quality article.

Issue #3: Presentation issues

The paper has way too many URLs that are cited improperly for a journal article. Your article is expected to have a much longer life than most of the URLs listed or cited in the paper, and so a person reading this paper in say 10 years from now will have to deal with a long set of broken URLs. You need to move all of the URLs to your list of references and include date of access as pointed out in commonly used MLA or APA styles:
This will at least make it possible to use web archive copies (if available) in the future.

Another issue is lack of proper description of figures and tables as I also mentioned in the above issues. In Section 3.8, this lack of discussion has apparently resulted in a typo (I cannot see scenarios in Table 2). In 5.4, what is Figure 6 describing?

My overall suggestion is to change the organization of the paper to start with a brief description of high-level goals and motivations, then continue with real-world scenarios that clearly show the application of the LOD2 Statistical Workbench in real-world scenarios and why existing systems or tools fail in achieving similar goals (if they fail), and then describe the technical details. You can discuss related work while discussing scenarios or technical details, or in a separate section afterwards, but again the discussions should be more than brief sentences that only say what the related project/system/paper is.

Review #2
By Francois Scharffe submitted on 14/Mar/2014
Major Revision
Review Comment:

This paper was submitted as a report on tools and systems. As a reminder the SMJ guidelines for this type of paper says the following:

Reports on tools and systems – short papers describing mature Semantic Web related tools and systems. These reports should be brief and pointed, indicating clearly the capabilities of the described tool or system. It is strongly encouraged, that the described tools or systems are free, open, and accessible on the Web. If this is not possible, then they have to be made available to the reviewers. For commercial tools and systems, exceptions can be arranged through the editors. These submissions will be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This paper presents a framework to manage statistical linked data. The framework consists a in set of tools integrated in a common environment. It supports various operation on the data, based on the data cube vocabulary. The paper really is a description of the various tools integrated together as expected by the tools and systems category.

The paper is well written and well structured. It is however too long. 12 pages double column is not a short paper.

The problem of publishing and managing statistical data as linked data is an important one and the tools seem to provide the necessary set of features required. The absence of any evaluation however does not allow to have an objective opinion about the efficiency of the tools. The paper describes the tools at a high level and does not include information such as the volume of data considered, the execution time, the amount of user involvement and the amount of expertise need to use them.

The related works only considers european projects. The scope should be enlarged to be more international. Projet Related work should be more than a list of links and project names. Reference to published papers and an objective comparison with current works should be included.

There are already various papers about the LOD2 stack:
- Managing the life-cycle of linked data with the LOD2 stack (iswc 2012)
- Facilitation the publication of Open Governmental Data with the LOD2 Stack
- Facilitating Data-Flows at a Global Publisher using the LOD2 Stack

each of with redundantly describe many of the components described in this paper. It could make sense to publish one common tool paper about the LOD2 stack including its various extensions for statistical data, government data, or publishing data. This would be clearer for people interested in the technology if one common paper was published.

The paper is two long in its current form. It is both an application report and a tool description. Authors should choose a category and reduce the paper appropriately.

Also the paper should discuss the limitations of the tools.

Minor remarks and typos:
3.7 - is RDF/JSON the same as JSON-LD ?
- The paper discuss -> This paper discussed
- "in future” -> in the future
- ref 1: S. Auer, et all -> et al

Review #3
By Christophe Guéret submitted on 25/Mar/2014
Major Revision
Review Comment:

This system paper presents the LOD2 Statistical workbench, an integration of several tools developed by the members of the LOD2 consortium which relate to the creation, publication and enhancement of data cubes. The system is presented in the context of three use-cases one of which is already implemented.

The paper is interesting but I would recommend to revise it to add some more information about the workbench. Namely:
* A set of screen-shots showing the integrated interface, with some call-outs to refer to the different topics listed on Page 4 right column.
* A comparison with tools also aimed at analysing statistical data, whether they are based on LD or not. There must be such tools around using SDMX and OLAP and it would be interesting to read how the presented workbench compares to them (in terms of performances, user interface, etc).
* More information about the use-cases. How many datasets have been published using the workbench ? How many users are working with it daily within the statistical office of Serbia ? Are they pleased with it ? Some hints about the adoption of the tool would be a great addition to Section 5.1, 5.2 and 5.3.
* The challenges highlighted in 5.4 are rather general and could apply to any system. Wheren't there any more specific challenges to overcome ?

If space becomes an issue, I would suggest to massively cut through the text describing the Data Cube specification. Though being informative and well written, such an introduction could take place in another document to let more room to focus on describing the system this paper introduces.