Review Comment:
The paper presents LOD2 Statistical Workbench project, which is a collection of existing tools and systems along with a set of new components put together to support publication of statistical data in RDF. The paper is certainly a valuable contribution to this field and on a very relevant topic which can make a nice article in “Reports on tools and systems” category. However, the paper’s organization and presentation style makes it more like a nice technical report or project documentation, and not a high-quality journal article. More detailed comments below.
Issue #1: Motivation
First, the motivation behind the work and this paper is not at all clear. One would expect at least a high-level discussion of why there is a need for publication and management of statistical data using RDF and Linked Data in the introduction section. Then there is Section 2 entitled “motivation” in which I was hoping to find such a discussion, but the section talks about:
1) The European Commission investment in "delivering large amounts of trusted data to the public and improve interoperability".
2) A list of a large number of projects funded by EU FP7, and a very brief description of some of these projects and tools.
3) The goals of the LOD2 Statistical Workbench.
so, what is the "motivation"? The only statement related to motivation in this section is: "The work … was motivated by the need to support the process of publishing statistical data in the RDF format using common vocabularies such as the RDF Data Cube [10]". Are you expecting the reader to refer to reference [10] to find the motivation? Still, reference [10] is a technical specification of RDF Data Cube, with not much motivation for a need for a workbench on publishing data in RDF using the vocabulary and the potential applications.
In Section 2, I do not understand the purpose of Figure 1, and it is not described in enough detail. Similarly, Table 1 that seems to include a set of scenarios that in part motivate this work, is not explained at all. For example, what is "Code lists - creating and maintaining"?. Why "Export" is an important goal? We can do "Export" using any popular data model and format. We don't need RDF and RDF Data Cubes for that. Am I missing something here?
After reading the first two sections, I was still wondering: what is the problem addressed in this paper? i.e., what is the problem that LOD2 statistical benchmark addresses? why is it important? and what is missing from the existing tools or in the literature? Typically, at this point I should know why I should continue reading the paper and if it is addressing an important and relevant problem.
Issue #2: Missing details and scientific/technical discussions
Here are a few examples where I was hoping to read more discussion of details (or discussion related to goals, motivation, justification of choice of tools, and experience with real-world use cases).
In Section 3.1 where you discuss Import features, one example of missing details is the requirements of loading XML data in the system. Do you expect that users are experts in their domain and familiar with XSLT and will write custom transformations? Is this a reasonable assumption? How would the system take in the transformation scripts as input? Why do you pick this method and how about the alternatives? For example, how about XSPARQL http://www.w3.org/Submission/xsparql-language-specification/ or the approach used in xCurator http://dblab.cs.toronto.edu/project/xcurator/ ?
Similarly, how does the “Find more data online” feature work? Do you perform similarity analysis of some sort to find related/similar data? or is just a search over a CKAN repository metadata for example?
Similar questions can be raised for other features of the system. This is one of the reasons that the paper reads more like a technical report or project documentation, where no detailed description of the features or justification in choice of features or techniques is provided.
In Section 4, I was hoping to find the answer to some of my questions regarding motivation and technical details by reading real-world use case scenarios. But again, a high-level description of a set of tasks are given without justifying why they are needed.
For example, in Section 4.2 first paragraph, you state “every dataset should be validated to ensure it conforms to the RDF Data Cube model.” Why? What happens if you don't do this properly? Are there real-world application scenarios you can describe that have suffered from quality issues and therefore your solution has helped?
In Section 4.3, you describe a set of tasks that can be performed in the UI. Why are these tasks important? Again, are there real-world scenarios that take advantage of these? What would be the alternative if LOD2 Statistical Workbench was not developed?
In Section 4.4, you describe the need for alignment. Do you perform the alignment? Are there any scenarios you have explored that require this type of alignment?
Section 5 has the same issue. In Section 5.1, why is SORS interested in using LOD2 Statistical Workbench? What are the advantages? Any use cases where publishing LOD makes a difference, comparing with publishing bunch of CSV files for example? Similarly in Section 5.2, what is achieved as a results of this “triplification” process and the visualization? And again in Section 5.3, any example scenarios why this transformation from CSV is useful? I am not saying it is not, and I am sure it is very useful, but this paragraph really has no value unless you explain why the outcome is useful in this case, and why LOD2 Statistical Workbench is needed and what is wrong with just using a simple RDF wrapper for example.
Note that in all of the above examples, I am not questioning your motivation or the design of your system, and I am sure there are good answers to each of the questions. Providing answers to some of these questions would be a very valuable contribution and would make the paper a higher quality article.
Issue #3: Presentation issues
The paper has way too many URLs that are cited improperly for a journal article. Your article is expected to have a much longer life than most of the URLs listed or cited in the paper, and so a person reading this paper in say 10 years from now will have to deal with a long set of broken URLs. You need to move all of the URLs to your list of references and include date of access as pointed out in commonly used MLA or APA styles:
https://owl.english.purdue.edu/owl/resource/747/08/
http://www.studygs.net/citation.htm
This will at least make it possible to use web archive copies (if available) in the future.
Another issue is lack of proper description of figures and tables as I also mentioned in the above issues. In Section 3.8, this lack of discussion has apparently resulted in a typo (I cannot see scenarios in Table 2). In 5.4, what is Figure 6 describing?
My overall suggestion is to change the organization of the paper to start with a brief description of high-level goals and motivations, then continue with real-world scenarios that clearly show the application of the LOD2 Statistical Workbench in real-world scenarios and why existing systems or tools fail in achieving similar goals (if they fail), and then describe the technical details. You can discuss related work while discussing scenarios or technical details, or in a separate section afterwards, but again the discussions should be more than brief sentences that only say what the related project/system/paper is.
|