Review Comment:
The paper presents a survey of methodologies for publishing government data as linked open data, so called linked open government data (LOGD). The authors identified 25 relevant papers and analyzed them to answer 4 research questions about methodologies described in the papers. On the base of the analysis, they derive a new unified model for publishing LOGD.
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The paper is a survey paper and as such it shall serve as an introductory text. However, in the introduction to the paper there are several problems which need to be addressed by the authors.
The first sentence states that "linking and combining datasets have become major topics for data consumers". However, there are many well-known issues consumers have to deal with, e.g. data quality or provenance. Linking and combining is a topic for some of them. Many of them however even do not know that linking and combining is possible and some of them even do not need this. They typically work with a sole dataset. Therefore, it is a question for which consumers this is a major topic. An analysis would be necessary for such a strong statement.
There are several statements in the first paragraph of the introduction that are maybe true but the paper does not provide any evidence for them.
1) using simple algorithms to convert data to LOD, without curration efforts - the referenced papers 1,2,3 do not say this. Is there any analysis which shows this? If there is no analysis known, some examples of datasets suffering from these issues should be presented.
2) focus on metadata instead of data - again, the referenced papers 1,2,3 do not state this. An analysis showing this or at least some examples should be provided and explained.
The end of the first paragraph of the introduction (starting at line 34, 2nd column) is about open data in general, not linked government data. How is this related to and relevant for the paper and the surrounding paragraphs? It shall be explained better.
In the second paragraph of the introduction, the authors say that government data has many important applications. We can agree with this intuitivelly. However, the referenced work 6 does not seem to be a relevant evidence for this statement. According to its abstract, its purpose is to investigate best practices of publishing linked data and how they are applied in practice. Not a word about applications of governmental linked data.
The statement on the 2nd page, 1st column, 1st line that the LOD cloud comprises 200 linked datasets is not true. It is probably a typo, not consistent with Fig. 1. However, there is a more serious problem. Referencing datahub.io is not relevant here. Several years ago the original datahub.io service as a data catalog was shut down and replaced by a commercial product. It seems that it does not work at all now. The authors of lod-cloud.net maintain their own list of datasets which is described here: https://lod-cloud.net/#about .
I provided a detailed comments to the introduction because the paper is submitted as a survey paper. It shall be used as an introductory text. Therefore, its introduction needs to be perfect and it cannot be misleading. My remarks above need to be resolved before the paper is accepted.
Section 2 is a general introduction to open data and linked open data and it is OK as an introduction to the problem.
In the end of Section 3, its last paragraph, it would be helpful to have the specifics of publishing LOGD which distinguish it from publishing LOD in general. Without this knowledge a reader may ask why a survey specific for government data is necessary.
(2) How comprehensive and how balanced is the presentation and coverage
Section 4 presents the methodology used for the survey. This is the core of the paper. However, it is also its weakest part in my opinion. The methodology is not very well fitted to the specifics of the domain of government data. I explain this in the following paragraphs.
In my opinion the research questions defined in Section 4.1 are not sufficient. The authors say that the research questions were defined with the goal to investigate how methodologies cover data quality. However, for RQ1-3 it is not clear how they are related to data quality. On the other hand, RQ4 is about data quality in general. As we all know, data quality has many dimensions (Zaveri, Amrapali et al. ‘Quality Assessment for Linked Data: A Survey’, Semantic Web, vol. 7, no. 1, pp. 63-93, 2016). It is not clear from the research questions how these different dimensions will be considered in the survey. The answer to RQ4 may be very generic and it can be, therefore, hard to map it to the quality dimensions. I would like to see here at the definition of the research questions what data quality is and how its dimensions are targeted by the presented study.
Moreover, the research questions do not reflect the fact that the study concentrates only on publishing LOGD. If we agree on the necessity of doing a survey specifically for LOGD, the survey should have some research questions specific for LOGD. For example, a question could be how a methodology fits with the specific hierarchical structure of governments, their strictly separated competencies given by law, etc. The current research questions are too generic, they do not investigate how methodologies fit the needs of governments who shall publish their data as LOD. In my opinion, the research questions shall be more specific and reflect the specifics of government data. Regarding RQ1, it would be also interesting and valuable to see what steps methodologies for publishing LODG add to general methodologies for publishing LOD.
The problems with the defined research questions arise in Section 5 where the results of analyzing the identified papers are presented. The answer to the 1st research question (RQ1) shows that RQ1 is insufficient. It lists generic steps which are common to publishing any LOD. It does not emphasize the specifics for LOGD. We can see this only partly, hidden in the text - for example at page 7, 1st column, around line 17 : "In government scope, datasets are created by different agencies, using different formats, different levels of granularity for the data and the metadata". However, it is hidden in the surrounding text, which is valid for any LOD. It is not described systematically and explicitly, and it is not supported by evidence. Therefore, it is not clear what is the difference with other methodologies for LOD in general and how the surveyed methodologies fit the needs of governments.
I also have a minor remark (question) to the answer to RQ1. Vocabularies are discussed at page 7, 2nd column, around line 7. In case of governmental data, vocabularies have to match concepts defined by legislation. Existing LOD vocabularies are generic, not related to legislation. How is this problem reflected?
Section 6 defines a unified model for publishing LOGD. It combines steps from existing methodologies. However, it is described without any direct and explicitly expressed relationship to the original methodologies. Therefore, it is not clear what steps are taken from which methodology. It seems as a separate methodology only with weak relationships to the surveyed ones.
Moroever, it is hard to understand why a unified publishing model is necessary. The authors should provide some arguments for this. The problem of the whole survey is that it does not specify a goal or goals a methodology for publishing LOGD should fulfill. In general, the goal is to publish GD as LOD. This is however not sufficient. It is necessary to specify the needs and specifics of governments. These would be the goals. The existing methodologies shall be compared on how they meet the goals. The unified publishing model should pick up the best of the best parts of these methodologies and combine them together to fit the needs of governments. The unified model should also cover what the existing methodologies lack the most - evaluation. Nothing of this is covered by the proposed unified publishing model. Its purpose is therefore unclear.
In Section 7 - discussion, it is written that the presented paper reflects all specifics of open government data. This is a very strong statement. The paper does not provide a list of all specifics. It does not list any specifics explicitly.
(3) Readability and clarity of the presentation
The paper is easily readable and all statements are clear and easy to understand.
(4) Importance of the covered material to the broader Semantic Web community
The paper is important to the broader SW community as it shows how hard it is for governments to publish their data as LOD. However, the paper does not fulfill the potential of the topic which I explain in my review related to the criterion (2) above.
In summary, I would expect the following for a major revision:
- Improve the introduction with better evidence of the presented statements
- Clarify explicitly the list of specifics of open government data
- Focus better the research questions on the specifics of open government data
- Show how existing methodologies reflect the specifics
- Explain better relationships of the unified model to the existing methodologies and better define the purpose of the unified model
- Show some evaluation of the proposed unified model at least on some case study
|