Methodologies for publishing linked open government data for the data on the web: a systematic mapping

Tracking #: 2382-3596

Bruno Elias Penteado
Seiji Isotani

Responsible editor: 
Jens Lehmann

Submission type: 
Survey Article
Since the beginning of the release of open data by many countries, different methodologies for publishing linked government data have been proposed. However, they seem to not be adopted by early studies exploring linked government data, for different reasons. In this work, we conducted a systematic mapping in the literature with the aim of synthesizing the different approaches around the following topics: common steps, associated tools and practices, quality assessment validations and evaluation of the methodology. The findings show a core set of activities, based on the linked data principles, but with very important additional steps for practical reuse in scale. Although a fair amount of quality issues are reported in the literature, very few of these methodologies embed validation steps in their process. We describe an integrated overview of the different activities and how they can be executed with appropriate tools. We also present research challenges that need to be addressed in future works in this area.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Marijn Janssen submitted on 09/Jan/2020
Minor Revision
Review Comment:

In this paper describes an integrated method for published linked op data (LOD) is described that included the different activities and how they can be executed with appropriate tools. The overview and mapping of activities is of interest. This results in a more comprehensive understanding.

Table 2 provides an overview of the papers. An analysis how this field developed over the years, based on these papers, is missing.

Fig. 4 should be labelled as table. This table is a bit swallow and disappointing. The overview of tools is of interest, but a discussion about their differences, similarities, their peculiarities etc. is missing. As it is now it can be provides limited insight.
The paper does not pay attention to maintain the links (only ‘define maintenance tasks’). This part can be better detailed as this is crucial for the long term.

Review #2
Anonymous submitted on 15/Jan/2020
Major Revision
Review Comment:

The article describes a systematic mapping of existing methodologies for linked open government data (LOGD). They consider 18 articles extracted from extensive research in existing literature and analyzed in 4 research questions

The topic is really interesting for the availability of different approaches in LOGD. In my opinion, articles like this must suggest to researchers (in the initial phase) and professionals a clear and reasoned state of the art, but also provide useful information for future research and a methodological tool to select an existing approach for a given task. This last point is very useful for professionals who want to find the most suitable methodologies for their business.

The document partially misses this point for the following reasons. First of all the selected works are described approximately; therefore readers should read all the articles before truly understanding this survey. A better description of all selected items is required, for example via an appendix. Second, the authors could use google scholar as a new data source where they find more articles (I mentioned some here [1] [2] [3] [4] [5])
the number of considered papers is very relevant in a “systematic” paper and I strongly suggest to also considered dedicated methodology for relevant field such as census data, bibliographic data, and so one.

Figure 3 is a key element in the document because it includes the steps and steps considered for analyzing the selected documents. All the phases are not described in a very detailed way and I suggest creating a subsection for each phase in which the authors can describe what are the objectives of the passages in which the reviewed document has applied it and what are the technologies they have used.
Another point that requires more attention is the context in which each article proposes its methodology (e.g. type of source, type of data, type of organization and so on)

The discussion section is rather short and I like to see a more in-depth analysis.
The process model of section 5 is just presented and it seems a grouping task of the previous discussed steps. For example, why is the definition of vocabularies part of the specification phase and not of the modeling phase?

in addition, the choice of license to be used cannot be applied only in the publication phase, but is also part of the specification phase in which the organization or the person who publishes a data set must agree with the owner of the data set (which may be different from the publisher)
In the exploitation, the engagement of the communities is fundamental (and in this case a list of proposals suggested by the paper could be very useful)
The role of the data portal is mandatory for the definition of e.g. the management of the metadata of the data set and therefore probably the choice of the platform to publish the data must be identified in the specification phase and implemented in the publication phase and not in the maintenance phase. At this stage, it is appropriate to apply a well-defined data lifecycle that includes updating and removing the dataset phase.

Moreover it could be interesting (especially for practitioners) to understand what are the most suitable methodologies for such items
scope of the methodology (general purpose, specific domain such as statistical data, or public health..)
automatization level of methodology
dimension of data and number of data source to public (e.g. a given methodology is proposed for a single large/small structured/non structured dataset)

[1] Transforming meteorological data into Linked Data
Atemezing, Ghislain | Corcho, Oscar | Garijo, Daniel | Mora, José | Poveda-Villalón, María | Rozas, Pablo | Vila-Suero, Daniel | Villazón-Terrazas, Boris; Semantic Web, vol. 4, no. 3, pp. 285-290, 2013
[2] Guidelines for Linked Data generation and publication: An example in building energy consumption
[3] J Kučera Open government data publication methodology - Journal of Systems Integration, 2015

[4]Alex B. Andersen, Nurefşan Gür, Katja Hose, Kim A. Jakobsen, Torben Bach Pedersen
Publishing Danish Agricultural Government Data as Semantic Web Data Joint International Semantic Technology Conference 2014

[5]Gustavo Pabón Claudio Gutiérrez Javier D. Fernández Miguel A. Martínez‐Prieto. Linked Open Data technologies for publication of census microdata Journal of the American Society for Information Science and Technology

Review #3
By Luis-Daniel Ibáñez submitted on 02/Feb/2020
Major Revision
Review Comment:

(0) Importance of the covered material to the Broader Semantic Web Community.

I think the aim of this study is important for the Semantic Web Community

(1) Suitability as introductory text, targeted as researchers, PhD students, or practitioners.

I find this to be overall correct. Could benefit from the improvements on the other aspects proposed below.

(2) How comprehensible and balanced the presentation coverage is

-In section 3, it is mention that in [13] "linked data publishing methodologies are elicited, mostly from the government domain. However, [13]'s is about publishing drug data. Unsure if this is just a typo, but it does not allow me to assess how the authors position against this previous work.

- In section 2.2, it is mentioned that Linked data extends the concept of open data. This is arguable, as Linked Data can exist without open data and viceversa. I believe that for the context of this work, it is Linked Open Data and Linked Open Government Data what is more important. For the latter, I miss some specific references about what is it, why is it important and what has been done about it asides the methodologies that will be studied later. Papers like this one might be useful to strength the

My main remarks are on the methodology section, mostly related to how comprehensive the survey is.

- I am missing the motivation behind the proposed research questions. How the answer to each of them helps to "describe what should be addressed to a better outcome of LOGD policies". I also note the use of the word "policies"? Why not other methodologies? What about tools and technical matters? RQ4 seems the one a bit odd

- I tried to reproduce the queries made available by the authors on the separated online spreadsheet on 30 January 2020. I had some issues with them:
ACM DL: returned no results
IEEE Explore: OK
Scopus: 3 results instead of 39 reported
Science Direct: Query appears incomplete with respect to the others
ISI web of Knowledge: returns error "Invalid use of boolean operator"
Springer Link: A bit more results, I assume the extra papers were indexed in the second half of 2019 (after the query date reported on the paper), so I assume OK

This needs to be clarified on a revised version.
A minor question is why Google Scholar was not considered?

- I consider very important that the authors make available the full list of (469-18) papers they excluded and the corresponding reason for exclusion. This is important for reproducibility, and will also help reviewing (and future readers), as it will be easier to understand why a certain paper that I think should have been included is missing.

- Minor thing on the inclusion criteria, please clarify if you had institutional access to the listed search engines, and at what level of subscription. This to quantify how paywalling affects your study.

- On exclusion criteria, why to have "focus on the application of LD in a specific domain"? Does this means that papers that focus on a specific subset of open government data are ruled out? As an example, the following paper seems relevant but is not included:
I suspect this one in particular was filtered out due to the abstract not including the word "government" ("public administration" instead), but my general comment remains.

- It is unclear how reference [30] was used as a "control"? Does this mean that you checked the papers there also appear on your list?

- For a survey paper, I think the results and discussion section lack some content. For example, for RQ3, only a small subset of the steps is discussed. For RQ1, I miss a discussion about steps that are considered in only one or two papers: are they needed in general? are they needed in certain contexts?

(3) Readability and clarity of the presentation.

- The results section per RQ is hard to read. I think there is enough space to make each description much more structured.

- What does it mean for a row to be empty on Figure 4?

- I missed how the process model of Figure 5 was developed. I assume there was a methodology and some work to derive it, so I was a bit surprised that authors have not elaborated on this.

Overall, readability needs to be improved, there are many places where the article could benefit from proof-reading to increase clarity.

Other minor comments:

- p5. "The study is from a peer-reviewed vehicle" -> probably "article" was meant
- Use of the word artifact on table 3, waht is an artifact?
- Figure 5 has too low resolution.
- p10. there are academic references describing the deployment of UK and US portals
- I think the Linked Data book by Bizer is a better reference for the Linked Data principles than the original website by Berners-Lee

In summary, what I would expect for a major revision is:

- Clarify issues with queries
- Make available dataset of excluded papers and reason why
- Clarify the specific domain exclusion criteria and why it was chosen
- Improve structure, and therefore readability of results and discussion section
- Improve results/discussion around tools, and what are the implications of
- Expand on methodology for builiding the process model on Figure 5, which I think is a very valuable contribution and probably deserves a section on its own