Methodologies for publishing linked open government data on the web: a systematic mapping and a unified process model

Tracking #: 2532-3746

Authors: 
Bruno Elias Penteado
José Carlos Maldonado
Seiji Isotani

Responsible editor: 
Jens Lehmann

Submission type: 
Survey Article
Abstract: 
Since the beginning of the release of open data by many countries, different methodologies for publishing linked data have been proposed. However, they seem to not be adopted by early studies exploring linked data, for different reasons. In this work, we conducted a systematic mapping in the literature to synthesize the different approaches around the following topics: common steps, associated tools and practices, quality assessment validations, and evaluation of the methodology. The findings show a core set of activities, based on the linked data principles, but with very important additional steps for practical use in scale. Although a fair amount of quality issues are reported in the literature, very few of these methodologies embed validation steps in their process. We describe an integrated overview of the different activities and how they can be executed with appropriate tools. We also present research challenges that need to be addressed in future works in this area.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Martin Necasky submitted on 07/Sep/2020
Suggestion:
Major Revision
Review Comment:

The paper presents a survey of methodologies for publishing government data as linked open data, so called linked open government data (LOGD). The authors identified 25 relevant papers and analyzed them to answer 4 research questions about methodologies described in the papers. On the base of the analysis, they derive a new unified model for publishing LOGD.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper is a survey paper and as such it shall serve as an introductory text. However, in the introduction to the paper there are several problems which need to be addressed by the authors.

The first sentence states that "linking and combining datasets have become major topics for data consumers". However, there are many well-known issues consumers have to deal with, e.g. data quality or provenance. Linking and combining is a topic for some of them. Many of them however even do not know that linking and combining is possible and some of them even do not need this. They typically work with a sole dataset. Therefore, it is a question for which consumers this is a major topic. An analysis would be necessary for such a strong statement.

There are several statements in the first paragraph of the introduction that are maybe true but the paper does not provide any evidence for them.
1) using simple algorithms to convert data to LOD, without curration efforts - the referenced papers 1,2,3 do not say this. Is there any analysis which shows this? If there is no analysis known, some examples of datasets suffering from these issues should be presented.
2) focus on metadata instead of data - again, the referenced papers 1,2,3 do not state this. An analysis showing this or at least some examples should be provided and explained.

The end of the first paragraph of the introduction (starting at line 34, 2nd column) is about open data in general, not linked government data. How is this related to and relevant for the paper and the surrounding paragraphs? It shall be explained better.

In the second paragraph of the introduction, the authors say that government data has many important applications. We can agree with this intuitivelly. However, the referenced work 6 does not seem to be a relevant evidence for this statement. According to its abstract, its purpose is to investigate best practices of publishing linked data and how they are applied in practice. Not a word about applications of governmental linked data.

The statement on the 2nd page, 1st column, 1st line that the LOD cloud comprises 200 linked datasets is not true. It is probably a typo, not consistent with Fig. 1. However, there is a more serious problem. Referencing datahub.io is not relevant here. Several years ago the original datahub.io service as a data catalog was shut down and replaced by a commercial product. It seems that it does not work at all now. The authors of lod-cloud.net maintain their own list of datasets which is described here: https://lod-cloud.net/#about .

I provided a detailed comments to the introduction because the paper is submitted as a survey paper. It shall be used as an introductory text. Therefore, its introduction needs to be perfect and it cannot be misleading. My remarks above need to be resolved before the paper is accepted.

Section 2 is a general introduction to open data and linked open data and it is OK as an introduction to the problem.

In the end of Section 3, its last paragraph, it would be helpful to have the specifics of publishing LOGD which distinguish it from publishing LOD in general. Without this knowledge a reader may ask why a survey specific for government data is necessary.

(2) How comprehensive and how balanced is the presentation and coverage

Section 4 presents the methodology used for the survey. This is the core of the paper. However, it is also its weakest part in my opinion. The methodology is not very well fitted to the specifics of the domain of government data. I explain this in the following paragraphs.

In my opinion the research questions defined in Section 4.1 are not sufficient. The authors say that the research questions were defined with the goal to investigate how methodologies cover data quality. However, for RQ1-3 it is not clear how they are related to data quality. On the other hand, RQ4 is about data quality in general. As we all know, data quality has many dimensions (Zaveri, Amrapali et al. ‘Quality Assessment for Linked Data: A Survey’, Semantic Web, vol. 7, no. 1, pp. 63-93, 2016). It is not clear from the research questions how these different dimensions will be considered in the survey. The answer to RQ4 may be very generic and it can be, therefore, hard to map it to the quality dimensions. I would like to see here at the definition of the research questions what data quality is and how its dimensions are targeted by the presented study.

Moreover, the research questions do not reflect the fact that the study concentrates only on publishing LOGD. If we agree on the necessity of doing a survey specifically for LOGD, the survey should have some research questions specific for LOGD. For example, a question could be how a methodology fits with the specific hierarchical structure of governments, their strictly separated competencies given by law, etc. The current research questions are too generic, they do not investigate how methodologies fit the needs of governments who shall publish their data as LOD. In my opinion, the research questions shall be more specific and reflect the specifics of government data. Regarding RQ1, it would be also interesting and valuable to see what steps methodologies for publishing LODG add to general methodologies for publishing LOD.

The problems with the defined research questions arise in Section 5 where the results of analyzing the identified papers are presented. The answer to the 1st research question (RQ1) shows that RQ1 is insufficient. It lists generic steps which are common to publishing any LOD. It does not emphasize the specifics for LOGD. We can see this only partly, hidden in the text - for example at page 7, 1st column, around line 17 : "In government scope, datasets are created by different agencies, using different formats, different levels of granularity for the data and the metadata". However, it is hidden in the surrounding text, which is valid for any LOD. It is not described systematically and explicitly, and it is not supported by evidence. Therefore, it is not clear what is the difference with other methodologies for LOD in general and how the surveyed methodologies fit the needs of governments.

I also have a minor remark (question) to the answer to RQ1. Vocabularies are discussed at page 7, 2nd column, around line 7. In case of governmental data, vocabularies have to match concepts defined by legislation. Existing LOD vocabularies are generic, not related to legislation. How is this problem reflected?

Section 6 defines a unified model for publishing LOGD. It combines steps from existing methodologies. However, it is described without any direct and explicitly expressed relationship to the original methodologies. Therefore, it is not clear what steps are taken from which methodology. It seems as a separate methodology only with weak relationships to the surveyed ones.

Moroever, it is hard to understand why a unified publishing model is necessary. The authors should provide some arguments for this. The problem of the whole survey is that it does not specify a goal or goals a methodology for publishing LOGD should fulfill. In general, the goal is to publish GD as LOD. This is however not sufficient. It is necessary to specify the needs and specifics of governments. These would be the goals. The existing methodologies shall be compared on how they meet the goals. The unified publishing model should pick up the best of the best parts of these methodologies and combine them together to fit the needs of governments. The unified model should also cover what the existing methodologies lack the most - evaluation. Nothing of this is covered by the proposed unified publishing model. Its purpose is therefore unclear.

In Section 7 - discussion, it is written that the presented paper reflects all specifics of open government data. This is a very strong statement. The paper does not provide a list of all specifics. It does not list any specifics explicitly.

(3) Readability and clarity of the presentation

The paper is easily readable and all statements are clear and easy to understand.

(4) Importance of the covered material to the broader Semantic Web community

The paper is important to the broader SW community as it shows how hard it is for governments to publish their data as LOD. However, the paper does not fulfill the potential of the topic which I explain in my review related to the criterion (2) above.

In summary, I would expect the following for a major revision:
- Improve the introduction with better evidence of the presented statements
- Clarify explicitly the list of specifics of open government data
- Focus better the research questions on the specifics of open government data
- Show how existing methodologies reflect the specifics
- Explain better relationships of the unified model to the existing methodologies and better define the purpose of the unified model
- Show some evaluation of the proposed unified model at least on some case study

Review #2
By Fathoni A. Musyaffa submitted on 09/Sep/2020
Suggestion:
Minor Revision
Review Comment:

Linked Data Methodologies Paper Review

The paper discusses the linked open government data methodologies, taking into account an important topic that deserves more research efforts due to the growth of open government data in general and the potential of the open government data if only it is published as linked data. The paper is well written and generally easy to understand, which could also potentially serve as an introductory article for public administrators who intend to implement Linked Open Government Data (LOGD) or for researchers that would start to get into the subject of LOGD publishing and LOGD pipeline.

Some comments, inputs, and questions:

(1) Page 3, col 1, line 5: instead of those three factors, there are more issues in this regards that would worth deeper study, for example, the lack of domain expertise (e.g., fiscal open data, public accounting, company register, procurement, etc.) as well as technical expertise to process and use necessary analytical/visualization tools. Other issues are the decentralization of open data publishing leads to data heterogeneity which makes the data hard to link, integrate, and analyze even if the domain expertise and/or technical expertise requirement is satisfied.

(2) Page 3, col 1, line 23: Not only data format difference that hampers the use of OGD, but also how the data are structured even when the format is the same. For example, in the fiscal domain, how columns are structured and the granularity of information available in the datasets can be very different across data publishers. Standardization of both format and structure for open data publication may be necessary, which involves the collaboration between public administrations (and hence, possibly, political entities) as well as certain standardization consortium/communities (e.g., W3C, Open Govt Partnership, OKF, etc).

(3) Page 3, col 2, line 17: Is excel still considered as a proprietary format? (see https://docs.microsoft.com/en-us/openspecs/dev_center/ms-devcentlp/1c24c... - “Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation to the extent it conforms to a Covered Specification ("Covered Implementation"), subject to the following…”)

(4) The paper’s scope is LOGD in the general domain, and hence the mention of the specific dataset domains published by LOGD initiatives is not explicitly discussed. However, since Research Question 2 of the paper mentions vocabularies (which often tied closely to the domain), as well as the potential of this paper to serve as an intro for public administrators for LOGD publishing, it would be a great addition if the authors discuss the top n domains that are published by OGD initiatives (see a survey by e.g., OKF - https://index.okfn.org/dataset/ for reference) as well as a short table listing possible ontologies that can be used to represent datasets from these domains.
(5) Could the search term (section 4.2) be made inclusive enough? There may be synonyms that are not covered in the terms used for the search in Table 1, e.g., some publications may use “knowledge graph” instead of linked data.

(6) The exclusion of general open data (section 4.3) is debatable since good general open data is required to be transformed into good quality linked data.

(7) Are there any more comprehensive parameters (outside of inclusion/exclusion criteria) for the papers that pass through these criteria to be analyzed? How are the filtered publications sorted into these top 25 publications to be featured in this survey? A clarification in the manuscript would be great.

(8) RQ2: Tools and vocabularies may need to be separated into two research questions, as both tools and vocabularies have a wide, separate scope.

(9) Fig. 4 could also be formulated in an inversed way to guide the community picking the right tool intended for novice adopters (i.e., list of individual tools on the left column, and the supported steps of each tool on the right column).

(10) In the unified process diagram, is there any reason there are only mandatory and optional processes? An additional option can be added (e.g., strongly recommended) as well. Moreover, to which criteria are the optional and mandatory labels are assigned for these processes? Perhaps it is related to literature or a survey in the LOGD/LD community? One can argue that the process “search and reuse vocabularies” is a very important/mandatory in LOGD since it leads to the interoperability and linkability of the linked data represented.

(11) Regarding the “engagement with the community” (Page 13), in addition to those three factors, success stories of how LOGD adoption improves the use case/analytics/application would drastically help promote the LOGD adoption further. A switch to LOGD without success stories may be seen as not worth the resources spent to transform the datasets into the linked data format.

Minor issues:

(1) The acronym LOGD on page 2 is not clarified, but it is clarified on page 3. Please clarify the acronym on its first use on page 2 instead.

(2) The terms “inadequate links” on page 2, col 2, line 20 is not clear.

(3) Typo: “...RDF format 2,...” on page 2.

(4) This sentence needs to be formulated clearer: “However, it is not clear the reason why it reappeared recently since none of the papers (since 2016) cited a different reason for not existing a large scale production of LOGD.”

(5) Fig. 3 and Fig. 4 are essentially tables, so it would be better to name them accordingly instead of figures.

(6) Grammatical mistake (Page 7): “design of URIs must also (be) considered”.

(7) This may need more clarification (Page 7): “...and files or tables are transformed as classes in the resulting graph”.

(8) Please clarify “the last column” in Page 10: “Observing the last column, ...”

(9) Typo on Page 10: “...guideline were (was) specified in any of the studies. (there should be no space/underline) 12”.

(10) Fig. 4 can be improved by using a tabular text instead of an image to preserve a better-quality table, especially when printed. Alternatively, an embedded PDF image can also be used if the authors use LaTeX.

(11) Fig 4: inconsistent reference to [W7] in the row “Build apps on top of data”. It should be cited as [W7] instead of [7].

(12) Also, in the row “Build apps on top of data” of Fig 4. “KNMINE” is written there while it could possibly refer to KNIME (even though the original paper [refered no. 55] states KNMINE). Please cross-check.

(13) Typo on Page 13: “It provides more features then (than)”., “...and guidelines for the definition of non-functional (requirements?)”.

(14) A reference for W3C Data on the Web Best Practices (Page 15) may need to be provided.

(15) V&V acronym (Page 15) is not clarified. Is it referring to verification and validation?

(16) Fig. 5: Typo - Conversion instead of “Converion”.

(17) Typo in Page 18: “as exemplified in (Fig.) 5”

(18) Typo in Page 18: “in a 99(-)pages report”.

(19) Page 19: Please clarify “auxiliary” and “core” steps.

(20)Page 19: Please clarify “longitudinal studies” here.

Review #3
By Michael Martin submitted on 09/Oct/2020
Suggestion:
Minor Revision
Review Comment:

In the following a summary of review comments are listed.

Overall the paper provides an overview about the publication of Linked Open Government Data (LOGD), which is an important and interesting topic.
The example of LOGD is good as it provides many examples and serves as prototype for other linked data publication processes.
The survey is performed as a systematic literature analysis.
Especially the derived process model is a very valuable contribution.
It complements the Linked Data Lifecycle with a lot more detail and can help in future works, as a reference if a more detailed understanding and description of the data publication process is needed.

Detailed comments
=================

In the introduction (page2) you speak about levels of open data. What is the 4th and 5th level? According to which levels? Do you refer to the 5-Star Linked Open Data model? If so, this is not clear from the context.

It would be good to reference the articles that are chosen for the survey consequently with the Paper ID e.g. in the following W3 instead of 12. (page 9) "The data community contributes to the process by[12]: i) providing feedback on what data to release; ii) contributing to the quality of the data; iii) collaborating with other members to create solutions over the data."

If you are using BibLaTex it is even possible to produce two separate bibliographies with prefixes, which could help to improve the readability. (Feel free to contact me if you need some examples.)

In section 3 related work on page 4 you write "The LOD2 Project [33] also developed a lifecycle for linked data and provided software tools for the steps, although leaving out important steps - such as data modelling, alignment, and the publication of the data on the Web."
But the step "alignment" is covered by the step "linking/fusing" and the step "publication" by the steps "search/browsing/exploration" and also "Storage/Querying". As you your self provide a very complex process model it would be helpful to discuss the relation of your model to the Linked Data Lifecycle in more detail.

Within the discussion of the results (section 5) it is hard to follow, about which papers of the survey you are speaking as the references are missing. Even though you provide a "mapping of tools" in Figure 4 but it is still hard to get the relation while reading. Please explain figure 4 and what it tells more precisely.
Examples of sentences where I would expect some reference about which papers you are speaking:
- page 10: "For cleaning up the data, OpenRefine was used in two studies, along with two other custom tools." which are the two studies?
- page 11: "For the definition of vocabularies two distinct approaches were identified: tools to search for existing vocabularies (such as LOV, Swoogle) and tools to create new vocabularies, like Protégé, OntoWiki, and TopBraid Composer." which paper chose which approach?

It would be good to split section 5. results into smaller sections, currently it is 9 pages long, while it is hard to get an overview on this section.

In figure 5. why are some arrows orange and others blue? Could you please explain, why these arrows are special?

Even though your systematic method did not bring up results for the versioning of datasets, it would be worth mentioning that there are some approaches with quit some attention the semantic web/linked data community that are building linked data versioning and archiving systems [1,2,3,4] or even embed them in a knowledge management process [4].
Also for the step "Engage with [the] Community" several approaches exists in the area of the Social Semantic Web, like Solid Social, distributed semantic social network [5,6,7,8].
For the step "Define non functional requirements" I'm not an expert.

Please do not use link shorteners: "The complete results are available online: https://bit.ly/319BGAH."
It would be even better to publish these results on a proper static web page and provide a copy of the page as supplementary material.

Please provided the tables in figure 3 and figure 4 as proper tables.

References
==========

[1] "Open Semantic Revision Control with R43ples: Extending SPARQL to access revisions of Named Graphs" (https://doi.org/10.1145/2993318.2993336)
[2] "OSTRICH: versioned random-access triple store" (https://doi.org/10.1145/3184558.3186960)
[3] "TailR: a platform for preserving history on the web of data" (https://doi.org/10.1145/2814864.2814875)
[4] "Decentralized Collaborative Knowledge Management using Git" (https://dx.doi.org/10.1016/j.websem.2018.08.002)
[5] "A Demonstration of the Solid Platform for Social Web Applications" (https://doi.org/10.1145/2872518.2890529)
[6] "An architecture of a distributed semantic social network" (https://doi.org/10.3233/SW-2012-0082, http://www.semantic-web-journal.net/content/architecture-distributed-sem...)
[7] "Structured feedback: A distributed protocol for feedback and patches on the web of data" (http://ceur-ws.org/Vol-1593/article-02.pdf)
[8] "Linked Data Notifications: a resource-centric communication protocol" (https://doi.org/10.1007/978-3-319-58068-5_33)

Typos and minor issues
======================

*Sometimes I highlight typos and provide suggestions to fix them, the part in [should be removed] and in (should be inserted)*

I personally find it sometimes awkward, when a sentence uses a citation number as subject "In [15] it is indicated, …" I prefer to have some author names which assist the reader in telling the references apart.

The abbreviation LOGD is used on page 2 but introduced on page 3

Please chose a consistent capitalization: "the Web" vs. "the web" (I prefer the first)

page 3 "open data government (OGD)" should be "open government data (OGD)"

page 3: "Open data is data that is publicly accessible via the Internet, without any physical or virtual barriers to access[ing] them."

page 3: "Linked Open Data, in turn, is data that allows relationships to be expressed among [these] (this) data, enriching the datasets with complementary information from elsewhere [25]."

page 5: Something is wrong with this sentence "The identified keywords are methodology, publishing and linked open government data, which were grouped these terms and their synonyms were considered to elaborate on the search string (Table 1)."

page 5: exclusion criteria: "The study focus[es] on the application of LD in a specific domain;"

page 6: there is something wrong with this sentence: "However, it is not clear _the reason_ why it reappeared recently since none of the papers (since 2016) cited a different reason for not existing a large scale production of LOGD."

page 7: "As one of the pillars of linked data is the unique and persistent identification of data resources, the careful design of URIs must also [be] considered."

page 7: "In [73] [it is discussed] the use of different protocols that can be used to provide this uniqueness of the resource (is discussed)."

page 9: "Some studies point to the importance of building applications with the data, to help the community raise [its] (the) awareness of it."

page 16, figure 5: "Conver[s]ion"

page 17: There is something wrong with this sentence: "While browsing for returned papers from the search in the data sources, many studies in the last years concerned the application of a method to create linked data for a particular purpose, sometimes based on one of the studies listed here and, mostly, time by creating an ad-hoc approach for their problems."

We would be happy to see a further version of your paper.
All the Best
Natanael Arndt and Michael Martin