The Nova Scotia Disease Knowledge Graph

Tracking #: 2904-4118

Enayat Rajabi
Rishi Midha
Jairo Francisco de Souza

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Dataset Description
The majority of published datasets in open government data are statistical. They are widely published by different governments to be used by the public and data consumers. However, most datasets in open data portals are not provided in RDF format. Moreover, the datasets are isolated from one another, while conceptually connected. Through this paper, a knowledge graph is constructed for the disease-related datasets of a Canadian government data portal, Nova Scotia Open Data. We trans-formed all the disease-related datasets to RDF and enriched them by semantic rules and an external ontology. The study shows that integrating open statistical datasets from multiple sources using ontologies and interlinking them potentially leads to valuable data sources and generates a dense knowledge graph with cross-dimensional information. The ontology designed to develop the graph adheres to best practices and standards, allowing for expansion, modification and flexible re-use.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Armin Haller submitted on 03/Nov/2021
Major Revision
Review Comment:

As suggested in my earlier review, the paper was now submitted as a Dataset Description. This type is better suited for the paper. The authors do need to draw very close attention to the call for papers for dataset descriptions, though. There are still characteristics of the dataset for its usage missing, e.g., spelling out the name, the URI, the version date and number, licensing, availability. Also, what is still missing is evidence of its use. Also, while datasets in a Government data repository are probably fairly save in their long term availability, a w3id identifier, DOI and availability on Figshare or Zenodo would be advisable.

Beyond the advertisement of the dataset and reporting its importance, a dataset description paper should also serve the purpose of lessons learned in the generation of the dataset. The paper still falls a bit short in this aspect. Two main issues:

- The transformation process is not described in detail and is not repeatable, i.e., someone with a similar problem cannot follow a methodology presented in this paper. There is some mentioning that the ontology was build manually, but then how was the mapping done. It looks like it was all done with custom scripts. Why not using mapping tools such as the ones listed here: Potential tools for the process could be any2rdf, triply, tarql, J2RM or any23.
Also, to be of real value to others, these tools should be configured (a wrapper built around them) to automate the process for the widely used data portal used for the Nova Scotia Open Data portal, i.e., Socrata.

- The level of human intervention is also not clear. It is stated in the conclusion that there was a tool developed to retrieve open datasets, but the identification of disease datasets was carried out manually. This is unclear. And the methodology section needs to clearly state which parts are manual and which parts can be automated. For example, the ontology development was obviously manual, but the mapping to the ontology can be automated using the aforementioned tools.

Overall, the paper can be shortened in several sections to save space to clearly describe the dataset, the methodology and the ontologies developed. For example, the description of the data portal itself is too long. Table 1 is not needed, as these seem to be the datasets in its original format, at least it appears to me when I click on them. Figure 3 shows an example observation. It would be better to first present the ontology in detail and then show an example instance. Section 4.4 is redundant. It seems as if these are just some proof of concept SWRL rules. Are they deployed and used? If not, they should not be included in the paper. Section 4.6 about the queries looks again like a proof-of-concept. Are these predefined SPARQL queries using SPIN or SHACL advanced features and can be used through the portal? If not, again, it should not be in the paper. Wikidata has a SPARQL interface with predefined queries. This is one way of helping users to access the data.

There should also be a stronger focus on lessons learned for jurisdiction to deal with Linked data. There are several Government Linked Data Working groups globally (and the W3C) that publish (have published) guidelines and recommendations on best practices. If they have been followed, what aspects of those had to be changed/customised for the local context and which one's were applicable.

There are a few language issues in the paper that need to be addressed, e.g.,

"a multi-dimensional structure should be defined consists of measures, and dimensions describing the measures" misses a verb
"As a proof of concept, we designed a SWRL rule to infer the transitive relationship of diseases in a dataset using Protege rule engine"
"downing the road"

There are also some formatting errors such as ??? for Figures and references.
The reference list is also ill-formatted and not consistent. See FAQ10

Review #2
By José María Álvarez Rodríguez submitted on 26/Nov/2021
Major Revision
Review Comment:

The paper presents the work to properly publish data under the principles of linked (open) data). To do so, the authors have promoted to the linked data initiative a dataset coming from Nova Scotia.

The work is interesting and applicable to many domains in which sometimes data is published for the shake of publishing. However, there are some things to improve:

-In the abstract, it should be necessary to be more informative and quantify some adjectives like “most of datasets” to see some context and magnitude of the problem/context.

-Include also more technical details about the results: ontologies/semantics rules and provide some quality assessment if possible.

-In the introduction, some existing issues must be justified: "The datasets act as isolated pools of information that cannot be queried or linked." This implies to categorize and identify the current dimensions of the problem to be addressed:
--Problem of reusing data, Which are the principles? For instance, alignment to Open Science/Data principles, etc.

Then, establish the specific technical issues that are preventing a proper data federated environment: data modelling: linked data patterns? data integration, data quality, data querying mechanisms, lack of APIs, use of standards and existing vocabularies, etc. to finally assess if there are interoperability issues: interoperability issues: communication protocols, syntax and semantics

-In the state of the art/background section, there are multiple works about linked data lifecycles that must be mentioned.

-In the methodology section, it would be nice to see some description of the datasets (type of data, issues in data: consistency, naming, etc., need of logics, etc.) to finally end with the decision on the vocabularies and data model including a description of what is being reused (with more details of the process) and what is new.

-Regarding the semantic rules, does it make sense to directly use SPARQL to produce new facts? In terms of consistency, it should be nice to see the reconciliation process between entities (if it was necessary).

-Authors also show some SPARQL queries. It should be necessary to link the potential of these queries to the initial needs.

-The quality checking of the dataset seems a bit simple. It would be also nice to see how to consume that information (if it is publicly available).

Finally, authors properly comment the main conclusions and envision some future work.

Review #3
Anonymous submitted on 31/Dec/2021
Review Comment:

Overall impression
The article presents a dataset description of aggregated statistical data on diseases related to a Canadian locality. Although the topic is interesting and could have potential, the dataset presented is only a fragment of the main dataset, that has been transformed to RDF, which does not necessarily show a real contribution to state of art. The article neither shows the impact of published data in RDF, nor demonstrates its use with utility by third parties. The relationship between the authors and the generators or maintainers of the data is not described either. Finally, in my opinion there are technical deficiencies in the specification of the dataset.

Positive aspects
As example, use cases based on SPARQL query logic, based on pattern matching are presented. Also, the basic aspects to be specified in a description dataset are presented, such as name, URL, version date and number, licensing, availability,

Negative aspects
Quality and stability of the dataset
No evidence is presented on how the authors are involved in the generation or maintenance of the dataset, or with the consent of these people.

Usefulness of the dataset
There is a missing contribution of weight that justifies the necessary impact to be published in the journal.
The dataset does not present growth potential, in fact the data present is up to 2017, nor is there a SPARQL endpoint where to execute the queries.
No evidence of use of the data by third parties is shown.

Clarity and completeness of the descriptions
There are no explanatory diagrams of the data model used, nor the way in which the referenced vocabularies are related.
The vocabularies and ontologies are vaguely described.
Interlinking mechanisms are not presented to make 5 stars linked open data.
The URI scheme used and the other type of specifications such as Shape Expressions (ShEx or SHACL) is not described.
Although minor, There are typographical and formatting errors, also unfinished paragraphs (example the first paragraph of the conclusions)