Constructing a Knowledge Graph for Open Statistical Data

Tracking #: 2706-3920

Enayat Rajabi
Rishi Midha
Devanshika Ghosh

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Open Government Data has been published widely by different governments to be used by public and data consumers. The majority of published datasets are statistical. Transforming Open Government Data into Knowledge graphs bridge the semantic gap and give machines the power to logically infer and reason. Through this paper, a knowledge graph is proposed for Open Statistical Data. An RDF-based knowledge graph with a rule-based ontology are presented on this paper. A case study on Nova Scotia Open Data (a provincial Open Data portal in Canada) is also presented. The proposed knowledge graph can be used on any statistical Open Data and can bring all provincial Open Government Data under a single umbrella. The knowledge graph was tested and underwent a quality check process. The study shows that the integration of statistical data from multiple sources using ontologies and interlinking features of Semantic Web potentially enables the performance of advanced data analytics and leads to the production of valuable data sources and it generates a dense knowledge graph with cross-dimensional information and data. The ontology designed to develop the graph adheres to best practices and standards thereby allowing for expansion, modification and flexible re-use.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Armin Haller submitted on 17/May/2021
Review Comment:

This manuscript was submitted as 'full paper' and is therefore reviewed according to the criteria, (1) originality, (2) significance of the results, and (3) quality of writing.

The paper describes the open data portal of Nova Scotia which has more than 500 datasets in various categories, including Environment and Energy, Health and Wellness, Population and Demographics, etc. The authors describe a process to create a knowledge graph from these datasets. The process that has been followed to create a knowledge graph (i.e., building an ontology, creating dimensions using the RDF Data Cube vocabulary, adding semantics and rules, refinement and defining template queries) is standard and has been discussed and applied many times before. Other than that, the paper does not make any scientific contributions, i.e., it lacks the originality or significance of the results needed for a full paper at SWJ. There are, however, two submission types in the Semantic Web Journal that would potentially fit a paper like that, i.e., an application report or a Dataset Description. The latter is probably the better option and there are many examples of previously published statistical/environmental-statistical dataset descriptions that are similar in nature to this paper [1,2,3,4,5]. However, even if such a submission type is chosen, there needs to be a lot more detail on the methodology that has been applied to the specific Nova Scotia data portal. First, what is (are) the ontology(ies) that has (have) been developed for the knowledge graph? How are external ontologies integrated? What was the ontology engineering process that has been followed to define questions such as "Question 1: What diseases by infectious agents occur in Nova Scotia?" Are they based on expert interviews (e.g., users of the portal) or are the based on query logs? Next, it is unclear what exact metadata is extracted from the datasets beyond their categories/terms. Then the paper describes a Disease dataset that seemed to be completely ontologised, i.e., it is mapped to a QB structured ontology. How many datasets, if any other in the portal, have been mapped to the aggregated knowledge graph? What was the process followed, how accurate was the mapping? Or is this a dataset that was developed as a knowledge graph from the outset? Is the dataset using semantics for the Observation types, i.e., external ontologies for observations? Then in the next step of the methodlogy, the paper mentions interlinking of the multiple disease datasets with other datasets such as the Disease ontology and Geonames. Again, it is unclear if this is on the metadata level, i.e., the level of descriptions of the datasets or is this the data within those datasets? No detail is in the paper on the number of mappings, their precision/recall either. In the rule generation part of the methodology there are some SWRL rules presented that seem to be additions to the Disease ontology. Again, this appears to be mixing the meta level discussions in the paper. Why are these rules defined on the data portal level? The seem to be interpretations necessary for a specific ontology. They should therefore be encoded in a use case ontology that is then applied in this use case. All the queries presented in the last part of the methodology are then again Disease specific. This confirms my suspicion that a dataset description of a specific Disease Dataset or a set of Disease datasets would be the better submission type in the SWJ.

Overall, as mentioned, the paper does not provide an original contribution or any evaluation/results that would make it fit for a full paper submission. However, it has a good potential for a resubmission as a dataset description if the missing details (as per my review) are discussed and the established structure of such papers (as per the examples) is followed. This should also include a discussion and evaluation of its impact, i.e., why is a semantic version of the dataset/datasets beneficial to the users of the Nova Scotia data portals.

[1] Konrad Hoeffner, Michael Martin, Jens Lehmann: LinkedSpending: OpenSpending becomes Linked Open Data, SWJ
[2] Albert Meroño-Peñuela, Ashkan Ashkpour, Christophe Guéret, Stefan Schlobach: CEDAR: The Dutch Historical Censuses as Linked Open Data, SWJ
[3] Marcos Zárate, German Braun, Mirtha Lewis, Pablo R. Fillottrani: Observational/Hydrographic data of the South Atlantic Ocean published as LOD, SWJ
[4] Laurent Lefort, Armin Haller, Kerry Taylor, Geoffrey Squire, Peter Taylor, Dale Percival, Andrew Woolf: The ACORN-SAT Linked Climate Dataset, SWJ
[5] Catherine Roussey: Weather Data Publication on the LOD using SOSA/SSN Ontology, SWJ

Review #2
By José María Álvarez Rodríguez submitted on 04/Aug/2021
Review Comment:

The paper describes an approach to publish statistical data under the principles of Linked Data exposing a 5 star set of datasets. To do so, authors motivate the work due to the nature of data published by governments. They also make a review of some vocabularies, previous approaches and key points to consider when publishing and linking data. Basically, authors try to solve the problem of linking data to be able to infer new facts and add semantics to each data item. Finally, authors apply the methodology to a domain, a kind of disease database getting some datasets under the specification of the RDF Data Cube vocabulary. Furthermore, authors indicate some quality metrics to ensure data has been properly transformed. Authors also make an interesting discussion drawing some conclusions and future research lines.

In general, the paper is interesting, it presents a good application of linked data to a context: statistical data. The thing that many institutions basically publish statistical data is absolutely true and the need of consuming such information under a common and unified data model is becoming critical to be able to build added-value services. However, the approach should be improved in the following topics:

-State of the art. The main contribution of the paper is a kind of methodology to publish statistical data so, it is necessary to review the linked data lifecycles and the different stages that should be done. Furthermore, statistical data as authors mention requires some data model (combining different types of vocabularies, etc.) so, a review of the typology of vocabularies, linked data patterns, need of semantics for different purposes (e.g. data consistency), etc. is completely required. It is necessary to extend the state of the art not just to similar applications but to those topics covering linked data lifecycle, data modelling (specially for statistical data) and quality features for this type of data.

-Methodology and concept. Based on the previous comment, there are many linked data lifecycles and methodologies. Authors here introduce a process that should be based on existing lifecycles (extending or tailoring) or somehow instantiate one of them. Furthermore, the transformation process is too descriptive without entering in the details to consider when modelling statistical data: type of data, structure, etc. In this context, it is also relevant to consider how to deal with the issues when processing this type of data: solving missing values (if possible), ensuring data domain and ranges, keep consistency with the sources, updates, etc. In general, this section requires more details and tasks to really have a framework to transform statistical data in a systematic way. Otherwise, it is difficult to see the contribution in regard to other approaches that were publishing and linking data in other domains. From a technical perspective, the process of entity reconciliation: matching and linking, it is a cornerstone that requires details in the type of algorithms to be used (e.g. from classical approaches like the Prompt algorithm or reconciliation frameworks to others based on recent advances in embeddings representations). In terms of publishing, how to organize the datasets, slices, observations and model the data structure definitions is strictly necessary. The justification to select vocabularies (apart from the RDF Data Cube) is also completely necessary.

-Application and results. Authors implement the previous process to some domain, disease information, they have implemented with different technology the stages to gather data, transform, link, and publish. They also present an ontology and some rules to infer facts and check consistency. However, this case study is not properly documented, it should include some objectives, the implementation of the methodology: description of data (typology), description of the ontology (how it is defined, structure, etc.), description of the transformation (problems found), details on the quality metrics (not just the indicators), and how the quality checking process is performed.
As a final comment, the introduction and motivation of the paper is good but the state of the art, the conceptual approach (methodology), the quality checking process (indicators, metrics, and implementation), and the case study require more details to provide a systematic way of getting, linking, structuring and publishing statistical data.

Other comments:

-The abstract is correct, but it should include more details in the results and main contributions: a methodology, an application with the following results (N datasets, etc.)

-The structure of the paper is ok.

-“rdf data cube”, fix the first apparition and make the proper cite.

-What do you mean by RDF multidimensional models? Should not be RDF multidomain/cross domain models?

-References are relevant to the paper content but, as it has been commented before, there are some of them missing in terms of linked data lifecycles, data modelling, linked data patterns, linked data quality, etc.