Representing COVID-19 information in collaborative knowledge graphs: a study of Wikidata

Tracking #: 2572-3786

Houcemeddine Turki
Mohamed Ali Hadj Taieb
Thomas Shafee
Tiago Lubiana
Dariusz Jemielniak
Mohamed Ben Aouicha
Jose Emilio Labra Gayo
Mus'ab Banat
Diptanshu Das
Daniel Mietchen

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Information related to the COVID-19 pandemic ranges from biological to bibliographic and from geographical to genetic. Wikidata is a vast interdisciplinary, multilingual, open collaborative knowledge base of more than 88 million entities connected by well over a billion relationships and is consequently a web-scale platform for broader computer-supported cooperative work and linked open data. Here, we introduce four aspects of Wikidata that make it an ideal knowledge base for information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The structure of the raw data is highly complex, so converting it to meaningful insight requires extraction and visualization, the global crowdsourcing of which adds both additional challenges and opportunities. The created knowledge graph for COVID-19 in Wikidata can be visualized, explored and analyzed in near real time by specialists, automated tools and the public, for decision support as well as educational and scholarly research purposes via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pouya Ghiasnezhad Omran submitted on 23/Oct/2020
Major Revision
Review Comment:

The paper discusses different aspects of Wikidata that make it a fit for handling COVID-19 pandemic information. To support their statement, the authors enumerate some features of Wikidata, including a flexible data model, multilingual features, alignment to external sources and visualization tools. They discuss some challenges regarding managing related data to the COVID-19 pandemic, such as the complexity of raw data and global crowdsourcing nature of data.
Overall, the paper is written clearly, but it could be improved regarding the organisation to make its logical flow more manifest. The authors explain how Wikidata works briefly and illustrate the COVID related data in Wikidata. However, according to the paper's goal, the fitness of Wikidata for handling COVID related data, the following discussions would be insightful, i. COVID data specific issues, ii. different available methods and technologies for handling these issues and iii. the advantages of Wikidata's techniques for addressing the raised issues.
The paper provides a brief introduction regarding the RDF data model and Wikidata's method of modelling, including qualifiers and different kind of data types. However, it avoids the discussion of underlying semantic technologies that are proposed and deployed for handling various aspects of complex real data, including geospatial and time characteristics of data. For example, for quantifying a fact, there are different competing approaches, including property graph and RDF*. The explanation of Wikidata's quantifiers is not adequate regarding characterizing syntax and semantic of the deployed quantifying method and how Wikidata's way is more apt for modelling COVID data in comparison with the other methods.
Arguing a Knowledge Base (e.g. Wikidata) is a reasonable solution for handling COVID-19 related data is an exciting idea. However, the authors do not provide convincing arguments to support how the characteristics of Wikidata addresses the specific issues that COVID-19 related data raised.

Review #2
Anonymous submitted on 06/Nov/2020
Major Revision
Review Comment:

This paper discusses about the collaborative efforts in Wikidata community of building a general-purpose knowledge graph related to Covid19. The covered topics are comprehensive, illustrative, and most importantly very timely. The motivation behind this work is clear and selected examples can also generally justify the merit of knowledge graphs in multidisciplinary research, like the Covid19 pandemic. Below are my comments:

Major concerns:
1.In Introduction, the authors talk about the benefit and drawback of the ‘community developed ontology and typology’ (second paragraph). In terms of the drawback, it claims that “it makes methodical planning of the whole structure and its granularity very difficult”. However, in the main text I do not clearly see how these issues are addressed in this project.

2.In Data Model section:
a). The authors claim that ‘… an ontological database representing all aspects of the outbreak’. Is it really the case? For example, does it cover economic aspects that include information about the unemployment rate and supply chain disruption during this outbreak? I think it is a too ambitious statement.
b). What exact lessons are learned from the Zika pandemic?
c). The authors mention ‘… could all be represented in Wikidata if matters related to the coverage and conflicts of information from multiple sources are solved’. In fact, it would be great if the authors can discuss about how does the model solve the issue about conflicting statements in the project? In Covid-19, it becomes particularly essential as we see various reported ‘facts’ that are conflicting/inconsistent with each other. In addition, what does ‘coverage’ mean here? Spatial coverage? Temporal coverage? Or property coverage? A little bit confusing.

3. In Language Representation section:
a). Figure 4E is confusing, the x-axis is the rank of languages based on their usages? What does y-axis mean then? The sentence: “The degree of translation of that information is increasingly high with an important representation of the concepts in more than 50 languages (Figure 4E)” does not help to understand the figure.
b). More importantly, there are multiple correlation analyses in this section. However, no statistical analysis is applied at all. The conclusions are all made by arbitrarily checking the tables. For example, the statement “Despite several differences like the higher visibility of Asian language… the query results largely match the literature-derived data … ” has to be justified in a more scientific way, e.g., by statistical testing.

4. In Database Alignment:
This section lists multiple alignment tables for different domains. However, how are these alignments accomplished? Any automated algorithms are used or totally based on human efforts? Have these alignments been evaluated?

5. In Visualizing facets of COVID-19 via SPARQL and Conclusion
It is great to see the authors bring up a relative comprehensive and well organized list of SPARQL queries, and demonstrated several promising visualization in the paper. However, I am wondering how accessible and easy for a non-SPARQL expert to explore the graph (or simply understand the query)? Do the authors have any empirical examples/cases to show how useful the graph has been to domain experts/general public? In Table S2, it seems to be a list about fulfilled tasks; but I do not find more contexts related to this table. Maybe use one of the rows in this table as an example to elaborate would help readers understand the value of the proposed graph.

6. Last but not least, the authors have to proofread the paper substantially. There are many long sentences, inconsistent uses of terms, typos, duplicates, and many weird sentences. In general, the paper is not that easy to follow. For example, solely in the first paragraph of Section 5.2:
a). whereas others common visualization --> other
b). from scratch from granularity --> one ‘from’ has to be deleted
c). its change over time over time --> duplicates
d). Wikidata’s granularity and collaborating … --> What does ‘wikidata’s granularity’ mean here?

Minor issues (this is by no means the complete list. As my sixth major point indicated, the authors have to proofread the paper carefully and make it more readable.)
1. page 2:
a). basing --> based
b). entities named items --> entities, named items
2. page 3:
>17,000 (what is this number? Cases? Deaths?)
3. page 5:
Table S1 --> Table 1
4. page 13:
table S2 --> Table S2
5. page 14:
a). allowed --> allows
b). WIkidata --> Wikidata

Review #3
By Gengchen Mai submitted on 20/Dec/2020
Minor Revision
Review Comment:

The paper titled “Representing COVID-19 information in collaborative knowledge graphs: a study of Wikidata” presents the current Wikidata efforts in creating, enriching, interlinking different data about COVID-19. The wikidata efforts have been discussed in four aspects: flexible data model, multilingual features, alignment to multiple external databases, and its multidisciplinary organization.

Overall, I do think this is an important and valuable paper to be published in Semantic Web journal.

Nevertheless, I have several suggestions I wish the authors to consider:

1. Some features the authors discussed about Wikidata are in fact well-known. For example, the data model, the multilingual features as well as its alignment to other databases. Since this paper is explicitly about the COVID-19 efforts of Wikidata. I suggest the authors highlight the specific features Wikidata considers for COVID-19.

2. The contribution of this paper is not clear enough to me in the beginning. In the end, I realize the authors are responsible for managing COVID-19 information in Wikidata. I suggest the author list the contribution at the beginning of this paper.

3. The author claims this paper is a research paper while I think this is a dataset paper. I do think dataset papers are also very important, especially for the Semantic Web community. So please rethink the paper type you want to submit here.