Representing COVID-19 information in collaborative knowledge graphs: the case of Wikidata

Tracking #: 2736-3950

Houcemeddine Turki
Mohamed Ali Hadj Taieb
Thomas Shafee
Tiago Lubiana
Dariusz Jemielniak
Mohamed Ben Aouicha
Jose Emilio Labra Gayo
Eric A. Youngstrom
Mus'ab Banat
Diptanshu Das
Daniel Mietchen

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Information related to the COVID-19 pandemic ranges from biological to bibliographic, from geographical to genetic and beyond. The structure of the raw data is highly complex, so converting it to meaningful insight requires data curation, integration, extraction and visualization, the global crowdsourcing of which provides both additional challenges and opportunities. Wikidata is an interdisciplinary, multilingual, open collaborative knowledge base of more than 90 million entities connected by well over a billion relationships. A web-scale platform for broader computer-supported cooperative work and linked open data, it can be queried in multiple ways in near real time by specialists, automated tools and the public, including via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format. Here, we introduce four aspects of Wikidata that enable it to serve as a knowledge base for general information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The rich knowledge graph created for COVID-19 in Wikidata can be visualized, explored and analyzed, for purposes like decision support as well as educational and scholarly research.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pouya Ghiasnezhad Omran submitted on 22/Mar/2021
Minor Revision
Review Comment:

The paper discusses different aspects of Wikidata that make it a fit for handling COVID-19 pandemic information. To support their statement, the authors enumerate some features of Wikidata, including a flexible data model, multilingual features, alignment to external sources and visualization tools. They discuss some challenges regarding managing related data to the COVID-19 pandemic, such as the complexity of raw data and the global crowdsourcing nature of data.
I reviewed the previous version of this paper before. Given Major revision as my evaluation, the paper needed improvements to be publishable in a quality journal like Semantic Web Journal. I thank the authors for their attempts to address my concerns, but I think there are some spaces for further improvements.
Based on the authors' response to my comment about the advantages of the proposed model for managing COVID-19 data, "Here, Wikidata's model offers a "good-enough" model to assess this statement.", the motivation for work could be advanced. In general, I think addressing a task with an existing tool in a moderately good manner is not an adequate motivation for a research paper.
The paper's novelty could be highlighted more explicitly so that the authors could enumerate their contributions more explicitly in the introduction.
Discussions on general characteristics of COVID data and general capacities of Wikidata regarding the expressivity of its formalism and tools to validate data could be included more extensively. Explaining the pros and cons of using Wikidata and related technologies to handle COVID like data could be constructive for choosing suitable technologies in future problems.

Review #2
Anonymous submitted on 29/Mar/2021
Minor Revision
Review Comment:

The paper has improved a lot after this round of revision. The responses to my comments are satisfactory. Below are my further suggestions to help the authors further improve the manuscript:

1. The authors claim that “With respect to COVID-19 data challenges, …. Its existing community has been using it to capture COVID-19-related knowledge right from the start” (section 2 ), any citations or other proofs to support such a statement?
2. Many general technical descriptions about Wikidata, or RDF and etc., make the paper unnecessarily long losing its focus. For instance, in section 2.1. I don’t think it is relevant to discuss about reification of statements across different graphs. The discussion of how YAGO uses a different approach seems even more strange here. IMHO, this paper should focus directly on the use of Wikidata to help address Covid-19 challenges. Any unnecessary/irrelevant contents will distract readers from getting the main contribution of this paper. I think this partially explains the reason why other reviewers also have the similar feeling that the contribution of this paper is unclear.
3. In section 2.2. (page 12), ‘low correlation of wikidata and the number of speakers’ indicates there is no significant correlation between the two factors, right? Then how come one gives the suggestion that “encouraging the contribution by speakers of under-resourced and unrepresented languages to medical Wikipedia projects and to Medical Wikidata is highly valuable”? I think the authors might have mis-interpreted the correlation coefficients.
4. Tables in Section 2.3 can be shortened by only showing examples, and the full table can be moved to Appendix.
5. The new version discussed many drawbacks of Wikidata but one can hardly find potential solutions. For instance, “However, this Wikidata coverage of the availability of COVID-19-related publications in external research databases does not seem to fully represent full records of COVID-19 literature in aligned resources.” and “In addition to such sampling biases, there are also differences in annotation workflows, e.g. in terms of the multilinguality of or the hierarchical relationships between topic tags in Wikidata versus comparable systems like Medical Subject Headings.” Solutions to these questions are more valuable, which will also highlight the contribution of this paper.
6. Words in many figures, e.g., Figure 7, 8, 9, S2 are hard to read. Need adjustments for these figures.


Dear authors,

I had a brief look at some parts of the paper and quite a few questions developed during reading.

# Recency

On page 22 you are giving as a source for active users in Wikidata, but these are defined as doing "any action". lists active editors at 13k per month, which is about half. I am mentioning this, as recency or up-to-dateness -- to the best of my knowledge-- has been verified for Wikipedia (e.g. in comparison to Encyclopedia Britannica), but not been verified for Wikidata (or has it? do you have a reference?). lists 46k active editors for EN Wikipedia. So EN Wikipedia has 3.5 times more editors, but 12 times less pages/items to maintain (7 instead of 90 million). You also mention Wikipedia as being more up-to-date in certain areas. This part should be looked at more carefully. Wikidata was not adopted to replace Wikipedia's infoboxes due to it not being as recent as Wikipedia, see discussion here: Infoboxes are still growing a lot in Wikipedia.

You also mention DBpedia being made by machines. The main advantage of DBpedia is that it doesn't need the continuous effort of 13k human users. This work has already done by 46k Wikipedians and the quality of Wikipedia edits is very high and timely. So in terms of recency the DBpedia approach to extract it from up-to-date Wikipedia should often be more recent.

* EN Wikipedia:
* Wikidata: copied on April 1st, 2020 from EN Wikipedia. Note that the numbers are erroneously copied, i.e. one magnitude too small !
* Ad hoc DBpedia extraction:

DBpedia is not perfect, of course, but it does quite a good job in reflecting the up-to-date Wikipedia. Of course, this was only one example, but it was the first one I looked at. Maybe I was lucky.

I am sceptical, whether the assumption holds that Wikidata handles COVID-19 data well as it is highly volatile and evolving. Theoretically, it can or could handle such data well, given enough editors are spending the effort. Often, however, the quality of human curation reaches only 80% as after this it gets much harder to contribute (20/80 rule or law of diminishing returns). Quite a few fields in Wikidata were filled using a infobox extractor of unclear quality ( "imported from" in statement metadata, see example). I am unsure, whether this extractor has 1. an update function and 2. I assume that most of the time it is run only once and never updated again (see example above). Did you verify the data used regarding its recency or correctness?

My question would be whether the paper has produced anything conclusive in the direction of recency, i.e. can you really use Wikidata's data to draw solid or reliable conclusions. In my opinion, it could be troublesome to use partial or incorrect information to visualize something and thus make it look more "truthy". Couldn't the many linked data sources (table 5-8) or DBpedia Live/ad hoc be used for verification, comparison? If data is copied into Wikidata, there is always a risk, that it would become stale after a while.

# minor
* table 5-8 are missing a "k" for thousands in the count.
* "to drive Wikidata instead of other systems that represent entities using textual expressions, particularly Virtuoso [19]." Virtuoso is a graph database. You can host anything with it. It is open-source and quite scalable as well, see Wikidata loaded into Virtuoso:

# 5. Conclusions
Reading the conclusion section, I do not get a clear idea what the contribution of this paper is. It mentions that Wikidata is user-friendly to query and that visualisations can be created. What are these "deeper insights" mentioned in the conlcusions section?
For DBpedia, the research methodology was quite clear: everybody found shortcomings and criticized DBpedia heavily and then devised ways to improve it, which made DBpedia better in the end. With this paper, I saw some paragraphs of self-critizism and mentioned shortcomings in the text, but no suggestions on how these can be improved upon or mitigated except for the argument that "the community will take care of these". I am sure that Wikidata could be quite good at integrating data (i.e. naturally, if you load data from different Linked Data sources into one database, you can do very good analytics). I am just wondering, how well it worked now and what worked well and what didn't and what needs to be improved. The conclusion section seems very unspecific on this, but I am quite curious to know.

# 2.3 Database alignment and 5 Conclusions
The conclusion claims that "Wikidata has become a hub for COVID-19 knowledge." This is probably based on section 2.3 Database alignment, where it also mentioned 5302 Wikidata properties used for alignment. DBpedia is also extracting them from Wikidata, trying hard to judge their semantics on extraction. They range from HTML or Wiki links to rdfs:seeAlso to skos: to owl:sameAs, which is quite mixed (as also written in the cover letter). Sometimes Wikipedia articles (and therefore related DBpedia and Wikidata IDs) are quite general and need two or more links with "skos:narrower" semantics, when linking to the same source. Did you gain any insights there, which semantics apply for which of the 5302 Wikidata properties or how to distinguish semantics or use these links? Any insight here would be highly appreciated.

As written before, I only took a brief look at the paper starting with intro and the conclusion section, then selectively some more parts. Sorry, if I am asking questions that have been answered somewhere else in the paper.

Dear Sir,

I thank you for your comments regarding our research paper. We believe that these comments will ameliorate the final output for this research publication. Please find the answer to all your comments attached with the last pages of the updated edition of this research work. We will be honoured to receive any other comment about our paper.