Review Comment:
The paper presents a method for improving Wikidata items about underrepresented writers, using natural language processing to extract information from the corresponding English Wikipedia pages. The paper is interesting and relevant, providing a good overview of the methodology followed by the authors to achieve their goal, and reporting the results of a limited evaluation.
However, there are some significant issues that need to be addressed, requiring a major revision of the paper. The main issues are described below.
1. Classification
In my opinion, the choice of “Transnational” vs. “Western” as the reference categories for writers is quite problematic. To define “Transnational”, the authors cite source [43] which states that it refers to people who “operated outside their own nation’s boundaries, or negotiated with them”. By this definition, any writer who moved between two countries (even if they are both Western countries) should be considered “Transnational”.
Therefore, the dichotomy Western vs. Transnational seems incorrect. If the goal is to look specifically at writers who migrated to a different country (i.e. Migrants), this should be clarified in the paper. If the goal is instead to improve the representation of writers from the Global South vs. the Global North, I suggest adopting the term Global Majority.
The selection criteria for this classification are also not sufficiently explained in the paper. Section 3.1 begins by stating that “it is necessary to define a clear classification of people based on their ethnicity”, however, Wikidata provides ethnicity information (through property P172) for less than 1% of people, therefore it is unclear how the authors can rely on ethnicity without introducing a significant selection bias.
Moreover, the two criteria are described very vaguely, e.g. which ethnicities? which countries? The authors should describe them in more detail, possibly provide a flow chart showing how the writers have been classified and explaining why — there is more information on the website of the project, but this needs to be reported in the paper or at least cited more clearly.
Finally, given that the topic of the paper is underrepresentation, I find it quite strange that the authors are not adequately situating their research within a wider anti-colonial framework, citing only two papers by Global Majority scholars. I should also note that the reference selected by the authors for their definition of Transnational [43] is from a book that disproportionately features Western white scholars.
2. Methodology
The methodology for event extraction is sound, but there is one important limitation that is not addressed in the paper. The authors state that they are extracting triples that can be used to improve Wikidata biographies, however, they are only extracting the subject and the predicate of the triple as entities. The object is apparently extracted as a simple text string.
The paper seems to suggests that the new triples can be readily added to Wikidata, but it actually impossible to do so without performing some form of entity linking. I fail to understand why this is never mentioned in the paper, as it makes the results unusable without further work.
The “extracted triple” reported on page 7 (which is actually a set of 5 triples) strangely lists “FAAP” as the value of wdt:P69, which is inconsistent with the Wikibase ontology because the range of property P69 is WikibaseItem, not String. The correct value should be wd:Q5508993 (Fundação Armando Alvares Penteado).
Moreover, the triple "wdt:P69 prov:wasDerivedFrom nodeID://b29989498" is incorrect. In Wikibase, the provenance is attached to the statement, not to the property. Please review the paper by Erxleben et al. (2014) "Introducing Wikidata to the Linked Data Web".
When the authors mention removing sentences which do not mention the target entity in Section 4.2, it is important to clarify whether sentences containing only pronouns, variant spellings of the name, or other references to the subject (e.g. “the poet”) are kept or removed.
3. Evaluation
The evaluation section of the paper has significant issues. First of all, the definition of “precision” is unclear. The authors do not report recall nor F-score. The selection process of the evaluation dataset is not well described. The evaluation is limited to 200 samples on a dataset of 180,000+ sentences, which seems insufficient.
When looking at the evaluation dataset on Zenodo, the first sentence is evaluated as True, however the sentence refers to “Oxford’s Socratic Club” which is not a school or university, and the object of the triple is the text “Oxford ’s”, which is ambiguous.
The second sentence is also evaluated as True, however the object of the triple is “Nottingham”, while the sentence says “Nottingham and Liverpool universities”. The result is thus ambiguous (Nottingham is a city with two universities) and also incomplete (Liverpool has not been matched).
The fourth sentence is evaluated as True, but the object of the triple “Paris University” is ambiguous as there are several universities in Paris.
The seventh sentence is evaluated as True, but it claims that the writer was educated at the City Museum of Stockholm, which is not at all what the sentence claims.
The 12th sentence is evaluated as True, but it claims that the writer was educated at BFA. The meaning of BFA is "Bachelor of Fine Arts", which is a degree, not a university.
For the “employed at” property, there are some issues with associations or clubs (e.g. Rotary Club, Alliance of Independent Journalists) that are recognised as employers, but the correct Wikidata property should probably be “P463 member of”.
For the awards, I have seen examples where the prize is recognised as "writers guild" (an organisation), but the actual prize is the Writers Guild Award.
How can the authors consider these results to be True when in practice they are highly vague and ambiguous, and basically unusable for improving Wikidata without performing either a manual check or some form of entity linking?
Moreover, how did he authors compare their results to the current Wikidata statements to understand whether they are actually providing an improvement or not? For example, if the result says Nottingham and the Wikidata statement says University of Nottingham, is this a new statement?
From all the above, I think that the evaluation requires a thorough review.
4. Goals
Looking at the results, the improvement to biographies overall seems significant (provided that the evaluation is correct), but it is not specific to underrepresented writers. In Fig. 3 and 5, the Transnational category continues to be underrepresented. This leads to the question of what is the ultimate goal of the authors' research.
If the goal is to address underrepresentation in Wikidata biographies, the authors should explain how they plan to do so given the limitations highlighted above, and also considering three more issues which are not addressed at all in the paper:
• While import of data from Wikipedia to Wikidata is quite common, this is problematic because Wikidata’s reliable sources policy (https://www.wikidata.org/wiki/Help:Sources) specifically states that Wikipedia is *not* an appropriate source for Wikidata statements. In the past, mass addition of information extracted from Wikipedia has caused significant issues with the sourcing and accuracy of statements on Wikidata. Have the authors considered how to address this issue, e.g. by also extracting the original sources cited in Wikipedia and using them as references for the statements?
• The authors’ approach is language-dependent, and they focused on English, but this is not discussed anywhere in the paper. This language choice can significantly affect the dataset and introduce biases, as the biographies of authors who are not represented on the English Wikipedia will not be improved. Do the authors plan to address this limitation in the future?
• The copyright license of the Zenodo dataset (CC BY 4.0) is different from the one used on Wikipedia, which is CC BY-SA 4.0. While the Wikipedia text that is incorporated in the dataset probably falls under fair use for research purposes, it would be better to republish it under the same license and provide links to the Wikipedia pages the text is extracted from (see https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content). The license is also incompatible with Wikidata, which uses CC Zero — this is probably not a problem because the Wikipedia text would not actually be added to Wikidata, but these copyright issues should still be addressed.
5. Other minor issues
• The introduction does a good job of introducing the problem, but it takes some things for granted. For example, the authors should briefly explain what Wikidata is and how it relates to the Semantic Web.
• “these knowledge bases have proven to be flawed by the lack of neutrality” —> this is quite a bold statement that requires references to previous studies that analysed biases in OpenLibrary and Worldcat, and also a definition of what "neutrality" means in this context.
• Please capitalize Black in “Black studies” (see https://news.ucdenver.edu/is-the-b-in-black-capitalized-yes/)
• In Section 3.1, the sentence “There are 0.93 properties...” sounds a bit strange. I suggest rephrasing such as for example "Wikidata items about Transnational writers contain on average..."
• In Fig. 1, non-binary people are invisible. I suggest improving the image or, if not possible, acknowledging this limitation in the text of the paper or in the image label.
• In Section 4.2, the table reference shows two question marks.
• In Section 4.2, please check the spelling of DistilBert as it is inconsistent (first lowercase, then titlecase).
• In Section 5, “led to few errors, but probably affected the recall” —> this is very vague. If the recall is not measured, there is no way to verify this.
• The charts in Fig. 3, Fig. 4 and Fig. 5 are difficult to understand. The label states that they show the before and after, but it seems that the first two columns show the before, the second two columns show the additions, and the third show the after. It would probably be simpler to use a stacked bar chart.
|