Reducing the Underrepresentation of Transnational Writers through Biographical Event Extraction

Tracking #: 3385-4599

Marco Stranisci
Viviana Patti
Rossana Damiano

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Wikidata represents an important source of literary knowledge, which is collaboratively created and curated by a large community of users. In this archive, it is possible to find hundreds of thousands pages about writers and their works. However, Wikidata is affected by the underrepresentation of Transnational authors, as recently demonstrated. Such an issue is present at different levels, since not only Transnational writers are less in number, but there are also fewer biographical information about them in their pages. In this paper we present an approach for reducing such form of underrepresentation by automatically extracting biographical information from Wikipedia through transformers and lexico-semantic patterns, and encoding it into Wikidata semantic model. Results show that our approach allows increasing the number of biographical triples on Wikidata for all writers, rebalancing at the same time the knowledge base in favour of Transnational writers.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/May/2023
Major Revision
Review Comment:

# outline
This paper presents an analysis of the representation of authors from different backgrounds in Wikidata. The authors of the paper highlight the underrepresentation of so-called "transnational" authors in Wikidata, and present an automated process to increase the number of triples related to them.

Specifically, the authors:
* conduct an exploratory analysis to show that entities of "transnational" authors have smaller number of triples associated with them, compared to "non-transnational" authors.
* collect a small corpus of biographical texts, and use it to train a test a pipeline for biographical event extraction.
* propose a hybrid method for entity and event extraction, that uses Lexico-Semantic Patterns and Language Models.
* using the proposed method, they attempt to augment the set of triples associated to "transnational" authors in Wikidata.

# strengths
* The problem of representation in vast, popular, and often-used-as-ground-truth resources such as wikipedia and wikidata is a serious and important problem to the community.
* The paper highlights an issue that, while not surprising, is important to be brought up, measured, and addressed.

# weaknesses
* The approach described in the paper includes a number of methodological shortcomings. Specifically,
* the definition of "transnational" is not one that is globally known (or respected or acceptable). The definition adopted by the authors (which is presented very late in the paper, at the end of page 3) is not clear: an exhaustive list of the "countries of birth" and "ethnic minorities in a Western country" would help immensely to identify the contributions of the paper. Moreover, the authors do not include a note on their own positionality and the fact that their definition of "transnational" emerges from their own understanding and lived experiences.
* there is not enough justification for the selection of the four properties that are selected to be modelled and added for "transnational" authors. Moreover, the fact that a number of authors might not have had formal education, might not have received (or have been nominated for) any (well-known and well-represented) awards, actually perpetuates the issue that the proposed method sets out to address.
* the motivation of the paper is not very clear. Specifically, the connection between increasing the number of wikidata triples and the downstream representation problem is not very strong (how exactly an increased number of triples leads to more representation of a specific author?).
* the authors do not comment on the fact that extracting triples from raw text has exactly the same issues as the one they attempt to resolve: most likely, free text on "transnational" authors is not available/not easy to find/not available in English. The proposed method does not address the issue of retrieving relative text, and merely runs a triple extraction system on a number of documents on "transnational" writers. Is this process technically more challenging for documents on "transnational" writers?
* The approach described in the paper includes a number of technical methodological shortcomings. Specifically,:
* the size of the collected corpus is extremely small, especially since it is used to train systems. It is very challenging to attribute any performance improvements to the inclusion of 5 training documents.
* Figure 1 is not very informative, unless it accounts for a number of priors: prior distributions of each of the modelled "categories", the fact that there might be more information available for older (not alive) and more popular authors, the fact that there are a number of factors at play in revealing sexual and gender identities for earlier generations.
* section 4.1: to my understanding, the described process looks more like relation extraction and less like coreference resolution.
* some more experimental information would make the technical sections of the paper stronger:
* section 4.2, on the configuration of the data splits: it would be great to include some more details on the selection process.
* section 4.2, on the selection of the best-performing model: it is not clear if the model selection was based on a development set or on the test data.
* it would be great if the authors included a justification about choosing to measure the performance of the system only using precision (Table 1). Also, it would be great to include some metrics in statements related to recall (page 8, line 49).
* Finally, given that this paper builds on top of previous work also by the authors, the contributions of this paper and their difference to prior work is not very clear.

# typos, writing
* In Figures 3 and 4, it is not clear to me what the y-axis is: is it thousands of authors?
* page 6, line 20: missing Table reference.

Review #2
Anonymous submitted on 02/Jun/2023
Review Comment:

This paper focuses on the problem of reducing the underrepresentation of Transnational writers that has strong practical applications. To address this problem, the authors propose a pipeline method, where first designing Biographical Event Detection, and then extracting Triples.

For approach section, it would be helpful for readers to better understand the method if the authors provide a model structure.

In experiments section, the authors should have compared their method with more different types of methods, such as pretrained-based and non-pretrained-based methods, to verify the validity of their method.

Problems of Typos, Grammar, Style, and Presentation:
1. Line 47, page 4, “As described in the review of the related work (Section 4.3),”;
2. Line 20, page 6, “Therefore, in Table ?? 4 named entities”.

Review #3
By Daniele Metilli submitted on 02/Jul/2023
Major Revision
Review Comment:

The paper presents a method for improving Wikidata items about underrepresented writers, using natural language processing to extract information from the corresponding English Wikipedia pages. The paper is interesting and relevant, providing a good overview of the methodology followed by the authors to achieve their goal, and reporting the results of a limited evaluation.

However, there are some significant issues that need to be addressed, requiring a major revision of the paper. The main issues are described below.

1. Classification

In my opinion, the choice of “Transnational” vs. “Western” as the reference categories for writers is quite problematic. To define “Transnational”, the authors cite source [43] which states that it refers to people who “operated outside their own nation’s boundaries, or negotiated with them”. By this definition, any writer who moved between two countries (even if they are both Western countries) should be considered “Transnational”.

Therefore, the dichotomy Western vs. Transnational seems incorrect. If the goal is to look specifically at writers who migrated to a different country (i.e. Migrants), this should be clarified in the paper. If the goal is instead to improve the representation of writers from the Global South vs. the Global North, I suggest adopting the term Global Majority.

The selection criteria for this classification are also not sufficiently explained in the paper. Section 3.1 begins by stating that “it is necessary to define a clear classification of people based on their ethnicity”, however, Wikidata provides ethnicity information (through property P172) for less than 1% of people, therefore it is unclear how the authors can rely on ethnicity without introducing a significant selection bias.

Moreover, the two criteria are described very vaguely, e.g. which ethnicities? which countries? The authors should describe them in more detail, possibly provide a flow chart showing how the writers have been classified and explaining why — there is more information on the website of the project, but this needs to be reported in the paper or at least cited more clearly.

Finally, given that the topic of the paper is underrepresentation, I find it quite strange that the authors are not adequately situating their research within a wider anti-colonial framework, citing only two papers by Global Majority scholars. I should also note that the reference selected by the authors for their definition of Transnational [43] is from a book that disproportionately features Western white scholars.

2. Methodology

The methodology for event extraction is sound, but there is one important limitation that is not addressed in the paper. The authors state that they are extracting triples that can be used to improve Wikidata biographies, however, they are only extracting the subject and the predicate of the triple as entities. The object is apparently extracted as a simple text string.

The paper seems to suggests that the new triples can be readily added to Wikidata, but it actually impossible to do so without performing some form of entity linking. I fail to understand why this is never mentioned in the paper, as it makes the results unusable without further work.

The “extracted triple” reported on page 7 (which is actually a set of 5 triples) strangely lists “FAAP” as the value of wdt:P69, which is inconsistent with the Wikibase ontology because the range of property P69 is WikibaseItem, not String. The correct value should be wd:Q5508993 (Fundação Armando Alvares Penteado).

Moreover, the triple "wdt:P69 prov:wasDerivedFrom nodeID://b29989498" is incorrect. In Wikibase, the provenance is attached to the statement, not to the property. Please review the paper by Erxleben et al. (2014) "Introducing Wikidata to the Linked Data Web".

When the authors mention removing sentences which do not mention the target entity in Section 4.2, it is important to clarify whether sentences containing only pronouns, variant spellings of the name, or other references to the subject (e.g. “the poet”) are kept or removed.

3. Evaluation

The evaluation section of the paper has significant issues. First of all, the definition of “precision” is unclear. The authors do not report recall nor F-score. The selection process of the evaluation dataset is not well described. The evaluation is limited to 200 samples on a dataset of 180,000+ sentences, which seems insufficient.

When looking at the evaluation dataset on Zenodo, the first sentence is evaluated as True, however the sentence refers to “Oxford’s Socratic Club” which is not a school or university, and the object of the triple is the text “Oxford ’s”, which is ambiguous.

The second sentence is also evaluated as True, however the object of the triple is “Nottingham”, while the sentence says “Nottingham and Liverpool universities”. The result is thus ambiguous (Nottingham is a city with two universities) and also incomplete (Liverpool has not been matched).

The fourth sentence is evaluated as True, but the object of the triple “Paris University” is ambiguous as there are several universities in Paris.

The seventh sentence is evaluated as True, but it claims that the writer was educated at the City Museum of Stockholm, which is not at all what the sentence claims.

The 12th sentence is evaluated as True, but it claims that the writer was educated at BFA. The meaning of BFA is "Bachelor of Fine Arts", which is a degree, not a university.

For the “employed at” property, there are some issues with associations or clubs (e.g. Rotary Club, Alliance of Independent Journalists) that are recognised as employers, but the correct Wikidata property should probably be “P463 member of”.

For the awards, I have seen examples where the prize is recognised as "writers guild" (an organisation), but the actual prize is the Writers Guild Award.

How can the authors consider these results to be True when in practice they are highly vague and ambiguous, and basically unusable for improving Wikidata without performing either a manual check or some form of entity linking?

Moreover, how did he authors compare their results to the current Wikidata statements to understand whether they are actually providing an improvement or not? For example, if the result says Nottingham and the Wikidata statement says University of Nottingham, is this a new statement?

From all the above, I think that the evaluation requires a thorough review.

4. Goals

Looking at the results, the improvement to biographies overall seems significant (provided that the evaluation is correct), but it is not specific to underrepresented writers. In Fig. 3 and 5, the Transnational category continues to be underrepresented. This leads to the question of what is the ultimate goal of the authors' research.

If the goal is to address underrepresentation in Wikidata biographies, the authors should explain how they plan to do so given the limitations highlighted above, and also considering three more issues which are not addressed at all in the paper:

• While import of data from Wikipedia to Wikidata is quite common, this is problematic because Wikidata’s reliable sources policy ( specifically states that Wikipedia is *not* an appropriate source for Wikidata statements. In the past, mass addition of information extracted from Wikipedia has caused significant issues with the sourcing and accuracy of statements on Wikidata. Have the authors considered how to address this issue, e.g. by also extracting the original sources cited in Wikipedia and using them as references for the statements?

• The authors’ approach is language-dependent, and they focused on English, but this is not discussed anywhere in the paper. This language choice can significantly affect the dataset and introduce biases, as the biographies of authors who are not represented on the English Wikipedia will not be improved. Do the authors plan to address this limitation in the future?

• The copyright license of the Zenodo dataset (CC BY 4.0) is different from the one used on Wikipedia, which is CC BY-SA 4.0. While the Wikipedia text that is incorporated in the dataset probably falls under fair use for research purposes, it would be better to republish it under the same license and provide links to the Wikipedia pages the text is extracted from (see The license is also incompatible with Wikidata, which uses CC Zero — this is probably not a problem because the Wikipedia text would not actually be added to Wikidata, but these copyright issues should still be addressed.

5. Other minor issues

• The introduction does a good job of introducing the problem, but it takes some things for granted. For example, the authors should briefly explain what Wikidata is and how it relates to the Semantic Web.

• “these knowledge bases have proven to be flawed by the lack of neutrality” —> this is quite a bold statement that requires references to previous studies that analysed biases in OpenLibrary and Worldcat, and also a definition of what "neutrality" means in this context.

• Please capitalize Black in “Black studies” (see

• In Section 3.1, the sentence “There are 0.93 properties...” sounds a bit strange. I suggest rephrasing such as for example "Wikidata items about Transnational writers contain on average..."

• In Fig. 1, non-binary people are invisible. I suggest improving the image or, if not possible, acknowledging this limitation in the text of the paper or in the image label.

• In Section 4.2, the table reference shows two question marks.

• In Section 4.2, please check the spelling of DistilBert as it is inconsistent (first lowercase, then titlecase).

• In Section 5, “led to few errors, but probably affected the recall” —> this is very vague. If the recall is not measured, there is no way to verify this.

• The charts in Fig. 3, Fig. 4 and Fig. 5 are difficult to understand. The label states that they show the before and after, but it seems that the first two columns show the before, the second two columns show the additions, and the third show the after. It would probably be simpler to use a stacked bar chart.