How can the social sciences benefit from knowledge graphs? A case study on using Wikidata and Wikipedia to examine the world’s billionaires

Tracking #: 3370-4584

Authors: 
Daria Tisch
Franziska Pradel

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Abstract: 
This study examines the potentials of Wikidata and Wikipedia as knowledge graphs for the social sciences. The study demonstrates how social science research may benefit from these knowledge bases by examining what we can learn from Wikidata and Wikipedia about global billionaires (2010-2022). First, knowledge graphs provide human knowledge, which can be used to generate datasets informing about, for example, political, economic, and cultural elites or other notable people. Second, knowledge graphs provide linked (open) data that can be used to examine social networks of a different kind but also enable social scientists to connect different databases to enrich their research data. We show that the English Wikipedia and, to a lesser extent, Wikidata exhibit gender and nationality biased in the coverage and information about global billionaires. Using the genealogical information that Wikidata provides, we examine the family webs of billionaires and show that at least 15% of all billionaires have a family member also being a billionaire. We discuss the challenges and limitations of using Wikidata and Wikipedia for research purposes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Apr/2023
Suggestion:
Major Revision
Review Comment:

Summary:
The paper studies how social sciences can benefit from wikidata and wikipedia. As a use case, they research how these structured open community datasets can give insight into the world’s billionaires. They measure bias in portraying the world’s billionaires on wikipedia and wikidata, public interest in these billionaires derived from user views, and perform a network analysis on the family web of billionaires, indicating that billionaires are connected by family and marital lines.
The authors conclude that wikidata and wikipedia metadata can be used to examine data coverage and bias. Moreover, they conclude that Wikipedia can be a useful data source to study public interest in persons or subjects. Lastly, they indicate that genealogical data (or other such as elite family networks or cultural organizations etc.) in wikidata can be used for network analysis.

Contributions: 1. a study of coverage and bias of wikidata and wikipedia, 2. a feasibility study of the use of wikidata and wikipedia for answering social science research questions.
Strengths: the paper is clearly written and demonstrates an interesting use case of wikipedia and wikidata. Related work appears mostly complete and reflects in an interesting way on the use of these (semi-)structured resources for social science questions. The methods and results are interesting as well as technologically sound.

Weaknesses: the contributions are limited in novelty and generalisability, as studies measuring coverage and bias of wikidata and wikipedia have been done before but on a more systematic and larger scale. The authors mention a couple (a publication they could add: An Analysis of Content Gaps Versus User Needs in the Wikidata Knowledge Graph).
Moreover, the second contribution is similarly limited in generalisability and novelty: similar social science studies have been performed on wikidata and wikipedia, whereas the paper claims its contribution is to see whether social science questions can be asked using wikidata and wikipedia in general. It should be made clearer what the added novelty of the research paper is. As the research done is interesting for social scientists or as a use case of the applicability of semantic web technologies, possibly the paper could be written up solely as a novel dataset or case study for using wikidata and wikipedia to get insight into the world’s billionaires.

To conclude, I believe the paper is interesting and well written, but limited in novelty and impact. I wonder whether it is not better suited as a data paper (although as far as I can see, no reference to a dataset is included in the paper), or as an application report.

Review #2
By David Schindler submitted on 02/Jun/2023
Suggestion:
Reject
Review Comment:

The paper investigates the application of the knowledge bases Wikidata and Wikipedia in the context of social science studies demonstrating both in a case study examining billionaires and their relations.

(1) originality

The article states two main contributions: 1) demonstrating the benefits of the knowledge bases Wikidata and Wikipedia for social sciences and 2) extending the knowledge on global elites.

Regarding 1): The article does a good job in introducing both resources and highlighting potential caveats when working with the them, outlining potential biases. However, everything is based on commonly applied, established methods

Regarding 2) In my opinion the analyses stay at a rather superficial level and do not offer any novelty beyond the specific selection of the sample set. The article presents results on how well billionaires are covered in the databases and potential biases in coverage dependent on gender, age, and birth place. Wikipedia page views are used to demonstrate the relation between 1. wealth of a billionaire, 2. length of their Wikidata entry, and 3. number of page views. The page views are then anecdotally discussed for 5 selected billionaires. Lastly, the article examines relationship bonds between billionaires based on Wikidata.

(2) significance of the results

Regarding 1): I am not sure demonstrating the benefits of Wikipedia and Wikidata is necessary as I would consider both well established resources that are already widely applied and main object of many investigations. However, some researchers getting started with working with these knowledge bases might profit from a condensed introduction.

Regarding 2): The analyses on coverage in Wikidata and Wikipedia show some biases. Here I am missing some context. As the authors outline, both databases suffer from inherent biases. Are the presented findings billionaire specific biases or do they just confirm the overall bias of the knowledge base on a selected subset? I think further investigation or context from other investigations is need at this point, e.g., are the same biases also present when looking at other person groups of popular interest such as actors and actresses or politicians?

The analysis on page views shows correlations between wealth, article length, and article views, but offers no insights beyond this point. Revision history could, for instance, be included to investigate how popularity, length, and revision frequency interact.
The specific investigation of 5 selected billionaires anecdotally shows that public interest strongly varies both between and for a specific billionaire, but a general systematic assessment of public interest in billionaires is omitted.

The results on billionaire networks are interesting, but might suffer from sparse data on family relations represented in Wikidata (see comments below).

(3) quality of writing.

The writing is concise and the article is easy to follow.

(4) Data files and code:

I was not able to locate any data or source code nor any related statement. I understand from the journal policy that data should be made available by authors whenever possible.
I realize that some data might not be publishable by authors, for instance, due to copyright regarding Forbes list. Publication of Wikidata related data should, however, not be an issue. Furthermore, the manual mapping between both datasets could be published and would probably benefit other researchers.
The publication of intermediate results would allow a partial replication of the study.
Furthermore, I think other researchers would benefit from knowledge of the actual implementation of the outlined analyses, especially since this work has the claim to demonstrate the benefits of the knowledge bases, e.g., the implemented SPARQL queries or the code for gathering and analyzing page views.

(5) Summary

Overall, the article is methodically sound and gives a good introduction and guideline to working with Wikipedia and Wikidata. My main critique is, however, that the contribution of the analyses is quite small and limited in novelty. In its current form I interpret the article as a well written tutorial on how to use Wikipedia and Wikidata with the implemented analyses mainly serving as a demonstration. Due to this reason, I am of the opinion the contribution of the article is not sufficient for a publication in Semantic Web.

(6) Comments:

I have some further comments regarding data collection/processing/analyses and some minor issues.

Forbes data: How and when was it obtained and what tools were used for this purpose? Why is it a suited source for the worlds billionaires? The main assumption of the article is that this list is the ground-truth on the worlds billionaires. How does the magazine generate this list and are there potential biases and is there a quality control for this data?
It might be a given that this data is valid, in this case authors could refer to some prior work that provides information about the validity of this data.

(p5 l11) Authors state that the software OpenRefine was used, without providing information on the software developer, version, or a location/identifier of the software, information that can be essential for reproducibility of results.

(p5 l13) Authors state that strings were matched between Forbes data and Wikidata, with results being manually checked. Does this also apply to results that were not successfully matched? Is there an evaluation of how well the tool worked? What about Wikipedia? Were the corresponding entries matched through Wikidata?

(p5 l21) What properties are particularly often covered? It might be beneficial for further analyses to outline how often specific properties are covered (e.g., regarding siblings or spouses).

(p5 l24) I do not understand to what the term "respective articles" refers here. Is this about the relation between Forbes and Wikidata entries? In general, it would be interesting to assess how often the resources are linked.

Table 1: the statistic information should be better sorted and horizontally split between information concerning Wikidata and Wikipedia. In general, further elaboration on the table content is needed.

Figure 1: I like the illustration but can hardly see some color differences on my screen. Maybe an interval based color scheme is necessary here.

(p6 l36) "is biased" to "can be biased"?

4.3.2 I think it is necessary to look at the interactions between all three variables at the same time in addition to the current analyses.

(p11 l6) I am confused by the statement: how is Elon Musk the most viewed and still has fewer views than Donald Trump?
(p11 l7) Does Michael Jordan have the least views of all billionaires? The paragraph is written around the most viewed, and its confusing whether this refers to all or just some selection of billionaires.

(p11 l27) Some background on the networks is missing. Are there any restrictions on their generation, e.g., max hops in terms of relations or were the networks spanned until no further relations were found? Were nodes/edges omitted for readability or is the available data so sparse?

(p13 l44) "all billionaires" this relates to the previous comment: is the Forbes list complete?

Figure 4: The color grading is hard to read on my screen. The line for Bernard Arnault is hardly visible in A. Further, I think readers would have a better reference if the Elon Musk plot was the same height in A and B. Maybe this could be achieved with a logarithmic scale?

Implementation: I am missing some details on what tools authors used to implement their analyses. I appreciate that a broad description of available tools is outlined, but in terms of reproducibility it would be beneficial to know which tools were also used.

(p1 l27) ".. nationality biased .. " to ".. nationality biases .."?

Review #3
Anonymous submitted on 13/Jan/2024
Suggestion:
Major Revision
Review Comment:

This paper discusses how Wikidata and Wikipedia might be useful for the social sciences, and makes and exemplar point about studying the world's billionaires.

Connecting Wikidata and the Semantic Web to use cases in other disciplines is of strong interest to the special issue on Wikidata, and the journal as a whole, so the paper is definitely in scope.

The paper overall is reasonably well readable, owing to well written opening and closing sections. The technical sections, in contrast, are not as well written, with figures and plots in odd formatting, size and resolution, and a significant lack of detail and depth. I think the paper would require a very substantial revision and extension, before it could be accepted in the journal.

Detailed comments (X.Y refers to line Y on page X):

1.27 "biased" - typo
2.6 "N=..." is is unclear what is meant here, how many entities were attempted to be linked, or how many could be linked.
2.14 "wealth but" add comma
3.1 "elite researchers" - potential for misunderstanding. Perhaps better "researchers studying global elites"
Section 2.3 mentions "easily" too often for my taste. There are actually quite some issues around working with Wikidata, most notably, difficulty of offline working (data too big), and difficulty of online work with the query service for complex queries (frequent timeouts). For example, bigger family trees that are large can often not be computed due to timeouts.
Sec 2.3 only talks about R - I wonder whether that's a social science thing, but in my community, Python is by far more popular nowadays.

Sec 4: It puts forward an interesting question, and easily relatable one. However, I see some conceptual issues, and feel the analysis is too shallow currently. The opening question "What do WD and WP know about the world's billionaires is a sensible top-level one, but two issues:
1. It is simply assumed that WD and WP are good reflections of what the public knows, but I think this assumption needs analysis. For example, on popular people like Trump, it is likely the case that there are many more sources, with much more detail than WP (e.g., biographies). Perhaps this is not surprising, so more interesting is the other direction: That on the other hand, it might be that the assumption is actually correct for many of the more reclusive billionaires. But we don't know, and I think the paper should look at this, at least exemplarily.
2. The examples of Trump and Musk are not well chosen, and bring up a more fundamental question. To me, Trump is not primarily known for being a billionaire but for being an excentric and arguably crazy US president. Similarly to a lesser degree for Musk, who rings bells as active businessman, Tesla owner and Starlink-know-it-all etc. Of course these characterizations are two-way causally linked to being billionaires, but they are certainly not the typical seclusive rank-500-something "old money" people who attempt to keep their life private. Yet these are exactly the ones where it is interesting to know whether WD and WP can be of use. Overall, in fact, the paper needs a positioning here which people it is after, how public visibility actually distributes (there are plots, but they are not much discussed), and more and representative examples.

5.9 what's the unique identifier? where from?
5.14 "the mapping was manually checked" - how long did this take, how easy/difficult is that? how many could be mapped at all should be mentioned here already
table 1 is very hard to read. where does one read the number 66%? std deviations do not make sense for binary variables, nor do min and max. horizontal partitioning (lines) would help, putting wd yes/no and wp/yes/no first, then the respective fields that stem from each encyclopedia under these headers, so one knows which line refers to which encyclopedia.

4.3.1 the idea of logistic regression is odd for binary variables, and overall insufficiently explained.

table 2 feels like a replication of figure 2, and both appear unnecessarily big and with not very nice layout. Resolution of numbers in table 2 seems also excessive. I also don't know what meaning of intercept and rank is in those tables, it feels a bit like a printout from an ML library, without proper selection or explanation of what is relevant.

Figure 3 deserves much more detail. Who are exemplary entities in each corner of the plot? Musk and Trump are presumably extreme to the top? Who are some "averages"? Who are "the reclusive ones" at the bottom?

Section 4.5 goes in a very interesting direction to look at family connections, but then falls dramatically short in looking only at 1-hop neighbourhoods! Graph are exactly about that, enabling longer paths, and I do not see any computational reason not to look at least a few hops further. But the paper does not even explain the reason for that choice. Perhaps the authors are not aware of the concept of SPARQL path queries?

There are other obvious questions left out
- Does WD/WP enable tracking how wealth develops (in other words, do they capture wealth, and in enough resolution)?
- Are there interesting queries enabled by having so many billionaires in a structured format? How many migrated country? How many founded a company (as opposed to inherited), how many have a university degree, how many have spouses half their age, or any other serious or trivia questions? This is also the big point for structured data, aggregate queries
- How well are they covered? The analysis stops short at reporting only the average number of properties, but which ones are most frequently present? Do we for many of them, have birth place, or social media accounts, or university degree?

There was one apparent contradiction - younger billionaires are on average less likely to have encyclopedic articles, but on average, have longer ones. I'd appreciate insight into this.

For Section 4.4 (page views) there are obvious alternatives, e.g., Google Trends. Perhaps worth mentioning.