Review Comment:
The paper investigates the application of the knowledge bases Wikidata and Wikipedia in the context of social science studies demonstrating both in a case study examining billionaires and their relations.
(1) originality
The article states two main contributions: 1) demonstrating the benefits of the knowledge bases Wikidata and Wikipedia for social sciences and 2) extending the knowledge on global elites.
Regarding 1): The article does a good job in introducing both resources and highlighting potential caveats when working with the them, outlining potential biases. However, everything is based on commonly applied, established methods
Regarding 2) In my opinion the analyses stay at a rather superficial level and do not offer any novelty beyond the specific selection of the sample set. The article presents results on how well billionaires are covered in the databases and potential biases in coverage dependent on gender, age, and birth place. Wikipedia page views are used to demonstrate the relation between 1. wealth of a billionaire, 2. length of their Wikidata entry, and 3. number of page views. The page views are then anecdotally discussed for 5 selected billionaires. Lastly, the article examines relationship bonds between billionaires based on Wikidata.
(2) significance of the results
Regarding 1): I am not sure demonstrating the benefits of Wikipedia and Wikidata is necessary as I would consider both well established resources that are already widely applied and main object of many investigations. However, some researchers getting started with working with these knowledge bases might profit from a condensed introduction.
Regarding 2): The analyses on coverage in Wikidata and Wikipedia show some biases. Here I am missing some context. As the authors outline, both databases suffer from inherent biases. Are the presented findings billionaire specific biases or do they just confirm the overall bias of the knowledge base on a selected subset? I think further investigation or context from other investigations is need at this point, e.g., are the same biases also present when looking at other person groups of popular interest such as actors and actresses or politicians?
The analysis on page views shows correlations between wealth, article length, and article views, but offers no insights beyond this point. Revision history could, for instance, be included to investigate how popularity, length, and revision frequency interact.
The specific investigation of 5 selected billionaires anecdotally shows that public interest strongly varies both between and for a specific billionaire, but a general systematic assessment of public interest in billionaires is omitted.
The results on billionaire networks are interesting, but might suffer from sparse data on family relations represented in Wikidata (see comments below).
(3) quality of writing.
The writing is concise and the article is easy to follow.
(4) Data files and code:
I was not able to locate any data or source code nor any related statement. I understand from the journal policy that data should be made available by authors whenever possible.
I realize that some data might not be publishable by authors, for instance, due to copyright regarding Forbes list. Publication of Wikidata related data should, however, not be an issue. Furthermore, the manual mapping between both datasets could be published and would probably benefit other researchers.
The publication of intermediate results would allow a partial replication of the study.
Furthermore, I think other researchers would benefit from knowledge of the actual implementation of the outlined analyses, especially since this work has the claim to demonstrate the benefits of the knowledge bases, e.g., the implemented SPARQL queries or the code for gathering and analyzing page views.
(5) Summary
Overall, the article is methodically sound and gives a good introduction and guideline to working with Wikipedia and Wikidata. My main critique is, however, that the contribution of the analyses is quite small and limited in novelty. In its current form I interpret the article as a well written tutorial on how to use Wikipedia and Wikidata with the implemented analyses mainly serving as a demonstration. Due to this reason, I am of the opinion the contribution of the article is not sufficient for a publication in Semantic Web.
(6) Comments:
I have some further comments regarding data collection/processing/analyses and some minor issues.
Forbes data: How and when was it obtained and what tools were used for this purpose? Why is it a suited source for the worlds billionaires? The main assumption of the article is that this list is the ground-truth on the worlds billionaires. How does the magazine generate this list and are there potential biases and is there a quality control for this data?
It might be a given that this data is valid, in this case authors could refer to some prior work that provides information about the validity of this data.
(p5 l11) Authors state that the software OpenRefine was used, without providing information on the software developer, version, or a location/identifier of the software, information that can be essential for reproducibility of results.
(p5 l13) Authors state that strings were matched between Forbes data and Wikidata, with results being manually checked. Does this also apply to results that were not successfully matched? Is there an evaluation of how well the tool worked? What about Wikipedia? Were the corresponding entries matched through Wikidata?
(p5 l21) What properties are particularly often covered? It might be beneficial for further analyses to outline how often specific properties are covered (e.g., regarding siblings or spouses).
(p5 l24) I do not understand to what the term "respective articles" refers here. Is this about the relation between Forbes and Wikidata entries? In general, it would be interesting to assess how often the resources are linked.
Table 1: the statistic information should be better sorted and horizontally split between information concerning Wikidata and Wikipedia. In general, further elaboration on the table content is needed.
Figure 1: I like the illustration but can hardly see some color differences on my screen. Maybe an interval based color scheme is necessary here.
(p6 l36) "is biased" to "can be biased"?
4.3.2 I think it is necessary to look at the interactions between all three variables at the same time in addition to the current analyses.
(p11 l6) I am confused by the statement: how is Elon Musk the most viewed and still has fewer views than Donald Trump?
(p11 l7) Does Michael Jordan have the least views of all billionaires? The paragraph is written around the most viewed, and its confusing whether this refers to all or just some selection of billionaires.
(p11 l27) Some background on the networks is missing. Are there any restrictions on their generation, e.g., max hops in terms of relations or were the networks spanned until no further relations were found? Were nodes/edges omitted for readability or is the available data so sparse?
(p13 l44) "all billionaires" this relates to the previous comment: is the Forbes list complete?
Figure 4: The color grading is hard to read on my screen. The line for Bernard Arnault is hardly visible in A. Further, I think readers would have a better reference if the Elon Musk plot was the same height in A and B. Maybe this could be achieved with a logarithmic scale?
Implementation: I am missing some details on what tools authors used to implement their analyses. I appreciate that a broad description of available tools is outlined, but in terms of reproducibility it would be beneficial to know which tools were also used.
(p1 l27) ".. nationality biased .. " to ".. nationality biases .."?
|