Mapping Articles on China in Wikipedia: An Inter-Language Semantic Network Analysis

Tracking #: 1332-2544

Authors: 
Ke Jiang
Grace A. Benefield
Junfei Yang
George A. Barnett

Responsible editor: 
Guest Editors Social Semantics 2016

Submission type: 
Full Paper
Abstract: 
This article describes an inter-language semantic network analysis examining the differences between articles about China in the Chinese and English versions of Wikipedia. The results not only confirmed previous findings of inter-language Wikipedia studies but also extended the research by providing an example of how exactly Chinese and English speaking groups frame the same topic differently in Wikipedia. Specifically, while both versions had the common focuses on government, population, language, and character, diplomatic relations, development of the economy, science and technology, the Chinese-speaking contributors and English-speaking contributors framed the article of China differently according to dissimilarities in national cultures, values, interests, situations, and emotions. Keywords: Wikipedia, Article of China, Inter-Language, Semantic Network Analysis
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Claudia Wagner submitted on 06/Mar/2016
Suggestion:
Minor Revision
Review Comment:

The authors use semantic network analysis to explore the differences of the English and Chinese Wikipedia when they write about China. The approach the authors present in very interesting. However, I would recommend the authors to explain it in more detail. e.g., how is the network correlation measured? The reader maybe does not know what QAP or UCINET means. Some explanations would certainly help.

The paper is very well written and I enjoyed reading it.

Some issues:

What is shown in Fig1? What are the nodes here? The authors talk first about a semantic network. So I assumed nodes are words. Now the authors talk about users as nodes? What are the links between users? I guess it is not necessary to introduce a network notation here. The y axis seems to be number of edits and the x axis are ranked articles. I suggest the authors to use this notation.

The authors need to improve their citations.
*) For example they cite Laufer 2014 incorrectly. Beside the structural analysis which the authors mention, Laufer et al have analyzed the concept overlap and view statistics of articles in different language editions.
http://arxiv.org/abs/1411.4484 (the paper was published at Webscience 2015 in Oxford)
*) The authors do not cite Omnipedia. They briefly mention one of Brent Hecht's papers, but not his work on Omnipedia. I suggest the authors to have a closer look into brent Hecht's work since he was working for several years on this topic.
http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf

Review #2
By Laura Hollink submitted on 31/Mar/2016
Suggestion:
Reject
Review Comment:

This paper presents a comparison between two Wikipedia articles: the articles on China in the English and Chinese Wikipedia chapters. Using semantic network analysis, the authors explore differences in word usage between the two pages. They discuss the results in the light of the cultural differences between the two communities.

The paper does not fit the scope of this journal. The most important argument for this is that the paper does not provide any new insights related to the field of Semantic Web (or even broader, the Web, or Computer Science, AI). What can be learned from this paper is mostly related to the cultural differences between the English and Chinees speaking parts of the World. The method that is employed in this paper, semantic network analysis, is only loosely related to the topic of the journal. This type of semantic network analysis is based on word co-occurrence matrices. Issues such as knowledge representation, data integration, modelling, reuse of (web) data, etc. do not play a role.

In addition, I feel that this paper is too lightweight for publication in this journal. This is mainly due to the fact that only two Wikipedia articles were examined, and that it is a purely observational study (in contrast to, for example, an experiment). In the remainder of this review I will give a more detailed motivation.

The motivation given for this study is that in previous work "insufficient attention has been paid to the detailed content of Wikipedia articles". It is not clear to me why this is a problem in itself. I would expect for example to hear about an unanswered question that could not be solved with the older, less detailed analyses.

The structure of the paper can be improved. The introduction is very short, and does not contain any information on the research methods (other than the general term Semantic Network Analysis). Also, the choice for the page on China as a use case is not explained. For me, it was not clear from the introduction that only two Wikipedia pages were examined. I was under the impression that all pages related to China were included (and I did wonder how this selection was made). Section 2 is a related work section (and should be named as such). However, at the end of the section new information is presented about the goal of the paper ("to map how different language speakers illustrate the meaning of a particular concept in various ways in Wikipedia"). I would move this to the introduction. Also, could you please explain what you mean by "illustrate the meaning of a concept"? Section 3 describes the method of semantic network analysis. However, again, there is also additional information about the focus of the paper ("This paper focuses on analyzing the salience of the concepts in texts.") I would move this to the introduction as well.

The explanation of the research method leaves some open questions. However, the words 'concept' and 'word' seem to be used interchangeably at some points. E.g. on p 10 in the sentence "words that occurred within seven concepts of each other": no defintion of 'concept' was given so far, so I have assumed this means something like 'seven words that are not stop words'. Also, I find the rationale for using a seven word threshold a bit unusual. Is there a reason for not using the sentence breaks? Can you give some indication of how these two options compare in terms of performance and in terms of results? Later on page 10, the paper mentions 'cells of the two networks'. I have an idea of what is meant (a cell in a matrix representation of a network), but please be explicit about this. Also, it is not clear to me how you can correlate the 'corresponding cells of the two networks, given that the two networds not necessarily have the same nodes.

The discussion of the results is interesting and makes sense. I like the fact that the authors refer back to individual sentences from the Wikipedia pages to illustrate/prove their point. However, as mentioned above, the discussion does not contain any new insights related to the Semantic Web. Research question is not discussed at all.

Specific comments:

p5 the top 30 people in each language version -> ranked based on what? Nr. of edits? Length of the page? Nr. of views? Centrality of the node?
p4 one-tenth of one percent is comprised of common concepts -> Can you explain what "common concepts" are in this case? And why not just write 0.1%?
p5 Although this research was significant in using advancing algorithm models -> I don't agree that the use of advanced algorithms is a reason to call a study significant.
p6 from a particular semantic context -> Could you please explain what you mean by "from a particular semantic context?"
p7 clusters that composing the semantic networks -> please rephrase
p9 The axis labels on Fig 1 can be improved (e.g. mentioning the contributors and number of edits). Also, I don't see the advantage of normalization here. Why not just a log scale?
p12 In the caption of table 4, should this be "ordered by Greatest Normalized Eigenvector Centralities", instead of "with Great...."?

Review #3
Anonymous submitted on 08/Apr/2016
Suggestion:
Accept
Review Comment:

This paper presents an inter-language semantic network analysis studying the differences between Chinese and English authors of the same Wikipedia article across languages. Experimental results show that Chinese and English natives tackled the article in a different way according to dissimilarities in national cultures, values, interests, situations, and emotions. The paper is well written and structured. The motivations are solid and well referenced. The findings are also interesting. In my opinion, this work may be interesting for the community. I have only two minor comments:

The first one is about the use of distributed representations. By means of the recent and popular word vectors generated using the word2vec toolkit [1] (https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html), you could compare the vectors of the same words on the English and Chinese Wikipedia and their relationship with their closest words. That type of analysis should also give an idea of the cultural and emotional differences of the authors.

The second comment is about the automatically generated multilingual semantic networks. Taking into account your findings, do you think that semantic networks such as BabelNet [2] (http://babelnet.org/) are mapping correctly the other language Wikipedia pages (e.g. Chinese) to English? They map each Chinese article to the same concept that was selected for their English translated article. If there is no translation, and there is no almost equivalent concept, a new concept will be generated. If you think that this differences between Chinese and English authors may change the meaning of the Wikipedia page, it would be interesting to add some short discussion about that in your work.

Suggested references:

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

[2] Navigli, R., & Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217-250.