Using Natural Language Generation to Bootstrap Empty Wikipedia Articles: A Human-centric Perspective

Tracking #: 2402-3616

Lucie-Aimée Kaffee
Pavlos Vougiouklis
Elena Simperl

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media management. Recent advances in machine learning have made it possible to train NLG systems to achieve human-level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-served Wikipedias, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to generate text from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the text snippets score well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles. It also hints at several potential implications of using NLG solutions in Wikipedia at large, including content quality, trust in technology, and algorithmic transparency.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Denny Vrandecic submitted on 03/May/2020
Major Revision
Review Comment:

I enjoyed the paper a lot, and I think it is an important contribution. I really liked the human-centered approach, instead of just relying on ROUGE AND BLEU. I recommend to accept it, but there are a number of changes I think are necessary. Once these are addressed, I would be happy to recommend acceptance of the paper.

As is the nature of a review, I list in particular the things I have issues with, and not the ones that I liked. But rest assured that there is a lot in this paper I liked.

== Main points ==

A) Under-served languages

The whole narrative of the paper is build around supporting under-resourced or under-served languages, which you define in Section 4.2.3 by number of articles. The problem is though that the work is done exclusively on Wikipedias that are within the top 12% of all Wikipedias by size: Swedish is #3 by number of articles, in front of German, Arabic (ranked #15) and Ukrainian (ranked #17) have 1M+ articles, Persian 700k+ (ranked #18), Indonesian 500k+ (ranked #22), and even Hebrew (ranked #36), the smallest one in this list, has more than a quarter million articles.

In fact, Breton, which would have by far the best claim to be an under-served Wikipedia, was removed from your scope because of a lack of training data (which I don't understand - given that for the experiments here you had a Wizard of Oz type solution of creating the one sentence by hand, why would you not be able to create that one sentence about Marrakesh in Breton?). But even Breton has more than 68k articles, and with that is #81 out of ~300 language Wikipedias.

As can be seen in Table 5, all of the Wikipedias listed here have more than a thousand active editors - those are all large communities!

It remains true that all of these Wikipedias have large gaps - but calling them the under-served or under-resourced ones is problematic, given that a vast majority of Wikipedias is even much smaller, and they have a much stronger claim to that description. I would drop all mentions of under-served languages from the paper (in particular in light of dropping Breton), as you don't ever touch any of the bottom half of Wikipedias by size. This doesn't weaken the paper, as its principle findings still hold, but it also does not confuse about what the paper achieved.

B) Neutrality of structured data

page 4, line 24-28 (left): "Instead of using content from one Wikipedia version to bootstrap another, we take structured data labelled in the relevant language and create a more accessible representation of it as text, keeping those cultural expressions unimpaired." I don't buy this argument. This would mean that Wikidata is culturally neutral, which I don't believe. There are plenty of places where bias can slip into Wikidata: which properties are deemed relevant at all, which statements are given for a given item, from which sources are these statements taken, which values to prefer for a particular statement, in which order are the values and statements given, etc. The claim that Wikidata is culturally unbiased is nice, but I don't think it holds to scrutiny. Or put differently: citation needed for that claim.

C) Unclear approach in 3.2.2

Section 3.2.2: "Afterwards, we retrieved the corresponding Wikidata item to the article and queried all triples where the item appeared as a subject or an object in the Wikidata truthy dump. In doing so, we relied on keyword matching against labels from Wikidata from the corresponding language, due to the lack of reliable entity linking tools for under-served languages."

I do not understand why you would do that. So first you have an article, and then look up the QID. But now that you already have the QID, why would you use the label to look up for triples that have the QID as a subject or object? Something seems missing here.

"In case a rare entity in the text is not matched to any of the input triples, its realisation is replaced by the special token.": When are you matching entities in the text to input triples? How do you even know if something is an entity, and not a property? e.g. in "Floridia estas komunumo de" how do you know whether you need to look up "estas" as an item or not and try to match it? This Section is confusing me, sorry.

D) Make all data available

Section 4.2.5: can you make the raw annotation data available, including the sentences and their ratings? How high was interannotator agreement?

How could you create the sentence with the if sometimes the sentence agreement and grammaticality would depend on ?

Table 9: could you please show all the results, instead of just 2? It's only ten anyway, and I find them really helpful. Ideally with translations to English.

E) Automatic text creation problematic for trust

Section 6, Discussion, argues: "However, none of these approaches take the automatic creation of text by non-human algorithms into account. Therefore, a high quality, not distinguishable from a human-generated Wikipedia summary can be a double-edged sword."

The "Therefore" is a bit strong here: inferring from existing research on trust in Wikipedia, and that the existing research disregarded the automatic creation of text, inferring from that that automatically generated text would directly lead to a loss of trust is a strong statement that would need more support.

F) Hallucinations

Section 7, Limitations, argues: "Beside the Arabic sentence, the editors worked on are synthesized, i.e. not generated but created by the authors of this study. While those sentences were not created by natural language generation, they were created and discussed with other researchers in the field. We therefore focused on the most common problem in text generative tasks similar to ours: the tokens. Other problems of neural language generation, such as factually wrong sentences or hallucinations [76], were excluded from the study as they are not a common problem for short summaries as ours."

You state repeatedly that hallucinations are not problematic because your sentences are so short, but at the same time, the one single sentence you created with your method and selected for the second part of your study with actually contained an infactual statement. I am thankful that you mention it - and the fact that an actual inhabitant of Marrakesh didn't notice it was a very interesting anecdote, thank you for sharing that - but that makes the whole dismissal of hallucinations even more problematic.

In fact I would claim that hallucinations and infactualities are a more pressing problem than ungrammatical text.

== Copyediting and minor issues ==

1. Title: "empty Wikipedia articles" sounds weird. What about "missing Wikipedia articles"?

2. Abstract, "under-served Wikipedias" - it is not the Wikipedias that are under-served, but the language communities. The text states "Under-resourced language versions" which sounds much better.

3. "fewer editors means fewer articles" - given Cebuano and Swedish, is that actually true?

4. Figure 1, on the left, the multilingual view of Wikidata is not particularly evident. I wonder if that could be made more visible (either by showing English labels instead or additionally to the French labels). I could imagine the to have three images, Wikidata English, Wikidata French, Wikipedia Infobox French.

5. Figure 1, subtitle: "Representation of Wikidata triples" -> "Representation of Wikidata statements".

6. ibid. "in the article about fromage infobox from the French Wikipedia" -> "in articles using the fromage infobox from the French Wikipedia"

7. page 2, line 39-40 (left): "as it is the case for a large share of the Wikipedia community" -> "as it is the case for a large share of Wikipedia readers" - the main audience of this would be the readers of Wikipedia, not the editors, right? And this makes the statement even stronger.

8. page 3, line 1 (left): line overrun.

9. page 3, line 17-22 (right): skipped Section 7 (limitations, mentioned but number is skipped)

10. page 4, line 21-23 (left): something off with the citations "that argue that the language of global projects such as Wikipedia Hecht [21] should express cultural reality Kramsch and Widdowson [23]."

11. Section 2.1 second half, but throughout the paper: I have trouble with the simplification that Wikidata statements are just triples. In Figure 1 we can see that this is not really the case. I would prefer to use here either "statement" or "reified triples" (sigh).

12. page 4, line 30-31 (right): "for under-resources Wikipedias" -> "under-resourced"

13. page 4, line 37-39 (right): " templates always assume that entities will always have the relevant triples to fill in the slots" -> remove one "always"

14. page 5, line 4-6 (right): " costing 75 thousand pounds, assumed to be the most costly NLG evaluation at this point" - which, compared how much it costs to train current language models, should not be a deterrent anymore. I do understand it nevertheless is.

15. page 5, line 24 (right): "is not easy to follow" -> "to replicate"

16. page 5, line 34-35 (right): "we run an mixed-methods study" -> "a"

17. page 6, line 29-30 (left): "with an median of 69, 623.5 articles, between 253, 539 (Esperanto) and 7464 (Northern Sami)" -> "a median", remove the spaces after the commas in the numbers, add a comma in the number for northern Sami for consistency.

18. page 6, line 35-36 (left): "append a short" - why append the text? Shouldn't the text come before the boxes? Or replace them? Do the boxes remain if the text already covers a statement?

19. Figure 4 and example in Section 3.2.1: what language is this?

20. Section 3.2.1: if the input is the concatenation of the vectors, does the order of the concatenated vectors make a difference for the generated text?

21. Page 6, line 2-4 (right): "A vector representation hf1 , hf2 to hfn for each of the input triples is computed by processing their subject, predicate and object" -> what kind of processing?

22. Section 3.2.1, surface form tuples: does this also cover different forms with regards to agreement?

23. page 7: there is one line of text on this page. Get rid of it.

24. The description of the property placeholder was a bit confusing. Looking at the example in Table 1, I think I understand what is going on, but rereading the text in Section 3.2.1 three times still left me confused. I would kindly ask to edit this section for clarity.

25. page 8, line 24-25 (left): "(as describe in the previous section" -> described

26. Section 4.1.1: how many of the genuine Wikipedia summaries could have been recreated with the vocabulary generated like this? Should be part of the baseline numbers.

27. Table 2. The numbers are from 2017 - that's three years ago. Could we get updated numbers?

28. Section 4.2.3: I was trying to do something with the ERGO number, but it seems these are only accessible to people with a Southampton account? (same for 4.3.3)

29. Page 10, line 49, right: "and 15 from Wikipedia sentences used the train the neural network" -> "used to train"

30. Table 4: could you also add the median number of sentences annotated by each annotator?

31. The formalisation in Section 4.2.5 seems unnecessary. I would remove formula 1 and 2, and the set-theory based definitions. You are taking the averages of all annotations for every sentence, and then the average of all sentences per system - you say that quite clearly in your sentence "For each sentence, we calculated the mean fluency given by all participants and then averaging over all summaries of each category." The rest of the paragraph, all the way to the formula, seems redundant. The same for appropriateness.

32. Could example screenshots for Figure 6 and 7 be also provided in English, to have a better intuition of what is going on? Particularly, the edit window in Figure 7 does not correspond to and also does not seem to correspond to Figure 6, wich seems to only have a single sentence, whereas Figure 7 contains more content.

33. Page 13, lines 43-46 (left): "Further, n = 4 editors worked in at least two other languages beside their main language, while n = 2 editors were active
in as many as 4 other languages." -> remove the "n =". The same for the rest of the paragraph.

34. Page 14, line 25-29 (left): "Marrakesh is a good starting point, as it is a topic possibly highly relevant to readers and is widely covered, but falls into the category of topics that are potentially under-represented in Wikipedia due to its geographic location" Alas for the fact that there are Marrakesh articles in 93 language editions, including each of the six in this study.

35. page 14, line 37 (right): "As the interviews were remotely" -> "remote"

36. page 14, line 50 (right): "about theur experience." -> "their"

37. page 15, line 3-5 (left): "Finally, we left them time discuss open questions of the participants." -> "time to discuss"

38. page 15, line 47/48 (left): "how much of real structure of the generated summary is being copied." -> "how much of the real structure" Also, what's a real structure?

39. page 15, top paragraph (right): it looks like you're putting gstscore in math mode (i.e. $gstscore$) which makes the kerning look weird. Try putting the word into a \text{}, or \mbox{} or \textrm{} command inside the math mode.

40. Since it is only one sentence per language, and only 10 participants, can you also show all the ten sentences they produced?

41. Page 15, line 25 (right): " in this space and a data" -> remove "a"

42. Page 15, line 29 (right): "class of techniques in NGL" -> NLG

43. Page 16, line 30 (right): "to a few number of domains" -> "small number", or drop "few" or "to a few domains"

44. Also re Domains: could you list the / some domains and their values?

45. Section 5.2.1: Isn't it curious that the Esperanto News articles seem to score considerably lower on Fluency than the generated text? Isn't it surprising that the Wikipedia texts and the news texts all scored below an average of 5.0 (besides Arab news?) Is there some other effect going on? (Such as Arab having many different dialects, and Esparanto not being a first language for anyone?)

46. Section 5.3, subsection "Importance": in Wikidata, the label and description should already offer the disambiguation you are looking for. Are these actually displayed in the ArticlePlaceholder? (based on Figure 6, it looks like the label is, but the description is actually missing). It probably would be good to add that to the ArticlePlaceholder no matter what.

47. Page 18, line 34 (right): " and none of the them mentioned them during the interviews" -> "of the editors mentioned" or "of them mentioned"

48. Section 5.3, subsection "": I am not sure if I understand - was literally displayed in the realization? I get the argument for Arabic, where it is only used for the Berber name, but in the other languages, such as Indonesian, there are two literal tokens, and the interviewed editor didn't comment on the word with the angle brackets?

49. page 19, line 24/25 (left): ", in which the participants explained how and they changed the text." -> missing "why"?

50. Table 9: Why do the tiles A and C count toward the score? I thought the minimum match length factor was set to 3, but A and C both have a length of 2? (Also, without C, #2 would drop below the PD threshold)

51. Section 5.4: "Three editors did not change the text snippet at all, but only added to it based on the triples shown on the ArticlePlaceholder page." - I don't understand. In all of the text snippets, there is a literal token. If they didn't change it all, in particular if they only added to it, does this mean they left the literal token in the text when they stored it?

52. Regarding the further discussion of the token: looking at Sentence #8 in Table 9, if my Ukrainian is strong enough (which it is not), I seem to understand that the editor just removed the token instead of adding the name of the mountain range, Атлас. The resulting sentence then turned from the precise "is a city at the foot of the Atlas Mountains." to the rather poetic and imprecise "is a city at foot of (the) mountains." (Ukrainian doesn't have determiners, so the translation to English ambiguous). I find this problematic.

53. Page 20, line 25 (left): "One participant commented at length the presence" -> "... at length on the presence"

54. Page 20, line 41-43 (right): "Therefore, a high quality, not distinguishable from a human-generated Wikipedia summary can be a double-edged sword."
-> ""Therefore, a high quality Wikipedia summary, which is not distinguishable from a human-generated one, can be a double-edged sword."

55. Page 21, line 16 (left): "those wrongly generated statements" -> "such infactual statements"

56. Section 6, Discussion, argues: "In particular, as research suggests that the length of an article indicates its quality – basically the longer, the better [62]. However, from the interviews with editors, we found that they mostly skim articles when reading them." - I don't think that's an "however", both statements can be equally true and are not contradictory to each other.

57. Page 21, line 1 (right): "approach, however using a tool with gamification " -> remove ", however"

58. Section 6, Discussion, argues: "In comparison to Wikipedia Adventure, the readers are exposed to the ArticlePlaceholder pages and, thus, it could lower their reservation to edit by offering a more natural start of editing." Maybe true, but isn't the Cebuano Wikipedia a counterpoint to this argument?

59. Page 22, line 5ff (right): "Beside the Arabic sentence, the editors worked on are synthesized, i.e. not generated but created by the authors of this study." -> Rephrase the sentence, something's off here.

60. Reference 7: replace arxiv preprint with EMNLP paper.

61. Reference 16 is missing the venue or journal.

62. Reference 17, capitalization.

63. Reference 20, incomplete

64. Reference 33, missing venue.

Review #2
By John Bateman submitted on 13/May/2020
Minor Revision
Review Comment:

This paper addresses a nicely constrained problem that I could imagine
would also be a quite useful result: producing place-holder wikipedia
entries for resource-limited wikipedias that may have articles/pages
missing, although information is available (and available in the
appropriate language) in wikidata. The paper is well structured, with
some nice results; also good suggestions for human-centered NLG
evaluation. The evaluation revealed interesting materials for further
investigations and system building in its own right and so will be
good for follow-up research and development.

There are also several places where the limitations of the work should
be made far clearer, however. Indeed, there are several
claims/comments that should be toned down; for example: "... have made
it possible to train NLG systems to achieve human-level performance in
text writing and summarisation." : this is of course hopelessly
over-stated as-is! The task (and corresponding evaluations) has always
to be restricted to particular domains and evaluation criteria: it is
possible in *some* situations to achieve human-level performance, but
it is not possible in general as this sentence leaves open to
interpretation! The results in this area are sufficiently strong on
their own merits as to not require overselling of this kind. Indeed,
the results of the paper contribute to this as well; but it is
neverless important to maintain awareness of the kinds of tasks and
quality measures that are still not scoring so well so that these can
be focused on more in the future.

Part of the overstatement even here in the abstract may also have had
some influence in the rest of the text, as it was actually not clear
to me for quite a while just what was being generated. The discussion
is almost always of 'text', which makes questions of fluency and
style, etc. very relevant. But actually it then seemed to be the case
that what was being generated was individual *sentences*, not texts at
all. Judging sentences is in many respects a far easier task, as is
producing them. Texts have many more opportunities to mess up than do
single sentences and so it is far easier to achieve 'human-level'

The authors needs to make it MUCH CLEARER PRECISELY WHAT WAS BEING
GENERATED: if full summary articles were being generated, then this is
clearly a much more substantial achievement. If actually introductory
sentences is all that was being produced, then this needs to be stated
THROUGHOUT so as not to mislead readers with respect to the scope of
what is being done. Some indication of the range of complexity
targeted for the generated sentences would then also be useful to
know. The question of what is actually being generated is fairly
crucial: it effects particular issues of reuse, the beginning of the
reuse assessment sounds as if one might be seeing entire blocks of
text being reused (which would be a very strong positive result), but
as one proceeds, it seems that blocks of text might just refer to
reuse of one or two words, a much less impressive result (but still
worth reporting). So, were "generated text snippets" really blocks of
text, or short sequences of words from sentences?

Just to strike this point home: in several places the strong
impression is given that we are talking about full wikipedia articles,

"I think that if I saw such an article in Ukrainian,
I would probably then go to English anyway," (p. 17)

here 'article' strongly suggests connected text, not a single sentence. but
then (p.18) it is explicitly stated, perhaps for the first time that:

"While the model generated just one sentence, the editors thought it
to be a helpful starting point: ... While generating larger pieces of
text could arguably be more useful, reducing the need for manual
editing even further, the fact that the placeholder page contained
just one sentence made it clear to the editors that the page still
requires work."

This needs to be said upfront! I would suggest that 'text' should be
replaced by 'introductory sentence' or 'summary sentence' in *all*
cases where generation of some linguistic output by this system is
used. Then it is clear what is done, what problems are being
addressed, and which issues are not raised. The circumlocution 'text
snippet' used at the beginning of the article is not clear enough in
this respect. The task of text generation is not solved by the
generation of single sentences and the authors should show more
awareness of this.

This would also open up explicit further directions for useful
research and extensions of the methods described. It would then be
particuarly interesting, for example, to extend the proposed
evaluation techniques so that they are done with full texts and not
with single sentences and to place the current work within a context
working towards or with full article generation. With this inherent
limitation of the work cleared up, what remains looks solid and

Minor corrections:

The referencing style sometimes leaves bare numbers serving
grammatical roles in their clauses, e.g., '[27] show that' - this
should be edited throughout to make names visible (regardless of
journal style!) as is also often already done in the paper in any

p. 4 col1 l. 20-23: referencing goes a bit awry here.
p.4. col2 l. 31: under-resources -> under-resourced
p.5. col1 l. 28: 'use' --> 'uses'
l. 36: 'judgement-based' --> 'Judgement-based'
p.5 col2. l. 2: 'costed' --> 'cost'
l. 35: 'an mixed' --> 'a mixed'
p.6 col2. l.12-17: uses wrong closing quotes several times
p7 has two text columns, each with only a single line of text...
p.10 col2 l.48 'used the train' --> 'used to train'
p.13 col1 l. 31 'a a bot' --> 'a bot'

Review #3
By Leo Wanner submitted on 13/Jul/2020
Major Revision
Review Comment:

The submission deals with the generation of Wikipedia entries from RDF triples using an existing natural language text generator, which has been adapted to the task.

The originality of the paper is moderate, but it might have a considerable practical impact since it facilitates the creation of Wikipedia articles in under-resourced languages. The extensive use of the ArticlePlaceholder, on which the work described in the paper is based, by the editors shows that they are ready to accept and use tools that facilitate their work.
In general, the paper is reasonably well-written. However, the use of examples in the original language of a Wikipedia article without translation into English (as the language all readers of SWJ can be assumed to speak) is of limited use only. For instance, Table 6 and Table 9 remain opaque, as do most of the figures (cf., e.g., Figs 1, 5, 6 and 7). I am aware of the difficulty that the mirroring of the content of the illustrative material in English implies, but I don't think that there is another option if the authors want to make their paper accessible to all readers of the WSJ.

According to the authors, the article aims to address three questions: (i) Can a neural network be trained to generate text from triples in a low-resource setting? (ii) How do editors perceive the generated text on the ArticlePlaceholder page?; and (iii) How do editors use the text snippets in their work? The research contribution in the answer of the authors to the first question is limited. The novelty consists in the introduction of a "property place holder" mechanism that learns multiple verbalizations of entities appearing seldomly in the training material into Vougiouklis et al.'s generator, which has already been published elsewhere (cf. Vougiouklis et al., 2018).
The authors use a number of baselines in the context of a quantitative evaluation of their model, but none of these baselines is a state-of-the-art content-to-text generation model. Therefore, the question on how well the model performs compared to the state of the art (and thus whether the generator that has been used is the best option) remains open. The WebNLG Challenge (cf. Gardent et al., 2017 in INLG 2017 proceedings) would have been a good source of inspiration for competitive baselines.
As far as the evaluation of the fluency and the appropriateness of the generated texts is concerned: although the authors apply quantitative metrics to obtain some figures, their evaluation is in its nature qualitative since it is based on the qualitative assessment of the sentences by native speakers. Overall, the discussion of the contribution to RQ1 is very short. The authors simply report the numbers; an error analysis would have been appropriate.
The WebNLG Challenge should be also discussed in the Related Work section since its is immediately relevant to the topic of the paper (i.e., generation from RDF-triples). In general, the Related Work section should be revised and completed: between 2017 (when, I assume (Vougiouklis et al., 2018) was written and submitted) and early 2020 (when this paper was submitted) a number of works on NLG from RDF triples have been published; cf., e.g., proceedings of INLG, EMNLP and ACL. There is also an already somewhat outdated, but, I reckon, still useful survey by Bouayad-Agha et al. in this journal:

The merit of the paper lies more in showing the practical benefit of the application of NLG (not necessarily state of the art) to the creation of Wikipedia articles from Wikidata. Addressing their RQs 2 and 3, the authors conduct an evaluation with editors of Wikipedia articles in this respect. The evaluation is well planned and, in general, thorough. However, the description of the implication of the editors is not always clear; cf., e.g., "After the editing was finished, they were asked questions about their experience." (cf. line 49, p.14). Which questions? Where these questions the same for all participants?