Review Comment:
I enjoyed the paper a lot, and I think it is an important contribution. I really liked the human-centered approach, instead of just relying on ROUGE AND BLEU. I recommend to accept it, but there are a number of changes I think are necessary. Once these are addressed, I would be happy to recommend acceptance of the paper.
As is the nature of a review, I list in particular the things I have issues with, and not the ones that I liked. But rest assured that there is a lot in this paper I liked.
== Main points ==
A) Under-served languages
The whole narrative of the paper is build around supporting under-resourced or under-served languages, which you define in Section 4.2.3 by number of articles. The problem is though that the work is done exclusively on Wikipedias that are within the top 12% of all Wikipedias by size: Swedish is #3 by number of articles, in front of German, Arabic (ranked #15) and Ukrainian (ranked #17) have 1M+ articles, Persian 700k+ (ranked #18), Indonesian 500k+ (ranked #22), and even Hebrew (ranked #36), the smallest one in this list, has more than a quarter million articles.
In fact, Breton, which would have by far the best claim to be an under-served Wikipedia, was removed from your scope because of a lack of training data (which I don't understand - given that for the experiments here you had a Wizard of Oz type solution of creating the one sentence by hand, why would you not be able to create that one sentence about Marrakesh in Breton?). But even Breton has more than 68k articles, and with that is #81 out of ~300 language Wikipedias.
As can be seen in Table 5, all of the Wikipedias listed here have more than a thousand active editors - those are all large communities!
It remains true that all of these Wikipedias have large gaps - but calling them the under-served or under-resourced ones is problematic, given that a vast majority of Wikipedias is even much smaller, and they have a much stronger claim to that description. I would drop all mentions of under-served languages from the paper (in particular in light of dropping Breton), as you don't ever touch any of the bottom half of Wikipedias by size. This doesn't weaken the paper, as its principle findings still hold, but it also does not confuse about what the paper achieved.
B) Neutrality of structured data
page 4, line 24-28 (left): "Instead of using content from one Wikipedia version to bootstrap another, we take structured data labelled in the relevant language and create a more accessible representation of it as text, keeping those cultural expressions unimpaired." I don't buy this argument. This would mean that Wikidata is culturally neutral, which I don't believe. There are plenty of places where bias can slip into Wikidata: which properties are deemed relevant at all, which statements are given for a given item, from which sources are these statements taken, which values to prefer for a particular statement, in which order are the values and statements given, etc. The claim that Wikidata is culturally unbiased is nice, but I don't think it holds to scrutiny. Or put differently: citation needed for that claim.
C) Unclear approach in 3.2.2
Section 3.2.2: "Afterwards, we retrieved the corresponding Wikidata item to the article and queried all triples where the item appeared as a subject or an object in the Wikidata truthy dump. In doing so, we relied on keyword matching against labels from Wikidata from the corresponding language, due to the lack of reliable entity linking tools for under-served languages."
I do not understand why you would do that. So first you have an article, and then look up the QID. But now that you already have the QID, why would you use the label to look up for triples that have the QID as a subject or object? Something seems missing here.
"In case a rare entity in the text is not matched to any of the input triples, its realisation is replaced by the special token.": When are you matching entities in the text to input triples? How do you even know if something is an entity, and not a property? e.g. in "Floridia estas komunumo de" how do you know whether you need to look up "estas" as an item or not and try to match it? This Section is confusing me, sorry.
D) Make all data available
Section 4.2.5: can you make the raw annotation data available, including the sentences and their ratings? How high was interannotator agreement?
How could you create the sentence with the if sometimes the sentence agreement and grammaticality would depend on ?
Table 9: could you please show all the results, instead of just 2? It's only ten anyway, and I find them really helpful. Ideally with translations to English.
E) Automatic text creation problematic for trust
Section 6, Discussion, argues: "However, none of these approaches take the automatic creation of text by non-human algorithms into account. Therefore, a high quality, not distinguishable from a human-generated Wikipedia summary can be a double-edged sword."
The "Therefore" is a bit strong here: inferring from existing research on trust in Wikipedia, and that the existing research disregarded the automatic creation of text, inferring from that that automatically generated text would directly lead to a loss of trust is a strong statement that would need more support.
F) Hallucinations
Section 7, Limitations, argues: "Beside the Arabic sentence, the editors worked on are synthesized, i.e. not generated but created by the authors of this study. While those sentences were not created by natural language generation, they were created and discussed with other researchers in the field. We therefore focused on the most common problem in text generative tasks similar to ours: the tokens. Other problems of neural language generation, such as factually wrong sentences or hallucinations [76], were excluded from the study as they are not a common problem for short summaries as ours."
You state repeatedly that hallucinations are not problematic because your sentences are so short, but at the same time, the one single sentence you created with your method and selected for the second part of your study with actually contained an infactual statement. I am thankful that you mention it - and the fact that an actual inhabitant of Marrakesh didn't notice it was a very interesting anecdote, thank you for sharing that - but that makes the whole dismissal of hallucinations even more problematic.
In fact I would claim that hallucinations and infactualities are a more pressing problem than ungrammatical text.
== Copyediting and minor issues ==
1. Title: "empty Wikipedia articles" sounds weird. What about "missing Wikipedia articles"?
2. Abstract, "under-served Wikipedias" - it is not the Wikipedias that are under-served, but the language communities. The text states "Under-resourced language versions" which sounds much better.
3. "fewer editors means fewer articles" - given Cebuano and Swedish, is that actually true?
4. Figure 1, on the left, the multilingual view of Wikidata is not particularly evident. I wonder if that could be made more visible (either by showing English labels instead or additionally to the French labels). I could imagine the to have three images, Wikidata English, Wikidata French, Wikipedia Infobox French.
5. Figure 1, subtitle: "Representation of Wikidata triples" -> "Representation of Wikidata statements".
6. ibid. "in the article about fromage infobox from the French Wikipedia" -> "in articles using the fromage infobox from the French Wikipedia"
7. page 2, line 39-40 (left): "as it is the case for a large share of the Wikipedia community" -> "as it is the case for a large share of Wikipedia readers" - the main audience of this would be the readers of Wikipedia, not the editors, right? And this makes the statement even stronger.
8. page 3, line 1 (left): line overrun.
9. page 3, line 17-22 (right): skipped Section 7 (limitations, mentioned but number is skipped)
10. page 4, line 21-23 (left): something off with the citations "that argue that the language of global projects such as Wikipedia Hecht [21] should express cultural reality Kramsch and Widdowson [23]."
11. Section 2.1 second half, but throughout the paper: I have trouble with the simplification that Wikidata statements are just triples. In Figure 1 we can see that this is not really the case. I would prefer to use here either "statement" or "reified triples" (sigh).
12. page 4, line 30-31 (right): "for under-resources Wikipedias" -> "under-resourced"
13. page 4, line 37-39 (right): " templates always assume that entities will always have the relevant triples to fill in the slots" -> remove one "always"
14. page 5, line 4-6 (right): " costing 75 thousand pounds, assumed to be the most costly NLG evaluation at this point" - which, compared how much it costs to train current language models, should not be a deterrent anymore. I do understand it nevertheless is.
15. page 5, line 24 (right): "is not easy to follow" -> "to replicate"
16. page 5, line 34-35 (right): "we run an mixed-methods study" -> "a"
17. page 6, line 29-30 (left): "with an median of 69, 623.5 articles, between 253, 539 (Esperanto) and 7464 (Northern Sami)" -> "a median", remove the spaces after the commas in the numbers, add a comma in the number for northern Sami for consistency.
18. page 6, line 35-36 (left): "append a short" - why append the text? Shouldn't the text come before the boxes? Or replace them? Do the boxes remain if the text already covers a statement?
19. Figure 4 and example in Section 3.2.1: what language is this?
20. Section 3.2.1: if the input is the concatenation of the vectors, does the order of the concatenated vectors make a difference for the generated text?
21. Page 6, line 2-4 (right): "A vector representation hf1 , hf2 to hfn for each of the input triples is computed by processing their subject, predicate and object" -> what kind of processing?
22. Section 3.2.1, surface form tuples: does this also cover different forms with regards to agreement?
23. page 7: there is one line of text on this page. Get rid of it.
24. The description of the property placeholder was a bit confusing. Looking at the example in Table 1, I think I understand what is going on, but rereading the text in Section 3.2.1 three times still left me confused. I would kindly ask to edit this section for clarity.
25. page 8, line 24-25 (left): "(as describe in the previous section" -> described
26. Section 4.1.1: how many of the genuine Wikipedia summaries could have been recreated with the vocabulary generated like this? Should be part of the baseline numbers.
27. Table 2. The numbers are from 2017 - that's three years ago. Could we get updated numbers?
28. Section 4.2.3: I was trying to do something with the ERGO number, but it seems these are only accessible to people with a Southampton account? (same for 4.3.3)
29. Page 10, line 49, right: "and 15 from Wikipedia sentences used the train the neural network" -> "used to train"
30. Table 4: could you also add the median number of sentences annotated by each annotator?
31. The formalisation in Section 4.2.5 seems unnecessary. I would remove formula 1 and 2, and the set-theory based definitions. You are taking the averages of all annotations for every sentence, and then the average of all sentences per system - you say that quite clearly in your sentence "For each sentence, we calculated the mean fluency given by all participants and then averaging over all summaries of each category." The rest of the paragraph, all the way to the formula, seems redundant. The same for appropriateness.
32. Could example screenshots for Figure 6 and 7 be also provided in English, to have a better intuition of what is going on? Particularly, the edit window in Figure 7 does not correspond to https://www.wikidata.org/w/index.php?title=User:Frimelle/Marrakesh&actio... and also does not seem to correspond to Figure 6, wich seems to only have a single sentence, whereas Figure 7 contains more content.
33. Page 13, lines 43-46 (left): "Further, n = 4 editors worked in at least two other languages beside their main language, while n = 2 editors were active
in as many as 4 other languages." -> remove the "n =". The same for the rest of the paragraph.
34. Page 14, line 25-29 (left): "Marrakesh is a good starting point, as it is a topic possibly highly relevant to readers and is widely covered, but falls into the category of topics that are potentially under-represented in Wikipedia due to its geographic location" Alas for the fact that there are Marrakesh articles in 93 language editions, including each of the six in this study.
35. page 14, line 37 (right): "As the interviews were remotely" -> "remote"
36. page 14, line 50 (right): "about theur experience." -> "their"
37. page 15, line 3-5 (left): "Finally, we left them time discuss open questions of the participants." -> "time to discuss"
38. page 15, line 47/48 (left): "how much of real structure of the generated summary is being copied." -> "how much of the real structure" Also, what's a real structure?
39. page 15, top paragraph (right): it looks like you're putting gstscore in math mode (i.e. $gstscore$) which makes the kerning look weird. Try putting the word into a \text{}, or \mbox{} or \textrm{} command inside the math mode.
40. Since it is only one sentence per language, and only 10 participants, can you also show all the ten sentences they produced?
41. Page 15, line 25 (right): " in this space and a data" -> remove "a"
42. Page 15, line 29 (right): "class of techniques in NGL" -> NLG
43. Page 16, line 30 (right): "to a few number of domains" -> "small number", or drop "few" or "to a few domains"
44. Also re Domains: could you list the / some domains and their values?
45. Section 5.2.1: Isn't it curious that the Esperanto News articles seem to score considerably lower on Fluency than the generated text? Isn't it surprising that the Wikipedia texts and the news texts all scored below an average of 5.0 (besides Arab news?) Is there some other effect going on? (Such as Arab having many different dialects, and Esparanto not being a first language for anyone?)
46. Section 5.3, subsection "Importance": in Wikidata, the label and description should already offer the disambiguation you are looking for. Are these actually displayed in the ArticlePlaceholder? (based on Figure 6, it looks like the label is, but the description is actually missing). It probably would be good to add that to the ArticlePlaceholder no matter what.
47. Page 18, line 34 (right): " and none of the them mentioned them during the interviews" -> "of the editors mentioned" or "of them mentioned"
48. Section 5.3, subsection "": I am not sure if I understand - was literally displayed in the realization? I get the argument for Arabic, where it is only used for the Berber name, but in the other languages, such as Indonesian, there are two literal tokens, and the interviewed editor didn't comment on the word with the angle brackets?
49. page 19, line 24/25 (left): ", in which the participants explained how and they changed the text." -> missing "why"?
50. Table 9: Why do the tiles A and C count toward the score? I thought the minimum match length factor was set to 3, but A and C both have a length of 2? (Also, without C, #2 would drop below the PD threshold)
51. Section 5.4: "Three editors did not change the text snippet at all, but only added to it based on the triples shown on the ArticlePlaceholder page." - I don't understand. In all of the text snippets, there is a literal token. If they didn't change it all, in particular if they only added to it, does this mean they left the literal token in the text when they stored it?
52. Regarding the further discussion of the token: looking at Sentence #8 in Table 9, if my Ukrainian is strong enough (which it is not), I seem to understand that the editor just removed the token instead of adding the name of the mountain range, Атлас. The resulting sentence then turned from the precise "is a city at the foot of the Atlas Mountains." to the rather poetic and imprecise "is a city at foot of (the) mountains." (Ukrainian doesn't have determiners, so the translation to English ambiguous). I find this problematic.
53. Page 20, line 25 (left): "One participant commented at length the presence" -> "... at length on the presence"
54. Page 20, line 41-43 (right): "Therefore, a high quality, not distinguishable from a human-generated Wikipedia summary can be a double-edged sword."
-> ""Therefore, a high quality Wikipedia summary, which is not distinguishable from a human-generated one, can be a double-edged sword."
55. Page 21, line 16 (left): "those wrongly generated statements" -> "such infactual statements"
56. Section 6, Discussion, argues: "In particular, as research suggests that the length of an article indicates its quality – basically the longer, the better [62]. However, from the interviews with editors, we found that they mostly skim articles when reading them." - I don't think that's an "however", both statements can be equally true and are not contradictory to each other.
57. Page 21, line 1 (right): "approach, however using a tool with gamification " -> remove ", however"
58. Section 6, Discussion, argues: "In comparison to Wikipedia Adventure, the readers are exposed to the ArticlePlaceholder pages and, thus, it could lower their reservation to edit by offering a more natural start of editing." Maybe true, but isn't the Cebuano Wikipedia a counterpoint to this argument?
59. Page 22, line 5ff (right): "Beside the Arabic sentence, the editors worked on are synthesized, i.e. not generated but created by the authors of this study." -> Rephrase the sentence, something's off here.
60. Reference 7: replace arxiv preprint with EMNLP paper.
61. Reference 16 is missing the venue or journal.
62. Reference 17, capitalization.
63. Reference 20, incomplete
64. Reference 33, missing venue.
|