Review Comment:
The paper describes the development of a Spanish adaptation for the WebNLG dataset using machine translation, together with a throughly evaluation of different methodologies for the task of triples verbalisation using LLMs.
The work is generally coherent, and its originality lies in it being the first released version of a fully-covered Spanish translation of the WebNLG task. From this point of view, the merits of the paper are undeniable, especially given the need for more language-specific data in the current age of neural models. Furthermore, the analysis of LLMs as models to verbalize triples is of extreme interest and arises further discussion.
On top of this, the paper is well written and to the point, and all the data are publicly accessible and well-documented.
Having said that, I feel like the paper is in need of major revisions before publishing. In particular, there are two major points that severely hinder the significance of the work, which I will discuss in detail in the next section.
# MAJOR ISSUES
The first major issue I have with the paper is that the results from LLMs are evaluated against automatically translated verbalisations generated using Deepl which have not, themselves, been throughly evaluated.
The authors explicitly mention this issue, in particular given that manually checking the translated verbalisation (more than 45.000) is way out of the scope of the paper, and refer to future work for a through crowdsourced revision of the dataset.
As for the current paper, however, the authors defer to manually revise only the translations that have a lower cosine similarity score (computed using a multilingual model) than their English sources.
While I can stand behind this solution from a practical point of view, I feel that its implementation, as it is presented, is severly lacking, especially given that the authors offer no source of this process of manually checking only those translation that have a low cosine similarity score being previously implemented or tested.
What I would expect is previous implementation of an automatically translated dataset being implemented as a gold standard for specific tasks, in order to give some theoretical foundation to the process itself.
Furthermore, according to the authors, the threshold for retrieval of verbalisations to be manually revised is set at 0.9, for statistical reasons. But this says nothing of said verbalisations, which might show particular phenomena that might not be covered by cosine similarity.
Once again, this issue could be improved on by just presenting other works using this technique, so that its theoretical foundation are made more explicit.
The second major issue I have with the paper is that the authors state in several sections of the paper that Spanish, as a language, is intrinsecally more difficult to tackle (NLP-wise) than English.
While this might be argued to be true in a rule-, or template-based environment, given the differences in morphology, the nature of neural models should not be impacted by such linguistic phenomena, since the quality of their input it is more a question of training data.
Currently, I am not aware of works describing the impact of inflections on neural networks. In the case that such work exists, thus showing the point made by the authors to be correct, it should be cited in the paper.
The authors make several claims regarding this aspect, most without a source. I am listing some istances of this which should be looked into:
- "Data-to-text generation in Spanish is a more complex task than
in English due to its richer grammar, more complex morphology, and larger vocabulary variations"
- "there is a high likelihood that the methodology followed to create our dataset does not encompass all possible verbalization variations, given the richness of the Spanish language"
- "[...] this is likely due to the restrictive nature of BLEU, which relies heavily on lexical overlap and struggles to account for the rich grammatical and *expressive diversity* of Spanish" <- BLEU shows issues with a weakly inflected language such as English as well, it's not like Spanish has a different standard in this sense
- "Also, vocabulary in Spanish tends to be more varied and expressive, often requiring more words or longer phrases to convey nuances that English expresses concisely—such as "te quiero" vs. "te amo" to distinguish degrees of affection, whereas English simply uses "I love you." " <- this is just a case of semantic and pragmatic misalignment between the two languages (an English speaker might actually say "I desire you" instead of "I love you", but in different contexts of a Spanish speaker saying "te quiero" instead of "the amo"), and I can't see how it is of any importance in the work presented
- "Because Spanish allows for greater syntactic flexibility, a single phrase can often be expressed in multiple grammatically correct ways with the same meaning" <- how is English any different in this aspect? Even the example given in this case (I hope you come tomorrow) can be expressed in different ways in English, such as 'I wish you would come tomorrow'; 'I am waiting your arrival tomorrow'; 'I can't wait to see you tomorrow'
- "We also see that, for rich languages such as Spanish, BLEU might not be a good lexical metric choice, but rather others such as METEOR and CHRF++" <- what do the authors mean by "richness"? If they are referring to inflectional morphemes, it should be made more explicit. Also, as previously mentioned, BLEU is generally considered not a good score for a weakly inflected language such as English as well
- "During the cross-lingual and error analysis, we also saw that, generally, Spanish grammar and vocabulary are more complex than English, with richer verb conjugations, gendered nouns, and varied word order". <- is Spanish vocabulary actually more complex? In which sense? While Spanish might categorize certain aspects of reality using a larger variety of words, this cannot be said in a vacuum, and the phenomena in which this might happen are, I believe, out of the scope of the paper.
Unlike the first major point, this issue is more theorical in nature and, as such, I feel is either to be removed from the paper or backed up cy citing published sources. The instances I presented are some of the examples of said issue that I have highlighted showing the issue, which I feel requires a deep revision of different sections of the paper.
# MINOR ISSUES
I have some minor issues that the authors should address before publishing the paper. Rather than sections to be revised, these are more doubts I have with the paper, and if the authors have a good enough reason to back them up, they can be left as they are altogether.
First of all, there are some entries that are strangely missing from the background section. For instance, when talking about Transformers model, it's strange to see BART mentioned but not BERT, from which it derives.
Furthermore, I believe a richer background on the evaluation process should also be included: for instance, in past years there have been a number of publications providing an analysis of bleu and its issues, see for instance https://aclanthology.org/W18-6319/, https://aclanthology.org/E06-1032/, https://aclanthology.org/J18-3002/
For the last point in the background section, it looks like some previous work has been done regarding webnlg and Spanish. Despite implementing a template-based system, I think it should at least be mentioned: https://aclanthology.org/W19-8659.pdf
In the evaluation section, is there a reason why the authors did not use a Spanish monolingual model for cosine similarity evaluation? The authors mention that the presented model is the most popular one that provides Spanish usage, but I expected at least one test with a Spanish-only model.
Similarly, in the evaluation section it's not mentioned what model has been used for the bertscore evaluation.
In the future work section, the authors state "First, we plan to continue investigating additional approaches for improving Spanish triple-to-text verbalisation, refining existing methods, and exploring new techniques to enhance model performance and adaptability". I think at least some methods and new techniques should be mentioned in this section, in order to be more to the point.
Finally, while this might be behind the scope of the present paper, and as such not be covered by a full experiment, I feel like the end-to-end training of a neural model should be at least mentioned as part of possible future work, especially since resource efficiency is part of RQ1.
|