Spanish Triple-to-Text Benchmark on Low-Resource Large Language Models

Tracking #: 3828-5042

Authors: 
Virginia Ramon-Ferrer
Carlos Badenes-Olmedo
Oscar Corcho

Responsible editor: 
Blerina Spahiu

Submission type: 
Full Paper
Abstract: 
The verbalisation of structured data is a beneficial process for several applications. In the context of knowledge graphs (KGs), transforming RDF triples into natural language facilitates tasks such as KG documentation or alternative exploration methods for different user needs. While significant progress has been made on the English verbalisation of KGs, Spanish remains an under-represented language for this task due to the lack of suitable resources. This hinders developing and evaluating models capable of generating high-quality Spanish verbalisations. To tackle this problem, we create a Spanish adaptation of the WebNLG dataset, a benchmark consisting of over 45.000 verbalisations paired with DBpedia triple sets. To our knowledge, this is the first formal attempt to provide such a dataset in Spanish, which not only serves for data verbalisation but can also potentially support the automated generation of RDF triples from text. We leverage this dataset to conduct a comprehensive evaluation of resource-efficient models for the Spanish triple-to-text task employing two different learning approaches: context learning (zero-shot, one-shot, and few-shot settings) and supervised learning through partial fine-tuning. Our results highlight the challenges of generating fluent and accurate Spanish text and demonstrate that partial fine-tuning of the evaluated models significantly improves performance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/Oct/2025
Suggestion:
Accept
Review Comment:

The paper focuses on building a Spanish adaptation of the English WebNLG benchmark. This is the first such benchmark for the Spanish language. Detailed evaluation has been conducted, alongside intricate explanations of the semantic and syntactic differences of the two languages that might account for some of the hurdles encountered in this process.
The Methodology, evaluation scenarios and the obtained results provide a valuable resource for scholars wishing to transform their ontologies into natural language text in a salient fashion.
The quality of writing is satisfactory in most parts of the paper, but might benefit from another, careful read-through. e.g. on p. 14, line 25 - 30, only one efficiancy metric was listed, although the subtitle says metrics; furhtermore, on p.20, line 47, instead of "we have that..." a phrase like "we see that" or "it can be seen that" could be used.

(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete.

Review #2
Anonymous submitted on 09/Nov/2025
Suggestion:
Accept
Review Comment:

The manuscript presents the methodology and results of the process of creating a Spanish version of the WebNLG dataset, consisting of over 45.000 verbalisations paired with DBpedia triple sets which can be emplyed for generating Spanish texts under the triples-to-text task using LLMs. The work is original because investigates a challenging problem especially for the Spanish language, thus contributing to new knowledge to the field and demonstrate an innovative way to use LLMs in NLG tasks.

2. Significance of the Results
The results reported in the manuscript are significant in terms of advancing the field. The results are also in line with state-of-the-art research about LLM finetuning and one-shot promting strategy. The results are promising and show that the approach proposed could easily be reproduced. The evaluation step has been conducted meticulosly and by resorting to different metrics.

3. Quality of Writing
The manuscript is well-written and generally clear. The authors use clear structure and logical flow. Furthermore, figures and tables are helpful in conveying the results.

The github repository is well organized and includes all the data and a README file, which makes it easy to understand and use the data.

Review #3
By Gennaro Nolano submitted on 16/Nov/2025
Suggestion:
Major Revision
Review Comment:

The paper describes the development of a Spanish adaptation for the WebNLG dataset using machine translation, together with a throughly evaluation of different methodologies for the task of triples verbalisation using LLMs.

The work is generally coherent, and its originality lies in it being the first released version of a fully-covered Spanish translation of the WebNLG task. From this point of view, the merits of the paper are undeniable, especially given the need for more language-specific data in the current age of neural models. Furthermore, the analysis of LLMs as models to verbalize triples is of extreme interest and arises further discussion.

On top of this, the paper is well written and to the point, and all the data are publicly accessible and well-documented.

Having said that, I feel like the paper is in need of major revisions before publishing. In particular, there are two major points that severely hinder the significance of the work, which I will discuss in detail in the next section.

# MAJOR ISSUES

The first major issue I have with the paper is that the results from LLMs are evaluated against automatically translated verbalisations generated using Deepl which have not, themselves, been throughly evaluated.

The authors explicitly mention this issue, in particular given that manually checking the translated verbalisation (more than 45.000) is way out of the scope of the paper, and refer to future work for a through crowdsourced revision of the dataset.

As for the current paper, however, the authors defer to manually revise only the translations that have a lower cosine similarity score (computed using a multilingual model) than their English sources.

While I can stand behind this solution from a practical point of view, I feel that its implementation, as it is presented, is severly lacking, especially given that the authors offer no source of this process of manually checking only those translation that have a low cosine similarity score being previously implemented or tested.

What I would expect is previous implementation of an automatically translated dataset being implemented as a gold standard for specific tasks, in order to give some theoretical foundation to the process itself.

Furthermore, according to the authors, the threshold for retrieval of verbalisations to be manually revised is set at 0.9, for statistical reasons. But this says nothing of said verbalisations, which might show particular phenomena that might not be covered by cosine similarity.

Once again, this issue could be improved on by just presenting other works using this technique, so that its theoretical foundation are made more explicit.

The second major issue I have with the paper is that the authors state in several sections of the paper that Spanish, as a language, is intrinsecally more difficult to tackle (NLP-wise) than English.

While this might be argued to be true in a rule-, or template-based environment, given the differences in morphology, the nature of neural models should not be impacted by such linguistic phenomena, since the quality of their input it is more a question of training data.

Currently, I am not aware of works describing the impact of inflections on neural networks. In the case that such work exists, thus showing the point made by the authors to be correct, it should be cited in the paper.

The authors make several claims regarding this aspect, most without a source. I am listing some istances of this which should be looked into:
- "Data-to-text generation in Spanish is a more complex task than
in English due to its richer grammar, more complex morphology, and larger vocabulary variations"
- "there is a high likelihood that the methodology followed to create our dataset does not encompass all possible verbalization variations, given the richness of the Spanish language"
- "[...] this is likely due to the restrictive nature of BLEU, which relies heavily on lexical overlap and struggles to account for the rich grammatical and *expressive diversity* of Spanish" <- BLEU shows issues with a weakly inflected language such as English as well, it's not like Spanish has a different standard in this sense
- "Also, vocabulary in Spanish tends to be more varied and expressive, often requiring more words or longer phrases to convey nuances that English expresses concisely—such as "te quiero" vs. "te amo" to distinguish degrees of affection, whereas English simply uses "I love you." " <- this is just a case of semantic and pragmatic misalignment between the two languages (an English speaker might actually say "I desire you" instead of "I love you", but in different contexts of a Spanish speaker saying "te quiero" instead of "the amo"), and I can't see how it is of any importance in the work presented
- "Because Spanish allows for greater syntactic flexibility, a single phrase can often be expressed in multiple grammatically correct ways with the same meaning" <- how is English any different in this aspect? Even the example given in this case (I hope you come tomorrow) can be expressed in different ways in English, such as 'I wish you would come tomorrow'; 'I am waiting your arrival tomorrow'; 'I can't wait to see you tomorrow'
- "We also see that, for rich languages such as Spanish, BLEU might not be a good lexical metric choice, but rather others such as METEOR and CHRF++" <- what do the authors mean by "richness"? If they are referring to inflectional morphemes, it should be made more explicit. Also, as previously mentioned, BLEU is generally considered not a good score for a weakly inflected language such as English as well
- "During the cross-lingual and error analysis, we also saw that, generally, Spanish grammar and vocabulary are more complex than English, with richer verb conjugations, gendered nouns, and varied word order". <- is Spanish vocabulary actually more complex? In which sense? While Spanish might categorize certain aspects of reality using a larger variety of words, this cannot be said in a vacuum, and the phenomena in which this might happen are, I believe, out of the scope of the paper.

Unlike the first major point, this issue is more theorical in nature and, as such, I feel is either to be removed from the paper or backed up cy citing published sources. The instances I presented are some of the examples of said issue that I have highlighted showing the issue, which I feel requires a deep revision of different sections of the paper.

# MINOR ISSUES

I have some minor issues that the authors should address before publishing the paper. Rather than sections to be revised, these are more doubts I have with the paper, and if the authors have a good enough reason to back them up, they can be left as they are altogether.

First of all, there are some entries that are strangely missing from the background section. For instance, when talking about Transformers model, it's strange to see BART mentioned but not BERT, from which it derives.

Furthermore, I believe a richer background on the evaluation process should also be included: for instance, in past years there have been a number of publications providing an analysis of bleu and its issues, see for instance https://aclanthology.org/W18-6319/, https://aclanthology.org/E06-1032/, https://aclanthology.org/J18-3002/

For the last point in the background section, it looks like some previous work has been done regarding webnlg and Spanish. Despite implementing a template-based system, I think it should at least be mentioned: https://aclanthology.org/W19-8659.pdf

In the evaluation section, is there a reason why the authors did not use a Spanish monolingual model for cosine similarity evaluation? The authors mention that the presented model is the most popular one that provides Spanish usage, but I expected at least one test with a Spanish-only model.

Similarly, in the evaluation section it's not mentioned what model has been used for the bertscore evaluation.

In the future work section, the authors state "First, we plan to continue investigating additional approaches for improving Spanish triple-to-text verbalisation, refining existing methods, and exploring new techniques to enhance model performance and adaptability". I think at least some methods and new techniques should be mentioned in this section, in order to be more to the point.

Finally, while this might be behind the scope of the present paper, and as such not be covered by a full experiment, I feel like the end-to-end training of a neural model should be at least mentioned as part of possible future work, especially since resource efficiency is part of RQ1.

Review #4
By Barbara Heinisch submitted on 23/Nov/2025
Suggestion:
Minor Revision
Review Comment:

The manuscript Spanish Triple-to-Text Benchmark on Low-Resource Large Language Models presents the first Spanish adaptation of the WebNLG dataset, providing over 45,000 verbalised DBpedia triple sets to support Spanish knowledge-graph verbalisation and related tasks. Using this resource, the authors evaluate several resource-efficient large language models (LLMs) and show that, despite the challenges of generating fluent and accurate Spanish text, partial fine-tuning substantially improves performance compared to zero-, one- and few-shot approaches.
(1) Originality
The manuscript makes an original contribution in several important respects. First, it presents the Spanish adaptation of the WebNLG dataset (addressing a gap in the field, as no comparable Spanish resource previously existed). Second, the work is methodologically innovative in that it systematically combines and compares a wide range of evaluation metrics commonly used in the NLP community, thereby offering valuable insights into the strengths and limitations of current assessment practices.
(2) Significance of the results
Given the scarcity of high-quality Spanish resources for knowledge-graph verbalisation, the results presented are both significant and timely. The study provides an empirically grounded benchmark and offers a meaningful foundation for future research on Spanish triple-to-text generation, particularly with respect to resource-efficient model architectures.
(3) Quality of writing
The manuscript is clearly written and generally easy to follow. The methodological workflow is well explained, and the authors provide a convincing rationale for selecting the WebNLG dataset as their basis. The figures (especially the three-stage workflow diagram) effectively support comprehension and improve the overall readability of the paper.

Despite these strengths, several aspects of the methodology and dataset creation would benefit from further elaboration:
• The manuscript does not specify which machine translation tool was used for the initial automatic translation step. Moreover, it remains unclear how many individuals revised the MT output (is this the same person mentioned later The manual revision was carried out by a native Spanish speaker with formal training in English proficiency”?), what qualifications they held, and whether inter-annotator agreement was measured or assessed.
• When no Spanish DBpedia label or alias was available, the authors relied on MT output. It would be useful to specify the proportion of such cases and to explain how the quality of these machine-generated translations was evaluated before inclusion in the dataset.
• The authors state that a native Spanish speaker with formal English training revised the triple instances, but it should be clarified whether this revision was performed by a single annotator or multiple annotators, and how consistency was ensured.
• Although Spanish is one of the most widely spoken languages globally, the dataset appears to rely predominantly on Peninsular Spanish. This raises concerns regarding representativeness and the exclusion of major Latin American varieties. How do the authors account for variation, and what steps were taken to avoid a Eurocentric bias in lexical or syntactic choices?
• The authors note that generating all possible variants in Spanish is infeasible and that they addressed this limitation by translating all available verbalizations from the English dataset “to capture a broader range of expressions”. However, it remains unclear how this broader range was determined, what criteria governed inclusion or exclusion of certain expressions and how the authors assessed the adequacy of this coverage.
• Additional information on how coherence between translated triples and their corresponding verbalisations was ensured would strengthen confidence in the dataset’s reliability.
• Regarding the figures, for consistency and improved interpretability, the y-axis in Figure 4 could begin at zero, as is also the case in Figure 5.
However, there is also some praise for the authors:
• I commend the authors for deliberately choosing resource-efficient models, thereby reducing the environmental impact of their experiments compared with typical large-scale model evaluations.
• The breadth of evaluation metrics used and compared is impressive and enhances the methodological contribution of the work. The study convincingly shows that multiple evaluation metrics and model-specific strengths must be considered when assessing multilingual performance.
• The manuscript is supported by a robust and up-to-date reference list, covering relevant literature across the necessary subfields.
• The data and code are provided both on Zenodo and GitHub. They are well structured and include a clear README, specified requirements, the codebase, the relevant data and the corresponding results.
• The provided resources, especially the data seem to be complete (for the replication of experiments).
• The chosen repositories (GitHub and Zenodo) are appropriate for long-term discoverability and archiving as well as for reaching different communities (including computational linguistics and the NLP community).