Spanish Triple-to-Text Benchmark on Low-Resource Large Language Models

Tracking #: 3993-5207

Authors: 
Virginia Ramon-Ferrer
Carlos Badenes-Olmedo
Oscar Corcho

Responsible editor: 
Blerina Spahiu

Submission type: 
Full Paper
Abstract: 
The verbalisation of structured data is a beneficial process for several applications. In the context of knowledge graphs (KGs), transforming RDF triples into natural language facilitates tasks such as KG documentation or alternative exploration methods for different user needs. While significant progress has been made on the English verbalisation of KGs, Spanish remains an under-represented language for this task due to the lack of suitable resources. This hinders developing and evaluating models capable of generating high-quality Spanish verbalisations. To tackle this problem, we create a Spanish adaptation of the WebNLG dataset, a benchmark consisting of over 45,000 verbalisations paired with DBpedia triple sets. To our knowledge, this is the first formal attempt to provide such a dataset in Spanish, which not only serves for data verbalisation but can also potentially support the automated generation of RDF triples from text. We leverage this dataset to conduct a comprehensive evaluation of resource-efficient models for the Spanish triple-to-text task employing two different learning approaches: context learning (zero-shot, one-shot, and few-shot settings) and supervised learning through partial fine-tuning. Our results highlight the challenges of generating fluent and accurate Spanish text and demonstrate that partial fine-tuning of the evaluated models significantly improves performance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Jan/2026
Suggestion:
Accept
Review Comment:

This paper presents a well-motivated and timely study on Spanish triples-to-text generation using resource-efficient large language models. The creation of the Spanish WebNLG dataset and the systematic evaluation of prompt-based and fine-tuned approaches constitute valuable contributions to multilingual data-to-text generation, particularly for underrepresented languages such as Spanish.

Strengths:

One of the main strengths of the work lies in the development of the Spanish WebNLG dataset through a semi-supervised pipeline followed by manual revision. This effort addresses a significant resource gap and enables both generation and potential reverse (text-to-triples) applications. The authors clearly articulate their research questions and design experiments that directly address them, offering a structured and coherent narrative throughout the paper.

The experimental findings convincingly demonstrate that contextualisation (one-shot prompting) and parameter-efficient fine-tuning substantially improve performance over zero-shot settings. The analysis of different models provides useful practical insights for model selection in low-resource scenarios, particularly the strong performance of Qwen2.5-1.5B-Instruct and the contrasting behaviour of Llama-3.2-1B-Instruct across languages. The multilingual and error analyses further strengthen the study by showing that linguistic properties such as morphological richness and syntactic flexibility impact both model behaviour and metric reliability.

The discussion is thorough and well-aligned with the results, and the conclusions appropriately reflect the scope of the experiments. The authors’ emphasis on language-specific evaluation and adaptation strategies is particularly relevant for multilingual NLG research.

Weaknesses:
While the dataset creation process is carefully described, the paper would benefit from a more detailed quantitative and qualitative analysis of the manual revision stage (e.g., inter-annotator agreement or explicit error categories). This would improve transparency regarding the final corpus quality and help assess the reliability of the semi-supervised pipeline. The study and the whole community would also benefit from making the "internal guidelines" for evaluation public with examples and descriptions (even on GitHub).
__________

Overall, the paper makes a solid contribution to multilingual data-to-text generation and provides practical guidance for adapting resource-efficient models to underrepresented languages. With minor improvements in evaluation, the work would be even stronger.

Review #2
By Gennaro Nolano submitted on 03/Feb/2026
Suggestion:
Accept
Review Comment:

I have checked the new version against my previous review, and I feel like the authors have addressed most of the points I had raised.

As such, I have no more issues with the paper, and I think it can be published without further changes.

Review #3
By Barbara Heinisch submitted on 10/Feb/2026
Suggestion:
Accept
Review Comment:

As mentioned in my first review, the manuscript makes an original contribution in several respects, including the use of several (not only one) evaluation metrics commonly used in the NLP community and addressing the scarcity of high-quality Spanish resources for knowledge-graph verbalisation. Since the majority of the comments from the first review were taken into account in this version (including the limitations), the current manuscript provides a more comprehensive picture.