Extensive Benchmark of Small Language Models for Datatype Properties Extraction and RDF Knowledge Graph Generation

Tracking #: 3845-5059

This paper is currently under review
Authors: 
Célian Ringwald
Fabien Gandon
Catherine Faron
Franck Michel
Hanna Abi Akl

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
The choice made for representing the inputs and outputs of generative pre-trained language models (PLMs) can impact their fine-tuning on a new task. This article focuses on the fine-tuning and linearization process to generate facts extracted from text. On a restricted relation extraction (RE) task, we challenged five encoder-decoder models including BART, T5, CodeT5, FlanT5 and PileT5 by fine-tuning them on 13 linearization variations, including RDF standard syntaxes and variations thereof. Our benchmark covers the validity of the produced triples, the model's performance, the training behaviour and the resources needed. We show these PLMs can learn some syntaxes more easily than others, and we identify a promising ``Turtle Light'' syntax supporting the quick and robust learning of the RE task.
Full PDF Version: 
Tags: 
Under Review