Editorial Board

Editor-in-Chief
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Sebastián Ferrada
Mark Gahegan
Aldo Gangemi
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Krzysztof Janowicz
Sabrina Kirrane
Agnieszka Lawrynowicz
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Angelo Salatino
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
Sanju Tiwari
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Krzysztof Janowicz
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

On General and Biomedical Text-to-Graph Large Language Models

Submitted by Sergio Consoli on 02/06/2025 - 01:09

Tracking #: 3808-5022

A new version of this paper is available

Authors:

Lorenzo Bertolini

Roel Hulsman1

Sergio Consoli

Antonio Puertas Gallardo

Mario Ceresa

Responsible editor:

Guest Editors KG Gen from Text 2023

Submission type:

Full Paper

Abstract:

Knowledge graphs and ontologies represent symbolic and factual information that can offer structured and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in complex processes. While Large Language Models (LLMs) are known to be useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar domains. In this work, we provide a large-scale comparison of the performance of certain LLM features (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL)) on text-to-graph benchmarks in two domains, namely the general and biomedical ones. Experiments suggest that, in the general domain, small fine-tuned encoder-decoder models and mid-sized decoder-only models used with iCL reach overall comparable performance with high entity and relation recognition and moderate yet encouraging graph completion. Our results further tentatively suggest that, independent of other factors, biomedical knowledge graphs are notably harder to learn and better modelled by small fine-tuned encoder-decoder architectures. Pertaining to iCL, we analyse hallucinating behaviour related to sub-optimal prompt design, suggesting an efficient alternative to prompt engineering and prompt tuning for tasks with structured model output.

Full PDF Version:

swj3808.pdf

Revised Version:

On General and Biomedical Text-to-Graph Large Language Models

Previous Version:

On General and Biomedical Text-to-Graph Large Language Models

Tags:

Reviewed

Long-term Stable Link to Resources:

https://github.com/jrcf7/txt2graphLLMs

Decision/Status:

Minor Revision

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Fidel Jiomekong submitted on 11/Mar/2025

Suggestion:
Accept

Review Comment:

Previous reviews was only about proofreading of the manuscript and the authors addressed these reviews.

Review #2

Anonymous submitted on 03/Apr/2025

Suggestion:
Accept

Review Comment:

The authors have satisfactorily addressed all my comments. I believe the paper is now ready for acceptance.

Review #3

By Finn Årup Nielsen submitted on 11/Apr/2025

Suggestion:
Minor Revision

Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper is a revision of a previous paper. It is not entirely clear
to me what has been changed compared to the previous version. From the
last review my single point was concerning the evaluation metrics and
the unusual use of ROUGE. As the authors note in their response ROUGE
has been used in the WebNLG dataset. But this has been for
graph-to-text evaluation. My concern is still that
in a typical knowledge graph the individual triples are not
ordered. The order of triples of the reference data is somewhat
arbitrary. When the ROUGE-2 and ROUGE-L are used for evaluation the
reported result will be dependent on the arbitrary
ordering. Specifically, the columns R-2 and R-L in Table 4, Figure 2,
Figure 3 and Figure 4 will change depending on the ordering of the
triples for a specific prediction, - if my understanding of the method
is correct.

As an example consider the following "result" with two generated
triples in different ordering. In the first case ordering is the same
between the target and the prediction and the ROUGE score are 1.
In the second case, the ordering has changed and the ROUGE-2 score is
now 0.8 and the ROUGE-L score is now 0.5. This is a considerable
difference.

>>> rouge = evaluate.load('rouge')
>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"])
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(Rome # CapitalOf # Italy) | (dog # isA # animal)]"])
{'rouge1': 1.0, 'rouge2': 0.8000000000000002, 'rougeL': 0.5, 'rougeLsum': 0.5}

The authors seem not to be concerned about this issue. And it could
be a major issue.

Rereading the section 2.3, I now also see that the equations and
description of the equation are either questionable or not clear. The
ROUGE measures are not measures on sets, but on counts. An example with
"dog dog" predicted where the target is "dog" shows a ROUGE-1 score
that is not 1.

>>> rouge.compute(predictions=["dog dog"], references=["dog"])
{'rouge1': 0.6666666666666666, 'rouge2': 0.0, 'rougeL': 0.6666666666666666, 'rougeLsum': 0.6666666666666666}

If ROUGE-1 were based on sets of unigrams we would have expected it
to be 1. Either the notation and explanation is unclear or it is
wrong, - or I am misunderstanding fundamentally how the evaluation is
done.

Minor issue:

Page 3, line 14: unresolved reference

Log in or register to post comments
1939 reads

Main menu

Editorial Board

Syndicate

On General and Biomedical Text-to-Graph Large Language Models

Tracking #: 3808-5022

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

On General and Biomedical Text-to-Graph Large Language Models

Tracking #: 3808-5022

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles