On General and Biomedical Text-to-Graph Large Language Models

Tracking #: 3808-5022

Authors: 
Lorenzo Bertolini
Roel Hulsman1
Sergio Consoli
Antonio Puertas Gallardo
Mario Ceresa

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Knowledge graphs and ontologies represent symbolic and factual information that can offer structured and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in complex processes. While Large Language Models (LLMs) are known to be useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar domains. In this work, we provide a large-scale comparison of the performance of certain LLM features (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL)) on text-to-graph benchmarks in two domains, namely the general and biomedical ones. Experiments suggest that, in the general domain, small fine-tuned encoder-decoder models and mid-sized decoder-only models used with iCL reach overall comparable performance with high entity and relation recognition and moderate yet encouraging graph completion. Our results further tentatively suggest that, independent of other factors, biomedical knowledge graphs are notably harder to learn and better modelled by small fine-tuned encoder-decoder architectures. Pertaining to iCL, we analyse hallucinating behaviour related to sub-optimal prompt design, suggesting an efficient alternative to prompt engineering and prompt tuning for tasks with structured model output.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Fidel Jiomekong submitted on 11/Mar/2025
Suggestion:
Accept
Review Comment:

Previous reviews was only about proofreading of the manuscript and the authors addressed these reviews.

Review #2
Anonymous submitted on 03/Apr/2025
Suggestion:
Accept
Review Comment:

The authors have satisfactorily addressed all my comments. I believe the paper is now ready for acceptance.

Review #3
By Finn Årup Nielsen submitted on 11/Apr/2025
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper is a revision of a previous paper. It is not entirely clear
to me what has been changed compared to the previous version. From the
last review my single point was concerning the evaluation metrics and
the unusual use of ROUGE. As the authors note in their response ROUGE
has been used in the WebNLG dataset. But this has been for
graph-to-text evaluation. My concern is still that
in a typical knowledge graph the individual triples are not
ordered. The order of triples of the reference data is somewhat
arbitrary. When the ROUGE-2 and ROUGE-L are used for evaluation the
reported result will be dependent on the arbitrary
ordering. Specifically, the columns R-2 and R-L in Table 4, Figure 2,
Figure 3 and Figure 4 will change depending on the ordering of the
triples for a specific prediction, - if my understanding of the method
is correct.

As an example consider the following "result" with two generated
triples in different ordering. In the first case ordering is the same
between the target and the prediction and the ROUGE score are 1.
In the second case, the ordering has changed and the ROUGE-2 score is
now 0.8 and the ROUGE-L score is now 0.5. This is a considerable
difference.

>>> rouge = evaluate.load('rouge')
>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"])
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(Rome # CapitalOf # Italy) | (dog # isA # animal)]"])
{'rouge1': 1.0, 'rouge2': 0.8000000000000002, 'rougeL': 0.5, 'rougeLsum': 0.5}

The authors seem not to be concerned about this issue. And it could
be a major issue.

Rereading the section 2.3, I now also see that the equations and
description of the equation are either questionable or not clear. The
ROUGE measures are not measures on sets, but on counts. An example with
"dog dog" predicted where the target is "dog" shows a ROUGE-1 score
that is not 1.

>>> rouge.compute(predictions=["dog dog"], references=["dog"])
{'rouge1': 0.6666666666666666, 'rouge2': 0.0, 'rougeL': 0.6666666666666666, 'rougeLsum': 0.6666666666666666}

If ROUGE-1 were based on sets of unigrams we would have expected it
to be 1. Either the notation and explanation is unclear or it is
wrong, - or I am misunderstanding fundamentally how the evaluation is
done.

Minor issue:

Page 3, line 14: unresolved reference