Editorial Board

Editor-in-Chief
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Sebastián Ferrada
Mark Gahegan
Aldo Gangemi
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Krzysztof Janowicz
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Angelo Salatino
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
Sanju Tiwari
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Krzysztof Janowicz
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Submitted by Claus Stadler on 12/23/2025 - 08:31

Tracking #: 3994-5208

Authors:

lars-peter meyer

Johannes Frey

Felix Brei

Desiree Heim

Sabine Gründer-Fahrer

Sara Todorovikj

Claus Stadler

Markus Schröder

Natanael Arndt

Michael Martin

Responsible editor:

Guest Editors 2025 LLM GenAI KGs

Submission type:

Full Paper

Abstract:

Current Large Language Models (LLMs) can work with structured information and even assist developing program code, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is it possible to determine this without manually checking many answers? The LLM-KG-Bench framework is designed to answer these questions. It consists of an extensible set of tasks for which the LLM answers are automatically evaluated, and covers different aspects of working with semantic technologies. This article gives a description of the LLM-KG-Bench framework, its main concepts, and the tasks implemented. In a benchmark run, a comprehensive dataset has been generated with it, evaluating more than 40 contemporary open and proprietary LLMs with 26 benchmark tasks, resulting in interaction logs and evaluations of roughly 45 000 LLM task dialogues. Finally, this dataset is used for an analysis of the SPARQL-related capabilities of the LLMs tested.

Full PDF Version:

swj3994.pdf

Previous Version:

Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Tags:

Reviewed

Long-term Stable Link to Resources:

https://doi.org/10.5281/zenodo.18017450

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Michael Röder submitted on 21/Feb/2026

Suggestion:
Accept

Review Comment:

Since this is a new version of a paper that I reviewed before, I am going to focus on the points that have been raised back than.

# Publication Summary

The paper presents the current state of the benchmarking framework LLM-KG-Bench, which has been created for automatically assessing and comparing Large Language Models (LLMs) with respect to their capabilities to work with Semantic Web technologies. Within this work, the authors focus on evaluations related to the LLMs' capabilities to process and generate SPARQL SELECT and RDF data. The framework is described in-detail in Section 3. The authors show the usefulness of the evaluation framework in Section 4 by comparing the performance of 41 LLMs in several tasks and draw conclusions from these results, e.g., which RDF format certain LLMs prefer.

This paper is an extended version of Lars-Peter Meyer, Johannes Frey, Desiree Heim, Felix Brei, Claus Stadler, Kurt Junghanns, and Michael Martin: "LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs" published at the ESWC 2025. In comparison to the previous publications, the authors increased the number of LLMs that they evaluate and enhanced the analysis of the evaluation results.

# Review Summary

## Originality

There are several works that look at the performance of LLMs including tasks related to knowledge graphs. The submitted work itself lists several related articles that evaluate LLMs in similar ways. However, it seems like the presented work provides a large set of different SPARQL- and RDF-related tasks, includes connectors to a large number of LLMs and offers automatic evaluations. The latter point is especially important as manual or crowd-based evaluations remain costly.

## Significance of the Results

The authors present some significant insights. They are not only able to compare the performance of the evaluated LLMs for a single task but show that based on their framework further insights can be gathered. It is also pointed out that all evaluation data is collected and made available for further analysis. In addition, intermediate results (i.e., the answers of the LLMs) are stored by the framework and can be evaluated again in case further analysis methods are implemented.

## Quality of Writing

The presentation of the paper has been improved compared to the previous version. From my understanding, the shortcomings that have been pointed out by the reviewers have been addressed by the authors.

## Open Science Data

The repeatability of the experiments seems to be good. The framework itself is hosted as an open-source project on github and has a DOI on Zenodo. The installation and usage of the framework are described in the readme file. A list of existing tasks has been added completing the parts of the documentation that I missed in the previous version.
The experiment results are shared on github in a separate project. The project structure is document in the readme file.

## Conclusion

In my humble opinion, the quality of the submission has been improved and the article should be accepted.

### Typos, etc.
- Several times, there is a whitespace missing in front of paranthesis. Some examples are:
-- p 4: "Turtle(TTL)"
-- Table 1: "many("
-- p10: "connectors(or model connectors)"
-- p15: "parameters size(targeted task size), knowsCount(number of incoming edges for normal nodes)"
-- P20: "task dialogues(task"
- p 5: "Javascript,and" --> "Javascript ,and"
- I appreciate the addition of the score explanations in Section 3.4. They help a lot in understanding the tasks better. However, the pattern "Most important score : |See explanation above." leads to incomplete sentences since a verb is missing. It might be better to formulate "The most important score(s) is/are : ." or "The most important score is the previously explained ."

Review #2

Anonymous submitted on 20/Mar/2026

Suggestion:
Accept

Review Comment:

The authors have carefully addressed the concerns I raised in the previous review, and the manuscript has improved as a result.

The revised version particularly now clarifies the scope and positioning of the benchmark with respect to real-world knowledge graph usage, including a more balanced Discussion and Outlook section of the role of LLMs relative to deterministic graph database systems. The expanded discussion of the contributions and related work also better articulates the novelty of the framework relative to prior publications by the same authors, which was one of my earlier concerns.

The clarifications added to the previously ambiguous passages (e.g., regarding format comparisons, references to external benchmarks, and the motivation for compatibility with BigBench task interfaces) improve readability and make the design choices of the framework clearer.

Overall, I am satisfied that the authors have addressed the requested revisions, and I believe the manuscript is now suitable for publication.

Review #3

By Bohui Zhang submitted on 23/Mar/2026

Suggestion:
Accept

Review Comment:

Thank you for the revision. The paper has improved noticeably, and most of my previous concerns have been addressed to a satisfactory extent. In particular, the revised version is clearer in its presentation of the benchmark design, task descriptions, experimental setup, and result reporting.

Overall, I believe the paper now makes a solid contribution to the area of LLM evaluation for knowledge graph-related tasks, and I recommend acceptance.

Log in or register to post comments
1035 reads

Main menu

Editorial Board

Syndicate

Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Tracking #: 3994-5208

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Tracking #: 3994-5208

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles