Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Tracking #: 3994-5208

Authors: 
lars-peter meyer
Johannes Frey
Felix Brei
Desiree Heim
Sabine Gründer-Fahrer
Sara Todorovikj
Claus Stadler
Markus Schröder
Natanael Arndt
Michael Martin

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Current Large Language Models (LLMs) can work with structured information and even assist developing program code, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is it possible to determine this without manually checking many answers? The LLM-KG-Bench framework is designed to answer these questions. It consists of an extensible set of tasks for which the LLM answers are automatically evaluated, and covers different aspects of working with semantic technologies. This article gives a description of the LLM-KG-Bench framework, its main concepts, and the tasks implemented. In a benchmark run, a comprehensive dataset has been generated with it, evaluating more than 40 contemporary open and proprietary LLMs with 26 benchmark tasks, resulting in interaction logs and evaluations of roughly 45 000 LLM task dialogues. Finally, this dataset is used for an analysis of the SPARQL-related capabilities of the LLMs tested.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Röder submitted on 21/Feb/2026
Suggestion:
Accept
Review Comment:

Since this is a new version of a paper that I reviewed before, I am going to focus on the points that have been raised back than.

# Publication Summary

The paper presents the current state of the benchmarking framework LLM-KG-Bench, which has been created for automatically assessing and comparing Large Language Models (LLMs) with respect to their capabilities to work with Semantic Web technologies. Within this work, the authors focus on evaluations related to the LLMs' capabilities to process and generate SPARQL SELECT and RDF data. The framework is described in-detail in Section 3. The authors show the usefulness of the evaluation framework in Section 4 by comparing the performance of 41 LLMs in several tasks and draw conclusions from these results, e.g., which RDF format certain LLMs prefer.

This paper is an extended version of Lars-Peter Meyer, Johannes Frey, Desiree Heim, Felix Brei, Claus Stadler, Kurt Junghanns, and Michael Martin: "LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs" published at the ESWC 2025. In comparison to the previous publications, the authors increased the number of LLMs that they evaluate and enhanced the analysis of the evaluation results.

# Review Summary

## Originality

There are several works that look at the performance of LLMs including tasks related to knowledge graphs. The submitted work itself lists several related articles that evaluate LLMs in similar ways. However, it seems like the presented work provides a large set of different SPARQL- and RDF-related tasks, includes connectors to a large number of LLMs and offers automatic evaluations. The latter point is especially important as manual or crowd-based evaluations remain costly.

## Significance of the Results

The authors present some significant insights. They are not only able to compare the performance of the evaluated LLMs for a single task but show that based on their framework further insights can be gathered. It is also pointed out that all evaluation data is collected and made available for further analysis. In addition, intermediate results (i.e., the answers of the LLMs) are stored by the framework and can be evaluated again in case further analysis methods are implemented.

## Quality of Writing

The presentation of the paper has been improved compared to the previous version. From my understanding, the shortcomings that have been pointed out by the reviewers have been addressed by the authors.

## Open Science Data

The repeatability of the experiments seems to be good. The framework itself is hosted as an open-source project on github and has a DOI on Zenodo. The installation and usage of the framework are described in the readme file. A list of existing tasks has been added completing the parts of the documentation that I missed in the previous version.
The experiment results are shared on github in a separate project. The project structure is document in the readme file.

## Conclusion

In my humble opinion, the quality of the submission has been improved and the article should be accepted.

### Typos, etc.
- Several times, there is a whitespace missing in front of paranthesis. Some examples are:
-- p 4: "Turtle(TTL)"
-- Table 1: "many("
-- p10: "connectors(or model connectors)"
-- p15: "parameters size(targeted task size), knowsCount(number of incoming edges for normal nodes)"
-- P20: "task dialogues(task"
- p 5: "Javascript,and" --> "Javascript ,and"
- I appreciate the addition of the score explanations in Section 3.4. They help a lot in understanding the tasks better. However, the pattern "Most important score : |See explanation above." leads to incomplete sentences since a verb is missing. It might be better to formulate "The most important score(s) is/are : ." or "The most important score is the previously explained ."

Review #2
Anonymous submitted on 20/Mar/2026
Suggestion:
Accept
Review Comment:

The authors have carefully addressed the concerns I raised in the previous review, and the manuscript has improved as a result.

The revised version particularly now clarifies the scope and positioning of the benchmark with respect to real-world knowledge graph usage, including a more balanced Discussion and Outlook section of the role of LLMs relative to deterministic graph database systems. The expanded discussion of the contributions and related work also better articulates the novelty of the framework relative to prior publications by the same authors, which was one of my earlier concerns.

The clarifications added to the previously ambiguous passages (e.g., regarding format comparisons, references to external benchmarks, and the motivation for compatibility with BigBench task interfaces) improve readability and make the design choices of the framework clearer.

Overall, I am satisfied that the authors have addressed the requested revisions, and I believe the manuscript is now suitable for publication.

Review #3
By Bohui Zhang submitted on 23/Mar/2026
Suggestion:
Accept
Review Comment:

Thank you for the revision. The paper has improved noticeably, and most of my previous concerns have been addressed to a satisfactory extent. In particular, the revised version is clearer in its presentation of the benchmark design, task descriptions, experimental setup, and result reporting.

Overall, I believe the paper now makes a solid contribution to the area of LLM evaluation for knowledge graph-related tasks, and I recommend acceptance.