Review Comment:
Since this is a new version of a paper that I reviewed before, I am going to focus on the points that have been raised back than.
# Publication Summary
The paper presents the current state of the benchmarking framework LLM-KG-Bench, which has been created for automatically assessing and comparing Large Language Models (LLMs) with respect to their capabilities to work with Semantic Web technologies. Within this work, the authors focus on evaluations related to the LLMs' capabilities to process and generate SPARQL SELECT and RDF data. The framework is described in-detail in Section 3. The authors show the usefulness of the evaluation framework in Section 4 by comparing the performance of 41 LLMs in several tasks and draw conclusions from these results, e.g., which RDF format certain LLMs prefer.
This paper is an extended version of Lars-Peter Meyer, Johannes Frey, Desiree Heim, Felix Brei, Claus Stadler, Kurt Junghanns, and Michael Martin: "LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs" published at the ESWC 2025. In comparison to the previous publications, the authors increased the number of LLMs that they evaluate and enhanced the analysis of the evaluation results.
# Review Summary
## Originality
There are several works that look at the performance of LLMs including tasks related to knowledge graphs. The submitted work itself lists several related articles that evaluate LLMs in similar ways. However, it seems like the presented work provides a large set of different SPARQL- and RDF-related tasks, includes connectors to a large number of LLMs and offers automatic evaluations. The latter point is especially important as manual or crowd-based evaluations remain costly.
## Significance of the Results
The authors present some significant insights. They are not only able to compare the performance of the evaluated LLMs for a single task but show that based on their framework further insights can be gathered. It is also pointed out that all evaluation data is collected and made available for further analysis. In addition, intermediate results (i.e., the answers of the LLMs) are stored by the framework and can be evaluated again in case further analysis methods are implemented.
## Quality of Writing
The presentation of the paper has been improved compared to the previous version. From my understanding, the shortcomings that have been pointed out by the reviewers have been addressed by the authors.
## Open Science Data
The repeatability of the experiments seems to be good. The framework itself is hosted as an open-source project on github and has a DOI on Zenodo. The installation and usage of the framework are described in the readme file. A list of existing tasks has been added completing the parts of the documentation that I missed in the previous version.
The experiment results are shared on github in a separate project. The project structure is document in the readme file.
## Conclusion
In my humble opinion, the quality of the submission has been improved and the article should be accepted.
### Typos, etc.
- Several times, there is a whitespace missing in front of paranthesis. Some examples are:
-- p 4: "Turtle(TTL)"
-- Table 1: "many("
-- p10: "connectors(or model connectors)"
-- p15: "parameters size(targeted task size), knowsCount(number of incoming edges for normal nodes)"
-- P20: "task dialogues(task"
- p 5: "Javascript,and" --> "Javascript ,and"
- I appreciate the addition of the score explanations in Section 3.4. They help a lot in understanding the tasks better. However, the pattern "Most important score : |See explanation above." leads to incomplete sentences since a verb is missing. It might be better to formulate "The most important score(s) is/are : ." or "The most important score is the previously explained ."
|