Graph RAG in the Wild: Insights and Best Practices from Real-World Applications

Tracking #: 3862-5076

Authors: 
Diego Collarana Vargas
Christopher Ingo Pack
Yan-Ying Liao
Marlena Flüh
Jonathan Lehmkuhl
Abhishek Nageri
Alexander Graß
Moritz Busch
Prinon Das
Lara Dingels
Stefan Decker2
Christian Beecks

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Abstract: 
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by enabling access to external knowledge without retraining. While effective, traditional RAG methods—typically reliant on vector-based retrieval face limitations in understanding complex semantics, connecting dispersed information, and supporting user-centric search workflows. Graph Retrieval-Augmented Generation (Graph RAG) addresses these challenges by incorporating knowledge graphs into the retrieval process, enabling semantically enriched and structured query handling. This paper explores the application of Graph RAG across seven real-world domains, including legal compliance, customer support, enterprise knowledge management, finance, education, data protection enforcement, and time series analytics. For each use case, we outline the distinct challenges, solutions, and design decisions made. In addition, we introduce a modular Graph RAG Engine to support ingestion, graph construction, hybrid retrieval, and LLM orchestration. We present empirical evidence demonstrating improvements in accuracy, latency, and user trust, and offer a practical design playbook for making schema choices, selecting retrieval strategies, and constructing prompts. Additionally, we address cross-domain challenges such as graph drift and evaluation strategies. These contributions aim to guide researchers and practitioners beyond traditional RAG and to inspire further research at the intersection of generative AI and knowledge graphs.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Aug/2025
Suggestion:
Minor Revision
Review Comment:

This paper shows an overview of Graph RAG applied in seven scenarios from Fraunhofer partners.
Given the number of authors and scenarios, the paper is written consistently and shows how Graph RAG is applied in each of the presented use cases.
The paper is well structured and easy to follow.

Overall, I like the idea of presenting an overview of how Graph RAG can be applied, which options exist in each step, and providing some best practices to guide practitioners and guide research.
Below I have some proposals on how to further improve the paper:

In the introduction (page 2, lines 23–40), the authors introduce the different use cases and refer to the sections.
Already at that stage, it would be helpful to know that sections 4 and 5 are different because of the schema vs. schemaless scenarios.
In line 29, the WFW Assistant is introduced. Since this assistant appears for the first time, the full name should be used.

The background section covers LLMs and their limitations.
I also like the detailed introduction to LLMs, especially the differentiation between encoder and decoder models.
Afterwards, the different RAG-based approaches are explained.
Already showing the limitations of traditional RAG-based systems motivates the reader to find out how Graph RAG can further improve the results.
One related work that could already be introduced here is the GraphRAG approach developed by Microsoft — especially because it is also used later on in one of the scenarios.

In Section 3, the Graph RAG Engine is introduced.
Figure 1 is a nice overview and helps to understand where decisions need to be taken and which options exist.
One possible improvement for this figure would be to show a smaller graph as the output of step 2.
Furthermore, the graph formats in step 3 should already be determined before the prompt is generated.
Maybe the prompt box can be an additional input to step 3, which is the direct successor of step 2 (with the smaller graph in between).
After the prompt (containing the graph) is generated, only the model choice is left.

In Section 4, the scenarios without a schema are discussed.
The structure of presenting first the motivation, realization, and then the evaluation is very good.
Sometimes, it is written on a very high level without many details.
As an example, in Section 4.1.2 line 44:
"This iterative, self-correcting process continues until a satisfactory answer is produced or all retrieval options are exhausted."
It is unclear what a satisfactory answer is and how it is determined. Is it a human judgment or an automatic decision?

In Figure 2, the left image of the graph does not really show anything. In that case, it is maybe better to leave it out. For the right-hand part of the figure, it would be good to reference it somewhere in the text.
Figure 3 on page 10 is very similar. I do not see any information that helps the reader further understand the use case. Thus, I would rather exclude it and leave the space for more detailed explanations of the evaluation.
The description of Table 1 should be formatted similarly to the figures (the text should directly follow after the table number).

In Section 4.1.3 the authors mention that they use RAGAS for the evaluation.
A bit more detailed information on how the dataset is automatically generated would be helpful to further understand the numbers — e.g., how many test examples, how the three metrics are computed, what the evaluation setup is, etc.

Section 5 shows the schema-first use cases.
The ABEL use case is easy to understand and well written.
For the GDPR use case, Figure 5 needs a higher resolution and/or a bit more space to make it readable.
Table 4 should also be referenced in the text, e.g., in 5.2.3 at the beginning of the paragraph: “The results show that …”
In the same use case, it would be interesting to know how the KPIs are actually modeled and which attributes exist (the structure is shown in 6(b) but no attributes for KPIs are listed).
For the data science agent use case, some more information on how the evaluation was performed would help.
In Table 6, most of the numbers can be cut off after a maximum of 2 decimal places (and filled with zeros).

Section 6 is a summary of insights and best practices, which covers the main points for such Graph RAG approaches.
From a scientific point of view, it would also be nice to further discuss the reproducibility of the results in a bit more detail—e.g., greedy token selection strategy (by setting the temperature to zero), using open-weight models (since proprietary models can be shut down at any point in time), etc.
For statements like “Bigger models don’t always perform better for triple extraction”, the same models should be used to ensure a fair comparison instead of comparing Gemma 2:27B with Llama 3.1:70B.

In Section 6.2 (line 46) the authors argue to “extract named entities first and combine them with the original query during retrieval.” Here I asked myself how this combination should be done. Some more sentences would be helpful.

It is nice to see in Section 6.3 that the authors also found that “code-like formats work best.” This has also been shown in various other works, and it is nice that this applies to Graph RAG as well.

Section 7 discusses the challenges of Graph RAG.
Table 7 shows that changing the prompts heavily impacts performance.
My question here is where the comparison actually is. Does it mean one should compare, e.g., the WFW column with Table 3 (the evaluation of WFW)?
If that is the case, it would be nice to have them in one table (also to show whether it is a comparison to local or global).
I also don’t understand what the green numbers represent: Is it the difference compared to Table 3?
Maybe some more words in the text describing what can be seen in the table would be helpful.

Lastly, in their conclusion and also throughout the paper, the authors talk about their Graph RAG Engine, which serves as a reference implementation. Is there any reference to the implementation that I might have missed?
If it refers to the general framework that the authors described in Figure 1 (and not to an actual implementation), I would slightly change the text.

Overall, the paper shows a lot of use cases of Graph RAG and discusses the design decisions that need to be taken.
If the proposed changes are added to the paper (especially a bit more information about the evaluation setup and result tables), the paper can be published.

Smaller improvements:

- page 6 line 49: Since Graph the introduction of RAG
- page 11 line 49, page 12 line 14: [citation] should be replaced by the actual citation
- all table descriptions should be formatted similarly to figures
- place all figures and tables at the top of the pages to improve reading flow (e.g., page 21 where lines 25–30 are just a small portion of text between two tables)
- data science agent use case (page 20): DataKnowledge, ScenarioKnowledge, and MethodKnowledge should include spaces

Review #2
Anonymous submitted on 15/Oct/2025
Suggestion:
Reject
Review Comment:

The submission type does not fit the actual content of this work. As a full paper, it lacks coherence because it does not present any novel contribution. I suggest changing the type to an application report. My score is based on the review guideline for a full paper type, which I am happy to change. In any case, I hope to provide helpful feedback for all types of submissions with my comments.

# Summary

The paper presents seven applications using Graph-RAG methods in heterogeneous domains (legal, finance, customer support, education, agriculture, enterprise knowledge management). After a brief explanation of RAG and Graph-RAG methods, an overview of the overall process is given for each application. Additionally, each application is individually evaluated, and minimal evaluation results, mostly comparing RAG and Graph-RAG, are discussed. Most applications focus on Knowledge Graph-based Graph-RAG approaches and, in particular, describe the knowledge organisation / KG construction step. The authors group their applications into two categories: construction using an existing schema (section 5) and constructions that do not use a prior schema (section 4). Based on the obtained results, an attempt is made to derive best practices and other insights, and general challenges of KG-RAG are identified.
# Strengths
* Overall, the paper is well written (in particular the introduction), follows a clear structure and motivates the presented use-cases of Graph-RAG quite well.
* Many heterogeneous applications are presented in sufficient detail to give a good overview of KG-RAG-based applications, which are especially interesting for practitioners.
* The authors make an effort to compare the use-cases and discover general insights to inform the development of further (Graph-)RAG applications.
# Weaknesses
* [Only for submission as full paper] The presented contribution has no scientific novelty, no comprehensive evaluation of all methods, i.e. this work does not contain any research contribution that could be reviewed along the dimensions originality or significance of the results. While the overview of the seven applications is novel, the methods used are not. Additionally, the generated insights are (1) not sufficiently supported by the experiments and (2) already reported by related work (based on more detailed evaluations).
* The sections for the individual applications repeat some information frequently, which negatively influences the readability. Better aligning each application section with the overall paper could shorten the paper by avoiding these repetitions.
* The presented methods and evaluations lack a lot of relevant details, which renders most evaluations useless and does not allow for any replications.
# Comments
* The introduction works really well to motivate this work and gives a good overview of the work.
* What should be the novel aspect of the presented Graph-RAG engine? The stages are exactly the stages presented in [61]. If the authors want to claim some novelty in this description, they should explicitly point out the novelty. Additionally, if this engine is to support the understanding of each application, it is helpful to adhere to this framework when describing the applications. Currently, this is only loosely connected to the applications.
* Graph-based indexing is described from a pure KG construction perspective and misses the retrieval system perspective, i.e. it is not clear how the constructed KG is integrated into the overall retrieval system.
* The claim "However, KGs generated automatically from text using LLMs are prone to errors" (page 5, line 43-44) needs a reference. This statement also contradicts "With the emergence of LLMs, new end-to-end approaches for building KGs have become possible" (page 6, line 24). If LLMs are generally prone to errors in this task, it does not make sense to have end-to-end approaches. This might simply be an issue of ordering. First mentioning the end-to-end approaches and then remarking that these pipelines still produce some errors is more reasonable.
* Figure 2 (left) does not provide any helpful information. Yes, the dots and connections resemble a graph, but a more effective example would be one where the dots and edges are actually connected to some readable information.
* Is the RAGAS-generated dataset publicly available? What are the basic statistics of this data, e.g. number of samples, typical length, and size of the retrieved data? Confidence intervals might be helpful as well. That is relevant for all evaluations of this work. An evaluation based on unknown data cannot produce any reproducible, scientific results.
* Was the LLM-As-A-Judge framework reliable here? Was a subset manually evaluated?
* Figure 3 does not provide any helpful information. What should be the point of an anime-style picture of someone sitting in front of a computer? Everyone knows how this looks like (most likely by looking in the mirror). The picture does not give a source, which raises even bigger concerns (Is it taken from an anime? Is it AI-generated?). Instead of anonymising it, it can be removed. The right subfigure has the same issue as Figure 2.
* The financial report generation in section 5.3 is missing any evaluation results.
* "particularly in industries like agriculture, which are less connected to technology and have limited data science knowledge" (page 19, line 17-18) I am not sure if this is true. As far as I know, technology and data science methods are frequently used in the agricultural sector. Some evidence for this claim is needed.
* "and can only be achieved using Graph RAG on our time series knowledge graph" (page 20, line 46-47) Such statements are highly unscientific and should not be made.
* Tables 5 and 6 present an average response time, but this is only comparable if the same hardware is used for all models. For GPT4o, Mixtral and Gema2 27b, this is not the case.
* "If you tell the LLM what you want to see in the graph, it will generate that based on the provided documents" (page 22, line 15-16). The presented applications did not discuss an evaluation which could actually support such a statement. Such a general statement (even with actual evidence for it) is not a good scientific style.
* The majority of "insights" described in section 6 do not seem to be connected to the performed experiments, e.g. the claim that bigger models do not necessarily perform better triple extraction does not directly connect to the presented experiments (only 5.4 seems to consider multiple models, and in this case, only results for the whole pipeline are given).

## Minor Comments
* "Graph RAG improves on RAG by integrating knowledge graphs into the retrieval process." (page 2, line 6-7) This assumes a pretty generic definition of KGs; what would be the difference between a graph and a KG here? [61] defines it as "GraphRAG retrieves graph elements containing relational knowledge pertinent to a given query from a pre-constructed graph database", which keeps the definition more general (any graph, not only KGs). Why did the authors assume a more specific definition here? It might make sense to highlight the focus on KGs as part of the introduction, instead of redefining the more general Graph-RAG term.
* The summary of all application domains (Page 2, line 23 - 25), followed by more detailed explanations, could be structured more coherently. The summary creates a jump backwards in the reading flow, i.e. the text jumps back from describing section 5.4 to 4.1 again. I suggest presenting the overview as a small diagram instead.
* "LLMs were first developed by OpenAI, with GPT-2" (page 3, line 23) Without background knowledge, this statement might produce the wrong impression. Language modelling with transformer-based models existed before GPT-2 ("Attention is all you need" paper, BERT, GPT-1) and language modelling even before that. GPT-2 was the first model to scale its architecture to over 1 billion parameters. Since the "large" part of LLMs is not clearly defined, it would be beneficial to provide more context here.
* "Since Graph the introduction of RAG, many [...]" (page 6, line 49) this is most likely a typo (Maybe without "Graph"?)
* "[...] we also identity communities [...]" (page 8 line 33). That should be "identify".
* "Zentralverriegelung" (page 10, line 45) I suggest including the information that this is the German term for central locking (increases readability for non-German speakers)
* "4.3 Wenn Fraunhofer Wüsste -Wass Fraunhofer Weiß" (page 11, line 42) is not a very descriptive title, especially for non-German speakers.
* page 11, line 49 and page 12, line 15 contain "[citation]"; actual citations should replace this.
* "The former adds weights to the model and trains these instead of the model's parameters, reducing the total number of training parameters" (page 25, line 48-49). Surely, adding something increases the overall trained parameters. LoRA reduces the fine-tuned parameters, but not the *total number* of parameters. Clarifying what the total number of training parameters refers to is useful.

# Score
As a full paper, this work clearly lacks novelty and does not provide any serious evaluation or experiments. As such, I recommend rejecting this work as a full paper. For an application report submission, I still see room for improvement, but this could be addressed with minor revisions.

Review #3
Anonymous submitted on 15/Nov/2025
Suggestion:
Major Revision
Review Comment:

Summary:
The paper presents an investigation of Graph RAG across seven real-world domains, including legal compliance, customer support, enterprise knowledge management, finance, education, data protection enforcement, and time series analytics from Fraunhofer partners. They propose a modular Graph RAG Engine divided into three modular stages — KG-Indexing, KG-Retrieval and KG-Generation. KG-Indexing is the creation or linking of a KG, often involving extracting entities and relations from text, guided by a formal ontology. In the KG-Retrieval module, it extracts relevant information such as triples or subgraphs based on the input queries and uses hybrid retrieval methods like similarity search, automatic translation of natural language queries into graph query languages, or semantic clustering. Finally, the KG-Generation module integrates the retrieved graph data with the input query and feeds it to the LLM to generate the final response. The evaluation of the engine is done using LLM-as-judge over Graph RAG and traditional RAG. Empirical results show improvement in accuracy, latency, and user trust across domains on comprehensiveness, diversity, and empowerment.

Strengths:
1. Empirical evidence across several domains shows that Graph RAG delivers measurable improvements compared to traditional RAG approaches. Graph-RAG’s retrieval proves to be beneficial in reasoning over traditional vector retrieval. By leveraging knowledge graphs, G-RAG reduces the risk of hallucinations, leading to more coherent outputs.
2. The authors also demonstrate methodological rigour by combining vector and graph retrieval (hybrid RAG) and by employing both similarity‑based and semantic‑clustering modes, which improves robustness in sparse or noisy graphs.
3. The method supports both schema-first and schemaless approaches which improves the flexibility in knowledge construction.

Weakness:

1. Evaluation methodology relies too heavily on LLM-as-a-judge - the analysis is not consistent, which has also been pointed out by the authors as they show the sensitivity of the approach to prompt engineering. Hence, I recommend another layer of evaluation i.e. human evaluation, maybe on a subset of the entire dataset, and an IAA agreement with LLM-as-a-judge.
2. Detailed ablation studies is missing - i.e. what is the contribution to the improvement of the results w.r.t. each component such as schema vs schema-less, hybrid switching, graph traversal, etc. Since KGs are inherently incomplete, it would be also interesting to design an experimental setup to analyse the sensitivity towards noisy or incomplete KGs.
3. As the KG grows, the number of candidate subgraphs during the retrieval phase increases exponentially, complicating efficient retrieval. Even though the paper reports on model responses but a more systematic experiments on throughput, memory, and KG size scaling would help practitioners weigh cost. Did the authors conduct experiments on different KG sizes and measure different factors such as retrieval latency, response time, etc.?
4. The Graph RAG applications do not utilize temporal KGs- therefore a careful balance is necessary between older and newer information in the KGs. A discussion on how to achieve this would improve the quality of the manuscript.
5. The code, data, prompt details are not made available - hence limits the reproducibility. It is recommended to make them available.
6. The paper focuses on data and information everything related to Fraunhofer partners. A detailed discussion is necessary to establish the generalisability of the proposed framework
7. It is recommended to provide a compact consolidated table to describe each use cases — LLMs used, embedding models, prompt templates, etc. This would provide a direct comparison across all the use cases and improve readability.