Review Comment:
The submission type does not fit the actual content of this work. As a full paper, it lacks coherence because it does not present any novel contribution. I suggest changing the type to an application report. My score is based on the review guideline for a full paper type, which I am happy to change. In any case, I hope to provide helpful feedback for all types of submissions with my comments.
# Summary
The paper presents seven applications using Graph-RAG methods in heterogeneous domains (legal, finance, customer support, education, agriculture, enterprise knowledge management). After a brief explanation of RAG and Graph-RAG methods, an overview of the overall process is given for each application. Additionally, each application is individually evaluated, and minimal evaluation results, mostly comparing RAG and Graph-RAG, are discussed. Most applications focus on Knowledge Graph-based Graph-RAG approaches and, in particular, describe the knowledge organisation / KG construction step. The authors group their applications into two categories: construction using an existing schema (section 5) and constructions that do not use a prior schema (section 4). Based on the obtained results, an attempt is made to derive best practices and other insights, and general challenges of KG-RAG are identified.
# Strengths
* Overall, the paper is well written (in particular the introduction), follows a clear structure and motivates the presented use-cases of Graph-RAG quite well.
* Many heterogeneous applications are presented in sufficient detail to give a good overview of KG-RAG-based applications, which are especially interesting for practitioners.
* The authors make an effort to compare the use-cases and discover general insights to inform the development of further (Graph-)RAG applications.
# Weaknesses
* [Only for submission as full paper] The presented contribution has no scientific novelty, no comprehensive evaluation of all methods, i.e. this work does not contain any research contribution that could be reviewed along the dimensions originality or significance of the results. While the overview of the seven applications is novel, the methods used are not. Additionally, the generated insights are (1) not sufficiently supported by the experiments and (2) already reported by related work (based on more detailed evaluations).
* The sections for the individual applications repeat some information frequently, which negatively influences the readability. Better aligning each application section with the overall paper could shorten the paper by avoiding these repetitions.
* The presented methods and evaluations lack a lot of relevant details, which renders most evaluations useless and does not allow for any replications.
# Comments
* The introduction works really well to motivate this work and gives a good overview of the work.
* What should be the novel aspect of the presented Graph-RAG engine? The stages are exactly the stages presented in [61]. If the authors want to claim some novelty in this description, they should explicitly point out the novelty. Additionally, if this engine is to support the understanding of each application, it is helpful to adhere to this framework when describing the applications. Currently, this is only loosely connected to the applications.
* Graph-based indexing is described from a pure KG construction perspective and misses the retrieval system perspective, i.e. it is not clear how the constructed KG is integrated into the overall retrieval system.
* The claim "However, KGs generated automatically from text using LLMs are prone to errors" (page 5, line 43-44) needs a reference. This statement also contradicts "With the emergence of LLMs, new end-to-end approaches for building KGs have become possible" (page 6, line 24). If LLMs are generally prone to errors in this task, it does not make sense to have end-to-end approaches. This might simply be an issue of ordering. First mentioning the end-to-end approaches and then remarking that these pipelines still produce some errors is more reasonable.
* Figure 2 (left) does not provide any helpful information. Yes, the dots and connections resemble a graph, but a more effective example would be one where the dots and edges are actually connected to some readable information.
* Is the RAGAS-generated dataset publicly available? What are the basic statistics of this data, e.g. number of samples, typical length, and size of the retrieved data? Confidence intervals might be helpful as well. That is relevant for all evaluations of this work. An evaluation based on unknown data cannot produce any reproducible, scientific results.
* Was the LLM-As-A-Judge framework reliable here? Was a subset manually evaluated?
* Figure 3 does not provide any helpful information. What should be the point of an anime-style picture of someone sitting in front of a computer? Everyone knows how this looks like (most likely by looking in the mirror). The picture does not give a source, which raises even bigger concerns (Is it taken from an anime? Is it AI-generated?). Instead of anonymising it, it can be removed. The right subfigure has the same issue as Figure 2.
* The financial report generation in section 5.3 is missing any evaluation results.
* "particularly in industries like agriculture, which are less connected to technology and have limited data science knowledge" (page 19, line 17-18) I am not sure if this is true. As far as I know, technology and data science methods are frequently used in the agricultural sector. Some evidence for this claim is needed.
* "and can only be achieved using Graph RAG on our time series knowledge graph" (page 20, line 46-47) Such statements are highly unscientific and should not be made.
* Tables 5 and 6 present an average response time, but this is only comparable if the same hardware is used for all models. For GPT4o, Mixtral and Gema2 27b, this is not the case.
* "If you tell the LLM what you want to see in the graph, it will generate that based on the provided documents" (page 22, line 15-16). The presented applications did not discuss an evaluation which could actually support such a statement. Such a general statement (even with actual evidence for it) is not a good scientific style.
* The majority of "insights" described in section 6 do not seem to be connected to the performed experiments, e.g. the claim that bigger models do not necessarily perform better triple extraction does not directly connect to the presented experiments (only 5.4 seems to consider multiple models, and in this case, only results for the whole pipeline are given).
## Minor Comments
* "Graph RAG improves on RAG by integrating knowledge graphs into the retrieval process." (page 2, line 6-7) This assumes a pretty generic definition of KGs; what would be the difference between a graph and a KG here? [61] defines it as "GraphRAG retrieves graph elements containing relational knowledge pertinent to a given query from a pre-constructed graph database", which keeps the definition more general (any graph, not only KGs). Why did the authors assume a more specific definition here? It might make sense to highlight the focus on KGs as part of the introduction, instead of redefining the more general Graph-RAG term.
* The summary of all application domains (Page 2, line 23 - 25), followed by more detailed explanations, could be structured more coherently. The summary creates a jump backwards in the reading flow, i.e. the text jumps back from describing section 5.4 to 4.1 again. I suggest presenting the overview as a small diagram instead.
* "LLMs were first developed by OpenAI, with GPT-2" (page 3, line 23) Without background knowledge, this statement might produce the wrong impression. Language modelling with transformer-based models existed before GPT-2 ("Attention is all you need" paper, BERT, GPT-1) and language modelling even before that. GPT-2 was the first model to scale its architecture to over 1 billion parameters. Since the "large" part of LLMs is not clearly defined, it would be beneficial to provide more context here.
* "Since Graph the introduction of RAG, many [...]" (page 6, line 49) this is most likely a typo (Maybe without "Graph"?)
* "[...] we also identity communities [...]" (page 8 line 33). That should be "identify".
* "Zentralverriegelung" (page 10, line 45) I suggest including the information that this is the German term for central locking (increases readability for non-German speakers)
* "4.3 Wenn Fraunhofer Wüsste -Wass Fraunhofer Weiß" (page 11, line 42) is not a very descriptive title, especially for non-German speakers.
* page 11, line 49 and page 12, line 15 contain "[citation]"; actual citations should replace this.
* "The former adds weights to the model and trains these instead of the model's parameters, reducing the total number of training parameters" (page 25, line 48-49). Surely, adding something increases the overall trained parameters. LoRA reduces the fine-tuned parameters, but not the *total number* of parameters. Clarifying what the total number of training parameters refers to is useful.
# Score
As a full paper, this work clearly lacks novelty and does not provide any serious evaluation or experiments. As such, I recommend rejecting this work as a full paper. For an application report submission, I still see room for improvement, but this could be addressed with minor revisions.
|