
RAG vs. GraphRAG: A Systematic Evaluation and Key Insights
Key Points
- 1This paper systematically evaluates Retrieval-Augmented Generation (RAG) and various GraphRAG methods on general text-based tasks, including question answering and query-based summarization, using widely adopted benchmarks.
- 2The evaluation reveals distinct strengths: RAG excels in single-hop and detailed queries, while GraphRAG is more effective for multi-hop, reasoning-intensive questions and diverse summarization, highlighting their complementary nature.
- 3Based on these insights, the authors propose and demonstrate that hybrid Selection and Integration strategies can leverage these strengths to generally enhance overall performance, while also discussing current GraphRAG limitations.
The paper presents a systematic evaluation and comparison of Retrieval-Augmented Generation (RAG) and Graph Retrieval-Augmented Generation (GraphRAG) on general text-based tasks, specifically Question Answering (QA) and Query-based Summarization. It addresses a gap in the understanding of when and why explicit graph structures, constructed from text, benefit retrieval-augmented generation.
The core methodology involves comparing RAG against three representative GraphRAG approaches under identical experimental settings, including the same Large Language Models (LLMs), embedding models, and retrieval configurations.
RAG Methodology:
The paper employs a vanilla semantic similarity-based RAG approach. Text documents are first segmented into chunks, each approximately 256 tokens in length. These chunks are then indexed using OpenAI's text-embedding-ada-002 model to create vector embeddings. For a given query, the system retrieves the top-10 most semantically similar text chunks. The retrieved chunks, along with the query, are then fed into a generative LLM (Llama-3.1-8B-Instruct or Llama-3.1-70B-Instruct) to produce the final response.
GraphRAG Methodologies (Core Technical Details):
- KG-based GraphRAG (Knowledge Graph-based GraphRAG):
- Graph Construction: A knowledge graph (KG) is constructed from text chunks using LLMs. This process involves extracting triplets of the form (head, relation, tail) from the text.
- Retrieval Mechanism: When a query is received, entities are extracted from the query using LLMs and matched to entities within the constructed KG. The retrieval process then traverses the graph from these matched entities, gathering relevant triplets from their multi-hop neighbors.
- Variants:
KG-GraphRAG (Triplets): Retrieves only the extracted (head, relation, tail) triplets.KG-GraphRAG (Triplets+Text): Retrieves both the triplets and their corresponding original source text, enhancing contextual completeness.
- Community-based GraphRAG:
- Graph Construction: Similar to KG-based GraphRAG, an initial KG is generated from text using LLMs. Following this, hierarchical communities are formed within the graph using graph community detection algorithms. Each community is associated with a textual summary or report; lower-level communities provide detailed information, while higher-level communities offer summaries of their sub-communities. GPT-4o-mini is primarily used for graph construction.
- Retrieval Mechanisms:
Community-GraphRAG (Local): This method performs a local search. Retrieval is based on entity matching between the query's extracted entities and the graph. It retrieves entities, relations, descriptions, and detailed lower-level community reports.Community-GraphRAG (Global): This method performs a global search. It retrieves only high-level community summaries based on semantic similarity to the query, providing a broader, less granular overview.
- Text-based GraphRAG (HippoRAG2):
- Graph Construction: This method treats original text chunks as nodes in a graph. A KG is constructed, where entities extracted from the text are linked back to their corresponding original text chunks.
- Retrieval Mechanism: Entities relevant to the query are first identified within the graph. Subsequently, the original text chunks directly connected to these relevant entities are retrieved and used for generation.
Evaluation and Findings:
The paper evaluates these methods on established datasets: Natural Questions (NQ) for single-hop QA, HotPotQA and MultiHop-RAG for multi-hop QA, and NovelQA for fine-grained query types. For summarization, SQuALITY and QMSum (single-document) and ODSum-story/ODSum-meeting (multi-document) are used. Evaluation metrics include Precision, Recall, F1-score, Accuracy for QA, and ROUGE-2 and BERTScore for summarization.
Key findings indicate that RAG excels at single-hop and fine-grained detail questions, while Community-GraphRAG (Local) demonstrates superior performance for multi-hop and reasoning-intensive QA tasks. Community-GraphRAG (Global) often struggles in QA due to loss of detail, but shows potential for queries requiring comparative or temporal understanding. KG-based GraphRAG generally underperforms in QA, attributed to incomplete entity coverage in constructed KGs.
The analysis reveals significant complementary strengths: a substantial portion of queries are answered correctly by only RAG or only GraphRAG. Building on this, the paper proposes two hybrid strategies to enhance QA performance:
- Selection: An LLM-based classifier determines whether a query is fact-based or reasoning-based, directing it to either RAG (for facts) or GraphRAG (for reasoning) accordingly. This is more efficient.
- Integration: Information retrieved by both RAG and GraphRAG is concatenated and fed to the generator. This typically yields higher performance but at a greater computational cost.
For query-based summarization, RAG and HippoRAG2 generally perform well as they retrieve original text, which aligns closely with human-written ground truths. KG-based GraphRAG benefits from combining triplets with their source text. Community-GraphRAG (Local) outperforms Community-GraphRAG (Global) in summarization, emphasizing the importance of detailed information.
The paper also highlights a critical limitation in current LLM-as-a-Judge evaluations for summarization, demonstrating the presence of position bias that can impact result reliability. This indicates that human-annotated metrics like ROUGE and BERTScore are still crucial for unbiased evaluation.