
KGGen: Extracting Knowledge Graphs from Plain Text with Language Models
Key Points
- 1KGGen is a novel text-to-knowledge-graph generator that leverages language models and an iterative entity and edge resolution algorithm to produce high-quality, dense knowledge graphs from plain text.
- 2The paper introduces MINE, a new benchmark for KG extraction, on which KGGen outperforms OpenIE and performs comparably to GraphRAG in retrieval, while demonstrating better information retention.
- 3KGGen significantly reduces graph sparsity by clustering and de-duplicating entities and relations, leading to more generalizable KGs and superior efficiency and scaling properties compared to existing methods like GraphRAG.
KGGen is a novel text-to-knowledge-graph (KG) generator that addresses the scarcity and incompleteness of high-quality KGs by leveraging language models (LMs) and a sophisticated entity and edge resolution approach. Existing methods like OpenIE and Microsoft's GraphRAG often produce sparse and disconnected KGs due to a lack of effective entity resolution and relation normalization. KGGen aims to create dense, high-quality KGs by clustering and de-duplicating related entities and normalizing relation types. The paper also introduces MINE (Measure of Information in Nodes and Edges), the first benchmark specifically designed to evaluate KG extractors from plain text.
The core methodology of KGGen is a multi-stage approach:
- Entity and Relation Extraction:
- Step 1 (Entity Extraction): Given the source text, the LLM is prompted to extract a list of relevant entities.
- Step 2 (Relation Extraction): Using the source text and the previously extracted list of entities, the LLM is prompted to output a list of subject-predicate-object relations, ensuring that subjects and objects are consistent with the extracted entities.
- Aggregation:
- Entity and Edge Resolution:
- Semantic Embedding: All items (entities or edges) in the aggregated graph are first converted into semantic embeddings. The paper specifically mentions using
all-MiniLM-L6-v2from SentenceTransformers for this purpose. - Clustering: K-means clustering is applied to these semantic embeddings, grouping similar items into clusters (e.g., clusters of 128 items). This pre-processing step efficiently handles large KGs by partitioning the resolution task.
- Similar Item Retrieval: Within each cluster, for every item, the top-k (e.g., ) most semantically similar items are retrieved. This retrieval leverages a fused approach combining BM25 (for lexical similarity) and semantic embedding similarity. The combined similarity score for a query and an item can be represented as a weighted sum:
where is typically 0.5 for equal weighting.
- LLM-based De-duplication: The retrieved set of similar items is then passed to an LLM. The LLM is prompted to identify *exact duplicates* from this set, specifically considering variations in tense, plurality, case, abbreviations, and shorthand forms (e.g., "Olympic Winter Games", "Winter Olympics", "winter Olympic games"). This step leverages the LLM's understanding of natural language nuances for precise matching.
- Canonical Representative Selection: For each identified set of duplicates, the LLM selects a single, canonical representative (similar to aliases in Wikidata) that best captures the shared meaning. Cluster maps are maintained to track which original entities or edges are mapped to which canonical alias.
- Iteration: The item and its identified duplicates are removed from the current cluster, and the process (retrieval, LLM de-duplication, canonical selection) repeats until no items remain in the cluster. This iterative refinement ensures comprehensive de-duplication.
KGGen's performance is evaluated using the novel MINE benchmark, which includes:
- MINE-1 (Knowledge Retention): Measures the ability of a KG extractor to capture information from short texts. It assesses if 15 pre-defined, ground-truth facts from an article can be recovered from the extracted KG via a semantic query process and LLM-based verification.
- MINE-2 (KG-Assisted RAG Description): Evaluates downstream RAG performance using KGs built from multi-million token datasets (WikiQA). It assesses how well the KG facilitates answer synthesis for given questions, using retrieved triples and an LLM to generate responses.
The paper demonstrates that KGGen significantly outperforms GraphRAG and OpenIE in terms of information retention on MINE-1, achieves comparable RAG performance on MINE-2, and critically, produces KGs with more concise, generalizable entities and relations. Qualitatively, KGGen generates KGs that are more informative and coherent, avoiding the hyperspecific or generic nodes often found in outputs from other tools. Furthermore, KGGen exhibits superior scaling properties, with the average re-usability of relation types increasing with corpus size, unlike GraphRAG, which generates nearly as many relation types as edges even for large corpora. Efficiency analysis shows KGGen to be significantly faster and more cost-effective for large corpora compared to GraphRAG, primarily due to its effective de-duplication reducing the number of LLM calls in the later stages.