GitHub - stair-lab/kg-gen: [NeurIPS '25] Knowledge Graph Generation from Any Text
Key Points
- 1`kg-gen` is a comprehensive tool that leverages language models to extract knowledge graphs from diverse text formats, including plain text, large documents, and conversational messages.
- 2It offers robust features such as chunking for large texts, clustering for similar entities and relations, aggregation of multiple graphs, and supports various LLM providers through LiteLLM.
- 3This system is designed to facilitate applications like Retrieval-Augmented Generation (RAG), synthetic data creation for model training, general text structuring, and the analysis of relationships within source texts.
KGGen is a knowledge graph generation system designed to extract structured knowledge graphs from arbitrary plain text, including single strings, large documents, and conversational message arrays. The core methodology leverages large language models (LLMs) to perform the extraction, with model calls routed via LiteLLM, supporting various API-based and local providers such as OpenAI, Ollama, Anthropic, Gemini, and Deepseek. Structured output generation is facilitated by DSPy.
The system outputs knowledge graphs consisting of:
entities: A set of unique concepts identified in the text.edges: A set of unique relationship types.relations: A set of triples(subject, predicate, object)representing the extracted relationships.
KGGen offers several key functionalities:
- Text Processing:
- Single String Input: Directly processes a given text string.
- Large Text Chunking: For extensive documents, KGGen can process text in configurable
chunk_sizesegments (e.g., 5000 characters) to manage context window limitations of LLMs. - Message Array Processing: Handles conversational data provided as a list of
Messageobjects, each with aroleandcontent. This feature preserves message order and role information, extracting entities and relationships not only between concepts mentioned in messages but also between speakers (roles) and concepts, and across multiple messages within a conversation. For instance,(user, "asks about", "France")can be extracted from a dialogue.
- Graph Manipulation and Refinement:
- Clustering: Identifies and groups semantically similar entities and relations, disambiguating variations and aliases (e.g., 'AI' and 'artificial intelligence' or 'is type of' and 'is a type of'). Clustering can be applied during the initial graph generation or as a post-processing step on an existing graph. An optional
contextparameter can be provided to guide the clustering process. The output includesentity_clustersandedge_clustersmappings. - Aggregation: Allows for the combination of multiple independently generated knowledge graphs into a single, comprehensive graph using the
kg.aggregate()method. This enables merging knowledge extracted from different sources or segments.
- Clustering: Identifies and groups semantically similar entities and relations, disambiguating variations and aliases (e.g., 'AI' and 'artificial intelligence' or 'is type of' and 'is a type of'). Clustering can be applied during the initial graph generation or as a post-processing step on an existing graph. An optional
- Visualization: Provides a
KGGen.visualize()method to render the generated knowledge graphs, outputting them to a specified path (e.g., an HTML file) and optionally opening in a browser.
- Model Configuration: Users can specify the LLM to be used by passing a model string (e.g.,
"openai/gpt-4o","gemini/gemini-2.5-flash","ollama_chat/deepseek-r1:14b"). Custom API base URLs and API keys can also be configured.
KGGen's applications include assisting with RAG (Retrieval-Augmented Generation) systems, generating synthetic graph data for model training and testing, structuring unstructured text, and analyzing relationships between concepts within source materials. An accompanying MCP (Memory, Control, and Perception) Server is available for AI agents requiring persistent memory capabilities.