GitHub - CausalInferenceLab/Lang2SQL
Key Points
- 1Lang2SQL is an open-source tool designed to convert natural language queries into optimized SQL statements, built upon LangGraph and integrated with DataHub for metadata management.
- 2It addresses common data challenges by enabling smart table discovery, automated schema-aware SQL generation, and result visualization, simplifying data access for users regardless of their SQL expertise.
- 3The project employs a multi-stage LangGraph architecture, supports flexible VectorDBs like FAISS and pgvector, and actively seeks community contributions for containerization, agentic architecture development, and enhanced DataHub integration.
Lang2SQL is an open-source tool developed by the Causal Inference Lab, designed to convert natural language queries into optimized SQL statements. Its primary goal is to democratize data access and analysis by allowing users to generate complex SQL queries without extensive knowledge of database schemas or SQL syntax, thereby enabling them to focus on higher-value tasks.
The core methodology of Lang2SQL is built upon a multi-stage, agentic architecture orchestrated by LangGraph, which leverages Large Language Models (LLMs), DataHub for metadata management, and vector databases for efficient table discovery. The process unfolds through the following key stages:
- Natural Language Processing (NLP): This initial stage focuses on parsing the user's natural language query to understand their intent and extract core entities. While the specific NLP models are not detailed, this step is crucial for transforming unstructured text into structured information suitable for subsequent processing.
- Table Discovery (Semantic Search): This is a critical component for identifying relevant database tables that match the user's query. It operates as follows:
- Vector Embeddings: Information about database tables, including their schemas, column names, and potentially sampled data or descriptions obtained from DataHub, is transformed into high-dimensional vector embeddings. These embeddings capture the semantic meaning of the table metadata.
- Query Embedding: The user's natural language query is also embedded into the same vector space.
- Vector Similarity Search: A vector similarity search is performed to find tables whose embeddings are semantically similar to the query embedding. Lang2SQL supports flexible vector database backends, including FAISS (for local, in-memory or disk-based indexing) and pgvector (for PostgreSQL-backed vector storage). The user can specify the vector database type (
--vectordb-type) and location (--vectordb-location). - Schema Awareness: DataHub plays a vital role here by providing rich metadata, ensuring "accurate column mapping" and "semantic search using DataHub metadata," which improves the relevance and precision of retrieved tables. This stage can be configured to retrieve a specified number of top-N relevant tables (
--top-n).
- SQL Generation: With the user's parsed intent and the semantically retrieved relevant tables, an LLM is employed to generate an optimized SQL query. This stage incorporates "best practices for performance optimization" into the generated SQL. The LangGraph framework allows for a modular and configurable workflow, accessible via the "Graph Builder" in the Streamlit interface. Users can activate or deactivate specific sub-components, including:
PROFILE_EXTRACTION: Likely extracts specific user or query profiles to better contextualize the SQL generation.CONTEXT_ENRICHMENT: Enriches the context provided to the LLM by integrating additional metadata from DataHub (e.g., glossary terms, query examples) or other sources, leading to more accurate and contextually relevant SQL.QUERY_MAKER: The core component responsible for invoking the LLM to construct the final SQL query based on the processed intent and retrieved schema information. Disabling this module would only provide table search results without SQL generation. The--use-enriched-graphoption allows using a workflow that includes profile extraction and context enrichment.
- Query Visualization: After SQL generation and execution, Lang2SQL provides capabilities to visualize the results of the generated SQL queries through various charts and graphs, facilitating intuitive data insights.
The technical stack underpinning Lang2SQL includes LangGraph for orchestrating complex LLM workflows, DataHub for comprehensive metadata management, and Streamlit for creating an interactive web-based user interface.
Key features of Lang2SQL include natural language to SQL conversion, smart table discovery through semantic search, schema awareness powered by DataHub, an interactive web interface, and result visualization. It addresses common pain points for data teams, such as navigating vast table spaces, validating JOIN operations, optimizing query performance, and extracting meaningful insights by automating table recommendations, generating SQL with appropriate column combinations, and applying best practices.
Installation is possible via pip or uv, or directly from the source. The tool can be run via a Command Line Interface (CLI) for direct query execution or as a Streamlit web application. Configuration options, previously managed via CLI flags, are increasingly consolidated within the Streamlit UI (e.g., DataHub server settings, VectorDB settings), though CLI flags remain for lang2sql query commands.
The project identifies several areas for future contribution:
- Containerization: Enhancing deployment simplicity through Docker and improving CI/CD pipelines.
- Agentic Architecture Development: Advancing the intelligence and autonomy of SQL generation and data discovery.
- DataHub Integration Enhancement: Deeper integration of DataHub's metadata (glossary, query examples) into the SQL generation process.
- VectorDB Flexibility Improvement: Decoupling the dependency on DataHub for VectorDB creation, allowing direct use of pre-existing vector databases.
- Monitoring/Logging Enhancement: Implementing robust systems for tracking usage patterns and performance.
- Documentation Enhancement: Creating comprehensive guides to lower contribution barriers.
- LLM Frontend Separation: Centralizing LLM calls and key management to a backend API for improved security, authorization, and monitoring.
Lang2SQL is an ongoing project by the Causal Inference Lab at ๊ฐ์ง์ฐ๊ตฌ์ (Gajjayeonguso), operating under the MIT License, and welcomes community contributions.