Inside OpenAI’s in-house data agent
Blog

Inside OpenAI’s in-house data agent

2026.01.30
·Service·by 이호민
#AI#Data Agent#OpenAI#LLM#Internal Tool

Key Points

  • 1OpenAI developed an in-house AI data agent to empower employees with rapid, accurate data analysis from their massive internal data platform via natural language.
  • 2Powered by GPT-5.2, the agent ensures accuracy through a self-learning process and a multi-layered context system incorporating schema, human annotations, Codex-derived code understanding, and persistent memory.
  • 3Acting as a conversational teammate, this bespoke tool accelerates data analysis and maintains quality by continuously learning from user interactions and systematic evaluation using the Evals API.

The presented paper, titled "Inside OpenAI’s in-house data agent," details the development and functionality of a bespoke, internal-only AI data agent designed to streamline data analysis within OpenAI. Built to address the challenges of navigating a massive data platform (3.5k users, 600 petabytes across 70k datasets), the agent aims to provide quick, correct, and contextually relevant answers to complex data questions, enabling employees across various functions to go from question to insight in minutes.

Problem Statement:
OpenAI's data landscape poses significant challenges for analysts:

  1. Data Discovery: Difficulty in identifying the correct tables among thousands of similar ones, requiring extensive time to understand differences and relationships (e.g., logged-in vs. logged-out users, overlapping fields).
  2. Query Correctness: Producing accurate results necessitates reasoning about table data and relationships to apply transformations and filters correctly. Common failure modes include many-to-many joins, filter pushdown errors, and unhandled nulls, which can silently invalidate results.
  3. Analyst Focus: Analysts often spend undue time debugging SQL semantics and query performance instead of focusing on higher-value activities like metric definition, assumption validation, and data-driven decision-making.

Core Methodology and Architecture:
The agent is powered by GPT-5.2 and leverages other OpenAI tools like Codex, GPT-5, the Evals API, and the Embeddings API. Its core strength lies in its closed-loop, self-learning reasoning process and its ability to curate and utilize rich, multi-layered context.

  1. Reasoning and Self-Correction: Instead of following a fixed script, the agent evaluates its own progress. If an intermediate result is incorrect (e.g., zero rows due to an invalid join), it investigates the error, adjusts its approach, and retries, carrying learnings forward. This iterative, self-correcting mechanism shifts the burden of iteration from the user to the agent, leading to faster and higher-quality analyses. The agent handles the full analytics workflow: data discovery, SQL generation and execution, and report publication.
  1. Contextual Grounding (Six Layers of Context): High-quality answers are ensured by grounding the agent in OpenAI's data, code, and institutional knowledge through six layers of context, with relevant context retrieved using Retrieval Augmented Generation (RAG) at query time:
    • Layer #1: Table Usage Metadata: Utilizes schema metadata (column names, data types) for SQL writing and table lineage (upstream/downstream relationships) for understanding data flow. Historical queries are ingested to infer common joins and query patterns.
    • Layer #2: Human Annotations: Domain experts provide curated descriptions of tables and columns, capturing intent, semantics, business meaning, and caveats not inferable from schema or query history.
    • Layer #3: Codex Enrichment: Codex derives a code-level definition of tables by crawling the codebase. This provides deeper understanding of data contents, uniqueness of values, update frequency, scope, and how data is derived from analytics events. This layer helps distinguish seemingly similar tables that differ critically (e.g., first-party ChatGPT traffic only). This context is automatically refreshed.
    • Layer #4: Institutional Knowledge: The agent accesses internal documents (Slack, Google Docs, Notion) containing critical company context (launches, incidents, codenames, metric definitions). These documents are ingested, embedded using the Embeddings API, and stored with metadata and permissions. A retrieval service handles access control and caching for efficient, safe information pull.
    • Layer #5: Memory: When given corrections by users or when it discovers nuances, the agent saves these learnings for future use. This allows it to retain non-obvious corrections, filters, and constraints crucial for data correctness but difficult to infer otherwise. Memories can be global or personal, and are editable.
    • Layer #6: Runtime Context: For new or stale information, the agent can issue live queries to the data warehouse to inspect and validate schemas and data in real-time. It can also interact with other Data Platform systems (metadata service, Airflow, Spark) for broader context outside the warehouse.

A daily offline pipeline aggregates table usage, human annotations, and Codex-derived enrichment into a normalized representation. This enriched context is then converted into embeddings using the OpenAI Embeddings API and stored for efficient RAG, ensuring fast and scalable table understanding.

  1. Teammate-like Interaction: The agent is designed for conversational, iterative exploration. It maintains full context across turns, allowing follow-up questions and redirection. It proactively asks clarifying questions when instructions are unclear and applies sensible defaults if no response is provided (e.g., assuming last 7/30 days for growth metrics without specified date range). It also supports workflows for recurring analyses, packaging common tasks into reusable instruction sets for consistency.

Evaluation and Security:

  1. Systematic Evaluation: Quality control is maintained using the OpenAI Evals API. Evals are based on curated question-answer pairs, each linked to a "golden" SQL query representing the expected correct result. The agent's generated SQL is executed, and its output is compared against the golden result. The Evals grader assesses correctness and acceptable variation, providing a score and explanation. These act as continuous unit tests, identifying regressions early.
  2. Security: The agent functions as an interface layer, inheriting and enforcing OpenAI's existing security and access-control model. It provides pass-through access, meaning users can only query tables they are authorized to access. It either flags access denials or falls back to authorized datasets.
  3. Transparency: The agent exposes its reasoning process by summarizing assumptions and execution steps, linking directly to underlying results and raw data for user verification.

Lessons Learned:

  1. Less is More: Initially, exposing the full toolset led to ambiguity and confusion for the agent. Consolidating and restricting tool calls improved reliability.
  2. Guide the Goal, Not the Path: Highly prescriptive prompting degraded results. Shifting to higher-level guidance allowed GPT-5's reasoning to choose appropriate execution paths, leading to more robust and accurate outcomes.
  3. Meaning Lives in Code: Schemas and query history provide shape and usage, but Codex crawling the codebase reveals true data meaning, assumptions, freshness guarantees, and business intent, enabling more accurate reasoning about data contents and applicability.

Future Vision:
OpenAI is continuously improving the agent's ability to handle ambiguous questions, enhance reliability, and deepen workflow integrations, aiming for it to blend seamlessly into existing user workflows and provide fast, trustworthy data analysis.