Introducing GPT-5.3-Codex
Key Points
- 1OpenAI introduces GPT-5.3-Codex, a new agentic coding model that integrates advanced coding and reasoning capabilities, achieving a 25% speed increase and excelling at long-running tasks involving research, tool use, and complex execution.
- 2This model sets new industry highs on benchmarks like SWE-Bench Pro, Terminal-Bench, and OSWorld, demonstrating expanded capabilities from autonomous web development and general computer use to professional knowledge work such as creating presentations and analyzing data.
- 3GPT-5.3-Codex was instrumental in its own development and is the first model classified as "High capability" for cybersecurity with built-in vulnerability identification training and a comprehensive safety stack, now available through paid ChatGPT plans.
OpenAI introduces GPT-5.3-Codex on February 5, 2026, a new agentic coding model that expands Codex's capabilities across the full spectrum of professional computer-based work. This model builds upon and integrates the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2, while also operating 25% faster.
A significant advancement is GPT-5.3-Codex's ability to undertake long-running tasks involving research, tool use, and complex execution. It functions as an interactive collaborator, allowing users to steer and interact with it in real-time without losing context, receiving frequent updates, and discussing approaches. Notably, early versions of GPT-5.3-Codex were instrumental in its own creation, debugging its training, managing deployment, and diagnosing test results, accelerating its development.
The model achieves state-of-the-art performance on several key benchmarks:
- Coding:
- SWE-Bench Pro: Achieves 56.8%, surpassing the previous state-of-the-art of 56.4% by GPT-5.2-Codex. SWE-Bench Pro evaluates real-world software engineering across four languages, demonstrating enhanced contamination resistance, diversity, and industry relevance compared to SWE-bench Verified.
- Terminal-Bench 2.0: Scores 77.3%, significantly exceeding GPT-5.2-Codex's 64.0%. This measures the terminal skills essential for coding agents.
- These coding achievements are accomplished with fewer tokens than prior models.
- Web Development: GPT-5.3-Codex demonstrates the ability to build highly functional, complex games and applications from scratch over several days, iteratively improving with generic prompts like "fix the bug." Examples include a racing game and a diving game. It also exhibits superior intent understanding for day-to-day websites, defaulting to more functional and production-ready designs (e.g., automatically discounted yearly plans, transitioning testimonial carousels).
- Professional Knowledge Work: Extending beyond coding, GPT-5.3-Codex supports the entire software lifecycle (debugging, deployment, monitoring, PRDs, user research) and general professional tasks like creating slide decks or analyzing data.
- GDPval: Matches GPT-5.2's performance with a win/tie rate of 70.9% on GDPval, an evaluation released in 2025 measuring performance on well-specified knowledge-work tasks across 44 occupations, including presentations and spreadsheets.
- OSWorld-Verified: Demonstrates significantly stronger computer-use capabilities with 64.7%, a substantial improvement over GPT-5.2-Codex's 38.2%. OSWorld is an agentic computer-use benchmark where models complete productivity tasks in a visual desktop environment.
The paper highlights GPT-5.3-Codex's internal impact at OpenAI, where it has accelerated research and engineering. It assisted in monitoring and debugging training runs, analyzing interaction quality, proposing fixes, optimizing deployment harnesses, identifying bugs, scaling GPU clusters dynamically, and performing productivity analyses. For instance, it used simple regex classifiers to estimate clarification frequency, user responses, and task progress from session logs, and enabled data scientists to build new pipelines and visualize alpha testing results, summarizing key insights from thousands of data points.
In cybersecurity, GPT-5.3-Codex is the first model classified as "High capability" under OpenAI's Preparedness Framework for cybersecurity tasks and the first directly trained to identify software vulnerabilities. Despite lacking definitive evidence of end-to-end cyberattack automation, OpenAI employs a comprehensive safety stack including training, automated monitoring, trusted access, and enforcement pipelines. They are launching "Trusted Access for Cyber," expanding "Aardvark" (a security research agent), partnering with open-source maintainers for codebase scanning (e.g., Next.js), and committing $10M in API credits through their Cybersecurity Grant Program to accelerate cyber defense research, particularly for open-source software and critical infrastructure.
GPT-5.3-Codex is available to paid ChatGPT users across the Codex app, CLI, IDE extension, and web interfaces, with API access planned soon. Its infrastructure and inference stack improvements enable the 25% faster operation for Codex users. The model was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. This release marks a step towards a single, general-purpose agent capable of reasoning, building, and executing across the full spectrum of real-world technical work.
The appendix provides detailed benchmark results:
- SWE-Bench Pro (Public): GPT-5.3-Codex (xhigh) 56.8%, GPT-5.2-Codex (xhigh) 56.4%, GPT-5.2 (xhigh) 55.6%.
- Terminal-Bench 2.0: GPT-5.3-Codex (xhigh) 77.3%, GPT-5.2-Codex (xhigh) 64.0%, GPT-5.2 (xhigh) 62.2%.
- OSWorld-Verified: GPT-5.3-Codex (xhigh) 64.7%, GPT-5.2-Codex (xhigh) 38.2%, GPT-5.2 (xhigh) 37.9%.
- GDPval (wins or ties): GPT-5.3-Codex (xhigh) 70.9%, GPT-5.2 (high) 70.9%.
- Cybersecurity Capture The Flag Challenges: GPT-5.3-Codex (xhigh) 77.6%, GPT-5.2-Codex (xhigh) 67.4%, GPT-5.2 (xhigh) 67.7%.
- SWE-Lancer IC Diamond: GPT-5.3-Codex (xhigh) 81.4%, GPT-5.2-Codex (xhigh) 76.0%, GPT-5.2 (xhigh) 74.6%.