Introducing GPT-5.4
News

Introducing GPT-5.4

2026.03.06
ยทServiceยทby ๊ถŒ์ค€ํ˜ธ
#Agent#AI#Computer Vision#GPT#LLM

Key Points

  • 1OpenAI has launched GPT-5.4, its new frontier model designed for professional work, integrating advances in reasoning, coding, and agentic workflows.
  • 2GPT-5.4 features significant improvements in native computer-use capabilities, visual perception, and factual accuracy, alongside enhanced token efficiency and support for up to 1M tokens of context.
  • 3The model achieves state-of-the-art performance across various benchmarks, including knowledge work (GDPval), computer use (OSWorld-Verified), coding (SWE-Bench Pro), and web browsing (BrowseComp), and is available in ChatGPT, the API, and Codex.

This paper announces the release of OpenAI's GPT-5.4, a new frontier model designed for professional work, integrating advancements in reasoning, coding, and agentic workflows. GPT-5.4 is available in ChatGPT (as GPT-5.4 Thinking and GPT-5.4 Pro), the API (as gpt-5.4 and gpt-5.4-pro), and Codex, succeeding GPT-5.3-Codex and GPT-5.2.

Core Methodological Advancements and Technical Details:

  1. Enhanced Reasoning and Knowledge Work:
GPT-5.4 significantly improves upon GPT-5.2's general reasoning capabilities, focusing on professional-grade output. On the GDPval benchmark, which assesses agents' abilities to produce well-specified knowledge work across 44 occupations, GPT-5.4 achieves an 83.0% win/tie rate against industry professionals, a notable increase from GPT-5.2's 70.9%. This involves tasks such as sales presentations, accounting spreadsheets, and manufacturing diagrams, typically executed with a high reasoning effort setting ("xhigh"). For internal spreadsheet modeling tasks relevant to junior investment banking analysts, GPT-5.4 scores 87.3% accuracy, compared to GPT-5.2's 68.4%. Furthermore, human raters preferred presentations generated by GPT-5.4 68.0% of the time due to superior aesthetics and visual variety. A key aspect is the reduction of factual errors and hallucinations: GPT-5.4's individual claims are 33% less likely to be false, and its full responses are 18% less likely to contain any errors compared to GPT-5.2.

  1. Native Computer-Use and Advanced Vision Capabilities:
GPT-5.4 marks a significant leap by introducing native, state-of-the-art computer-use capabilities, enabling agents to operate computers and execute complex workflows across applications. This is facilitated by a substantially expanded context window, supporting up to 1 million tokens, crucial for long-horizon planning, execution, and verification of tasks.
  • Desktop Navigation: On OSWorld-Verified, which measures a model's ability to navigate a desktop environment using screenshots and keyboard/mouse actions, GPT-5.4 achieves a 75.0% success rate, dramatically surpassing GPT-5.2's 47.3% and even human performance at 72.4%.
  • Web Interaction: For browser-based tasks, GPT-5.4 shows improved performance on WebArena-Verified (67.3% with DOM- and screenshot-driven interaction) and Online-Mind2Web (92.8% using screenshot-based observations).
  • Visual Perception: These computer-use capabilities are underpinned by improved general visual perception. On MMMU-Pro, a test of visual understanding and reasoning, GPT-5.4 achieves 81.2% success without tools and 82.1% with tools, an improvement over GPT-5.2's scores. Document parsing is also enhanced, with OmniDocBench showing an average error (normalized edit distance) of 0.109 for GPT-5.4, down from 0.140 for GPT-5.2. The model now supports higher image input detail levels: "original" up to 10.24 million pixels or 6000-pixel maximum dimension, and "high" up to 2.56 million pixels or 2048-pixel maximum dimension, improving localization, image understanding, and click accuracy for agents.
  1. Advanced Coding and Developer Workflows:
GPT-5.4 integrates the industry-leading coding capabilities of GPT-5.3-Codex with its new knowledge work and computer-use strengths. It matches or outperforms GPT-5.3-Codex on SWE-Bench Pro (57.7% for GPT-5.4 vs. 56.8% for GPT-5.3-Codex), while offering lower latency across various reasoning efforts. A "fast mode" in Codex and "priority processing" in the API enable up to 1.5x faster token velocity. The model excels in complex frontend tasks, and an experimental "Playwright (Interactive)" skill in Codex allows for visual debugging and testing of web and Electron applications, even during their construction.

  1. Sophisticated Tool Use and Agentic Orchestration:
The model introduces "tool search" in the API, allowing agents to efficiently navigate large tool ecosystems. Instead of including all tool definitions upfront, GPT-5.4 receives a lightweight list and dynamically looks up and appends specific tool definitions when needed. This significantly reduces token usage (e.g., 47% reduction on MCP Atlas benchmark) and improves response times by preserving the cache. Tool calling accuracy and efficiency are also enhanced on benchmarks like Toolathlon (54.6% for GPT-5.4 vs. 45.7% for GPT-5.2). GPT-5.4's web search capabilities are substantially improved, leading to a 17 percentage point absolute gain over GPT-5.2 on BrowseComp, particularly for "needle-in-a-haystack" queries requiring persistent multi-round searching and synthesis.

  1. Efficiency and Steerability:
GPT-5.4 is described as the most token-efficient reasoning model yet, using significantly fewer tokens to solve problems compared to GPT-5.2, resulting in reduced cost and faster speeds. In ChatGPT, GPT-5.4 Thinking can now provide an upfront plan of its reasoning process, allowing users to adjust course mid-response and guiding the model more effectively. It also maintains stronger context awareness over longer conversations and complex prompts.

Safety and Deployment:
Treated as "High cyber capability" under OpenAI's Preparedness Framework, GPT-5.4 incorporates an expanded cyber safety stack, including monitoring systems, trusted access controls, and asynchronous blocking. Safety research on Chain-of-Thought (CoT) monitorability indicates that GPT-5.4 Thinking has low CoT controllability, suggesting it lacks the ability to intentionally obfuscate its reasoning, thus validating CoT monitoring as an effective safety tool.

Pricing and Availability:
GPT-5.4 is priced higher than GPT-5.2 to reflect its enhanced capabilities (2.50/Mtokensinput,2.50/M tokens input,15/M tokens output for gpt-5.4 vs. 1.75/Minput,1.75/M input,14/M output for gpt-5.2). However, its greater token efficiency aims to reduce overall task costs. GPT-5.4 Pro models (gpt-5.4-pro) are available for maximum performance on complex tasks. It is gradually rolling out across OpenAI's platforms.