gpt-oss
Key Points
- 1OpenAI has launched gpt-oss, a series of open-weight models (20B and 120B parameters) in partnership with Ollama, designed for powerful reasoning, agentic tasks, and versatile developer use cases.
- 2These models feature agentic capabilities like function calling and web browsing, full chain-of-thought access, configurable reasoning effort, and are fine-tunable under a permissive Apache 2.0 license.
- 3Utilizing MXFP4 quantization for their mixture-of-experts weights, the 20B model requires as little as 16GB memory, while the 120B model is optimized to fit on a single 80GB GPU.
This document details the OpenAI gpt-oss open-weight models, a collaboration with Ollama, designed for powerful reasoning, agentic tasks, and versatile developer use cases. The offering includes two primary models: a B parameter model and a B parameter model, both optimized for local deployment via Ollama.
The models are characterized by a K token context window and are primarily designed for text-based inputs. The B parameter model occupies approximately GB of memory, while the B parameter model requires around GB.
Key features of the gpt-oss models include:
- Agentic Capabilities: Native support for function calling, web browsing (with optional built-in web search via Ollama), Python tool calls, and generation of structured outputs.
- Full Chain-of-Thought: Provides complete access to the model's internal reasoning processes, enhancing debuggability and output trustworthiness.
- Configurable Reasoning Effort: Allows users to adjust the computational effort for reasoning (low, medium, high) to balance performance with latency requirements.
- Fine-tunability: Models are designed to be customizable through parameter fine-tuning for specific use cases.
- Permissive Licensing: Distributed under the Apache license, enabling unrestricted experimentation, customization, and commercial deployment without copyleft or patent restrictions.
A core methodological innovation for memory footprint reduction is the quantization of model weights. Specifically, the gpt-oss models undergo post-training quantization of their Mixture-of-Experts (MoE) weights to the MXFP4 format. This process quantizes the weights to bits per parameter. Given that MoE weights constitute over of the total parameter count, this quantization significantly reduces memory requirements. The B parameter model, after quantization, can operate on systems with as little as GB of memory, while the B parameter model is optimized to fit on a single GB GPU. Ollama provides native support for the MXFP4 format, eliminating the need for additional quantizations or conversions. New kernels have been developed within Ollama's engine to support MXFP4, with benchmarking performed in collaboration with OpenAI to ensure equivalent quality to their reference implementations.
The B parameter variant, gpt-oss:20b, is tailored for lower latency, local, or specialized applications. The B parameter model, gpt-oss:120b, is designed for more demanding tasks requiring larger capacity. Both models can be accessed and run through the Ollama platform using commands like ollama run gpt-oss:20b or ollama run gpt-oss:120b.