gpt-oss
Service

gpt-oss

2025.08.10
·Web·by Anonymous
#LLM#OpenAI#Agent#Quantization#Ollama

Key Points

  • 1OpenAI has launched gpt-oss, a series of open-weight models (20B and 120B parameters) in partnership with Ollama, designed for powerful reasoning, agentic tasks, and versatile developer use cases.
  • 2These models feature agentic capabilities like function calling and web browsing, full chain-of-thought access, configurable reasoning effort, and are fine-tunable under a permissive Apache 2.0 license.
  • 3Utilizing MXFP4 quantization for their mixture-of-experts weights, the 20B model requires as little as 16GB memory, while the 120B model is optimized to fit on a single 80GB GPU.

This document details the OpenAI gpt-oss open-weight models, a collaboration with Ollama, designed for powerful reasoning, agentic tasks, and versatile developer use cases. The offering includes two primary models: a 2020B parameter model and a 120120B parameter model, both optimized for local deployment via Ollama.

The models are characterized by a 128128K token context window and are primarily designed for text-based inputs. The 2020B parameter model occupies approximately 1414GB of memory, while the 120120B parameter model requires around 6565GB.

Key features of the gpt-oss models include:

  • Agentic Capabilities: Native support for function calling, web browsing (with optional built-in web search via Ollama), Python tool calls, and generation of structured outputs.
  • Full Chain-of-Thought: Provides complete access to the model's internal reasoning processes, enhancing debuggability and output trustworthiness.
  • Configurable Reasoning Effort: Allows users to adjust the computational effort for reasoning (low, medium, high) to balance performance with latency requirements.
  • Fine-tunability: Models are designed to be customizable through parameter fine-tuning for specific use cases.
  • Permissive Licensing: Distributed under the Apache 2.02.0 license, enabling unrestricted experimentation, customization, and commercial deployment without copyleft or patent restrictions.

A core methodological innovation for memory footprint reduction is the quantization of model weights. Specifically, the gpt-oss models undergo post-training quantization of their Mixture-of-Experts (MoE) weights to the MXFP4 format. This process quantizes the weights to 4.254.25 bits per parameter. Given that MoE weights constitute over 90%90\% of the total parameter count, this quantization significantly reduces memory requirements. The 2020B parameter model, after quantization, can operate on systems with as little as 1616GB of memory, while the 120120B parameter model is optimized to fit on a single 8080GB GPU. Ollama provides native support for the MXFP4 format, eliminating the need for additional quantizations or conversions. New kernels have been developed within Ollama's engine to support MXFP4, with benchmarking performed in collaboration with OpenAI to ensure equivalent quality to their reference implementations.

The 2020B parameter variant, gpt-oss:20b, is tailored for lower latency, local, or specialized applications. The 120120B parameter model, gpt-oss:120b, is designed for more demanding tasks requiring larger capacity. Both models can be accessed and run through the Ollama platform using commands like ollama run gpt-oss:20b or ollama run gpt-oss:120b.