Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog
Key Points
- 1Nemotron 3 Super is an open, 120B total/12B active-parameter hybrid Mamba-Transformer Mixture-of-Experts (MoE) model designed to tackle the "thinking tax" and "context explosion" in agentic AI systems.
- 2It introduces architectural innovations including a hybrid Mamba-Transformer backbone for 1M-token context, Latent MoE for efficient expert utilization, Multi-token prediction for faster generation, and native NVFP4 pretraining for optimized performance.
- 3This model achieves leading accuracy on agentic benchmarks like PinchBench and is fully open-sourced with weights, datasets, and recipes to enable easy customization, optimization, and deployment.
Nemotron 3 Super is an open, hybrid Mamba-Transformer Mixture-of-Experts (MoE) model designed for agentic AI systems, addressing limitations such as "context explosion" and "thinking tax." It is a 120B total parameter model with 12B active parameters, featuring a native 1M-token context window and aiming for maximum compute efficiency and accuracy for complex multi-agent applications.
The core methodology of Nemotron 3 Super incorporates several architectural innovations:
- Hybrid Mamba-Transformer Backbone: The model interleaves Mamba-2 layers with Transformer attention layers. Mamba-2 layers handle the majority of sequence processing with linear-time complexity with respect to sequence length, making the 1M-token context window practical by managing memory footprint. Transformer attention layers are strategically interleaved at key depths to preserve precise associative recall, crucial for tasks requiring specific fact retrieval within long contexts, mitigating pure State Space Models (SSMs)' struggles in this area. This hybrid approach delivers higher throughput and 4x improved memory and compute efficiency.
- Latent MoE: Unlike standard MoE architectures where tokens are routed directly from the full hidden dimension, Latent MoE projects token embeddings into a compressed, low-rank latent space *before* routing decisions. Expert computation then occurs in this smaller dimension, and results are projected back to the full model dimension. This allows the model to consult 4x as many expert specialists for the same inference cost, leading to finer-grained specialization (e.g., distinct experts for Python syntax vs. SQL logic) by compressing tokens before they reach the experts.
- Multi-Token Prediction (MTP): Instead of predicting one token at a time, Super is trained with MTP, where specialized prediction heads forecast multiple future tokens simultaneously from each position. This design uses a shared-weight approach across all MTP heads, minimizing parameter overhead and improving training stability. During training, MTP forces the model to internalize longer-range structure and logical dependencies, leading to stronger reasoning. At inference, it provides built-in speculative decoding, offering draft predictions that can be verified in parallel, resulting in up to 3x wall-clock speedups for long sequence generation.
- Native NVFP4 Pretraining: Most quantized models are compressed after full-precision training. Nemotron 3 Super, however, runs the majority of floating-point multiply-accumulate operations during pretraining directly in NVFP4, NVIDIA's 4-bit floating-point format optimized for Blackwell. This native reduced-precision training allows the model to learn accuracy within 4-bit arithmetic constraints from the first gradient update, significantly cutting memory requirements and speeding up inference by 4x on NVIDIA B200 compared to FP8 on NVIDIA H100, while maintaining accuracy and ensuring mathematical stability.
The model's training pipeline involves three sequential phases:
- Pretraining: Conducted on 25 trillion tokens (including 10 trillion unique curated tokens with additional compute focused on reasoning and coding) using native NVFP4.
- Supervised Fine-tuning (SFT): The model is fine-tuned on approximately 7 million SFT samples, drawn from a broader 40 million post-training corpus, covering reasoning, instruction following, coding, safety, and multi-step agent tasks. This establishes a behavioral foundation.
- Multi-environment Reinforcement Learning (RL): Post-training is performed using reinforcement learning across 21 diverse environment configurations in NVIDIA NeMo Gym and NeMo RL, involving over 1.2 million environment rollouts. This trajectory-based RL aligns the model with real agentic behavior, evaluating its ability to perform sequences of actions and verifiable outcomes, reducing reasoning drift, and handling structured operations common in agentic pipelines.
Nemotron 3 Super achieves leading accuracy on agentic benchmarks, scoring 85.6% on PinchBench. The model is fully open, providing weights on Hugging Face and NVIDIA NIM, complete training and evaluation recipes, deployment cookbooks (vLLM, SGLang, TensorRT LLM), fine-tuning cookbooks (LoRA/SFT, GRPO/DAPO), and open datasets (pretraining corpora, post-training datasets, RL tasks/environments). This open ecosystem aims to facilitate customization, optimization, and deployment by developers.