The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Key Points
- 1Meta AI introduces the Llama 4 herd, featuring Llama 4 Scout and Llama 4 Maverick, as their first open-weight, natively multimodal, and Mixture-of-Experts (MoE) models.
- 2These models leverage distillation from the powerful Llama 4 Behemoth, innovative training methods like online RL, and the iRoPE architecture to achieve industry-leading context lengths and performance.
- 3Available for immediate use and download, the Llama 4 series aims to foster open innovation, enhance personalized AI experiences, and address safety and bias in large language models.
The Llama 4 herd, comprising Llama 4 Scout, Llama 4 Maverick, and the unreleased teacher model Llama 4 Behemoth, represents Meta AI's first suite of open-weight, natively multimodal large language models leveraging a Mixture-of-Experts (MoE) architecture. Released on April 5, 2025, these models aim to enable personalized multimodal AI experiences with significantly extended context capabilities.
Model Architectures and Parameters:
- Llama 4 Scout: A general-purpose model with 17 billion active parameters and 16 experts. It fits on a single NVIDIA H100 GPU when using Int4 quantization.
- Llama 4 Maverick: A more powerful general-purpose model, also with 17 billion active parameters but utilizing 128 experts, resulting in 400 billion total parameters. It fits on a single NVIDIA H100 host.
- Llama 4 Behemoth: The teacher model for the Llama 4 series, featuring 288 billion active parameters and 16 experts, with nearly two trillion total parameters. It is also a multimodal MoE model.
Pre-training Innovations:
- Mixture-of-Experts (MoE): Llama 4 models are the first in the series to adopt MoE, where a single token activates only a subset of the total parameters. For instance, Llama 4 Maverick has 17B active parameters but 400B total parameters. The architecture employs alternating dense and MoE layers for inference efficiency. Each token is routed to a shared expert and one of 128 routed experts, activating only a fraction of total parameters at inference time, which reduces serving costs and latency.
- Native Multimodality with Early Fusion: The models incorporate early fusion to seamlessly integrate text, image, and video tokens into a unified model backbone. This enables joint pre-training on large, unlabeled text, image, and video datasets.
- Improved Vision Encoder: The vision encoder, based on MetaCLIP, is trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM.
- MetaP Training Technique: A novel technique for reliably setting critical model hyperparameters, such as per-layer learning rates and initialization scales. These hyperparameters demonstrate transferability across different batch sizes, model widths, depths, and training tokens.
- Multilingual Data Scaling: Pre-training on 200 languages, with over 100 languages containing more than 1 billion tokens each, resulting in 10x more multilingual tokens than Llama 3.
- Efficient Training: Utilization of FP8 precision without sacrificing quality, achieving high model FLOPs utilization. During Llama 4 Behemoth pre-training with 32K GPUs, 390 TFLOPs/GPU was achieved.
- Massive Data Mixture: Training on over 30 trillion tokens, more than double the Llama 3 pre-training mixture, encompassing diverse text, image, and video datasets.
- Mid-training for Core Capabilities: Continued training with new recipes, including long context extension using specialized datasets, to enhance model quality and achieve the 10M input context length for Llama 4 Scout.
- Context Length Innovations: Llama 4 Scout supports an industry-leading 10 million tokens. A key architectural innovation is the use of interleaved attention layers without positional embeddings. The iRoPE (interleaved Rotary Position Embeddings) architecture is introduced, where "i" signifies "interleaved" attention layers and aims for "infinite" context length. Inference-time temperature scaling of attention is employed to enhance length generalization.
- Visual Understanding: Pre-training on a wide variety of image and video frame stills (up to 48 images) for broad visual understanding, including temporal activities. Post-training has shown good results with up to eight images.
Post-training Pipeline and Distillation:
- Revamped Pipeline: A new three-stage pipeline: lightweight supervised fine-tuning (SFT) online reinforcement learning (RL) lightweight direct preference optimization (DPO).
- Data Curation: A key learning was that SFT and DPO could over-constrain the model. The pipeline removed over 50% of data tagged as "easy" using Llama models as judges, performing lightweight SFT on the remaining "harder" set.
- Multimodal Online RL: During the multimodal online RL stage, careful selection of harder prompts led to significant performance improvements.
- Continuous Online RL: An iterative strategy alternating between model training and using the model to filter and retain only medium-to-hard difficulty prompts, optimizing compute and accuracy.
- Distillation from Llama 4 Behemoth: Llama 4 Maverick was codistilled from Llama 4 Behemoth during pre-training, amortizing the computational cost of generating distillation targets. A novel distillation loss function dynamically weights soft and hard targets during training. For new data incorporated into student training, forward passes on Behemoth generate targets.
- Behemoth Post-training Specifics: For Llama 4 Behemoth, 95% of SFT data was pruned to focus on quality. Lightweight SFT followed by large-scale RL showed further improvements in reasoning and coding. The RL recipe focused on sampling hard prompts via
pass@kanalysis and crafting a curriculum of increasing prompt hardness. Dynamically filtering prompts with zero advantage and constructing batches with mixed prompts from multiple capabilities improved performance in math, reasoning, and coding. - RL Infrastructure Scaling: A fully asynchronous online RL training framework was developed for the two trillion parameter Behemoth, enabling flexible allocation of models to separate GPUs and a improvement in training efficiency.
Performance and Capabilities:
- Llama 4 Scout: Best-in-class performance for its size, outperforming all previous Llama models. It excels in image grounding, aligning user prompts with visual concepts, and precise visual question answering.
- Llama 4 Maverick: Considered the best-in-class multimodal model, exceeding GPT-4o and Gemini 2.0 Flash across coding, reasoning, multilingual, long-context, and image benchmarks. It is competitive with DeepSeek v3.1 on coding and reasoning at less than half the active parameters, offering a superior performance-to-cost ratio.
- Llama 4 Behemoth: Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond.
Safety and Bias Mitigation:
- Layered Mitigations: Safeguards are integrated from pre-training to post-training, with tunable system-level mitigations.
- Data Filtering: Data filtering and other mitigations during pre-training, and various techniques applied post-training to ensure policy conformance.
- System-level Tools (Open-sourced):
- Llama Guard: An input/output safety LLM for detecting policy violations.
- Prompt Guard: A classifier for detecting malicious prompts (Jailbreaks) and prompt injections.
- CyberSecEval: Evaluations for assessing and reducing generative AI cybersecurity risk.
- Evaluations and Red-teaming: Systematic testing with adversarial dynamic probing. Introduction of Generative Offensive Agent Testing (GOAT) to simulate multi-turn interactions of medium-skilled adversarial actors, increasing testing coverage and allowing expert human red teamers to focus on novel risks.
- Bias Reduction: Significant efforts to reduce political and social bias by making the model more balanced and responsive to diverse viewpoints without judgment. Llama 4 refuses fewer prompts on debated political/social topics (below 2%, down from 7% in Llama 3.3) and exhibits significantly more balanced response refusals (less than 1% unequal response refusals). Its strong political lean rate is comparable to Grok and half that of Llama 3.3.
The Llama 4 models are available for download on llama.com and Hugging Face, and integrated into Meta AI products like WhatsApp, Messenger, and Instagram Direct.