Introducing Gemma 3: The Developer Guide- Google Developers Blog
Blog

Introducing Gemma 3: The Developer Guide- Google Developers Blog

Philipp Schmid
2025.03.22
ยทWebยทby Anonymous
#Gemma#LLM#Multimodality#Google AI#Generative AI

Key Points

  • 1Gemma 3 is introduced as the most advanced open-model in its family, featuring multimodality (vision-language input), an extended 128k context window, enhanced math/reasoning/chat capabilities, and support for over 140 languages, available in four distinct sizes.
  • 2Developed using advanced techniques like distillation, RLHF, RLMF, and RLEF, Gemma 3 achieves top-tier performance in LMArena (score of 1338), significantly improving math, coding, and instruction following.
  • 3It incorporates a frozen SigLIP vision encoder for image and video analysis, supports high-resolution and non-square images, and is accessible via Google AI Studio, Hugging Face, Kaggle, and various development/deployment tools.

Gemma 3 is the latest iteration of the Gemma open-model family, building upon previous releases with significant enhancements driven by community feedback. This version introduces multimodality, extended context windows, enhanced multilingual support, and improved capabilities in reasoning, mathematics, and chat.

Gemma 3 supports vision-language input and text outputs, handling context windows up to 128,000 tokens. It understands over 140 languages, leveraging a new tokenizer for improved multilingual performance. The model is available in four sizes: 1B, 4B, 12B, and 27B parameters, offered as both pre-trained checkpoints (suitable for fine-tuning) and general-purpose instruction-tuned versions.

The core methodology for building Gemma 3 involved an optimized pre-training and post-training process, combining distillation, reinforcement learning, and model merging. During pre-training, Gemma 3 models were trained on Google TPUs using the JAX Framework, with data scales of 2 trillion tokens for the 1B model, 4 trillion for 4B, 12 trillion for 12B, and 14 trillion for the 27B model.

For post-training, Gemma 3 utilized four distinct components to enhance performance and alignment:

  1. Distillation: Knowledge transfer from a larger instruct model into the Gemma 3 pre-trained checkpoints, effectively leveraging a more capable "teacher" model.
  2. Reinforcement Learning from Human Feedback (RLHF): This technique aligns the model's predictions with human preferences, optimizing for desired behavioral characteristics.
  3. Reinforcement Learning from Machine Feedback (RLMF): Specifically designed to enhance mathematical reasoning, this involves using machine-generated feedback signals to guide the learning process for mathematical tasks.
  4. Reinforcement Learning from Execution Feedback (RLEF): Employed to improve coding capabilities, this method leverages the outcome of code execution (e.g., correctness or errors) as a feedback signal for reinforcement learning.

These advancements significantly improved the model's math, coding, and instruction-following abilities, positioning it as a top-performing open compact model in LMArena with a score of 1338.

For multimodality, Gemma 3 integrates a vision encoder based on SigLIP, which was kept frozen during training across the 4B, 12B, and 27B model sizes. This enables Gemma 3 to process images and videos, analyze visual content, answer image-based questions, compare images, identify objects, and understand text within images. While originally optimized for 896x896 pixel images, an adaptive window algorithm allows it to work with high-resolution and non-square images.

Instruction-tuned versions of Gemma 3 maintain the same dialog format as Gemma 2 for text-only inputs, ensuring tool compatibility. For interleaved image and text inputs, a new format is supported. ShieldGemma 2, a 4B image safety classifier, is built on Gemma 3 and outputs labels across safety categories for both synthetic and natural images.

Gemma 3 models are accessible via Google AI Studio, downloadable weights on Hugging Face and Kaggle, and can be integrated using various development tools and frameworks including Hugging Face Transformers, Ollama, Gemma JAX library, MaxText, LiteRT, Gemma.cpp, llama.cpp, and Unsloth. Deployment options include Google GenAI API, Vertex AI, Cloud Run, Cloud TPU, and Cloud GPU.