gemma3
Service

gemma3

2025.03.22
ยทWebยทby Anonymous
#LLM#Multimodal#Google#Gemma

Key Points

  • 1Gemma 3 is a family of lightweight, multimodal AI models from Google, built on Gemini technology, available in parameter sizes ranging from 270M to 27B, supporting text and image processing with up to 128K context window.
  • 2The models include text-only and multimodal versions, alongside Quantization Aware Trained (QAT) variants that maintain quality while significantly reducing memory footprint for efficient deployment.
  • 3Evaluation benchmarks demonstrate Gemma 3's strong performance across diverse tasks, including reasoning, logic, code, multilingual understanding, and multimodal capabilities, with scores generally improving with larger model sizes.

The Gemma 3 model family, developed by Google and built upon Gemini technology, represents a series of lightweight, multimodal large language models designed for efficient deployment on resource-limited devices. These models are capable of processing both text and images, feature a substantial 128K context window, and support over 140 languages. They are primarily engineered to excel in tasks such as question answering, summarization, and reasoning.

The Gemma 3 family is available in several parameter sizes: 270 Million (270M), 1 Billion (1B), 4 Billion (4B), 12 Billion (12B), and 27 Billion (27B). The 270M and 1B models are text-only with a 32K context window. The larger 4B, 12B, and 27B models are multimodal, supporting both text and image inputs, and boast a 128K context window. Deployment is facilitated via Ollama 0.6 or later.

A key technical innovation highlighted is the development of Quantization Aware Trained (QAT) models. Quantization is a technique used to reduce the precision of numerical representations, typically from floating-point (e.g., 32-bit or 16-bit) to lower-bit integer formats (e.g., 8-bit). Quantization Aware Training integrates this precision reduction directly into the training process, allowing the model to learn and adapt to the lower precision. This differs from post-training quantization, where quantization occurs after the model is fully trained, which can sometimes lead to a more significant drop in performance. The Gemma 3 QAT models are specifically designed to preserve quality comparable to their half-precision (BF16) counterparts while achieving a significantly lower memory footprint, specifically a 3x reduction compared to non-quantized models. QAT variants are available for the 1B, 4B, 12B, and 27B models.

The models underwent extensive evaluation across a diverse range of benchmarks covering reasoning, logic, code capabilities, multilingual performance, and multimodal understanding.

For the Gemma 3 270M instruction-tuned model, 0-shot and few-shot evaluations on common NLP benchmarks yielded results such as HellaSwag (37.7), PIQA (66.2), ARC-c (28.2), WinoGrande (52.3), BIG-Bench Hard (26.7), and IF Eval (51.2).

The larger Gemma 3 Pre-trained (PT) models demonstrate improved performance with increasing parameter count across various tasks:

  • Reasoning, Logic, and Code: Benchmarks include HellaSwag (10-shot, 85.6 for 27B), MMLU (5-shot, top-1, 78.6 for 27B), GSM8K (5-shot, maj@1, 82.6 for 27B), and HumanEval (pass@1, 48.8 for 27B). The performance generally scales positively with model size, with the 27B model consistently outperforming smaller variants. Notably, MATH and GSM8K show significant gains from 1B to 27B (e.g., GSM8K from 1.36 to 82.6).
  • Multilingual Capabilities: Evaluated on datasets like MGSM (74.3 for 27B), Global-MMLU-Lite (75.7 for 27B), Belebele (78.0 for 12B), and WMT24++ (ChrF, 55.7 for 27B). These scores indicate robust cross-lingual understanding and generation abilities, also showing a clear trend of performance improvement with scale.
  • Multimodal Capabilities: Applicable to the 4B, 12B, and 27B models, these capabilities are assessed on benchmarks like COCOcap (116 for 27B), DocVQA (val, 85.6 for 27B), MMMU (pt, 56.1 for 27B), TextVQA (val, 68.6 for 27B), and ChartQA (augmented, 88.7 for 27B). This category specifically highlights the models' proficiency in understanding and reasoning over visual information combined with text, with larger models again showing superior performance.