google/gemma-3n-E2B-it-litert-preview · Hugging Face
Service

google/gemma-3n-E2B-it-litert-preview · Hugging Face

2025.05.25
·Hugging Face·by Anonymous
#LLM#Multimodal AI#Gemma#Google AI#Open Source

Key Points

  • 1Gemma 3n is a family of lightweight, state-of-the-art open multimodal models from Google, built on Gemini technology and designed for efficient execution on resource-constrained devices.
  • 2These models utilize a novel Matformer architecture and selective parameter activation to achieve effective sizes of 2B and 4B parameters, supporting text, image, video, and audio input with text outputs.
  • 3Trained on diverse data in over 140 languages, Gemma 3n demonstrates strong performance in benchmarks for reasoning, multilingual tasks, and STEM/code, while incorporating rigorous ethics and safety evaluations.

Gemma 3n is a family of lightweight, state-of-the-art open multimodal models developed by Google DeepMind, derived from the same research and technology as the Gemini models. Designed for efficient execution on low-resource devices, Gemma 3n models feature a novel architecture, including a "Matformer" architecture for nesting multiple models and utilizing selective parameter activation technology. This latter technique allows the models to operate at an effective size of 2B (E2B) and 4B (E4B) parameters, which is lower than their total parameter count, thereby reducing resource requirements.

The models are capable of processing multimodal inputs and generating text outputs. Supported inputs include:

  • Text strings (e.g., questions, prompts, documents).
  • Images: Normalized to 256×256256 \times 256, 512×512512 \times 512, or 768×768768 \times 768 resolution, encoded to 256 tokens each.
  • Audio data: Encoded to 6.25 tokens per second from a single channel.
The total input context can be up to 32K tokens. Outputs are generated text, with a total output length of up to 32K tokens, subtracting the request input tokens.

Gemma 3n models were trained on a massive dataset totaling approximately 11 trillion tokens, with a knowledge cutoff date of June 2024. The training data encompassed a wide variety of sources to foster broad linguistic exposure and multimodal capabilities, including:

  • Web Documents: Diverse collection in over 140 languages.
  • Code: To learn programming language syntax and patterns.
  • Mathematics: For logical reasoning and symbolic representation.
  • Images: For analysis and visual data extraction.
  • Audio: For speech recognition, transcription, and audio information identification.
Data preprocessing involved rigorous CSAM (Child Sexual Abuse Material) filtering at multiple stages, automated sensitive data filtering to remove personal information, and additional filtering based on content quality and safety policies.

Model training was conducted using Google's Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e), leveraging their advantages in performance, memory capacity, scalability (via TPU Pods), and cost-effectiveness for large-scale generative model training. The software stack for training included JAX and ML Pathways. JAX facilitated efficient use of TPUs, while ML Pathways, designed for building AI systems capable of generalizing across multiple tasks, utilized a "single controller" programming model for streamlined development, as described for the Gemini family of models.

Evaluation of Gemma 3n models (at full precision, float32) involved a comprehensive suite of benchmarks across various categories:

  • Reasoning and Factuality: Including HellaSwag, BoolQ, PIQA, SocialIQA, TriviaQA, Natural Questions, ARC-c/e, WinoGrande, BIG-Bench Hard, and DROP (measuring Accuracy or Token F1 score).
  • Multilingual: MGSM, WMT24++ (ChrF), Include, MMLU (ProX), OpenAI MMLU, Global-MMLU, and ECLeKTic.
  • STEM and Code: GPQA Diamond, LiveCodeBench v5, Codegolf v2.2, AIME 2025, MBPP, HumanEval, and HiddenMath (measuring Accuracy or pass@1).
Benchmark results were provided for both pre-trained (PT) and instruction-tuned (IT) variants of E2B and E4B models, with varying n-shot settings.
Android performance benchmarks were also conducted on a Samsung S25 Ultra, demonstrating efficiency on edge devices. These benchmarks reported Prefill (tokens/sec), Decode (tokens/sec), Time to first token (TTFT), Model size (MB), Peak RSS Memory (MB), and GPU Memory (MB) for dynamic_int4 quantized models on CPU (accelerated via LiteRT XNNPACK delegate with 4 threads) and GPU backends. For instance, the dynamic_int4 quantized model showed 620 tokens/sec prefill and 23.3 tokens/sec decode on GPU with a model size of 2991MB.

Ethics and safety evaluations were robust, employing structured evaluations and internal red-teaming across categories like Child Safety, Content Safety (harassment, violence, hate speech), and Representational Harms (bias, stereotyping). "Assurance evaluations," conducted independently from the development team, further informed release decisions. The evaluations showed minimal policy violations and significant improvements over previous Gemma models, though a limitation noted was the primary use of English language prompts.

Intended uses span Content Creation (text generation, chatbots, summarization, image/audio data extraction) and Research & Education (NLP/generative model research, language learning, knowledge exploration).
Limitations include:

  • Training Data: Quality, diversity, biases, or gaps can impact responses.
  • Context and Task Complexity: Better performance with clear prompts; struggles with open-ended or highly complex tasks.
  • Language Ambiguity and Nuance: Difficulty grasping subtleties, sarcasm, or figurative language.
  • Factual Accuracy: Models are not knowledge bases and may generate incorrect or outdated facts.
  • Common Sense: Reliance on statistical patterns may lead to a lack of common sense reasoning.

Ethical considerations and risks, such as perpetuation of biases, misinformation, misuse for malicious purposes, and privacy violations, were addressed through data preprocessing, continuous monitoring, de-biasing techniques, content safety mechanisms, technical limitations, user education, and adherence to the Gemma Prohibited Use Policy. The models aim to provide high-performance, open generative model implementations designed for responsible AI development, demonstrating superior performance to other comparably sized open models based on benchmark metrics.