naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B 路 Hugging Face
Key Points
- 1HyperCLOVAX-SEED-Vision-Instruct-3B is NAVER's new lightweight, multimodal model designed for efficient visual understanding and text generation, specifically optimized for the Korean language.
- 2It features a LLaVA-based architecture, combining a 3.2B parameter LLM with a 0.43B parameter SigLIP vision encoder, trained using SFT and RLHF with an automated validation system.
- 3The model achieves competitive performance, outperforming similarly sized open-source models in Korean benchmarks, and represents Korea's first open-source vision-language model.
HyperCLOVAX-SEED-Vision-Instruct-3B is a lightweight, multimodal model developed by NAVER, capable of understanding both text and images, and generating text responses. It is built upon a proprietary backbone and fine-tuned through post-training, prioritizing computational efficiency and a Pareto-optimal balance specifically for the Korean language.
Model Architecture: The model employs a LLaVA-based Vision-Language Model architecture. Its components include:
- LLM Module: A Transformer-based dense model with 3.2 billion parameters.
- Vision Encoder: A SigLIP-based architecture that processes images with an input resolution of 378x378 pixels per grid.
- Vision-Language Connector: A C-Abstractor based architecture featuring the AnyRes mechanism, which supports up to 1.29 million total pixels across 9 grids. The total parameter count for the model is 3.2 billion (LLM Module) plus 0.43 billion (Vision Module). It supports Text + Image + Video inputs and Text output, with a context length of 16k and a knowledge cutoff date of August 2024.
Training Methodology:
The training process involved both text and vision components, with a focus on overcoming data quality and cost challenges.
- Text Training: To secure high-quality data for post-training without relying heavily on manual annotation, an automated validation system powered by HyperCLOVA X was utilized. This system improved data quality and streamlined the training process, leading to enhanced performance in tasks with definitive answers, such as mathematics and coding. The model was developed by starting from
HyperCLOVAX-SEED-Text-Base-3Band applying both Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), specifically using an online reinforcement algorithm called GRPO. - Vision Training: Vision understanding capabilities, including image-based Question Answering (VQA) and chart/diagram interpretation, were integrated into the model architecture without compromising the existing performance of the HyperCLOVA X LLM. A key focus for this 3B model was optimizing the efficiency of video input tokens, carefully adjusting the number of tokens extracted per frame to enable efficient video understanding with minimal tokens. Additionally, during the RLHF phase, vision-specific V-RLHF data was incorporated to enhance the model's learning for visual tasks, mirroring the approach used for text. The model supports OCR-free processing.
Performance and Benchmarks:
The model demonstrates competitive performance, particularly excelling in Korean-language inputs and outperforming similarly sized open-source models in relevant benchmarks.
- Text Benchmarks: While
HyperCLOVAX-SEED-Vision-Instruct-3Bshows slightly lower scores on KMMLU, HAE-RAE, CLiCK, and KoBEST compared to its text-base counterpart (HyperCLOVAX-SEED-Text-Base-3B), it still performs competitively with or surpasses other instruct models like Qwen2.5-3B-instruct and gemma-3-4b-it on certain Korean benchmarks (e.g., HAE-RAE). - Vision Benchmarks: The model uses 1856 tokens and 108 frames for video input. It achieves an overall score of 59.54 across 9 benchmarks (4 image, 5 video), including Korean-specific datasets like KMMLU (Ko), HAE-RAE (Ko), CLiCK (Ko), KoBEST (Ko), VideoMME (Ko), NAVER-TV-CLIP (Ko), VideoChatGPT (Ko), KoNet (Ko), and Korean VisIT-Bench (Ko), as well as English benchmarks like PerceptionTest (En), ActivityNet-QA (En), MMBench-Val (En), and TextVQA-Val (En). It shows strong performance in Korean VisIT-Bench (79.2) and KoNet (81.8), and generally performs comparably or better than other 3B-4B models like Qwen-2.5-VL-3B (when token count is constrained), Gemma-3-4B, InternV-2-2B, InternV-2-4B, and InternV-2-8B across various vision tasks. For optimal image understanding, the inclusion of additional information like Optical Character Recognition (OCR) results and entity recognition (Lens) is recommended.
Deployment: The model supports vLLM engine integration for faster inference, with specific instructions provided for setting up the API server and performing offline inference.