naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B Β· Hugging Face
Key Points
- 1HyperCLOVA X SEED 8B Omni is an 8-billion parameter unified multimodal model integrating text, vision, and speech capabilities based on a Transformer architecture.
- 2It supports consistent understanding and generation across various modalities, including vision-language QA, text-to-image creation, image editing, speech recognition, and text-to-speech.
- 3This model is provided with OmniServe, a production-ready inference system, marking a key milestone for HyperCLOVA X's Any-to-Any-Korean-First intelligence initiative.
HyperCLOVA X SEED 8B Omni is a unified multimodal model designed to integrate text, vision, and speech capabilities within a single auto-regressive Transformer architecture. It aims to achieve consistent multimodal understanding and generation by aligning textual, visual, and audio representations into a shared semantic space, facilitating bidirectional interactions across these modalities. This model is an 8-billion parameter dense model.
The core methodology revolves around a central Large Language Model (LLM) that processes inputs from various modalities and generates outputs in different forms. The architecture incorporates specialized encoders and decoders for non-textual data. For input processing, a Vision Encoder converts visual data (images and videos) into embeddings, and an Audio Encoder transforms speech into embeddings. These embeddings, along with direct text inputs, are then fed into the 8B parameter LLM, which operates within a 32K context window. This allows the model to process and understand complex multimodal queries by leveraging the integrated semantic space. For output generation, the LLM's processed information is channeled to dedicated decoders: a Text Decoder for generating textual responses, a Vision Decoder for creating images (outputting image URLs via S3), and an Audio Decoder for producing speech (outputting audio URLs via S3). Specifically, text-to-image generation and image-to-image transformations utilize a tool named t2i_model_generation that requires a discrete_image_token. This token represents a serialized string of discrete vision tokens, suggesting a quantized representation of images, following a strict format like .
HyperCLOVA X SEED 8B Omni supports a wide range of capabilities, including established text-to-text functionalities, vision-language Question Answering (QA), text-to-image generation and editing, speech recognition, speech translation, and text-to-speech synthesis. The model's knowledge cutoff is May 2025. Benchmarks demonstrate its performance across various tasks: for text-to-text, MMLU-Pro, GSM8K, KMMLU-Pro, and HAERAE 1.0; for vision-to-text, SEED-IMG, AI2D, and K-MMBench; for text-to-vision, GenEval and ImgEdit; for audio-to-text, Librispeech and Ksponspeech; and for audio-to-audio, Fleurs en2ko and Fleurs ko2en.
Inference is managed through OmniServe, a production-ready system with an OpenAI-compatible API. This system orchestrates the multimodal processing, requiring specific GPU VRAM allocations for each component: approximately 8GB for the Vision Encoder, 4GB for the Audio Encoder, 16GB for the LLM, 16GB for the Vision Decoder, and 4GB for the Audio Decoder, totaling around 48GB across 3 NVIDIA A100 80GB GPUs. S3-compatible storage is required for image and audio output. The system provides flexible parameters such as max_tokens and temperature, and allows skipping reasoning with chat_template_kwargs.skip_reasoning.