naver-hyperclovax/HyperCLOVAX-SEED-Think-32B · Hugging Face
Key Points
- 1HyperCLOVA X SEED 32B Think is a 32-billion parameter vision-language model with a unified Transformer backbone and a reasoning-centric training recipe.
- 2It supports multimodal understanding up to 128K tokens, processing text, image, and video inputs within a shared embedding space and offering an optional "thinking mode" for deep reasoning.
- 3Designed for practical reasoning and agentic capabilities, particularly strong in Korean, this model requires significant GPU resources for deployment via its OmniServe inference system.
HyperCLOVA X SEED 32B Think is an advanced multimodal large language model, succeeding the SEED Think 14B series, designed for enhanced reasoning capabilities, particularly strong in Korean contexts.
Architecture and Core Methodology: The model employs a unified vision-language Transformer backbone, classifying it as a dense model with 32 billion parameters. A core methodological aspect is its ability to process diverse input modalities—text tokens and visual patches (from images or video frames)—within a shared embedding space. This unified representation facilitates deep, integrated multimodal understanding. It supports an extensive context length of up to 128,000 tokens, enabling comprehensive processing of long textual and visual sequences.
Key Capabilities and Reasoning Mode: HyperCLOVA X SEED 32B Think supports text, image, and video as inputs, producing text as output. A distinctive feature is its optional "thinking mode," which enables deep and controllable reasoning, akin to chain-of-thought (CoT) reasoning. When activated, the model generates intermediate reasoning steps encapsulated within tags in its output, allowing for transparency and better control over complex problem-solving. This mode is activated by setting chat_template_kwargs.thinking to True and can be constrained by thinking_token_budget to control the length of the reasoning process. The model also inherently supports multi-turn conversations and is proficient in image and video understanding.
Performance and Application Focus: Building upon its predecessor, the 32B model specifically strengthens Korean-centric reasoning and agentic capabilities, aiming to improve practical reasoning quality and reliability in real-world applications. It has been evaluated on a diverse set of benchmarks including Korean text-based general knowledge (KoBalt, CLIcK, HAERAE Bench 1.0), vision understanding (ChartVQA, TextVQA, K-MMBench, K-DTCBench), and agentic tasks (Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom).
Inference and Deployment: The model is made available through OmniServe, a production-ready multimodal inference system with an OpenAI-compatible API. Deployment of the 32B model necessitates significant hardware resources, specifically requiring a total of 3x NVIDIA A100 80GB GPUs (1x ~8GB for the Vision Encoder and 2x ~60GB for the 32B LLM).