LGAI-EXAONE/EXAONE-4.0-1.2B · Hugging Face
Key Points
- 1EXAONE 4.0 is a new series of large language models integrating non-reasoning and reasoning modes, designed for advanced capabilities like agentic tool use and expanded multilingual support in English, Korean, and Spanish.
- 2The model series comes in 32B and 1.2B sizes, incorporating novel architectural changes like Hybrid Attention and QK-Reorder-Norm for optimized performance across different deployment scenarios.
- 3Extensive evaluations highlight EXAONE 4.0's strong performance in both reasoning and non-reasoning tasks, with detailed guidelines for optimal usage and deployment on platforms like TensorRT-LLM and vLLM.
EXAONE 4.0 is a series of unified large language models that integrates both a Non-reasoning mode for general usability and a Reasoning mode for complex problem-solving. Developed by LG AI Research, this iteration extends multilingual capabilities to include Spanish, alongside English and Korean, and incorporates essential agentic tool use features. The series includes a 32B parameter model optimized for high performance and a 1.2B parameter model designed for on-device deployment.
The core methodology of EXAONE 4.0 introduces two significant architectural changes from previous EXAONE models:
- Hybrid Attention: Employed in the 32B model, this scheme combines Local attention (sliding window attention) with Global attention (full attention) in a 3:1 ratio. Notably, Rotary Positional Embedding (RoPE) is explicitly *not* used for global attention to enhance global context understanding. This selective application of attention mechanisms allows the model to efficiently process long sequences while retaining a comprehensive global view.
- QK-Reorder-Norm: This modification reorders the Layer Normalization (LayerNorm) position from the traditional Pre-Layer Normalization (Pre-LN) scheme. In QK-Reorder-Norm, LayerNorm is applied directly to the attention and Multi-Layer Perceptron (MLP) outputs. Additionally, Root Mean Square (RMS) normalization is applied immediately after the Query (Q) and Key (K) projection layers. This refined normalization strategy is reported to yield improved performance on downstream tasks, albeit with increased computational cost.
The model also supports agentic tool use, allowing it to interact with external functions by processing tool schemas and generating appropriate tool calls. The tokenizer.apply_chat_template method facilitates the activation of reasoning mode via , which initiates a reasoning block indicated by a tag.
Model configurations include a 1.28B parameter model and a 32B parameter model. The 1.2B model, without embeddings, has 1.07B parameters, 30 layers, and uses Grouped Query Attention (GQA) with 32 heads and 8 KV heads. It has a vocabulary size of 102,400 and a context length of 65,536 tokens.
Performance evaluations demonstrate competitive results across various benchmarks for both reasoning and non-reasoning modes. In reasoning mode, EXAONE 4.0 32B achieves strong scores in World Knowledge (e.g., 92.3 MMLU-Redux), Math/Coding (e.g., 85.3 AIME 2025), Instruction Following (e.g., 83.7 IFEval), Agentic Tool Use (e.g., 63.9 BFCL-v3), and Multilinguality (e.g., 67.7 KMMLU-Pro, 85.6 MMMLU (ES)). The 1.2B model also shows robust performance for its size in reasoning capabilities (e.g., 71.5 MMLU-Redux, 45.2 AIME 2025). In non-reasoning mode, the models maintain strong performance in world knowledge, instruction following, and multilingual tasks, while showing lower scores in complex reasoning tasks like Math/Coding, aligning with its design purpose.
Deployment of EXAONE 4.0 models is officially supported by TensorRT-LLM and vLLM, with specific configurations recommended for optimal serving, including --enable-auto-tool-choice and specific tool/reasoning parsers for vLLM. Usage guidelines suggest lower temperature values () for non-reasoning mode and specific sampling parameters (e.g., , ) for reasoning mode, with to mitigate degeneration.