EXAONE Deep: Reasoning Enhanced Language Models
Paper

EXAONE Deep: Reasoning Enhanced Language Models

Sunkyoung Kim
2025.06.22
Β·ArxivΒ·by Anonymous
#LLM#Reasoning#Chain-of-Thought#Fine-tuning#Deep Learning

Key Points

  • 1EXAONE Deep introduces a series of language models (2.4B, 7.8B, 32B) from LG AI Research, specifically fine-tuned and optimized for superior performance in various reasoning tasks, including math and coding.
  • 2These models were developed by fine-tuning EXAONE 3.5 Instruct models on a reasoning-specialized dataset incorporating long chain-of-thought processes, utilizing Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (Online RL).
  • 3Evaluation shows EXAONE Deep's smaller models outperform comparable sizes, while the largest 32B model achieves competitive performance against leading open-weight reasoning models, with all models being openly available for research.

EXAONE Deep is a series of large language models (LLMs) developed by LG AI Research, specifically fine-tuned for enhanced performance in various reasoning tasks, including mathematics and coding. The series comprises three models: EXAONE Deep 2.4B, 7.8B, and 32B parameters. These models are derived from the EXAONE 3.5 Instruct base models, which are already instruction-tuned and possess instruction-following capabilities.

The core methodology for training EXAONE Deep models involves a multi-stage fine-tuning process utilizing three prominent techniques: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (Online RL).

Data for Fine-Tuning:
To imbue the models with strong reasoning abilities, a specialized dataset was constructed:

  • SFT Dataset: Consists of 1.6 million instances and approximately 12 billion tokens. This dataset is designed to guide models through an extended Chain-of-Thought (CoT) process. Token distribution varies, with code-related data points being notably longer on average, while "others" tend to be shorter. Each SFT instance follows a templated format: a user query, followed by a structured thought process encapsulated within <thought><thought> and </thought></thought> tags, and finally, a concise, self-contained answer synthesizing the reasoning steps. This structure explicitly trains the models to perform step-by-step logical progression, including reflection, self-checking, and correction within the designated thought tags.
  • DPO Dataset: Comprises 20,000 instances of preference data.
  • Online RL Dataset: Includes an additional 10,000 instances.

Training Process:
The EXAONE 3.5 Instruct base models are fine-tuned using the aforementioned datasets and techniques:

  • Supervised Fine-Tuning (SFT): This initial phase uses the 1.6 million CoT-structured instances to teach the models the desired reasoning patterns and response formats.
  • Direct Preference Optimization (DPO): Following SFT, DPO is applied using 20,000 preference instances. The paper specifies the use of SimPER [19] as the training algorithm for DPO.
  • Online Reinforcement Learning (Online RL): The final stage incorporates Online RL, utilizing 10,000 instances. For this, the authors employed their self-designed GRPO [15] variant.

The total computational expenditure (FLOPs) for training each model is significant, combining pretraining and fine-tuning. For instance, the 32B model accumulated 1.25Γ—10241.25 \times 10^{24} FLOPs for pretraining and 7.04Γ—10217.04 \times 10^{21} FLOPs for fine-tuning, totaling 1.26Γ—10241.26 \times 10^{24} FLOPs. Training was conducted on NVIDIA H100 GPU clusters provided by Google Cloud Platform, leveraging the NVIDIA NeMo Framework.

Evaluation and Results:
The models were rigorously evaluated across a diverse set of benchmarks covering mathematics, science, coding, and general knowledge:

  • Mathematics: MATH-500, American Invitational Mathematics Examination (AIME) 2024 and 2025, and South Korea’s College Scholastic Ability Test (CSAT) Math 2025.
  • Science: GPQA Diamond.
  • Coding: LiveCodeBench (24.08-25.02).
  • General Knowledge: MMLU and MMLU-Pro.

The evaluation setup followed the DeepSeek-R1 technical report, employing a maximum generation length of 32K tokens. To ensure reliability, the pass@k metric was used, defined as:
pass@1=1kβˆ‘i=1kpi\text{pass@1} = \frac{1}{k} \sum_{i=1}^{k} p_i
where kk is the number of responses generated (e.g., 8 for MATH-500, 16 for CSAT 2025, 64 for AIME), and pip_i denotes the correctness of the ii-th response. A sampling temperature of 0.6 and a top-p value of 0.95 were used. Additionally, cons@k was reported for AIME, where the most frequently generated answer among kk responses is chosen. Evaluation prompts were tailored for short-answer, multiple-choice, and code generation tasks, instructing the models to reason step-by-step and provide answers within specific delimiters (e.g., \boxed{} for math/MCQ answers).

Experimental results demonstrate strong performance:

  • EXAONE Deep 32B: Achieves competitive performance against leading open-weight reasoning models like DeepSeek-R1 and QwQ-32B, and outperforms distilled versions such as DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B.
  • EXAONE Deep 7.8B: Outperforms models of similar scale, including DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, and the proprietary OpenAI o1-mini.
  • EXAONE Deep 2.4B: Shows superior performance compared to DeepSeek-R1-Distill-Qwen-1.5B.

Limitations and Future Work:
While excelling at reasoning tasks, EXAONE Deep models are specifically fine-tuned for this purpose. For broader real-world applications requiring general instruction-following capabilities, the base EXAONE 3.5 Instruct models are recommended. Future work aims to extend the models' capabilities to domains with less clear or undiscovered answers.

Licensing:
The EXAONE Deep models are openly available for research purposes via Hugging Face. The license is a non-commercial (NC) agreement, explicitly prohibiting commercial use, development of revenue-generating products, and using the models or their outputs to develop or improve other models. It permits access, download, installation, use for research, public disclosure of research results, modification to create derivatives for research, and distribution with a copy of the agreement, with mandatory attribution. The license strictly forbids reverse engineering and mandates ethical use, preventing the generation of harmful, false, or discriminatory content. All intellectual property, including generated output, remains with LG Management Development Institute Co., Ltd. The models are provided "as-is," without warranties.