deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Hugging Face
Key Points
- 1DeepSeek-R1-0528 is a significant upgrade to the DeepSeek R1 model, enhancing its reasoning and inference capabilities through increased computational resources and algorithmic optimizations.
- 2This new version shows marked improvements across various benchmarks, including mathematics and coding, with reasoning depth increasing from 12K to 23K tokens per AIME question, leading to higher accuracy and reduced hallucination rates.
- 3A distilled 8B parameter version, DeepSeek-R1-0528-Qwen3-8B, achieves state-of-the-art performance among open-source models, and usage recommendations now include system prompt support without needing the `<think>` token.
The DeepSeek-R1-0528 is a minor version upgrade of the DeepSeek R1 model, significantly enhancing its reasoning and inference capabilities. This improvement is attributed to leveraging increased computational resources and introducing unspecified algorithmic optimization mechanisms during post-training. The model now demonstrates performance approaching that of leading models like O3 and Gemini 2.5 Pro across mathematics, programming, and general logic benchmarks.
Key performance improvements for DeepSeek-R1-0528 include:
- An increase in AIME 2025 test accuracy from 70% to 87.5%.
- Enhanced reasoning depth, evidenced by an average increase in token usage from 12K to 23K per question on the AIME test set.
- Reduced hallucination rates, improved function calling support, and a better experience for vibe coding.
Evaluations were conducted with a maximum generation length of 64K tokens. For benchmarks requiring sampling, the parameters used were a temperature , a top-p value , and 16 responses were generated per query to estimate pass@1. Performance gains were observed across various benchmarks:
General: MMLU-Redux (92.9 to 93.4 EM), MMLU-Pro (84.0 to 85.0 EM), GPQA-Diamond (71.5 to 81.0 Pass@1), FRAMES (82.5 to 83.0 Acc.), Humanity's Last Exam (8.5 to 17.7 Pass@1). SimpleQA saw a slight decrease from 30.1 to 27.8 Correct.
Code: LiveCodeBench (63.5 to 73.3 Pass@1), Codeforces-Div1 (1530 to 1930 Rating), SWE Verified (49.2 to 57.6 Resolved using Agentless framework), Aider-Polyglot (53.3 to 71.6 Acc.).
Math: AIME 2024 (79.8 to 91.4 Pass@1), AIME 2025 (70.0 to 87.5 Pass@1), HMMT 2025 (41.7 to 79.4 Pass@1), CNMO 2024 (78.8 to 86.9 Pass@1).
Tools: BFCL\_v3\_MultiTurn (37.0 Acc.), Tau-Bench (53.5 Pass@1 for Airline, 63.9 Pass@1 for Retail, with GPT-4.1 acting as user).
A distilled version, DeepSeek-R1-0528-Qwen3-8B, was created by post-training the Qwen3 8B Base model using chain-of-thought distillation from DeepSeek-R1-0528. This smaller model achieves state-of-the-art performance among open-source models on AIME 2024 (86.0 Pass@1), surpassing Qwen3 8B by +10.0% and matching Qwen3-235B-thinking.
Deployment guidelines for local execution include changes to system prompt usage: the model now supports a standard system prompt, eliminating the need for the token to force a thinking pattern. The DeepSeek-R1-0528-Qwen3-8B model shares the same architecture as Qwen3-8B but uses the tokenizer configuration from DeepSeek-R1-0528, requiring configuration files to be sourced from the DeepSeek repository. The default temperature for inference in web/application environments is . The paper also provides prompt templates for file uploading and web search functionalities, distinguishing between Chinese and English queries for search-augmented generation, which involve detailed instructions on integrating search_results with cur_date and question, emphasizing citation format and content structuring.
The models are licensed under the MIT License, allowing commercial use and distillation.