
rStar2-Agent: Agentic Reasoning Technical Report
Key Points
- 1rStar2-Agent introduces a 14B math reasoning model trained with agentic reinforcement learning, enabling advanced cognitive behaviors like careful tool use and reflection on code execution feedback.
- 2This capability is powered by three innovations: an efficient RL infrastructure with a reliable Python environment, GRPO-RoC (Group Relative Policy Optimization with Resampling on Correct) to manage environment noise, and an efficient multi-stage training recipe starting with non-reasoning SFT.
- 3rStar2-Agent-14B achieves state-of-the-art math reasoning, scoring 80.6% on AIME24 and 69.8% on AIME25, outperforming significantly larger models like DeepSeek-R1 (671B) with minimal compute and strong generalization.
The paper introduces rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning (RL) to achieve frontier-level performance. It highlights limitations of current long Chain-of-Thought (CoT) models in handling complex problems prone to subtle errors or requiring creative shifts, advocating for "smarter thinking" via autonomous tool utilization, validation, and learning from feedback. The focus is on Python coding tools and interpreter as the environment for agentic RL.
Challenges in Agentic Reinforcement Learning:
- Inherent Environment Noises: The complexity of coding tools introduces noise. Syntactically or logically incorrect code leads to error messages that can mislead the model, causing it to waste tokens on corrections rather than advancing reasoning.
- Impact of Outcome-only Reward: Current outcome-only rewards (binary accuracy of the final answer) do not penalize undesirable intermediate behaviors. This means trajectories with incorrect intermediate tool calls can still receive positive rewards if the final answer is correct, reinforcing low-quality reasoning with tool errors (observed as ~10-15% tool-related errors in correctly answered trajectories even after training).
Core Innovations of rStar2-Agent:
rStar2-Agent proposes three key innovations to address these challenges and enable effective agentic RL at scale:
- Efficient RL Infrastructure: A reliable Python code environment capable of high-throughput execution (45K concurrent tool calls, 0.3s average feedback) and a load-balanced rollout scheduler that dynamically allocates rollout requests based on KV cache capacity to maximize GPU utilization. This enables training on limited GPU resources (64 MI300X GPUs).
- GRPO-RoC Algorithm: Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC), an agentic RL algorithm that specifically addresses environment noise.
- Efficient Agent Training Recipe: A multi-stage RL training strategy starting with non-reasoning Supervised Fine-Tuning (SFT) and progressing through RL stages.
Smarter Reasoning in a Code Environment (Methodology):
The model performs multi-turn rollouts, where it interacts iteratively with the code environment.
- Process: An initial system prompt and question are given. The model generates reasoning and a
tool_call. If a tool call is present, the code block is extracted, executed by an environment service, and the output (tool_response) is appended to the trajectory under theuserrole. The model then takes this updated context to continue reasoning under theassistantrole. This repeats until a final answer or max turns () is reached. - Tool Call Format: A structured JSON format is used: . The
tool_responsewraps standard output, IPython output, execution errors (with tracebacks), or timeouts. This API-like interface separates reasoning from execution and generalizes across tools. - Prompt Template: The prompt instructs the model to generate reasoning within and the final answer within , with the numeric result boxed by .
End-to-End Agentic Reinforcement Learning (Technical Details):
Preliminary: GRPO
rStar2-Agent builds upon the Group Relative Policy Optimization (GRPO) algorithm. For each question and ground-truth answer , GRPO samples a group of rollout trajectories from the old policy . The policy is optimized by maximizing the objective:
where is the estimated advantage, computed as:
The reward is outcome-only, based on whether the final answer (extracted from ) matches the ground truth.
Modifications from standard GRPO include: removal of KL divergence penalty () and entropy loss, and an increased to 0.28 (Clip-Higher) to encourage exploration.
GRPO-RoC: Group Relative Policy Optimization with Resampling on Correct
To address environment noise and improve trajectory quality while retaining outcome-only rewards, GRPO-RoC introduces the Resample on Correct (RoC) rollout strategy.
- Oversampling: Instead of rollouts, trajectories are initially sampled.
- Asymmetric Selection: These rollouts are then downsampled to for policy updates, applying different selection strategies for negative and positive samples:
- Negative Samples (): For zero-reward rollouts, trajectories are uniformly sampled from to preserve diversity of failure modes.
- Positive Samples (): For successful rollouts (reward=1), trajectories are sampled from with probability *inversely proportional* to a
ptotalpenalty score, prioritizing higher-quality traces.ptotalconsiders two types of intermediate issues:- Tool Call Errors (
perr):
- Tool Call Errors (
(A default for no tool calls encourages tool usage.)
- Format Violations (
pformat): Punishes undesirable formats (e.g., redundant blocks or incorrect tag counts).
The total penalty is .
This asymmetric sampling guides the model towards cleaner, higher-quality positive trajectories with correct tool usage and formatting, while still exposing it to diverse failure modes. The final objective is similar to GRPO but applied to the selected rollouts .
Large-Scale Agentic RL Infrastructure:
- Reliable High-Throughput Code Environment: Handles up to 45,000 concurrent tool calls with average feedback latency of 0.3 seconds.
- Load-Balanced Rollout Scheduler: Optimizes computational utilization by dynamically allocating rollout requests based on available KV cache capacity across GPUs.
Training Recipe:
- Non-Reasoning Cold Start for SFT: The training begins with SFT focused solely on general instruction following, coding tool usage, and formatting, without explicitly enhancing reasoning. This aims to avoid SFT overfitting and keeps initial average responses short, allowing RL to cultivate reasoning effectively.
- Multi-Stage RL Training: GRPO-RoC is applied in multiple stages, gradually increasing task difficulty and maximum training length. Each stage uses shorter rollout lengths (8K-12K) to encourage efficient reasoning strategies, unlike prior methods that scale rollouts to 16K-48K.
Results:
rStar2-Agent-14B achieves state-of-the-art math reasoning performance in only 510 RL steps within one week. On AIME24, it scores 80.6%, surpassing DeepSeek-R1 (671B), OpenAI o3-mini (medium), and Claude-Opus-4.0. It also shows strong generalization to scientific reasoning and agentic tool-use tasks beyond mathematics.