
Agentic Reasoning for Large Language Models
Key Points
- 1This paper defines agentic reasoning for large language models as bridging thought and action, reframing LLMs as autonomous agents that plan, act, and learn through continual interaction.
- 2It provides a systematic roadmap organizing agentic reasoning into foundational, self-evolving, and collective multi-agent dimensions, analyzed across in-context and post-training optimization settings.
- 3The survey contextualizes these mechanisms with real-world applications and benchmarks, outlining open challenges such as personalization, long-horizon interaction, and governance for future development.
The paper "Agentic Reasoning for Large Language Models" introduces agentic reasoning as a paradigm shift where Large Language Models (LLMs) are reframed as autonomous agents that plan, act, and learn through continual interaction, unifying reasoning with acting. This concept moves beyond traditional LLM reasoning, which is typically a static, one-shot prediction task, to an interactive, dynamic, and stateful process that enables planning, adaptation, and collaboration.
The survey systematically organizes agentic reasoning along three complementary dimensions, which characterize environmental dynamics and agent capabilities:
- Foundational Agentic Reasoning: Establishes core single-agent capabilities like planning, tool use, and search in stable environments. Agents decompose goals, invoke external tools, and verify results.
- Self-Evolving Agentic Reasoning: Focuses on how agents continually improve through cumulative experience, integrating feedback and memory-driven adaptation in evolving settings. This involves persistent updates of internal states and policies without full retraining.
- Collective Multi-Agent Reasoning: Extends intelligence to collaborative scenarios where multiple agents coordinate roles, share knowledge, and pursue shared goals through communication and shared memory systems.
Across these dimensions, the paper distinguishes two complementary optimization settings:
- In-context Reasoning: Scales inference-time interaction through structured orchestration, search-based planning, and adaptive workflow design, without modifying model parameters. It focuses on how agents navigate complex problem spaces dynamically during deployment.
- Post-training Reasoning: Targets capability internalization by optimizing behaviors through reinforcement learning (RL) and supervised fine-tuning (SFT), consolidating successful reasoning patterns or tool-use strategies into the model's weights.
Core Methodology and Formalization:
The paper formalizes agentic reasoning as operating within a Partially Observable Markov Decision Process (POMDP) framework, defined by a tuple . Here, is the latent environment state, is the observation space, is the external action space, is a reasoning trace space (e.g., latent plans, chain-of-thought), and is the agent's internal memory/context. and are transition and observation kernels, is the reward, and is the discount factor.
The agent's policy is factorized into an internal thought process and an external action execution:
where is the history up to timestep . This decomposition highlights the core shift from traditional LLMs by making the "think-act" structure explicit. The objective is to maximize the expected return .
Technical Details of Optimization Modes:
- In-Context Reasoning (Inference-Time Search): With frozen model parameters , the agent optimizes the reasoning trajectory by searching over to maximize a heuristic value function . Methods like ReAct perform greedy decoding over alternating thoughts () and actions (). Tree-of-Thoughts (ToT) and similar Monte Carlo Tree Search (MCTS)-style approaches treat partial thoughts as nodes (e.g., a representation from ) and search for an optimal path , where is a heuristic evaluator or verifier. This corresponds to planning in without updating policy parameters.
- Post-Training (Policy Optimization): This paradigm optimizes to align the policy with long-horizon rewards . While Proximal Policy Optimization (PPO) is common, Group Relative Policy Optimization (GRPO)-based methods are widely used. For a group of sampled outputs from the same prompt , the GRPO objective is:
where and the group-normalized advantage is , with and . Advanced methods like ARPO and DAPO extend this for sparse rewards and stability.
- Collective Intelligence (Multi-Agent Reasoning): This extends the single-agent formulation to a decentralized partially observable multi-agent setting (Dec-POMDP). Each agent's observation includes a communication channel . For agents, the joint policy is composed of individual policies . Communication acts as an extension of reasoning, where one agent's action can prompt another's internal reasoning. Centralized-Training/Decentralized-Execution (CTDE) paradigms are used to stabilize cooperative behaviors.
- Self-Evolving Agents (The Meta-Learning Loop): These agents optimize the agent system itself across episodes. Let denote the evolvable system state (e.g., explicit memories, tool libraries, code). A generic meta-update rule is , where represents environmental feedback (rewards, execution errors). Types of evolution include:
- Verbal Evolution: consists of textual reflections or guidelines (e.g., Reflexion updates by synthesizing error logs into linguistic cues).
- Procedural Evolution: consists of executable tools or skills (e.g., Voyager synthesizes new code-based skills).
- Structural Evolution: consists of the agent's source code or architecture, where an LLM acts as a mutation operator to search for superior reasoning algorithms.
The paper outlines its contributions as providing a conceptual framing of agentic reasoning, a systematic review across single-agent, adaptive, and multi-agent systems, a survey of real-world applications and benchmarks, and an identification of future challenges. It structures the survey into foundational, self-evolving, and collective reasoning, followed by applications, benchmarks, and open problems.