
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Key Points
- 1Agent Lightning is a novel framework that decouples Reinforcement Learning (RL) training from AI agent execution, enabling seamless, low-code integration for training Large Language Models (LLMs) within any existing agent.
- 2It achieves this by formulating agent execution as a Markov Decision Process (MDP), defining a unified data interface for trajectories, and introducing LightningRL, a hierarchical RL algorithm for credit assignment.
- 3The framework's Training-Agent Disaggregation architecture provides a standardized training service, demonstrating stable and continuous performance improvements across diverse real-world agent tasks like text-to-SQL and RAG.
Agent Lightning is a novel, flexible, and extensible framework designed to enable Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. It addresses the significant challenges in applying RL to complex, dynamic agent behaviors, which current methods struggle with due to their focus on static, single-call tasks. The core innovation lies in its complete decoupling of agent execution from RL training, allowing seamless integration with existing agents (built with frameworks like LangChain, OpenAI Agents SDK, AutoGen, or from scratch) with minimal code modifications.
The framework's methodology is grounded in a rigorous formulation of agent execution as a Markov Decision Process (MDP). This allows for a unified data interface that abstracts away the underlying orchestration logic and agent framework specifics.
Unified Data Interface:
Agent execution, similar to software execution, can be conceptualized as a directed acyclic graph (DAG) of component invocations. Agent Lightning simplifies this by focusing on key state changes relevant for RL optimization.
- State and Call: A "state" at any timestep for the -th execution of a task , denoted , represents a snapshot of the agent's execution, comprising "semantic variables" that encapsulate critical program intents and evolve over time. Changes to these semantic variables occur through "Component Invocations" or "calls." A call involves a component (where is the set of LLMs and is the set of tools) producing from . Both and are semantic variables at specific timesteps, representing the visible and modified parts of the state during an invocation.
- Reward and Dataset: Each agent execution is augmented with scalar reward signals , where evaluates the quality of the -th invocation. An execution with rewards is represented as . Rewards can be intermediate (e.g., successful tool use) or terminal (overall task success).
Markov Decision Process (MDP) Formulation:
For an LLM to be optimized as a policy model, its decision-making is modeled as a Partially Observable Markov Decision Process (POMDP) :
- : The space of states, i.e., .
- : The observation space, corresponding to inputs visible to the policy LLM, i.e., .
- : The action space, where an action is the entire token sequence generated by a single LLM invocation, i.e., .
- : The (unknown) transition dynamics .
- : The reward function .
For RL training, the system extracts , where for the policy LLM . This approach focuses solely on the LLM's input and output, abstracting away complex agent logic, enabling RL application to diverse agents. This formulation flexibly extends to single-LLM multi-agent scenarios (where a single LLM adopts different roles based on prompts) and can potentially be extended to multi-LLM settings using Multi-Agent RL.
LightningRL Algorithm:
Building on the MDP formulation, Agent Lightning introduces LightningRL, a hierarchical RL algorithm designed specifically for agent training. While the detailed algorithm is not fully described in the provided text, it features a crucial "credit assignment module." This module attributes trajectory-level returns to individual responses generated by each LLM call, enabling the optimization of the policy. LightningRL is designed to be fully compatible with existing single-turn RL methods for LLMs, allowing efficient and effective training by avoiding issues like excessively long sequences and supporting flexible context construction.
Training-Agent Disaggregation Architecture:
To implement this, Agent Lightning employs a Training-Agent Disaggregation (TA Disaggregation) architecture, which cleanly separates RL training from agent execution:
- Lightning Server: Acts as the controller for the RL training system, managing the training process and exposing an OpenAI-like API for the updated model to clients.
- Lightning Client: Comprises two parts:
- Communication Component: Handles data transmission and reception with the Lightning Server.
- Agent Runtime: Runs the agent and performs data collection. Crucially, it transparently manages agent execution and trajectory collection without requiring code modifications to the agent itself. This design leverages existing observability frameworks (e.g., OpenTelemetry) for trajectory collection, connecting monitoring data directly to RL training. The agent runtime also facilitates an Automatic Intermediate Rewarding (AIR) mechanism, which assigns intermediate rewards based on system monitoring signals (e.g., tool call success/failure), effectively mitigating the sparse reward problem common in RL.
Agent Lightning's key advancements include: (1) full decoupling of agents and RL training, enabling application to any AI agent with near-zero code changes; (2) a unified data interface and MDP formulation, transforming diverse agent execution data into training trajectories; (3) the hierarchical LightningRL algorithm with credit assignment; and (4) the TA Disaggregation architecture, which provides a standardized training service, integrates observability, and supports automatic intermediate rewarding. Experimental results across text-to-SQL, RAG, and math tool-use tasks demonstrate stable and continuous performance improvements, showcasing its practical potential for real-world agent training and deployment.