CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Paper

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weiqiang Lou
2026.03.03
·Arxiv·by 이호민
#Agent#CUDA#Kernel Generation#LLM#Reinforcement Learning

Key Points

  • 1CUDA Agent introduces a large-scale agentic reinforcement learning system specifically designed to significantly improve large language models' capabilities in generating high-performance CUDA kernels.
  • 2This system achieves its goals through a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling, and novel RL algorithmic techniques that ensure stable training.
  • 3CUDA Agent achieves state-of-the-art results on KernelBench, consistently outperforming torch.compile and surpassing leading proprietary models, especially on more complex tasks, by learning sophisticated optimization strategies.1. 🚀 CUDA Agent introduces a large-scale agentic reinforcement learning system designed to significantly enhance LLMs' ability to generate high-performance CUDA kernels.
  • 4This system is built upon a scalable data synthesis pipeline, a skill-augmented CUDA development environment with robust feedback, and algorithmic improvements for stable multi-turn RL training.
  • 5CUDA Agent achieves state-of-the-art performance on KernelBench, delivering substantial speedups over torch.compile and outperforming leading proprietary models across all difficulty levels.

CUDA Agent is a large-scale agentic reinforcement learning (RL) system designed to address the challenges of high-performance CUDA kernel generation for deep learning, a task traditionally requiring specialized hardware expertise. The paper highlights that despite the general proficiency of Large Language Models (LLMs) in software development, they remain uncompetitive with compiler-based systems like torch.compile for CUDA kernel optimization. Existing approaches, such as training-free refinement or fine-tuning within fixed multi-turn execution-feedback loops, are limited because they fail to fundamentally enhance the model's intrinsic CUDA optimization capabilities.

To overcome these limitations, CUDA Agent systematically improves the base model's CUDA kernel coding abilities through contributions across three complementary dimensions:

  1. Scalable Data Synthesis Pipeline:
The scarcity of high-quality, expert-level CUDA kernel implementations for training is a significant bottleneck. CUDA Agent addresses this by developing a scalable data collection pipeline to generate a vast and diverse corpus of training problems.
  • Seed Problem Crawling: Fundamental computational primitives are mined from PyTorch and Transformers libraries, establishing a comprehensive set of seed operators.
  • Combinatorial Problem Construction: LLMs are employed to synthesize aggregated operators by sequentially composing up to five sampled operator classes. This process generates fused, multi-operator tasks that often present more complex optimization landscapes than individual operations.
  • Rubric-based Problem Filtering: A rigorous execution-based filtering process ensures data quality. Problems are validated against four criteria: successful execution in both Eager and Compile modes, non-stochasticity for reproducibility, numerical distinctness for different inputs to prevent trivial solutions, and execution time within a reasonable range (1ms to 100ms in eager mode) to filter out trivial or excessively heavy tasks. Additionally, problems with high similarity to KernelBench test cases are excluded to prevent data contamination. This process yields CUDA-Agent-Ops-6K, a curated operator-level dataset.
  1. Skill-Augmented CUDA Development Environment:
The system adopts the agent skills paradigm, providing the LLM with a structured specification and automated tools to formalize the CUDA kernel development workflow.
  • Agent Loop: The system utilizes a ReAct-style agent loop, interleaving reasoning, action execution (via standard shell utilities like BashTool, GlobTool, MultiEditTool, TodoWriteTool), and observation. This enables iterative coding, debugging, and performance optimization.
  • CUDA Coding Skill: A SKILL.md instruction file formulates a standard four-step process for CUDA kernel optimization: 1) Analyze native PyTorch performance; 2) Implement custom CUDA operators in model_new.py with corresponding CUDA kernel source and binding code; 3) Compile and iteratively refine in a GPU sandbox until correctness and performance (at least 5% speedup over torch.compile) are met; 4) Repeat optimization. CUDA-specific tools, such as a profiling tool to compare performance against torch.compile, are integrated.
  • Robust Reward Scheduling: To overcome issues with raw speedup as a reward signal (outliers, bias towards easy kernels), a normalized, robust reward scheme is introduced. The reward r{1,1,2,3}r \in \{-1, 1, 2, 3\} is assigned based on correctness and performance:
r={1if correctness check fails3if b(t,teager)b(t,tcompile)2if b(t,teager)1otherwiser = \begin{cases} -1 & \text{if correctness check fails} \\ 3 & \text{if } b(t, t_{\text{eager}}) \land b(t, t_{\text{compile}}) \\ 2 & \text{if } b(t, t_{\text{eager}}) \\ 1 & \text{otherwise} \end{cases}
where tt is the generated kernel's runtime, teagert_{\text{eager}} and tcompilet_{\text{compile}} are runtimes of PyTorch's eager and torch.compile versions, respectively, and b(t,t0)=I[(t0t)/t0>5%]b(t, t_0) = I[(t_0 - t)/t_0 > 5\%] indicates a significant speedup over baseline t0t_0.
  • Efforts to Avoid Reward Hacking: The system incorporates several safeguards to ensure accurate and unhackable reward signals: protected verification/profiling scripts, enforcement of execution-time constraints to prevent trivial fallbacks, validation against five randomly sampled inputs, careful profiling with synchronization/warm-up/averaging to reduce noise, and the absence of web search/external information retrieval tools.
  1. RL Algorithmic Techniques for Stable Training:
Initial RL trials faced instability and performance collapse due to a severe domain distribution mismatch between the base model's learned prior and the CUDA kernel coding data.
  • Multi-Stage Warm-up: The core solution is a warm-up strategy for both the actor and critic models to adapt to the target distribution.
    • Single-Turn Warm-up: Initial RL (PPO) is performed on the base model to enhance its basic CUDA kernel generation capability.
    • Actor Initialization (Rejection Fine-Tuning - RFT): Agent trajectories generated by the single-turn RL model are collected. Rejection sampling filters these trajectories, retaining only high-quality rollouts (positive rewards and no inefficient/invalid behaviors). The filtered trajectories DD' are then used to optimize the actor model πθ\pi_\theta via supervised fine-tuning with the objective: LRFT(θ)=EτD[t=1Tlogπθ(atst,a<t)]L_{RFT}(\theta) = -E_{\tau \sim D'} \left[ \sum_{t=1}^T \log \pi_\theta(a_t | s_t, a_{<t}) \right].
    • Critic Initialization (Value Pretraining): The sampled agent trajectories DD are used to pretrain the critic network VϕV_\phi. Target values VttargV^{\text{targ}}_t are computed using Generalized Advantage Estimation (GAE): Vttarg=Vϕ(st)+A^tV^{\text{targ}}_t = V_\phi(s_t) + \hat{A}_t, where A^t=l=0T1t(γλ)lδt+l\hat{A}_t = \sum_{l=0}^{T-1-t} (\gamma\lambda)^l \delta_{t+l} and δt=rt+γVϕ(st+1)Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t). The critic parameters ϕ\phi are optimized by minimizing the mean squared error: LVP(ϕ)=12EτD[1Tt=0T1(Vϕ(st)Vttarg)2]L_{VP}(\phi) = \frac{1}{2} E_{\tau \sim D} \left[ \frac{1}{T} \sum_{t=0}^{T-1} (V_\phi(s_t) - V^{\text{targ}}_t)^2 \right].
  • RL Algorithm: After warm-up, Proximal Policy Optimization (PPO) is employed to optimize the actor model πθ\pi_\theta using the clipped surrogate objective:
LCLIP(θ)=EτD[1Tt=0T1min(ρt(θ)A^t,clip(ρt(θ),1ϵlower,1+ϵhigher)A^t)]L_{CLIP}(\theta) = E_{\tau \sim D} \left[ \frac{1}{T} \sum_{t=0}^{T-1} \min \left( \rho_t(\theta) \hat{A}_t, \text{clip}(\rho_t(\theta), 1 - \epsilon_{\text{lower}}, 1 + \epsilon_{\text{higher}}) \hat{A}_t \right) \right]
where ρt(θ)=πθ(atst)πθold(atst)\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the importance sampling ratio, and ata_t is the action (token) taken at position tt.

CUDA Agent, scaled to a context length of 128k tokens and supporting up to 200 interaction turns, achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rates over torch.compile on Level-1, Level-2, and Level-3 splits, respectively. It outperforms proprietary models like Claude Opus 4.5 and Gemini 3 Pro by approximately 40% on the hardest Level-3 setting.