
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Key Points
- 1CUDA Agent introduces a large-scale agentic reinforcement learning system specifically designed to significantly improve large language models' capabilities in generating high-performance CUDA kernels.
- 2This system achieves its goals through a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling, and novel RL algorithmic techniques that ensure stable training.
- 3CUDA Agent achieves state-of-the-art results on KernelBench, consistently outperforming torch.compile and surpassing leading proprietary models, especially on more complex tasks, by learning sophisticated optimization strategies.1. 🚀 CUDA Agent introduces a large-scale agentic reinforcement learning system designed to significantly enhance LLMs' ability to generate high-performance CUDA kernels.
- 4This system is built upon a scalable data synthesis pipeline, a skill-augmented CUDA development environment with robust feedback, and algorithmic improvements for stable multi-turn RL training.
- 5CUDA Agent achieves state-of-the-art performance on KernelBench, delivering substantial speedups over torch.compile and outperforming leading proprietary models across all difficulty levels.
CUDA Agent is a large-scale agentic reinforcement learning (RL) system designed to address the challenges of high-performance CUDA kernel generation for deep learning, a task traditionally requiring specialized hardware expertise. The paper highlights that despite the general proficiency of Large Language Models (LLMs) in software development, they remain uncompetitive with compiler-based systems like torch.compile for CUDA kernel optimization. Existing approaches, such as training-free refinement or fine-tuning within fixed multi-turn execution-feedback loops, are limited because they fail to fundamentally enhance the model's intrinsic CUDA optimization capabilities.
To overcome these limitations, CUDA Agent systematically improves the base model's CUDA kernel coding abilities through contributions across three complementary dimensions:
- Scalable Data Synthesis Pipeline:
- Seed Problem Crawling: Fundamental computational primitives are mined from PyTorch and Transformers libraries, establishing a comprehensive set of seed operators.
- Combinatorial Problem Construction: LLMs are employed to synthesize aggregated operators by sequentially composing up to five sampled operator classes. This process generates fused, multi-operator tasks that often present more complex optimization landscapes than individual operations.
- Rubric-based Problem Filtering: A rigorous execution-based filtering process ensures data quality. Problems are validated against four criteria: successful execution in both Eager and Compile modes, non-stochasticity for reproducibility, numerical distinctness for different inputs to prevent trivial solutions, and execution time within a reasonable range (1ms to 100ms in eager mode) to filter out trivial or excessively heavy tasks. Additionally, problems with high similarity to KernelBench test cases are excluded to prevent data contamination. This process yields CUDA-Agent-Ops-6K, a curated operator-level dataset.
- Skill-Augmented CUDA Development Environment:
- Agent Loop: The system utilizes a ReAct-style agent loop, interleaving reasoning, action execution (via standard shell utilities like BashTool, GlobTool, MultiEditTool, TodoWriteTool), and observation. This enables iterative coding, debugging, and performance optimization.
- CUDA Coding Skill: A
SKILL.mdinstruction file formulates a standard four-step process for CUDA kernel optimization: 1) Analyze native PyTorch performance; 2) Implement custom CUDA operators inmodel_new.pywith corresponding CUDA kernel source and binding code; 3) Compile and iteratively refine in a GPU sandbox until correctness and performance (at least 5% speedup overtorch.compile) are met; 4) Repeat optimization. CUDA-specific tools, such as a profiling tool to compare performance againsttorch.compile, are integrated. - Robust Reward Scheduling: To overcome issues with raw speedup as a reward signal (outliers, bias towards easy kernels), a normalized, robust reward scheme is introduced. The reward is assigned based on correctness and performance:
where is the generated kernel's runtime, and are runtimes of PyTorch's eager and
torch.compile versions, respectively, and indicates a significant speedup over baseline .
- Efforts to Avoid Reward Hacking: The system incorporates several safeguards to ensure accurate and unhackable reward signals: protected verification/profiling scripts, enforcement of execution-time constraints to prevent trivial fallbacks, validation against five randomly sampled inputs, careful profiling with synchronization/warm-up/averaging to reduce noise, and the absence of web search/external information retrieval tools.
- RL Algorithmic Techniques for Stable Training:
- Multi-Stage Warm-up: The core solution is a warm-up strategy for both the actor and critic models to adapt to the target distribution.
- Single-Turn Warm-up: Initial RL (PPO) is performed on the base model to enhance its basic CUDA kernel generation capability.
- Actor Initialization (Rejection Fine-Tuning - RFT): Agent trajectories generated by the single-turn RL model are collected. Rejection sampling filters these trajectories, retaining only high-quality rollouts (positive rewards and no inefficient/invalid behaviors). The filtered trajectories are then used to optimize the actor model via supervised fine-tuning with the objective: .
- Critic Initialization (Value Pretraining): The sampled agent trajectories are used to pretrain the critic network . Target values are computed using Generalized Advantage Estimation (GAE): , where and . The critic parameters are optimized by minimizing the mean squared error: .
- RL Algorithm: After warm-up, Proximal Policy Optimization (PPO) is employed to optimize the actor model using the clipped surrogate objective:
where is the importance sampling ratio, and is the action (token) taken at position .
CUDA Agent, scaled to a context length of 128k tokens and supporting up to 200 interaction turns, achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rates over torch.compile on Level-1, Level-2, and Level-3 splits, respectively. It outperforms proprietary models like Claude Opus 4.5 and Gemini 3 Pro by approximately 40% on the hardest Level-3 setting.