
Learning to Discover at Test Time
Key Points
- 1TTT-Discover proposes a novel method that applies reinforcement learning at test time to continually train a Large Language Model (LLM) for scientific discovery, prioritizing the generation of a single, highly optimized solution rather than broad generalization.
- 2This approach employs an entropic objective to maximize rewards and a PUCT-inspired state reuse mechanism, enabling the LLM to adapt and learn from its own attempts to solve specific, out-of-distribution problems.
- 3TTT-Discover achieves new state-of-the-art results across diverse domains, including mathematics, GPU kernel engineering, and algorithm design, using an open model and demonstrating significant improvements over prior methods with cost-effective training.
This paper introduces Test-Time Training to Discover (TTT-Discover), a novel approach that uses reinforcement learning (RL) to continually train a large language model (LLM) at test time to solve challenging scientific discovery problems. Unlike prior methods that prompt a frozen LLM for search (e.g., AlphaEvolve), TTT-Discover allows the LLM to adapt and improve its internal representations based on experience specific to the target problem. The core motivation is that discovery problems demand solutions beyond an LLM's pre-training data, requiring learning during the problem-solving process itself.
The paper formalizes a scientific problem as an environment, specifically a Markov Decision Process (MDP), characterized by a text description , a candidate solution state , an action (generated by the LLM), a transition function producing a new state , and a continuous reward function . A "discovery" is defined as finding a state such that , where is the reward of the current best-known solution. Actions typically involve generating thinking tokens and code, which the environment parses and executes to yield a new state .
Traditional search methods like Best-of-N sample i.i.d. rollouts from a frozen . More advanced methods, such as evolutionary search (e.g., AlphaEvolve), employ state-action reuse by maintaining a buffer of previous attempts and using heuristics (reuse) to select an initial state and context for the LLM's next generation . However, these methods do not update the LLM's weights .
TTT-Discover addresses the limitations of both frozen LLMs and standard RL for discovery problems. Standard RL aims to maximize expected reward and produce a generalizable policy, which is misaligned with the discovery goal of finding a single, maximal solution. Specifically, naive RL's objective function is indifferent to the maximum reward, its fixed initial state distribution limits the effective horizon, and its exploration strategies can favor safe actions over potentially groundbreaking but riskier ones.
TTT-Discover's methodology, outlined in Algorithm 1, is an iterative process:
- Initialize a buffer with an empty solution.
- For :
reuse heuristic.b. Generate an action .
c. Transition to state and evaluate its reward .
d. Add to .
e. Update the policy weights from using a
train subroutine.
- Return the state with the highest reward found across all iterations.
The key innovations in TTT-Discover lie in its specialized train and reuse subroutines:
- Entropic Objective (): To prioritize maximum reward actions, TTT-Discover optimizes an entropic objective:
The temperature parameter is adaptively set per initial state by constraining the KL divergence of the induced policy (details in Appendix A.1). This objective asymptotically approaches maximizing the maximum reward as . The gradient is computed as:
where serves as a re-weighting term for actions. Advantages are shaped with a KL penalty: .
- PUCT-inspired Reuse: For selecting initial states from the buffer , TTT-Discover employs a PUCT-inspired rule:
Critically, is defined as the *maximum* reward achieved by any descendant generated when starting from , not the average. is proportional to the rank of in the buffer sorted by reward, is the number of times or its descendants have been expanded, and is the total number of expansions. This heuristic balances exploitation of high-potential states with exploration of under-visited ones.
TTT-Discover is implemented using gpt-oss-120b with LoRA fine-tuning (rank 32) on Tinker. Each run involves 50 training steps, with 512 rollouts per step (8 groups of 64). The train step involves one gradient update on the entire batch.
The paper demonstrates TTT-Discover's effectiveness across mathematics (Erdős' minimum overlap problem, autocorrelation inequalities), GPU kernel engineering, algorithm design (AtCoder competitions), and biology (single-cell analysis denoising). TTT-Discover achieves new state-of-the-art results in almost all attempted problems. For instance, in Erdős' minimum overlap problem, it improved the upper bound to , surpassing AlphaEvolve's . In the first autocorrelation inequality, it set a new upper bound of , beating ThetaEvolve's . These improvements are often achieved by discovering qualitatively new constructions (e.g., an asymmetric 600-piece step function for Erdős' problem), contrasting with prior work that refined existing constructions. The reported results are achieved with an open-source model and can be reproduced with public code, costing a few hundred dollars per problem.