Learning to Discover at Test Time
Paper

Learning to Discover at Test Time

James Zou
2026.01.31
·Arxiv·by 네루
#LLM#Reinforcement Learning#Test-Time Training#AI Discovery#Open Model

Key Points

  • 1TTT-Discover proposes a novel method that applies reinforcement learning at test time to continually train a Large Language Model (LLM) for scientific discovery, prioritizing the generation of a single, highly optimized solution rather than broad generalization.
  • 2This approach employs an entropic objective to maximize rewards and a PUCT-inspired state reuse mechanism, enabling the LLM to adapt and learn from its own attempts to solve specific, out-of-distribution problems.
  • 3TTT-Discover achieves new state-of-the-art results across diverse domains, including mathematics, GPU kernel engineering, and algorithm design, using an open model and demonstrating significant improvements over prior methods with cost-effective training.

This paper introduces Test-Time Training to Discover (TTT-Discover), a novel approach that uses reinforcement learning (RL) to continually train a large language model (LLM) at test time to solve challenging scientific discovery problems. Unlike prior methods that prompt a frozen LLM for search (e.g., AlphaEvolve), TTT-Discover allows the LLM to adapt and improve its internal representations based on experience specific to the target problem. The core motivation is that discovery problems demand solutions beyond an LLM's pre-training data, requiring learning during the problem-solving process itself.

The paper formalizes a scientific problem as an environment, specifically a Markov Decision Process (MDP), characterized by a text description dd, a candidate solution state ss, an action aa (generated by the LLM), a transition function T(a)T(a) producing a new state ss', and a continuous reward function R(s)R(s'). A "discovery" is defined as finding a state ss such that R(s)>rsotaR(s) > r_{sota}, where rsotar_{sota} is the reward of the current best-known solution. Actions typically involve generating thinking tokens and code, which the environment parses and executes to yield a new state ss'.

Traditional search methods like Best-of-N sample i.i.d. rollouts from a frozen πθ\pi_\theta. More advanced methods, such as evolutionary search (e.g., AlphaEvolve), employ state-action reuse by maintaining a buffer HiH_i of previous attempts and using heuristics (reuse) to select an initial state sis_i and context cic_i for the LLM's next generation aiπθ(d,si,ci)a_i \sim \pi_\theta(\cdot | d, s_i, c_i). However, these methods do not update the LLM's weights θ\theta.

TTT-Discover addresses the limitations of both frozen LLMs and standard RL for discovery problems. Standard RL aims to maximize expected reward and produce a generalizable policy, which is misaligned with the discovery goal of finding a single, maximal solution. Specifically, naive RL's objective function is indifferent to the maximum reward, its fixed initial state distribution limits the effective horizon, and its exploration strategies can favor safe actions over potentially groundbreaking but riskier ones.

TTT-Discover's methodology, outlined in Algorithm 1, is an iterative process:

  1. Initialize a buffer H0H_0 with an empty solution.
  2. For i=0,,N1i = 0, \ldots, N-1:
a. Select an initial state sis_i and context cic_i from HiH_i using a reuse heuristic.
b. Generate an action aiπθi(d,si,ci)a_i \sim \pi_{\theta_i}(\cdot | d, s_i, c_i).
c. Transition to state si=T(ai)s'_i = T(a_i) and evaluate its reward ri=R(si)r_i = R(s'_i).
d. Add (si,ai,si,ri)(s_i, a_i, s'_i, r_i) to HiH_i.
e. Update the policy weights θi+1\theta_{i+1} from θi\theta_i using a train subroutine.
  1. Return the state sis'_i with the highest reward found across all iterations.

The key innovations in TTT-Discover lie in its specialized train and reuse subroutines:

  1. Entropic Objective (Jβ(θ)J_\beta(\theta)): To prioritize maximum reward actions, TTT-Discover optimizes an entropic objective:
Jβ(θ)=Esreuse(H)[logEaπθ(s)[eβ(s)R(s,a)]]J_\beta(\theta) = \mathbb{E}_{s \sim \text{reuse}(H)}\left[\log \mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left[e^{\beta(s)R(s,a)}\right]\right]
The temperature parameter β(s)\beta(s) is adaptively set per initial state ss by constraining the KL divergence of the induced policy (details in Appendix A.1). This objective asymptotically approaches maximizing the maximum reward as β\beta \to \infty. The gradient is computed as:
θJβ(θ)=Esreuse(H),aπθ(s)[wβ(s)(a)θlogπθ(as)]\nabla_\theta J_\beta(\theta) = \mathbb{E}_{s \sim \text{reuse}(H), a \sim \pi_\theta(\cdot|s)}\left[ w_\beta(s)(a) \nabla_\theta \log \pi_\theta(a|s) \right]
where wβ(s)(a)=eβ(s)R(s,a)Ea~πθ(s)[eβ(s)R(s,a~)]w_\beta(s)(a) = \frac{e^{\beta(s)R(s,a)}}{\mathbb{E}_{\tilde{a} \sim \pi_\theta(\cdot|s)}[e^{\beta(s)R(s,\tilde{a})}]} serves as a re-weighting term for actions. Advantages are shaped with a KL penalty: A(a;s)=wβ(s)(a)1λlogπθ(as)πθ0(as)A(a; s) = w_\beta(s)(a) - 1 - \lambda \log \frac{\pi_\theta(a|s)}{\pi_{\theta_0}(a|s)}.

  1. PUCT-inspired Reuse: For selecting initial states sis_i from the buffer HiH_i, TTT-Discover employs a PUCT-inspired rule:
Score(s)=Q(s)+cP(s)1+T1+n(s)\text{Score}(s) = Q(s) + c \cdot P(s) \cdot \sqrt{\frac{1 + T}{1 + n(s)}}
Critically, Q(s)Q(s) is defined as the *maximum* reward achieved by any descendant generated when starting from ss, not the average. P(s)P(s) is proportional to the rank of ss in the buffer sorted by reward, n(s)n(s) is the number of times ss or its descendants have been expanded, and TT is the total number of expansions. This heuristic balances exploitation of high-potential states with exploration of under-visited ones.

TTT-Discover is implemented using gpt-oss-120b with LoRA fine-tuning (rank 32) on Tinker. Each run involves 50 training steps, with 512 rollouts per step (8 groups of 64). The train step involves one gradient update on the entire batch.

The paper demonstrates TTT-Discover's effectiveness across mathematics (Erdős' minimum overlap problem, autocorrelation inequalities), GPU kernel engineering, algorithm design (AtCoder competitions), and biology (single-cell analysis denoising). TTT-Discover achieves new state-of-the-art results in almost all attempted problems. For instance, in Erdős' minimum overlap problem, it improved the upper bound to 0.3808760.380876, surpassing AlphaEvolve's 0.3809240.380924. In the first autocorrelation inequality, it set a new upper bound of 1.502861.50286, beating ThetaEvolve's 1.503141.50314. These improvements are often achieved by discovering qualitatively new constructions (e.g., an asymmetric 600-piece step function for Erdős' problem), contrasting with prior work that refined existing constructions. The reported results are achieved with an open-source model and can be reproduced with public code, costing a few hundred dollars per problem.