The Second Half
Blog

The Second Half

Shunyu Yao
2025.08.31
·Web·by Anonymous
#AI#RL#LLM#Reasoning#Evaluation

Key Points

  • 1The first half of AI focused on developing novel training methods and models to hillclimb benchmarks, where method innovation was prioritized over task definition.
  • 2A new "recipe" combining language pre-training, scale, and reasoning has made RL generalize across tasks, effectively standardizing benchmark-solving and diminishing the impact of incremental method improvements.
  • 3The second half of AI demands a shift from solving problems to defining them, focusing on fundamentally rethinking evaluation setups to prioritize real-world utility and drive truly game-changing research.

The paper posits that Artificial Intelligence (AI) is at a "halftime" inflection point, shifting from a focus on developing new training methods and models to defining real-world problems and evaluating for utility.

The first half of AI, spanning decades, was characterized by an emphasis on creating novel algorithms and model architectures (e.g., backpropagation, convolutional networks, Transformer). Success was measured by "hillclimbing" established benchmarks (e.g., ImageNet, WMT'14). Seminal works like AlexNet or Transformer received significantly more citations than the benchmarks they outperformed, illustrating that the game favored methods over tasks. This was because methods were generally harder to invent, more exciting, and more broadly applicable across domains (e.g., Transformer's impact on CV, NLP, RL). Tasks, in contrast, often involved simply adapting existing human challenges into benchmarks.

The paper argues that this game is now "ruined" by the emergence of a "recipe" that fundamentally changes the landscape. This recipe comprises:

  1. Massive language pre-training: Distilling general commonsense and linguistic knowledge into models.
  2. Scale: Utilizing vast amounts of data and compute.
  3. Reasoning and Acting (ReAct-like capabilities): Enabling agents to perform internal "thought" steps before external actions.

Framed through the lens of Reinforcement Learning (RL), the traditional view prioritized the RL algorithm (e.g., PPO, DQN) over the environment and priors. Early efforts like OpenAI Gym sought to generalize environments. However, the critical missing piece was *priors*. Language pre-training provided powerful priors, but generalization to domains like computer control or video games remained elusive.

The "eureka moment" described is the integration of reasoning as an action within the RL framework. Classical RL theory treats actions as directly affecting the external environment. However, adding internal "thinking" or "reasoning" steps—which do not immediately change the external world but leverage pre-trained language models—allows for powerful generalization. This concept is analogous to an agent being presented with "infinite empty boxes" alongside one valuable box. While classical RL might suggest this dilutes the expected reward, the paper argues that incorporating these "empty boxes" (reasoning steps) allows the agent to better prepare and choose the correct box by leveraging learned language priors. This means that:
Language generalizes through reasoning in agents\text{Language generalizes through reasoning in agents}
With the right RL priors (from large-scale language pre-training) and an RL environment that incorporates language-based reasoning as part of its action space, the RL algorithm itself becomes "the most trivial part." This reversal of traditional RL research priorities has led to breakthroughs like the "o-series" models and computer-using agents.

The second half of AI is necessitated because the "recipe" industrializes and standardizes benchmark hillclimbing, making novel methods less impactful (e.g., a 5% improvement from a novel method pales in comparison to a 30% improvement from the next scaled "o-series" model). Even harder benchmarks are rapidly solved.

Therefore, the paper advocates for a fundamental rethinking of evaluation. Instead of just creating harder benchmarks, the goal is to question existing evaluation setups and devise new ones that force the invention of methods beyond the current recipe. The author highlights the utility problem: despite AI's prowess in games and exams, its real-world economic impact is still limited. This is attributed to a disconnect between evaluation setups and real-world scenarios. Two examples are given:

  1. Automated vs. Human-engaged Evaluation: Current evaluations often assume autonomous agents with single inputs and outputs. Real-world tasks frequently require continuous human interaction (e.g., customer service chatbots). New benchmarks like Chatbot Arena or tau-bench address this by incorporating human or simulated human interaction.
  2. I.I.D. vs. Sequential/Contextual Evaluation: Machine learning benchmarks typically assume independent and identically distributed tasks. However, real-world problems (e.g., a software engineer familiarizing with a codebase) benefit from sequential learning, long-term memory, and contextual understanding, which i.i.d. evaluation fails to capture.

The proposed game for the second half is:

  1. Develop novel evaluation setups or tasks that reflect real-world utility.
  2. Solve them using the existing recipe or augment the recipe with novel components.
  3. Continue this loop.

This shift encourages research that creates new assumptions to "break" the existing recipe, leading to truly game-changing advancements focused on building useful products and driving economic value, rather than merely incremental improvements on established benchmarks.