The Father of Transformers, Łukasz Kaiser: Inference is Now Layer 1
Video

The Father of Transformers, Łukasz Kaiser: Inference is Now Layer 1

2026.01.23
·YouTube·by 이호민
#Transformer#LLM#Reasoning#AI#OpenAI

Key Points

  • 1Lukasz Kaiser, a co-author of the "Attention Is All You Need" paper, identifies a new "reasoning paradigm" in AI that fundamentally differs from traditional LLMs, learning from orders of magnitude less data and enabling more sophisticated, iterative thinking.
  • 2This nascent paradigm is poised for a steep path of improvement, accelerating scientific discovery by allowing models to execute complex ideas and aiding researchers in tasks like coding and experiment management.
  • 3Despite the ultimate bottleneck of GPU and energy resources, Kaiser dismisses fears of an "AI winter" and foresees continued rapid progress, with significant advancements expected in the near future.

The provided text discusses the evolution and future of Artificial Intelligence, particularly Large Language Models (LLMs), from the perspective of a researcher who was part of the "Attention Is All You Need" paper team and is currently involved in advanced AI research at OpenAI.

The "Attention Is All You Need" paper, published in 2017, is highlighted as a foundational and iconic work that introduced the Transformer paradigm, which underpins modern LLMs. The interviewee emphasizes its significance as a starting point for much of the public's awareness of current AI capabilities.

A central theme is the distinction between "old style LLMs" and a new class of "reasoning models." Old Style LLMs: These models primarily operate by predicting the next word based on vast amounts of general internet data. The interviewee suggests that the utility of this paradigm, based on simply scaling up next-word prediction with ever more general internet data, is reaching a plateau, as much of the readily available data has already been utilized. While larger models in this paradigm still exhibit improved performance (following "scaling laws"), their core mechanism is primarily pattern matching and output imitation.

Reasoning Models: These are presented as a "fundamentally very different" and "new paradigm" that is only just beginning its "steep path up." Unlike old LLMs, reasoning models are trained not just to imitate an output but to learn the *process* of reaching that output, akin to human reasoning. The core methodology involves:

  1. Chain of Thought (CoT): Initially, this was a prompting technique where models were instructed to "think step by step." This showed early promise in eliciting intermediate thought processes.
  2. Reinforcement Learning (RL): The significant breakthrough for reasoning models comes from training them with RL. Unlike traditional gradient descent, which can train from random weights, RL for reasoning requires a "prior that knows a little bit already about how to think." This "finicky" training method allows models to:
    • Self-Correction: If an error is made, the model learns to "go back and start from scratch" and try again.
    • Deeper Exploration: It considers things for much longer, exploring different paths and checking for consistency.
    • Tool Use: Reasoning models learn to call external tools (e.g., search, specific APIs) to gather information, verify facts, and resolve discrepancies.
The training signal for this complex behavior is surprisingly simple: the need to get the correct answer. This RL-based training enables the model to learn the "latent thinking" process.

Data Efficiency and Generalization: A crucial characteristic of reasoning models is their ability to learn from "another order of magnitude less data" compared to old LLMs. For instance, they learn complex mathematics from a "tiny" dataset relative to the internet. This reduced data requirement is a "huge change" and implies better generalization to unseen problems, as the models are learning underlying principles rather than just memorizing patterns. This directly addresses critiques against old LLMs, such as those by Richard Sutton, who argued that LLMs don't truly "reason" because they only imitate output. The interviewee counters that reasoning models *do* imitate the underlying actions and processes of thought.

Bottlenecks and Progress: The ultimate bottlenecks to AI progress are identified as "GPUs and energy." Despite ongoing efforts to secure massive compute resources (e.g., OpenAI's Stargate project, Nvidia partnership), the demand for GPUs far outstrips supply, limiting the number of parallel experiments and the scale of models that can be trained. However, the interviewee remains optimistic, dismissing concerns of an "AI winter" and predicting "sharp improvement in the next year or two." This progress will come from:

  • Continued scaling: Even the "old paradigm" of bigger models will still yield benefits when combined with reasoning.
  • Research in the new paradigm: The steep ascent of reasoning models, with much more research to be done on scaling them up and refining their methods.
  • AI accelerating AI: Reasoning models are already being used to generate "synthetic data" for training new models, which is proving more effective than traditional data. AI is also assisting researchers directly by accelerating coding and experiment scheduling, thereby reducing "technical drudgery."

Creativity and Intelligence Explosion: The discussion touches upon whether AI can become truly creative and accelerate scientific discovery. The interviewee suggests that AI, particularly reasoning models with tool access (even human collaboration), can "speed up science" significantly. This is because a major bottleneck in science is not just generating ideas but executing and testing them. If models can automate much of the execution and experimental process, researchers can focus more on generating and refining ideas, leading to much more rapid progress. The concept of an "intelligence explosion" (singularity) is viewed more cautiously; while external observers might perceive rapid, explosive progress, the reality for researchers is a continuous series of "hard work" and incremental breakthroughs, where bottlenecks shift from one area to another (e.g., from programming to compute). The overall confidence stems from the historical trend of finding solutions to current challenges, often by building better software and tools, with AI itself playing an increasing role in this development.