Richard Sutton – Father of RL thinks LLMs are a dead end
🎯 Summary
Podcast Summary: Richard Sutton – Father of RL thinks LLMs are a dead end
This 66-minute podcast features an in-depth discussion with Richard Sutton, a Turing Award laureate and foundational figure in Reinforcement Learning (RL), primarily focusing on his critique of Large Language Models (LLMs) and his advocacy for the RL paradigm as the true path to general intelligence.
1. Focus Area
The discussion centers on the fundamental philosophical and technical differences between the Reinforcement Learning (RL) paradigm (learning from experience, action, and reward) and the Large Language Model (LLM) paradigm (learning via next-token prediction based on massive static datasets). The core theme is whether LLMs possess genuine “world models” or “goals,” and whether imitation learning can serve as a viable foundation for continual, experiential learning.
2. Key Technical Insights
- The Necessity of Goals and Ground Truth: Sutton argues that intelligence requires a goal, which defines “right” and “wrong” actions (i.e., reward). LLMs, based purely on next-token prediction from text, lack a substantive goal related to influencing the external world, thus lacking the ground truth necessary for meaningful continual learning or validating prior knowledge.
- Experience vs. Imitation: True learning, as seen in nature and RL, is defined by the stream of experience (action $\rightarrow$ sensation $\rightarrow$ reward). LLMs learn from static human output (imitation/supervised learning), which Sutton contends is fundamentally different from learning from the consequences of one’s own actions in the world.
- The “Bitter Lesson” Re-evaluated: While LLMs leverage massive computation (fitting the “Bitter Lesson”), Sutton predicts they will ultimately be superseded by systems that learn purely from experience, just as previous methods relying heavily on human-engineered knowledge were superseded by scalable, general methods.
3. Business/Investment Angle
- RL Environment Bottleneck: The transition to experiential AI requires complex, messy, real-world RL environments. The difficulty in building these environments (requiring deep subject matter expertise to model real-world subtleties like changing data states) represents a significant current barrier and a potential area for specialized service providers.
- LLMs as a Temporary Scaffold: The current investment wave in LLMs might be seen as a temporary scaffold—a large initial injection of human knowledge—but the truly scalable, superior systems will likely emerge from the RL/experiential paradigm, potentially rendering the LLM foundation obsolete for achieving AGI.
- Continual Learning Advantage: Digital intelligence offers the potential to aggregate knowledge across instances (unlike human children who start anew), suggesting that once the experiential paradigm is dominant, scaling knowledge transfer will be a massive advantage over current data-limited approaches.
4. Notable Companies/People
- Richard Sutton: Central figure, advocate for RL as the foundation of intelligence.
- Alan Turing: Quoted regarding the goal of a machine that can “learn from experience.”
- John McCarthy: Quoted defining intelligence as the computational ability to achieve goals.
- Joseph Henrich: Mentioned for his anthropological view that human cultural knowledge transmission heavily relies on imitation, a point Sutton acknowledges but subordinates to basic trial-and-error learning.
- Labelbox: Mentioned in an embedded segment regarding the difficulty and necessity of building high-fidelity, complex simulation environments for training RL agents (e.g., simulating dynamic e-commerce storefronts).
5. Future Implications
The industry is currently caught in a “fashion” driven by LLMs, which Sutton believes is a detour. The future, according to this perspective, lies in building systems capable of general, continual learning from interaction. This requires a fundamental architectural shift back toward the RL framework, where agents actively explore, form predictions about the world’s response to their actions, and update based on surprise (reward/error signals).
6. Target Audience
This episode is highly valuable for AI Researchers, Machine Learning Engineers, and Technology Strategists who are evaluating the long-term viability of current generative AI trends versus foundational AI paradigms like Reinforcement Learning.
🏢 Companies Mentioned
💬 Key Insights
"And number four is that once it's inevitable over time that the most intelligent things around would gain resources and power."
"And number three, we won't stop just with human-level intelligence. We will get super intelligence."
"I do think succession to digital or digital intelligence or augmented humans is inevitable."
"A big question, a big issue will become corruption. If you really could just get information from anywhere and bring it into your central mind, you could become more and more powerful. And it's all digital... you can lose your mind this way. If you pull in something from the outside and build it into your inner thinking, it could take over you. It could change you. It could be your destruction, rather than your increment in knowledge."
"The bitter lesson is not saying necessarily that human artisanal researcher tuning doesn't work, but that it obviously scales much worse than compute, which is growing exponentially."
"In the old days, it was interesting because things like search and learning were called weak methods because they're just, these use general principles are not using the power that comes from imbuing a system with human knowledge. So those were called strong. And so I think the weak methods have just totally won."