Diffusion LLMs - The Fastest LLMs Ever Built | Stefano Ermon, cofounder of Inception Labs

Unknown Source October 09, 2025 39 min

artificial-intelligence ai-infrastructure generative-ai startup investment openai anthropic nvidia

🎧 Listen to Original

50 Companies

60 Key Quotes

5 Topics

1 Insights

🎯 Summary

Podcast Summary: Diffusion LLMs - The Fastest LLMs Ever Built | Stefano Ermon

This episode of Infinite Curiosity features Stefano Ermon, co-founder of Inception Labs and Stanford Associate Professor, discussing the revolutionary concept of Diffusion Language Models (Diffusion LLMs) as a fundamental shift away from traditional autoregressive LLMs (like GPT, Claude).

1. Focus Area

The primary focus is on applying diffusion modeling, successfully used in image and video generation, to text and code generation. The discussion centers on the architectural differences, performance advantages (speed and cost efficiency), and the commercialization efforts by Inception Labs with their model, Mercury.

2. Key Technical Insights

Autoregressive vs. Diffusion Generation: Autoregressive models generate text sequentially, one token at a time (left-to-right), which is inherently slow due to its sequential nature. Diffusion LLMs generate the entire output in parallel by starting with a rough, noisy guess and iteratively refining it through multiple denoising steps, analogous to sculpting or outlining a document before editing.
Training Methodology: Diffusion LLMs are trained via denoising. Noise is intentionally introduced into clean data, and the neural network (often transformer-based) is trained to fix these mistakes. At inference, the model refines a random initial state until a high-quality output is achieved.
Efficiency through Parallelism: Because diffusion models modify multiple tokens simultaneously in each forward pass (step), they leverage the parallel processing capabilities of GPUs far more effectively than sequential autoregressive models, leading to significant speed and cost advantages.

3. Business/Investment Angle

Performance Frontier Shift: Diffusion LLMs are shifting the trade-off frontier between quality, speed, and cost, offering models that can match autoregressive quality while being significantly faster (potentially 10x) or cheaper to serve.
Engineering Hurdles: A major commercialization challenge was developing a custom serving engine optimized for diffusion inference, as existing frameworks (like vLLM) are built for autoregressive architectures.
Data Importance: Moving from academic research to production highlighted the critical, often underestimated, importance of high-quality, filtered training data for real-world performance.

4. Notable Companies/People

Stefano Ermon: Co-founder of Inception Labs, pioneer in applying diffusion models to language, and the central expert in the discussion.
Inception Labs: Developing and commercializing Diffusion LLMs under the name Mercury.
Google and Baidu: Mentioned as having announced internal diffusion language model prototypes, though Inception claims to be ahead in production deployment.
ContinueDev: A leading source code IDE that has integrated Mercury as the default for its “next-edit” feature.

5. Future Implications

Dominance in Specific Niches: Diffusion LLMs are predicted to excel in applications where speed and iterative refinement are crucial, particularly code generation and editing workflows (autocomplete, auto-edit), where the non-linear nature of code aligns well with diffusion refinement.
Multimodality Potential: Ermon is highly optimistic about building truly multimodal models using diffusion, leveraging the proven success of diffusion in image/video generation with the new text capabilities.
Hardware Agnostic Efficiency: The algorithmic improvements in Diffusion LLMs allow them to achieve speeds comparable to specialized inference chips (like Groq) using only commodity Nvidia GPUs, suggesting software innovation can bypass hardware specialization.

6. Target Audience

This episode is highly valuable for AI/ML Engineers, Research Scientists, CTOs, and AI Product Managers interested in next-generation LLM architectures, inference optimization, and competitive advantages in the rapidly evolving generative AI landscape.

🏢 Companies Mentioned

Copilot Arena ✅ ai_research

GPT-2 ✅ ai_application

Part B ✅ unknown

Part A ✅ unknown

So I ✅ unknown

Copilot Arena ✅ unknown

Gemini Diffusion ✅ unknown

Nvidia GPUs ✅ unknown

At Inception ✅ unknown

And I ✅ unknown

Stable Diffusion ✅ unknown

Generative Adversarial Networks ✅ unknown

Generative AI ✅ unknown

Inception Labs ✅ unknown

Stefano Armon ✅ unknown

💬 Key Insights

"The moment you change the inference algorithm to something that is fundamentally more parallel, that changes the way you build GPUs, the number of FLOPs that you need, and then kind of a mature line of memory bandwidth and the amounts of memory that you need. So the game changes, and so it's going to be exciting to think about these effects on the whole stack from hardware all the way to the applications."

Impact Score: 10

"I wouldn't be surprised, and I think it's very likely that diffusion models will become the default for a lot of the coding use cases in the next 12 to 18 months."

Impact Score: 10

"If you think about code, I think it's a lot less left-to-right. If you think about a big codebase, there's not necessarily an order between the files, and even the way you write code involves a lot more back and forth and edits and changes. And so that was actually one of the reasons we started out with code. We felt like that's going to be a space where diffusion could be really, really good."

Impact Score: 10

"What we are seeing is that we are paradigm-shifting what's possible with autoregressive models. And so for the same speed, we can reduce the cost, or for the same cost, we can increase the speed. And so we are excited about this technology because eventually it's all going to be an inference game."

Impact Score: 10

"existing diffusion language models are not capable of reasoning. That's one thing that has enabled the autoregressive models to drastically improve the quality of the answers."

Impact Score: 10

"It surprised me how important data was. I mean, I knew that data was important. I've heard about it. But in the academic setting, we were always able to get around it and just use standard datasets... But then in the real world, the kind of data that we use for training the model, the quality of the data, the filtering that we do is really, really important."

Impact Score: 10

📊 Topics

#artificialintelligence 57 #aiinfrastructure 22 #generativeai 11 #investment 3 #startup 3

🧠 Key Takeaways

💡 be trying to know the major hardware providers are already thinking about how should the hardware call evolve and change to better support these kinds of workloads