Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750

Unknown Source October 07, 2025 57 min

artificial-intelligence ai-infrastructure startup investment google nvidia

🎧 Listen to Original

41 Companies

96 Key Quotes

4 Topics

2 Insights

🎯 Summary

Podcast Episode Summary: Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750

This episode of the Two Minute AI podcast, hosted by Sam Charrington, features Jacob Buckman, co-founder and CEO of Manifest AI, discussing the critical challenge of achieving effective long context in Transformer models and introducing their proposed solution, the Power Retention architecture.

1. Focus Area

The primary focus is on overcoming the computational bottlenecks of standard Transformer attention mechanisms when dealing with extremely long input sequences (context length). The discussion centers on architectural innovations that aim to achieve compute-optimal scaling by balancing the axes of scale, specifically focusing on the context axis through a mechanism called Retention.

2. Key Technical Insights

Context as an Axis of Scale: Effective long context is framed as another crucial axis of scaling, analogous to parameters or dataset size. Improvements in context utilization lead to measurable, log-linear improvements in pre-training negative log-likelihood, suggesting that better context handling directly translates to better predictive ability.
The Power of Retention (Recurrence + Attention): Retention is presented as a family of architectures (related to Mamba 2 and DeltaNet) that mathematically unifies recurrent and attention-based computation. It allows for computation to be expressed either sequentially (recurrent form, linear cost, fixed state size) or in parallel (attention form, quadratic cost, GPU-friendly).
Chunking for Optimal Performance: The key innovation is the chunked algorithm for Retention. By breaking long sequences into small chunks, attention is used within the chunk for hardware efficiency (parallelization), and recurrence is used between chunks. This yields the parallelization benefits of attention while maintaining an overall linear cost relative to context length, determined by the fixed size of the GPU memory.

3. Business/Investment Angle

Context Utility vs. Context Length: There is a growing realization that simply having a large context window (e.g., millions of tokens) does not guarantee utility. Value is derived from the model’s ability to effectively use that context, which requires architectural improvements.
The Bottleneck of Context Scaling: The high cost associated with scaling context length is identified as a major current bottleneck preventing models from reaching their full potential across various domains (text, video, etc.). Companies solving this problem, like Manifest AI, are targeting a fundamental architectural limitation.
Unifying Architectural Design: The discussion provides a framework for understanding various long-context solutions (like Mamba, windowed attention, GQA) by analyzing how they reduce the size or growth rate of the model’s “state” (analogous to the KV cache in Transformers). This offers strategic insight into architectural trade-offs.

4. Notable Companies/People

Jacob Buckman (Manifest AI): Co-founder and CEO, presenting the research on the Power Retention architecture.
Sam Charrington (Host): Host of the Two Minute AI podcast.
Mila / Google Brain: Institutions where Jacob Buckman conducted prior research in deep learning and RL.
Mamba: Mentioned as a prominent example of a state-space model that falls under the broader “retention” umbrella.

5. Future Implications

The industry is moving toward architectures that achieve compute optimality by balancing all axes of scale, not just parameters. The future of long-context models likely involves hybrid architectures like chunked Retention, which leverage hardware capabilities (parallel matrix multiplication) while maintaining linear scaling costs, potentially making extremely long-context processing economically viable.

6. Target Audience

This episode is highly valuable for AI/ML Researchers, Deep Learning Engineers, and Technical Product Managers involved in building or optimizing large language models, especially those focused on efficiency, scaling laws, and novel sequence modeling architectures.

🏢 Companies Mentioned

ThreeDot ✅ ai_infrastructure

DeepSeek ✅ ai_application

Chris Lattner ✅ unknown

Cute Layout ✅ unknown

Power Retention I ✅ unknown

Python JIT ✅ unknown

So Vigil ✅ unknown

State FLOPs ✅ unknown

So RNNs ✅ unknown

Grouped Query Attention ✅ unknown

Do I ✅ unknown

And I ✅ unknown

Google Brain ✅ unknown

But I ✅ unknown

Carnegie Mellon ✅ unknown

💬 Key Insights

"Yeah, but you can just change. So there are some parameters like the RoPE, the rotational positional embedding parameters, that do sort of fix the context length, but you can work around it."

Impact Score: 10

"If you download the StarCoder weights and download the StarCoder architecture, but then replace the call to FlashAttention in the StarCoder architecture with a call to Power Retention, you now have a new model with weights that are equivalent to the StarCoder weights."

Impact Score: 10

"ining run like any other with the Power Retention architecture, but instead of initializing randomly, initialize to the known good StarCoder weights."

Impact Score: 10

"But where we get the huge speedups is on the problem shapes that are less typical, like if you want to have some weird number of tokens in your sequence length. For example, if you're doing pre-fill on a real document, that document isn't going to have a nice even 1,024 token length. It's going to have some crazy number."

Impact Score: 10

"one is this separation of static and dynamic computation, where you at static time, like at compile time, learn everything you can about what memory is moving where for any given configuration."

Impact Score: 10

"usually think about going up a level of abstraction as introducing inefficiencies, but in this case, it actually allows us to search the space of possible low-level implementations to get the best one."

Impact Score: 10

📊 Topics

#artificialintelligence 74 #aiinfrastructure 32 #investment 2 #startup 2