Dataflow Computing for AI Inference with Kunle Olukotun - #751

Unknown Source October 14, 2025 58 min

artificial-intelligence ai-infrastructure startup generative-ai google meta nvidia

🎧 Listen to Original

42 Companies

74 Key Quotes

4 Topics

2 Insights

1 Action Items

🎯 Summary

Podcast Summary: Dataflow Computing for AI Inference with Kunle Olukotun - #751

This episode features Sam Charrington interviewing Kunle Olukotun (Professor at Stanford and Co-founder/Chief Technologist at SambaNova Systems) about the evolution of computer architecture, specifically focusing on reconfigurable data flow architectures and their application to accelerating large-scale AI inference.

1. Focus Area

The primary focus is on Dataflow Computing for AI Inference. The discussion centers on how architectures that directly map the computational graph of ML models (derived from frameworks like PyTorch) can overcome the inherent bottlenecks of traditional instruction-based architectures (like GPUs) when running massive models (trillions of parameters) and agentic workloads.

2. Key Technical Insights

Reconfigurable Data Flow Architecture: This paradigm shifts from fetching sequential instructions to configuring hardware (coarser-grained tensor units, or RDUs) to directly match the model’s data flow graph. This eliminates the need for complex synchronization mechanisms like locks and shared memory, relying instead on data flow tags/tokens for low-overhead communication.
HBM Bandwidth Optimization via Fusion: The architecture is specifically designed to minimize reliance on external memory bandwidth (HBM), which is the primary bottleneck in LLM inference. By fusing large sections of the model (e.g., an entire decoder block, going beyond techniques like FlashAttention) onto the chip fabric, intermediate data remains local, maximizing the utilization of the critical HBM interface (up to 90% utilization, significantly higher than GPUs).
Asynchronous Execution and Latency Management: The data flow approach naturally supports extreme asynchrony, allowing computation and memory access to be heavily overlapped. This results in superior performance across the latency-throughput trade-off: significantly lower latency even at high batch sizes compared to GPUs, where serialization latency dominates.

3. Business/Investment Angle

Inference as the Bottleneck: The conversation highlights that efficient, fast, and energy-efficient inference for trillion-parameter models is the critical commercial challenge in the current AI landscape.
Multi-Model Serving Efficiency: SambaNova’s architecture, leveraging high-capacity on-chip memory (1.5TB DDR), allows systems to hold multiple large models simultaneously and switch between them in milliseconds (around 1ms). This is crucial for multi-tenancy and serving specialized, fine-tuned models efficiently.
Domain-Specific Compiler Advantage: While the architecture is specialized, the compiler abstracts complexity. Since most modern models are transformer-based, adapting to new models (like DeepSeek) takes engineers only about a week, using Python-based descriptions rather than low-level CUDA programming.

4. Notable Companies/People

Kunle Olukotun: Pioneer in multicore architecture and parallel programming, now leading the charge in dataflow hardware at SambaNova.
SambaNova Systems: The company implementing these concepts, currently shipping the SN40L chip (100B transistors, 5nm), which features three memory tiers including high-capacity DDR.
Jim Smith (Cray Architect): Quoted to establish the principle: “If you have a vector problem, build a vector computer.” Applied here: If you have a data flow problem, build a data flow computer.

5. Future Implications

The conversation suggests a future where hardware architecture is increasingly tightly coupled to the computational graph of the dominant ML workloads (transformers). The industry is moving away from general-purpose instruction-set architectures (ISAs) for AI acceleration toward highly specialized, reconfigurable fabrics that prioritize data locality and asynchronous execution to solve the memory wall problem inherent in scaling LLMs.

6. Target Audience

AI/ML Infrastructure Engineers, Computer Architects, Hardware Designers, and Technology Strategists focused on optimizing large-scale LLM deployment and next-generation accelerator design.

🏢 Companies Mentioned

Cray computers ✅ ai_infrastructure

Coma ✅ ai_infrastructure

The LLM ✅ unknown

Streaming Tensor Programs ✅ unknown

Am I ✅ unknown

As I ✅ unknown

So GPUs ✅ unknown

And I ✅ unknown

So I ✅ unknown

Jim Smith ✅ unknown

SambaNova Systems ✅ unknown

Stanford University ✅ unknown

Kunle Olukotun ✅ unknown

Today I ✅ unknown

Sam Charrington ✅ unknown

💬 Key Insights

"And so we created an agentic system for doing this and an adaptive self-improving loop around that in order to implement the solution."

Impact Score: 10

"The LLM is not going to be very good at doing this because it doesn't have any training examples, right? This is a brand new architecture."

Impact Score: 10

"Even from SambaNova's point of view, the most difficult part of the whole endeavor has been delivering the compiler infrastructure for our systems."

Impact Score: 10

"So that's where we're going from an architecture point of view. So, can you take the fundamental benefits of reconfigurable data flow architectures and put the "D" in front of it? Can you make them dynamic, right, with low overhead..."

Impact Score: 10

"the models are fundamentally becoming more dynamic, right? So you're getting these mixture-of-experts style of models. You're getting environments where you've got multiple users with different context lengths... You are getting models like graph-based neural nets that have a kind of dynamic data access pattern."

Impact Score: 10

"Then the other use cases are sort of these agentic systems that require a large number of models to coexist at the same time, right? And then using this model-switching capability, we can support that with far fewer resources than you would be required if you had to put each of these models on a separate GPU-based system."

Impact Score: 10

📊 Topics

#artificialintelligence 61 #aiinfrastructure 59 #startup 2 #generativeai 1

🧠 Key Takeaways

💡 write to one other thing about inference while we're on that subject, which is one of the things that we have, because we've got this high-capacity memory, is we can hold multiple models at the same time, up to 5 trillion parameters in total

🎯 Action Items

🎯 potentially investigation