Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738

Unknown Source July 09, 2025 60 min

artificial-intelligence ai-infrastructure generative-ai

29 Companies

74 Key Quotes

3 Topics

🎯 Summary

Podcast Summary: Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738

This episode of the Twomol AI podcast features Fatih Porikli, Senior Director of Technology at Qualcomm, focusing on two key research papers from CVPR: DIMA (Distilling Multi-Modal LLMs for Autonomous Driving) and SharpDepth (diffusion distillation for depth estimation). The central theme unifying these discussions is the critical role of efficiency and distillation in deploying advanced AI models, particularly Large Language Models (LLMs), onto resource-constrained edge devices like autonomous vehicles.

1. Focus Area

The primary focus is on advancing Autonomous Driving (AD) systems by integrating Multi-Modal Large Language Models (LLMs) into end-to-end planning architectures, specifically addressing the challenge of robustness in long-tail, rare scenarios. Secondary focus includes the application of diffusion model distillation for high-fidelity perception tasks like depth estimation.

2. Key Technical Insights

DIMA Architecture: DIMA proposes an end-to-end AD solution that leverages the world knowledge and semantic reasoning capabilities of LLMs. It integrates vision-based low-level perception (projected onto a bird’s-eye view) and vehicle/agent state information, all tokenized and fed into the LLM-based planner.
LLM Role as Regularizer: The LLM’s virtual knowledge representation acts as a powerful regularizer, allowing the system to generalize effectively to rare, long-tail scenarios without explicit training data for every event, significantly improving robustness over purely modular systems.
Distillation for Efficiency: To overcome the latency issues of running full LLMs on edge hardware, DIMA employs distillation. A smaller, efficient transformer-based model (the student) is trained to approximate the outputs (updated tokens) generated by the larger LLM teacher, achieving significant speedups while retaining high performance. Surrogate tasks like trajectory prediction and Visual Question Answering (VQA) are used during distillation to ensure the student model captures necessary spatial-temporal dynamics and semantic grounding.

3. Business/Investment Angle

End-to-End Superiority: End-to-end systems like DIMA are setting a new state-of-the-art, demonstrating massive KPI improvements (e.g., 80% reduction in collision rate compared to 2025 baselines) over traditional modular AD stacks.
Edge Deployment Viability: The success of distillation proves that complex, knowledge-rich models (LLMs) can be made efficient enough to run faster than real-time on Qualcomm accelerators, making advanced AI practical for mass-market automotive deployment.
Semantic Interpretability as a Feature: Unlike modular systems where interpretability is often limited to component outputs, DIMA’s LLM integration allows for semantic interpretability—the system can explain why it is braking or changing lanes (e.g., “slowing down due to congestion”), which is crucial for trust and safety validation.

4. Notable Companies/People

Fatih Porikli (Qualcomm): Senior Director of Technology, presenting Qualcomm’s latest research in efficient, robust edge AI for AD.
Qualcomm AI Research: The organization driving this research, focused on making perception, reasoning, and action ubiquitous across devices.
VAD (Vectorized Autonomous Driving): Mentioned as a previous state-of-the-art end-to-end system that DIMA surpasses.

5. Future Implications

The industry is moving toward hybrid architectures where large, knowledge-rich models (LLMs) guide the reasoning process, but smaller, highly optimized models (Transformers) handle the real-time execution. Future research will likely focus on running both models concurrently, using the LLM for high-level safety checks or low-frequency reasoning, while the distilled transformer handles the primary, high-frequency planning loop. This trend suggests a future where semantic understanding is deeply embedded in real-time control systems.

6. Target Audience

This episode is highly valuable for AI/ML Engineers, Robotics Researchers, Autonomous Vehicle Developers, and Technology Strategists interested in practical applications of foundation models (LLMs) in safety-critical, real-time edge computing environments.

🏢 Companies Mentioned

Unity ✅ ai_model_or_paper

Stable Diffusion ✅ unknown

Because I ✅ unknown

Should I ✅ unknown

Sharpening Metric Depth Predictions Using Diffusion Distillation ✅ unknown

Visual Question Answering ✅ unknown

So I ✅ unknown

And I ✅ unknown

Maybe I ✅ unknown

When I ✅ unknown

Vectorized Autonomous Driving ✅ unknown

So DIMA ✅ unknown

Autonomous Driving ✅ unknown

Modal Large Language Models ✅ unknown

Distilling Multi ✅ unknown

💬 Key Insights

"Text-to-3D is a model. Literally, a user speaks through an audio interface or text prompt, and you describe an object, like a cactus sword, you know, kind of a hippo wearing a sweater, you know, anything you can imagine, and it will generate on-device a 3D mesh and also associated texture map, like color, everything, in less than 3 seconds."

Impact Score: 10

"The first one is going to be a text-to-3D demo. The second one is going to be either video-to-video or image-to-video generation demo. All these are all multi-modal demos running on device."

Impact Score: 10

"We also show in the paper that you don't really need a lot of it [metric data]. We are using maybe 100, 150 times smaller than the amount of data used to train such discriminative models."

Impact Score: 10

"What we did, our intuition is we take these two estimations and then compare the head is adaptive subtraction. A skill of a subtraction which generates a difference map. So the intuition is in this—this is a difference map—the regions with minimal differences are more reliable in terms of their metric depth estimation, and while and other areas with larger kind of differences build require maybe updates."

Impact Score: 10

"This paper, SharpDepth, bridges these two approaches, integrating metric accuracy with detailed boundary preservation of the generative methods."

Impact Score: 10

"Generative models... provide very sharp depth monocular depth estimation because there's a lot they can use synthetic data... but there is, you know, scale. It is point, you know, it is not like inches or millimeters or anything like that. You don't know what that point, each pixel, how big it is. It could be one meter or one millimeter, you know, very different. There are relative within the reconstructed image, but it's not absolute, so you can measure with it. Yeah, it is relative."

Impact Score: 10

📊 Topics

#artificialintelligence 116 #aiinfrastructure 15 #generativeai 1