Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738
🎯 Summary
Podcast Summary: Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738
This episode of the Twomol AI podcast features Fatih Porikli, Senior Director of Technology at Qualcomm, focusing on two key research papers from CVPR: DIMA (Distilling Multi-Modal LLMs for Autonomous Driving) and SharpDepth (diffusion distillation for depth estimation). The central theme unifying these discussions is the critical role of efficiency and distillation in deploying advanced AI models, particularly Large Language Models (LLMs), onto resource-constrained edge devices like autonomous vehicles.
1. Focus Area
The primary focus is on advancing Autonomous Driving (AD) systems by integrating Multi-Modal Large Language Models (LLMs) into end-to-end planning architectures, specifically addressing the challenge of robustness in long-tail, rare scenarios. Secondary focus includes the application of diffusion model distillation for high-fidelity perception tasks like depth estimation.
2. Key Technical Insights
- DIMA Architecture: DIMA proposes an end-to-end AD solution that leverages the world knowledge and semantic reasoning capabilities of LLMs. It integrates vision-based low-level perception (projected onto a bird’s-eye view) and vehicle/agent state information, all tokenized and fed into the LLM-based planner.
- LLM Role as Regularizer: The LLM’s virtual knowledge representation acts as a powerful regularizer, allowing the system to generalize effectively to rare, long-tail scenarios without explicit training data for every event, significantly improving robustness over purely modular systems.
- Distillation for Efficiency: To overcome the latency issues of running full LLMs on edge hardware, DIMA employs distillation. A smaller, efficient transformer-based model (the student) is trained to approximate the outputs (updated tokens) generated by the larger LLM teacher, achieving significant speedups while retaining high performance. Surrogate tasks like trajectory prediction and Visual Question Answering (VQA) are used during distillation to ensure the student model captures necessary spatial-temporal dynamics and semantic grounding.
3. Business/Investment Angle
- End-to-End Superiority: End-to-end systems like DIMA are setting a new state-of-the-art, demonstrating massive KPI improvements (e.g., 80% reduction in collision rate compared to 2025 baselines) over traditional modular AD stacks.
- Edge Deployment Viability: The success of distillation proves that complex, knowledge-rich models (LLMs) can be made efficient enough to run faster than real-time on Qualcomm accelerators, making advanced AI practical for mass-market automotive deployment.
- Semantic Interpretability as a Feature: Unlike modular systems where interpretability is often limited to component outputs, DIMA’s LLM integration allows for semantic interpretability—the system can explain why it is braking or changing lanes (e.g., “slowing down due to congestion”), which is crucial for trust and safety validation.
4. Notable Companies/People
- Fatih Porikli (Qualcomm): Senior Director of Technology, presenting Qualcomm’s latest research in efficient, robust edge AI for AD.
- Qualcomm AI Research: The organization driving this research, focused on making perception, reasoning, and action ubiquitous across devices.
- VAD (Vectorized Autonomous Driving): Mentioned as a previous state-of-the-art end-to-end system that DIMA surpasses.
5. Future Implications
The industry is moving toward hybrid architectures where large, knowledge-rich models (LLMs) guide the reasoning process, but smaller, highly optimized models (Transformers) handle the real-time execution. Future research will likely focus on running both models concurrently, using the LLM for high-level safety checks or low-frequency reasoning, while the distilled transformer handles the primary, high-frequency planning loop. This trend suggests a future where semantic understanding is deeply embedded in real-time control systems.
6. Target Audience
This episode is highly valuable for AI/ML Engineers, Robotics Researchers, Autonomous Vehicle Developers, and Technology Strategists interested in practical applications of foundation models (LLMs) in safety-critical, real-time edge computing environments.
🏢 Companies Mentioned
đź’¬ Key Insights
"Text-to-3D is a model. Literally, a user speaks through an audio interface or text prompt, and you describe an object, like a cactus sword, you know, kind of a hippo wearing a sweater, you know, anything you can imagine, and it will generate on-device a 3D mesh and also associated texture map, like color, everything, in less than 3 seconds."
"The first one is going to be a text-to-3D demo. The second one is going to be either video-to-video or image-to-video generation demo. All these are all multi-modal demos running on device."
"We also show in the paper that you don't really need a lot of it [metric data]. We are using maybe 100, 150 times smaller than the amount of data used to train such discriminative models."
"What we did, our intuition is we take these two estimations and then compare the head is adaptive subtraction. A skill of a subtraction which generates a difference map. So the intuition is in this—this is a difference map—the regions with minimal differences are more reliable in terms of their metric depth estimation, and while and other areas with larger kind of differences build require maybe updates."
"This paper, SharpDepth, bridges these two approaches, integrating metric accuracy with detailed boundary preservation of the generative methods."
"Generative models... provide very sharp depth monocular depth estimation because there's a lot they can use synthetic data... but there is, you know, scale. It is point, you know, it is not like inches or millimeters or anything like that. You don't know what that point, each pixel, how big it is. It could be one meter or one millimeter, you know, very different. There are relative within the reconstructed image, but it's not absolute, so you can measure with it. Yeah, it is relative."