Chelsea Finn: Building Robots That Can Do Anything

Unknown Source July 22, 2025 45 min

artificial-intelligence ai-infrastructure investment openai meta

🎧 Listen to Original

25 Companies

77 Key Quotes

3 Topics

3 Insights

🎯 Summary

Podcast Episode Summary: Chelsea Finn: Building Robots That Can Do Anything

This 44-minute podcast episode features Chelsea Finn discussing the critical challenge of moving robotics from highly specialized, application-specific solutions to developing general-purpose foundation models for physical intelligence. The core narrative traces the journey of her work, particularly at her company Physical Intelligence, in creating models that enable robots to perform diverse, long-horizon tasks in novel environments, drawing strong parallels to the success of large language models (LLMs).

1. Focus Area

The primary focus is on General-Purpose Robotics using Foundation Models. Key themes include overcoming the traditional robotics bottleneck (requiring a new company/stack for every task), the role of scale vs. diversity in training data, and developing robust pre-training and fine-tuning recipes for real-world deployment across different hardware and environments.

2. Key Technical Insights

Decoupling Pre-training and Fine-tuning: The most significant breakthrough for complex tasks like laundry folding was adopting a two-stage approach: extensive pre-training on all available robot data, followed by fine-tuning on a highly uncurated, consistent, high-quality demonstration dataset specific to the target task. This recipe drastically outperformed training solely on curated data or using only uncurated data.
Leveraging Vision-Language Models (VLMs) for Language Following: To improve generalization and adherence to open-ended prompts (e.g., “clean the bedroom”), the team integrated larger, pre-trained VLMs (like Polyjama). Crucially, they prevented the randomly initialized diffusion action head from deteriorating the VLM backbone’s knowledge by stopping the gradient flow to that head, leading to an 80% language-following rate compared to 20% previously.
Diversity Over Specificity in Pre-training: For generalization to unseen environments (e.g., testing in new AirBnBs), the model benefited significantly from diverse pre-training data. Mobile manipulation data (tidying kitchens/bedrooms) accounted for only about 2.4% of the total pre-training mixture, yet excluding static manipulation and web data reduced novel environment performance by over 20%, demonstrating that broad, diverse exposure is key to closing the generalization gap.

3. Business/Investment Angle

Market Shift from Custom to Generalist: The current robotics landscape requires building an entire company (hardware, software, primitives) for every application. The foundation model approach promises to lower the barrier to entry by allowing a single model to serve multiple applications, mirroring the LLM ecosystem.
Data Strategy is Paramount: Investment must focus not just on collecting massive scale (which can be repetitive, like industrial data) but on diverse, high-quality, embodied data. The success of the laundry folding task hinged on finding the right recipe for curating demonstration data, not just collecting more of it.
Hardware Agnosticism Potential: The ability to fine-tune a pre-trained model on data from an entirely new robot platform (even without full knowledge of its control representation) suggests a path toward faster deployment across heterogeneous hardware fleets.

4. Notable Companies/People

Chelsea Finn: Speaker, co-founder of Physical Intelligence, driving the research agenda for general-purpose physical intelligence.
Physical Intelligence: The company focused on developing these general-purpose foundation models for robotics.
Polyjama: Mentioned as the 3-billion parameter open-source Vision-Language Model used as the backbone for action prediction in later experiments.

5. Future Implications

The industry is moving toward embodied foundation models that can handle long-horizon, dexterous tasks (like folding laundry) and generalize across novel objects and environments (like cleaning unknown homes). The immediate bottleneck is shifting from data collection diversity to achieving higher reliability and performance (currently around 80% success rates in novel settings). This approach promises to unlock widespread, useful robotic applications beyond controlled industrial settings.

6. Target Audience

This episode is highly valuable for AI/ML Researchers, Robotics Engineers, Venture Capitalists focused on Deep Tech/Robotics, and Product Leaders aiming to deploy intelligent physical systems. It requires a foundational understanding of machine learning concepts like imitation learning, foundation models, and fine-tuning.

🏢 Companies Mentioned

Meta ✅ big_tech/ai_research

Charo Thomas ✅ unknown

APPLAUSE I ✅ unknown

Kit Kat ✅ unknown

APPLAUSE YC ✅ unknown

So Laura ✅ unknown

So I ✅ unknown

And I ✅ unknown

San Francisco ✅ unknown

So Michael ✅ unknown

Kenna Robot ✅ unknown

Kenna Robot Fold ✅ unknown

Physical Intelligence ✅ unknown

OpenAI 🔥 big_tech/ai_developer

Physical Intelligence (PI) 🔥 ai_application/startup

💬 Key Insights

"the world model will hallucinate a video of completing the task successfully, even if the actions that you provided as input didn't, weren't actually going to successfully lead to a good outcome."

Impact Score: 10

"instead of only predicting the next action, you predict the intermediate subgoal image, like what should happen in the future in order to accomplish the task, and then predict an action from there."

Impact Score: 10

"And we found that in blue, the performance at following instructions and making progress on the task was substantially lower than the performance of our system which is shown in green. And in general, we found that these frontier models generally struggle with visual understanding as it pertains to robotics, which makes sense because in general, these models aren't really targeting many physical applications and have very little data in the physical world."

Impact Score: 10

"And then lastly, it's able to handle interjections and situated corrections. So in this case, the robot is kind of getting items for a user. The user interjects and says, "Get me something sweet that's not in the basket." Right after I had put a Kit Kat into the basket and the robot says, "Sure, let me get you some Skittles," and reasons through kind of basic reasoning of how to fulfill the user's request and is able to respond to those kinds of corrections situated in the world that the robot is at."

Impact Score: 10

"And so what we did is we kind of took all of our existing robot data and we can actually generate synthetic data for the existing robot data. And particularly we can use language models to relabel and generate hypothetical human prompts for the scenarios that the robots are in."

Impact Score: 10

"We rented three AirBnBs that we had never been to before... The robot's able to succeed, even though it's never been to here before. There's different countertops, different furniture, different objects, and so forth."

Impact Score: 10

📊 Topics

#artificialintelligence 109 #aiinfrastructure 45 #investment 1

🧠 Key Takeaways

💡 have in years to come

💡 be doing control in end-effector space rather than in joint space of the robot

💡 collect diverse data