Chelsea Finn: Building Robots That Can Do Anything
π― Summary
Podcast Episode Summary: Chelsea Finn: Building Robots That Can Do Anything
This 44-minute podcast episode features Chelsea Finn discussing the critical challenge of moving robotics from highly specialized, application-specific solutions to developing general-purpose foundation models for physical intelligence. The core narrative traces the journey of her work, particularly at her company Physical Intelligence, in creating models that enable robots to perform diverse, long-horizon tasks in novel environments, drawing strong parallels to the success of large language models (LLMs).
1. Focus Area
The primary focus is on General-Purpose Robotics using Foundation Models. Key themes include overcoming the traditional robotics bottleneck (requiring a new company/stack for every task), the role of scale vs. diversity in training data, and developing robust pre-training and fine-tuning recipes for real-world deployment across different hardware and environments.
2. Key Technical Insights
- Decoupling Pre-training and Fine-tuning: The most significant breakthrough for complex tasks like laundry folding was adopting a two-stage approach: extensive pre-training on all available robot data, followed by fine-tuning on a highly uncurated, consistent, high-quality demonstration dataset specific to the target task. This recipe drastically outperformed training solely on curated data or using only uncurated data.
- Leveraging Vision-Language Models (VLMs) for Language Following: To improve generalization and adherence to open-ended prompts (e.g., βclean the bedroomβ), the team integrated larger, pre-trained VLMs (like Polyjama). Crucially, they prevented the randomly initialized diffusion action head from deteriorating the VLM backboneβs knowledge by stopping the gradient flow to that head, leading to an 80% language-following rate compared to 20% previously.
- Diversity Over Specificity in Pre-training: For generalization to unseen environments (e.g., testing in new AirBnBs), the model benefited significantly from diverse pre-training data. Mobile manipulation data (tidying kitchens/bedrooms) accounted for only about 2.4% of the total pre-training mixture, yet excluding static manipulation and web data reduced novel environment performance by over 20%, demonstrating that broad, diverse exposure is key to closing the generalization gap.
3. Business/Investment Angle
- Market Shift from Custom to Generalist: The current robotics landscape requires building an entire company (hardware, software, primitives) for every application. The foundation model approach promises to lower the barrier to entry by allowing a single model to serve multiple applications, mirroring the LLM ecosystem.
- Data Strategy is Paramount: Investment must focus not just on collecting massive scale (which can be repetitive, like industrial data) but on diverse, high-quality, embodied data. The success of the laundry folding task hinged on finding the right recipe for curating demonstration data, not just collecting more of it.
- Hardware Agnosticism Potential: The ability to fine-tune a pre-trained model on data from an entirely new robot platform (even without full knowledge of its control representation) suggests a path toward faster deployment across heterogeneous hardware fleets.
4. Notable Companies/People
- Chelsea Finn: Speaker, co-founder of Physical Intelligence, driving the research agenda for general-purpose physical intelligence.
- Physical Intelligence: The company focused on developing these general-purpose foundation models for robotics.
- Polyjama: Mentioned as the 3-billion parameter open-source Vision-Language Model used as the backbone for action prediction in later experiments.
5. Future Implications
The industry is moving toward embodied foundation models that can handle long-horizon, dexterous tasks (like folding laundry) and generalize across novel objects and environments (like cleaning unknown homes). The immediate bottleneck is shifting from data collection diversity to achieving higher reliability and performance (currently around 80% success rates in novel settings). This approach promises to unlock widespread, useful robotic applications beyond controlled industrial settings.
6. Target Audience
This episode is highly valuable for AI/ML Researchers, Robotics Engineers, Venture Capitalists focused on Deep Tech/Robotics, and Product Leaders aiming to deploy intelligent physical systems. It requires a foundational understanding of machine learning concepts like imitation learning, foundation models, and fine-tuning.
π’ Companies Mentioned
π¬ Key Insights
"the world model will hallucinate a video of completing the task successfully, even if the actions that you provided as input didn't, weren't actually going to successfully lead to a good outcome."
"instead of only predicting the next action, you predict the intermediate subgoal image, like what should happen in the future in order to accomplish the task, and then predict an action from there."
"And we found that in blue, the performance at following instructions and making progress on the task was substantially lower than the performance of our system which is shown in green. And in general, we found that these frontier models generally struggle with visual understanding as it pertains to robotics, which makes sense because in general, these models aren't really targeting many physical applications and have very little data in the physical world."
"And then lastly, it's able to handle interjections and situated corrections. So in this case, the robot is kind of getting items for a user. The user interjects and says, "Get me something sweet that's not in the basket." Right after I had put a Kit Kat into the basket and the robot says, "Sure, let me get you some Skittles," and reasons through kind of basic reasoning of how to fulfill the user's request and is able to respond to those kinds of corrections situated in the world that the robot is at."
"And so what we did is we kind of took all of our existing robot data and we can actually generate synthetic data for the existing robot data. And particularly we can use language models to relabel and generate hypothetical human prompts for the scenarios that the robots are in."
"We rented three AirBnBs that we had never been to before... The robot's able to succeed, even though it's never been to here before. There's different countertops, different furniture, different objects, and so forth."