How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730

Unknown Source May 06, 2025 67 min
artificial-intelligence generative-ai ai-infrastructure startup openai
36 Companies
68 Key Quotes
4 Topics

🎯 Summary

Podcast Summary: How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730

This episode of the Twomol AI podcast features Sam Charrington in conversation with Josh Tobin, a Member of Technical Staff at OpenAI, focusing on the research and development behind OpenAI’s agentic products, such as Operator and Deep Research. The discussion centers on the shift from manually designed, multi-step LLM workflows to end-to-end trained, robust AI agents capable of complex, real-world task execution.


1. Focus Area

The primary focus is AI Agent Development and Architecture at OpenAI. Key areas covered include:

  • The limitations of traditional, human-designed, multi-step LLM workflows (compounding errors).
  • The necessity of end-to-end training using reinforcement learning (RL) to enable agents to learn recovery mechanisms from failures during multi-step processes.
  • The role and application of OpenAI’s agentic products (Deep Research, Operator, Codex CLI).
  • The evolution of the ML infrastructure landscape following the rise of powerful foundation models.

2. Key Technical Insights

  • Agentic Training Paradigm: The core technical challenge in building reliable agents is moving beyond sequential LLM calls governed by human-designed rules. True agentic capability requires training models end-to-end on the entire workflow, rewarding success, which allows the model to learn recovery strategies (e.g., correcting a bad search term) that brittle, pre-designed workflows cannot handle.
  • Compounding Error Mitigation: Traditional workflows suffer from error propagation across multiple steps. RL-trained agents overcome this by learning to recognize and self-correct deviations during execution, leading to significantly higher reliability in long, complex tasks.
  • Reasoning and Model Size: Larger models generally exhibit better generalization capabilities, which is crucial for novel agentic tasks that developers did not explicitly anticipate. Furthermore, the ability for models to dynamically allocate reasoning effort (i.e., deciding how much “thinking” time to spend on a step) is a valuable component for robust agent performance.

3. Business/Investment Angle

  • Foundation Model Dominance: The era of every company building custom models is largely over. Foundation models (like those from OpenAI) are now so capable that businesses should exhaust all possibilities with off-the-shelf commercial models before investing in proprietary training infrastructure.
  • Agentic Products as Technology Previews: Early agentic offerings like Operator serve as crucial technology previews. While not immediately mass-market useful for everyone (similar to early GPT-3 API access), they demonstrate the direction of AI and provide early value to power users and developers.
  • Shifting MLOps Landscape: The business model for ML infrastructure startups focused on model training (pre-ChatGPT) became less feasible as general-purpose models absorbed much of the need for bespoke model development.

4. Notable Companies/People

  • Josh Tobin (OpenAI): Leads the agents research team, responsible for models powering agentic products. Previously co-founded the ML infrastructure startup Gantry.
  • OpenAI Agentic Products: Deep Research (for thorough literature review/synthesis), Operator (for real-world interaction via a virtual browser, e.g., booking reservations), and Codex CLI.
  • Andrej Karpathy: Mentioned regarding the idea that good models often outperform manually designed systems.

5. Future Implications

The industry is moving toward co-working entities where the AI assistant (like an evolved ChatGPT) intuitively knows when to provide a quick answer versus when to autonomously execute complex, multi-step tasks (like Deep Research) on the user’s behalf. The goal is to evolve ChatGPT into a natural partner that manages task delegation, including knowing when to pause research to ask clarifying questions.

6. Target Audience

This episode is highly valuable for AI Researchers, Machine Learning Engineers, Product Managers building AI applications, and Technology Strategists interested in the practical deployment and next generation of LLM capabilities beyond simple chat interfaces.

🏢 Companies Mentioned

Airbnbs âś… ai_application
GPT-3 API âś… ai_application
Airbnb âś… ai_application
Craigslist âś… ai_application
eBay âś… ai_application
Shopify âś… ai_user/enterprise
Full Stack Deep Learning âś… ai_education/community
If I âś… unknown
So Operator âś… unknown
So Deep Research âś… unknown
So I âś… unknown
And Deep Research âś… unknown
Andrej Karpathy âś… unknown
But I âś… unknown
LLM API âś… unknown

đź’¬ Key Insights

"Do not just say, "Click upload in the menu." Say, "The menu is the hamburger thing on the right, and upload," like how granular do you need to—first of all, are we talking about localizing capabilities on the page, or are we talking about other types of context that is useful for the agent?"
Impact Score: 10
"I think of it a lot like early GPT-3 API, if you remember that, right? Where it is like people used it, and it was—I do not think OpenAI would have framed it this way at the time, but in a lot of ways it is kind of a technology preview."
Impact Score: 10
"Do not just say, 'Click upload in the menu.' Say, 'The menu is the hamburger thing on the right, and upload,' like how granular do you need to—"
Impact Score: 10
"One thing that I found to be really helpful for getting the most value from it is there is a way to add site-specific instructions, add or customize site-specific instructions."
Impact Score: 10
"So one thing that I found to be really helpful for getting the most value from it is there is a way to add site-specific instructions, add or customize site-specific instructions."
Impact Score: 10
"I think of it a lot like early GPT-3 API, right? Where it is like people used it, and it was—I do not think OpenAI would have framed it this way at the time, but in a lot of ways it is kind of a technology preview."
Impact Score: 10

📊 Topics

#artificialintelligence 96 #generativeai 30 #aiinfrastructure 8 #startup 5

🤖 Processed with true analysis

Generated: October 05, 2025 at 07:43 PM