From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731

Unknown Source May 13, 2025 61 min

artificial-intelligence ai-infrastructure generative-ai investment startup openai google

🎧 Listen to Original

64 Companies

85 Key Quotes

5 Topics

1 Action Items

🎯 Summary

Podcast Summary: From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731

This episode of the Twimble AI podcast, hosted by Sam Charrington, features Mahesh Sathiamoorthy, Co-founder and CEO of Bespoke Labs, focusing on the critical role of Reinforcement Learning (RL) and data curation in moving beyond brittle prompting techniques to build robust, high-performing AI agents.

1. Focus Area

The discussion centers on advanced post-training techniques for Large Language Models (LLMs), specifically contrasting the limitations of prompt engineering with the power of RL Fine-Tuning for developing sophisticated AI agents capable of complex reasoning and tool use. A strong secondary focus is the Data-Centric AI philosophy, emphasizing the necessity of high-quality, curated data (including reasoning traces) for model improvement.

2. Key Technical Insights

RL as the Policy Builder: RL is presented as the necessary mechanism to teach models “what’s good and what’s bad,” moving beyond static instructions provided via prompting. This is exemplified by OpenAI’s Deep Research Agent, which utilized RL fine-tuning despite the general industry push toward prompting.
The AlphaGo Moment in LLMs: The current application of RL to LLMs is analogous to AlphaGo, not AlphaZero. LLMs possess significant prior knowledge (world understanding from pre-training/SFT), meaning RL only requires nudging this knowledge in the right direction (fewer rollouts/less compute) rather than learning everything from scratch.
Reward Shaping as the New Prompting: The method of defining success and failure through reward functions in RL is becoming the new “programming language” for controlling model behavior, replacing complex, brittle prompt chains.

3. Business/Investment Angle

Enterprise Adaptation is Key: The primary business value lies in adapting powerful, general open-source models to specific enterprise environments, ecosystems, and proprietary tools (e.g., internal APIs). RL/SFT combinations allow models to become “hired” and trained for specific organizational contexts.
Data Curation as the Bottleneck: The failure point for most custom model projects is not the model architecture but the data curation pipeline. Tools that systematize large-scale, batch-mode data curation and visualization unlock significant alpha.
Agent Development Fragility: The current reliance on complex, multi-step prompting for agents leads to “prompt hell”—fragile, expensive, and difficult-to-debug systems. RL offers a more flexible, baked-in policy mechanism for agent behavior.

4. Notable Companies/People

Mahesh Sathiamoorthy (Bespoke Labs): Discussed his background at Google DeepMind (working on Bard/Gemini) and the inspiration for Bespoke Labs—the realization that data recipes drive most model improvement.
OpenAI Deep Research Agent: Cited as a prime example of using RL fine-tuning for agent construction.
DeepSeek RL: Mentioned as a catalyst for community focus on reasoning data, prompting Bespoke Labs to release their own trained model (Bespoke Status) and contribute to the OpenThoughts/OpenThinker consortium.
Andrej Karpathy: Quoted regarding “Programming in English” to frame reward shaping as the new control mechanism.

5. Future Implications

The industry is moving toward RL-driven policy creation for agents, especially those requiring tool use and complex reasoning. This shift will allow enterprises to deeply customize foundation models for proprietary tasks, moving away from runtime prompt manipulation toward baked-in, robust behaviors that adhere to the “Bitter Lesson” by letting the model learn optimal policies through experience rather than explicit human instruction.

6. Target Audience

AI/ML Engineers, Research Scientists, AI Product Managers, and Technology Leaders focused on deploying custom, robust AI agents and understanding the next frontier beyond standard prompt engineering.

🏢 Companies Mentioned

Rich Sutton ✅ ai_research

David Silver ✅ ai_research

DSP ✅ ai_tooling

Llama ✅ ai_model

And MiniChart ✅ unknown

Math Olympiad ✅ unknown

RL GRPO ✅ unknown

Deep Research ✅ unknown

In SFT ✅ unknown

In RL ✅ unknown

Andrej Karpathy ✅ unknown

The Era ✅ unknown

Rich Sutton ✅ unknown

David Silver ✅ unknown

And RL ✅ unknown

💬 Key Insights

"MiniChart is kind of also similar: it does Q&A on charts, and we are able to train a 7B model, and it's much better than existing 7B models. It also achieves the level of Claude 3.5 and Gemini 1.5."

Impact Score: 10

"The advantage of using the MiniCheck model is it's actually—we measure it on a benchmark—and it's actually better than GPT-4o in this task. And then it's actually way cheaper because it's a 7B model."

Impact Score: 10

"MiniCheck is a model that is trained to detect hallucinations. It's a specific problem called groundedness, basically. If you take, especially in RAG systems, you have a context which the model looks at and produces an answer or a claim. This particular model is good at checking if this claim is supported by this context or not."

Impact Score: 10

"DeepSeek RL does something like that. So, in this case, things worked fine, but in the large-scale DeepSeek, what they do is they had DeepSeek RL 0, but they didn't do any SFT; they did just the RL-based fine-tuning, and using that, they got a model, and they used that model to generate some data, which then went into the SFT."

Impact Score: 10

"Is it possible to capture the data that's generated in the RL process and use that as prompts for SFT? ... they had DeepSeek RL 0, but they didn't do any SFT; they did just the RL-based fine-tuning, and using that, they got a model, and they used that model to generate some data, which then went into the SFT."

Impact Score: 10

"What we found is that RL is actually much nicer to work with because you don't need as much data. In that particular case, all we needed were about a hundred good-quality examples instead of tens of thousands."

Impact Score: 10

📊 Topics

#artificialintelligence 92 #aiinfrastructure 13 #investment 7 #generativeai 7 #startup 3

🎯 Action Items

🎯 democratize investigation