Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar
🎯 Summary
[{“key_takeaways”=>[“eVals are systematic data analytics used to measure and improve AI applications, considered the highest ROI activity for product builders.”, “The initial, crucial step in building eVals is manual error analysis (open coding) of application traces, which should not be fully automated by LLMs at this stage.”, “The ‘benevolent dictator’ concept suggests appointing a single domain expert to own the initial, rapid note-taking process to avoid committee slowdowns.”, “Builders should review traces until they reach ‘theoretical saturation’—the point where they stop uncovering new types of errors or insights.”, “eVals are broader than traditional software unit tests, encompassing metrics, tracking user feedback (like thumbs up/down), and identifying new user cohorts.”, “Common misconceptions include believing an LLM can perform the initial error analysis or that eVals must be done perfectly from the start.”], “overview”=>”Evaluation (eVals) is emerging as the highest ROI and most critical new skill for building successful AI products, moving beyond simple ‘vibe checks’ to systematic measurement and improvement. The process begins with manual error analysis, or ‘open coding,’ where domain experts review application traces to identify and document issues, which is crucial before attempting automation. This foundational step allows builders to gain deep, actionable insights necessary for iterating confidently on complex, stochastic LLM applications.”, “themes”=>[“The Definition and Importance of eVals”, “The EVal Process: Error Analysis and Open Coding”, “Misconceptions and Pitfalls in EVal Implementation”, “Structuring the EVal Process (Benevolent Dictator vs. Committee)”, “The Role of Domain Expertise in Evaluation”, “Moving from Manual Analysis to Systematic Metrics”]}]
🏢 Companies Mentioned
💬 Key Insights
"Benevolent dictator is just a catchy term for the fact that when you're doing this open coding, a lot of teams get bogged down in having a committee do this... You need to cut through the noise in a lot of organizations. If you look really deeply, especially small, medium-sized companies, there's really like, you can appoint one person whose taste that you trust."
"I can guarantee you, I would bet money on this: if I put that into ChatGPT and asked, 'Is there an error?' it would say, 'No, it did a great job.' But Hamel had the context of knowing, 'Oh, we don't actually have this virtual tour functionality.'"
"Both the chief product officers of Anthropic and OpenAI shared that eVals are becoming the most important new skill for product builders."
"To build great AI products, you need to be really good at building eVals. It's the highest ROI activity you can engage in."
"It should be the person with domain expertise. So in this case, it would be the person who understands the business of leasing, apartment leasing, and has context to understand if this makes sense. It's always a domain expert."
"The first two or three can be very painful, but it doesn't—we can do a bunch of them really fast. So here's another one. And let's skip the system prompt again. And the user asks, 'Hey, I'm looking for a two- to three-bedroom with either one or two baths. Do you provide virtual tours?' And a bunch of tools are called. And it says, 'Hi, Sarah. Currently, we have three-bedroom, two-and-a-half-bathroom apartment available for $2,175. Unfortunately, we don't have any two-bedroom options at the moment. We do offer virtual tours. Let's schedule a tour, blah blah.' It just so happens that there's no virtual tour. Nice. It is hallucinating something that doesn't exist."