EP 628: What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI
🎯 Summary
Podcast Summary: EP 628: What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI
This episode of the Everyday AI Show, hosted by Jordan Molson, focuses on providing a practical framework for organizations to select the right Large Language Model (LLM) for their needs and, crucially, how to measure the Return on Investment (ROI) when deploying these tools, particularly in their front-end, “AI Operating System” capacity.
The central narrative argues that the industry is shifting from using LLMs primarily via backend APIs to integrating them as front-end AI Operating Systems (like enterprise versions of ChatGPT, Gemini, or Claude), where users interact directly with rich interfaces, modes, and external applications. This shift necessitates a new, rigorous evaluation process beyond standard scientific benchmarks.
1. Focus Area
The primary focus is on evaluating front-end LLM deployments for knowledge workers, moving beyond simple API benchmarks to assess real-world utility, reliability, and ROI within enterprise workflows. Key themes include the concept of the LLM as an operating system, the complexity introduced by multiple models and “modes” (e.g., web search, agent mode), and common pitfalls in achieving measurable AI ROI.
2. Key Technical Insights
- Front-End vs. API Usage: The most powerful use cases emerge when utilizing the modes and features available in the front-end interfaces (like ChatGPT’s Connectors, Canvas, or Agent mode), which are often unavailable or abstracted when using models purely via API connections.
- Generative Reliability: Because generative AI outputs vary significantly even with identical inputs, structured AI Evals (Evaluations) are essential for ensuring consistency, verifying performance, and catching bias before company-wide deployment.
- Public Evals as Starting Points: While custom testing is necessary, organizations can leverage public evaluation sites like LM Arena (which uses blind user voting to generate Elo scores across categories like coding, math, and creativity) and LiveBench to narrow down which models to test internally.
3. Business/Investment Angle
- The AI Operating System Decision: Companies must treat the choice of a primary front-end LLM (ChatGPT, Gemini, Claude) as a strategic decision akin to choosing a foundational operating system (like Windows or macOS) in the 90s, as business processes will reorganize around it.
- ROI Traps: Common reasons for failing to see ROI include overly long pilot phases, severe lack of change management and training, and, most critically, failing to measure pre-Gen AI human baselines for comparison.
- Shiny Object Syndrome: Businesses frequently get distracted by weekly AI advancements, preventing them from locking down a test workflow and seeing a pilot through to measurable completion.
4. Notable Companies/People
- Stephen Johnson (Co-founder of NotebookLM): Featured as a guest/expert, emphasizing the need for tools that help organize complex information and discussing the shift toward AI operating systems.
- Jordan Molson (Host): Driving the discussion, advocating for the AI Operating System concept and detailing the 7-step evaluation framework.
- LLM Providers: OpenAI (ChatGPT), Google (Gemini), and Anthropic (Claude) are the primary platforms whose enterprise/team plans are the focus of the evaluation.
- Evaluation Platforms: LM Arena, LiveBench, Epoch AI, and Scale’s C-LLM Leader were cited as valuable public resources for initial model assessment.
5. Future Implications
The conversation strongly suggests that the future of knowledge work involves embedding business processes directly into the user interfaces of major LLM platforms, leveraging integrated apps and connectors. Success will depend not just on the model’s raw intelligence but on the organization’s ability to rigorously test, train staff, and establish clear, measurable baselines to prove efficiency gains.
6. Target Audience
This episode is highly valuable for Business Leaders, IT/Digital Transformation Managers, AI Strategy Teams, and Knowledge Workers responsible for piloting, selecting, and scaling generative AI tools within their organizations.
Comprehensive Summary of the 7-Step Evaluation Framework
The core of the episode is the 7-Step Plan for Evaluating AI Models and Creating ROI. The speaker stresses that before starting, organizations must secure executive buy-in, clear legal/security permissions, and commit to a short (2-4 week) evaluation sprint focused on one specific workflow, ignoring new advancements during that period.
The First Four Steps Detailed:
- Define Success Criteria: Write a “job description” for the pilot, explicitly identifying the required outcome, constraints, allowed tools, and “do not do” actions. Create a measurable grading rubric (1-10 scale) and define 3-5 black-and-white Key Performance Indicators (KPIs) for the workflow (e.g., human time spent, accuracy, revisions required).
- Measure Your Human Baseline First: This is deemed critical but often skipped. Multiple employees must complete the exact same task without AI to establish average time, error rate, and cost per task. Without this baseline, ROI calculation is impossible guesswork.
- Build a Realistic and Controllable Test Data Set: Gather 20-40 actual work examples that include “messiness” (e.g., renamed files, dead links) to test the model’s adaptability, especially for agentic workflows. Create a pass/fail checklist for each test case tied to the rubric.
- Configure Your Workspace Like Production: Test using the exact subscription tier (e.g.,
🏢 Companies Mentioned
💬 Key Insights
"This is probably where even if you follow these steps one through seven, this is probably where you're going to fail. Most people have no clue how to prompt an AI."
"Let's say for whatever reason, you're using GPT-5 thinking as your model of choice, and then you're using Canvas mode, okay? Canvas mode gets updated often. Most people, unless you're a nerd like me, don't know this."
"You need to retest this monthly to track changes."
"I think these seven factors are pretty good to keep a look to keep a look at. So cost, latency, accuracy, stability, safety, integration, and compliance, right?"
"You also need to require working citations, file paths, or artifacts for every accepted answer. No exceptions."
"Most humans have no clue what they're doing when they're using large language models. You need to use the right model, the right mode, and the right prompting techniques. You got to know the basics."