Generative Benchmarking with Kelly Hong - #728

Unknown Source April 23, 2025 54 min

artificial-intelligence investment ai-infrastructure generative-ai openai

26 Companies

84 Key Quotes

4 Topics

🎯 Summary

Generative Benchmarking with Kelly Hong - #728: Comprehensive Summary

This podcast episode features Sam Charrington in conversation with Kelly Hong, a researcher at Chroma, focusing on the introduction and necessity of Generative Benchmarking—a novel approach to systematically evaluating Retrieval-Augmented Generation (RAG) systems and embedding models. The core narrative addresses the inadequacy of current public benchmarks and the common practice of “vibe checking” system performance.

1. Focus Area

The primary focus is on Evaluation Methodologies for AI/ML Systems, specifically within the Retrieval Space (RAG). Key technologies discussed include embedding models, vector databases, synthetic data generation for evaluation, and the use of LLMs as judges for automated assessment.

2. Key Technical Insights

Two-Step Generative Benchmarking Process: The method centers on creating custom, representative evaluation sets by first performing document/chunk filtering (to ensure relevance to the use case) and then context-steered query generation (to mimic realistic, often vague, user queries).
Importance of Contextual Alignment: Effective evaluation requires injecting specific application context (e.g., “This is a technical support bot”) into both the filtering and query generation steps to ensure the synthetic test set reflects the true production environment.
LLM Judge Alignment is Crucial: Blindly using an LLM as a judge for relevance or filtering yields poor results (initially 46% alignment). Frameworks like EvalGen are necessary to iterate on prompting and criteria to align the LLM judge’s decisions with human preferences (achieving over 70% alignment).

3. Business/Investment Angle

Public Benchmark Limitations: Scores on generic benchmarks like M-Tub often fail to predict real-world performance; models that outperform on M-Tub may perform worse on domain-specific, messy production data.
Model Performance Divergence: The conversation highlighted empirical findings where one embedding model (Jina AI) performed worse on real-world data than expected based on its M-Tub score, while another (Voyage 3 Large) performed best, underscoring the need for custom evaluation before deployment.
Cost vs. Benefit of Advanced Techniques: Techniques like contextual rewriting (prepending context to chunks) can significantly boost retrieval performance but require careful consideration regarding the computational cost of running LLMs on every chunk.

4. Notable Companies/People

Kelly Hong (Chroma): Researcher and lead on the Generative Benchmarking project.
Chroma: The company developing vector database solutions and focusing on systematic AI system debugging and evaluation.
Weights & Biases (W&B): Used as a case study partner; production query logs from their technical support bot informed the development and validation of the generative benchmarking approach.
Embedding Model Providers: OpenAI (text-embedding-3-large), Jina AI, and Voyage were mentioned in the context of comparative performance testing.

5. Future Implications

The industry is moving away from relying solely on static, public benchmarks toward dynamic, synthetic evaluation sets tailored to specific application data. The future of robust RAG deployment hinges on making evaluation accessible (not just for experts) and ensuring that the tools used for judging (LLM judges) are rigorously aligned with human expectations.

6. Target Audience

This episode is highly valuable for AI/ML Engineers, RAG Developers, MLOps Professionals, and Technical Product Managers involved in building, deploying, and maintaining production-grade LLM and retrieval applications.

🏢 Companies Mentioned

Hugging Face ✅ ai_infrastructure

For EvalGen ✅ unknown

Jina AI ✅ unknown

Because I ✅ unknown

But I ✅ unknown

And RAG ✅ unknown

And Chroma ✅ unknown

And I ✅ unknown

Generative Benchmarking ✅ unknown

Kelly Hong ✅ unknown

Sam Charrington ✅ unknown

Twinmal AI ✅ unknown

Retrieval Space ✅ unknown

Claude 🔥 ai_application

AirBench 🔥 ai_infrastructure

💬 Key Insights

"It's interesting that you're going for multi-party alignment, and it makes me curious about whether there's specific research on how to align an LLM not just to a particular party but to two parties in this case: the user that's issuing queries... but then you're also aligning to the creator of the system and their input to the LLM as judge part."

Impact Score: 10

"And we also have some human alignment in the whole LLM judge process where you're filtering document chunks. So, if we didn't have that, as we saw initially, we only had 46% alignment; that's not very reflective of how you would want to evaluate your system."

Impact Score: 10

"We tested this out with a naive query generation method as well, where we wouldn't give any example queries, we wouldn't give it any context; we would literally just feed in the chunk and tell the LLM to generate a query. And in those cases, oftentimes it would generate pretty relevant queries, more relevant queries than the real production queries. And we noticed a performance increase by a lot, which if you're just looking at the numbers, it looks good, but it's not really reflective of what you'll actually see in production."

Impact Score: 10

"A lot of real user queries are very ambiguous. And a lot of the polished datasets that you see, the query-document pairs are highly relevant, so it's very obvious that a query matches the documents. Whereas in the real world, maybe a query is only relevant to the first sentence of a chunk."

Impact Score: 10

"I think one of the important things is understanding how important retrieval is in the context of your entire AI system. A lot of people just use it for RAG, where you retrieve relevant documents and then you have an LLM output. A lot of people just tend to focus only on the LLM output... but maybe the problem is just in the retrieval itself."

Impact Score: 10

"We noticed that retrieval performance drops a lot when you're working with these very domain-specific datasets... it's very hard to differentiate between different chunks. So, I think one area that could be interesting to explore is how can we improve performance in these very domain-specific use cases..."

Impact Score: 10

📊 Topics

#artificialintelligence 108 #investment 8 #aiinfrastructure 3 #generativeai 1