How to test, optimize, and reduce hallucinations of AIs with Thomas Natschlaeger
🎯 Summary
Podcast Summary: How to Test, Optimize, and Reduce Hallucinations of AIs with Thomas Natschlaeger
This 49-minute episode of “Pure Performance” features Thomas Natschlaeger, Principal Data Scientist at Dynatrace, discussing the critical challenges of ensuring quality, reliability, and trustworthiness in Large Language Models (LLMs), particularly within enterprise applications like Dynatrace’s internal managerial co-pilot. The discussion bridges the long history of AI research with the modern imperative of rigorous testing for generative models.
1. Focus Area
The primary focus is on AI/ML Quality Assurance and Reliability Engineering, specifically addressing the testing methodologies, hallucination mitigation, and system integration challenges associated with deploying LLMs in production environments. Key applications discussed include building managerial co-pilots and translating natural language into proprietary query languages (DQL).
2. Key Technical Insights
- RAG Architecture Testing: Testing LLM applications requires treating the entire system—including the Retrieval Augmented Generation (RAG) pipeline (retrieval mechanism + LLM)—as an integrated unit, not just testing the LLM in isolation.
- Dual Testing Requirements: Testing must be bifurcated: one approach for free-text/summary outputs (checking for linguistic quality and relevance) and a stricter approach for structured outputs (like DQL generation), which demands perfect syntactic and semantic correctness for executability.
- LLMs as Judges: A common and effective technique for measuring output quality (faithfulness, relevance) is using a more powerful, external LLM as a “judge” to evaluate the output of the primary generative model, moving beyond older statistical metrics like n-gram overlap.
3. Business/Investment Angle
- Trust as the New KPI: As AI moves into core business functions (like observability analysis), the “trust factor”—ensuring AI outputs are true and reliable—becomes paramount, mirroring the need for trust in traditional monitoring data.
- Enterprise Adoption Requires Rigor: The shift from experimental LLMs to reliable enterprise tools necessitates establishing well-defined Key Performance Indicators (KPIs) for text generation, similar to the established RMSE/MAE metrics used in traditional numerical ML forecasting.
- Data Gating is Crucial: A major business risk is the quality of the underlying knowledge base. Ensuring the retrieved data sources are accurate and relevant is the first line of defense against propagating misinformation or “mass hallucinations” from the training data.
4. Notable Companies/People
- Thomas Natschlaeger (Dynatrace): Guest expert, Principal Data Scientist, with a deep background in neural networks dating back to the 1990s and experience in weather prediction ML.
- Dynatrace: The company where Natschlaeger is building internal AI tools, specifically a managerial co-pilot that translates natural language into their proprietary DQL.
- JĂĽrgen Schmidhuber: Mentioned in an anecdote regarding the historical credit for AI innovations, particularly automated gradient descent.
- OpenAI: Referenced as the entity that commercialized the Transformer model, leading to the current LLM boom.
5. Future Implications
The industry is moving toward a future where AI quality assurance is formalized through dedicated testing frameworks that account for the unique non-deterministic nature of generative models. The focus will remain on grounding LLMs in proprietary, verified knowledge bases (RAG) and using sophisticated LLM-based evaluation systems to maintain high standards of factual accuracy and relevance, especially as AI is integrated across diverse enterprise personas (developers, testers, managers).
6. Target Audience
This episode is highly valuable for AI/ML Engineers, Software Quality Assurance Professionals, Data Scientists, and Technology Leaders involved in building, deploying, or integrating LLM-powered features into enterprise software, particularly those concerned with model drift, hallucination, and production reliability.
🏢 Companies Mentioned
đź’¬ Key Insights
"I feel like if you're just trying to model AI based on the world as we know it right now to help us, then maybe we missed an opportunity to really redefine everything, because we don't need to just model the world we live in right now digitally, but maybe it is now an easier and a different way, right?"
"back in the days, there was a new technology, containers, orchestration, Kubernetes, but people were just repackaging their apps and putting them on Kubernetes, but this is not cloud native."
"defining what AI native is, is like back in 2014 if you would have defined what cloud native is."
"if you go to an OpenAI API, you don't get this enterprise SLAs, you cannot opt out from prompt logging. So they will log everything... then all this information goes back into their service, and they will leverage it for the next training."
"With how does that feedback go back into the model so the model can learn from that, right? And how critical is that? Because it seems like a lot of what the AI is learning is based on what exists, but then as we implement things, implement things, it's not getting those updates of what these changes are."
"So having those smaller models, but also publishing what the data we're making public, what the data sources are, I think would be critical to this trust of, is my answer a reliable answer if I can see the sources that are being used..."