901: Automating Legal Work with Data-Centric ML (feat. Lilith Bat-Leah)

Unknown Source July 01, 2025 66 min

artificial-intelligence generative-ai ai-infrastructure investment nvidia anthropic

🎧 Listen to Original

62 Companies

78 Key Quotes

4 Topics

3 Insights

🎯 Summary

Podcast Episode Summary: 901: Automating Legal Work with Data-Centric ML (feat. Lilith Bat-Leah)

This episode of the Super Data Science podcast features Lilith Bat-Leah, Senior Director of AI Labs at Epic, discussing the transformative role of AI, particularly Data-Centric Machine Learning (DCML), in revolutionizing the legal technology sector, specifically in e-discovery.

1. Focus Area

The primary focus is the application of machine learning to legal tech, centered on automating the e-discovery process. Key technical discussions revolve around Technology Assisted Review (TAR) workflows, active learning strategies (relevance feedback vs. uncertainty sampling), the integration of Large Language Models (LLMs) via Retrieval Augmented Generation (RAG), and the critical importance of data quality and uncertainty quantification in high-stakes legal environments.

2. Key Technical Insights

Advanced TAR with LLMs: Epic’s AI Discovery Assistant moves beyond traditional TAR (which relies on standard classifiers like Random Forests or SVMs) by leveraging LLMs with RAG to “kickstart” the classification process, using both human-labeled examples and natural language instructions for training case-specific classifiers (e.g., relevance, privilege).
The Metric of “Illusion”: The legal tech domain uses a unique metric called Illusion (False Negatives / (False Negatives + True Negatives)), which is analogous to the False Omission Rate in general ML. This metric is crucial for estimating the interval for recall, especially in TAR 2.0 workflows where all predicted relevant documents are human-reviewed.
Prioritizing Uncertainty over Point Estimates: Lilith strongly advocates for reporting model performance using confidence intervals rather than single point estimates (like precision or recall). She argues that a point estimate without sample size or confidence intervals is effectively “lying with statistics,” as it fails to measure the inherent uncertainty in the estimation process.

3. Business/Investment Angle

Massive Automation Potential: AI tools like Epic’s assistant claim to automate over 80% of traditional e-discovery processes and speed up reviews by up to 90% compared to linear review, representing significant cost savings in high-stakes litigation (often involving millions or billions of dollars).
Expertise Gap in Law Firms: Most large law firms lack in-house data scientists for discovery tasks, making them heavily reliant on specialized legal tech vendors like Epic for both tools and the necessary domain expertise to ensure defensibility.
Defensibility as a Negotiated Metric: Unlike other industries, the required performance metrics (precision/recall) in e-discovery are often negotiated with opposing counsel or regulatory bodies, emphasizing the need for data scientists to clearly articulate the consequences of margin of error to legal professionals.

4. Notable Companies/People

Lilith Bat-Leah: Senior Director of AI Labs at Epic (a legal tech firm with over 6,000 employees). Co-chair of the Data-Centric Machine Learning Research (DCMLR) working group at ML Commons.
Andrew Ng: Credited with coining the term Data-Centric AI, which spurred Lilith’s interest in focusing on data quality over algorithm optimization.
Epic AI Discovery Assistant: The specific product discussed, designed to accelerate document review using advanced ML techniques.
ML Commons (DataPerf): The organization where Lilith is involved in benchmarking and advancing DCMLR.

5. Future Implications

The industry is moving toward a model where AI tools are sophisticated enough to handle the bulk of document review, requiring legal professionals to focus on interpreting results and defending methodologies rather than manual processing. The conversation highlights a broader industry shift toward Data-Centric AI, where improving the quality and consistency of training data (especially in noisy domains like legal) yields greater performance gains than marginal algorithmic tweaks. There is a growing need for data scientists to become better communicators of statistical uncertainty to non-technical stakeholders.

6. Target Audience

This episode is most valuable for hands-on practitioners including Data Scientists, AI/ML Engineers, and Software Developers, particularly those interested in applying ML to specialized, high-stakes domains, or those involved in Data-Centric AI research and implementation.

🏢 Companies Mentioned

Serge Masees ✅ ai_research

Khan Academy ✅ ai_research

MIT OpenCourseWare ✅ ai_research

dataperf.org ✅ ai_infrastructure

JMLR journal ✅ ai_research

Northwestern ✅ research_institution

So Serge Masees ✅ unknown

Khan Academy ✅ unknown

MIT OpenCourseWare ✅ unknown

Dear Lilith ✅ unknown

Claude Pro ✅ unknown

When I ✅ unknown

Adversarial Nibbler ✅ unknown

Am I ✅ unknown

Centric Machine Learning Research Past ✅ unknown

💬 Key Insights

"if you ran the experiments 20 times, you would anticipate with a 0.05 alpha that one of those 20 times, you would get a significant result by chance alone. And this is like a century-old idea, and from the age of Fisher and Pearson and statistics, and the idea there is that you'll kind of accept that you'll end up getting a significant result by chance alone one or 20 times, and that's kind of tolerable, but it is completely arbitrary."

Impact Score: 10

"like you should be running that model a bunch of times in both the A case and the B case, get a distribution of results, and be comparing those. And then if you have a statistically significant result."

Impact Score: 10

"I don't think you're able to properly evaluate the performance of the models that you're building. So you might be able to build the model without statistics, but I think especially in this era of black box models, it's so important to be able to actually evaluate the performance."

Impact Score: 10

"It's covering a lot of those subjects, linear algebra, calculus, probability theory, and statistics. And we go in that order so that hopefully by the time we get to the statistics part, you're able to understand based on the fundamental building blocks underlying it what's going on as opposed to just being able to get an A by following the examples, not by rote, that's not exactly it, but by being able to apply the abstractions as opposed to understand the underlying fundamentals."

Impact Score: 10

"there is a critique that the intent focus on benchmark performance doesn't necessarily translate to real-world impact in the way that we would expect. So there's definitely a balance to be found there."

Impact Score: 10

"I totally see the idea of how benchmarks and competition have led us to having such a model-centric approach to machine learning."

Impact Score: 10

📊 Topics

#artificialintelligence 159 #generativeai 11 #investment 8 #aiinfrastructure 8

🧠 Key Takeaways

💡 stop obsessing over model improvements and focus on something else that takes up 80% of data scientists' time, and she talks about how she grew from being a temp receptionist to eventually an AI lab director by falling in love with statistics

💡 be able to accept that those labels are the gold standard