914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

Unknown Source October 04, 2025 26 min

artificial-intelligence ai-infrastructure startup

37 Companies

29 Key Quotes

3 Topics

🎯 Summary

Podcast Summary: 914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

This episode of the Super Data Science podcast, hosted by John Cron, features Oz Katz, Co-founder and CTO of Lake FS, focusing on the critical role of modern data storage architectures, specifically Data Lakes, in supporting advanced AI and Machine Learning applications.

1. Focus Area

The primary focus is the evolution of data storage for AI, contrasting traditional Data Warehouses with Data Lakes. Key discussions centered on the challenges introduced by multimodal AI data (images, text, embeddings, etc.), the need for versioning and collaboration in data pipelines, and how modern systems must manage this complexity. The conversation heavily featured the analogy of Git for Data.

2. Key Technical Insights

Data Lake Definition: A Data Lake is fundamentally a centralized, shared repository (like a “shared folder”) designed to ingest data of any structure (structured or unstructured), allowing for faster iteration compared to the rigid structure required by Data Warehouses.
Multimodal Data Complexity: Modern AI requires managing diverse data modalities (images, text, vector embeddings) simultaneously. This often leads to “multiple sources of lies” (data scattered across object stores, vector databases, feature stores) that easily fall out of sync.
LakeFS as Data Version Control: LakeFS introduces Git-like versioning (branching, committing, merging, pull requests) directly onto the data lake (object store). This allows data scientists to experiment in isolated environments without copying massive datasets, ensuring reproducibility and controlled collaboration.

3. Business/Investment Angle

Data as a Managed Asset: The core business value of tools like LakeFS is treating data as a managed, auditable company asset, similar to how code is managed via Git, reducing operational risk and improving governance.
Convergence on Object Storage: The industry trend shows that various data types (tabular via Apache Iceberg, vectors, raw files) are increasingly converging onto the Object Store (e.g., S3) as the single organizational source of truth.
Enabling Complex Pipelines: By unifying data access and versioning, these solutions reduce friction in complex, multi-stage AI pipelines involving pre-processing, feature extraction, and model training across different modalities.

4. Notable Companies/People

Oz Katz (Lake FS): Co-founder and CTO, providing the technical perspective on data storage needs for AI.
Lake FS: The company providing the solution, which offers a unified facade and version control layer over object stores.
Apache Iceberg: Mentioned as an open-source table format that allows structured, schema-managed tables to exist on top of object stores, enabling updates and consistent reads across different compute engines.

5. Future Implications

The future of data infrastructure for AI points toward complete convergence on the object store as the foundational layer. Solutions must provide robust versioning, governance, and schema management (like Iceberg) directly on this storage layer to handle the increasing complexity of multimodal data and large-scale model training/deployment. Reproducibility, guaranteed by data snapshots, will become non-negotiable.

6. Target Audience

This episode is highly valuable for AI/ML Engineers, Data Engineers, Data Architects, and CTOs involved in building or scaling data platforms specifically designed to support large-scale, multimodal AI model development and MLOps.

Comprehensive Summary Narrative

The podcast opens with a lighthearted discussion on time zones before diving into the core topic: the data infrastructure required for modern AI. Oz Katz defines the Data Lake as a flexible, centralized repository contrasting it with the rigid structure of a Data Warehouse. He emphasizes that the needs of AI have drastically changed the required data landscape, moving beyond simple tabular data to encompass complex, multimodal data like images, labels, and vector embeddings.

The central challenge identified is managing the resulting data sprawl. When data scientists work with these different modalities, the data often fragments across specialized systems (vector databases, feature stores), leading to synchronization issues and making it difficult to trace the exact input data used for a specific model run.

Katz introduces LakeFS as the solution, drawing a direct parallel to Git for Data. LakeFS sits as a facade over the object store, allowing users to branch the entire data lake for experimentation. This branching is efficient, as it only tracks changes, not full copies. Users can modify, add, or remove data within their branch, knowing their work is isolated and reproducible because they reference a specific, frozen snapshot (version) of the data. Changes can then be merged back via a controlled pull request workflow, often incorporating automated quality checks.

Looking ahead, Katz notes the industry trend of data convergence on the object store. He highlights the importance of technologies like Apache Iceberg, which enables structured table definitions on top of object storage, allowing different compute engines to access the same consistent data state. Ultimately, the conversation stresses that as AI models become more sophisticated, the underlying data management must evolve from simple storage to sophisticated, version-controlled asset management to ensure reliability and scalability.

🏢 Companies Mentioned

Martin Kleppmann ✅ ai_research

Designing Data-Intensive Applications ✅ ai_research

Snowflake ✅ ai_application

AWS Athena ✅ ai_application

Pandas ✅ ai_application

Martin Kleppmann ✅ unknown

Intensive Applications ✅ unknown

Designing Data ✅ unknown

Before I ✅ unknown

So Apache Iceberg ✅ unknown

Apache Iceberg ✅ unknown

And John ✅ unknown

Maybe I ✅ unknown

If I ✅ unknown

European CRM ✅ unknown

💬 Key Insights

"What this guarantees is kind of a side effect is that whatever I'm building now is going to be reproducible later. Right? As long as my code doesn't introduce any variability into it, if it's deterministic, same code, same input data would guarantee the same result."

Impact Score: 10

"I want to introduce a change. I'll open a pull request. And John's team, they have to sign off on that change before it gets introduced. I'd have a very structured workflow to how changes get implemented to bring that same concept, that same notion to the data itself. Right? Not just the code, but also the data itself regardless of what modality it is or what type of actual business value it represents."

Impact Score: 10

"Just like Git and GitHub, the same idea that they brought to right. I want to introduce a change. I'll open a pull request. And John's team, they have to sign off on that change before it gets introduced."

Impact Score: 10

"I'd have a very structured workflow to how changes get implemented to bring that same concept, that same notion to the data itself. Right. Not just the code, but also the data itself regardless of what modality it is or what type of actual business value it represents."

Impact Score: 10

"The place where this kind of gets difficult is that even though the dream or the idea of a data lake is that everything streams into that one centralized location, reality is a bit more dirty than that. Right. If I have images, yeah, I might have those images on my day, just the raw data lake, just as image files, but I'm probably also going to have an embedding of those images toward the vector database. And maybe I have some features that were extracted from those images in a database. Maybe labels in some other third-party labeling solution. Now instead of having that one source of truth that I was expecting to have, I have multiple sources of lies, right? They get out of sync very easily."

Impact Score: 10

"It used to be that the data that we get value from and that we can actually derive business impact out of was your, let's say, tabular data, right? Tables organized into a database, it's all very well structured. But that's not necessarily the case nowadays, right? Especially with advances around AI models, what we can do with them, we can extract value from a lot of different other kinds of data as well, right? So it could be images, it could be embeddings out of those images, labels attached to them, right? All different kinds of modalities that are representative of the information we have at the company."

Impact Score: 10

📊 Topics

#artificialintelligence 27 #aiinfrastructure 8 #startup 1