906: How Prof. Jason Corso Solved Computer Vision’s Data Problem
🎯 Summary
Podcast Episode Summary: 906: How Prof. Jason Corso Solved Computer Vision’s Data Problem
This episode of the Super Data Science Podcast features Dr. Jason Korsow, Professor at the University of Michigan and co-founder/Chief Science Officer of Voxel 51. The discussion centers on the evolution of computer vision, the critical role of data over algorithms, and how Voxel 51 is addressing the massive bottleneck in visual AI development: data tooling and labeling.
1. Focus Area
The primary focus is Computer Vision (CV) Development Tooling and Data-Centric Machine Learning (ML). Specific topics covered include the explosion of academic research in CV (evidenced by conference growth like CVPR), the shift in focus from model architecture to data quality, and the practical application of this philosophy in real-world systems like autonomous vehicles.
2. Key Technical Insights
- Data Dominance: For many modern CV tasks, the performance ceiling is now dictated more by the quality and coverage of the training dataset than by the specific model architecture chosen (assuming standard, high-performing architectures are used).
- Curation is the New Annotation: The industry is moving beyond manual, brute-force data labeling (Annotation 1.0). The future involves leveraging foundation models for Verified Auto Labeling, where AI generates initial labels, and sophisticated ML ranking systems prioritize which samples require human verification (Annotation 1.5/2.0).
- Physically Grounded Systems: Professor Korsow’s academic work focuses on building AI systems that operate alongside humans, such as guiding rural healthcare providers through complex procedures (e.g., cardiac ultrasound) using vision-based guidance.
3. Business/Investment Angle
- Tooling Gap: A significant market opportunity exists in providing robust development tools for visual AI, as the tooling around data analysis and model iteration has lagged behind algorithmic advancements.
- Cost Reduction in Labeling: Verified Auto Labeling offers massive cost and time savings by automating the labeling of high-confidence data segments (e.g., accepting 70% of labels automatically), reserving expensive human expertise only for corner cases and challenging scenarios.
- Data Strategy as Competitive Edge: Companies that master data curation and analysis—understanding failure modes and strategically adding data—will outperform those focused solely on chasing the newest model releases.
4. Notable Companies/People
- Dr. Jason Korsow: Professor at UMich (Robotics, EECS) and Chief Science Officer/Co-founder of Voxel 51. His 20+ years of research bridge academia and industry, focusing on physically grounded cognitive systems.
- Voxel 51: The company co-founded by Korsow, which provides the open-source tool (with 3M+ installs) and commercial platform for visual AI development, focusing on data analysis and model iteration workflows.
- Foundation Models: Mentioned as the key enablers for the next generation of automated labeling tools.
5. Future Implications
The industry is rapidly shifting toward data-centric ML, where the focus moves from writing code to curating and verifying data pipelines. The next evolution (Annotation 2.0) is predicted to be more agentic, where AI models proactively query human experts only when necessary, further minimizing manual involvement in the data lifecycle. This trend is crucial for achieving the near-perfect reliability required in safety-critical systems like autonomous driving.
6. Target Audience
This episode is highly valuable for ML Engineers, Data Scientists, Computer Vision Practitioners, and technical leaders involved in building, deploying, or investing in AI systems that rely on large visual datasets. The discussion on tooling, data pipelines, and the shift from annotation to curation is directly relevant to their daily challenges.
🏢 Companies Mentioned
💬 Key Insights
"What is actually annotation 2.0? ... is this notion that instead of the humans asking the foundation models what they should label or what the labels are or what have you, there's more agentic, where there's a problem statement given, the amount of unlabeled data, and then the models are able to actually ask the humans questions just when it's necessary."
"The tagline that I like—not approved by marketing, but that I like these days—is 'Curation is the new annotation,' right?"
"The hardest part about this world of building highly successful, like 99.999 whatever percent accurate systems is getting the data, then getting the data labeled, and then training the model and figuring out what are the failure modes, what are the success cases, what are my failure modes, and where do I need to add more data and begin this process, right?"
"As the evolution from what some folks have called software 1.0, just code, to software 2.0, which is essentially just a different type of code—it's just humans can't really write it; we write other code to train it from data."
"When you think of machine learning, there's code, and then there's the data that goes into the code in some sense; like it gets kind of transformed into data weights or coefficients or something like that. But these two things are inseparable."
"And we basically began to build this conviction around data is at least as important, if not more important than the model architecture you choose."