929: Dragon Hatchling: The Missing Link Between Transformers and the Brain, with Adrian Kosowski
🎯 Summary
Podcast Summary: 929: Dragon Hatchling: The Missing Link Between Transformers and the Brain, with Adrian Kosowski
This episode features Adrian Kosowski from Pathway discussing a groundbreaking new neural network architecture, BDH (BabyDagon Hatchling), positioned as a potential successor to the Transformer model. The core innovation lies in bridging the gap between high-performing modern AI (like LLMs) and biologically plausible mechanisms observed in the human brain, particularly concerning attention and sparse activation.
1. Focus Area
The discussion centers on Artificial Intelligence and Machine Learning, specifically the development of post-Transformer architectures. Key themes include reconciling computational efficiency with biological realism, the concept of attention in both ML and neuroscience, and the role of Hebbian learning and sparse activation in next-generation reasoning models.
2. Key Technical Insights
- Bridging Attention Mechanisms: BDH utilizes an attention mechanism rooted in State Space Models (SSMs), which allows attention to be viewed locally (like biological neurons focusing on neighbors) rather than purely as a global context lookup structure, as is typical in standard Transformers.
- Sparse Activation for Efficiency: Unlike densely activated Transformers (where all parameters are used per inference step), BDH exhibits sparse positive activation, with only about 5% of its artificial neurons firing at any given time, mimicking the energy efficiency observed in biological brains.
- Scalability and Performance: The BDH architecture, demonstrated successfully at the one-billion-parameter scale (comparable to GPT-2), is shown to be GPU-efficient and potentially outperforms Transformers on specific tasks, particularly those requiring complex, long-context reasoning.
3. Business/Investment Angle
- Transformer Replacement Potential: BDH and the broader BabyDagon family represent a significant technological shift that could challenge the dominance of the Transformer architecture in future large-scale model development.
- Efficiency Gains: The sparse activation mechanism promises substantial computational and energy savings, making future large models potentially more cost-effective to train and run compared to current dense models.
- Advancing Reasoning Capabilities: The architecture is specifically targeted at overcoming current LLM limitations in lifelong learning and generalizing complex reasoning—areas where human cognition excels—opening new commercial avenues for robust AI systems.
4. Notable Companies/People
- Adrian Kosowski (Pathway): The guest and lead researcher behind the BDH architecture and the BabyDagon family of models.
- Donald Hebb: Mentioned for his foundational concept of Hebbian learning (“neurons that fire together, wire together”), which heavily influences the theoretical underpinnings of BDH.
- AWS (Sponsor): Mentioned for their purpose-built AI chip, Trainium, highlighting the hardware ecosystem supporting advanced AI development.
5. Future Implications
The conversation suggests the industry is moving toward biologically inspired architectures that prioritize efficiency and deep reasoning over sheer parameter count and dense computation. BDH hints at a future where AI models can handle virtually unlimited context windows efficiently and exhibit superior generalization in reasoning tasks, potentially leading to AI systems capable of true lifelong learning.
6. Target Audience
This episode is highly valuable for AI/ML Researchers, Deep Learning Engineers, Computational Neuroscientists, and Technology Strategists interested in the fundamental architectural shifts beyond the current Transformer paradigm.
Comprehensive Summary
The podcast episode dives deep into the BabyDagon Hatchling (BDH) architecture developed by Adrian Kosowski and his team at Pathway, positioning it as a critical “missing link” between contemporary Transformer models and the mechanics of the biological brain.
The discussion begins by establishing the historical context: early neural networks were biologically inspired (like RNNs), but the Transformer architecture, while dominant, is computationally optimized for GPUs and diverges significantly from natural neural processing. Kosowski frames BDH as a reconciliation effort, starting from the concept of attention—a mechanism present in both neuroscience and ML.
A key theoretical anchor for BDH is Hebbian learning, the biological principle governing synaptic strengthening based on correlated firing. Kosowski contrasts the brain’s dynamic, local attention mechanisms (where neurons prioritize connections to neighbors) with the Transformer’s attention, which functions more like a global context search structure optimized for parallel processing.
Technically, BDH is described as a post-Transformer architecture built upon State Space Models (SSMs). This SSM foundation allows attention to be interpreted locally, aligning better with biological models. Crucially, BDH introduces sparse positive activation. While current large models are densely activated (expensive), BDH achieves performance comparable to a one-billion-parameter GPT-2 model while only activating about 5% of its artificial neurons per step—a feature mirroring the energy efficiency of the human brain.
The implications are twofold: ML breakthrough and Neuroscience insight. On the ML front, BDH aims to solve the Transformer’s known limitations in reasoning generalization and long-term context management. Kosowski suggests BDH removes the effective context window bottleneck, allowing for continuous, efficient learning over vast amounts of data, akin to human expertise development. This efficiency and improved reasoning capability suggest BDH could become the preferred architecture for next-generation LLMs that require deeper, more sustained cognitive functions beyond pattern matching.
🏢 Companies Mentioned
đź’¬ Key Insights
"reasoning model which goes through billions of tokens of context. Here you are in this space in which you can, for example, ingest our contextualized data set, private to enterprise, like the documentation of an entire technology, which is like one million pages of paper, one million sheets of paper, that's one billion tokens. You ingest it in a matter of minutes, given enough hardware in this architecture."
"What we are doing is we are entering reasoning models. We are entering it from the module scale, obviously. But this is a scale where we can display the advantage of this architecture."
"There's nothing particularly stopping us from releasing a super large model like in the 70-80 billion scale or larger. The kind of question which is super pertinent is, why do it? Because if you are in the world of language models, just language models, there's a certain market which we could call a bit of a commodity market for the kind of chatbot-like applications, discussions, and so on."
"you were able to concatenate, literally just like a concatenate operation. You could have one neural network trained on one language, let's say English, and you could have another language trained on, let's say, French... And because of the sparse activation, it just works, and it's a multilingual model."
"one fascinating technical detail here is that in our model... is the concept of mother synapse other than the grandmother neuron, which means that if you think of how the state of the system works, the state of the system, the context that we are listening to is represented by my synapse activation, and those synapses that are responsible or related to specific contexts activate in those settings."
"find in our network an individual synapse which is sufficient evidence for a concept being mentioned. So this touches on notion known as monosimplicity or being responsible for one concept and digging on one concept."