LLMs for Equities Feature Forecasting at Two Sigma with Ben Wellington - #736
🎯 Summary
Podcast Summary: LLMs for Equities Feature Forecasting at Two Sigma with Ben Wellington - #736
This episode of the Twomo AI podcast features an in-depth discussion with Ben Wellington, Deputy Head of Feature Forecasting at Two Sigma, focusing on how Large Language Models (LLMs) and Generative AI are revolutionizing the creation and utilization of predictive features for quantitative equity trading.
1. Focus Area
The primary focus is the application of advanced NLP/GenAI techniques (specifically LLMs) to financial data analysis for the purpose of generating novel, high-signal features used in quantitative investment models. The discussion bridges Ben Wellington’s background in traditional NLP with the current paradigm shift driven by data-centric AI in finance.
2. Key Technical Insights
- The ROI Revolution in Feature Engineering: LLMs have drastically reduced the cost (from potentially months of specialized engineering to minutes) required to test complex hypotheses derived from unstructured data (e.g., analyzing video content for non-verbal cues like “nose touching” during CEO interviews). This lowers the barrier to entry for exploring previously intractable features.
- Shift from Syntactic/One-Hot Encoding to Embeddings: The evolution from older NLP methods (like counting specific words or using one-hot encoding) to using dense vector embeddings allows models to capture semantic relationships between concepts (e.g., understanding that “innovate” and “creative” are related), leading to better generalization in prediction models.
- The Importance of Raw, Historical Data Capture: A core philosophy at Two Sigma is the imperative to record the rawest form of data possible (e.g., raw video feeds, unedited news wires). This “time capsule” approach ensures that future, yet-to-be-invented analytical techniques (like advanced vision models) can be applied retrospectively to historical data, a capability that is impossible if only derived features are saved.
3. Business/Investment Angle
- Feature Forecasting as the Core Business: Two Sigma’s objective is to predict future asset prices by quantifying the world into millions of observable “features.” Feature forecasting is the dedicated process of discovering, quantifying, and validating these signals.
- The Value of Holistic Data Capture: The firm prioritizes capturing data across all traded entities simultaneously (e.g., tracking job postings for every company), recognizing that a holistic, cross-sectional view often yields more predictive power than siloed data sets.
- Competitive Edge in Data Provenance: Having proprietary, time-stamped historical records (like unedited news feeds) that competitors lack provides a significant edge, as this data allows for testing hypotheses that others cannot validate historically.
4. Notable Companies/People
- Ben Wellington (Two Sigma): Deputy Head of Feature Forecasting, expert in NLP, driving the integration of GenAI into feature discovery.
- Two Sigma: The quantitative investment manager where this work is applied, focused on using data science to predict asset prices.
- NYU: Mentioned as the location where Ben Wellington pursued his PhD in machine translation, highlighting the historical context of NLP research.
5. Future Implications
The conversation suggests a renaissance in feature creation. As the technical overhead for extracting complex signals from unstructured data plummets due to LLMs, researchers will shift from prioritizing technically feasible ideas to pursuing any hypothesis that seems potentially valuable, regardless of initial complexity. This democratization of feature engineering will likely lead to a rapid expansion of the feature space used in quantitative finance.
6. Target Audience
This episode is highly valuable for AI/ML Engineers, Quantitative Researchers (Quants), Data Scientists working in finance, and technology leaders interested in the practical, high-stakes application of Generative AI beyond consumer-facing products.
🏢 Companies Mentioned
💬 Key Insights
"...which when you combine orthogonal signals, you get a much smoother response than when they're correlated signals."
"I'm not always looking at the best at things; I'm looking for a group of things that each have their own take that when I average out among them, I'm better off and more robust in the future than had I just picked one."
"there's not going to be a horse to bet on. You're going to be well-suited to have a diversified set of inputs to build interesting things, and you need to be comfortable using a wide array of technologies, not just betting on a single one."
"So, it is kind of scary for us to use an off-the-shelf model that's been trained on data in 2020 to ask questions from a document of 2019, right? So, if I could say, 'Hey, here's this Enron conference call. Do you think it's good or bad?' Is the word 'Enron' going to trigger a negative reaction because somewhere deep in the psyche of the LLM, there was a big bankruptcy?"
"That's an exact example where somebody had forced that hop [intermediate text step], you would actually have a less good system than we have today when they said, 'Oh, look, let me remove these abstractions that humans have added and just let the system go with enough data.'"
"The things that you build need to be plug-and-playable with the changing world."