CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729
π― Summary
Podcast Summary: CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729
This episode features Dr. Nidhi Rastogi from Rochester Institute of Technology discussing the critical intersection of Large Language Models (LLMs) and Cyber Threat Intelligence (CTI), focusing on the development and application of their novel benchmarking framework, CTIBench.
1. Focus Area
The primary focus is the application and rigorous evaluation of LLMs specifically within the domain of Cyber Threat Intelligence (CTI). Discussions covered the evolution from traditional ML/DL in security to the contextual understanding provided by LLMs, the role of Retrieval-Augmented Generation (RAG) in keeping models current, and the creation of a practical, analyst-centric benchmark (CTIBench) to measure performance on real-world CTI tasks.
2. Key Technical Insights
- LLMs Comprehend Diverse Syntax: Contrary to expectations, LLMs (even smaller open-source models like Llama) demonstrated a surprising ability to comprehend non-prose data formats common in security, including log files, code, JSON, and even malware hashes.
- RAG for Temporal Relevance: Due to training data cutoff dates, Retrieval-Augmented Generation (RAG) is essential in CTI to inject the most recent threat intelligence, ensuring models are not operating on outdated information, which is critical in a fast-moving threat landscape.
- CTIBench Task Structure: The benchmark is designed around the daily activities of a threat analyst, encompassing knowledge retrieval (e.g., defining MITRE ATT&CK), reasoning (e.g., CVE mapping), threat attribution, and severity scoring (CVSS calculation).
3. Business/Investment Angle
- High Stakes of Hallucination: In CTI, LLM hallucinations are βdetrimental,β as convincing but false information can lead analysts to implement ineffective or harmful security measures. Accuracy is paramount.
- Efficiency Gains in CTI Analysis: LLMs can compress tasks that take human analysts hours (gathering, analyzing, and correlating threat reports) into seconds, offering massive productivity boosts for security teams.
- Need for Domain-Specific Benchmarking: The existence of CTIBench highlights a market gap: general LLM benchmarks are insufficient for specialized, high-stakes domains like cybersecurity, driving demand for domain-specific evaluation tools.
4. Notable Companies/People
- Nidhi Rastogi (RIT): The expert driving the research, focusing on applied AI/ML in cybersecurity and leading the development of CTIBench.
- General LLM Providers (Llama, Gemini, ChatGPT/GPT-4): These models were used as the subjects for benchmarking against CTIBench, showing performance scaling with model size and sophistication (GPT-4 performed exceedingly well).
- Google (Sec-Gemini v1): Mentioned as a developer of a specialized, security-focused LLM that has utilized CTIBench for its own evaluation.
- Foundational Sources (NIST, MITRE ATT&CK, GDPR): These standards bodies provided the trustworthy source material used to construct the CTIBench questions.
5. Future Implications
The industry is moving toward specialized, continuously updated LLMs validated by domain-specific benchmarks. Future research (already underway in Rastogiβs lab) will likely focus on extending benchmarks to cover remediation and mitigation strategies, moving beyond knowledge retrieval to actionable defense recommendations. The success of CTIBench suggests a trend toward creating standardized, practical evaluation suites for all critical AI applications.
6. Target Audience
This episode is highly valuable for Cybersecurity Professionals (especially Threat Intelligence Analysts and SOC Managers), AI/ML Researchers focusing on applied domains, and Technology Investors tracking the commercialization and validation of enterprise AI solutions.
π’ Companies Mentioned
π¬ Key Insights
"It was definitely the need, which was very surprising to us, that there are so many benchmarks getting built or designed every other day, but there's nothing for cybersecurity which can be all applied in practice."
"And at the same time, it would say it very, very convincingly, adding information, you know, like, 'This threat was found in this organization at this location on this in this time period in this country,' and so on. So it would be very, very convincing but incorrect."
"Can the LLM explain itself? Can the LLM explain why it gave this response? That takes us to another side of my research, which is explainable AI. Can the machine learning model or AI model explain itself along with the confidence that it has in its response?"
"It's better to be aware of that, and that's what the role of the benchmark is. It's not to hide any kind of these edge cases or corner cases but to reveal them so the analyst knows that my model will be able to respond 80% of the time for these types of questions, and for these other types of questions, we need more of a human in the loop or some kind of human intervention or an expert intervention so we're not making mistakes that might cost us."
"benchmarks are kind of an approach to telling what the model is capable of doing and where it will fail, and identify those blind spots or those edge cases where it either needs better training data set or there's something else needs to be done. Turning a blind eye towards it is not going to help anybody."
"Very interestingly, pretty much every single model fared very poorly on some of the questions, and that was very interesting for us. Why? Because we kind of looked back at those questions: What is it about the questions? Is it about the training? And then we also had human evaluators who were also experts kind of checking if the question is complex or does it require a lot of detailed understanding or deep knowledge, which is what we determined was the case."