CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729

Unknown Source April 30, 2025 56 min

artificial-intelligence generative-ai ai-infrastructure investment google meta

🎧 Listen to Original

46 Companies

70 Key Quotes

4 Topics

1 Insights

🎯 Summary

Podcast Summary: CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729

This episode features Dr. Nidhi Rastogi from Rochester Institute of Technology discussing the critical intersection of Large Language Models (LLMs) and Cyber Threat Intelligence (CTI), focusing on the development and application of their novel benchmarking framework, CTIBench.

1. Focus Area

The primary focus is the application and rigorous evaluation of LLMs specifically within the domain of Cyber Threat Intelligence (CTI). Discussions covered the evolution from traditional ML/DL in security to the contextual understanding provided by LLMs, the role of Retrieval-Augmented Generation (RAG) in keeping models current, and the creation of a practical, analyst-centric benchmark (CTIBench) to measure performance on real-world CTI tasks.

2. Key Technical Insights

LLMs Comprehend Diverse Syntax: Contrary to expectations, LLMs (even smaller open-source models like Llama) demonstrated a surprising ability to comprehend non-prose data formats common in security, including log files, code, JSON, and even malware hashes.
RAG for Temporal Relevance: Due to training data cutoff dates, Retrieval-Augmented Generation (RAG) is essential in CTI to inject the most recent threat intelligence, ensuring models are not operating on outdated information, which is critical in a fast-moving threat landscape.
CTIBench Task Structure: The benchmark is designed around the daily activities of a threat analyst, encompassing knowledge retrieval (e.g., defining MITRE ATT&CK), reasoning (e.g., CVE mapping), threat attribution, and severity scoring (CVSS calculation).

3. Business/Investment Angle

High Stakes of Hallucination: In CTI, LLM hallucinations are “detrimental,” as convincing but false information can lead analysts to implement ineffective or harmful security measures. Accuracy is paramount.
Efficiency Gains in CTI Analysis: LLMs can compress tasks that take human analysts hours (gathering, analyzing, and correlating threat reports) into seconds, offering massive productivity boosts for security teams.
Need for Domain-Specific Benchmarking: The existence of CTIBench highlights a market gap: general LLM benchmarks are insufficient for specialized, high-stakes domains like cybersecurity, driving demand for domain-specific evaluation tools.

4. Notable Companies/People

Nidhi Rastogi (RIT): The expert driving the research, focusing on applied AI/ML in cybersecurity and leading the development of CTIBench.
General LLM Providers (Llama, Gemini, ChatGPT/GPT-4): These models were used as the subjects for benchmarking against CTIBench, showing performance scaling with model size and sophistication (GPT-4 performed exceedingly well).
Google (Sec-Gemini v1): Mentioned as a developer of a specialized, security-focused LLM that has utilized CTIBench for its own evaluation.
Foundational Sources (NIST, MITRE ATT&CK, GDPR): These standards bodies provided the trustworthy source material used to construct the CTIBench questions.

5. Future Implications

The industry is moving toward specialized, continuously updated LLMs validated by domain-specific benchmarks. Future research (already underway in Rastogi’s lab) will likely focus on extending benchmarks to cover remediation and mitigation strategies, moving beyond knowledge retrieval to actionable defense recommendations. The success of CTIBench suggests a trend toward creating standardized, practical evaluation suites for all critical AI applications.

6. Target Audience

This episode is highly valuable for Cybersecurity Professionals (especially Threat Intelligence Analysts and SOC Managers), AI/ML Researchers focusing on applied domains, and Technology Investors tracking the commercialization and validation of enterprise AI solutions.

🏢 Companies Mentioned

So Sec ✅ unknown

Fancy Bear ✅ unknown

And ChatGPT ✅ unknown

CTI MCQ ✅ unknown

Now I ✅ unknown

The LLM ✅ unknown

Common Vulnerability ✅ unknown

MITRE ATT ✅ unknown

So CTI Bench ✅ unknown

CTI Bench ✅ unknown

So Llama ✅ unknown

So December ✅ unknown

So I ✅ unknown

Verizon Wireless ✅ unknown

But I ✅ unknown

💬 Key Insights

"It was definitely the need, which was very surprising to us, that there are so many benchmarks getting built or designed every other day, but there's nothing for cybersecurity which can be all applied in practice."

Impact Score: 10

"And at the same time, it would say it very, very convincingly, adding information, you know, like, 'This threat was found in this organization at this location on this in this time period in this country,' and so on. So it would be very, very convincing but incorrect."

Impact Score: 10

"Can the LLM explain itself? Can the LLM explain why it gave this response? That takes us to another side of my research, which is explainable AI. Can the machine learning model or AI model explain itself along with the confidence that it has in its response?"

Impact Score: 10

"It's better to be aware of that, and that's what the role of the benchmark is. It's not to hide any kind of these edge cases or corner cases but to reveal them so the analyst knows that my model will be able to respond 80% of the time for these types of questions, and for these other types of questions, we need more of a human in the loop or some kind of human intervention or an expert intervention so we're not making mistakes that might cost us."

Impact Score: 10

"benchmarks are kind of an approach to telling what the model is capable of doing and where it will fail, and identify those blind spots or those edge cases where it either needs better training data set or there's something else needs to be done. Turning a blind eye towards it is not going to help anybody."

Impact Score: 10

"Very interestingly, pretty much every single model fared very poorly on some of the questions, and that was very interesting for us. Why? Because we kind of looked back at those questions: What is it about the questions? Is it about the training? And then we also had human evaluators who were also experts kind of checking if the question is complex or does it require a lot of detailed understanding or deep knowledge, which is what we determined was the case."

Impact Score: 10

📊 Topics

#artificialintelligence 149 #generativeai 23 #aiinfrastructure 7 #investment 1