Back to projects

CTI-RAG

Threat Intel

RAG pipeline over CVE and MITRE ATT&CK datasets with custom chunking strategies and vector indexing for high-precision semantic retrieval of threat intelligence.

LangChainChromaDBStreamlitPython

Project highlights

  • Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
  • Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
  • Mitigation extraction workflow for turning retrieved intelligence into practical action items.
  • Two-stage execution model with explicit index-build and chatbot-run phases.

What it is

CTI-RAG is a cybersecurity retrieval-augmented intelligence assistant that ingests CTI datasets, builds a semantic vector index, and serves cited analyst answers from CVE and MITRE ATT&CK knowledge.

Problem it solves

Threat intelligence is fragmented across vulnerability feeds, ATT&CK technique references, and internal notes, forcing analysts to manually cross-reference sources. CTI-RAG reduces lookup overhead by centralizing retrieval and grounding responses in indexed evidence.

How it works

  • Ingest CTI files from the dataset layer and normalize CVE and ATT&CK records into consistent retrieval-ready documents.
  • Chunk and embed documents with SentenceTransformer pipelines and persist vectors in ChromaDB.
  • Run semantic retrieval for each analyst query and assemble evidence-rich context through LangChain components.
  • Send grounded prompts to Gemini-backed generation and produce responses tied to concrete CVE and technique identifiers.
  • Expose the workflow through a Streamlit analyst interface for iterative investigation and mitigation-focused questioning.

Key capabilities

  • Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
  • Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
  • Mitigation extraction workflow for turning retrieved intelligence into practical action items.
  • Two-stage execution model with explicit index-build and chatbot-run phases.
  • Extensible dataset folder pattern for onboarding additional feeds without redesigning the stack.

Impact and outcomes

  • Cuts analyst time spent switching across disconnected intelligence sources.
  • Improves response traceability by returning grounded outputs with identifiable threat references.
  • Provides a reusable CTI-focused RAG architecture for SOC and threat research teams.
CTI-RAG - Project Documentation