CTI-RAG

Threat Intel

RAG pipeline over CVE and MITRE ATT&CK datasets with custom chunking strategies and vector indexing for high-precision semantic retrieval of threat intelligence.

LangChainChromaDBStreamlitPython

View on GitHub

Project highlights

Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
Mitigation extraction workflow for turning retrieved intelligence into practical action items.
Two-stage execution model with explicit index-build and chatbot-run phases.

What it is

CTI-RAG is a cybersecurity retrieval-augmented intelligence assistant that ingests CTI datasets, builds a semantic vector index, and serves cited analyst answers from CVE and MITRE ATT&CK knowledge.

Problem it solves

Threat intelligence is fragmented across vulnerability feeds, ATT&CK technique references, and internal notes, forcing analysts to manually cross-reference sources. CTI-RAG reduces lookup overhead by centralizing retrieval and grounding responses in indexed evidence.

How it works

Ingest CTI files from the dataset layer and normalize CVE and ATT&CK records into consistent retrieval-ready documents.
Chunk and embed documents with SentenceTransformer pipelines and persist vectors in ChromaDB.
Run semantic retrieval for each analyst query and assemble evidence-rich context through LangChain components.
Send grounded prompts to Gemini-backed generation and produce responses tied to concrete CVE and technique identifiers.
Expose the workflow through a Streamlit analyst interface for iterative investigation and mitigation-focused questioning.

Key capabilities

Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
Mitigation extraction workflow for turning retrieved intelligence into practical action items.
Two-stage execution model with explicit index-build and chatbot-run phases.
Extensible dataset folder pattern for onboarding additional feeds without redesigning the stack.

Impact and outcomes

Cuts analyst time spent switching across disconnected intelligence sources.
Improves response traceability by returning grounded outputs with identifiable threat references.
Provides a reusable CTI-focused RAG architecture for SOC and threat research teams.