Back to projects
CTI-RAG
Threat IntelRAG pipeline over CVE and MITRE ATT&CK datasets with custom chunking strategies and vector indexing for high-precision semantic retrieval of threat intelligence.
LangChainChromaDBStreamlitPython
Project highlights
- Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
- Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
- Mitigation extraction workflow for turning retrieved intelligence into practical action items.
- Two-stage execution model with explicit index-build and chatbot-run phases.
What it is
CTI-RAG is a cybersecurity retrieval-augmented intelligence assistant that ingests CTI datasets, builds a semantic vector index, and serves cited analyst answers from CVE and MITRE ATT&CK knowledge.
Problem it solves
Threat intelligence is fragmented across vulnerability feeds, ATT&CK technique references, and internal notes, forcing analysts to manually cross-reference sources. CTI-RAG reduces lookup overhead by centralizing retrieval and grounding responses in indexed evidence.
How it works
- Ingest CTI files from the dataset layer and normalize CVE and ATT&CK records into consistent retrieval-ready documents.
- Chunk and embed documents with SentenceTransformer pipelines and persist vectors in ChromaDB.
- Run semantic retrieval for each analyst query and assemble evidence-rich context through LangChain components.
- Send grounded prompts to Gemini-backed generation and produce responses tied to concrete CVE and technique identifiers.
- Expose the workflow through a Streamlit analyst interface for iterative investigation and mitigation-focused questioning.
Key capabilities
- Semantic retrieval over mixed CTI formats for faster vulnerability and technique discovery.
- Citation-oriented answer generation tied to specific CVE and ATT&CK artifacts.
- Mitigation extraction workflow for turning retrieved intelligence into practical action items.
- Two-stage execution model with explicit index-build and chatbot-run phases.
- Extensible dataset folder pattern for onboarding additional feeds without redesigning the stack.
Impact and outcomes
- Cuts analyst time spent switching across disconnected intelligence sources.
- Improves response traceability by returning grounded outputs with identifiable threat references.
- Provides a reusable CTI-focused RAG architecture for SOC and threat research teams.