XJailGuard
LLM SecurityExplainable LLM security framework with modular pipeline for input sanitization, intent classification, and output validation against prompt injection attacks.
Project highlights
- Modular guardrail pipeline with separate input, output, and context-aware classifiers.
- Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
- Token-level SHAP explanation views for blocked prompts and blocked model responses.
- Gradio-based interactive interface for security testing and model-behavior auditing.
What it is
XJailGuard is an explainable LLM safety gateway for multilingual conversational systems. It wraps a generative model with input screening, multi-turn jailbreak detection, output filtering, and token-level explanation so unsafe requests and responses are blocked before user delivery.
Problem it solves
Single-layer moderation often misses role-play jailbreaks, translated attack prompts, and chained multi-turn escalation. XJailGuard solves this by applying defense-in-depth at both ingress and egress, while exposing explainability artifacts for review and policy tuning.
How it works
- Accept user prompt and route it through multilingual input classifiers that score jailbreak intent before generation.
- Evaluate recent conversational context with a dedicated multi-turn detector to catch staged or progressive attack chains.
- Forward safe requests to a quantized Vicuna-family generation layer and collect candidate model output.
- Pass generated output through a second safety classifier to prevent unsafe responses from leaving the system.
- Generate SHAP-based token attributions whenever content is blocked so operators can inspect why a decision was made.
Architecture diagram
Pipeline showing prompt filtering, LLM generation, response validation, and explainability stages.
Open full diagramKey capabilities
- Modular guardrail pipeline with separate input, output, and context-aware classifiers.
- Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
- Token-level SHAP explanation views for blocked prompts and blocked model responses.
- Gradio-based interactive interface for security testing and model-behavior auditing.
- Lazy LLM loading and quantized inference path for practical deployment on constrained hardware.
Impact and outcomes
- Reduces jailbreak exposure for assistant workflows by filtering threats at both prompt and response stages.
- Improves trust and triage speed by pairing each block decision with transparent token-level evidence.
- Creates an extensible blueprint for explainable guardrails in multilingual production assistants.