XJailGuard

LLM Security

Explainable LLM security framework with modular pipeline for input sanitization, intent classification, and output validation against prompt injection attacks.

PythonPyTorchNLPSHAP

View on GitHub

Project highlights

Modular guardrail pipeline with separate input, output, and context-aware classifiers.
Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
Token-level SHAP explanation views for blocked prompts and blocked model responses.
Gradio-based interactive interface for security testing and model-behavior auditing.

What it is

XJailGuard is an explainable LLM safety gateway for multilingual conversational systems. It wraps a generative model with input screening, multi-turn jailbreak detection, output filtering, and token-level explanation so unsafe requests and responses are blocked before user delivery.

Problem it solves

Single-layer moderation often misses role-play jailbreaks, translated attack prompts, and chained multi-turn escalation. XJailGuard solves this by applying defense-in-depth at both ingress and egress, while exposing explainability artifacts for review and policy tuning.

How it works

Accept user prompt and route it through multilingual input classifiers that score jailbreak intent before generation.
Evaluate recent conversational context with a dedicated multi-turn detector to catch staged or progressive attack chains.
Forward safe requests to a quantized Vicuna-family generation layer and collect candidate model output.
Pass generated output through a second safety classifier to prevent unsafe responses from leaving the system.
Generate SHAP-based token attributions whenever content is blocked so operators can inspect why a decision was made.

Architecture diagram

Pipeline showing prompt filtering, LLM generation, response validation, and explainability stages.

Open full diagram

Key capabilities

Modular guardrail pipeline with separate input, output, and context-aware classifiers.
Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
Token-level SHAP explanation views for blocked prompts and blocked model responses.
Gradio-based interactive interface for security testing and model-behavior auditing.
Lazy LLM loading and quantized inference path for practical deployment on constrained hardware.

Impact and outcomes

Reduces jailbreak exposure for assistant workflows by filtering threats at both prompt and response stages.
Improves trust and triage speed by pairing each block decision with transparent token-level evidence.
Creates an extensible blueprint for explainable guardrails in multilingual production assistants.