Back to projects

XJailGuard

LLM Security

Explainable LLM security framework with modular pipeline for input sanitization, intent classification, and output validation against prompt injection attacks.

PythonPyTorchNLPSHAP

Project highlights

  • Modular guardrail pipeline with separate input, output, and context-aware classifiers.
  • Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
  • Token-level SHAP explanation views for blocked prompts and blocked model responses.
  • Gradio-based interactive interface for security testing and model-behavior auditing.

What it is

XJailGuard is an explainable LLM safety gateway for multilingual conversational systems. It wraps a generative model with input screening, multi-turn jailbreak detection, output filtering, and token-level explanation so unsafe requests and responses are blocked before user delivery.

Problem it solves

Single-layer moderation often misses role-play jailbreaks, translated attack prompts, and chained multi-turn escalation. XJailGuard solves this by applying defense-in-depth at both ingress and egress, while exposing explainability artifacts for review and policy tuning.

How it works

  • Accept user prompt and route it through multilingual input classifiers that score jailbreak intent before generation.
  • Evaluate recent conversational context with a dedicated multi-turn detector to catch staged or progressive attack chains.
  • Forward safe requests to a quantized Vicuna-family generation layer and collect candidate model output.
  • Pass generated output through a second safety classifier to prevent unsafe responses from leaving the system.
  • Generate SHAP-based token attributions whenever content is blocked so operators can inspect why a decision was made.

Architecture diagram

Pipeline showing prompt filtering, LLM generation, response validation, and explainability stages.

Open full diagram

Key capabilities

  • Modular guardrail pipeline with separate input, output, and context-aware classifiers.
  • Multilingual attack detection to handle unsafe prompts beyond English-only threat models.
  • Token-level SHAP explanation views for blocked prompts and blocked model responses.
  • Gradio-based interactive interface for security testing and model-behavior auditing.
  • Lazy LLM loading and quantized inference path for practical deployment on constrained hardware.

Impact and outcomes

  • Reduces jailbreak exposure for assistant workflows by filtering threats at both prompt and response stages.
  • Improves trust and triage speed by pairing each block decision with transparent token-level evidence.
  • Creates an extensible blueprint for explainable guardrails in multilingual production assistants.
XJailGuard - Project Documentation