Real-Time Guardrails¶
Enforce safety boundaries and operational constraints to keep your AI applications within desired behavior in production. Guardrails evaluate every AI input and output against configurable thresholds — blocking, flagging, or passing content before it reaches users.
Why This Matters¶
- Production failures are visible and costly. When a model generates harmful content, leaks PII, or returns irrelevant answers in front of customers, the impact is immediate — regulatory exposure, reputational damage, and loss of trust.
- Safety can't be solved at design time alone. Users will find ways to misuse AI systems that pre-deployment testing cannot anticipate. Real-time guardrails provide the last line of defense.
- Compliance requires ongoing protection. Regulatory frameworks expect organizations to demonstrate continuous safeguards, not just pre-deployment testing.
How It Works¶
graph TB
REQ["User Input"]
REQ --> INPUT_CHK
INPUT_CHK["Input Guardrail<br>Jailbreak, HAP"]
INPUT_CHK -->|"block"| BLOCKED1["Input rejected"]
INPUT_CHK -->|"pass"| MODEL["AI Model"]
MODEL --> OUTPUT_CHK["Output Guardrail<br>PII, Safety, Quality"]
OUTPUT_CHK --> DEC{"PASS / FLAG / BLOCK"}
DEC -->|"pass"| RESP["Response delivered"]
DEC -->|"flag"| FLAG["Delivered + flagged"]
DEC -->|"block"| BLOCKED2["Fallback response"]
Three-tier response handling:
- PASS — Content is within acceptable limits, serve normally
- FLAG — Content is borderline, serve but log for human review
- BLOCK — Content violates thresholds, serve a fallback response instead
Guardrail Types¶
Built-in Content Safety¶
Detects harmful content using IBM watsonx governance pre-trained models. No additional setup beyond API credentials.
| Metric | What It Detects | Threshold Type |
|---|---|---|
| HAP | Hate, abuse, and profanity | Upper-limit (block when exceeded) |
| PII | Names, emails, SSNs, phone numbers | Upper-limit |
| Jailbreak | Prompt injection and jailbreak attempts | Upper-limit |
| Social Bias | Stereotyping and discriminatory language | Upper-limit |
| Violence | Violent content | Upper-limit |
| Profanity | Profanity | Upper-limit |
| Harm | General harm | Upper-limit |
| Sexual Content | Sexual content | Upper-limit |
| Unethical Behavior | Unethical content | Upper-limit |
| Evasiveness | Evasive or non-committal responses | Upper-limit |
RAG Quality Guardrails¶
Real-time quality checks on RAG pipeline responses. Blocks responses that are hallucinated or off-topic.
| Metric | What It Checks | Threshold Type |
|---|---|---|
| Answer Relevance | Does the response address the question? | Lower-limit (block when quality drops) |
| Context Relevance | Are retrieved passages relevant? | Lower-limit |
| Faithfulness | Is the response grounded in context? | Lower-limit |
Custom LLM-as-Judge Guardrails¶
Define your own guardrail criteria using an LLM evaluator. Two approaches:
- Prompt template — Full control over the evaluation prompt (e.g., answer completeness)
- Criteria + Options — Structured rubric with named options and scores (e.g., conciseness, helpfulness)
Uses LLMAsJudgeMetric with WxAIFoundationModel as the judge.
Available Assets¶
| Script | What It Does |
|---|---|
| Content Safety Guardrails | Screen inputs/outputs for 10 safety metrics with configurable BLOCK/FLAG/PASS thresholds |
| RAG Quality Guardrails | Real-time faithfulness, relevance, context quality checks with fallback responses |
| Custom Guardrails | Define custom LLM-as-judge guardrails (completeness, conciseness, helpfulness) |
| Guardrail Pipeline | End-to-end: validate input → call model → validate output → audit log |
All scripts use the ibm_watsonx_gov SDK with MetricsEvaluator and GenAIConfiguration.
GitHub Repository