Run-time Monitoring¶
Why This Matters¶
An AI system that passes every evaluation at design time can still fail in production. User behavior shifts, upstream data changes, and adversarial inputs introduce conditions that pre-deployment testing cannot fully anticipate. Runtime monitoring closes this gap by providing continuous observation and real-time safeguards for AI systems operating in the real world.
For enterprise teams, this is critical because:
- Production failures are visible and costly. When a model generates harmful content, leaks PII, or returns irrelevant answers in front of customers, the impact is immediate — regulatory exposure, reputational damage, and loss of trust. Real-time guardrails provide the last line of defense before outputs reach users.
- Model quality degrades gradually. Drift in data distributions, shifts in user behavior, or changes to upstream systems can erode model performance over weeks or months. Without continuous measurement, teams discover degradation only after business metrics decline or customers complain.
- Compliance requires ongoing evidence. Regulatory frameworks don't just require pre-deployment testing — they expect organizations to demonstrate continuous monitoring and the ability to detect and respond to problems in production. Monitoring logs, drift alerts, and safety records serve as this operational evidence.
- Different AI modalities need different monitoring. Generative AI systems and traditional ML models have distinct failure modes. Generative AI can hallucinate, produce unsafe content, or drift in response quality. Traditional ML can drift in prediction accuracy, develop fairness issues, or degrade on new data distributions. Both need tailored monitoring approaches.
Runtime monitoring integrates with IBM watsonx.governance to provide production-grade observability and safeguards.
Architecture Overview¶
The runtime monitoring capability provides two independent layers — real-time guardrails for immediate protection and continuous monitoring for long-term observability. Each addresses a distinct use case and can be adopted independently.
Real-Time Guardrails¶
Real-time guardrails evaluate every AI input and output against configurable safety and quality thresholds before responses reach users.
graph TD
REQ["User / Application Request"]
subgraph guardrails["REAL-TIME GUARDRAILS"]
direction TB
subgraph metrics_row[" "]
direction LR
CS["Content Safety<br>HAP, PII, Harm,<br>Jailbreak, Bias,<br>Violence, etc."]
RQ["RAG Quality<br>Answer relevance,<br>Context relevance,<br>Faithfulness"]
RSQ["Response Quality<br>Completeness,<br>Conciseness,<br>Helpfulness,<br>Custom validators"]
end
DEC{"PASS / BLOCK<br>decision"}
end
REQ --> guardrails
CS --> DEC
RQ --> DEC
RSQ --> DEC
DEC -->|"pass"| RESP["Response delivered<br>to user"]
DEC -->|"block"| BLOCK["Response blocked<br>risk flagged"]
Continuous Monitoring¶
Continuous monitoring tracks quality, drift, and fairness metrics over time across both generative AI and traditional ML workloads.
graph TD
subgraph continuous["CONTINUOUS MONITORING"]
direction TB
subgraph mon_row[" "]
direction LR
GEN["Generative AI Monitoring<br><br>Manual evaluation<br>Scheduled evaluation<br>Custom metrics<br>Interactive dashboard"]
TRAD["Traditional AI Monitoring<br><br>Model risk management<br>Fairness & bias detection<br>Custom metrics & monitors"]
end
WOS["watsonx.governance<br>Subscriptions · Monitors · Feedback datasets · Drift tracking"]
WXG["watsonx.governance<br>Factsheets · Audit records · Governance artifacts"]
end
GEN --> WOS
TRAD --> WOS
WOS --> WXG
Real-time guardrails operate synchronously in the request path. Every input and output is scored against configurable thresholds. Content safety metrics use upper-limit thresholds (block when exceeded), RAG metrics use lower-limit thresholds (block when quality falls below), and response quality metrics use LLM-as-Judge or custom rule-based evaluation. Responses that violate thresholds are blocked before reaching the user.
Continuous monitoring operates asynchronously, independent of the request path. Evaluation data is collected over time — either manually, on a schedule, or via streaming payloads — and fed into watsonx.governance for trend analysis, drift detection, and alerting. An interactive dashboard provides visualization of metrics, drift, and governance artifacts.
Supported Metrics Reference¶
The metrics listed below are those demonstrated in the sample applications and notebooks included in the Trusted AI GitHub repository. The full set of metrics available through IBM watsonx.governance is more comprehensive — refer to the watsonx.governance documentation for the complete list.
Real-Time Guardrails Metrics¶
| Metric | Category | Description | Threshold Type |
|---|---|---|---|
| HAP | Content Safety | Hate, abuse, and profanity detection | Upper-limit |
| PII | Content Safety | Personally identifiable information detection | Upper-limit |
| Harm | Content Safety | General harm detection | Upper-limit |
| Violence | Content Safety | Violence-related content detection | Upper-limit |
| Profanity | Content Safety | Profanity detection | Upper-limit |
| Social Bias | Content Safety | Bias and stereotyping detection | Upper-limit |
| Jailbreak | Content Safety | Jailbreak attempt detection | Upper-limit |
| Unethical Behavior | Content Safety | Unethical content detection | Upper-limit |
| Sexual Content | Content Safety | Sexual content detection | Upper-limit |
| Evasiveness | Content Safety | Evasive or non-committal response detection | Upper-limit |
| Answer Relevance | RAG Quality | Whether the response addresses the user's question | Lower-limit |
| Context Relevance | RAG Quality | Whether retrieved passages are relevant to the query | Lower-limit |
| Faithfulness | RAG Quality | Whether the response is consistent with the provided context | Lower-limit |
| Answer Completeness | Response Quality | Whether the response fully addresses the question (LLM judge) | LLM-as-Judge |
| Conciseness | Response Quality | Whether the response avoids unnecessary verbosity (LLM judge) | LLM-as-Judge |
| Helpfulness | Response Quality | Whether the response is useful to the user (LLM judge) | LLM-as-Judge |
| Action-Oriented Validator | Response Quality | Custom rule-based check for actionable responses | Rule-based |
Continuous Monitoring Metrics¶
| Metric | Category | Applies To | Description |
|---|---|---|---|
| ROUGE | Quality | Generative AI | N-gram overlap between generated and reference text |
| Readability | Quality | Generative AI | Text readability scores for generated outputs |
| Drift (confidence) | Drift | Gen AI + Traditional | Change in model confidence distributions over time |
| Drift (prediction) | Drift | Gen AI + Traditional | Shift in model prediction distributions |
| Drift (metadata) | Drift | Gen AI + Traditional | Changes in input feature distributions |
| Model Health | Performance | Gen AI + Traditional | Operational health of the model deployment |
| Fairness | Fairness | Traditional AI | Bias detection across protected attributes |
| Indirect Bias | Fairness | Traditional AI | Bias detected through proxy features |
| Custom Metrics | Custom | Gen AI + Traditional | User-defined metrics attached via watsonx.governance |
Metrics by Use Case¶
| Use Case | Design-Time Baseline | Runtime Monitoring |
|---|---|---|
| Text Summarization | ROUGE, SARI, readability, sentence similarity | ROUGE/SARI drift, latency, PII/HAP violations |
| Content Generation | BLEU, METEOR, fluency, novelty | BLEU degradation, safety violations, failed generations |
| Question Answering | F1, exact match, faithfulness, relevance | F1/EM decline, hallucination rate, response time |
| Entity Extraction | Precision, recall, F1, span accuracy | Accuracy drop, latency, PII leakage |
| RAG Systems | Retrieval recall@k, nDCG, faithfulness, ROUGE | Retrieval drift, faithfulness decline, failed retrievals |
| Code Generation | CodeBLEU, syntax correctness, test pass rate | Execution failures, unsafe patterns, hallucinated blocks |
End-to-End Workflow¶
Setting Up Real-Time Guardrails¶
graph TD
A["Define safety thresholds<br>for each metric"] --> B["Start the guardrails<br>application"]
B --> C["Submit AI inputs/outputs<br>for evaluation"]
C --> D["Review pass/block<br>decisions + risk scores"]
D --> E["Tune thresholds based<br>on observed patterns"]
E -->|"iterate"| C
- Set thresholds. Define acceptable limits for each content safety, RAG quality, and response quality metric based on your risk tolerance.
- Launch. Start the guardrails application and access the dashboard.
- Evaluate. Submit AI system inputs and outputs through the interface. Each interaction is scored against all configured metrics.
- Review. Inspect the color-coded risk dashboard. Red indicates high risk (threshold violated), green indicates acceptable output.
- Tune. Adjust thresholds based on observed patterns — tighten limits where the system is too permissive, relax where it is overly conservative.
Setting Up Continuous Monitoring¶
graph TD
A["Manual evaluation<br>Establish baselines"] --> B["Create Prompt Template<br>Asset in watsonx"]
B --> C["Deploy runtime subscription<br>in watsonx.governance"]
C --> D["Score prompt inputs +<br>store feedback"]
D --> E["Create monitors +<br>plot baseline metrics"]
E --> F["Automated scheduled<br>evaluation"]
F --> G["Configure batch processing<br>+ scheduled runs"]
G --> H["Track drift, readability,<br>risk over time"]
H --> I["Custom metrics"]
I --> J["Define domain-specific<br>monitors"]
J --> K["Generate factsheets"]
K --> L[("watsonx.governance")]
H -->|"drift detected"| M["Feed back into<br>design-time evaluation"]
- Start with manual evaluation. Create a Prompt Template Asset, deploy a runtime subscription in watsonx.governance, score inputs from sample data, and establish baseline metrics.
- Automate evaluation. Configure batch processing and scheduled evaluation runs. Track ROUGE, readability, and risk metrics over time to detect drift.
- Add custom metrics. Define domain-specific monitors (e.g., user feedback scores, business-specific quality checks) and attach them to your deployment.
- Use the dashboard. Launch the interactive dashboard for visualization — manage prompts, run evaluations, monitor drift, and generate factsheets from a single interface.
- Close the loop. When monitoring detects degradation or drift, feed that signal back into the design-time evaluation phase. Re-evaluate, update, and re-deploy with full traceability.
For full source code, notebooks, setup instructions, and configuration details, visit the Trusted AI GitHub repository.