Skip to content

Model Evaluation

Evaluate your AI and ML models for a range of key metrics — performance quality, fairness, reliability, drift, bias, and more — throughout the AI lifecycle.

Why This Matters

  • Unvalidated systems fail at the edges. LLM pipelines can hallucinate, leak sensitive data, or degrade when upstream data changes. Evaluation surfaces these issues before release.
  • Production failures are costly. Issues like PII leakage or ungrounded responses become significantly harder to diagnose once embedded in live workflows.
  • Compliance requires evidence. Regulatory frameworks such as the EU AI Act and NIST AI RMF expect structured testing with reproducible scoring and stored evaluation artifacts.
  • Baselines enable monitoring. Metrics captured at evaluation time become reference points for detecting drift and regression in production.

What's Covered

Area What It Evaluates
Gen AI Evaluations RAG pipelines, LLM outputs, chatbot safety — quality, safety, readability metrics
Predictive ML Evaluations Traditional ML models — scoring, confidence assessment, credit risk prediction

Gen AI Evaluations

Evaluate generative AI applications — RAG pipelines, LLM outputs, and chatbot safety — using IBM watsonx governance metrics.

Evaluation Scripts

Script What It Evaluates
RAG Quality Answer relevance, faithfulness, context relevance, retrieval precision, NDCG
Content Safety HAP, PII, jailbreak, social bias, violence, profanity (15 metrics)
LLM-as-Judge Evasiveness detection, topic relevance with system prompt boundaries
Readability Text grade level, Flesch reading ease
Deployment Readiness Combined quality + safety check with pass/fail verdict

Metrics Reference

Metric Category Description
Faithfulness Quality Is the response grounded in the provided context?
Answer Relevance Quality Does the response address the user's question?
Answer Similarity Quality Semantic similarity to a ground-truth reference
Context Relevance Retrieval Are retrieved passages relevant to the query?
Retrieval Precision Retrieval Proportion of retrieved passages that are relevant
NDCG Retrieval Ranking quality of retrieved results
Hit Rate Retrieval Did at least one relevant passage get retrieved?
HAP Safety Hate, abuse, and profanity detection
PII Safety Personally identifiable information detection
Jailbreak Safety Prompt injection / jailbreak attempt detection
Social Bias Safety Stereotyping and discriminatory language
Evasiveness Quality Is the model dodging the question?
Topic Relevance Quality Is the response on-topic?
Text Grade Level Readability US school grade needed to understand the text
Text Reading Ease Readability Flesch Reading Ease score (0–100)

Predictive ML Evaluations

Evaluate predictive ML models deployed on IBM watsonx ML — scoring, confidence assessment, and interactive exploration.

Available Assets

Asset What It Does
Credit Risk Prediction App Interactive Dash web app for credit risk scoring with real-time predictions
Model Scoring API Direct REST API calls to deployed watsonx ML models — suitable for batch scoring and pipeline integration

Both assets authenticate via IBM Cloud IAM and call deployed watsonx ML model endpoints.

Bob Modes

A Bob mode for Gen AI evaluation is available, providing an AI-assisted workflow that guides you through the evaluation process step by step.

GitHub Repository

Model Evaluation Assets