Quality Guard: Continuous AI Output Monitoring That Catches Drift Before Your Users Do

Bakeoffs tell you which prompt is better at a point in time. But prompts degrade. Model updates change behavior. Input distributions shift. A recipe that scored 92 last month might score 74 today, and you won’t know until a customer complains.

You need continuous monitoring, not one-time evaluation. That’s what Quality Guard does.

How Quality Guard works

Quality Guard attaches to any recipe from its detail page. Once enabled, it samples production runs at a configurable rate — default 5%, adjustable from 1% to 20%. Each sampled run is automatically scored by an LLM judge using weighted criteria.

The scoring is fire-and-forget: it never blocks run completion. Your production latency is unaffected. The evaluation happens asynchronously after the run finishes.

Two controls keep costs predictable:

Daily budget cap — Default 20 evaluations per day, configurable from 1 to 100
Judge model — Default is Claude Haiku 4.5 for cost efficiency. Switch to Sonnet for higher-accuracy evaluations when the stakes justify it

Budget tracking is backed by Redis with fail-open behavior — if Redis is temporarily unavailable, evaluations continue rather than silently dropping.

Evaluation criteria

Each sampled run is scored from 0 to 100 using weighted criteria:

Criterion	Weight	What it measures
Relevance	30%	How well the output addresses the input
Completeness	25%	Whether all aspects of the request are covered
Clarity	20%	Organization and readability
Accuracy	15%	Factual correctness, absence of hallucinations
Format	10%	Adherence to expected output structure

These are the defaults. You can customize the criteria, adjust the weights, and change the judge model per recipe. A recipe that generates structured JSON might weight Format at 40%. A research summary recipe might weight Accuracy at 35%.

Baseline establishment

When you first enable Quality Guard, it enters a collecting phase. Evaluations accumulate without any drift analysis — there’s no baseline to compare against yet.

After 20 evaluations (configurable), the baseline is automatically computed. It stores:

Mean and standard deviation of overall scores
Percentiles: p5, p25, p50, p75, p95
Per-criterion statistics — mean and standard deviation for each individual criterion

Once the baseline is established, a notification is sent to all configured alert recipients. From that point forward, every new evaluation is compared against the baseline.

You can manually reset or recompute the baseline at any time — useful after a deliberate prompt change that you expect to shift scores.

Drift detection

Quality Guard uses a rolling window of recent evaluations (default 30, minimum 5) to detect two types of drift:

Score drops. The rolling mean is compared against the baseline mean. Two thresholds trigger alerts:

Warning — 10-point drop from baseline (configurable 5-30)
Critical — 20-point drop from baseline (configurable 10-50)

Variance spikes. If the rolling standard deviation exceeds 2x the baseline standard deviation, Quality Guard flags it as quality becoming inconsistent — even if the mean hasn’t changed. This catches situations where a recipe alternates between great and terrible outputs.

The minimum 5-evaluation requirement for the rolling window prevents false positives from early noise.

Alerting

When drift is detected, Quality Guard notifies through two channels:

In-app notifications go to all configured alert recipients immediately. Each notification includes the severity level, the current rolling score, the baseline score, and the magnitude of the drift.

Email alerts use severity-colored styling — red for critical drift, amber for warnings. Emails include the same metrics plus a direct link to the recipe’s quality dashboard.

An alert cooldown prevents notification fatigue. The default is 6 hours (configurable from 60 to 1440 minutes). During cooldown, drift continues to be tracked but additional alerts are suppressed. All alerts are acknowledgeable and tracked — you can see who acknowledged what and when.

Auto-remediation

Quality Guard doesn’t just alert. It acts.

Prompt refinement. When drift is detected, Quality Guard automatically triggers a prompt refinement analysis. It examines the best-scoring and worst-scoring recent runs, identifies patterns in what’s degrading, and suggests specific prompt improvements. Rate limit: once per 24 hours.

Mini-bakeoffs. Quality Guard can auto-trigger a mini-bakeoff comparing the current prompt against the suggested improvements. This closes the loop — drift is detected, a fix is proposed, and the fix is evaluated, all without manual intervention. Rate limit: once per 7 days.

Knowledge base capture. High-quality outputs (score >= 85) are automatically captured to the recipe’s knowledge base, building a library of excellent examples over time.

Few-shot nomination. Good outputs (score >= 80) are auto-nominated as few-shot examples for the recipe’s prompt. The best outputs teach the recipe how to produce more outputs like them.

Quality dashboard

The quality dashboard gives you visibility across all monitored recipes.

Trend chart. An SVG visualization shows the score line (indigo), baseline mean (dashed green), interquartile range band (green shading), and drift markers — red circles for critical, amber for warnings. You see exactly when quality changed and by how much.

Recipe sparklines. Each monitored recipe shows a 14-day trend sparkline, a rolling 7-day average, and a trend arrow (up, down, or stable). Scan the list and immediately spot which recipes need attention.

Per-criterion breakdown. Drill into any recipe to see how individual criteria are trending. A recipe might maintain high Relevance and Completeness while Accuracy degrades — a pattern that’s invisible in an aggregate score.

Improvement report. A summary view across all recipes: how many improved, how many are stable, how many degraded. Average score change. Mini-bakeoffs triggered. This is the view for weekly team reviews.

How Quality Guard differs from bakeoffs

Bakeoffs and Quality Guard solve different problems:

	Bakeoffs	Quality Guard
Timing	One-time, on-demand	Continuous, automated
Comparison	Relative (A vs B)	Absolute (vs baseline)
Purpose	Experiment and choose	Monitor and maintain
Trigger	Manual	Automatic (production sampling)

They complement each other. Quality Guard monitors. Bakeoffs experiment. When Quality Guard detects drift, it can auto-trigger a bakeoff to test a fix. When a bakeoff picks a winner and you deploy it, Quality Guard establishes a new baseline and watches for the next regression.

Cost control

Quality Guard is designed to run indefinitely without runaway costs. Three mechanisms keep spending predictable:

Sample rate — Only a fraction of runs are evaluated (default 5%)
Daily budget cap — Hard limit on evaluations per day (default 20)
Judge model choice — Haiku for cost-efficient monitoring, Sonnet for high-accuracy evaluation

At default settings with Claude Haiku 4.5 as the judge, a recipe running 400 times per day costs approximately 20 judge evaluations — well within the budget cap. Redis-backed budget tracking ensures the cap is enforced across distributed workers.

Availability

Quality Guard is available on Pro plans and above. Learn more about Quality Guard and other features or start your free trial.