How to A/B Test Your AI Workflows

Offline evaluation tells you which AI configuration looks better on test data. A/B testing tells you which one performs better in production, with real users and real inputs. JieGou’s bakeoff system supports both — and this guide covers the live A/B testing side.

When to A/B test (vs. offline evaluation)

Offline bakeoffs (comparing outputs on a fixed set of inputs) are great for:

Initial model selection before launch
Prompt iteration during development
Comparing fundamentally different approaches

Live A/B testing is better when:

You’ve already narrowed down to 2 strong candidates
Production inputs differ from your test set in ways that matter
You want to measure real-world performance over time
Stakeholder buy-in requires production data, not test results

Setting up an A/B test

Here’s the step-by-step process in JieGou:

Step 1: Create a bakeoff with A/B routing

Navigate to the bakeoff section and select “A/B Test Routing” as the mode. Choose the two variants you want to compare — these can be two recipes, two model configurations, or two workflows.

Step 2: Configure the traffic split

By default, traffic splits 50/50 between variants. You can adjust this if you want to be conservative — for example, 90/10 to limit exposure to the experimental variant while still gathering data.

Step 3: Set auto-stop conditions

JieGou uses chi-square statistical testing to determine when one variant is significantly better than the other. You can configure:

Minimum sample size — Don’t declare a winner until at least N executions have run through each variant
Significance threshold — The p-value threshold for declaring a winner (default: 0.05)

When the auto-stop condition is met, JieGou automatically routes 100% of traffic to the winning variant and notifies you.

Step 4: Monitor results

While the test is running, the bakeoff dashboard shows:

Execution count per variant
LLM judge scores over time
Current statistical significance
Estimated time to reach significance based on current traffic

Step 5: Review and finalize

When the test concludes (either by auto-stop or manual decision), review the full results: score distributions, confidence intervals, cost comparison, and execution time differences. Then promote the winning variant to be the default.

Consistency guarantees

A/B routing decisions are cached in Redis. Once a specific execution context is assigned to a variant, it stays on that variant for the duration of the test. This prevents confusing behavior where the same recipe produces different results on consecutive runs.

What to measure

LLM judge scores are the primary metric, but consider these additional signals:

Execution cost — A slightly lower-quality variant that costs 60% less might be the better production choice
Execution time — Faster responses improve user experience even if quality is equal
Error rate — A variant that fails 5% of the time is worse than one that never fails, even if its successes score higher

Practical tips

Run tests for at least 48 hours to capture variation in input patterns across different times of day and days of the week
Don’t A/B test too many things at once — changing the model and the prompt simultaneously makes it impossible to attribute the difference
Document your hypothesis before starting — “I expect the Claude variant to score higher on nuance but cost 2x more” helps you evaluate whether the results are actionable
Use offline bakeoffs first to narrow the field, then A/B test the top 2 candidates in production

Availability

A/B test routing is available on Enterprise plans. Offline bakeoffs (recipe vs. recipe, model vs. model) are available on Pro. Learn more about all bakeoff modes.