Offline evaluation tells you which AI configuration looks better on test data. A/B testing tells you which one performs better in production, with real users and real inputs. JieGou’s bakeoff system supports both — and this guide covers the live A/B testing side.
When to A/B test (vs. offline evaluation)
Offline bakeoffs (comparing outputs on a fixed set of inputs) are great for:
- Initial model selection before launch
- Prompt iteration during development
- Comparing fundamentally different approaches
Live A/B testing is better when:
- You’ve already narrowed down to 2 strong candidates
- Production inputs differ from your test set in ways that matter
- You want to measure real-world performance over time
- Stakeholder buy-in requires production data, not test results
Setting up an A/B test
Here’s the step-by-step process in JieGou:
Step 1: Create a bakeoff with A/B routing
Navigate to the bakeoff section and select “A/B Test Routing” as the mode. Choose the two variants you want to compare — these can be two recipes, two model configurations, or two workflows.
Step 2: Configure the traffic split
By default, traffic splits 50/50 between variants. You can adjust this if you want to be conservative — for example, 90/10 to limit exposure to the experimental variant while still gathering data.
Step 3: Set auto-stop conditions
JieGou uses chi-square statistical testing to determine when one variant is significantly better than the other. You can configure:
- Minimum sample size — Don’t declare a winner until at least N executions have run through each variant
- Significance threshold — The p-value threshold for declaring a winner (default: 0.05)
When the auto-stop condition is met, JieGou automatically routes 100% of traffic to the winning variant and notifies you.
Step 4: Monitor results
While the test is running, the bakeoff dashboard shows:
- Execution count per variant
- LLM judge scores over time
- Current statistical significance
- Estimated time to reach significance based on current traffic
Step 5: Review and finalize
When the test concludes (either by auto-stop or manual decision), review the full results: score distributions, confidence intervals, cost comparison, and execution time differences. Then promote the winning variant to be the default.
Consistency guarantees
A/B routing decisions are cached in Redis. Once a specific execution context is assigned to a variant, it stays on that variant for the duration of the test. This prevents confusing behavior where the same recipe produces different results on consecutive runs.
What to measure
LLM judge scores are the primary metric, but consider these additional signals:
- Execution cost — A slightly lower-quality variant that costs 60% less might be the better production choice
- Execution time — Faster responses improve user experience even if quality is equal
- Error rate — A variant that fails 5% of the time is worse than one that never fails, even if its successes score higher
Practical tips
- Run tests for at least 48 hours to capture variation in input patterns across different times of day and days of the week
- Don’t A/B test too many things at once — changing the model and the prompt simultaneously makes it impossible to attribute the difference
- Document your hypothesis before starting — “I expect the Claude variant to score higher on nuance but cost 2x more” helps you evaluate whether the results are actionable
- Use offline bakeoffs first to narrow the field, then A/B test the top 2 candidates in production
Availability
A/B test routing is available on Enterprise plans. Offline bakeoffs (recipe vs. recipe, model vs. model) are available on Pro. Learn more about all bakeoff modes.