When you build an AI recipe, how do you know it’s the best version? When you pick a model, how do you know it’s the right one for the job? Most teams rely on intuition — run it a few times, eyeball the output, and move on. That works for prototyping, but not for production.
Today we’re launching bakeoffs: a built-in system for comparing AI recipes, models, and entire workflows with rigorous, automated evaluation.
What is a bakeoff?
A bakeoff runs the same inputs through two or more AI configurations and scores the results. The scoring is done by an independent LLM judge — not the model that produced the output — so the evaluation is as objective as automated evaluation can be.
You can compare across six modes:
- Recipe vs. recipe — Two different recipes processing the same inputs
- Model vs. model — The same recipe on different LLM providers (e.g., Claude vs. GPT)
- Full matrix — Every recipe × every model combination in a single evaluation grid
- Workflow vs. workflow — Full end-to-end workflow execution compared side-by-side
- Workflow model vs. model — The same workflow executed with different LLM providers across its steps
- A/B test — Live traffic splitting that routes real recipe executions between two variants
How scoring works
Each output is scored by an LLM judge on dimensions like quality, accuracy, relevance, and completeness.
For higher confidence, enable multi-judge mode with 2-3 independent judges. JieGou calculates inter-judge agreement using Kendall’s tau and Spearman’s rho rank correlation coefficients, so you can see whether judges converge or disagree. Results include 95% confidence intervals and standard deviations, telling you when a result is statistically meaningful versus noise.
Synthetic inputs
Don’t have enough real data for a meaningful comparison? The synthetic input generator creates diverse test cases from your recipe or workflow input schemas. It reads the JSON Schema definitions — field names, types, descriptions, and constraints — and produces realistic inputs that cover a range of scenarios.
This is especially useful for new recipes that haven’t accumulated real-world usage data yet.
A/B test routing
For recipes and workflows already in production, bakeoffs support live A/B test routing. Traffic is split between two variants, and JieGou tracks performance using chi-square statistical testing. When one variant reaches statistical significance, routing automatically stops sending traffic to the losing variant.
Routing decisions are cached in Redis for consistency — the same user sees the same variant across requests.
Bakeoff templates
Setting up a bakeoff — choosing arms, configuring judges, selecting input schemas — takes thought. Templates let you save a bakeoff configuration and reuse it later, so you don’t repeat that setup work every time you want to re-evaluate.
Templates support visibility scoping: keep them private, share with your department, or make them available account-wide. When your team establishes a standard evaluation methodology for a particular use case, saving it as a template ensures everyone evaluates consistently.
When to use bakeoffs
Bakeoffs are most valuable when:
- Choosing a model — You’re launching a new recipe and want to pick between Claude, GPT, and Gemini based on output quality, not assumptions
- Iterating on prompts — You’ve rewritten a recipe’s prompt and want to verify the new version is actually better before rolling it out
- Optimizing cost — A cheaper model might produce equivalent output for certain tasks, but you need data to prove it
- Comparing workflows — Two different automation strategies produce different outputs, and you need to know which is better end-to-end
Availability
Recipe and model bakeoffs are available on Pro plans. Workflow bakeoffs and A/B test routing are available on Enterprise. Learn more about bakeoffs or start your free trial.