Evaluating AI output is one of the hardest problems in applied AI. Human evaluation is the gold standard, but it’s slow, expensive, and doesn’t scale. JieGou’s bakeoff system uses LLM-as-judge — a technique where one language model evaluates the output of another — to automate quality scoring with statistical rigor.
Here’s how it works under the hood.
The basic setup
In a bakeoff, two variants (recipes, models, or workflows) process the same set of inputs. Each produces an output. An independent LLM judge — separate from the models being evaluated — scores each output on predefined dimensions.
The judge sees both outputs (anonymized as “Output A” and “Output B”) along with the original input and scoring criteria. It produces a structured score for each dimension: quality, accuracy, relevance, completeness, and an overall winner.
Why use an LLM as judge?
The alternative is manual evaluation: have a human read every output pair and score them. For small tests (5-10 inputs), that’s feasible. For meaningful statistical analysis (50-100+ inputs), it becomes a bottleneck.
LLM judges scale linearly — evaluating 100 input pairs takes the same wall-clock time as evaluating 10 when run in parallel. The cost is predictable (it’s just tokens), and the evaluation is consistent. A human’s judgment drifts across a long evaluation session; an LLM’s doesn’t.
The trade-off is that LLM judges have known biases: they tend to prefer longer outputs, more formal language, and outputs that match their own training distribution. JieGou mitigates this by randomizing presentation order (A/B position) and by supporting multi-judge consensus.
Multi-judge consensus
For high-stakes evaluations, JieGou supports 2-3 independent judges. Each judge scores independently, and the system measures inter-judge agreement using two rank correlation metrics:
Kendall’s tau measures the proportion of concordant vs. discordant ranking pairs between judges. A tau of 1.0 means perfect agreement; 0.0 means no correlation. In practice, tau values above 0.7 indicate strong agreement.
Spearman’s rho measures rank-order correlation. It’s similar to Kendall’s tau but more sensitive to large ranking disagreements. Rho values above 0.8 indicate strong agreement.
When judges disagree significantly (low tau/rho), the system flags the bakeoff for human review rather than declaring a winner — because disagreeing judges usually means the outputs are close in quality or the evaluation criteria are ambiguous.
Statistical confidence
Every score in a bakeoff includes:
- Mean score across all inputs
- Standard deviation showing score consistency
- 95% confidence interval so you know the range of true performance
A bakeoff that shows Variant A scoring 7.2 (CI: 6.8-7.6) vs. Variant B scoring 7.0 (CI: 6.5-7.5) has overlapping confidence intervals — meaning the difference isn’t statistically significant. You’d need more inputs or a different evaluation approach.
A bakeoff showing Variant A at 8.1 (CI: 7.7-8.5) vs. Variant B at 6.3 (CI: 5.9-6.7) has non-overlapping intervals — that’s a clear winner.
Cost considerations
LLM-as-judge adds evaluation cost on top of the base execution cost. Each judge call processes both outputs plus the scoring prompt, which typically runs 2-4x the token count of a single output.
Multi-judge mode multiplies this: 3 judges means 3x the evaluation cost. JieGou shows estimated costs before you run a bakeoff so you can decide whether the evaluation budget is worth it.
For cost-sensitive scenarios, single-judge mode with more inputs often gives better statistical power than multi-judge mode with fewer inputs.
Practical recommendations
Based on our experience running thousands of bakeoffs internally:
- Start with 20-30 inputs for an initial signal, then scale to 50-100 for production decisions
- Use synthetic inputs when you don’t have enough real data — they cover edge cases that real data might miss
- Single judge is sufficient for clear differences (> 1 point gap). Use multi-judge for close calls
- Check confidence intervals before acting — overlapping intervals mean you need more data, not a decision
- Vary your judges — using Claude to judge Claude outputs can introduce self-preference bias; cross-provider judging reduces this
Learn more
Bakeoffs are available on Pro and Enterprise plans. See the full bakeoff feature page for details on all six evaluation modes.