LLM-as-Judge: How Automated AI Evaluation Works

Evaluating AI output is one of the hardest problems in applied AI. Human evaluation is the gold standard, but it’s slow, expensive, and doesn’t scale. JieGou’s bakeoff system uses LLM-as-judge — a technique where one language model evaluates the output of another — to automate quality scoring with statistical rigor.

Here’s how it works under the hood.

The basic setup

In a bakeoff, two variants (recipes, models, or workflows) process the same set of inputs. Each produces an output. An independent LLM judge — separate from the models being evaluated — scores each output on predefined dimensions.

The judge sees both outputs (anonymized as “Output A” and “Output B”) along with the original input and scoring criteria. It produces a structured score for each dimension: quality, accuracy, relevance, completeness, and an overall winner.

Why use an LLM as judge?

The alternative is manual evaluation: have a human read every output pair and score them. For small tests (5-10 inputs), that’s feasible. For meaningful statistical analysis (50-100+ inputs), it becomes a bottleneck.

LLM judges scale linearly — evaluating 100 input pairs takes the same wall-clock time as evaluating 10 when run in parallel. The cost is predictable (it’s just tokens), and the evaluation is consistent. A human’s judgment drifts across a long evaluation session; an LLM’s doesn’t.

The trade-off is that LLM judges have known biases: they tend to prefer longer outputs, more formal language, and outputs that match their own training distribution. JieGou mitigates this by randomizing presentation order (A/B position) and by supporting multi-judge consensus.

Multi-judge consensus

For high-stakes evaluations, JieGou supports 2-3 independent judges. Each judge scores independently, and the system measures inter-judge agreement using two rank correlation metrics:

Kendall’s tau measures the proportion of concordant vs. discordant ranking pairs between judges. A tau of 1.0 means perfect agreement; 0.0 means no correlation. In practice, tau values above 0.7 indicate strong agreement.

Spearman’s rho measures rank-order correlation. It’s similar to Kendall’s tau but more sensitive to large ranking disagreements. Rho values above 0.8 indicate strong agreement.

When judges disagree significantly (low tau/rho), the system flags the bakeoff for human review rather than declaring a winner — because disagreeing judges usually means the outputs are close in quality or the evaluation criteria are ambiguous.

Statistical confidence

Every score in a bakeoff includes:

Mean score across all inputs
Standard deviation showing score consistency
95% confidence interval so you know the range of true performance

A bakeoff that shows Variant A scoring 7.2 (CI: 6.8-7.6) vs. Variant B scoring 7.0 (CI: 6.5-7.5) has overlapping confidence intervals — meaning the difference isn’t statistically significant. You’d need more inputs or a different evaluation approach.

A bakeoff showing Variant A at 8.1 (CI: 7.7-8.5) vs. Variant B at 6.3 (CI: 5.9-6.7) has non-overlapping intervals — that’s a clear winner.

Cost considerations

LLM-as-judge adds evaluation cost on top of the base execution cost. Each judge call processes both outputs plus the scoring prompt, which typically runs 2-4x the token count of a single output.

Multi-judge mode multiplies this: 3 judges means 3x the evaluation cost. JieGou shows estimated costs before you run a bakeoff so you can decide whether the evaluation budget is worth it.

For cost-sensitive scenarios, single-judge mode with more inputs often gives better statistical power than multi-judge mode with fewer inputs.

Practical recommendations

Based on our experience running thousands of bakeoffs internally:

Start with 20-30 inputs for an initial signal, then scale to 50-100 for production decisions
Use synthetic inputs when you don’t have enough real data — they cover edge cases that real data might miss
Single judge is sufficient for clear differences (> 1 point gap). Use multi-judge for close calls
Check confidence intervals before acting — overlapping intervals mean you need more data, not a decision
Vary your judges — using Claude to judge Claude outputs can introduce self-preference bias; cross-provider judging reduces this

Learn more

Bakeoffs are available on Pro and Enterprise plans. See the full bakeoff feature page for details on all six evaluation modes.