Building an AI Evaluation System: Multi-Judge Scoring and Statistical Confidence

“Which prompt is better?” sounds like a simple question. In practice, evaluating AI outputs is one of the hardest problems in applied AI. Human evaluation is expensive and slow. Automated metrics like BLEU or ROUGE miss nuance. And single-judge LLM evaluation is biased by position, verbosity, and the judge model’s own preferences.

We built a bakeoff system that addresses these problems with multi-judge scoring, position randomization, and statistical confidence intervals. This post covers the architecture, the statistics, and the lessons we learned.

What’s a Bakeoff?

A bakeoff is a structured comparison of two or more AI “arms” — different prompts, models, recipes, or workflows — evaluated against the same inputs. Think of it as A/B testing for AI outputs, but with automated scoring instead of user click-through rates.

JieGou supports 6 bakeoff modes:

Prompt vs. prompt — Same recipe, different prompt templates
Model vs. model — Same recipe, different LLM providers
Recipe vs. recipe — Different recipes entirely
Workflow vs. workflow — Different multi-step workflows
Workflow model vs. model — Same workflow, different models
A/B test — Live production routing with user feedback

Each bakeoff can have up to 8 arms and 10 inputs, capped at 40 total cells (arms times inputs) to keep costs reasonable.

LLM-as-Judge Evaluation

Each cell is evaluated by an LLM judge using a listwise comparison approach. Rather than scoring each output independently (which loses relative context), the judge sees all arm outputs side by side and scores them against defined criteria.

The evaluation criteria are weighted and score from 0 to 100. The default preset uses:

Criterion	Weight
Relevance	30%
Completeness	25%
Clarity	20%
Accuracy	15%
Format	10%

Users can customize these criteria or create their own. Weights must sum to 100%.

The judge prompt presents each arm’s output with a randomized label (A, B, C…) and asks for structured scores. We use invokeLLMStructured() with a Zod schema to ensure the judge returns valid, parseable scores.

Position Randomization

A known bias in LLM evaluation is position preference — models tend to favor outputs presented first or last. We randomize the arm-to-label mapping for every evaluation call. Arm 1 might be labeled “C” in one run and “A” in the next.

This doesn’t eliminate position bias, but it distributes it randomly across arms rather than systematically favoring one. Over multiple inputs, the effect averages out.

Multi-Judge Consensus

Single-judge evaluation is inherently noisy. Different models have different preferences, and even the same model can give different scores on the same input. We run 2-3 independent judges in parallel on the same evaluation and measure their agreement.

Kendall’s tau measures rank correlation by counting concordant and discordant pairs. If two judges both rank arm A above arm B, that’s a concordant pair. If they disagree, it’s discordant. The coefficient ranges from -1 (perfect disagreement) to 1 (perfect agreement).

Spearman’s rho provides a complementary measure: 1 - (6 * sum of squared rank differences) / (n * (n^2 - 1)).

We classify agreement as:

High — Kendall’s tau ≥ 0.7
Moderate — 0.4 ≤ tau < 0.7
Low — tau < 0.4

Low agreement is itself a useful signal. It usually means the criteria are ambiguous, the outputs are too similar to differentiate, or the task doesn’t have a clear “better” answer.

Consensus rankings are computed by averaging scores across judges, with aggregated win counts for each arm.

Statistical Confidence

For each arm, we compute a 95% confidence interval around the mean score: mean +/- 1.96 * stdDev / sqrt(n), where n is the number of inputs.

When confidence intervals overlap between two arms, we flag it — the difference may not be statistically meaningful. This prevents teams from making decisions based on noise. A 2-point score difference on 3 inputs is probably random. A 15-point difference on 10 inputs with non-overlapping CIs is worth acting on.

A/B Test Integration

Bakeoffs answer “which is better in theory.” A/B tests answer “which performs better in production.”

When an A/B test is active, incoming workflow runs are randomly routed to different arms (50/50 split). Users provide star ratings (1-5) as feedback on each output, and we run a chi-square test for statistical significance.

The test auto-stops when two conditions are met: p-value < 0.05 and at least 30 feedback responses per arm. This prevents both premature conclusions and unnecessarily long tests.

Routing decisions are cached in Redis with a 30-second TTL for consistency — the same user hitting the endpoint repeatedly within the cache window gets the same arm.

Cost Management

Multi-judge evaluation multiplies execution costs by the number of judges (2-3x). We provide upfront cost estimates before a bakeoff starts, using historical token usage medians as baselines and accounting for the judge multiplier.

The 40-cell cap (8 arms times 10 inputs) and 10-criteria limit keep individual bakeoffs bounded. For workflow arms, we use multi-step token projections that account for each step’s model and expected input/output size.

What We Learned

Position randomization is essential, not optional. In our early testing without randomization, the first-presented output won 60-65% of evaluations regardless of quality. Randomization brought this to 48-52%, which is within noise.

Two judges catch most disagreements. Going from 1 to 2 judges dramatically improved reliability. Going from 2 to 3 judges improved it further but with diminishing returns. For most use cases, 2 judges with Kendall’s tau reporting is the sweet spot.

Confidence intervals change behavior. Before we added CI reporting, teams would optimize for 1-2 point score differences. Now they see overlapping intervals and correctly interpret them as “no meaningful difference.” This saves engineering time that would have been spent chasing noise.

Structured output schemas make evaluation reliable. Early versions used free-form judge responses and parsed scores with regex. This broke constantly — models would add explanations, use different formats, or skip criteria. Switching to Zod-validated structured output eliminated parsing failures entirely.