Skip to content

GPT-5 in preview. Claude 4.6 GA. Gemini 2.5 Pro. Llama 4 open-weight. Every platform now supports multiple models — only JieGou lets you prove which is best on your data.

Don't just run AI —
prove it works

Every platform now supports multiple models. Only JieGou lets you prove which is best — on your data, with your recipes, tracking real costs. Compare recipes, models, and entire workflows side by side with LLM-as-judge scoring, multi-judge consensus, and live A/B routing.

Bakeoff Modes

Six ways to evaluate your AI

From simple recipe comparison to live traffic routing, choose the evaluation approach that fits your needs.

Recipe vs. Recipe Pro

Compare two different recipes on the same inputs

Recipe vs. Model Pro

Same recipe, different LLM providers or models

Multi-Judge Pro

2-3 independent LLM judges with consensus scoring

Workflow vs. Workflow Enterprise

Full end-to-end workflow comparison

A/B Test Routing Enterprise

Live traffic splitting with statistical auto-stop

Synthetic Inputs Pro

Auto-generated test data from input schemas

Recipe Comparison

Recipe vs. recipe, model vs. model

Run the same inputs through different recipes or the same recipe on different models. See outputs side by side and let an LLM judge score each result on quality, accuracy, and relevance.

  • Compare two recipes on identical inputs
  • Test the same recipe across different LLM providers
  • Side-by-side output display with diff highlighting
  • LLM-as-judge scores each output automatically

Multi-Judge Evaluation

Consensus scoring with statistical confidence

Use two or three independent LLM judges to evaluate outputs. JieGou calculates inter-judge agreement using Kendall's tau and Spearman's rho, and reports 95% confidence intervals so you know when results are statistically meaningful.

  • 2-3 independent LLM judges per evaluation
  • Kendall's tau and Spearman's rho correlation
  • 95% confidence intervals with standard deviation
  • Cost estimation with multi-judge multiplier

Workflow Bakeoffs

Compare entire workflows head-to-head

Go beyond single recipes. Run full workflows against each other to compare end-to-end output quality, execution time, and cost. Ideal for evaluating different automation strategies before committing to one.

  • Full workflow execution with token tracking
  • Compare total cost and execution time
  • End-to-end output quality scoring
  • Available on Enterprise plans

A/B Test Routing

Live traffic splitting with auto-stop

Route live execution traffic between recipe or workflow variants. JieGou tracks performance with chi-square statistical testing and automatically stops routing to the losing variant when the winner reaches significance.

  • Split live traffic between two variants
  • Chi-square statistical testing for significance
  • Auto-stop when a winner is determined
  • Redis-cached routing decisions for consistency

Synthetic Inputs

Auto-generate test data from schemas

Don't have enough real data to run a meaningful comparison? JieGou generates synthetic inputs from your recipe or workflow input schemas, giving you a diverse set of test cases without manual effort.

  • Generate test inputs from JSON Schema definitions
  • Diverse, realistic data for meaningful comparisons
  • No manual test case creation required
  • Works with both recipe and workflow schemas

How It Works

From setup to results in four steps

1

Choose a mode

Pick recipe vs. recipe, model vs. model, workflow comparison, or A/B routing.

2

Add inputs

Use real data, generate synthetic inputs from schemas, or provide your own test cases.

3

Run the bakeoff

Both variants execute in parallel. LLM judges score each output independently.

4

Review results

See scores, confidence intervals, cost comparison, and the winning variant.

Start your first bakeoff

Find the best recipe, model, or workflow for every use case — with data, not guesswork.