Why We Built a Recipe Factory (and What LLM-as-Judge Taught Us)

JieGou ships starter packs for nine departments, each with 7-10 recipes and 2-5 workflows. That’s roughly 150 recipes and 40 workflows. Writing each one by hand, testing it with realistic inputs, evaluating the output quality, and iterating until it’s good enough to ship — that’s a full-time job for a team.

We needed a better approach. So we built the Recipe Factory: an automated pipeline that generates, tests, evaluates, and promotes templates at scale.

The pipeline

The Recipe Factory runs five stages:

1. Catalog → Generate

Every recipe starts as a spec in the catalog: a title, description, target department, expected input fields, expected output fields, and quality criteria. The catalog currently has ~150 recipe specs across 9 departments.

The generation stage takes each spec and uses an LLM to produce a complete recipe: a polished prompt template, a detailed input schema with field descriptions, and a structured output schema. The LLM follows a meta-prompt that encodes our standards for recipe quality — clear instructions, appropriate detail level, and schema best practices.

2. Generate → Test Data

For each generated recipe, a second LLM call produces 3-5 synthetic test inputs. These cover the typical case, edge cases (minimal input, unusually long input, ambiguous input), and department-specific scenarios. The test inputs are realistic enough to produce meaningful output without requiring real customer data.

3. Test Data → Run Tests

Each recipe is executed against a real LLM using the synthetic test inputs. This catches problems that look fine in the template but fail at runtime: schema mismatches, prompts that produce off-target output, templates that exceed context limits, and instructions that the model interprets differently than intended.

4. Run Tests → Evaluate

This is where LLM-as-judge comes in. A separate LLM evaluates each test result across five dimensions:

Schema compliance — Does the output match the expected schema? Are all required fields present and correctly typed?
Completeness — Does the output address all aspects of the input? Are there gaps or missing sections?
Actionability — Is the output useful? Can a person act on it without significant additional work?
Format quality — Is the output well-organized, clearly written, and appropriately detailed?
Consistency — Across multiple test inputs, does the recipe produce consistently structured output?

Each dimension gets a score from 0-100. The overall score is a weighted average.

5. Evaluate → Promote

Recipes that score 75+ overall with no dimension below 50 get promoted to the demo account in Firestore. Recipes that fail get flagged for manual review and iteration.

What LLM-as-judge taught us

Building an automated quality evaluation system produced several non-obvious insights.

Prompt specificity matters more than length. Early recipe templates were long and detailed. The LLM-as-judge consistently scored shorter, more specific prompts higher on schema compliance and format quality. A 200-word prompt that says exactly what to do outperforms a 500-word prompt that covers every edge case. The model follows clear instructions better than comprehensive instructions.

Output schemas are the most important quality lever. Recipes with detailed output schemas — field descriptions, enum constraints, nested object structures — scored significantly higher on completeness and consistency. The schema acts as a second set of instructions that constrains the model’s output independent of the prompt.

Edge-case test inputs reveal fragility. The standard test case usually works. The edge cases — minimal input, unusual formatting, missing optional fields — expose recipes that break under real-world conditions. A recipe that scores 90 on the typical case but 40 on edge cases isn’t ready for production.

Evaluation criteria need calibration. Our first pass at LLM-as-judge scoring was too lenient. Recipes scored 85+ that produced mediocre output. We tightened the evaluation prompt with specific examples of what each score level looks like. “Actionability” at 80+ means the output includes specific next steps, not just analysis. “Format quality” at 80+ means clear headings, consistent structure, and appropriate length.

Cross-department consistency requires shared standards. A sales prospect research recipe and an HR resume screening recipe both extract structured data from unstructured input. The quality bar should be the same. We standardized evaluation criteria across departments so the factory applies consistent standards regardless of the recipe’s domain.

The workflow factory

After the recipe factory proved effective, we built an equivalent pipeline for workflows. The workflow factory generates multi-step workflows from specs, creates test inputs that exercise the full workflow (including condition branches and loop iterations), executes them end-to-end, and evaluates the composite output.

Workflow evaluation is harder than recipe evaluation because the quality depends on how well the steps chain together, not just individual step quality. A workflow where each step scores 85 individually might score 60 overall if the data doesn’t flow cleanly between steps. The evaluation criteria include inter-step data integrity and overall coherence.

Running the factory

The full pipeline runs with a single command: npm run recipe-factory. It takes the catalog specs, generates recipes, creates test data, executes tests, evaluates results, and promotes passing recipes to the demo account.

Individual stages can be run separately for iteration:

npm run generate-recipes — Regenerate recipes from catalog specs
npm run generate-test-inputs — Create new test data
npm run run-recipe-tests — Execute recipes against LLM
npm run evaluate-recipes — Score the results
npm run promote-recipes — Push passing recipes to Firestore

This modularity lets us iterate on a specific stage without re-running the entire pipeline.

Why this matters for users

The Recipe Factory is an internal tool, but its effects are user-facing. Every recipe in a starter pack has been:

Generated from a spec that defines quality criteria
Tested with realistic inputs including edge cases
Evaluated by an LLM judge across five quality dimensions
Promoted only if it meets the quality threshold

When you install a starter pack, the recipes aren’t rough drafts. They’ve been through an automated quality pipeline that catches the most common issues before they reach you.

That said, they’re starting points. Your team’s specific terminology, quality standards, and output preferences are things the factory can’t know. The recipes are designed to be customized — the factory ensures they start at a high baseline so your customization is refinement, not remediation.