Test My Recipe: See Results Stream In Before You Ship to Production

You built a recipe. The prompt looks right. You ran it once with a hand-crafted input and the output looked good. Time to deploy?

Not so fast. One input is not a test suite. The recipe might handle your carefully written example perfectly and fall apart on the messy, incomplete, contradictory inputs that real users send. Deploying without systematic testing is a bet — and most teams don’t realize the odds until something breaks in production.

Test My Recipe eliminates the guesswork. Generate realistic inputs, execute the recipe against each one, and watch results stream in before you commit to anything.

The problem with manual testing

Most teams test recipes the same way: type an input, hit run, read the output, repeat. This approach has three problems.

It’s slow. Typing inputs by hand, waiting for each result, and mentally evaluating quality takes minutes per test. Testing 20 variations takes an hour you don’t have.

It’s biased. You write inputs based on what you think users will send. Your mental model of the input distribution is wrong — it always is. Real inputs include typos, missing fields, contradictory instructions, and edge cases you never imagined.

It’s not repeatable. There’s no record of what you tested, what the results were, or whether the recipe improved after your last prompt edit. Every test cycle starts from zero.

Generating realistic inputs

Click the Test Recipe button on any recipe’s detail page and JieGou generates synthetic test inputs for you. The generation uses the recipe’s input schema — field names, types, descriptions, and any examples you’ve provided — to produce N realistic variations (configurable from 5 to 50).

The generated inputs aren’t random noise. They cover the realistic spectrum: well-formed inputs, edge cases with minimal information, inputs with conflicting requirements, and inputs that push the boundaries of what the recipe was designed to handle. Think of it as an automated QA engineer who reads your recipe’s spec and writes test cases.

You can review the generated inputs before execution starts. Delete any that aren’t relevant, edit others to target specific scenarios, or add your own custom inputs to the set. The goal is a test suite that reflects reality, not a synthetic exercise.

Real-time streaming with NDJSON

Once you start the test run, JieGou executes the recipe against each input sequentially. Results stream back to your browser in real time using NDJSON (newline-delimited JSON) — each line is a complete JSON object representing one event.

The TestMyRecipeModal progresses through four phases:

Idle — Ready to configure and start
Generating — Synthetic inputs are being created
Running — Recipe is executing against each input, with results streaming in
Complete — All tests finished, summary available

During the Running phase, you see results arrive one by one. No waiting for the entire batch to finish. No spinner hiding all progress behind a single loading state. Each result appears as soon as its execution completes, so you can start reading outputs while later tests are still running.

This matters for longer-running recipes. If your recipe calls external APIs or processes lengthy documents, individual executions might take 10-30 seconds. Without streaming, testing 20 inputs means staring at a spinner for several minutes. With NDJSON streaming, you’re reviewing the first result within seconds.

Reading the results

When the test run completes, the results view gives you two levels of detail.

Summary statistics show the big picture at a glance: total tests run, success count, failure count, average execution time, and average token usage. If 18 of 20 tests succeeded but 2 failed, you know immediately that the recipe has gaps to address.

Per-test accordions let you drill into each individual execution. Expand any test to see the input that was sent, the full output that was returned, the execution time, token count, and any error messages. Side-by-side comparison of input and output makes it easy to judge whether the recipe understood the request and produced a useful result.

The combination works the way code test suites work: the summary tells you if something is wrong, and the details tell you what and where.

Audit trail integration

Every test run is logged as a recipe.tested audit action. The audit record captures who ran the test, when, which recipe was tested, how many inputs were generated, and the success/failure breakdown.

This serves two purposes. First, it creates an accountability trail for teams with compliance requirements — you can demonstrate that recipes were tested before deployment. Second, it gives you a historical record of testing activity. When a recipe starts misbehaving in production, you can check the audit log to see when it was last tested and what the results looked like.

Audit records are visible in the Operations Hub alongside other system activity, so testing is part of the same operational visibility as execution, approvals, and configuration changes.

Why this matters for production confidence

The gap between “it worked when I tried it” and “it works reliably at scale” is where most AI automation failures happen. A recipe might handle 90% of inputs perfectly but produce nonsense for the other 10%. Without systematic testing, that 10% failure rate only becomes visible after real users encounter it.

Test My Recipe closes that gap by making it fast and easy to run a meaningful test suite before every deployment. Generate inputs, watch results stream in, review the summary, fix any issues, and test again. The entire cycle takes minutes, not hours.

Combined with Quality Guard for ongoing monitoring and bakeoffs for prompt comparison, Test My Recipe completes the quality lifecycle: test before you deploy, compare when you experiment, monitor after you ship.

Test My Recipe is available on all plans. Try it now.