The Prompt Engineering Studio: Version, Optimize, and A/B Test Your Prompts

Prompt engineering is trial and error for most teams. You tweak a system prompt, run it a few times, decide it “feels better,” and move on. There’s no version history. No way to compare iteration 14 against iteration 11. No systematic feedback loop connecting production quality back to prompt changes.

We built the Prompt Engineering Studio to fix this. It’s a collapsible panel embedded directly in the recipe editor — not a separate page, not a different tool. Five tabs sit right next to the prompt textarea: Token Budget, Variables, Versions, Few-Shot, and Optimizer. Iteration happens in context, not in a disconnected workflow.

Version Tracking and Diff Comparison

Every prompt change creates a version in a Firestore subcollection. Each version stores:

Field	Description
Version number	Auto-incremented integer
Template text	The full prompt template at that point in time
Similarity score	0-100 via normalized Levenshtein distance against the previous version
Author	Who made the change
Changelog	Freeform description of what changed and why

Quality metrics are tracked per version: total runs, success count, thumbs-up count, thumbs-down count, feedback ratio, and average token usage. These metrics are Redis-cached with a 5-minute TTL to keep the UI responsive without hammering Firestore on every panel open.

The diff viewer shows line-by-line comparisons between any two versions. Additions render in green, removals in red, with statistics summarizing how many lines changed. Similarity scores are color-coded: green for >= 90% similarity (minor tweaks), amber for >= 50% (moderate rewrites), and red for < 50% (substantial changes).

Rollback is non-destructive. Restoring a previous version creates a new version with the old content — it never overwrites history. Version 15 might be identical to version 8, and that’s fine. The full chain of changes is always preserved.

Live Token Budget Visualization

The Token Budget tab renders a real-time bar chart showing context window utilization as you type. Five labeled sections break down where your tokens go:

Section	Budget
System	200-token overhead (system prompt framing)
Glossary	500-token cap
Few-Shot	2,000-token cap
RAG Context	4,000-token cap
User Prompt	Estimated from text length / 4

The visualization is model-aware. Select Claude and the bar scales to 200K tokens. Switch to GPT-4o and it rescales to 128K. Switch to Gemini and it extends to 1M. The proportions shift accordingly, making it immediately obvious when a prompt that fits comfortably in one model’s context window is dangerously tight in another.

Three warning levels trigger automatically:

>80% utilization — amber warning, suggesting you trim context or few-shot examples
>90% utilization — red warning, indicating high risk of truncation
<1,000 tokens remaining for output — explicit alert that the model won’t have enough room to generate a useful response

Updates are debounced at 500ms on keystroke, so the chart stays responsive without recalculating on every character.

Variable Inspector

The Variables tab detects {{variable}} and {{fragment:name}} references in real time as you edit the prompt template. Each detected variable is cross-referenced against the recipe’s inputSchema and assigned one of four statuses:

Matched (green) — Variable exists in both the template and the schema. Everything is wired up correctly.
Orphan (amber) — Variable appears in the template but isn’t defined in the schema. The prompt references something that won’t have a value at runtime.
Unused (red) — Variable is defined in the schema but never referenced in the template. You’re collecting input that goes nowhere.
Fragment (blue) — A {{fragment:name}} reference to a reusable prompt fragment.

For matched variables, the inspector pulls example values from historical run data, so you can see what real inputs look like without leaving the editor. This catches a common class of bugs: the schema defines a field as companyName but the template references {{company_name}}.

Few-Shot Example Management

The Few-Shot tab lets you pin successful runs as curated examples. Each pinned example has an editable output field — you can correct or refine the model’s original response to create a gold-standard demonstration.

Quality scoring determines which examples surface at runtime. Each example starts with a base score of 50. A thumbs-up adds 40 points. A thumbs-down subtracts 30 points. Judge scores from bakeoff evaluations are weighted in as well.

At runtime, the system supports three retrieval strategies:

Strategy	How it works
Feedback-based	Selects thumbs-up runs, diversified to avoid repetitive examples
Recent	Selects the most recent successful runs
Similar	Uses cosine similarity via embeddings to find runs closest to the current input

Pinned curated examples always take priority over dynamically retrieved ones. They’re injected first, and the remaining budget fills with strategy-selected examples.

All few-shot examples are injected into the prompt as <few_shot_examples> XML blocks, constrained to a 2,000-token budget. If your curated examples exceed the budget, the system truncates from the lowest-scored examples first.

AI-Powered Optimizer

The Optimizer tab provides three tiers of prompt improvement, escalating from manual to fully automated.

Tier 1: User-Triggered Analysis

Click “Analyze” and the optimizer pulls the last 50 successful runs for the recipe, partitions them into thumbs-up and thumbs-down buckets, and sends the distribution to Claude Sonnet 4.6 for structured analysis. The model returns a list of suggestions, each containing:

Field	Description
Section	Which part of the prompt to change
Original text	The current prompt text in that section
Suggested text	The proposed replacement
Rationale	Why this change should improve output quality
Confidence	0-100 score indicating the model’s certainty

You review each suggestion and choose Apply (inline replacement in the editor) or A/B Test (creates a bakeoff comparing the current prompt against the suggestion in production).

Tier 2: Auto-Triggered Suggestions

When a recipe accumulates 5 or more thumbs-down ratings, the optimizer automatically generates 1-3 improvement suggestions without user intervention. These appear as a notification badge on the Optimizer tab.

This tier is rate-limited to once per hour per recipe to prevent suggestion fatigue. The intent is a gentle nudge — “your users are unhappy with this recipe’s output, here are some ideas” — not a firehose of changes.

This tier is triggered by the Quality Guard system when it detects drift in a recipe’s output quality over time. The optimizer analyzes the top 5 and bottom 5 runs from the drift window, identifies patterns in what’s degrading, and generates targeted prompt changes.

Tier 3 can also auto-trigger mini-bakeoffs — creating a structured A/B comparison between the current prompt and the suggested revision, routed to live production traffic. This closes the loop entirely: quality drops, the system proposes a fix, tests it against real inputs, and reports back.

Rate limits keep this conservative: refinement suggestions are limited to once per 24 hours, and auto-triggered bakeoffs to once per 7 days. Prompt optimization should be deliberate, not a runaway feedback loop.

Why It Lives in the Editor

The Studio is a panel, not a page. This is a deliberate design decision. Prompt engineering is iterative — you change a line, check the token budget, glance at the diff, run a test. Context-switching between separate tools breaks the flow.

With the Studio collapsed, you have a clean editor. With it expanded, every signal you need — token utilization, variable health, version history, few-shot examples, optimization suggestions — is one tab away. No navigation, no loading screens, no lost context.

Availability

The Prompt Engineering Studio is available on Pro plans and above. Version tracking, token budgeting, and variable inspection are included in Pro. Few-shot management and the AI-powered optimizer (all three tiers) are available on Pro and Enterprise.

Explore all features on the features page or start a free trial.