When Your Workflow Fails at Step 7: Debugging AI Automation with Execution Traces

A workflow with 12 steps, two conditional branches, and a loop fails on the third iteration of the loop’s second sub-step. The error message says “unexpected response format.” Where do you start?

Without detailed execution data, you’re guessing. With execution traces, you see exactly what happened — every input, every output, every timing boundary, every retry attempt — in a hierarchical view that mirrors the workflow’s structure.

What execution traces capture

Every workflow run produces an execution trace: a tree of spans that mirrors the workflow’s execution graph. Each span records:

Step identity — Which step, by ID and name
Timing — Start time, end time, and duration in milliseconds
Inputs — The exact values that were passed to the step after template resolution
Outputs — The structured output the step produced (or the error it threw)
Retry attempts — If the step retried, each attempt is recorded with its error and the backoff delay before the next attempt
Nesting context — Which branch of a condition, which iteration of a loop, which branch of a parallel step

Hierarchical structure

The trace tree follows the workflow’s structure exactly:

Workflow Run
├── Step 1: Research prospect (recipe) — 2.3s, success
├── Step 2: Qualify lead (recipe) — 1.8s, success
├── Step 3: Condition (score >= 70)
│   └── Then branch
│       ├── Step 3a: Parallel
│       │   ├── Branch A: Draft email — 3.1s, success
│       │   └── Branch B: Draft LinkedIn message — 2.7s, success
│       └── Step 3b: Review (approval) — waiting
└── Step 4: Archive (skipped — condition took then-branch)

Conditional branches show which path was taken and why. Loops show each iteration with its item and sub-step results. Parallel branches show concurrent execution with individual timings.

Three layers of observability

Execution traces are one of three complementary observability layers:

Execution traces (Firestore) are the detailed, per-run inspection tool. They power the run inspector UI and are designed for debugging individual runs. Every run gets a trace, and traces are retained for the life of the run record.

Prometheus metrics track aggregate operational health: total executions, success rates, duration percentiles, and error counts. These feed dashboards and alerting — “workflow X’s success rate dropped below 95% in the last hour.”

OpenTelemetry spans provide distributed tracing across service boundaries. The withSpan() wrapper creates spans for any instrumented code path, carrying workflow and step metadata. These integrate with any OTel-compatible backend (Jaeger, Datadog, Honeycomb) for cross-system trace correlation.

Using traces to debug

Finding the failure point

Open the run detail page and expand the trace tree. Failed steps are highlighted. Click the failed step to see its exact inputs and the error message. Compare the inputs to what you expected — often the issue is an upstream step producing unexpected output that gets passed to a downstream step.

Spotting template resolution issues

Each step’s trace shows the resolved input values — after template substitution. If a step expected {{step.research.company_name}} but got undefined, the trace shows you the actual resolved value. Check whether the referenced step produced the expected output field.

Identifying performance bottlenecks

Sort by duration to find the slowest steps. A step that takes 15 seconds when similar steps take 2 seconds might be hitting a rate limit, using a more expensive model than necessary, or processing unexpectedly large input.

Debugging retry behavior

When a step retried before succeeding (or failing permanently), the trace shows each attempt: the error that triggered the retry, the backoff delay, and the outcome of the next attempt. Common patterns: transient 429 rate limits (retries succeed after backoff), persistent 400 errors (retries all fail with the same error — fix the input, not the retry count).

Traces are automatic

You don’t configure tracing. Every workflow run produces a trace, every step is instrumented, and the trace is persisted asynchronously (fire-and-forget) so it doesn’t slow down execution. The run inspector reads traces on demand when you open the detail view.

Execution traces are available on all plans. Prometheus metrics and OpenTelemetry integration are available on Enterprise. Learn more about workflows or start your free trial.