Designing a Workflow Engine with Conditions, Loops, Parallel Execution, and Approval Gates

Building a single-step AI operation is straightforward: take input, call an LLM, return output. Building a multi-step workflow engine that supports branching, loops, parallel execution, human approval gates, and automatic retries is a different kind of problem entirely.

This post covers the architecture of JieGou’s workflow engine — the execution model, the 8 step types, how data flows between steps, how approvals pause and resume execution, and the guardrails that keep everything reliable.

Execution Model

A workflow is a directed graph of steps. Execution starts with executeWorkflow(), which creates a WorkflowRun record, builds a shared StepExecutionContext, and calls executeStepList() to process steps sequentially.

The context carries a previousStepOutputs Map — a key-value store where each completed step deposits its output for downstream steps to reference. This is the backbone of data flow. Step B can reference Step A’s output using template syntax like {{step.stepA.fieldName}}.

Each workflow has a configurable timeout (default 5 minutes), enforced via a deadlineMs field in the context. Individual steps also have their own timeout (default 60 seconds) enforced by a withStepTimeout() wrapper.

The 8 Step Types

Recipe Step

The workhorse. Executes a reusable prompt template via executeRecipe() and stores the parsed output in previousStepOutputs. Input mapping resolves references to workflow inputs, previous step outputs, static values, or loop items.

Condition Step

Evaluates a boolean expression and executes either thenSteps or elseSteps recursively. This is a true branch — both paths can contain any step type, including nested conditions. The engine calls executeStepList() recursively on the chosen branch.

Loop Step

Iterates over a collection and executes a list of sub-steps for each item. The collection can come from 4 sources: a static array defined in the workflow, a previous step’s output, a workflow input field, or parent loop items.

Each iteration gets its own loopContext Map, so sub-steps can reference the current item via {{loop_item.fieldName}}. Iteration results are stored as an array with nested stepRuns for observability.

Parallel Step

Executes multiple branches concurrently via Promise.allSettled(). Each branch is an independent list of steps with its own stepRuns array. This is useful when multiple independent operations can run simultaneously — for example, enriching a lead from three different data sources at once.

Approval Step

The most architecturally interesting step type. When execution reaches an approval step, it throws an ApprovalPauseError. This is a controlled exception — not a crash.

The error is caught at the top level, the WorkflowRun is persisted with status pending_approval, and eligible approvers are notified via email. Execution stops completely. No resources are held.

When an approver acts (approve or reject via the API), resumeWorkflowFromApproval() loads the persisted run, calls reconstructPreviousOutputs() to rebuild the previousStepOutputs Map from the saved stepRuns, and resumes execution from the next step after the approval gate.

The reconstruction is recursive — it walks through thenStepRuns, elseStepRuns, iterations[], and branchStepRuns[] to restore every nested output. This means approvals work correctly even when they’re inside a conditional branch or a loop iteration.

Write-to-KB Step

Captures a step’s output and writes it to a knowledge base document. This enables workflows that build institutional knowledge — a support triage workflow might write resolution summaries to a KB that future runs reference via RAG.

Handoff Step

Notifies users in a target department via email and in-app notification. It’s a no-op for execution — it doesn’t produce output or block the pipeline. Useful for escalation flows where a human needs to be alerted but the workflow should continue.

Browser Action Step

Executes an MCP tool call via the browser extension. Acquires a client from the MCP connection pool, resolves template arguments (workflow input fields and previous step outputs), and returns the tool result.

Data Flow: Template Resolution

Steps reference each other through a template syntax resolved at execution time:

{{workflow_input.fieldName}} — References the workflow’s input data
{{step.stepId.path.to.value}} — References a previous step’s output using dot notation
{{loop_item.fieldName}} — References the current item in a loop iteration

The resolver uses getNestedValue() for dot-notation path traversal, handling arrays and nested objects. Input mappings declare their source explicitly: workflowInput, previousStep, static, or loopItem.

Retry and Error Handling

Not all failures are permanent. Rate limits, transient 5xx errors, timeouts, and connection failures are retryable. Client-side validation errors (4xx) are not.

The retry strategy uses exponential backoff with jitter: Math.min(30000, 2000 * 2^attempt + random jitter). That’s roughly 2 seconds, 4 seconds, 8 seconds, 16 seconds, capped at 30 seconds, with random jitter to avoid thundering herd problems. Default max attempts is 3, configurable per step.

Each retry is logged with the attempt index and error details. If all attempts fail, the step is marked as failed and the workflow terminates (unless error handling at the workflow level specifies otherwise).

Concurrency Control

Each account is limited to 10 concurrent LLM calls via a Redis semaphore. This prevents a batch workflow from monopolizing provider connections.

MCP connections use a pooled client with LRU eviction (max 20 connections, 60-second idle timeout). The pool handles connection setup, keepalive, and cleanup.

Both systems use fail-open semantics on Redis errors — a degraded cache layer should never prevent workflow execution.

Observability

Three layers of observability run concurrently:

Prometheus metrics track workflowExecutionsTotal and workflowDuration by status. This gives operational dashboards for success rates and latency percentiles.

OpenTelemetry spans create a trace hierarchy with workflow-level and step-level spans, carrying workflow ID, name, and step metadata. These integrate with any OTel-compatible backend.

Execution traces are a custom hierarchical span tree persisted to Firestore via fire-and-forget. These power the detailed run inspector in the UI, showing exactly what happened at each step including inputs, outputs, timing, and retry attempts.

Pre-Flight Checks

Before execution begins, the engine runs validation:

Input schema validation — Workflow inputs are checked against the declared schema
Mapping validation — All step input mappings are verified (referenced steps exist, paths are plausible)
MCP verification — If any steps require browser actions, the engine verifies MCP client connectivity upfront rather than failing mid-execution
Auto-context resolution — Knowledge base documents are resolved from workflow, department, and explicit ID sources, so RAG context is ready before the first step runs

Lessons Learned

Approval gates as exceptions simplify the architecture. Our first design used a state machine with explicit pause/resume states. The exception-based approach is cleaner — the workflow runs forward until it hits an approval, throws, persists, and stops. Resume reconstructs state and continues. No complex state transitions to manage.

Recursive output reconstruction is worth the complexity. Rebuilding previousStepOutputs from persisted step runs requires walking through every nesting level — conditions, loops, parallel branches. It’s intricate code, but it means approvals work correctly at any nesting depth without special-casing.

Per-step timeouts prevent cascading delays. A single slow LLM call shouldn’t eat the entire workflow timeout. The 60-second per-step default catches hung requests early, and the workflow-level timeout acts as a backstop.

Fire-and-forget observability keeps the hot path fast. Execution traces, webhook deliveries, and destination outputs are all dispatched asynchronously after the main execution completes. The user gets their result without waiting for observability writes to finish.