Execution Insights: Automated Anomaly Detection for AI Workflows

Running one recipe is simple. Running 50 recipes across 8 departments, each calling different LLM providers with different cost profiles and latency characteristics, is an operations problem. Standard monitoring tools can tell you if a server is down. They can’t tell you that your contract review recipe started costing 3x more tokens last Tuesday, or that three different recipes are failing with semantically similar errors that point to the same upstream issue.

Execution Insights is an anomaly detection system built specifically for AI workflow operations. It lives in the Operations Hub on the /operations/landscape page and continuously analyzes execution data to surface problems you would otherwise miss.

Four detection patterns

Execution Insights runs four specialized detectors, each designed to catch a different class of operational problem.

Failure pattern detection

The failure detector flags recipes whose error rate exceeds 20% over the configured time window. A recipe that fails once in 100 runs is normal. A recipe that fails 25 times in 100 runs has a systemic problem — a broken API integration, a prompt that chokes on a new input pattern, or a model that started refusing certain requests.

The detector doesn’t just count failures. It examines the failure trajectory. A recipe that went from 2% failure rate to 22% failure rate in the last 48 hours is more urgent than one that has been hovering at 21% for weeks. The insight includes which specific recipes are affected, the time range over which the pattern was detected, and a recommendation for investigation.

Cost spike detection

LLM costs are proportional to token usage, and token usage can change without any code changes. A model update might produce longer outputs. An upstream data source might start returning larger documents. A prompt refinement might accidentally remove a length constraint.

The cost detector flags recipes whose token usage has increased by more than 50% compared to their baseline. The baseline is computed from historical execution data within the configured time window. When a recipe that typically uses 2,000 tokens per run starts averaging 3,500 tokens, the detector surfaces it — along with the affected recipes, the magnitude of the increase, and the estimated cost impact.

This is a signal that generic monitoring tools don’t provide. CPU and memory look fine. HTTP status codes are all 200. But your bill is growing 50% faster than your usage, and the reason is buried in token-level execution data that only an AI-specific monitoring system tracks.

Latency anomaly detection

The latency detector compares recent execution times against the p95 baseline and flags recipes exceeding 2x that threshold. A recipe with a p95 latency of 4 seconds that starts regularly taking 10 seconds has a problem — even if it’s technically completing successfully.

Latency anomalies in AI workflows often signal upstream issues: a model provider experiencing degradation, an MCP tool taking longer to respond, or a knowledge base query hitting a slow path. The insight includes the baseline p95, the current observed latency, and which recipes are affected, giving you enough context to start diagnosis immediately.

Error clustering

Individual errors are noise. Three or more recipes failing with semantically similar error messages is a pattern. The error clustering detector groups errors across recipes and flags clusters of 3 or more similar errors within the time window.

This catches cross-cutting failures that per-recipe monitoring misses. If your Anthropic API key expires, five different recipes will start failing with similar authentication errors. Without clustering, you see five separate failures. With clustering, you see one root cause affecting five recipes — and the recommendation points you toward the shared dependency.

Severity ranking and recommendations

Every insight is classified into one of three severity levels:

Critical — Immediate attention required. High failure rates, extreme cost spikes, or large error clusters that indicate systemic problems.
Warning — Degradation detected but not yet critical. Moderate cost increases, elevated latency, or emerging failure patterns.
Info — Worth knowing but not urgent. Minor deviations, single-recipe anomalies, or patterns that are trending toward a threshold but haven’t crossed it.

Each insight includes a structured recommendation — not just “investigate this recipe” but specific next steps. A cost spike insight might recommend checking the recipe’s prompt for missing length constraints or comparing token usage before and after a recent model change. A failure pattern insight might recommend reviewing the recipe’s error logs for the top failure reason.

Insights are displayed in the ExecutionInsightsPanel sorted by severity, so critical issues are always at the top. Each insight card shows the type, severity, title, description, affected recipes, time range, recommendation, and supporting data points.

Time range configuration

Anomaly detection is only as good as the window you’re looking at. A spike that’s alarming over 7 days might be normal seasonal variation over 90 days. Execution Insights supports three configurable time ranges:

7 days — Best for catching acute problems. Short baseline, high sensitivity.
30 days — Balanced view. Smooths out daily variation while still catching week-over-week changes.
90 days — Long-term trends. Best for identifying gradual drift in cost or latency that accumulates slowly.

Switching between time ranges updates all four detectors simultaneously, so you can quickly cross-reference whether a 7-day anomaly is also visible at 30 days (real problem) or disappears at wider windows (temporary blip).

Operations Hub integration

Execution Insights lives alongside the other Operations Hub views: automation landscape, governance, revenue analytics, availability monitoring, and security monitoring. This placement is intentional. Anomaly detection isn’t a standalone tool — it’s part of operational awareness.

The insights API is accessible at /api/insights/execution with audit:read permission. This means any team member with operational visibility can query insights programmatically — feeding them into Slack alerts, external dashboards, or automated remediation workflows.

Why AI-specific monitoring matters

Generic application monitoring watches HTTP status codes, response times, error rates, and resource utilization. These metrics matter, but they miss the signals that are unique to AI workflows.

Token cost is invisible to APM tools. A recipe can return HTTP 200 with a correct output and still cost 3x what it should because the model is generating unnecessarily verbose responses. Execution Insights tracks token usage at the recipe level and detects when costs diverge from baselines.

Model latency is not server latency. A 12-second response time might be normal for a recipe that calls Claude Opus with a 50,000-token context window. The same 12-second response from a Haiku recipe that normally completes in 2 seconds is a red flag. Execution Insights maintains per-recipe baselines instead of applying one-size-fits-all latency thresholds.

Semantic error clustering requires understanding error messages. Traditional monitoring groups errors by HTTP status code or error class. Execution Insights groups errors by semantic similarity, catching patterns like “rate limit exceeded” and “too many requests” as the same underlying issue even though they’re different strings.

These are the signals that tell you whether your AI automation is healthy — not just whether your servers are running.

Execution Insights is available on Team and Enterprise plans. Explore the Operations Hub or start your free trial.