Skip to content
Engineering

99.18% Test Coverage, 24,000+ Tests: The Most Tested AI Automation Platform

Why JieGou runs 24,000+ automated tests at 99.18% coverage — and how our testing infrastructure feeds directly into SOC 2 compliance evidence.

JT
JieGou Team
· · 4 min read

AI automation platforms make decisions that affect real business processes. When a recipe generates a customer email, or a workflow approves a purchase order, or an agent delegates tasks across departments — the output matters. If the platform has bugs, the business has bugs.

That’s why JieGou runs 24,000+ automated tests with 99.18% code coverage. Every night. Across all 4 LLM providers. With accessibility audits, visual regression testing, and RBAC enforcement verification included.

No other AI automation platform publishes these numbers. Most don’t have them.

Why testing matters more for AI platforms

Traditional SaaS testing is straightforward: given input X, expect output Y. AI automation platforms add three layers of complexity:

  1. Non-deterministic outputs — LLMs don’t return the same response twice. Tests must validate structure, constraints, and quality rather than exact strings.
  2. Multi-provider variability — JieGou supports 4 LLM providers (Anthropic, OpenAI, Google, and any OpenAI-compatible endpoint). Each has different capabilities, error modes, and response formats.
  3. Orchestration complexity — Workflows chain multiple steps with conditional logic, parallel execution, approval gates, and convergence loops. A bug in step 3 can corrupt step 7’s output through shared state.

These challenges are exactly why testing discipline matters. Without it, you’re shipping bugs you can’t reproduce because they only appear under specific LLM response patterns.

What 24,000+ tests cover

Unit tests (Vitest)

The bulk of our test suite — server-side logic, data transformations, validation rules, and business logic:

  • LLM layer: Provider routing, BYOK key resolution, circuit breaker state machines, concurrency limiting, token usage tracking
  • Workflow engine: Step execution (recipe, condition, loop, parallel, approval, LLM, eval, router, aggregator), DAG execution, convergence loops, checkpoint/resume
  • Security: RBAC enforcement (20 permissions across 5 roles), auth guard, API key encryption/decryption, session management
  • SOC 2 evidence: Access review generation, encryption inventory, vendor register, incident response runbook, audit log summaries
  • Data layer: Firestore CRUD, Redis caching, rate limiting, dead letter queue

E2E tests (Playwright)

Full browser automation testing that exercises the real application:

  • User journeys: Admin onboarding, department lead review, developer workflow creation
  • Route coverage: Every route in the application (bundles, entities, groups, integrations, knowledge bases, recordings, pricing, redirects)
  • RBAC enforcement: Negative tests verifying that unauthorized users get 403s
  • Data consistency: API response ↔ UI rendering verification, concurrent operation handling

Accessibility audits (@axe-core/playwright)

WCAG 2.1 AA compliance scanning on key pages:

  • Color contrast ratios
  • ARIA attribute correctness
  • Keyboard navigation
  • Screen reader compatibility

Visual regression testing

Playwright screenshot comparison to catch unintended UI changes:

  • Component rendering across viewport sizes
  • Theme consistency (light/dark)
  • Layout stability after dependency updates

LLM mock testing

Deterministic test doubles for all 4 LLM providers via llm-mock.ts (818 lines):

  • Each provider’s response format is precisely mocked
  • Tool calling, structured output, and streaming are all covered
  • Tests verify behavior under timeout, rate limit, and error conditions
  • Custom OpenAI-compatible endpoint mocking for self-hosted LLM testing

Performance baselines

Page load metrics tracked as test assertions:

  • Time to interactive
  • Largest contentful paint
  • Bundle size thresholds

The n8n contrast

While we’re running 24,000+ tests nightly, the open-source automation platform n8n has accumulated 8 critical CVEs — several requiring only workflow editor access (not admin) for remote code execution. Censys identified 26,512 exposed n8n instances on the public internet.

Self-hosted doesn’t mean self-secure. Testing discipline does.

How testing feeds SOC 2

Our test suite isn’t just about catching bugs. It’s part of our SOC 2 evidence collection:

  • CC5.2 (Control Activities): The test suite itself is evidence of quality controls
  • CC6.2 (Access Controls): RBAC enforcement tests prove access controls work
  • CC7.1 (System Operations): Nightly CI proves continuous monitoring
  • CC8.1 (Change Management): Every PR runs the full test suite before merge

The SOC 2 evidence aggregator (/api/soc2-evidence) references test coverage as a key metric. When our auditor asks “how do you ensure changes don’t introduce security regressions?”, we have a concrete answer: 24,000+ tests, 99.18% coverage, every commit.

The nightly CI pipeline

Every night, our CI pipeline:

  1. Runs the full Vitest unit test suite (~9,500 tests)
  2. Runs Playwright E2E tests (~500 tests) against a fresh deployment
  3. Runs accessibility audits on 20+ key pages
  4. Runs visual regression comparisons
  5. Reports coverage to the team

If any test fails, the team is notified before the next business day. If coverage drops below 98%, the build fails.

Try it yourself

JieGou is available for free evaluation. Every feature mentioned here — the 4-provider LLM support, the workflow engine, the SOC 2 evidence collection — is available on Enterprise plans.

Start a free trial or contact our team to discuss compliance requirements.

testing quality security soc2 compliance engineering ci-cd enterprise
Share this article

Enjoyed this post?

Get workflow tips, product updates, and automation guides in your inbox.

No spam. Unsubscribe anytime.