AI automation platforms make decisions that affect real business processes. When a recipe generates a customer email, or a workflow approves a purchase order, or an agent delegates tasks across departments — the output matters. If the platform has bugs, the business has bugs.
That’s why JieGou runs 24,000+ automated tests with 99.18% code coverage. Every night. Across all 4 LLM providers. With accessibility audits, visual regression testing, and RBAC enforcement verification included.
No other AI automation platform publishes these numbers. Most don’t have them.
Why testing matters more for AI platforms
Traditional SaaS testing is straightforward: given input X, expect output Y. AI automation platforms add three layers of complexity:
- Non-deterministic outputs — LLMs don’t return the same response twice. Tests must validate structure, constraints, and quality rather than exact strings.
- Multi-provider variability — JieGou supports 4 LLM providers (Anthropic, OpenAI, Google, and any OpenAI-compatible endpoint). Each has different capabilities, error modes, and response formats.
- Orchestration complexity — Workflows chain multiple steps with conditional logic, parallel execution, approval gates, and convergence loops. A bug in step 3 can corrupt step 7’s output through shared state.
These challenges are exactly why testing discipline matters. Without it, you’re shipping bugs you can’t reproduce because they only appear under specific LLM response patterns.
What 24,000+ tests cover
Unit tests (Vitest)
The bulk of our test suite — server-side logic, data transformations, validation rules, and business logic:
- LLM layer: Provider routing, BYOK key resolution, circuit breaker state machines, concurrency limiting, token usage tracking
- Workflow engine: Step execution (recipe, condition, loop, parallel, approval, LLM, eval, router, aggregator), DAG execution, convergence loops, checkpoint/resume
- Security: RBAC enforcement (20 permissions across 5 roles), auth guard, API key encryption/decryption, session management
- SOC 2 evidence: Access review generation, encryption inventory, vendor register, incident response runbook, audit log summaries
- Data layer: Firestore CRUD, Redis caching, rate limiting, dead letter queue
E2E tests (Playwright)
Full browser automation testing that exercises the real application:
- User journeys: Admin onboarding, department lead review, developer workflow creation
- Route coverage: Every route in the application (bundles, entities, groups, integrations, knowledge bases, recordings, pricing, redirects)
- RBAC enforcement: Negative tests verifying that unauthorized users get 403s
- Data consistency: API response ↔ UI rendering verification, concurrent operation handling
Accessibility audits (@axe-core/playwright)
WCAG 2.1 AA compliance scanning on key pages:
- Color contrast ratios
- ARIA attribute correctness
- Keyboard navigation
- Screen reader compatibility
Visual regression testing
Playwright screenshot comparison to catch unintended UI changes:
- Component rendering across viewport sizes
- Theme consistency (light/dark)
- Layout stability after dependency updates
LLM mock testing
Deterministic test doubles for all 4 LLM providers via llm-mock.ts (818 lines):
- Each provider’s response format is precisely mocked
- Tool calling, structured output, and streaming are all covered
- Tests verify behavior under timeout, rate limit, and error conditions
- Custom OpenAI-compatible endpoint mocking for self-hosted LLM testing
Performance baselines
Page load metrics tracked as test assertions:
- Time to interactive
- Largest contentful paint
- Bundle size thresholds
The n8n contrast
While we’re running 24,000+ tests nightly, the open-source automation platform n8n has accumulated 8 critical CVEs — several requiring only workflow editor access (not admin) for remote code execution. Censys identified 26,512 exposed n8n instances on the public internet.
Self-hosted doesn’t mean self-secure. Testing discipline does.
How testing feeds SOC 2
Our test suite isn’t just about catching bugs. It’s part of our SOC 2 evidence collection:
- CC5.2 (Control Activities): The test suite itself is evidence of quality controls
- CC6.2 (Access Controls): RBAC enforcement tests prove access controls work
- CC7.1 (System Operations): Nightly CI proves continuous monitoring
- CC8.1 (Change Management): Every PR runs the full test suite before merge
The SOC 2 evidence aggregator (/api/soc2-evidence) references test coverage as a key metric. When our auditor asks “how do you ensure changes don’t introduce security regressions?”, we have a concrete answer: 24,000+ tests, 99.18% coverage, every commit.
The nightly CI pipeline
Every night, our CI pipeline:
- Runs the full Vitest unit test suite (~9,500 tests)
- Runs Playwright E2E tests (~500 tests) against a fresh deployment
- Runs accessibility audits on 20+ key pages
- Runs visual regression comparisons
- Reports coverage to the team
If any test fails, the team is notified before the next business day. If coverage drops below 98%, the build fails.
Try it yourself
JieGou is available for free evaluation. Every feature mentioned here — the 4-provider LLM support, the workflow engine, the SOC 2 evidence collection — is available on Enterprise plans.
Start a free trial or contact our team to discuss compliance requirements.