Case Study: Automated Test Generation with the Coding Agent

Test coverage is one of those metrics everyone agrees matters and nobody wants to be responsible for improving. You know the pattern: a codebase starts at 90% coverage, then a few features ship without tests, then a refactor skips the test update because “we’ll come back to it.” Six months later you’re at 60% and climbing back up feels like a full-time job that nobody signed up for.

This case study walks through how one team used a JieGou workflow with the Coding Agent to go from 60% to 94% test coverage in three weeks — automatically, with minimal human intervention.

The problem

The codebase in question was a Node.js backend with 312 modules. Coverage had been steadily declining for months:

60% overall test coverage, down from 85% at the start of the year
47 modules with zero tests — mostly utility functions, data transformers, and validation logic
New features shipping without tests because developers were focused on delivery deadlines
CI pipeline ran tests but didn’t enforce coverage thresholds, so the decline was invisible until someone checked

The team had tried allocating “test writing sprints” twice. Both times, the sprint got deprioritized when a customer escalation came in. Manual test writing is slow, tedious, and competes with every other priority.

The workflow

The team built a 4-step workflow in JieGou that triggers automatically when code merges to main. No manual intervention required unless a generated test needs revision.

Step 1: Trigger — GitHub webhook on PR merge

The workflow starts with a webhook trigger that fires when a pull request is merged to main.

trigger:
  type: webhook
  source: github
  events: ["pull_request.closed"]
  filters:
    - field: "action"
      value: "closed"
    - field: "pull_request.merged"
      value: true
    - field: "pull_request.base.ref"
      value: "main"

The webhook payload includes the list of changed files, the merge commit SHA, and the PR metadata. All of this feeds into the next step.

Step 2: Analyze — Identify untested functions

A recipe step takes the list of changed files from the webhook and cross-references them against the most recent coverage report. The recipe prompt is straightforward:

step: analyze-coverage
type: recipe
model: claude-sonnet-4-6
input:
  changed_files: "{{trigger.pull_request.changed_files}}"
  coverage_report: "{{secrets.COVERAGE_REPORT_URL}}"
prompt: |
  You are a test coverage analyst. Given the list of changed files
  and the coverage report (lcov format), identify:

  1. Which changed files have less than 80% line coverage
  2. Which exported functions in those files have zero coverage
  3. The function signatures and a brief description of what each does

  Output a JSON array of objects with fields:
  file_path, function_name, signature, description, current_coverage

This step produces a structured list of exactly which functions need tests. It filters out files that already have adequate coverage, so the coding agent doesn’t waste time on code that’s already tested.

Step 3: Generate — Coding Agent writes the tests

This is where the Coding Agent does the heavy lifting. It receives the list of untested functions, clones the repository, and writes unit tests for each one.

step: generate-tests
type: coding-agent
model: claude-sonnet-4-6
repo: "{{secrets.REPO_URL}}"
branch: "test-gen/{{trigger.pull_request.number}}"
maxTurns: 30
sandbox:
  memory: "1GB"
  timeout: "5m"
  network: false
tools:
  - read
  - write
  - edit
  - bash
  - glob
  - grep
input:
  untested_functions: "{{steps.analyze-coverage.output}}"
task: |
  You are a senior test engineer. Your job is to write unit tests for
  the following untested functions:

  {{untested_functions}}

  Instructions:
  1. Read the source file for each function to understand its behavior
  2. Read existing test files in the same directory for style conventions
  3. Write tests using the project's test framework (vitest)
  4. Each test file should cover: happy path, edge cases, error handling
  5. Run `npm test -- --reporter=verbose <test-file>` after writing each
     test file to verify all tests pass
  6. If a test fails, read the error, fix the test, and re-run
  7. Do NOT modify source code — only create or update test files

  Match the existing test style: describe blocks, clear test names,
  arrange-act-assert pattern. Use the existing fixtures and helpers
  where available.

The key configuration choices here:

Model: Claude Sonnet 4.6 — fast enough for high-volume test generation, smart enough to understand complex function signatures and edge cases.
Sandbox: 1 GB memory, 5-minute timeout — enough headroom for npm test runs without risking runaway processes.
Network disabled — the agent doesn’t need external access; everything is in the cloned repo.
maxTurns: 30 — allows the agent to iterate on multiple test files and fix failures across several cycles.

The task description includes a critical instruction: read existing test files first. This ensures the generated tests match the team’s conventions — same assertion style, same describe/it structure, same fixture patterns. Without this, generated tests are technically correct but stylistically inconsistent, which creates friction during review.

Step 4: Submit — Create a PR with results

After the coding agent finishes, a final recipe step creates a pull request with the generated tests and includes the test results in the PR description.

step: submit-pr
type: recipe
model: claude-sonnet-4-6
input:
  modified_files: "{{steps.generate-tests.modified_files}}"
  test_output: "{{steps.generate-tests.output}}"
  source_pr: "{{trigger.pull_request.number}}"
prompt: |
  Create a pull request with the following:
  - Title: "test: auto-generated tests for PR #{{source_pr}}"
  - Body: Include a summary of tests added, functions covered,
    and the full test output showing all tests passing
  - Label: "auto-tests"
  - Request review from the original PR author

The PR is labeled auto-tests so the team can filter and batch-review generated test PRs. Requesting review from the original PR author means the person most familiar with the code sees the tests first.

Results

The team ran this workflow for three weeks on every merge to main. Here are the numbers:

Metric	Value
Starting coverage	60%
Ending coverage	94%
Tests generated	847
Pull requests created	42
Average PR review time	12 minutes
Tests that caught existing bugs	23
Average cost per day	~$4.50 (Claude API)

A few things stand out.

847 tests across 42 PRs means roughly 20 tests per PR. Most PRs covered 2-4 source files, with 5-6 tests per function (happy path, boundary conditions, error cases, null/undefined inputs, type coercion edge cases).

12-minute average review time is remarkable. Most generated tests were correct on the first pass. The main review comments were stylistic — renaming test descriptions, adjusting fixture data to be more realistic. Very few tests needed functional corrections.

23 tests caught actual bugs in existing code. This is the most interesting result. The coding agent wrote a test for a date parsing function that expected ISO 8601 format, and the test revealed the function silently returned Invalid Date for timestamps with timezone offsets. That bug had been in production for four months. Similar discoveries in validation logic, off-by-one errors in pagination helpers, and a race condition in a caching module.

$4.50/day in Claude API costs is trivial compared to the engineering time saved. A conservative estimate: writing 847 tests manually at 15 minutes per test would take roughly 212 hours — over five weeks of a full-time engineer’s time.

Lessons learned

After three weeks of running this workflow, the team refined their approach based on what worked and what didn’t.

Start with pure functions

The first version of the workflow targeted all untested code. This worked well for pure functions — utilities, validators, transformers, formatters — because they have clear inputs, clear outputs, and no side effects. The agent could read the function, understand the contract, and write comprehensive tests.

Complex modules with database connections, external API calls, and shared mutable state were harder. The agent sometimes generated tests that mocked too much or tested implementation details rather than behavior. The team adjusted the analyze step to prioritize pure functions first and queue complex modules for manual review.

Provide existing test examples

The single most impactful change was adding this instruction to the task description: “Read existing test files in the same directory for style conventions.” Before this instruction, the agent produced tests that were functionally correct but used different assertion patterns, different describe block structures, and different naming conventions than the rest of the test suite. After the instruction, generated tests were nearly indistinguishable from hand-written ones.

Run tests before the PR is created

Early iterations of the workflow submitted PRs with tests that hadn’t been verified. Some had import errors, missing fixtures, or assertion failures. Adding the npm test step inside the coding agent task — and instructing it to fix failures before finishing — eliminated almost all of these issues. The agent’s iterative loop (write test, run test, read error, fix test, re-run) is exactly how a human would work, but faster.

Review for quality, not just coverage

Coverage is a proxy metric. A test that asserts expect(result).toBeDefined() technically covers the function but doesn’t verify behavior. The team added a review checklist for generated test PRs:

Does each test verify a meaningful behavior, not just that the function runs?
Are edge cases covered (null inputs, empty arrays, boundary values)?
Do test descriptions clearly state what behavior is being tested?
Are assertions specific (.toBe(42) not .toBeTruthy())?

Most generated tests passed this checklist without changes. The few that didn’t were easy to spot and fix during the 12-minute review.

The configuration in full

For teams that want to replicate this workflow, here is the complete step configuration:

workflow:
  name: "Auto Test Generation"
  description: "Generate unit tests for untested code on PR merge"

  trigger:
    type: webhook
    source: github
    events: ["pull_request.closed"]
    filters:
      - field: "pull_request.merged"
        value: true
      - field: "pull_request.base.ref"
        value: "main"

  steps:
    - id: analyze-coverage
      type: recipe
      model: claude-sonnet-4-6
      # ... (see Step 2 above)

    - id: generate-tests
      type: coding-agent
      model: claude-sonnet-4-6
      dependsOn: [analyze-coverage]
      # ... (see Step 3 above)

    - id: submit-pr
      type: recipe
      model: claude-sonnet-4-6
      dependsOn: [generate-tests]
      # ... (see Step 4 above)

The workflow runs end-to-end in 8-15 minutes depending on the number of untested functions. Most of that time is the coding agent iterating through test files and running npm test after each one.

What’s next

The team is now extending the workflow in two directions:

Integration tests — a separate coding agent step that generates integration tests for API endpoints, using the existing test helpers and database fixtures
Coverage enforcement — a CI check that fails the build if coverage drops below 90%, now that the baseline is high enough to enforce

The broader lesson: test coverage doesn’t have to be a manual chore. With the right workflow, you can treat it as an automated pipeline that runs alongside your existing CI/CD — no dedicated sprints, no deprioritization, no coverage debt.

The Coding Agent is available on Pro plans and above. Build your first coding workflow.