Your AI agent just generated three new modules in twenty minutes. Clean code, good structure, it even follows your conventions. You merge it. Next day, another three modules. By the end of the week, your codebase grew by 2,000 lines. Your test suite? It grew by 200. Maybe.

Your agents are building faster than you can verify. And if you can’t verify it, you don’t really know what you shipped.

My side project — a security-focused AI agent framework with 309 modules, 13+ plugins, and a six-stage security pipeline — has 115,000 lines of test code against 67,000 lines of production code. 7,649 tests across 399 files. The test code is 1.7x the size of the production code. I didn’t aim for that ratio. I aimed for 100% coverage. That’s just what 100% coverage looks like when you’re actually testing behavior.

The Problem: AI Writes Code Fast. Your Safety Net Is Still Thin.

Tests have always been necessary. But before agentic coding, the speed bottleneck was writing code — you could only ship so fast, so the test suite had a chance to keep up. Now AI agents ship features in hours. The bottleneck flipped. The code writes itself. The question is whether you can trust it.

Most teams using AI agents still have the same thin test suites they had before. A few happy-path assertions, maybe some integration tests. The production codebase doubles, the test suite barely moves. The ratio gets worse every week.

That’s a problem unique to agentic coding. When a human writes code, they carry context — they know what they changed, they can mentally trace the impact. When an AI agent writes three modules in twenty minutes, nobody has that context. The only thing that can verify the behavior is the test suite. And if the test suite is thin, you’re trusting code that nobody — human or AI — has actually verified.

The Solution: 100% Coverage as the Baseline for Agentic Coding

With AI agents, 100% coverage should be the default, not the stretch goal. And real 100% coverage — where every branch, every edge case, every integration point has a test that would fail if the behavior changed — naturally produces more test code than production code.

Every if branch needs at least two tests. Every error path needs a test. Every integration point needs a contract test. When you add all that up across 309 modules, you get 115,000 lines of test code. That’s not excessive — that’s what it takes to actually trust the code your agents are shipping.

The good news: AI agents write tests too. The same agents that generate production code generate the test suite. I describe the behavior, the AI writes the tests — including the edge cases I’d be tempted to skip at 5pm on a Friday. A test file that would take me an hour takes minutes. The AI doesn’t get tired on the sixth edge case. It doesn’t skip the error path because “it probably works.”

My test suite covers every layer:

  • Unit tests for domain logic, business rules, and individual components
  • Scenario tests for multi-step workflows and integration points
  • E2E tests with Playwright for the UI
  • Contract tests for external API integrations

Why This Matters With AI

Any agentic workflow needs an eval phase — a way to give the agent feedback on its output so it can correct course toward the goal. In coding, your test suite is that eval. When an AI agent generates a module and the tests pass, that’s a green signal. When the tests fail, that’s concrete, specific feedback the agent can act on — not a vague “try again,” but “this method returns the wrong value for this input.” Test automation, on top of everything it always was, is now the best external eval tool for AI coding agents.

The payoff shows up especially when things change — which, with agentic coding, is constantly.

When I merged upstream changes that refactored a core abstraction from one class into five, the test suite immediately lit up. Nine test files needed updates. Every breakage was visible — which mocks pointed at classes that no longer existed, which assertions checked methods that had been renamed.

Without 7,649 tests, that merge would have been guesswork. With them, it was mechanical — update each test, run the suite, fix the next failure. AI helped with the fixes too. It read the new production code and rewrote 1,425 lines of test code to match the new structure. Three commits. Zero regressions.

When your agents ship code daily, merges and refactors happen constantly. A comprehensive test suite is the only thing standing between “AI-assisted development” and “AI-assisted chaos.”

What This Looked Like in Practice

The upstream merge brought three changes at once:

A core abstraction was split into multiple classes. Every mock in the corresponding test file targeted a class that no longer existed. 1,425 lines of test code needed rewriting — not tweaking, rewriting.

A new integration was added. Test files for routing and configuration needed updates to handle the new registration paths and fallback behavior.

An external API client switched versions. Field names changed, response structures shifted. Three test files needed updates.

After fixing: 7,649 tests passing, zero failures. Every breakage was caught, fixed with AI assistance, and verified automatically. No manual testing. No guessing. The merge that could have broken everything took an afternoon.

How to Build This In

Never merge AI-generated code without AI-generated tests. If the agent writes a module, it writes the tests in the same session. One doesn’t ship without the other. Make this a rule, not a suggestion — enforce it in your pipeline.

Aim for 100% behavior coverage — the ratio follows. If your test code is smaller than your production code, your agents are outrunning your safety net. With AI writing the tests, there’s no excuse for the gap.

Use a dedicated quality-check agent. Don’t let the same agent that writes the code decide the tests are good enough. A separate agent that runs coverage, checks edge cases, and rejects insufficient tests keeps the bar high — even when you’re shipping fast. And line coverage alone isn’t enough — mutation testing catches the tests that cover lines but don’t actually verify behavior.

Treat the test suite as your trust layer. In agentic coding, the test suite isn’t a nice-to-have — it’s the only thing that tells you whether your agents built what you asked for. Invest in it the way you’d invest in any critical infrastructure.

The Takeaway

If your test code is smaller than your production code, your AI agents are building faster than you can verify. 100% real coverage naturally produces more test code than production code — and with AI agents writing both, there’s no reason to accept anything less. The agents that write your code should be matched by agents that prove it works.

This article was written by AI and approved by Hossein for publication.