ProductHow it worksPricingBlogDocsLoginFind Your First Bug
Code coverage dashboard showing 100% with a bug slipping through, illustrating why high coverage is misleading for AI-generated tests
TestingAI

Why Code Coverage Is Misleading for AI-Generated Tests

Tom Piaggio
Tom PiaggioCo-Founder at Autonoma

Code coverage is misleading because it measures which lines were executed during a test run, not whether any assertion verified correct behavior. For AI-generated tests, this gap is systemic: the model writes tests that execute every branch and report 100% coverage without asserting outcomes that would fail when the logic is wrong. High coverage with hollow assertions is not protection. It is a metric that looks right and means nothing.

There is a specific kind of engineering team this post is written for. They have coverage numbers they are proud of: 80%, 90%, sometimes 100% on critical modules. They run coverage checks in CI. They catch it when coverage drops. And they still get paged at 2am when a bug ships that their test suite had already signed off on.

The "we hit 90% coverage and still shipped the bug" moment is not bad luck. It is the predictable outcome of measuring execution instead of verification. With AI-generated tests now contributing to many codebases, the gap between those two things has grown wider and become harder to see. Autonoma was built, in part, around this problem: the tests our agents generate are derived from the codebase by an agent that did not write the code, and they run against a real preview environment per PR. That independence is what line coverage cannot buy.

What code coverage actually measures

Coverage tools instrument your source code and report which lines, branches, or statements were reached during a test run. A line is "covered" the moment execution passes through it, regardless of what happens next.

That word, "covered," does the damage. It sounds like protection. It is not. Coverage is an execution trace, not a behavioral guarantee. A function can be fully covered and still contain a wrong formula, a missing edge case, or a logic inversion, as long as the test bothered to call the function at all.

The execution-vs-verification distinction matters because these are two completely different questions. Execution asks: "Did this code run?" Verification asks: "Did this code produce the right output?" Coverage tools answer the first question. They say nothing about the second.

A senior QA engineer described it plainly: "Coverage tells me the test suite touched the code. It doesn't tell me the test suite understood the code."

Diagram contrasting code execution, where every line is touched and coverage reports 100%, against verification, where a hollow assertion check never proves the behavior is correct
Coverage answers the left question. It says nothing about the right one.

That gap has always existed. What AI test generation did was industrialize it. The tautological test pattern (AI writes the function, then asserts the function's output back at itself) is covered in depth in AI-generated tests that pass but don't assert anything. This post is about the metric that makes the pattern invisible: coverage.

100% covered, 0% protected

The demonstration is clearest with a small function. Consider a price calculation that applies a discount based on customer tier. The function has a few branches: one for premium customers, one for standard, one for the default case.

A coverage-maximizing test calls the function once per branch. Every line executes. Coverage reports 100%. But if the assertion in each test call is just "the output equals whatever the function returned," no bug can ever be caught. The test is asserting the implementation back to itself.

Here is exactly that pattern:

import { describe, it } from 'node:test';
import assert from 'node:assert/strict';

// PREMIUM_DISCOUNT should be 0.20 (20%), but is set to 0.02 (2%) by mistake.
// No test below will catch this because every assertion mirrors the return value
// of the function itself, not an independently-derived expected value.
export const PREMIUM_DISCOUNT = 0.02; // bug: should be 0.20
export const STANDARD_DISCOUNT = 0.10;
export const DEFAULT_DISCOUNT = 0.0;

export function calculatePrice(basePrice, tier) {
    if (tier === 'premium') {
        return basePrice * (1 - PREMIUM_DISCOUNT);
    }
    if (tier === 'standard') {
        return basePrice * (1 - STANDARD_DISCOUNT);
    }
    return basePrice * (1 - DEFAULT_DISCOUNT);
}

// Coverage result: 100% lines, 100% branches.
// Bug detection result: 0%. Every assertion is tautological.
describe('calculatePrice', () => {
    it('applies premium tier pricing', () => {
        const price = calculatePrice(100, 'premium');
        // Asserts the output back to itself. If PREMIUM_DISCOUNT is wrong,
        // this expected value is also wrong, and the test stays green.
        assert.strictEqual(price, calculatePrice(100, 'premium'));
    });

    it('applies standard tier pricing', () => {
        const price = calculatePrice(100, 'standard');
        assert.strictEqual(price, calculatePrice(100, 'standard'));
    });

    it('applies default pricing for unknown tier', () => {
        const price = calculatePrice(100, 'guest');
        assert.strictEqual(price, calculatePrice(100, 'guest'));
    });
});

The function runs. Every branch is hit. The test file turns green. And if someone swaps the premium discount rate from 20% to 2% by accident (or by a bad merge), the test stays green, because the assertion was never asking "is this 20%?" It was asking "is the output what this function returned?" The function returned the wrong number. The assertion confirmed the wrong number. Coverage: 100%. Protection: zero.

The recommended fix is to add a layer that does not derive expected behavior from the same implementation. Autonoma does that at the E2E boundary: the Planner Agent reads the codebase for routes, components, user flows, and data-state needs; the Executor Agent checks the running preview for observable behavior; and the Reviewer Agent evaluates whether failures are real product signals. The hollow unit test can still exist, but it is no longer the last signal before merge.

This is not a contrived edge case. It is the default output when a test is optimized for coverage rather than for failure on incorrect behavior.

Diagram showing a full 100% line coverage bar sitting above a cracked, hollow shield while a bug slips through the gap, illustrating that high coverage does not equal real protection
A full coverage bar can sit on top of a shield that stops nothing.

Why AI generation inflates coverage specifically

Coverage is the easiest metric for an AI to optimize. Write a test that calls the function, exercises each branch by passing different inputs, confirm no exception is thrown. Done. Every line touched. Coverage moves to 100%.

The model is not being malicious. It is doing exactly what it was implicitly asked to do: write a test that passes. A test that passes without asserting anything meaningful passes just as well as a test with correct assertions. From the model's perspective, there is no difference.

The same-model loop makes this structurally worse. When the AI that wrote the implementation also writes the test in the same session or the same context, any bug in the implementation propagates into the test's expected value. The model saw the logic when it wrote it. The test reflects that logic. A discount rate that is wrong in the implementation becomes the "expected" discount rate in the assertion. The test is internally consistent and behaviorally useless.

This is why teams running AI-heavy workflows describe finding that their QA engineers keep catching things the test suite already approved. The tests are not broken. They are just not checking what engineers assume they are checking.

Code coverage is the metric that makes this invisible. It reports the execution and nothing else. A 90% coverage number in an AI-assisted codebase can represent a test suite where a third of the assertions are tautological and another third are missing entirely.

Better signals than line coverage

Two metrics have real signal where line coverage does not.

Mutation score asks whether your tests can detect small, deliberate changes to the code. A mutation testing tool makes targeted modifications (flipping a comparison operator, changing a constant, swapping a boolean) and reruns the suite. If a mutant survives (the tests stay green after the code is broken), the assertions are not protecting that behavior. A function with 100% line coverage but a 40% mutation score has a lot of lines being executed by tests that would not notice if the logic changed. Mutation score is the topic of its own post in this cluster: mutation testing vs code coverage goes deeper on how to read and act on those numbers.

Assertion coverage measures the ratio of assertions to test steps. A test that calls five functions and asserts one outcome has four unchecked assumptions. Assertion coverage is a proxy for how much of what the test does is actually verified. The post on assertion coverage vs line coverage covers how to use this metric alongside mutation score to get a complete picture.

Both metrics are additive with line coverage, not replacements for it. Coverage as a floor still makes sense: a line that is never executed by any test has zero protection and the coverage tool will find it. The problem is treating the floor as the ceiling.

How Autonoma measures real protection

The root problem with code coverage as a quality signal is that it is produced by the same process it is supposed to evaluate. The tests run, the lines are touched, the tool reports. Nothing in that loop requires anyone to ask whether the assertions are correct.

We built Autonoma around independent verification from the start. The Planner Agent reads your codebase: routes, components, business logic, user flows. It generates test cases from application structure and plans checks at the user-flow boundary instead of echoing a function's current return value. Crucially, the Planner Agent did not write the application code.

The Executor Agent executes those test cases against a real, running preview environment per PR. Not mocks. Not a simulated DOM. Observable behavior in a running instance of the application. This is where execution-vs-verification collapses into the same test: the test either sees the right output from the running application or it does not.

The Reviewer Agent evaluates the run output, classifies failures, and filters false positives so the signal is not just execution noise.

The Diffs Agent keeps the test suite aligned with code changes, without requiring engineers to rewrite tests manually after every refactor.

AI verification is only trustworthy when it is independent of the thing being verified. A test that runs against a live preview, written by an agent that did not write the code, checking observable behavior instead of tracing execution: that is the signal line coverage cannot produce. Green on that suite means consistency with correct behavior, not just consistency with the implementation that was shipped.

The result is a test signal that means something: not "these lines were reached" but "this behavior was verified against a running instance by an agent with no stake in the implementation." That is the solution to the pain shown in the code snippet: a coverage gate can approve a tautological assertion, but an independent E2E run can still fail when the running product produces the wrong outcome.

Add the independent layer coverage cannot provide

Code coverage has been the default quality proxy for so long that questioning it feels counterintuitive. But the metric was always measuring the wrong thing. It measured execution. It reported it as coverage. Teams trusted the word and skipped the question underneath it. AI test generation did not create this problem. It accelerated it, because AI is extremely good at writing tests that maximize coverage without maximizing protection. The same-model loop means the bugs in the implementation become the expected values in the assertions.

If your team has accumulated AI-generated tests behind a coverage gate, the practical next steps are direct: run mutation testing on a critical module and see how many mutants survive; review a sample of AI-generated assertions manually and check whether they are asserting a computed correct value or echoing the implementation back; then put Autonoma on the real product flows so the final pre-merge signal is independent of the code that produced the bug. The broader question of which test metrics actually correlate with release quality is covered in test automation metrics and release quality.

That last step is the one coverage cannot answer. Autonoma supplies it: four-agent E2E checks, executed against a managed preview environment per PR, with expected behavior checked at the boundary users actually touch.


Code coverage is a useful floor check: if a line is never executed by any test, it has zero protection. But as a quality signal, coverage is misleading because it measures execution, not verification. A test can touch every line in a function and assert nothing meaningful about its behavior. High coverage numbers routinely coexist with undetected bugs, especially in codebases that rely on AI-generated tests.

Because coverage tracks which lines were executed during a test run, not whether the tests check for correct behavior. A test that calls a function, exercises every branch, and then asserts nothing (or asserts the output back to itself) will report 100% coverage and catch zero bugs. AI-generated tests are especially prone to this pattern: the model writes a test that runs without errors, which is enough to execute lines, but the assertion reflects the implementation rather than an independently-derived expected outcome.

Mutation score and assertion coverage are both stronger signals than line coverage. Mutation score measures whether your tests can detect small, deliberate changes to the code (mutations). If mutants survive, your assertions are not checking what they should. Assertion coverage measures the ratio of assertions to test steps, flagging tests that call code without asserting outcomes. Neither metric is perfect alone, but both reveal what line coverage hides.

Yes. AI models optimize for generating a test that executes and passes, not for generating a test that would fail when the behavior is wrong. Coverage is the easiest metric for that: call the function, touch the branches, confirm no exception is thrown. This is enough to drive coverage to 100% without any real behavioral assertion. The same-model loop makes it worse: when the AI that wrote the code also writes the test, bugs in the implementation become the expected value in the assertion.

Coverage does not tell you whether the tested behavior is correct, whether the assertions are meaningful, whether edge cases are handled, or whether a change to the logic would be caught. It tells you only which lines were executed at least once during a test run. A function with 100% coverage can contain an off-by-one error, a wrong formula, a missing authorization check, or a silent data corruption path, and the coverage report will show no signal of any of it.

Related articles

Two abstract gauges side by side: a high code-coverage gauge glowing green with hollow confidence, and a low mutation score gauge cracked open in amber, with a Quara frog mascot observing skeptically

Mutation Testing vs Code Coverage: The Real Test-Quality Metric

Code coverage measures which lines ran. Mutation testing measures whether your tests would fail if the code were wrong. Learn mutation score, killed mutants, and why AI-generated tests score low.

A developer running diagnostic checks on a test suite dashboard, revealing hollow green tests that pass but never verify observable behavior

Are My Tests Actually Testing Anything? 5 Ways to Know

Five test-quality checks you can run in 10 minutes: comment out a line, flip a boolean, audit your assertions. If nothing breaks, those tests are theater, not protection.

AI code review limitations diagram: the self-verification loop where AI writes code, reviews it, and generates tests with no independent check, and CI turns green anyway

AI Code Review Limitations: Who Checks AI Generated Code?

The AI code review limitations nobody talks about: when the same model writes, reviews, and tests your code, green CI signals consistency, not correctness. Here's where independence has to enter.

A CI pipeline dashboard showing all green checkmarks while a broken app silently ships bugs, illustrating the false signal produced by AI-generated test suites

Tests Green But App Broken: A Diagnostic

When AI generates your test suite, green CI stops meaning what you think it means. Here's why tests that always pass are a dead signal, and how to restore it.