AI-Generated Tests That Pass But Don't Assert Anything

AI-generated unit tests pass not because the code is correct, but because the assertions mirror the implementation rather than verifying it. Green is a measure of execution consistency, not behavioral correctness. When the same model writes the code and the test, any bug in the logic becomes the expected value, producing a tautological test that can never catch the very mistake it was supposed to find.

A Series B engineering team we spoke with described the moment clearly: "It asserts something, but not what it should be asserting." They had a Cursor-heavy workflow, 70%+ line coverage, green CI on every PR. Bugs were still reaching staging. Their QA engineers were still finding things their test suite had already signed off on.

This is not an edge case. It is the default outcome when AI generates both the implementation and the test in the same context window. The team is not lazy. The coverage number is real. The protection is not.

The green checkmark has become the primary signal engineers use to trust a PR. CI passes, tests are green, the PR merges. What most teams have not yet internalized is that the green checkmark on an AI-generated test suite measures execution, not correctness. It tells you the code ran without throwing an exception. It does not tell you the code did the right thing.

This article is for AI-forward engineering teams at Series A and Series B stage who ship with Cursor, Claude, or Copilot and have accumulated a meaningful library of green tests. You are not the no-QA startup that has zero coverage. You have coverage. The problem is that your green tests may be asserting consistency rather than correctness, and there is no visible signal telling you which ones. This is a different problem from having no tests at all: it is harder to see and harder to fix, because the CI dashboard looks fine.

What a Tautological Test Looks Like

The cleanest way to see the problem is with a pricing function. Suppose your application calculates a discounted price based on a user tier.

A human engineer writing this test would look up the business rule ("Silver tier gets 15% off, Gold gets 25% off"), compute the expected result independently (100 * 0.75 = 75.00), and write an assertion against that number. If the discount logic is wrong, the assertion fails.

An AI writing the test in the same session that generated the discount function will do something subtler. It will call the function with the same inputs, capture the output, and assert that the output equals whatever the function currently returns. The assertion re-states the implementation back to itself. Here is the before/after:

import { calculateDiscountedPrice } from "./calculateDiscountedPrice";

/**
 * BEFORE: the tautological test.
 *
 * This is the kind of test an AI assistant produces when asked to
 * "write a test for calculateDiscountedPrice" without being told what
 * the correct output should be. The expected value is derived from the
 * function itself, so the assertion compares the implementation to a
 * copy of the implementation.
 *
 * The test passes. It always passes. It will keep passing even if the
 * discount logic is wrong, because the "expected" side moves in lockstep
 * with the "actual" side. It asserts that the code does what the code
 * does, which is true by construction and tells you nothing.
 */
describe("calculateDiscountedPrice (tautological / before)", () => {
  it("applies the Silver discount", () => {
    const price = 100;

    // The expected value is computed by re-running the same code path.
    // If the Silver rate were changed from 15% to 5%, this line would
    // change with it, and the assertion would still pass.
    const expected = calculateDiscountedPrice(price, "Silver");

    expect(calculateDiscountedPrice(price, "Silver")).toBe(expected);
  });

  it("applies the Gold discount", () => {
    const price = 200;

    // Same failure mode: the function is its own oracle.
    const expected = calculateDiscountedPrice(price, "Gold");

    expect(calculateDiscountedPrice(price, "Gold")).toBe(expected);
  });
});

import { calculateDiscountedPrice } from "./calculateDiscountedPrice";

/**
 * AFTER: the behavior-asserting test.
 *
 * The expected values here are derived independently of the function.
 * They are computed by hand from the business rule:
 *
 *   Silver = price * 0.85   (15% off)
 *   Gold   = price * 0.75   (25% off)
 *
 * Because the expected side does not call the function, it cannot drift
 * with a buggy implementation. If someone changes the Silver rate from
 * 15% to 5%, calculateDiscountedPrice(100, "Silver") returns 95 while the
 * test still expects 85, so the test fails and the regression is caught.
 */
describe("calculateDiscountedPrice (behavioral / after)", () => {
  it("charges 85 percent of price for Silver", () => {
    // Hand-computed: 100 * 0.85 = 85. Not produced by the function.
    expect(calculateDiscountedPrice(100, "Silver")).toBe(85);
  });

  it("charges 75 percent of price for Gold", () => {
    // Hand-computed: 200 * 0.75 = 150.
    expect(calculateDiscountedPrice(200, "Gold")).toBe(150);
  });

  it("rejects negative prices", () => {
    expect(() => calculateDiscountedPrice(-1, "Silver")).toThrow(RangeError);
  });
});

The "before" test will pass whether the discount is 15%, 25%, or 0%. It passes when there is a typo in the tier comparison. It passes when the wrong tier branch is hit. Whatever value the buggy function produces, that is the value the assertion expects, because the assertion was derived from the function rather than from the requirement. The test is green but broken.

The corrected version asserts against an independently-derived expected value: the number a human computed from the specification, not from the code. Now if the discount logic has a bug, the test fails.

The distinction matters more than it looks. The "before" version will also pass all your linting, type checking, and code review. It compiles cleanly. It runs quickly. It contributes to your coverage metric. From every signal except a real bug in production, it looks like a good test. This is what makes false coverage so dangerous at scale: the false signal is indistinguishable from the real one until users start hitting the broken path.

To see the full range of shapes this anti-pattern takes, the sibling article Useless Unit Tests: The Tautological Test Anti-Pattern catalogues five variants. This article owns the canonical before/after and the mutation argument.

Why AI Generation Produces Them

The core mechanism is simple: AI optimizes for producing a test that passes, not for producing a test that would fail if the behavior were wrong. The model has no access to the original business requirement. It has access to the code you just gave it. So it writes a test that is correct with respect to that code: whatever the code returns, that becomes the expected value. There is no external specification being consulted. This dynamic is part of a broader pattern we cover in generative AI testing and QA of AI-generated code: the same generation-first mindset that accelerates delivery also undermines the independence test verification requires. The full explanation of how generation-for-coverage differs from generation-for-behavior lives in AI Test Generation: Why Green Tests Still Let Bugs Through.

Mutation Score: The Metric That Exposes It

Line coverage tells you which lines were executed. It says nothing about whether any assertion would catch a mistake on those lines. A test that calls a function and asserts nothing fails zero lines. A tautological test that asserts the function's own output onto itself also fails zero lines. Both score 100% line coverage.

Mutation testing exposes this. A mutation testing tool (Stryker for JavaScript and TypeScript, PIT for Java) introduces small, deliberate bugs into your codebase: changing a > to >=, flipping a boolean, replacing a return value with a constant. For each mutation, it runs your test suite. If no test goes red, the mutation "survived." Mutation score is the percentage of mutations that were killed.

A tautological test suite will show high line coverage and low mutation score. Practitioner reports and published accounts from teams adopting mutation testing suggest that AI-generated test suites frequently survive the majority of mutations even when line coverage looks healthy. The exact number varies by codebase and generation tool, but the pattern is consistent: the tests execute the code without asserting its correctness in any way that a small behavioral change would expose.

Bar chart showing the same AI-generated test suite measured two ways: line coverage reads 78 percent of lines executed and looks protected, while mutation score reads 31 percent of injected bugs caught, with the gap between the two bars labeled as false coverage — The same suite, measured two ways. Line coverage says 78% of lines ran. Mutation score says 31% of injected bugs were caught. The gap between the two is false coverage.

If you want to pressure-test this on your own suite, Mutation Testing vs Code Coverage walks through the setup and interpretation in detail.

The Self-Deception Cycle

There is a deeper structural problem than any single tautological test. When vibe-coding workflows use the same model to write a feature and then immediately prompt it to "write tests for what I just built," the model does not have access to the business requirement. It has access to the code it just wrote. It tests what the code does, not what the code should do. This is precisely the structural argument behind why Cursor and Claude Code cannot reliably test their own output: the model that wrote the logic carries the same implicit assumptions into the test session.

Any bug in the implementation is invisible to this process. The model cannot see the difference between "the function computes 0.75 * price because the discount is 25%" and "the function computes 0.75 * price because of a copy-paste error from the Gold tier branch." Both produce the same output on the first test run. Both produce the same assertion. The test passes.

Closed four-step loop showing the self-deception cycle of AI-generated tests: the model writes the implementation with a hidden bug, the same model writes a test that asserts the code's own output instead of the spec, the test goes green with the bug as the expected value, and the feature is reported as done so the bug ships, with the center noting there is no independent source of truth — When one model writes both the code and the test, the loop has no independent source of truth to break it.

The test passes but the bug ships. Not because the model made an error when writing the test. Because the model had no independent source of truth to test against.

This is the self-deception cycle: the same agent writes the code, writes the tests, watches them go green, and reports the feature as done. The CI pipeline confirms it. Nobody disagrees. The bug is now in production.

AI-forward teams at Series A and Series B are particularly exposed because they often ship 20-40 PRs per day with heavy Cursor or Claude Code usage. Each PR brings its own AI-generated test suite. Each suite goes green. The aggregate coverage number looks strong. The aggregate protection is hollow.

The cluster spine, stated plainly: AI verification is only trustworthy when it is independent of the thing being verified. Green means the code is consistent with itself. It does not mean the behavior is correct.

The False Coverage Trap at Scale

The "green but broken" state is not a one-time event. It compounds. Each sprint adds more AI-generated tests. Each test that passes without asserting meaningful behavior adds to a coverage number that senior engineers use to justify shipping confidently. The coverage number grows. The actual protection does not.

Our QA engineers are still finding things after the AI test suite went green. We assumed the tests would surface what QA used to surface. They don't.

A Series B team we worked with had reached 78% line coverage across their critical checkout flow. Three of the five tests in that flow were asserting on values derived from the same functions they were testing. The other two were structural (checking that a component rendered, not that it rendered correctly). A billing tier bug shipped twice in one month. Both times the CI pipeline was green.

The trap is not that the tests were bad in isolation. Any one of them might have caught a bug. The trap is that the aggregate signal they produced ("78% coverage, all green") was being used as a proxy for "the checkout flow is protected." It was not. It was a measure of execution, not protection.

False test coverage is the term for this gap, and it is worth being precise about what it means: coverage that makes you feel protected without making you actually protected. The AI Test Theater concept names the broader phenomenon where AI-generated tests, AI PR review, and AI code generation combine to produce a production pipeline that looks rigorous from the outside but has no independent verification layer inside.

For a direct look at how to diagnose whether your own test suite has this problem, see How to Tell If Your Tests Are Actually Testing Anything.

How Autonoma Verifies Behavior AI Tests Can't

The pattern we have been describing is a structural limitation, not a prompt engineering problem. A tautological test exists because the test author and the implementation author share the same source of truth. Prompt differently, get the same issue, because the missing ingredient is not better instructions: it is an independent source of truth.

That is the layer Autonoma provides. We built our platform to verify application behavior from the outside, independent of the code that generated the implementation or its unit tests.

Our three agents divide the work. The Planner reads your codebase: routes, components, API contracts, user flows. It plans test cases from the structure of the application, not from the implementation of any single function, and sets up the database state each scenario needs. This is a meaningful difference. A unit test is derived from a function. Our planned tests are derived from what the function is part of: a user flow with an expected outcome. The Automator turns those plans into executable tests and runs them against your live preview environment, asserting on observable application behavior (what the UI shows, what the API returns, what the database state becomes). The Maintainer keeps those tests healthy as your code evolves, surfacing real divergence instead of noise.

That does not mean the agents ignore the code. The Planner reads the codebase to understand routes, components, API contracts, and user flows; the independence comes from not deriving the expected value from the function's current output. They exercise the running app from the outside and compare against expected user-visible behavior. A bug that a tautological unit test would call the "expected value" is exactly the kind of divergence this surfaces, because the expected value in our system comes from the application flow, not from the function's own output.

This is the yes-and framing worth being precise about. Autonoma does not replace unit tests or AI code review. It adds the independent behavioral E2E layer that catches the business-logic class those layers structurally cannot, because those layers tend to ratify the implementation or the diff instead of independently checking user-flow behavior. If you want to know how to improve the unit tests themselves, How to Write Good Test Assertions is the prescriptive guide. The two layers are complementary, not competing.

One boundary worth naming honestly: Autonoma verifies behavior at the application E2E layer, so tautological tests living entirely in a library of pure functions with no user-facing surface are still Stryker's job. For everything else, the recommendation is direct. The bug class that survives AI-generated unit tests is behavioral: the wrong discount applied in a specific edge case, the wrong redirect after a specific error state, correct-looking logic that produces a wrong outcome for a real user. That is the class Autonoma exists to catch. If your suite is green and your QA engineers are still finding things, this is the layer to add.

AI-generated unit tests pass but miss bugs because the model optimizes for writing a test that executes without error, not for writing an assertion that would fail when the behavior is wrong. When the same AI that wrote the implementation also writes the test, any bug in the implementation becomes the 'expected value' in the assertion. The test is consistent with the code, not with the correct behavior. This is called a tautological test.

A tautological test is a test that asserts the implementation back to itself rather than asserting against an independently-derived expected behavior. Instead of calculating what the function should return and comparing, the test re-invokes the same logic (or a copy of it) and compares the output to itself. The test can never fail unless the code raises an exception, because whatever the code produces is what the assertion expects.

No. Code coverage measures execution, not verification. A test can execute every line of a function and still assert nothing meaningful about its behavior. AI test generation is particularly prone to inflating coverage this way because it writes tests that call the function and touch all branches without requiring those branches to produce specific correct outcomes. Mutation score is a far better proxy for test quality than line coverage.

AI can write structurally valid assertions. The problem is that a good assertion requires knowing the correct expected value independently of the implementation. When AI writes both the implementation and the test in the same session, any mistake in the logic propagates into both. AI can write better assertions when given a specification, a business rule, or an independently-derived expected value to assert against. Without that independent source of truth, the assertion reflects the code's behavior, right or wrong.

The simplest check: introduce a deliberate bug into the function and see if any test goes red. If your suite stays green after you break the logic, the tests are not protecting you. A more systematic approach is to run a mutation testing tool (Stryker for JavaScript/TypeScript, PIT for Java). Mutation score below 50% on a function that has 80%+ line coverage is a strong signal that the assertions are tautological. Independently verify a sample of assertions by manually computing what the correct output should be before checking what the test expects.

AI-Generated Tests That Pass But Don't Assert Anything

What a Tautological Test Looks Like

Why AI Generation Produces Them

Mutation Score: The Metric That Exposes It

The Self-Deception Cycle

The False Coverage Trap at Scale

How Autonoma Verifies Behavior AI Tests Can't

Why do AI-generated unit tests pass but miss bugs?

What is a tautological test?

Does high code coverage mean good tests?

Can AI write good test assertions?

How do I know if my AI-generated tests are real?

AI-Generated Tests That Pass But Don't Assert Anything

What a Tautological Test Looks Like

Why AI Generation Produces Them

Mutation Score: The Metric That Exposes It

The Self-Deception Cycle

The False Coverage Trap at Scale

How Autonoma Verifies Behavior AI Tests Can't

Why do AI-generated unit tests pass but miss bugs?

What is a tautological test?

Does high code coverage mean good tests?

Can AI write good test assertions?

How do I know if my AI-generated tests are real?

Related articles

AI Test Theater: The Confidence Trap Killing Your Test Suite

Automated E2E Testing Without Writing a Single Test

What an Autonomous Testing Platform Actually Does in 2026

Test Preview Environments Without Writing Test Code