ProductHow it worksPricingBlogDocsLoginFind Your First Bug
A developer at a whiteboard drawing the boundary between observable behavior and implementation detail, with two test assertions side by side: one that catches bugs and one that does not
TestingAIAssertion Quality

How to Write Good Test Assertions

Tom Piaggio
Tom PiaggioCo-Founder at Autonoma

A good test assertion verifies observable behavior, not implementation. It must be capable of failing: break the function it covers and it must go red. It checks a specific real value, stays scoped to one logical behavior, and uses real outputs rather than mock round-trips. That is what assertion quality means. The test most teams skip: would this assertion catch a real bug?

Most teams shipping with Cursor, Claude, or Copilot have no shortage of tests. They have hundreds of green ones. The problem is that "green" stopped meaning much, because the tests were written by the same system that wrote the code. "It asserts something, but it's not really asserting what it should be asserting." That is word-for-word what an engineer at a Series B startup told us after auditing their AI-generated suite.

This article is not for teams that have no tests. It is for AI-forward engineering teams, the ones shipping with vibe-coding workflows and heavy Cursor and Copilot usage, who have plenty of tests and still find bugs in production. The assertion rules below answer one question: how do I know if my tests are actually good? (If you are earlier in the journey and want context on how generative AI interacts with QA workflows more broadly, generative AI testing and QA covers the landscape.)

5 Rules for Assertions That Protect

Rule 1: Assert on observable behavior, not implementation

Observable behavior means what the function returns, what the UI shows, what the database contains after the operation. Implementation means that a specific internal method was called, a private variable was set, or a particular code path executed.

A boundary diagram contrasting implementation-detail assertions on the left (internal method called, private variable set, code path ran, verdict drop it) with observable-behavior assertions on the right (return value, what the UI shows, database state after the operation, verdict assert here)
Implementation-detail assertions (internal method called, private variable set) break on valid refactors and pass on real bugs. Observable-behavior assertions (return value, UI, database state) are where to assert.

Implementation assertions break on valid refactors and pass on real bugs. You can rename the internal method and the test fails, even though behavior is unchanged. You can introduce a bug in the return value and the test passes, because the right internal calls still happened. The rule: if your assertion would pass on a subtly wrong implementation, it is an implementation assertion. Drop it.

/**
 * Rule 1: Assert observable behavior, not implementation details.
 *
 * The function under test computes a cart total after applying a
 * tier-based discount. There are two ways to test it:
 *
 *   BAD  - spy on an internal helper and assert it was called.
 *          This couples the test to HOW the function works, not WHAT
 *          it returns. The assertion stays green even when the math
 *          is wrong, as long as the internal helper still runs.
 *
 *   GOOD - assert the actual returned total for a concrete input.
 *          This tests WHAT the caller observes. It goes red the
 *          moment the output is wrong, which is the only thing the
 *          rest of the system actually depends on.
 *
 * Syntax is jest-style (`describe` / `it` / `expect`). The file is
 * illustrative and has no runtime dependencies; it is meant to be
 * read alongside the blog post, not executed.
 */

// --- subject under test -----------------------------------------------------

const DISCOUNT_RATES = {
  standard: 0,
  silver: 0.1,
  gold: 0.2,
};

// Internal helper. Note: this is an implementation detail. Callers do
// not care that it exists; they only care about the total they get back.
function discountRateFor(tier) {
  return DISCOUNT_RATES[tier] ?? 0;
}

function applyDiscount(cart, tier) {
  const subtotal = cart.items.reduce(
    (sum, item) => sum + item.price * item.quantity,
    0
  );
  const rate = discountRateFor(tier);
  return Math.round(subtotal * (1 - rate) * 100) / 100;
}

// --- the helpers module the BAD test reaches into ---------------------------
// In a real codebase `discountRateFor` would be imported from a module so a
// spy could replace it. We expose it here to mirror that shape.
const helpers = { discountRateFor };

function applyDiscountViaHelpers(cart, tier) {
  const subtotal = cart.items.reduce(
    (sum, item) => sum + item.price * item.quantity,
    0
  );
  const rate = helpers.discountRateFor(tier);
  return Math.round(subtotal * (1 - rate) * 100) / 100;
}

// --- fixtures ---------------------------------------------------------------

const cart = {
  items: [
    { name: 'keyboard', price: 80, quantity: 1 },
    { name: 'mouse', price: 20, quantity: 2 },
  ],
};
// subtotal = 80 + (20 * 2) = 120
// gold tier => 20% off => 96.00

describe('applyDiscount', () => {
  // BAD: spies on the internal helper and asserts it was called.
  //
  // Why it is bad: this passes as long as `discountRateFor` is invoked,
  // even if `applyDiscount` later multiplies by the wrong factor, drops
  // a line item, or returns `subtotal` untouched. The discount could be
  // completely broken and this test would still be green, because it
  // never looks at the number the caller receives.
  it('BAD: asserts the internal helper was called', () => {
    const spy = jest.spyOn(helpers, 'discountRateFor');

    applyDiscountViaHelpers(cart, 'gold');

    expect(spy).toHaveBeenCalledWith('gold');
    expect(spy).toHaveBeenCalledTimes(1);
    // No assertion on the returned total. If the math regresses, this
    // test does not notice.

    spy.mockRestore();
  });

  // GOOD: asserts the observable result for a concrete input.
  //
  // Why it is good: it pins the exact value the caller depends on.
  // If anyone breaks the discount math (wrong rate, missing item,
  // off-by-one rounding), the expected total no longer matches and
  // the test fails loudly, pointing straight at the regression.
  it('GOOD: asserts the returned total for a gold-tier cart', () => {
    const total = applyDiscount(cart, 'gold');

    expect(total).toBe(96.0);
  });

  it('GOOD: standard tier returns the full subtotal', () => {
    const total = applyDiscount(cart, 'standard');

    expect(total).toBe(120.0);
  });
});

module.exports = { applyDiscount };

Rule 2: One logical behavior per test

If the first assertion in a test block fails, the rest do not run. You lose information about what else broke. More importantly, a test that covers three behaviors tells you "something broke," not "this behavior broke."

One logical behavior does not mean one line of code. Testing that a discount is applied correctly may need multiple assertions (line item price changed, total changed, discount label appeared). Those are all one behavior: the discount was applied. What is not acceptable is asserting discount behavior and pagination behavior in the same test. The practical heuristic: if your test description has "and" in it, you have two tests.

Rule 3: The assertion must fail if you intentionally break the function

This is the most important rule, and the easiest to validate. Pick one behavior your test is supposed to cover. Modify the function to return the wrong value for that case. Does the test go red?

If the test still passes after you introduced a deliberate bug, it is not protecting you. It is asserting something orthogonal to the behavior it claims to cover. AI verification is only trustworthy when it is independent of the thing being verified. Green means consistency, not correctness. AI-generated tests fail this check far more often than handwritten ones, because they are optimized to pass on the current code, not to catch a broken version.

Rule 4: Prefer real values over mock round-trips

A test that mocks a dependency and then asserts against the mock's return value is asserting that the mock works. It always passes because you configured it. The rule is not "never mock." Mocking for speed and isolation is legitimate. The rule is: the assertion must check the output of the code under test, not the input you fed it via the mock.

If the only thing your assertion checks is that a mock was called with the right argument, you have not tested behavior. You have tested that a function was invoked. Assert against real values: a specific string, a specific number, a specific structure. That brittleness is intentional. It will break if the business logic changes. That is the point.

/**
 * Rule 4: Assert real values, not the mock round-trip.
 *
 * The function under test fetches raw invoice line items from an API
 * client and turns them into a single total in a target currency.
 * The API client is mocked. There are two ways to write the assertion:
 *
 *   BAD  - assert that the function returns exactly what the mock was
 *          configured to hand back. This is tautological: you told the
 *          mock to return X, then asserted you got X. The function's
 *          own logic (summation, currency conversion) is never tested.
 *          Delete the entire body and replace it with `return mockData`
 *          and the test still passes.
 *
 *   GOOD - assert the concrete value the function PRODUCES from the
 *          mocked input: the summed, converted total. This fails if the
 *          summation, the conversion rate, or the rounding regresses,
 *          which is the actual business logic worth protecting.
 *
 * Syntax is jest-style. The file is illustrative and has no runtime
 * dependencies; it is meant to be read alongside the blog post.
 */

// --- subject under test -----------------------------------------------------

// Static FX table. In production this would itself come from a source;
// here it is a constant so the example stays dependency-free.
const USD_PER_EUR = 1.1;

// `apiClient.getInvoiceLines(id)` returns line items priced in EUR.
// `fetchInvoiceTotal` sums them and converts the total to USD.
async function fetchInvoiceTotal(apiClient, invoiceId) {
  const lines = await apiClient.getInvoiceLines(invoiceId);
  const totalEur = lines.reduce(
    (sum, line) => sum + line.amountEur * line.quantity,
    0
  );
  const totalUsd = totalEur * USD_PER_EUR;
  return Math.round(totalUsd * 100) / 100;
}

// --- fixtures ---------------------------------------------------------------

// What the mocked API client is configured to return.
const MOCK_LINES = [
  { sku: 'seat', amountEur: 50, quantity: 2 }, // 100 EUR
  { sku: 'support', amountEur: 30, quantity: 1 }, // 30 EUR
];
// totalEur = 130, converted at 1.1 => 143.00 USD

describe('fetchInvoiceTotal', () => {
  // BAD: asserts the function returns exactly the mock's payload.
  //
  // Why it is bad: the expected value is the mock's configured return,
  // restructured by hand. The test proves only that the mock was wired
  // up, not that `fetchInvoiceTotal` sums or converts anything. If the
  // conversion factor is dropped, or the quantities are ignored, this
  // assertion still passes because it never computes the real answer.
  it('BAD: asserts the function echoes the mocked lines', async () => {
    const apiClient = {
      getInvoiceLines: jest.fn().mockResolvedValue(MOCK_LINES),
    };

    const lines = await apiClient.getInvoiceLines('inv-1');

    // Tautology: we configured the mock to return MOCK_LINES, then
    // assert we got MOCK_LINES back. The unit under test is bypassed.
    expect(lines).toEqual(MOCK_LINES);
  });

  // GOOD: asserts the value the function PRODUCES from the mocked input.
  //
  // Why it is good: 143.00 is not something the mock returns; it is the
  // result of the function's own summation and currency conversion.
  // Break the conversion (or the reduce) and this expectation no longer
  // holds, so the test fails and names the regression.
  it('GOOD: asserts the summed, converted total in USD', async () => {
    const apiClient = {
      getInvoiceLines: jest.fn().mockResolvedValue(MOCK_LINES),
    };

    const total = await fetchInvoiceTotal(apiClient, 'inv-1');

    expect(total).toBe(143.0);
    expect(apiClient.getInvoiceLines).toHaveBeenCalledWith('inv-1');
  });

  it('GOOD: an empty invoice converts to a zero total', async () => {
    const apiClient = {
      getInvoiceLines: jest.fn().mockResolvedValue([]),
    };

    const total = await fetchInvoiceTotal(apiClient, 'inv-empty');

    expect(total).toBe(0);
  });
});

module.exports = { fetchInvoiceTotal };

Rule 5: Assert specific values, not just shape or truthiness

expect(result).toBeTruthy() passes for any truthy value, including an empty object or an empty array. expect(result).toHaveProperty("discountAmount") tells you the property exists, not whether the value is correct. A function that always returns { discountAmount: 0 } passes both assertions even when a 20% discount should have been applied.

Assert the actual value. Not "is there a discount." The specific number the business rule requires, for the specific input you gave it. This is the hardest rule for AI-generated tests because writing the expected value requires knowing the business rule. The AI knows the code. It does not know whether the rule requires 20% or 15% or a floor of $5. "It doesn't cover the business case" is a direct consequence of this structural gap.

The Mutation Test for an Assertion

The five rules have a single diagnostic that covers all of them: can this assertion be killed by a mutation? This is the basis of independent verification: not "does this code run?" but "would a different output be caught?"

A mutation is a small, deliberate change to the function under test. A return value flipped. A condition negated. An off-by-one. If you introduce a mutation and the test still passes, the test is not testing the behavior the mutation broke. The assertion is decorative.

A flow showing the mutation test for an assertion: mutate the function by flipping a return value or negating a condition, run the test, then branch on whether the test failed. Going red means the mutation was killed and the assertion protects you. Staying green means the mutation survived and the assertion is decorative.
The mutation test: break the function on purpose, then run the test. Red means the mutation was killed and the assertion protects you. Green means the mutation survived and the assertion is decorative.

Running a formal mutation framework (Stryker for JavaScript/TypeScript, PIT for Java) gives you a mutation score across the whole suite. Practitioner reports suggest AI-generated suites with high line coverage can score very low here, because the assertions check the implementation rather than the behavior. The assertion coverage vs line coverage article covers the metric mechanics.

You do not need a framework. For any assertion you are unsure about: change the return value in the function. Run the test. If it stays green, the assertion was giving you false confidence, not protection.

Checklist: Is This Assertion Worth Keeping?

Before committing a test, run through this:

  • Would this assertion fail if the function returned the wrong value?
  • Is the expected value a real business outcome, not just shape or truthiness?
  • Does this test cover exactly one logical behavior?
  • Does the assertion check what comes out of the code, not what went into a mock?
  • If you comment out the line that implements the behavior, does the test go red?

If any answer is no, the assertion needs work. Teams we have worked with, who ran this checklist against their AI-generated suites, found a meaningful fraction of their tests failed at least one check. Not because the engineers were careless. Because the AI generating those tests was reasoning about what would pass, not about what would fail.

How Autonoma Writes Assertions from Real Behavior

The pattern this article documents is a specific kind of fragility: tests that assert consistency with the code rather than correctness against the business requirement. Every rule above is an attempt to make the assertion independent of the implementation. Assert the output, not the mechanism. Assert the value, not the shape.

Our team built Autonoma to be that independent layer. Our three agents work from your codebase and the running application, not from PR-diff review alone. The Planner reads your routes, components, and user flows to plan test cases against the running application, including the DB state each scenario needs. The Automator executes those cases against a real preview environment per PR. The Maintainer keeps the tests passing as code changes. The assertions in those E2E tests are derived from what the application actually does when driven through real user flows, which is exactly the independence the five rules above are trying to recover one assertion at a time.

The platform does not write your unit-test assertions (those need the business-rule knowledge of the engineer who owns the domain), but for the behavioral class of bug it is the layer we recommend AI-forward teams add first. AI code reviewers like CodeRabbit and Bugbot catch the syntactic class. Unit-test assertions, written to the rules above, catch the function-level class. Autonoma catches the behavioral class at the integration boundary: the wrong total, the broken flow, the page that renders but lies. That is the class most AI-generated suites are missing entirely, and the one that keeps reaching production while CI stays green.

Why AI-Written Assertions Fail These Rules by Default

The structural reason AI-generated assertions are weak is not a model quality problem. It is an independence problem.

When the same model writes both the function and the test, it has a strong prior: the current implementation is correct. The test it writes will naturally assert the current behavior. If the implementation has a bug, that bug becomes the expected value. "The test passes but the bug ships" is not a coincidence. It is the predictable output of a system that is not independent of the thing it is verifying.

This is why AI-generated tests so often fail Rules 3 and 5. They assert the output the code currently produces, not the output the business logic requires. CI is green. Nothing fails in review. The bug reaches staging or production, where the behavior is compared against reality rather than against the code that generated it.

For a detailed look at the specific shapes bad assertions take, the article on useless unit tests and the tautological anti-pattern catalogs them. For the root concept of how the same model writing both code and tests creates the self-deception cycle, see AI-Generated Tests That Pass But Don't Assert Anything. The five rules above are the positive counterpart to both.

If your team ships with Cursor or Claude and the suite stays green while production keeps surprising you, do not stop at repairing assertions one by one. Add the layer that is independent by construction: Autonoma verifies your application's behavior from outside the code that wrote it, which is the one property no AI-generated assertion can give itself. Fix the unit assertions with the rules above, and let the behavioral E2E layer catch what they structurally cannot.

FAQ

A good test assertion verifies observable behavior and must be capable of failing. It checks a specific real value (not just shape or truthiness), covers exactly one logical behavior, and is independent of the implementation internals. The core test: if you intentionally break the function, does the assertion go red? If it stays green, the assertion is not protecting you.

As many as needed to verify one logical behavior, and no more. If your test covers a discount being applied, you may assert the line item price, total, and discount label. That is one behavior. The rule is not one assertion per test. It is one behavior per test. If your test description has the word 'and' in it, you likely have two tests.

No. Implementation assertions (internal method calls, private state, specific code paths) break on valid refactors and pass on real bugs. Assert on observable outputs: return values, database state, rendered UI, emitted events. If the assertion would pass on a broken but structurally similar implementation, it is an implementation assertion. Rewrite it.

Run the mutation test: intentionally break the function for a behavior the test covers, and check if the test goes red. If it stays green, the assertion is not testing what you think. For the full suite, a mutation testing tool (Stryker for JS/TS, PIT for Java) gives you a mutation score. High line coverage with a low mutation score is the signature of AI-generated test theater: green but not protecting anything.

The model that writes the function has a strong prior that the current implementation is correct. The assertion it generates checks that the code behaves as written, not that the code is correct. When the implementation has a bug, the bug becomes the expected value. This is the tautological test failure mode. AI verification is only trustworthy when it is independent of the thing being verified. Green means consistency, not correctness.

Related articles

Five shapes of useless unit tests illustrated as hollow checkmarks on a CI dashboard that stays green while real bugs slip through

Useless Unit Tests: 5 Patterns That Never Fail

A field guide to the 5 shapes of useless unit tests: the tautological test, mock-asserting tests, snapshot tests nobody reviews, and tests with no real assertions.

Shift-left testing pipeline diagram: bugs caught at the PR stage before production for a small engineering team

Shift-Left Testing for Small Engineering Teams in 2026

Shift-left testing for small engineering teams: how 3-6 person startups catch bugs before production without a QA hire, using preview environments and AI.

Happy path testing taxonomy: tree diagram showing happy path, sad path, edge case, and corner case as four coverage branches, with most production bugs living in the non-happy branches.

Happy Path Testing: What It Covers and What It Misses

Happy path testing vs sad path, edge case, and corner case. Canonical taxonomy, golden path explained, and four bugs a happy-path-only suite misses.

Diagram showing a wall of AI-generated pull requests overwhelming a small hand-maintained test suite, with a codebase-aware regression layer intercepting the merge flow

Regression Testing for AI-Generated Code: How to Keep Coverage Current When Agents Ship 100x More PRs

Regression testing AI-generated code: why Playwright suites collapse under agent PR volume and how codebase-aware AI code regression coverage survives drift.