A good test assertion verifies observable behavior, not implementation. It must be capable of failing: break the function it covers and it must go red. It checks a specific real value, stays scoped to one logical behavior, and uses real outputs rather than mock round-trips. That is what assertion quality means. The test most teams skip: would this assertion catch a real bug?
Most teams shipping with Cursor, Claude, or Copilot have no shortage of tests. They have hundreds of green ones. The problem is that "green" stopped meaning much, because the tests were written by the same system that wrote the code. "It asserts something, but it's not really asserting what it should be asserting." That is word-for-word what an engineer at a Series B startup told us after auditing their AI-generated suite.
This article is not for teams that have no tests. It is for AI-forward engineering teams, the ones shipping with vibe-coding workflows and heavy Cursor and Copilot usage, who have plenty of tests and still find bugs in production. The assertion rules below answer one question: how do I know if my tests are actually good? (If you are earlier in the journey and want context on how generative AI interacts with QA workflows more broadly, generative AI testing and QA covers the landscape.)
5 Rules for Assertions That Protect
Rule 1: Assert on observable behavior, not implementation
Observable behavior means what the function returns, what the UI shows, what the database contains after the operation. Implementation means that a specific internal method was called, a private variable was set, or a particular code path executed.
Implementation assertions break on valid refactors and pass on real bugs. You can rename the internal method and the test fails, even though behavior is unchanged. You can introduce a bug in the return value and the test passes, because the right internal calls still happened. The rule: if your assertion would pass on a subtly wrong implementation, it is an implementation assertion. Drop it.
Rule 2: One logical behavior per test
If the first assertion in a test block fails, the rest do not run. You lose information about what else broke. More importantly, a test that covers three behaviors tells you "something broke," not "this behavior broke."
One logical behavior does not mean one line of code. Testing that a discount is applied correctly may need multiple assertions (line item price changed, total changed, discount label appeared). Those are all one behavior: the discount was applied. What is not acceptable is asserting discount behavior and pagination behavior in the same test. The practical heuristic: if your test description has "and" in it, you have two tests.
Rule 3: The assertion must fail if you intentionally break the function
This is the most important rule, and the easiest to validate. Pick one behavior your test is supposed to cover. Modify the function to return the wrong value for that case. Does the test go red?
If the test still passes after you introduced a deliberate bug, it is not protecting you. It is asserting something orthogonal to the behavior it claims to cover. AI verification is only trustworthy when it is independent of the thing being verified. Green means consistency, not correctness. AI-generated tests fail this check far more often than handwritten ones, because they are optimized to pass on the current code, not to catch a broken version.
Rule 4: Prefer real values over mock round-trips
A test that mocks a dependency and then asserts against the mock's return value is asserting that the mock works. It always passes because you configured it. The rule is not "never mock." Mocking for speed and isolation is legitimate. The rule is: the assertion must check the output of the code under test, not the input you fed it via the mock.
If the only thing your assertion checks is that a mock was called with the right argument, you have not tested behavior. You have tested that a function was invoked. Assert against real values: a specific string, a specific number, a specific structure. That brittleness is intentional. It will break if the business logic changes. That is the point.
Rule 5: Assert specific values, not just shape or truthiness
expect(result).toBeTruthy() passes for any truthy value, including an empty object or an empty array. expect(result).toHaveProperty("discountAmount") tells you the property exists, not whether the value is correct. A function that always returns { discountAmount: 0 } passes both assertions even when a 20% discount should have been applied.
Assert the actual value. Not "is there a discount." The specific number the business rule requires, for the specific input you gave it. This is the hardest rule for AI-generated tests because writing the expected value requires knowing the business rule. The AI knows the code. It does not know whether the rule requires 20% or 15% or a floor of $5. "It doesn't cover the business case" is a direct consequence of this structural gap.
The Mutation Test for an Assertion
The five rules have a single diagnostic that covers all of them: can this assertion be killed by a mutation? This is the basis of independent verification: not "does this code run?" but "would a different output be caught?"
A mutation is a small, deliberate change to the function under test. A return value flipped. A condition negated. An off-by-one. If you introduce a mutation and the test still passes, the test is not testing the behavior the mutation broke. The assertion is decorative.
Running a formal mutation framework (Stryker for JavaScript/TypeScript, PIT for Java) gives you a mutation score across the whole suite. Practitioner reports suggest AI-generated suites with high line coverage can score very low here, because the assertions check the implementation rather than the behavior. The assertion coverage vs line coverage article covers the metric mechanics.
You do not need a framework. For any assertion you are unsure about: change the return value in the function. Run the test. If it stays green, the assertion was giving you false confidence, not protection.
Checklist: Is This Assertion Worth Keeping?
Before committing a test, run through this:
- Would this assertion fail if the function returned the wrong value?
- Is the expected value a real business outcome, not just shape or truthiness?
- Does this test cover exactly one logical behavior?
- Does the assertion check what comes out of the code, not what went into a mock?
- If you comment out the line that implements the behavior, does the test go red?
If any answer is no, the assertion needs work. Teams we have worked with, who ran this checklist against their AI-generated suites, found a meaningful fraction of their tests failed at least one check. Not because the engineers were careless. Because the AI generating those tests was reasoning about what would pass, not about what would fail.
How Autonoma Writes Assertions from Real Behavior
The pattern this article documents is a specific kind of fragility: tests that assert consistency with the code rather than correctness against the business requirement. Every rule above is an attempt to make the assertion independent of the implementation. Assert the output, not the mechanism. Assert the value, not the shape.
Our team built Autonoma to be that independent layer. Our three agents work from your codebase and the running application, not from PR-diff review alone. The Planner reads your routes, components, and user flows to plan test cases against the running application, including the DB state each scenario needs. The Automator executes those cases against a real preview environment per PR. The Maintainer keeps the tests passing as code changes. The assertions in those E2E tests are derived from what the application actually does when driven through real user flows, which is exactly the independence the five rules above are trying to recover one assertion at a time.
The platform does not write your unit-test assertions (those need the business-rule knowledge of the engineer who owns the domain), but for the behavioral class of bug it is the layer we recommend AI-forward teams add first. AI code reviewers like CodeRabbit and Bugbot catch the syntactic class. Unit-test assertions, written to the rules above, catch the function-level class. Autonoma catches the behavioral class at the integration boundary: the wrong total, the broken flow, the page that renders but lies. That is the class most AI-generated suites are missing entirely, and the one that keeps reaching production while CI stays green.
Why AI-Written Assertions Fail These Rules by Default
The structural reason AI-generated assertions are weak is not a model quality problem. It is an independence problem.
When the same model writes both the function and the test, it has a strong prior: the current implementation is correct. The test it writes will naturally assert the current behavior. If the implementation has a bug, that bug becomes the expected value. "The test passes but the bug ships" is not a coincidence. It is the predictable output of a system that is not independent of the thing it is verifying.
This is why AI-generated tests so often fail Rules 3 and 5. They assert the output the code currently produces, not the output the business logic requires. CI is green. Nothing fails in review. The bug reaches staging or production, where the behavior is compared against reality rather than against the code that generated it.
For a detailed look at the specific shapes bad assertions take, the article on useless unit tests and the tautological anti-pattern catalogs them. For the root concept of how the same model writing both code and tests creates the self-deception cycle, see AI-Generated Tests That Pass But Don't Assert Anything. The five rules above are the positive counterpart to both.
If your team ships with Cursor or Claude and the suite stays green while production keeps surprising you, do not stop at repairing assertions one by one. Add the layer that is independent by construction: Autonoma verifies your application's behavior from outside the code that wrote it, which is the one property no AI-generated assertion can give itself. Fix the unit assertions with the rules above, and let the behavioral E2E layer catch what they structurally cannot.
FAQ
A good test assertion verifies observable behavior and must be capable of failing. It checks a specific real value (not just shape or truthiness), covers exactly one logical behavior, and is independent of the implementation internals. The core test: if you intentionally break the function, does the assertion go red? If it stays green, the assertion is not protecting you.
As many as needed to verify one logical behavior, and no more. If your test covers a discount being applied, you may assert the line item price, total, and discount label. That is one behavior. The rule is not one assertion per test. It is one behavior per test. If your test description has the word 'and' in it, you likely have two tests.
No. Implementation assertions (internal method calls, private state, specific code paths) break on valid refactors and pass on real bugs. Assert on observable outputs: return values, database state, rendered UI, emitted events. If the assertion would pass on a broken but structurally similar implementation, it is an implementation assertion. Rewrite it.
Run the mutation test: intentionally break the function for a behavior the test covers, and check if the test goes red. If it stays green, the assertion is not testing what you think. For the full suite, a mutation testing tool (Stryker for JS/TS, PIT for Java) gives you a mutation score. High line coverage with a low mutation score is the signature of AI-generated test theater: green but not protecting anything.
The model that writes the function has a strong prior that the current implementation is correct. The assertion it generates checks that the code behaves as written, not that the code is correct. When the implementation has a bug, the bug becomes the expected value. This is the tautological test failure mode. AI verification is only trustworthy when it is independent of the thing being verified. Green means consistency, not correctness.




