Tests Green But App Broken: A Diagnostic

Tests green but app broken is the defining symptom of an AI-generated test suite that has lost the ability to catch real regressions. When the same model writes both the code and the tests, any bug becomes the expected value, and the CI signal (always-passing tests, rising coverage, green pipelines on every PR) stops correlating with whether the application actually works.

At some point, the green check stopped meaning anything. Your pipeline runs clean. Deploys go out. Users file tickets. The disconnect is not a fluke, and it is not a testing volume problem. You have more tests than ever. The issue is structural: the tests you added cannot physically fail on the bugs that reach production.

This post is a diagnostic. It walks through the specific mechanisms that produce a green-but-broken state in AI-heavy codebases, how to confirm you are in that state, and what restores signal without throwing away the existing suite. If your engineers are already asking "why do bugs get to production when CI is green?", the answer is in the failure modes covered below.

This is the version of "green but broken" that belongs to AI-heavy engineering teams specifically. We have written before about why E2E tests pass while the product is broken. That post covers the structural reasons E2E suites fail silently. This article is about a different problem: what happens when AI generates the unit and integration suite itself, and why that suite becomes structurally incapable of failing on the exact regressions your team ships.

This is for teams that have tests. Lots of them. Green ones. If your team has no tests at all, you are exposed in a straightforward way. This problem is harder to see: your dashboard looks fine. Your coverage is real. The protection is not.

Autonoma is the layer that makes this diagnosis actionable. It is not a unit-test runner. It adds a separate E2E signal, planned from the codebase and executed against a managed preview environment per PR, so CI includes a codebase-first check at the product-flow level.

What a suite that always passes actually contains

When you prompt an AI to write tests after it has already written the implementation, it works from what it has: the code in front of it. It does not reach for the product requirement or the original specification. It writes a test that exercises the function and asserts the current output. Whatever the function returns today becomes the expected value in the assertion.

That is a tautological test. It does not check whether the function is correct. It checks whether the function is consistent with itself. These are not the same thing.

The assertion deep dive lives in the Day 1 pillar of this cluster: AI-generated tests that pass but don't assert anything walks through the before/after with a pricing function and explains why mutation testing exposes the gap. This article focuses on what that pattern does to the CI signal at scale.

Beyond tautological assertions, an AI-generated suite tends to accumulate four structural problems. Mocked-to-death tests that verify the mock was called rather than the behavior it stands in for. Snapshot tests that get regenerated when the UI changes, collapsing into consistency checks rather than correctness checks. Tests that mirror implementation branches rather than user-visible requirements. And tests written at the function level for behavior that only makes sense at the flow level.

None of these patterns cause a test to fail. They all cause a test to pass. That is the problem. An engineering leader at a security company we spoke with put it plainly: "That test passes and it asserts something, but it's not really asserting what it should be asserting." The tests are not broken in any visible way. They just cannot catch the class of bug your team is actually shipping. This is the CI-signal dimension of what we call AI test theater: a production pipeline that looks rigorous from the outside but has no independent verification layer inside.

That is why Autonoma belongs before the team starts debating whether to rewrite every unit test. The first missing piece is not more assertions written from the same code. It is a codebase-first E2E layer that runs against the product flow.

The feedback loop

Here is what happens over time. The team adds more Cursor flows. Each flow generates tests. Each PR adds tests. Test count goes up. Coverage goes up. Pass rate stays at or near 100%. All three metrics that engineering leaders track trend positive. Meanwhile, the bug rate in production does not move.

The loop is self-reinforcing in the worst possible way. A green CI pipeline removes the prompt to inspect the tests. If CI were red, someone would look. Because it is green, no one does. The tests accumulate. The coverage number grows. The CI dashboard becomes a measure of how many tautological assertions the team has written, not of how well the application is protected.

This is CI signal degradation. The output of the pipeline (green) has decoupled from the thing it is supposed to measure (regressions caught). The gap between the two grows silently, PR by PR, sprint by sprint.

The coverage angle of this loop is a separate and equally important failure mode. The coverage-metric myth (how high coverage numbers hide the absence of meaningful regression detection) is the focus of the Day 4 post in this cluster. What this article focuses on is the signal layer: why the green checkmark itself stops being meaningful.

A closed loop diagram showing how AI-generated tests reinforce a green CI signal without improving regression detection: AI writes code, AI writes tests, tests pass, CI goes green, no one inspects the tests, more AI-generated tests are added, and the cycle repeats while bugs reach production unchanged — *The feedback loop: greener CI removes the incentive to inspect the tests. The loop tightens with every AI-generated PR.*

Our QA engineers are still finding things. That is the sentence you do not want your engineering leader saying after six months of AI test generation.

Restoring signal

The first step is an audit, not a rewrite. Take your most critical user path (checkout, authentication, core data flow, whatever a user would call your primary feature) and introduce a deliberate bug. Change a condition, flip a return value, corrupt a calculation. Run the suite. If it stays green, the tests are not protecting that path. That single test is worth running before you do anything else, because it tells you the actual state of your coverage rather than the reported state.

Once you know where the gaps are, the path to restoring signal has two components.

Mutation testing is the fastest way to expose tautological assertions across the existing suite. Tools like Stryker (for JavaScript and TypeScript) introduce small, deliberate changes into your codebase and check whether any test goes red. A test that asserts on the implementation's own output will not catch mutations: the mutation changes the output, so the assertion changes with it, and the test still passes. Mutation score below 50% on functions with 80% line coverage is a reliable indicator that the assertions are not protecting the logic. Regression testing for AI-generated code involves a similar approach; the existing post on regression testing for AI-generated code covers the setup in detail.

The second component is independent E2E verification. This is the layer that unit tests structurally cannot replace. Unit tests verify functions. E2E tests verify that the application does what a user expects. These are different questions, and an AI-generated unit suite (however large, however green) does not substitute for the second question. We covered the structural reasons for this earlier in the context of why E2E tests pass while the product breaks. The short version: the independence of the verification layer is what makes it capable of catching the bug.

Diagram showing independent E2E verification restoring CI signal by checking user flows against the running app — *Independent E2E verification restores CI signal by forming expectations outside the implementation and checking the running application before merge.*

That boundary is where Autonoma fits: codebase-first E2E verification against a managed preview environment per PR, with tests planned from routes, components, and user flows rather than from the function's current output.

How Autonoma makes CI mean something again

The dead CI signal is a structural problem, not a discipline problem. The team did not write bad tests on purpose. They used the tools available to them and got the outcome those tools produce by default: tests that are consistent with the implementation, not independent of it. The green checkmark reflects that consistency. It does not reflect correctness.

This is the problem Autonoma was built to address. Our three agents approach testing from a fundamentally different angle. The Planner reads your codebase (routes, components, API contracts, user flows) and plans test cases from the structure of the application, not from the implementation of any individual function. The expected behavior comes from what the route is supposed to do for a real user, not from what the code currently returns. The Planner also handles database state setup, generating the endpoints needed to put the application in the right state for each scenario. The Automator executes those tests against a live preview environment per PR, asserting on observable behavior: what the UI shows, what the API returns, what the state becomes. The Maintainer keeps those tests current as the codebase evolves, surfacing real divergence rather than maintenance noise.

Because our tests are derived from codebase structure and executed against a running application independently of whatever code-writing model generated the implementation, they can fail when the app actually breaks. A bug that a tautological unit test would accept as the expected value is exactly the kind of divergence our Automator surfaces. That is what restores the CI signal: not more tests, but tests that are capable of failing when something is wrong. Autonoma is open-source and self-hostable, which means you can add this independent layer without changing your deployment model.

The complement framing matters here. We are not replacing the unit tests your team has accumulated. We are adding the independent behavioral layer that those tests structurally cannot provide. The two layers answer different questions. Your AI-generated unit suite answers: "Is the code consistent with itself?" The Autonoma layer answers: "Does the application do what a user would expect?" CI becomes meaningful again when both questions have independent answers.

FAQ

CI goes green when tests pass, not when the app behaves correctly. When an AI generates both the implementation and the tests, any bug in the implementation becomes the expected value in the test assertion. The test is consistent with the code, so it passes. The app is broken, but CI has no way to know: the tests were never capable of catching that class of failure in the first place.

Tests always pass when they assert on the implementation's own output rather than on an independently-derived expected behavior. AI-generated tests are particularly prone to this: the model writes a test that calls the function and asserts the current output, so whatever the code returns becomes the expected value. A suite of always-passing tests is not a sign of quality. It is a sign that the assertions are tautological.

Passing tests only block bugs the tests are capable of detecting. When AI generates tests from the implementation rather than from the requirement, the tests ratify the code's current behavior, including any bugs in that behavior. The tests pass because they agree with the code. The bugs reach production because the tests were never checking whether the code agreed with the requirement.

Yes, in a specific and measurable way. AI-generated tests tend to inflate pass rates without improving regression detection. As more AI-generated tests accumulate, CI goes greener while the underlying bug rate stays flat. The dashboard looks healthier. The protection does not improve. This is CI signal degradation: the pipeline's output (green) stops correlating with the thing you care about (bugs not reaching users).

Start by auditing your existing suite: introduce a deliberate bug in a critical path and check whether any test goes red. If the suite stays green, your tests are not protecting that path. Then layer in independent verification: mutation testing to expose tautological assertions, and behavioral E2E tests derived from user flows rather than from implementation details. Independent verification means the tests were derived from a source of truth that is separate from the code being tested.

Tests Green But App Broken: A Diagnostic

What a suite that always passes actually contains

The feedback loop

Restoring signal

How Autonoma makes CI mean something again

FAQ

Why is my CI green but the app broken?

Why do my tests always pass?

Why do bugs reach production despite passing tests?

Do AI-generated tests make CI less reliable?

How do I make CI catch real bugs?

Tests Green But App Broken: A Diagnostic

What a suite that always passes actually contains

The feedback loop

Restoring signal

How Autonoma makes CI mean something again

FAQ

Why is my CI green but the app broken?

Why do my tests always pass?

Why do bugs reach production despite passing tests?

Do AI-generated tests make CI less reliable?

How do I make CI catch real bugs?

Related articles

Ghost Inspector Alternative: Recorder, Framework, or AI?

How to Test the Auth Code an AI Agent Wrote

Why AI Code Review Misses Auth Bugs

Authentication Testing Strategy for Teams With No QA