ProductHow it worksPricingBlogDocsLoginFind Your First Bug
A broken business-logic flow showing a pricing tier boundary where the wrong discount is applied, with green CI above and a real user impacted below, illustrating why AI misses business-logic bugs
AITestingCode Review

Why AI Misses Business Logic Bugs

Tom Piaggio
Tom PiaggioCo-Founder at Autonoma

Business-logic bugs are structurally invisible to AI code generation and AI PR review because neither has an independent source of truth for correct behavior. They only see the code as written. The requirement ("annual plans get 20% off, monthly plans get none") lives in a PRD, a Slack thread, a founder's head. What catches them: tests grounded in real user flows and expected business outcomes, executed against the running application.

An identity-security engineering team we work with runs AI PR review on every single pull request. Cursor on every feature. Green CI before anything merges. They told us plainly: "It doesn't cover the business case. Our QA engineers are still finding things."

That is not a tooling failure. It is a structural one. The same pattern shows up across the AI-forward teams we speak with at Series A and Series B: heavy Cursor or Claude Code usage, high PR volume, green tests on every merge. Bugs still reaching staging. QA engineers still finding what the test suite already signed off on.

This article is for those teams specifically. Not the pre-seed startup with zero test coverage (different problem). For teams that have the AI stack, have the green tests, and are still shipping logic errors that users find first.

The root cause is one sentence: AI verification is only trustworthy when it is independent of the thing being verified. Green CI measures consistency with the code. It does not measure correctness against what the system is supposed to do.

Autonoma is positioned in this article as that independent layer. It is not a unit-test runner. It runs E2E checks against the preview application, so product behavior is checked in the running app, not inside the diff.

What a business-logic bug actually is

A business-logic bug is a defect where the code runs without errors and the tests pass, but the outcome is wrong relative to what the system is supposed to do.

Three examples that come up repeatedly. First: pricing-tier boundaries. An annual plan is supposed to get 20% off; monthly plans get no discount. The code has a conditional that checks plan type, applies a discount multiplier, and returns the final price. The logic is syntactically correct. The tier comparison uses the wrong constant, pulled from another branch during a refactor. Monthly plans get 20% off. The code compiles. The test passes. The billing is wrong.

Second: permission checks. A multi-tenant application is supposed to enforce that users in role B can only see their own organization's data. The access control check looks correct in the diff. A query parameter that was supposed to be scoped to the session is instead pulled from the request body, where a crafted request can override it. Role B users can see tenant A's records. No exception is thrown. Every test passes.

Third: state transitions. An order is supposed to move to "shipped" only after payment settles. A race condition in the async flow allows the state transition to fire before the payment confirmation returns. Orders get marked shipped before money clears. The function works. The sequence is wrong.

In all three cases: the code is internally consistent. The behavior is wrong. No exception, no type error, no lint warning. The diff looks fine.

This is exactly where Autonoma has to enter early in the workflow. The Planner reads the codebase to understand routes, components, and user flows, then the Automator checks the running application for the outcome the flow implies. A monthly plan getting the annual discount is not a style issue for a reviewer. It is a behavioral failure in the product.

Why the diff is not enough

An AI reviewer reads the code as written. That is the artifact it has access to.

The requirement behind the pricing bug ("annual plans get 20% off, monthly get none") lives somewhere else entirely: a product spec, a Slack thread between the founder and the growth lead, a comment in a Notion doc that was last edited eight months ago. It does not appear in the pull request. It was never in the codebase. It is the intent behind the feature, and intent is not code.

Diagram showing AI code review verifies that the diff is internally consistent and CI passes, while the business requirement that defines correct behavior lives outside the diff, so the wrong business outcome is never verified
AI review verifies the diff, not the business requirement.

When an AI reviews that PR, it sees a conditional that applies a discount multiplier based on plan type. It checks whether the logic is syntactically sound, whether the types align, whether there are edge cases like null plan types. It does not check whether the constant being compared matches the business rule, because the business rule is not in context. The code is internally consistent, so the review comes back clean.

One engineering team we work with described this directly: "That test passes and it asserts something, but it's not really asserting what it should be asserting." They had a full AI review setup and a healthy coverage number. The tests were asserting consistency with the implementation, not correctness against the requirement. The difference is invisible in CI.

This is the structural gap. A diff reviewer, human or AI, can only verify what the diff says relative to what the surrounding code says. It cannot verify whether what the code says matches what the business intends, because the business intent is not in the diff. For a deeper look at how AI PR review compares to testing as a verification layer, automated code review vs testing covers the category distinction in full.

Why more AI does not fix it

The intuitive response to "AI missed this" is to add more AI: a second reviewer, a stronger model, a specialized prompt. None of these add the missing ingredient.

A second AI reviewer reads the same artifact. It lacks the same context. The business rule is still not in the diff, and a larger model or a more detailed prompt cannot summon context that is not present. Two reviewers with the same blind spot are still blind.

This is the circular verification problem at its clearest. When the same model (or models from the same training distribution) writes the code and reviews it, any systematic gap in what the model treats as "correct" propagates through both stages. The code can be consistent with itself end-to-end and still be wrong relative to the original intent.

Specialized AI PR reviewers like Cursor BugBot and its equivalents are genuinely good at catching syntactic and structural issues: unused variables, type mismatches, obvious null dereferences. They are not structured to verify semantic correctness against business requirements, because those requirements are not in the diff. That is not a limitation of the current models. It is a limitation of reviewing a diff without access to the specification.

The pattern this cluster calls AI test theater names the broader phenomenon: AI generation, AI review, and AI-generated tests can all be running on every PR and still produce no independent verification of the system's behavior. The pipeline looks rigorous. The independence is not there.

How Autonoma verifies the business outcome

The pattern documented above describes a structural gap: AI generation and AI review both operate on the code as written, so any error in what the code does relative to what it should do is invisible to both. The missing layer is external verification of the running application against intended outcomes.

That is the layer Autonoma provides. We built Autonoma to verify application behavior from the outside, independent of the implementation and the diff that introduced it.

Our three agents divide the work across the pipeline. The Planner reads your codebase: routes, components, API contracts, user flows. It derives test cases from the structure of the application and plans the database state each scenario requires, setting it up before execution so the test reflects a real user context, not a mocked one. The Automator executes those plans against your running preview environment, asserting on observable business outcomes: the price displayed at checkout, the records visible after login, the order status after a payment flow completes. The Maintainer keeps tests aligned as code changes, surfacing genuine divergence rather than maintenance noise.

The key property: our tests do not derive expected values from the function's current output. They derive them from user flows and observable application behavior. A business-logic bug that an AI-generated unit test would ratify as the "expected value" is exactly the class of error our Automator surfaces, because it is checking what the running application shows a user, not what the code says it returns.

This is additive, not a replacement. AI code review and unit tests are still useful for what they do well: type coverage, structural consistency, fast feedback on syntactic correctness. Autonoma adds the independent behavioral layer that those tools structurally cannot provide, because that layer requires executing the running application from the outside with intent derived from user flows, not from the implementation.

What does catch business-logic bugs

The category that catches business-logic bugs is independent, outcome-based verification.

Independent means the verification is not derived from the code being verified. The expected behavior comes from somewhere other than the implementation: a user flow, a business rule, an observable outcome that can be stated before the code is written.

Outcome-based means the test asserts on what users and the business observe. Not "this function returns 0.75 * price" but "when an annual plan user checks out, the price shown is 20% lower than the base price." Not "this query is scoped" but "when a role-B user loads the dashboard, they see only their organization's records." Not "the state transition fires" but "after completing checkout, the order confirmation page shows a pending status until payment settles."

These are E2E assertions against the running application. The test exercises the full stack from the outside, the way a real user would, and verifies that the application produces the outcome the business expects. A comprehensive E2E testing strategy for AI teams covers how to structure this layer alongside the unit and integration layers that already exist.

The critical property is independence.

Outcome-based verification diagram showing a business rule checked against the running app and user-visible result
Outcome-based verification works because the expected business result is defined before the implementation is consulted.

The test is not derived from the function's current output. The expected value is set from the business rule, before the function is consulted. When the code is wrong, the test fails, because the test is measuring something external to the code.

Does this matter if you have strong AI test generation

Yes, and the mechanism is the same.

AI test generation produces tests that are consistent with the code. When AI writes both the implementation and the tests in the same context window, any business-logic error in the implementation propagates into the assertions. The test asserts the function's current output as the expected value. The test passes. The bug ships. We cover this pattern in detail in AI-generated tests that pass but don't assert anything.

The coverage number looks real. The signal is real. The protection for the business-logic layer is not there. The fix is the same: an independent, outcome-based layer that exercises the running application from the outside, with expected values that come from the business flow, not from the code being tested.

That is why Autonoma treats outcome verification as the merge gate: the running preview has to produce the user-visible result the flow implies. AI review still handles diff-level issues; the independent E2E layer handles whether the business outcome survives the PR.


FAQ

AI misses business-logic bugs because it has no independent source of truth for what 'correct behavior' means. An AI reviewer reads the code as written. The requirement (say, 'annual plans get 20% off, monthly plans get none') lives in a PRD, a Slack thread, or a founder's head, never in the diff. The code can be internally consistent and still be wrong. AI verification is only trustworthy when it is independent of the thing being verified.

AI code review catches syntactic and structural issues well: unused imports, type errors, obvious null dereferences. It does not reliably catch business-logic errors because the logic's correctness depends on requirements that rarely appear in the diff. Adding a second AI reviewer or a stronger model does not fix this: both read the same artifact and lack the same missing input (the intended behavior). Two reviewers with the same blind spot are still blind.

A business-logic bug is a defect where the code runs correctly and throws no exceptions, but the outcome is wrong relative to the intended behavior of the system. Examples: a pricing function applies a 20% discount to monthly plans when only annual plans should qualify; a permission check lets a user in role B see tenant A's data; an order is marked shipped before payment settles. The code is internally consistent. The behavior is wrong.

Not for business-logic verification. AI excels at syntactic review, type checking, and generating tests that execute code paths. It does not replace the independent verification layer that checks whether the running application produces the outcomes users and the business expect. An identity-security customer running AI review on every PR told us: 'It doesn't cover the business case. Our QA engineers are still finding things.' The gap is not speed or scale. It is independence.

Outcome-based, independent verification catches business-logic bugs. Tests must be derived from real user flows and expected business outcomes, executed against the running application, and asserting on what the user or business observes (the price charged, the page visible, the state persisted), not on what the code says it does. E2E tests that exercise the full application stack from the outside provide the independence that code review and unit tests structurally cannot.

Related articles

AI code review limitations diagram: the self-verification loop where AI writes code, reviews it, and generates tests with no independent check, and CI turns green anyway

AI Code Review Limitations: Who Checks AI Generated Code?

The AI code review limitations nobody talks about: when the same model writes, reviews, and tests your code, green CI signals consistency, not correctness. Here's where independence has to enter.

A CI pipeline dashboard showing all green checkmarks while a broken app silently ships bugs, illustrating the false signal produced by AI-generated test suites

Tests Green But App Broken: A Diagnostic

When AI generates your test suite, green CI stops meaning what you think it means. Here's why tests that always pass are a dead signal, and how to restore it.

Five shapes of useless unit tests illustrated as hollow checkmarks on a CI dashboard that stays green while real bugs slip through

Useless Unit Tests: 5 Patterns That Never Fail

A field guide to the 5 shapes of useless unit tests: the tautological test, mock-asserting tests, snapshot tests nobody reviews, and tests with no real assertions.

Shift-left testing pipeline diagram: bugs caught at the PR stage before production for a small engineering team

Shift-Left Testing for Small Engineering Teams in 2026

Shift-left testing for small engineering teams: how 3-6 person startups catch bugs before production without a QA hire, using preview environments and AI.