ProductHow it worksPricingBlogDocsLoginFind Your First Bug
Diagram showing AI-generated auth code without a baseline: an agent writes login code on one side, while expected auth behavior (valid login, rejected password, protected route redirect) must be defined explicitly on the other
TestingAI

How to Test the Auth Code an AI Agent Wrote

Tom Piaggio
Tom PiaggioCo-Founder at Autonoma

When an AI agent writes your authentication, there is no human baseline for what correct behavior looks like, so define expected auth behavior explicitly: valid users log in, invalid credentials are rejected, and protected routes redirect. Then verify the AI-generated code against that spec with end-to-end tests.

The AI wrote the login. You merged it. The tests passed. And yet, somewhere in the back of your head, there is a quiet unease: you never actually defined what "correct" looks like for this code, and neither did the agent. You reviewed a diff. The diff looked plausible. You approved it.

This is the specific risk that AI-generated auth code introduces that no other category of code shares as acutely. A billing function written by an AI might calculate a discount wrong. That is bad. But the wrong discount does not lock every user out of your product overnight. Authentication is the highest-stakes code path in most applications, and when an AI agent writes it, the verification gap is not just technical. It is definitional. Nobody specified what correct auth behavior is, and nobody checked whether the generated code matches it.

The incident that keeps coming up in engineering retrospectives in 2026 is variations of the same story: an AI coding agent dropped an auth wrapper, the code compiled, code review passed, and users hit a locked-out production app. We wrote the full teardown in When the AI Agent Silently Broke Authentication in Production. The pattern is not a fluke. It is structural.

Why AI-written auth has no baseline

When a human engineer writes login functionality, there is a baseline embedded in the process even if nobody writes it down. The engineer has seen login flows before. They know a valid user should reach the dashboard. They know a wrong password should return an error. They know a logged-out user hitting a protected route should land on the login page. That intuition is the spec. It is informal, but it is real.

An AI coding agent has no such intuition about your specific application. It has pattern-matched on millions of login implementations. It will generate something that looks correct. It will use the right library calls. It will return reasonable status codes. What it will not do is independently verify that the behavior it just created matches what your application actually needs, because it was never given a behavioral spec. It was given a prompt.

The result is AI-generated code that compiles, passes linting, passes static analysis, and passes any tests the same agent or its peer generates, because those tests are derived from the same implementation the agent just wrote. The tests assert the code back to itself. A wrong assumption produces a wrong expected value, and the test passes.

This is what we mean by "no baseline." The broader version of this problem applies to regression testing AI-generated code across the entire codebase. Auth amplifies it: there is no previous implementation to diff against, no earlier behavior to compare to, and the failure mode is catastrophic rather than gradual.

This is exactly the gap a verification layer like Autonoma is built to close: it derives the expected auth behavior from the running application rather than from the same code the AI just wrote, so it can tell when the generated code deviates.

Defining expected auth behavior

The fix is to do what neither the AI agent nor the post-merge test suite typically does: define expected auth behavior before verifying the code. This does not require a formal spec document. It requires answering three questions about your application's auth flows, then writing those answers down as the behavioral contract the tests must verify.

The questions are:

Who can log in, and what do they reach? A valid user with correct credentials should authenticate and land on the right destination. This seems obvious, but it needs to be stated explicitly, because "the right destination" varies: a standard user hits the dashboard, an admin hits the admin panel, a deactivated user should not get in at all. Each of these is a distinct test case.

What should fail, and how? Wrong password. Expired token. Unregistered email. Account locked after N attempts. Each failure mode should return a specific, expected response: the correct error message, the correct HTTP status, no accidental session creation. AI-generated auth code frequently handles the happy path correctly and handles failure modes inconsistently, because the prompt described login, not what happens when login fails.

What do protected routes actually protect? A logged-out user requesting a protected route should be redirected to login, not served the page with a blank state. A user without the right role should receive an authorization error, not reach the route with limited data. These redirect and guard behaviors are exactly what auth middleware is supposed to enforce, and they are exactly what AI coding agents omit silently. Protecting a route is the kind of wrapper that looks like boilerplate and gets dropped.

Behavior to defineExpected outcomeFailure mode to test
Valid user logs inReaches correct destinationWrong destination, blank state
Wrong password submittedError shown, no session createdError silent, session exists
Unregistered emailSpecific error, no sessionGeneric error, info leak
Logged-out user hits protected routeRedirects to loginServes page, no redirect
Authed user reaches protected routeRenders correct contentAuth loop, blank page

Once you have these behaviors written down, you have a spec. The spec is not code. It is the behavioral contract that the AI-generated code must satisfy. Everything the AI wrote is implementation detail. The spec is what you are actually testing.

Diagram showing AI-generated auth code checked against an external behavior spec before end-to-end verification decides whether the code matches expected login, password rejection, and protected-route redirect behavior

The expected behavior comes from the auth contract, not from the AI-generated implementation.

Verifying it end-to-end

The reason this spec must be verified end-to-end, and not unit-by-unit, is that auth behavior only exists in the context of a running application. A unit test on an authentication function can confirm the function returns a token for valid credentials. It cannot confirm that the token is correctly passed to the session layer, that the session layer sets the cookie, that the middleware reads the cookie, or that the protected route actually checks the middleware. Each of those is a separate piece of code. A dropped auth wrapper is invisible to any individual unit test, because the wrapper is not a function that gets called. It is a guard that is supposed to surround the route.

End-to-end verification drives a real browser through the login flow. It submits credentials, observes the response, navigates to a protected route, and asserts on what it sees. The assertion is against the behavioral spec, not against the implementation.

For AI-generated auth code specifically, this means writing E2E tests that cover each row of the table above. Valid user logs in and reaches the dashboard: navigate to /login, fill in valid credentials, submit, assert the URL is now /dashboard and the user's name appears. Wrong password: fill in invalid credentials, submit, assert the error message is visible and the URL is still /login. Logged-out user hits a protected route: navigate directly to /dashboard without a session, assert the browser lands on /login.

None of these are technically complex assertions. What makes them valuable is that they are derived from the behavioral spec, not from the AI-generated code. They will fail if the implementation is wrong, even if the implementation compiles and passes every unit test the same AI generated.

The sibling piece on how to test auth middleware and protected routes covers the mechanics of these assertions in detail. For this article, the point is simpler: define the expected behavior first, then verify the AI-generated code against it, and do the verification at the browser level where the full auth stack runs together.

How Autonoma Verifies AI-Written Auth

The "no baseline" problem has a structural solution. The behavioral spec you defined above does not need to be written manually in a test file. An agent can derive expected auth behavior from the running application and verify the AI-generated code against it end-to-end, catching deviations before they reach production.

Autonoma is the verification layer we built specifically for this. The Planner agent reads your codebase, including the routes, components, and user flows the AI agent just generated, and plans the test cases from the application's intended architecture. Not from the implementation details of any individual PR. Not from the same context the AI used to write the auth code. The Planner establishes the behavioral spec: what flows exist, what should succeed, what should fail, what protected routes should guard. The Executor runs those test cases against a live preview environment, driving the real browser through the real login flow. The Reviewer evaluates each result and separates a real auth bug from an agent error or a test mismatch. The Diffs Agent runs on every subsequent PR and maintains those test cases as the codebase evolves, adding and deprecating them as the auth implementation changes.

This maps directly to the "no baseline" problem. When an AI coding agent writes your login and immediately generates tests from the same implementation, both the implementation and the tests share the same context. A dropped auth wrapper is invisible to both. Autonoma's Planner derives its behavioral expectations from the application's route architecture, independent of the PR that introduced the AI-generated auth code. When the middleware guard is absent, the Executor drives a logged-out browser to the protected route, does not see a redirect, and the Reviewer classifies it as a real auth bug, not an agent error.

The auth flows this catches are exactly the ones the table above names: the valid login that silently fails, the protected route that does not redirect, the wrong-password case that creates a session anyway. These are the deviations that compile fine, pass code review, and only surface when a real browser exercises the full auth stack.

Diagram showing Autonoma separating an AI-generated auth PR from independent verification: Planner derives expected behavior, Executor runs the live preview, Reviewer classifies the result, and Diffs Agent keeps coverage current

Autonoma keeps the behavior spec, browser run, and reviewer verdict separate from the AI-generated code path.

Catching the deviation before users do

The pattern that ends in a production lockout is not a series of mistakes. It is a series of individually reasonable decisions that together leave a gap. The AI agent wrote auth code, which is reasonable. Code review confirmed the diff was internally consistent, which is what code review does. Tests passed, which they do when the tests were generated from the same implementation. Nobody defined expected auth behavior independently, because nobody had a reason to think it was missing.

The gap is invisible right up until users start hitting it. The first signal is often a spike in failed login attempts, a support ticket, or a monitoring alert on session creation rate. By then, the deviation has been in production for however long it took to reach measurable volume.

Defining expected auth behavior explicitly, before or immediately after the AI agent writes the code, is the intervention. Write down the three questions. Fill in the table. Then verify the generated code against those behaviors with end-to-end tests that drive the real browser.

The test suite that covers a valid login, a wrong-password rejection, and a protected-route redirect is not sophisticated. It is not covering edge cases. It is covering the baseline that should have been defined before the AI agent started writing. For generic AI-generated code testing beyond auth, the broader patterns apply across the whole codebase and are worth reading alongside this. The auth path is where the stakes are highest, but the principle that AI-generated code needs behavioral verification derived from something other than the AI-generated code itself applies everywhere.

The check that catches the dropped auth wrapper is not technically hard to write. It is just the check that nobody wrote, because nobody defined what "correct" was supposed to look like. Autonoma makes that independent check part of the PR loop: Planner defines what should happen, Executor runs it in a browser, Reviewer decides whether the deviation is real, and the Diffs Agent keeps the auth coverage current.

FAQ

Define expected auth behavior explicitly first: which users should log in and reach which destination, what failure modes should return what errors, and which protected routes should redirect unauthenticated users. Then write end-to-end tests that drive a real browser through each of those scenarios and assert on what the browser observes. The key is that the behavioral spec must be defined independently of the AI-generated implementation. Tests derived from the implementation itself will inherit any wrong assumption the AI made and pass regardless.

It does not, unless you tell it. An AI coding agent pattern-matches on similar implementations it has seen. It will generate code that looks plausible and compiles. What it cannot do is independently verify that the behavior it created matches what your application actually needs, because it was given a prompt, not a behavioral spec. For authentication specifically, this means the generated code may handle the happy path correctly and handle failure modes inconsistently, or may omit middleware guards that were never described in the prompt. Defining expected behavior explicitly is the only way to give the verification step something to check against.

You can trust AI-written authentication after it has been verified against an explicit behavioral spec with end-to-end tests. Before that verification, the trust is premature. The risk is not that AI-generated auth code is generically bad; it is that the code was generated without an external spec, so the only way to know it is correct is to run it against one. The specific failure mode to watch for is not incorrect logic but omitted guards: auth wrappers, middleware checks, and route protections that the AI did not include because they were not described in the prompt and feel like boilerplate. These omissions compile without error and only fail at runtime when a real browser exercises the full auth stack.

For auth flows, answer three questions: who can log in and what do they reach, what should fail and how (wrong password, expired token, unregistered email), and what do protected routes actually protect. Each answer becomes a test case. The test case submits the input through the real browser, observes the result, and asserts it matches the defined expectation. The behavioral spec does not need to be formal documentation. It needs to exist outside the AI-generated implementation, so the tests verify the code rather than reflecting it back to itself.

Cover the behavioral contract in four areas: the valid login reaches the correct destination, invalid credentials return the expected error without creating a session, each failure mode (wrong password, unregistered email, locked account) produces the correct response, and protected routes redirect unauthenticated users rather than serving the page. These cover the baseline behaviors that AI coding agents most commonly get wrong or omit. For each test, verify at the browser level: navigate, submit, observe, assert. Unit tests on individual auth functions will not catch a dropped middleware guard or a missing auth wrapper, because those are structural gaps, not logic errors in a function.

Related articles

Split diagram showing code that compiles cleanly on the left and a broken login flow at runtime on the right, illustrating what AI code review cannot see

Why AI Code Review Misses Auth Bugs

AI code review catches structure and style. It cannot catch a dropped auth wrapper or broken login flow. Here is what code review misses and why E2E testing fills the gap.

A dark dashboard showing a green CI status bar above a support queue full of red error tickets, representing the production lockout caused by a silent AI coding agent auth wrapper omission

When Vibe Coding Broke Authentication in Production

An AI coding agent silently omitted an auth wrapper during a refactor. CI stayed green. Every user was locked out. Here is the failure mode and the only fix that works.

A developer reviewing Claude-generated tests at a split-screen workstation: green checkmarks on the left for boilerplate scaffolding and red warnings on the right for business logic tests

Claude Write Tests: When to Trust It and When Not To

Can Claude write tests you can trust? A practical green zone vs red zone rubric for when Claude-written tests are reliable and when they fake green CI.

A split diagram showing manual QA on one side with a human tester examining a screen, and AI testing on the other side with automated agents running breadth-first coverage

Manual QA vs AI Testing: Where Each Actually Wins

Manual QA vs AI testing: where each genuinely wins. Manual QA owns exploratory testing and business judgment. AI testing owns breadth and speed. Here's how to blend both.