ProductHow it worksPricingBlogDocsLoginFind Your First Bug
Shift-left testing pipeline diagram: bugs caught at the PR stage before production for a small engineering team
TestingAI

Shift-Left Testing for Small Engineering Teams in 2026

Tom Piaggio
Tom PiaggioCo-Founder at Autonoma

Shift-left testing is the practice of catching bugs earlier in the development cycle, before they reach production. For small engineering teams at Seed-to-Series-A startups, it means building verification into the PR stage rather than relying on a QA team that doesn't exist yet. The earlier a bug is caught, the cheaper it is to fix, and the less likely it is to become a churn event.

We built Autonoma so a 3-engineer team can have the shift-left coverage that used to require a 30-engineer QA function. The term itself goes back to Larry Smith's 2001 paper, where "shift left" meant moving testing activities toward the left side of the project timeline. Two decades later the principle holds, but the tooling assumption underneath it, a dedicated QA function with test writers and a maintenance team, has not caught up with how Seed-to-Series-A startups actually ship. This article is written for small engineering teams and engineers shipping without a QA hire: the 3-to-6-person shop where everyone reviews code, everyone is on-call, and "we don't have any QA" is not a complaint but a description of the org chart.

Shift-left when you ARE the test team

The standard shift-left pitch assumes there is a QA function to shift. There is a handoff point: dev writes code, QA writes tests, QA runs tests, bugs flow back to dev. Shift left means moving that handoff earlier. Smaller teams, more collaboration, less queue.

That model does not describe a 5-person startup. There is no handoff. The person who wrote the feature is the person who tests it, deploys it, and monitors it. When a bug slips through, "we hear about it real quick" because a customer emails within the hour. The shift-left insight still applies, but the framing has to change: you are not shifting a handoff left, you are building automated verification into a flow that has never had any.

This matters because the cost structure is different. Enterprise teams pay for QA labor to find bugs before release. Small teams pay in churn. A bug that breaks checkout for 20 minutes on a Tuesday doesn't show up in a QA report. It shows up in the cancellation email three days later. The leverage of shift-left testing for startups is not "reduce QA headcount." It is "reduce the probability that a production bug becomes a churned account."

The question is not whether to test earlier. The question is whether you can build shift-left into a workflow where no one is assigned to testing in the first place.

The 4 types of shift-left, reranked for 3-6 person teams

The original taxonomy comes from Arnon Axelrod and later formalized by vendors. Four types are usually listed: Traditional, Agile, Model-based, and Incremental. Most vendors present them in that order, with roughly equal weight. For small teams, the ordering is wrong.

TypeWhat it isEffort to adoptROI for 3-6 person teamsVerdict
TraditionalRun tests earlier in the sprint, not just at the endLowHigh: immediate feedback without new toolingAdopt now
AgileTests written alongside code (TDD/BDD); dev and test happen in parallelMediumHigh: especially for critical pathsAdopt now
IncrementalRun a subset of tests on every commit, not just nightlyLow-MediumMedium: depends on existing test suite densityAdopt later
Model-basedGenerate tests from formal specs or state machinesVery highLow: requires formal spec discipline few startups haveSkip

Traditional and Agile shift-left are the right starting points for small teams. They require the least infrastructure investment and produce the most immediate return. Traditional means running whatever tests you have as early and as often as possible. Agile means writing test cases before or alongside the feature, not after.

Model-based shift-left is not worth the setup cost at this scale. It requires formal specification of system behavior before any code is written. Teams that have shipped a product and are iterating on it rarely have the spec discipline that model-based testing assumes. Skip it.

Incremental shift-left is useful once you have test coverage worth running incrementally. If your test suite is thin, running a subset frequently just catches a subset of nothing.

Local-to-prod: closing the gap

The "local-to-prod" gap is where most small-team bugs live. The code works on the developer's machine. It works in the CI environment. It fails in production because production has different data, different configuration, different third-party state, or a different load pattern that no one modeled.

Shift-left, in practical terms for a startup, is about closing that gap. The smaller the difference between where you test and where you run, the fewer surprises at deploy time. This is why preview environments are a meaningful shift-left primitive: a per-PR environment that mirrors production configuration is closer to "testing in production" than a local dev server running against fixtures.

The preview environment is not a complete shift-left strategy on its own. It solves the environment gap. It does not solve the coverage gap: running no tests against a perfect environment still catches nothing. The full shift-left stack is environment fidelity (preview) plus automated coverage (E2E tests that run against the preview) plus fast feedback (results visible before merge).

Shift-left testing diagram: preview environment with E2E coverage closing the local-to-prod gap before merge

A concrete shift-left stack for 5-person teams

Walk through what a small team actually signs up for if it assembles shift-left coverage manually. There are three layers, and each one quietly hands the team a permanent maintenance surface.

Layer 1: Vercel preview + E2E on preview URL. Every pull request gets a preview deployment, and a GitHub Actions workflow listens for the preview URL and runs E2E tests against it. What the team owns once this is wired up: the workflow YAML and the secrets it depends on, the Playwright suite that runs against the preview, the flaky-test triage when a preview boots slowly, and the policy of which tests gate merge versus which post warnings.

Here is the workflow file you sign up for owning:

# GitHub Actions workflow: run Playwright E2E tests against Vercel preview deployments.
#
# Trigger: every time Vercel finishes a preview deployment for a PR, GitHub
# receives a `deployment_status` event. This workflow filters that event down
# to successful Vercel previews, pulls the preview URL out of the payload,
# and runs the Playwright suite against that URL.
#
# Required GitHub repo secrets:
#   - VERCEL_TOKEN          (read-only token; optional, only if you need to
#                            call the Vercel API for richer deployment data)
#
# Required GitHub repo configuration:
#   - The repo must be linked to Vercel so deployment_status events fire.
#
# Notes:
#   - `PLAYWRIGHT_BASE_URL` is set per-run from the deployment payload, not
#     from a static secret. That is the whole point of preview testing:
#     each run targets the ephemeral URL Vercel just produced.
#   - This workflow does NOT run on `pull_request` directly; it waits for
#     Vercel to finish so we don't race the deployment.

name: E2E on Vercel Preview

on:
  deployment_status:

jobs:
  playwright-on-preview:
    # Only run when Vercel reports a successful preview deployment.
    # `environment` is set by Vercel; "Preview" excludes Production deploys.
    if: >-
      github.event.deployment_status.state == 'success' &&
      github.event.deployment_status.environment == 'Preview' &&
      github.event.deployment.creator.login == 'vercel[bot]'

    runs-on: ubuntu-latest

    env:
      # Extract the ephemeral preview URL Vercel just published.
      # `target_url` is the public preview URL on the deployment_status event.
      PLAYWRIGHT_BASE_URL: ${{ github.event.deployment_status.target_url }}

    steps:
      # 1. Check out the commit the preview was built from, not the PR HEAD.
      #    This guarantees the tests we run match the code that's deployed.
      - name: Check out repository at deployed SHA
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event.deployment.sha }}

      # 2. Install Node. Match the version your app uses in CI.
      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"

      # 3. Install project dependencies (including @playwright/test).
      - name: Install dependencies
        run: npm ci

      # 4. Install the Playwright browsers + OS-level system deps.
      #    `--with-deps` pulls in the apt packages Chromium/Firefox need on
      #    Ubuntu runners. Without it, browser launches fail with cryptic
      #    shared-library errors.
      - name: Install Playwright browsers
        run: npx playwright install --with-deps

      # 5. Run the suite against the preview URL.
      #    Playwright reads PLAYWRIGHT_BASE_URL from the env (or you can wire
      #    it through `use.baseURL` in playwright.config.ts).
      - name: Run Playwright tests against preview
        run: npx playwright test
        env:
          PLAYWRIGHT_BASE_URL: ${{ env.PLAYWRIGHT_BASE_URL }}

      # 6. Always upload the HTML report so failures are debuggable from the
      #    PR check, not just from a red X in the Actions tab.
      - name: Upload Playwright report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-report
          path: playwright-report/
          retention-days: 14

Layer 2: Sentry-to-Slack alert routing. Some bugs reach production. Faster production-to-engineer routing compresses the time-to-fix. What the team owns: the Sentry project configuration, the alert rules, the Slack integration, and the ongoing tuning so the channel does not get noisy enough that people mute it.

The minimal Sentry rule, and the configuration the team keeps current:

# Sentry alert rule: route high-severity production errors to #eng-alerts in Slack.
#
# Apply this rule under: Sentry project settings -> Alerts -> Create Alert -> Issue Alert.
# Sentry's UI does not import YAML directly; this file is the source of truth
# you keep in version control, and you mirror it into the Sentry UI (or use
# the Sentry Terraform provider / API to apply it).
#
# Replace the two placeholders before applying:
#   - project_slug: your Sentry project slug (e.g. "web-frontend")
#   - slack_channel: the channel you want noisy enough to be noticed but quiet
#                    enough to not be muted. We use #eng-alerts.

alert:
  name: "Production errors -> #eng-alerts"
  project_slug: "your-project-slug"   # REPLACE: e.g. "web-frontend"
  environment: "production"

  # Fire when: a Sentry issue's events spike past 5 in a single minute AND
  # the event level is `error` or worse (fatal). We do NOT alert on warnings
  # or info — those drown the channel and train people to ignore Slack.
  filter_match: "all"
  conditions:
    - id: "sentry.rules.conditions.event_frequency.EventFrequencyCondition"
      interval: "1m"
      value: 5
    - id: "sentry.rules.filters.level.LevelFilter"
      match: "gte"           # greater than or equal to
      level: "error"         # error or fatal

  # Only the first matching action fires per issue per cooldown window.
  # Cooldown prevents one flapping issue from spamming Slack every minute.
  frequency_minutes: 30

  actions:
    - id: "sentry.integrations.slack.notify_action.SlackNotifyServiceAction"
      workspace: "your-slack-workspace"   # REPLACE: Slack workspace name in Sentry
      channel: "#eng-alerts"              # REPLACE if you want a different channel
      tags: "environment,release,url"     # show these as fields in the Slack message
      notes: |
        On-call: ack within 5 minutes. If unrelated to a deploy in the last
        hour, page the next on-call instead of investigating solo.

# ---------------------------------------------------------------------------
# Reference: sample Slack webhook payload Sentry POSTs to the channel.
#
# Sentry's Slack integration uses Slack's `attachments` field with the
# `color` set by issue level (red = fatal/error). The renderer in Slack
# turns this into the familiar bordered card with title, fields, footer.
# ---------------------------------------------------------------------------
#
# {
#   "channel": "#eng-alerts",
#   "username": "Sentry",
#   "icon_url": "https://sentry.io/_static/favicon.png",
#   "attachments": [
#     {
#       "color": "#E03E2F",
#       "title": "TypeError: Cannot read properties of undefined (reading 'id')",
#       "title_link": "https://your-org.sentry.io/issues/123456789/?referrer=slack",
#       "text": "at CheckoutPage.handleSubmit (src/pages/checkout.tsx:142)",
#       "fields": [
#         { "title": "Project",     "value": "web-frontend",  "short": true },
#         { "title": "Environment", "value": "production",    "short": true },
#         { "title": "Release",     "value": "frontend@a1b2c3d", "short": true },
#         { "title": "Events",      "value": "12 in 1m",      "short": true },
#         { "title": "URL",         "value": "https://app.example.com/checkout", "short": false }
#       ],
#       "footer": "Sentry",
#       "footer_icon": "https://sentry.io/_static/favicon.png",
#       "ts": 1716240000,
#       "actions": [
#         { "type": "button", "text": "View issue",  "url": "https://your-org.sentry.io/issues/123456789/" },
#         { "type": "button", "text": "Assign to me","url": "https://your-org.sentry.io/issues/123456789/?assign=me" },
#         { "type": "button", "text": "Resolve",     "url": "https://your-org.sentry.io/issues/123456789/?resolve=true" }
#       ]
#     }
#   ]
# }

Layer 3: Coding-agent PR workflow. A developer prompts Claude Code or Cursor to scaffold tests as new code is written. What the team owns: the prompting discipline (every developer, every PR), the review of agent-generated tests, and the corner cases the prompt did not mention (more on this below).

This is the floor of what a manual shift-left stack costs a small team to operate. Nothing in it is unusual or unreasonable. It is just labor that does not go away.

Coding agents at the PR stage

Coding agents like Claude Code and Cursor shift testing left, but only for the scenarios the developer remembered to prompt for. That is the structural limit worth naming.

Prompt the agent with: "Write a Playwright test for the checkout flow: user adds item to cart, enters payment details, submits order, sees confirmation page." The agent produces a test that covers exactly that path. Here is what comes out, and what is missing from it:

/**
 * Playwright test for the checkout flow.
 *
 * Generated from the prompt:
 *   "Write a Playwright test for the checkout flow: user adds item to cart,
 *    enters payment details, submits order, sees confirmation page."
 *
 * Run with:
 *   npx playwright test tests/checkout.spec.ts
 *
 * The PLAYWRIGHT_BASE_URL env var controls the target environment
 * (local dev, Vercel preview, staging). Set it in CI or via .env.
 *
 * ---------------------------------------------------------------------------
 * Corner cases this prompt did NOT generate
 * ---------------------------------------------------------------------------
 * The single-sentence prompt above produced a clean happy-path test, but it
 * silently skipped two failure modes that matter in production. A human
 * reviewer should add these before shipping:
 *
 * 1. Soft payment decline
 *    The card is valid (Luhn-passes, expiry in the future) but the issuer
 *    rejects the transaction. Two Stripe test cards exercise this:
 *      - 4000 0000 0000 0002  -> generic decline
 *      - 4000 0000 0000 9995  -> insufficient funds
 *    Expected behavior: the checkout page stays mounted, the inline error
 *    surfaces the decline reason, the cart is preserved, and the user can
 *    retry with a different card. The prompt-generated test never enters
 *    this branch — it only asserts the happy path lands on /order/confirmation.
 *
 * 2. Session expiry between cart and checkout
 *    The user adds items, takes a phone call, comes back 30 minutes later,
 *    and submits payment. Either the cart cookie expired or the auth token
 *    expired while the payment form was open. Expected behavior: graceful
 *    redirect to /login with the cart preserved server-side, so after
 *    re-auth the user lands back on /checkout with the same line items.
 *    The prompt-generated test runs entirely within a single fresh session
 *    and never exercises the expiry path.
 *
 * Both corner cases are common in real traffic and both are invisible to a
 * happy-path-only test. They are exactly the kind of coverage gap a coding
 * agent will not invent unless you prompt for it explicitly.
 * ---------------------------------------------------------------------------
 */

import { test, expect } from "@playwright/test";

test("checkout happy path: add to cart, pay, see confirmation", async ({
  page,
}) => {
  // 1. Land on a product page and add the item to the cart.
  await page.goto("/products/example-widget");
  await page.getByRole("button", { name: /add to cart/i }).click();

  // The Add-to-cart button typically updates a header badge or opens a
  // mini-cart. We don't assert on that here — the next step (visiting
  // /checkout) proves the item was persisted into the cart.

  // 2. Go to checkout.
  await page.goto("/checkout");
  await expect(page).toHaveURL(/\/checkout$/);

  // 3. Fill payment details with Stripe's canonical success test card.
  //    Card:    4242 4242 4242 4242
  //    Expiry:  12/30
  //    CVC:     123
  //    Postal:  12345
  await page.getByLabel(/card number/i).fill("4242 4242 4242 4242");
  await page.getByLabel(/expiry|expiration/i).fill("12/30");
  await page.getByLabel(/cvc|cvv|security code/i).fill("123");
  await page.getByLabel(/postal|zip/i).fill("12345");

  // 4. Submit the order.
  await page.getByRole("button", { name: /place order/i }).click();

  // 5. Confirmation page renders an order ID. We assert on the URL shape
  //    and on the presence of an order-id node, not on the exact ID value
  //    (it's generated server-side).
  await page.waitForURL(/\/order\/confirmation/);
  await expect(page).toHaveURL(/\/order\/confirmation/);

  const orderId = page.getByTestId("order-id");
  await expect(orderId).toBeVisible();
  await expect(orderId).not.toBeEmpty();
});

The two cases that actually cause production bugs are absent: the payment provider returning a soft decline (card valid but transaction rejected), and the session expiring between cart and checkout when the user is on a slow connection. These corner cases live in customer support tickets, not in the prompt the developer wrote.

The gap is structural, not a prompting failure. The agent only covers what the developer thought to describe. Closing it manually means writing every corner case yourself, every PR, for the rest of the product's life. For the canonical taxonomy of happy-path, sad-path, edge-case, and corner-case coverage, see the full breakdown.

How Autonoma covers shift-left testing

For a small team without QA, the only shift-left approach that survives contact with the next twelve months of product changes is one where no human writes or maintains the scenario library. Everything in the section above is labor that compounds: the workflow YAML, the Playwright suite, the prompting discipline, the corner cases the agent missed. Autonoma is the version of shift-left that runs every PR and does not hand the team that labor.

The four agents map to what each one removes from your team's workload:

  • Planner agent. Reads your codebase (routes, components, API endpoints, user flows) and derives the scenario plan from what the code does. No human writes the scenario list.
  • Generation agent (Automator). Drives scenarios directly against the running application (in a PR workflow, the preview URL for that branch) through Autonoma's own AI-native runtime. No human writes test code, in Playwright or anything else.
  • Replay engine. Reruns the same scenarios deterministically with verification layers at each step. No human chases flakes.
  • Reviewer. Posts PR-level pass/fail per scenario, including which corner cases were exercised. No human triages.

Autonoma is not a layer on top of Playwright. It is its own test system: scenarios live inside Autonoma, the runtime executes them directly, and the artifacts the Reviewer surfaces are pass/fail results on those scenarios, not Playwright .spec.ts files.

Tied back to the prior sections: this is what removes the labor that the Vercel workflow plus the Playwright suite plus the prompting discipline imposes on a 5-person team. The preview environment stays. The IDE agent stays. What goes away is the Playwright suite itself and the human-owned scenario library behind it. Autonoma replaces both with its own pipeline, not a wrapper on top of theirs.

Two boundaries worth stating directly. Autonoma does not replace unit tests: classical shift-left includes unit tests written by developers alongside their code, and that discipline stays yours. Autonoma also does not replace post-production observability: Autonoma is pre-deploy, Sentry is post-production, and a complete shift-left posture has both.

The cost-of-defect math for startups

The enterprise framing of shift-left cost math is "$1 to find in requirements, $10 in development, $100 in QA, $1,000 in post-release, $10,000 in production." That framing does not map to a startup.

For a startup, the cost of a production bug is not a support ticket and a hotfix. It is a churn event. If your average contract value is $5,000 per year and a billing bug causes a customer to cancel, that bug cost you $5,000 in annual recurring revenue. It also cost you the lifetime value of that customer's potential expansion, and possibly a reference account. The $10,000 generic multiplier understates the real number for most SaaS businesses.

The math that actually matters for startup shift-left decisions:

One production bug that reaches a customer. Call it a p0: broken checkout, wrong billing charge, data not saved. A realistic churn probability from a p0 is not trivial; customers who hit a serious bug in their first 90 days cancel at elevated rates. If that one bug costs you one customer at $5,000 ACV, that is the cost. One month of Autonoma coverage costs less than that. The shift-left decision is not an engineering decision, it is a unit economics decision.

The real cost of not testing before production is not developer time fixing bugs. It is customers leaving because they hit the bug first.

There is another cost that compounds invisibly: developer time. Every production bug creates an interrupt. Someone investigates logs. Someone writes a hotfix. Someone reviews it. Someone deploys it. On a 5-person team where everyone is also building features, a production incident typically consumes half a day of engineering time across the team. If your team ships two production bugs per sprint, that is one engineer-day per sprint lost to firefighting, every sprint, until someone invests in shift-left coverage.

The "catch bugs before they reach production" goal is not just about product quality. It is about protecting the sprint cadence that keeps a small team competitive.

What shift-left will NOT save you from

Shift-left testing catches bugs in flows you test before they reach production. It does not catch what it does not cover.

Unit test discipline. Autonoma and most E2E shift-left tools do not replace unit tests. If your business logic has subtle off-by-one errors, type coercion bugs, or edge cases in pure functions, unit tests catch those. E2E tests will only catch them if they surface through the UI or API. Write unit tests for the code that matters.

Post-production observability. Shift-left testing assumes you can predict which scenarios matter before a real user runs them. Real users find paths you did not predict. A production monitoring layer, error tracking, and alerting is the complement to shift-left testing, not a replacement for it. Sentry (and its pre-deploy-focused alternatives) is your post-production safety net. Autonoma is your pre-deploy safety net. Both belong in a complete stack.

On-call and incident response. Shift-left reduces the frequency of incidents. It does not reduce the severity of incidents that get through. A clear on-call rotation and a documented incident response process are independent of your testing posture. Teams that invest in shift-left sometimes let on-call discipline slip because bugs become less frequent. That is the wrong trade.

Infrastructure and configuration drift. If your preview environment is not production-shaped, shift-left tests pass against a configuration that production does not share. A test passing against a feature flag that is on in preview but off in production is not a caught bug. It is a deferred bug. Maintaining environment parity is a prerequisite for shift-left testing to mean anything.

Shift-left testing is not a QA strategy. It is an engineering leverage strategy. For a 5-person team, catching one production bug that would have churned a customer pays for a year of tooling investment. The shift-left stack for small teams does not require a QA hire, a test infrastructure team, or a formal testing process. It requires a preview environment with automated E2E coverage, a fast feedback loop from production, and the discipline to run tests before merging.

Autonoma handles the E2E part of that stack automatically: the Planner reads your codebase, the Generation agent builds the tests, and the Replay engine runs them on every PR. For engineers shipping without a QA team, that is the shift-left primitive that changes the unit economics of a production bug.

FAQ

Shift-left testing is the practice of running tests and verification earlier in the development lifecycle, ideally before code is merged rather than after it is deployed. In 2026, the most practical expression for small teams is running E2E tests automatically on every pull request against a preview environment that mirrors production, using tools like Autonoma that generate and maintain those tests from the codebase itself.

Yes, and arguably more than large teams. Large teams have QA engineers whose job is to catch bugs before release. Small teams have no one in that role, so every bug that is not caught pre-deploy reaches a real customer. The cost of a production bug for a startup is frequently a churned account. Shift-left testing is the investment that keeps production bugs from becoming churn events.

Partially. Coding agents like Claude Code and Cursor can generate test scaffolding for new features when prompted, which is a meaningful shift-left contribution. The gap is that they cover the happy path the developer described, not the corner cases the developer did not think to describe. Tools like Autonoma complement coding agents by deriving test coverage from the codebase itself, catching the paths the coding agent's prompt did not anticipate.

Not exactly. Traditional shift-left assumes a QA function exists and moves its activities earlier in the timeline. For small teams without a QA function, shift-left is better framed as building automated verification into the engineering workflow from the start. The goal is the same (catch bugs earlier) but the mechanism is different: automated tooling running on every PR rather than a QA team running a test cycle before release.

No. Autonoma does not replace Sentry, and shift-left testing does not replace post-production observability. Autonoma is pre-deploy: it catches bugs before a PR is merged. Sentry is post-production: it catches errors that reach real users. Both belong in a complete safety net. A team that only has shift-left has no visibility into production. A team that only has Sentry is finding out about bugs from customers instead of from tests.

Related articles

Happy path testing taxonomy: tree diagram showing happy path, sad path, edge case, and corner case as four coverage branches, with most production bugs living in the non-happy branches.

Happy Path Testing: What It Covers and What It Misses

Happy path testing vs sad path, edge case, and corner case. Canonical taxonomy, golden path explained, and four bugs a happy-path-only suite misses.

Diagram showing a wall of AI-generated pull requests overwhelming a small hand-maintained test suite, with a codebase-aware regression layer intercepting the merge flow

Regression Testing for AI-Generated Code: How to Keep Coverage Current When Agents Ship 100x More PRs

Regression testing AI-generated code: why Playwright suites collapse under agent PR volume and how codebase-aware AI code regression coverage survives drift.

AI E2E testing taxonomy: AI-assisted authoring, autonomous codebase-first testing, runtime exploration, natural-language spec execution, generated test pipelines, visual-AI assertions

AI E2E Testing: What It Actually Means in 2026

AI E2E testing covers six structurally different products: AI-assisted authoring, autonomous codebase-first testing, runtime exploration, natural-language spec execution, generated test pipelines, and visual-AI assertions. Only one is genuinely autonomous end to end.

Three-mechanism self-healing test automation taxonomy diagram contrasting locator-weighting, visual-diff, and intent re-derivation approaches.

AI Self-Healing Test Automation: Beyond Locator Fallback

Self-healing test automation has three mechanisms: locator-weighting, visual-diff, and intent re-derivation. See which one your vendor actually ships.