Deploys aren't the constraint anymore

  • engineering-practice
  • deployment
  • ci
  • devops

The deployment cycle most teams run today was tuned for an era when writing code was expensive and shipping it was the constraint. The cycle’s habits — small PRs, comprehensive CI, careful canary rollouts, rollback as the universal safety net — all encoded the same belief: the dangerous part of engineering is the boundary between we have it and users have it, and our job is to make that boundary safe to cross.

That belief was correct for two decades. It is no longer correct. AI tools moved most of the danger upstream of the deploy, into the part of the cycle where code becomes plausible without becoming correct. The deploy itself is no longer where the failures live — it is where they manifest, sometimes much later, in shapes the cycle was not designed to catch.

Three of the cycle’s most-relied-on habits stopped working the way they used to. The cycle still runs. The habits still feel like discipline. They are increasingly checking for failures that no longer happen, and missing the failures that do.

1. The small-PR discipline

For a long time, keep your PRs small was good advice that everyone agreed on for the same reason: a small diff is easier to review, easier to reason about, easier to roll back. The principle was sound, and it produced a useful heuristic — when in doubt about whether a diff is safe, look at its size.

The principle still survives. The heuristic stopped working.

AI tools produce small diffs with wide commitments. A twenty-line AI-generated diff can silently change an invariant a hundred lines away, can introduce a plausible-looking method call to an API that does not exist, can conform to the local pattern while violating a global one. The size of the diff is no longer a reliable signal about the size of the change; the model can compress arbitrary structural impact into a small number of typed characters. The reviewer who applies the small = safe heuristic gets fooled by exactly the diffs that should require the most attention.

The failure mode is hard to spot because nothing in the workflow changed visibly. The PR is small. The reviewer does the usual quick read. The code merges. The bug ships. Two weeks later, when the consequences surface, the team has trouble reconstructing why the obvious-looking change introduced an unobvious problem. The answer is that the diff was small for AI reasons and the review was small for human reasons, and the asymmetry was invisible to both sides at the time.

The remedy is to separate two things that used to move together: diff size (an artifact attribute) and review depth (a practice). Pre-AI, the two correlated tightly. Post-AI, they have to be set independently. A small AI-generated diff in a critical-path module deserves more careful review than a large diff in a peripheral module — the inverse of the old heuristic. Teams that update their review practices to match the new asymmetry catch failures the cycle would otherwise have shipped. Teams that don’t keep approving twenty-line diffs in two minutes and meeting their bug rate three weeks later.

2. Tautological CI

Continuous integration was always a partial gate, not a complete one. Tests catch what they were written to catch; lint catches what it was configured to catch; type checks catch what the type system encodes. The collective coverage was good but never complete, and engineering teams have always operated with the understanding that green CI is a necessary condition for shipping but not a sufficient one.

What changed is more subtle. The CI signal weakened, in a way that is structural rather than partial.

The mechanism is straightforward once named. AI tools generate both code and tests, often in the same session, often with the tests written immediately after the code they exercise. The tests verify that the implementation matches what the model expected to implement. They do not verify that the implementation matches what the team wanted. CI has become, increasingly, tautological: a measure of whether the model agrees with itself, dressed up as a measure of whether the code is correct.

This is not the same as the tests being wrong; they are usually right at what they check for. It is a measure-of-itself problem: the test catches errors the model would have noticed, and misses errors the model would not have noticed — which is the exact category of error that matters most. A confident-wrongness bug, locally correct and globally wrong, is not the kind of thing the model’s tests look for, because the model didn’t think the bug was there.

Green CI therefore became a thinner signal than it used to be, in a way the dashboard does not reflect. Test coverage might be at 90%. Build status is green. The deploy is safe by the historical definition. The bugs that ship through this gate are the bugs the gate was never going to catch, and they ship more confidently because the gate’s reassurance is stronger than it has ever been.

The remedy is unfashionable: human-authored tests for the contracts that matter. Not the test that exercises the function — the model wrote one of those automatically. The test that encodes the intent of the function, the constraint it is meant to honor, the invariant it is committing to. These tests are slower to write, harder to maintain, and the only ones with a meaningful chance of catching what AI-generated tests cannot. They are the tests the cycle now needs and most teams have not started writing.

3. The rollback safety net

Rollback was the universal escape hatch of the deployment cycle. Whatever review and CI missed, the rollback would catch — push a bad deploy, see the alarms, revert, fix at leisure. The rollback sat at the bottom of the safety stack, the one practice every team relied on and every team trusted.

It worked because of an implicit assumption: bad deploys announce themselves quickly. The metric goes red, the alert fires, the on-call rolls back within minutes. Total exposure: an hour, maybe two.

That assumption is increasingly false. AI-shaped failures often do not announce themselves quickly. The recognizable shapes of confident wrongness surface days or weeks after the deploy that introduced them. By the time the failure is visible, the deploy is no longer the most recent deploy. A dozen other deploys have happened on top of it. Some of them depend on the bad code. Rolling back the original now requires either rolling back the cascade or hand-fixing the original in place — neither of which is the smooth one-click revert the cycle was designed around.

There is a deeper problem under the surface one. The cycle was designed with three layers of defense — review, CI, rollback — that were independent in the old failure regime. A bug that slipped past one had two more chances to be caught. Each layer compensated, in expectation, for the failures of the others.

Under AI, the three layers stopped being independent. They are all weakened by the same underlying shift — the gap between code being plausible and code being correct — and they are all weakened in the same direction, on the same kinds of failures. The reviewer misses the small diff with wide commitments; CI passes because the model agrees with itself; rollback is no longer fast enough because the failure surfaces too late. Three layers, one shared blind spot. The cycle still has all three in place; in practice it has one.

The remedy is structural. Shorter feedback loops on the things that actually matter — production monitoring of the contracts that matter, not just the metrics that are easy to graph. Longer canary periods for the kinds of changes most likely to harbor confident-wrongness bugs (small AI-generated diffs in established modules, ironically). Smaller-than-feels-right cohorts at each rollout stage, because the time-to-manifest of an AI-shaped failure can outlast a stage that was sized for the old time-to-manifest. None of these is novel. They are the same gradual-rollout disciplines the field already has, applied with a corrected expectation of when failures actually surface.

Closing

The deployment cycle most teams run is not broken. It is mistuned. It was designed for an asymmetry — code is expensive to write, easy to ship safely with the right gates — that no longer applies. AI made writing cheap. The gates the cycle relied on (small PR, green CI, rollback) all weakened in correlated ways, and the cycle as a whole is now checking for a class of failure that is not the dominant one.

The remedy is not to add new gates. It is to retune the existing gates for what they now have to catch. Review depth that scales with integration impact rather than diff size. Tests written by humans for contracts that matter, not just AI-generated tests for code that exists. Detection windows tuned for failures that surface late. Canary periods that respect the actual time-to-manifest of confident-wrongness bugs. None of these is technical. All of them are process disciplines that have to be rebuilt, because the old disciplines depended on assumptions that no longer hold.

The cycles, before and after retuning, look identical from the outside. Same vendors, same CI/CD, same dashboards, same ceremony. They ship at different rates and discover their differences a year in, when the cumulative cost of the failures one cycle catches and the other doesn’t has had a chance to surface in the form of incidents the cycle would have prevented if it had been tuned for the actual failure distribution.