On-call without an author

  • engineering-practice
  • on-call
  • incident-response
  • debugging

When something breaks in production at three in the morning, the unit of progress is a question. What was this code trying to do? Why does it depend on that field? What did we learn last time this fell over? The on-call rotation has depended, for a long time, on the answers to these questions being reachable — sometimes from the author, sometimes from runbooks, sometimes from the senior engineer who has been there long enough to remember. Each question has a path to an answer. The path is part of how on-call works.

A growing share of the questions an on-call asks now do not have answers at the end of the path. The author, for code in the AI-authored regions, did not write what they accepted. The runbook was tuned to a failure-mode catalog that does not quite fit anymore. The senior engineer remembers the system shape from before the last refactor and does not fully know the shape after. The questions are still the right questions. The answers shifted, and the shift broke four things about incident response that nobody fixed.

1. Debugging under pressure got harder, not easier

The vendor pitch for AI-augmented engineering implies easier debugging. The code is more uniform, the implementations are more idiomatic, the commits are smaller, the diffs are cleaner. All of these are true on a good day. None of them help on a bad one.

Debugging in production, under time pressure, with users affected, is the bad day. The skill it depends on is the same as it has always been: reconstruct what the code is supposed to do, observe what it is actually doing, find the gap, fix the gap. The reconstruction step used to be quick because the author was reachable, the comments were intent-bearing, and the failure mode was usually a recognizable shape — null pointer, race condition, off-by-one, expired token.

The reconstruction step is now slower in three ways. The author cannot be asked, because the author was a model that does not remember. The comments increasingly describe what the code does rather than why, because the model’s comments lean stylistic. And the failure modes include the four shapes of confident wrongness — locally correct and globally wrong, stale pattern, plausible API, invariant violation — which are visually subtle and require longer to recognize than the classical bug shapes.

The on-call rotation has not picked up new debugging skills to match. The skill that used to be spot the bug fast has quietly been replaced by reconstruct intent fast, and the second skill is harder, takes longer to develop, and is not what teams hire for at the on-call level. Mean time to recovery is creeping up on teams that adopted AI tools meaningfully twelve to eighteen months ago, and the creep lives mostly in the reconstruction time, not the fix time.

Teams that respond by hiring more senior on-calls are addressing the symptom. Teams that respond by making intent reconstructable — annotated review comments, design records, explicit invariants — are addressing the cause. The first works in the short run. The second is the only fix that scales.

2. Blast radius opacity

When a senior engineer reads code, part of the read is unconscious estimation: if I change this, what else breaks. The estimate depends on understanding what the code commits to — its contract with the rest of the system, the invariants it relies on, the assumptions baked into its shape. Hand-written code expresses these implicitly. Reading the code carefully, an experienced reader unpacks them within minutes.

AI-authored code is harder to read this way. The code does what it does, but the commitments are unclear. Was this null check protecting something real or just there? Was that ordering chosen for a reason or accepted because it looked right? Is this caching behavior a contract or an accident? The reader cannot tell from the code, because the code does not distinguish chosen patterns from accepted ones. Under time pressure, the on-call does not have the luxury of a long careful read.

The result is that blast radius — the set of things that might break if you change a given piece — gets harder to bound. The conservative response is to assume the radius is wide and move slowly; the aggressive response is to assume the radius is narrow and move fast. Both cost. Conservative on-call adds minutes or hours to recovery. Aggressive on-call ships fixes that introduce new incidents inside the rollback window. Neither is a stable equilibrium.

Call this blast radius opacity: the gap between what the code commits to and what the code reveals about those commitments. The gap used to be small for hand-written code in mature systems. It is wider for AI-authored code, and the on-call rotation has not yet learned to treat the wider gap as a permanent condition rather than a fixable irregularity. Teams that systematically write down which behaviors are contracts and which are coincidences — at module boundaries, in design records, on the wall above the on-call screen if needed — are narrowing the opacity. Teams that don’t are making the on-call’s job a guess against a clock.

3. Runbooks decay against unfamiliar failure shapes

Runbooks are a catalog of recognized failure modes and the appropriate response to each. They are written from observation: someone saw the system fail, wrote down what worked, and the next on-call benefits. The catalog accumulates over years and represents the team’s compressed wisdom about how its own system breaks.

AI-authored systems break in shapes the older catalog does not fully cover. The four shapes of confident wrongness manifest in production differently than the failure modes that filled the older runbook. Locally correct, globally wrong shows up as a feature behaving fine in isolation while a related feature silently misbehaves; the on-call investigating the visible failure is looking at the wrong code. Plausible API shows up as a method call that worked in development against a mock and now fails against the real service in a way the trace doesn’t immediately explain. Stale pattern shows up as code passing CI but tripping a deprecation in a library’s tail-end behavior. Invariant violation shows up as data that types correctly, validates correctly, and silently corrupts a downstream consumer’s assumptions.

A runbook tuned for the older failure-mode distribution misroutes the on-call when the failure is actually one of these. The runbook is not wrong; it is insufficient, and the difference is hard to detect from inside the runbook. The team finds out by watching its mean-time-to-recovery on AI-shaped failures be markedly worse than on classical failures, often without realizing the gap is in the failure shape rather than the on-call’s skill.

The remedy is to add the new failure shapes to the runbook deliberately, with diagnostic patterns and recovery steps, before the team encounters all four in production. Most teams discover them one at a time, painfully, and update the runbook reactively. The reactive path costs incidents that did not need to be costly.

4. The five-whys wall

Post-mortems traditionally drive toward root cause through depth reduction: ask why until further whys stop adding information. Why did the service go down? Because the cache invalidation failed. Why? Because we missed a case in the eviction logic. Why? Because the code was written without considering concurrent writes from the new admin path. Why? Because that interaction was not in the spec. Five layers, root cause, action items. The technique works because each why has an answer that adds something specific.

The technique hits a wall in AI-authored code. The chain often runs: Why did the service go down? Because the cache invalidation failed. Why? Because we missed a case. Why was that case missed? — Because the model produced this implementation and the reviewer accepted it without surfacing the missed case. And then the chain stops. The answer to why was the case missed is structural, not reasoned: nobody chose, the model proposed, the human accepted. There is no fifth why that goes anywhere useful.

Call this the five-whys wall. Two failure modes follow from running into it.

The first is blame the tool. The post-mortem concludes that the model produced bad output, the model is at fault, and the action item is be more careful with model output, which is unactionable in the way drive more carefully is unactionable after a near-miss. The team has identified a target without identifying a remedy. Six months later, a post-mortem of the same shape happens again, and the lesson the team thinks it took from the first one turns out not to have been a lesson.

The second is truncated post-mortem. The team, sensing that the why chain is going nowhere, stops the analysis early and ships an action item about the surface of the bug. The cache invalidation case gets a fix; the structural condition that produced the gap — review without intent reconstruction, accepted code without articulated rationale — stays untouched. The next failure happens in a different surface, with the same structural cause, and is treated as a different incident.

Both failure modes are escapable, with a different post-mortem technique. The shift is to stop asking why was this chosen and start asking what conditions allowed this to be accepted. The first question assumes a chooser. The second admits the choice may not have happened in the strong sense, and instead asks about the review and intake conditions that let the absent choice through. What about our review process did not catch this? What about our spec did not constrain it? What about the codebase did not surface the inconsistency before production? Each of these is a real question with a real answer. The answers are uncomfortable — they are usually about review discipline, spec rigor, or the absence of design records — but they go somewhere.

The five-whys wall is not a defect of the technique; it is a signal that the technique was tuned for code with authors. Post-mortems that do not update for the signal get stuck on it, repeatedly, and the team accumulates the same incident in different shapes until they change the question.

Closing

On-call work is the surface where the cost of accepted-without-chosen code is paid most visibly, and most asymmetrically. The team accepted the code at leisure, in PR review, with the option of asking for clarification. The on-call lives with it under time pressure, in production, without that option. The asymmetry is the cost, and it is paid in the currency the team is least able to refuse: minutes during an incident.

The fixes that reduce the cost are familiar in shape: reconstruct intent at review time so the on-call does not have to reconstruct it under pressure; write design records so the why lives somewhere outside the code that decayed; mark trust topography so the on-call knows which regions of the codebase to suspect first; update post-mortem technique so it does not dead-end on the wall. None of these is technical. All of them are institutional habits that have to be built, and most teams are operating on the habits they had before the codebase changed shape.

The pager goes off the same way it always did. The thing on the other end is not the same thing. Teams that have not yet retuned for the difference are running their incident response on assumptions the codebase no longer supports, and they will keep finding out the assumptions do not hold, one incident at a time, until they decide to retune deliberately rather than by attrition.