The handoff problem

The vendor pitch for AI coding tools pictures a closed loop: engineer thinks, tool writes, engineer reviews, repeat. In practice the loop isn’t closed. It has seams — places where a person hands work to the tool and takes it back — and the work either compounds across those seams or it leaks out of them. Teams that treat the seams as friction to minimize keep being surprised by the same failures. Teams that treat them as first-class engineering work come out of adoption with something useful.

There are three handoffs that matter. Each has a signature failure mode, each has a remedy that is cheaper than people expect, and none of them is about “better prompting.”

1. Prompt in — silent disagreement

The engineer is picturing one problem. The tool is solving an adjacent one. Neither notices until an artifact exists.

This is the failure mode that eats the most time and gets blamed the least, because by the time you notice, you have output in front of you and the conversation has moved on to whether the output is any good. It isn’t any good; it is an answer to a question you didn’t ask. The question drifted inside the prompt, between sentences, while you were typing.

The remedy is old and boring: write the spec before the prompt, not inside it. A prompt that substitutes for a spec produces an answer that substitutes for a solution. A tight spec is three things — the outcome stated in one sentence, the constraints stated as a short list, and the non-goals stated as another. Anything missing from the spec has to be inferred by the tool, and the tool’s inferences are only correct on average. You do not want the average.

The tell for silent disagreement is a specific sensation: the output reads as plausible and you can’t say exactly why it’s wrong. That sensation is almost always a spec problem, not a model problem. Sharpen the spec, re-run, and most of the time the “wrong” output becomes visibly off-target rather than diffusely uncomfortable. The cost of this loop is twenty minutes at the front. The cost of skipping it is a day at the back.

The worst version of this failure is when the engineer adapts their mental model to the output mid-conversation, without noticing — rewriting the question to match the answer. This is where senior judgment earns its pay. If you catch yourself doing it, throw the thread out.

2. Artifact out — confident wrongness

The tool returns work. The output is coherent, stylistically appropriate, and wrong in ways that are hard to see.

Confident wrongness has a small number of shapes, and naming them is most of the battle:

Locally correct, globally wrong. The change does the thing the prompt asked for, inside the file it edited, and breaks an invariant that lives somewhere else. This is the modal failure in mature codebases and the one least visible at review time.
Stale pattern. The output uses the right library, in a style that was idiomatic two major versions ago. The CI passes because the old API still works; the team’s convention quietly regresses.
Plausible API. A method, flag, or config key that does not exist but obviously should. The output is exactly what the library ought to have looked like, which is why the reviewer doesn’t notice.
Invariant violation. The change respects the types, respects the tests, respects the linter, and silently violates a constraint the code was enforcing by convention. These are the ones that ship.

Style and coherence are not evidence of correctness; they are evidence of the training distribution. A clean diff that type-checks is the beginning of evaluation, not the end.

The remedy is an evaluation step that is adversarial by default. Not “read it and accept it if nothing jumps out.” Run it. Break it. Disprove it. Try the input it didn’t think of. Ask what invariant this module was maintaining and whether the diff still maintains it. The default stance toward AI-authored output is not trust-but-verify; it is disbelieve-until-shown. That sounds harsh. It is the only stance that keeps the review bar from sliding.

Good teams build an evaluation scaffold around AI output the same way they build a test harness around a service — deliberately, up front, reusable. Bad teams evaluate by vibes per diff, and never notice the bar dropping.

3. Review back — the degraded reviewer

A human accepts, rejects, or modifies the output. This is the handoff that most quietly determines whether adoption compounds or decays, because review quality is a ratchet: once it drops, it tends to stay down.

A reviewer who understands a diff less well than they would have written it themselves is a weaker reviewer. This is not a character flaw, it is physics. Reading code carries less information than writing it; reading code in a style you did not choose, written to solve a problem you did not frame, carries less still. Multiply that across every diff in the queue and the team’s aggregate understanding of its own codebase thins out in a way that is hard to see week-over-week and obvious over a quarter.

The social dynamic around AI-authored pull requests makes this worse. The author can’t defend the choice, because the author didn’t make it — they accepted it. “Why is it done this way?” gets answered with “that’s what came back,” which ends the conversation without resolving it. The reviewer, sensing that the author has no strong stake, downgrades the review to a skim. The diff merges. Nobody is quite responsible for the code that just shipped.

Good teams respond in one of two ways, usually both:

Slow the review down. More context in the PR description, more time per diff, explicit confirmation that the author has run the adversarial evaluation above. Reviews that used to take five minutes now take fifteen; the fifteen is well spent.
Narrow the scope. Smaller diffs, more reviews per ticket. The temptation with AI tooling is to land larger chunks of work per round because the tool will gladly write them. Resist. Diff size is inversely proportional to review quality, and AI output makes that curve steeper, not flatter.

The team-level signal to watch for is blame log decay. When “git blame” increasingly lands on commits whose author-of-record didn’t write the code and can’t remember the reasoning, review has already degraded. You are now debugging a codebase nobody on the team fully authored. That is a harder codebase to debug than one the team wrote by hand — and it is a much harder codebase than the one the adoption narrative promised.

4. The handoff nobody trains for — throwing it away

The three handoffs above get the attention. The skill that decides whether the three handoffs pay off gets almost none: knowing when to throw the output away.

Not tweak it. Not argue with it. Not salvage it. Discard it, re-read the spec, and start over.

The sunk cost fallacy has a new surface area, and it looks like this: the output is seventy percent right. One more round will fix it. Two more rounds will fix it. At the fifth round you have a Frankenstein diff that nobody, including the tool, has a coherent model of, and the team lead is about to approve it because it’s Friday. The mistake happened at round two, when the output should have been thrown away and the spec re-written. Each subsequent round added inconsistency without adding correctness, and the total inconsistency is now greater than any individual round can repair.

A working heuristic: if the third attempt on a problem is not clearly better than the first, throw all three away and re-spec. Your prompt is the problem, not the tool. Iterating on bad output is cheaper than re-specifying only on paper — in practice it is always more expensive, because you are paying with attention instead of time, and attention does not restore with a coffee.

Engineers who can cut losses on bad output ship better code than engineers who write it all by hand. Engineers who can’t cut losses ship worse code than either — faster, but worse, and reviewed by a team that has already stopped pushing back. The skill is simple to describe and almost impossible to institutionalize, because throwing work away looks like waste to a manager who wasn’t in the loop, and the engineer who does it correctly has nothing to show for the afternoon but a cleaner spec. Protect this skill on your team. It is the one that doesn’t show up on a dashboard.

What this isn’t

None of this is an argument against adoption. The leverage is real. It is an argument that the leverage only materializes if the handoffs are treated as engineering work rather than paperwork. The vendors ship a tool. The handoffs are your job.

A team that writes tight specs, evaluates output adversarially, slows reviews to match the new load, and throws bad output away without ceremony will compound AI-assisted work into something meaningfully better than hand-written work. A team that skips any one of those will adopt the same tool and get a pile of confidently wrong diffs reviewed by increasingly disengaged humans. Same tool. Same vendor. Same benchmarks. Different outcome.

The handoffs aren’t the friction around the work. They are the work.