What the dashboard stopped measuring

  • engineering-org
  • metrics
  • dora

Every engineering metric does two things at once. There is the surface — the number on the dashboard, the figure in the quarterly review, the bar in the velocity chart. And there is the signal — the thing the surface was secretly tracking, the underlying state the metric was meant to summarize. Lead time surfaces median time from commit to deploy; the signal it tracks is how healthy is the change pipeline. Change failure rate surfaces fraction of deploys that broke prod; the signal is are we shipping things we understand. Review latency surfaces time from PR open to approval; the signal is is the team carefully reading each other’s work.

For most of the last decade, surface and signal moved together. The metric was a faithful instrument; the dashboard was a faithful map. AI tools, in the eighteen months they have been seriously in production, broke the link between many surface metrics and the signals they were tracking. The instruments still move. The dashboards are still bright. They are increasingly reading something other than what they used to read, and most teams are still treating the readings as if the calibration held.

1. DORA reads healthier than the codebase

The four DORA metrics — deployment frequency, lead time for changes, change failure rate, mean time to recovery — are the closest thing the field has to a standard dashboard for engineering team health. They are also among the most aggressively decoupled from their underlying signals under AI adoption.

Deployment frequency goes up. This is the first thing leadership notices and the first thing they are pleased about. The signal it was tracking — the team can ship safely, with confidence — is partially preserved (smaller diffs do ship more easily) and partially abandoned (smaller-and-shipped does not mean understood-and-shipped). Higher deploy frequency in an AI-augmented team is consistent with a healthy team. It is also consistent with a team that has accelerated the wrong work and reduced the friction that used to slow it.

Lead time drops. The time from commit to production shortens because the implementation phase shortens. The signal it was tracking — the change pipeline is healthy — gets noisier rather than cleaner. A short lead time used to mean the team had a smooth process; it now also means the team has compressed implementation without necessarily compressing decision-making. A team can have an excellent lead time and an architectural latency — the gap between “we should decide this” and “we have decided this” — that is silently compounding into a future rewrite.

Change failure rate stays roughly level, until it doesn’t. This is the slow one. AI-authored code does not necessarily fail more often than human-authored code; sometimes it fails less, because models have absorbed the common bug shapes and avoid them. What it fails differently — in the four shapes of confident wrongness, which classical defect categories don’t catch cleanly. The metric reports the same number. The failure-mode distribution underneath shifts. Six to twelve months in, teams notice the failure rate ticking up, not from incompetence but from the accumulated mass of accepted-without-chosen code finally manifesting under load. The metric saw it coming late.

Mean time to recovery goes up, and this is the canary the dashboard can see, if anyone is looking. Recovery time grows as on-call faces more incidents whose code has no available author, more failure shapes the runbook does not catalog, more post-mortems that hit a wall on why was this written this way. Of the four DORA metrics, MTTR is the most honest about the changing conditions. It is also the one most often dismissed as a bad month.

Reading the DORA quartet through its old calibration produces a comforting picture in conditions the calibration no longer fits. Three of the four metrics need to be re-anchored before they will read true again. The team that keeps reading the old dashboard makes confident decisions on data that decoupled from its signal eighteen months ago.

2. Review metrics measure activity, not depth

The instrumentation around code review — PR comments per review, review latency, approval rate, reviewer-assignment fairness — was designed to make sure the team’s review process was active and balanced. The metric was a proxy: if reviews are happening, with comments, in reasonable time, distributed reasonably across the team, then the review process is healthy.

The proxy was always imperfect, because it could not see depth. A reviewer who reads carefully and has nothing to flag leaves zero comments and approves quickly; a reviewer who skims and has nothing to flag does the same. The metric cannot distinguish them.

This imperfection used to be tolerable. The team’s culture, the calibration of senior reviewers, the social cost of approving something that broke later — all of these created an ambient pressure toward depth, even if the metric did not measure it. The metric was a check on the floor, not a measurement of the ceiling.

Under AI, the floor and the ceiling collapsed toward each other. Reviewing AI-authored code is structurally harder than reviewing human-authored code — the reviewer cannot ask the author, the intent is less visible, the recognizable failure shapes are subtle — and the social pressure that sustained depth weakened, because authors do not feel ownership of code they accepted, and reviewers sense the absence of ownership and respond in kind. Depth dropped. The activity metrics did not. The dashboard reports a healthy review process that is, in practice, less critical than it used to be.

The signal worth tracking — is the team carefully reading each other’s work — has no current instrument. Indirect proxies exist (review-comment quality, post-merge revert rate, the correlation between a reviewer’s approvals and the defects that escape them) but most teams do not collect them, and the dashboards that do collect them present them alongside activity metrics in a way that flattens the difference. A team that wants to know whether its review process is genuinely working has to look beyond the dashboard. Most teams do not, because the dashboard is reassuring, and reassurance is hard to argue with.

3. The signals worth measuring don’t have instruments yet

Of the metrics that matter most under AI authorship, almost none are on standard dashboards. The instruments do not exist, because the surfaces they would measure are new, and the signals they would track were either invisible before or did not need explicit measurement.

A short list of the signals teams should now be reading, in approximate order of importance.

  • Architectural latency. The gap between “we should decide this” and “we have decided this.” Tracked as: average age of open architectural questions, time from RFC opened to RFC resolved, count of in-flight design discussions older than a sprint. Most teams have no idea what these numbers are. The teams that start measuring them tend to discover that architectural decision-making is the actual throughput bottleneck on the team.
  • Intent reconstructability. The fraction of accepted code that has, attached to it somewhere, an articulated reason for why this and not that. Tracked as: fraction of merged PRs with substantive design rationale in the description, fraction of modules with current design records, count of I don’t remember why we did it this way moments per quarter. Approximate measurement is enough; precision is not the point.
  • Review depth. Not review activity. The fraction of merged PRs where at least one reviewer engaged with the contracts the change made or modified, not just the surface correctness of the diff. Hardest of these three to measure cleanly. Some teams approximate by adding a structured review template that asks the reviewer to explicitly answer what does this commit to before approving.
  • Trust topography coverage. The fraction of the codebase where the team has explicitly written down its current confidence level. A team with high coverage has a navigable codebase; a team with low coverage is transferring confidence as folklore, paid out in onboarding cost and incident cost both.
  • Post-mortem dead-end rate. The fraction of post-mortems that stopped at the model produced this and we accepted it without asking what conditions allowed the acceptance. A high rate indicates the post-mortem technique has not been updated; the rate falling over time is a sign the team has internalized a new technique.

None of these is hard to instrument crudely. Most teams have not, because the existing dashboards do not have a slot for them, and adding a metric to a dashboard requires institutional energy that most teams reserve for problems already visible. These problems are not yet visible — that is the point. By the time they show up on the existing dashboard, the cost of the gap has already been paid.

Closing

The dashboard is no longer a sufficient instrument for whether the team is healthy. It still produces numbers. The numbers no longer mean what the team thinks they mean. Some are reading low signal because their underlying tracker decoupled. Some are reading high signal in a way that misleads. And the most important readings — architectural latency, intent reconstructability, review depth, trust coverage — are not on the dashboard at all.

The fix is structural, not numerical. Re-anchor the metrics that decoupled by writing down what each one is now actually tracking, and deciding which to keep, which to demote, and which to retire. Add the missing instruments for the signals that matter under the new conditions, accepting that the first versions will be crude and partial. Stop reading the dashboard as a single source of truth for team health, and add the conversations — about depth, about ownership, about confidence — that the dashboard cannot have on the team’s behalf.

A team that does this looks, on the surface, like it has more meetings and a worse-looking dashboard than the team next door that didn’t bother. The dashboard reads worse because it is now reading something true. The meetings happen because the depth they require cannot be summarized in a number. The team next door is shipping the wrong things faster, and the dashboard is not yet reporting it. By the time the dashboard does report it, the cost has been paid: that is the structure of the gap between the surface and the signal, and the gap does not close itself just because the surface is reassuring.