Why Your DORA Dashboard Is Lying to You
6:00 listen · Extended briefing below
Extended briefing
Your DORA dashboard is lying to you. Not with bad data -- with incomplete data. You are tracking deployment frequency. Lead time. Change failure rate. Recovery time. And those numbers might look great. But none of them tell you why your team keeps shipping things that get immediately rewritten. None of them point upstream -- to where the real problem starts. That is what we are talking about today.
GitClear published research in 2026 that stopped a lot of engineering leaders in their tracks. Teams using AI coding tools were seeing code churn -- meaning code written and then quickly rewritten or deleted -- at nine times the rate of teams that were not. Nine times. That is not a rounding error. That is a signal.
The instinct is to blame the AI. But that is the wrong diagnosis. Churn is a symptom. The cause is upstream. The AI is not confused -- the AI is doing exactly what it was asked. The problem is what it was asked to do. Vague requirements. Assumptions baked in. Missing edge cases. In other words: the spec.
And here is what makes this interesting right now. The industry is starting to wake up to it. Amazon launched Kiro -- a spec-first IDE where you write the spec before you write a single line of code. It is a fundamentally different workflow. And on GitHub, a project called Spec Kit hit 71,000 stars. Not a framework. Not a new language. A structured specification toolkit. Engineers are voting with their attention.
But here is the gap nobody has closed yet. We have tools to write better specs. We do not have tools to measure spec quality. We are still flying blind on the upstream signal that drives everything downstream.
So let me walk you through three metrics I use with clients. These are not theoretical. These are things I have measured in real engineering organizations.
The first one is Spec-to-Ship Ratio.
Simple concept. Of the features that made it to production -- how many had a complete spec before development started? Not a ticket. Not a Slack message. A spec. Acceptance criteria. Edge cases. Dependencies. The real thing.
Most teams have never measured this. And when they do for the first time, the number is almost always lower than they expect. They thought they had a documentation problem. What they actually have is a rework time bomb.
When you do not know your spec-to-ship ratio, you cannot explain your churn. You cannot explain why sprints keep extending. You cannot explain why senior engineers are spending their time in code review catching things that should have been caught weeks earlier.
The second metric is Rework Attribution.
Not all rework is equal. I bucket it into four categories. Spec gap -- meaning the requirement was never defined or defined incorrectly. Implementation error -- meaning the spec was fine but the build did not match it. Requirement change -- meaning the business changed direction mid-flight. And integration issue -- meaning the interfaces between systems were not aligned.
Here is why this matters. Most teams track rework in aggregate. They know it is high. They try to fix it with better standups, tighter sprints, more code review. Nothing moves. The reason is they are not separating the causes. If the majority of your rework traces to spec gaps, no amount of process improvement at the implementation layer will help. You are treating a symptom four steps downstream from the actual problem.
The third metric is Spec Completeness Rate.
This one sounds obvious. It is not. Spec completeness is not whether a spec exists -- it is whether the spec was complete enough that no new requirements were discovered during implementation. That is the bar.
I ask teams to track a simple thing: how often does a developer encounter something during build that was not in the spec and has to go back to product or design or a stakeholder to get an answer? Every one of those trips is a gap. And when you start counting them, the number is almost always shocking.
Try tracking it for one sprint. How often does a developer hit something that was not in the spec and has to go back to product or design for an answer? Every one of those trips is a gap. And the research backs this up -- ArXiv published findings showing that human-refined specs reduce LLM errors by up to 50%. This is not just about human developers. If your specs are incomplete, your AI is guessing. And a CodeRabbit study found 1.7 times more bugs in codebases where specs were weak. The math is not subtle.
Now -- those three metrics are in the public blog post. But in the downloadable framework, there are two more that I want to mention because they are the ones that tend to surprise people the most.
The first is Spec Review Depth. Not whether a spec was reviewed -- but how deeply. A spec that got a two-minute skim before approval is not the same as one that went through structured review. Depth matters. We measure it.
The second one is what I call the Cognitive Debt Index. And this one has become really relevant in the last 18 months because of AI-generated specs. Here is what it tracks: AI-generated specs that nobody actually read before implementation started. The team used an AI to generate requirements, the developer used an AI to write the code, and at no point did a human actually sit with the spec and say -- does this make sense? Is this what we meant? That is cognitive debt. You borrowed cognitive work from a machine and you will pay it back in rework. The METR Research Center ran a randomized controlled trial and found experienced developers were 19 percent slower with AI tools on complex tasks. That is not because AI tools are bad. It is because when the specification layer is weak, AI amplifies the gap.
Both of those metrics are in the framework. Link is in the show notes.
If you want the full framework -- all five metrics, how to measure them, and a scoring rubric -- download it at Just Keen A.I. dot com. It is free. And it is the thing I hand to every new client before we talk about process or tooling.
Next week we are covering rework attribution in depth. The four categories I mentioned today -- we are going to go through each one, how to tag it, how to report it to leadership, and what the data actually tells you about where quality breaks down in your organization.
I'm Jess Keeney from Just Keen A.I. dot com. Thanks for listening.