Your Specs Are the Leading Indicator Nobody Measures

Three things happened in the last twelve months that should change how you measure engineering quality.

DORA added Rework Rate as a fifth core metric in 2025, acknowledging that shipping fast means nothing if the work comes back. Amazon launched Kiro, a spec-first IDE that generates structured specifications before writing code. And GitHub's Spec Kit crossed 71,000 stars, signaling that the developer community is hungry for specification tooling.

The industry is converging on a single insight: specs are the unit of work. Not tickets. Not stories. Specs.

But here is the gap nobody is filling. We now have tools that generate specs (Kiro), frameworks that measure delivery speed (DORA), and platforms that track code quality post-merge (GitClear). Nobody is measuring whether the specs themselves are any good. That is like measuring deployment frequency without asking whether you are deploying the right thing.

The evidence is already here

Spec quality is not a vibes-based argument. The research is converging from four independent directions.

Human-refined specs cut AI errors in half. ArXiv 2602.00180 demonstrated that when humans refine specifications before passing them to LLMs, error rates drop by up to 50%. The specification is the leverage point, not the model.

Specification is the bottleneck, not coding. The METR randomized controlled trial found that experienced developers were 19% slower with AI coding assistants. Not because the tools were bad, but because the developers spent more time specifying, reviewing, and correcting AI output than they saved on writing code. The bottleneck moved upstream.

AI tools correlate with 9x higher code churn. GitClear's 2026 analysis found that repositories using AI coding assistants showed nine times higher code churn -- code written and then rewritten within 14 days. The code is not the problem. The input to the code is the problem.

Without good specs, AI produces 1.7x more bugs. CodeRabbit's analysis of Stack Overflow developer survey data showed that AI-assisted code without strong specifications produced 1.7 times more defects than conventionally written code with clear requirements.

DORA tells you how fast you ship. Rework Rate tells you how much comes back. Neither tells you why. Spec quality does.

3 metrics you can derive from git today

Each one can be computed from your existing git history, no new tooling required.

Metric 1: Spec-to-Ship Ratio

Definition: The percentage of merged PRs that have a corresponding spec file committed before the first implementation commit on that branch.

Why it matters: This metric tells you what percentage of your engineering output was built against a defined target versus built on assumptions and Slack threads.

How to derive from git:

# Count spec files added before implementation on feature branches
SPECS=$(git log --all --diff-filter=A --name-only -- "specs/" "docs/specs/" | wc -l)
MERGES=$(git log --all --merges --oneline | wc -l)
echo "Spec-to-Ship Ratio: $SPECS specs / $MERGES merges"
echo "Ratio: $(echo "scale=2; $SPECS * 100 / $MERGES" | bc)%"

This is an approximation. A more precise version cross-references spec file timestamps against the first non-spec commit per feature branch. But even this version tells you something most teams have never quantified: what fraction of features had a written specification before coding started.

Most teams have never measured this. When they do, the number is almost always lower than they expect.

Metric 2: Rework Attribution

Definition: When a file changes within 14 days of its merge commit, classify the cause into one of four categories.

GitClear measures rework volume. This metric measures rework cause. You cannot fix what you cannot classify.

The four categories:

Spec gap -- the requirement was missing or ambiguous. Look for commit messages containing: "missed requirement," "spec didn't cover," "edge case not specified," "adding handling for [scenario] not in spec."
Implementation error -- the spec was clear, the code was wrong. Patterns: "fix typo," "off-by-one," "wrong variable," "null check," "regression from [PR]."
Requirement change -- the business changed its mind after development started. Patterns: "per [stakeholder] feedback," "updated requirement," "scope change," "new ask from product."
Integration issue -- the spec was correct in isolation but broke at system boundaries. Patterns: "API contract mismatch," "upstream dependency changed," "race condition in [service]," "timeout not accounted for."

Why it matters: If the majority of your rework traces to spec gaps, investing in better CI/CD will not help. You need better specs. If it traces to integration issues, you need better interface contracts and dependency mapping. The category tells you where to invest.

How to derive from git: Analyze commit messages and PR descriptions for the patterns above. Manual classification with AI-assist is practical at fewer than 100 PRs per month. For higher volumes, an LLM pass over commit messages with the four-category taxonomy produces consistent results after a short calibration period.

Metric 3: Spec Completeness Rate

Definition: The percentage of acceptance criteria present in a spec before development started, versus acceptance criteria added mid-sprint.

Why it matters: Acceptance criteria added after coding starts are reactive, not proactive. High mid-sprint additions correlate directly with scope creep and rework. This metric separates teams that think before they build from teams that discover requirements during implementation.

How to derive from git: Diff the spec file at branch creation against the spec file at PR merge. Count acceptance criteria lines using two patterns:

Given/When/Then blocks: Count Given...When...Then sequences at branch creation versus at merge. New sequences added after the first implementation commit represent mid-development discoveries.
Checkbox items: Specs using task-list format (- [ ] in markdown) are even easier. Count checkboxes at branch creation, count again at merge. The delta is your mid-sprint addition rate.

A Spec Completeness Rate above 85% indicates a team that front-loads requirements work. Below 60% indicates a team that routinely discovers what they are building while they are building it.

There are 2 more metrics in the full framework

These three metrics are the start. The complete Spec Quality Metrics Framework includes two additional metrics: Spec Review Depth (are your specs getting real scrutiny or rubber-stamp approval?) and Cognitive Debt Index (what percentage of your AI-generated specs ship with zero human edits?). The framework also includes scoring rubrics calibrated against 200+ engineering assessments and benchmark data segmented by company size.

The Cognitive Debt Index matters most for teams adopting Kiro, Cursor, or other AI-assisted specification tools. Generating a spec is not the same as understanding a spec. If your team ships AI-generated specifications without human review, you are accumulating cognitive debt: the gap between what the system does and what the team understands.

Get the Complete Framework

5 metrics, scoring rubrics, benchmark data, and integration patterns for Jira, ADO, and Linear. Free download.

Download the Framework →

Or take the free SDLC Maturity Assessment →

This is the third post in the Agentic Transformation series. Previously: The Uncomfortable Question DORA Can't Answer and The Real Numbers Behind Feature Delivery.