Free Download · CTO / VP Eng

Spec Quality Metrics Framework

5 metrics, scoring rubrics, benchmark data, and integration patterns. The systematic approach to evaluating specifications that nobody else has published.

Sections 1-2 are free to read. Sign up (free) to access the full framework.

Section 1: The 5-Metric Model

Metric 1: Spec-to-Ship Ratio

Definition: The percentage of merged pull requests with a spec artifact committed before the first implementation commit.

This measures spec presence, not quality. Organizations that do not track this consistently tend to discover that large fractions of shipped features were built against informal Slack threads or verbal agreements.

SPECS=$(git log --all --diff-filter=A --name-only -- "specs/" "docs/specs/" | wc -l)
MERGES=$(git log --all --merges --oneline | wc -l)
echo "Spec-to-Ship: $(echo "scale=1; $SPECS * 100 / $MERGES" | bc)%"

Metric 2: Rework Attribution

Definition: When a file changes within 14 days of its merge, classify the cause into four categories:

Spec Gap -- requirement was missing or ambiguous
Implementation Error -- spec was clear, code was wrong
Requirement Change -- business changed direction mid-flight
Integration Issue -- worked in isolation, broke at boundaries

GitClear measures rework volume. This measures rework cause. The category tells you where to invest.

Metric 3: Spec Completeness Rate

Definition: The percentage of acceptance criteria present before development started, versus criteria added mid-sprint.

Diff the spec file at branch creation against the spec at PR merge. Count Given/When/Then sequences and checkbox items (- [ ]) at each point.

Metric 4: Spec Review Depth

Formula: (reviewers x substantive comments) / spec length (paragraphs)

Threshold bands: <0.5 = under-reviewed, 0.5-1.5 = healthy, >1.5 = investigate (may indicate over-governance or an under-specified spec).

Metric 5: Cognitive Debt Index

Definition: The percentage of spec files committed with zero human edits after AI generation.

If 100% of lines in a spec attribute to a single AI-assisted commit with no subsequent human edits before implementation begins, the team is accumulating cognitive debt: the gap between what the system does and what the team understands.

git blame --porcelain docs/specs/feature.md \
  | grep "^[0-9a-f]\{40\}" | awk '{print $1}' | sort -u
# Single unique hash + AI marker in commit message = cognitive debt

Section 2: Scoring Rubrics

Maturity Scale

Score	Label	Definition
1	Ad Hoc	Not tracked; no awareness
2	Emerging	Occasionally measured; no org-level targets
3	Defined	Consistently measured; team-level targets
4	Measured	Automated tracking; org-level targets
5	Optimized	Predictive use; continuous improvement

Per-Metric Thresholds

Metric	1	2	3	4	5
Spec-to-Ship	<20%	20-40%	40-60%	60-80%	>80%
Rework (Spec Gap %)	>60%	40-60%	25-40%	10-25%	<10%
Completeness	<30%	30-50%	50-70%	70-85%	>85%
Review Depth	<0.2	0.2-0.5	0.5-1.0	1.0-1.5	>1.5
Cognitive Debt	>70%	50-70%	30-50%	15-30%	<15%

Composite Score

Composite = (Spec-to-Ship x 0.30) + (Rework Attribution x 0.25)
          + (Spec Completeness x 0.20) + (Review Depth x 0.15)
          + (Cognitive Debt x 0.10)

Sections 3-5

Benchmark data, implementation guide, and integration patterns

Sign up free to access the complete framework including benchmark data by company size, implementation templates, and integration patterns for Jira, ADO, Aha, and Linear.

Section 3: Benchmark Data

Research Context

GitClear (2026): AI-assisted workflows show 9x code churn relative to non-AI workflows.
ArXiv 2602.00180: Human-refined specs reduce LLM-generated code errors by up to 50%.
METR RCT: Experienced developers were 19% slower with AI assistants -- specification is the bottleneck.
Capers Jones: Requirement-origin defects represent 10-25% of all defects but 40-60% of total remediation cost.

Typical Patterns by Company Size

50-200 employees: High variance on all metrics. Spec practices are often founder-defined rather than systematically adopted. Highest-leverage intervention: standardize spec format so metrics can be measured at all.

200-500 employees: Spec debt becomes visibly expensive. Different teams have different practices. Rework Attribution studies at this stage frequently reveal that Spec Gap is the dominant rework category. Highest-leverage intervention: establish org-level Spec-to-Ship targets and make Rework Attribution visible in retrospectives.

500-2,000 employees: Metrics often partially measured but siloed. Cognitive Debt is rarely tracked even as AI adoption accelerates. Highest-leverage intervention: automation and aggregation into an executive dashboard.

Correlation: Spec Quality and Rework

Organizations that enforce spec review before development consistently report lower rates of mid-sprint requirement changes and post-merge rework. Moving from 20% to 50% completeness tends to produce larger rework reductions than moving from 70% to 90%, consistent with defect-cost curve literature.

Section 4: Implementation Guide

Level 1: Manual Tracking (~1 hour/sprint)

At the end of each sprint, review closed PRs and score each metric manually. Record in a shared spreadsheet:

Sprint	Merged PRs	Spec-Present	S2S %	Rework Events	Spec Gaps	SG %	Completeness %	Review Depth	CD %	Composite
S1
S2
S3

Level 2: Semi-Automated (Claude Code Skill)

Run /spec-quality score at sprint close. The skill handles git log parsing, rework classification, and outputs a filled scorecard.

Level 3: Fully Automated (CI Integration)

Pre-merge checks: spec presence gate, completeness floor, review depth floor, cognitive debt flag. Post-merge monitor: daily scan for rework events. Sprint close: aggregate and post to dashboard.

Section 5: Integration Patterns

Jira

Use the issue changelog API to track acceptance criteria field changes after the "In Progress" transition:

GET /rest/api/3/issue/{issueKey}/changelog
# Filter for changes to "Acceptance Criteria" field after sprint start

Azure DevOps

Work item history provides a full audit trail:

GET https://dev.azure.com/{org}/{project}/_apis/wit/workitems/{id}/updates
# Filter for AcceptanceCriteria field changes

Aha! / Productboard

Schedule nightly API pulls at sprint start to snapshot feature descriptions. Compare sprint-end to sprint-start for completeness rate measurement.

Linear

Sub-issues created after the parent issue transitions to "In Progress" are mid-sprint criteria additions. Use the GraphQL API history query to find the transition timestamp.

Quick-Start Checklist

Define "spec" for your organization: what counts as one canonical source per team
Establish the measurement period: sprint, quarter, or rolling 30 days
Assign one owner per metric
Set baseline scores in your first sprint using Level 1 manual tracking
Add composite score to the engineering retrospective agenda
Set 12-month targets at the next engineering leadership offsite
If AI coding tools are in use, prioritize Cognitive Debt Index tracking immediately

Want help implementing this framework?

Just Keen AI helps PE-backed SaaS companies build AI-ready engineering practices that hold up in diligence.

Get in Touch →