Spec Quality Metrics Framework
5 metrics, scoring rubrics, benchmark data, and integration patterns. The systematic approach to evaluating specifications that nobody else has published.
Sections 1-2 are free to read. Sign up (free) to access the full framework.
Section 1: The 5-Metric Model
Metric 1: Spec-to-Ship Ratio
Definition: The percentage of merged pull requests with a spec artifact committed before the first implementation commit.
This measures spec presence, not quality. Organizations that do not track this consistently tend to discover that large fractions of shipped features were built against informal Slack threads or verbal agreements.
SPECS=$(git log --all --diff-filter=A --name-only -- "specs/" "docs/specs/" | wc -l)
MERGES=$(git log --all --merges --oneline | wc -l)
echo "Spec-to-Ship: $(echo "scale=1; $SPECS * 100 / $MERGES" | bc)%"
Metric 2: Rework Attribution
Definition: When a file changes within 14 days of its merge, classify the cause into four categories:
- Spec Gap -- requirement was missing or ambiguous
- Implementation Error -- spec was clear, code was wrong
- Requirement Change -- business changed direction mid-flight
- Integration Issue -- worked in isolation, broke at boundaries
GitClear measures rework volume. This measures rework cause. The category tells you where to invest.
Metric 3: Spec Completeness Rate
Definition: The percentage of acceptance criteria present before development started, versus criteria added mid-sprint.
Diff the spec file at branch creation against the spec at PR merge. Count Given/When/Then sequences and checkbox items (- [ ]) at each point.
Metric 4: Spec Review Depth
Formula: (reviewers x substantive comments) / spec length (paragraphs)
Threshold bands: <0.5 = under-reviewed, 0.5-1.5 = healthy, >1.5 = investigate (may indicate over-governance or an under-specified spec).
Metric 5: Cognitive Debt Index
Definition: The percentage of spec files committed with zero human edits after AI generation.
If 100% of lines in a spec attribute to a single AI-assisted commit with no subsequent human edits before implementation begins, the team is accumulating cognitive debt: the gap between what the system does and what the team understands.
git blame --porcelain docs/specs/feature.md \
| grep "^[0-9a-f]\{40\}" | awk '{print $1}' | sort -u
# Single unique hash + AI marker in commit message = cognitive debt
Section 2: Scoring Rubrics
Maturity Scale
| Score | Label | Definition |
|---|---|---|
| 1 | Ad Hoc | Not tracked; no awareness |
| 2 | Emerging | Occasionally measured; no org-level targets |
| 3 | Defined | Consistently measured; team-level targets |
| 4 | Measured | Automated tracking; org-level targets |
| 5 | Optimized | Predictive use; continuous improvement |
Per-Metric Thresholds
| Metric | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Spec-to-Ship | <20% | 20-40% | 40-60% | 60-80% | >80% |
| Rework (Spec Gap %) | >60% | 40-60% | 25-40% | 10-25% | <10% |
| Completeness | <30% | 30-50% | 50-70% | 70-85% | >85% |
| Review Depth | <0.2 | 0.2-0.5 | 0.5-1.0 | 1.0-1.5 | >1.5 |
| Cognitive Debt | >70% | 50-70% | 30-50% | 15-30% | <15% |
Composite Score
Composite = (Spec-to-Ship x 0.30) + (Rework Attribution x 0.25)
+ (Spec Completeness x 0.20) + (Review Depth x 0.15)
+ (Cognitive Debt x 0.10)
Sections 3-5
Benchmark data, implementation guide, and integration patterns
Sign up free to access the complete framework including benchmark data by company size, implementation templates, and integration patterns for Jira, ADO, Aha, and Linear.
Sign Up Free →Section 3: Benchmark Data
Research Context
- GitClear (2026): AI-assisted workflows show 9x code churn relative to non-AI workflows.
- ArXiv 2602.00180: Human-refined specs reduce LLM-generated code errors by up to 50%.
- METR RCT: Experienced developers were 19% slower with AI assistants -- specification is the bottleneck.
- Capers Jones: Requirement-origin defects represent 10-25% of all defects but 40-60% of total remediation cost.
Typical Patterns by Company Size
50-200 employees: High variance on all metrics. Spec practices are often founder-defined rather than systematically adopted. Highest-leverage intervention: standardize spec format so metrics can be measured at all.
200-500 employees: Spec debt becomes visibly expensive. Different teams have different practices. Rework Attribution studies at this stage frequently reveal that Spec Gap is the dominant rework category. Highest-leverage intervention: establish org-level Spec-to-Ship targets and make Rework Attribution visible in retrospectives.
500-2,000 employees: Metrics often partially measured but siloed. Cognitive Debt is rarely tracked even as AI adoption accelerates. Highest-leverage intervention: automation and aggregation into an executive dashboard.
Correlation: Spec Quality and Rework
Organizations that enforce spec review before development consistently report lower rates of mid-sprint requirement changes and post-merge rework. Moving from 20% to 50% completeness tends to produce larger rework reductions than moving from 70% to 90%, consistent with defect-cost curve literature.
Section 4: Implementation Guide
Level 1: Manual Tracking (~1 hour/sprint)
At the end of each sprint, review closed PRs and score each metric manually. Record in a shared spreadsheet:
| Sprint | Merged PRs | Spec-Present | S2S % | Rework Events | Spec Gaps | SG % | Completeness % | Review Depth | CD % | Composite |
|---|---|---|---|---|---|---|---|---|---|---|
| S1 | ||||||||||
| S2 | ||||||||||
| S3 |
Level 2: Semi-Automated (Claude Code Skill)
Run /spec-quality score at sprint close. The skill handles git log parsing, rework classification, and outputs a filled scorecard.
Level 3: Fully Automated (CI Integration)
Pre-merge checks: spec presence gate, completeness floor, review depth floor, cognitive debt flag. Post-merge monitor: daily scan for rework events. Sprint close: aggregate and post to dashboard.
Section 5: Integration Patterns
Jira
Use the issue changelog API to track acceptance criteria field changes after the "In Progress" transition:
GET /rest/api/3/issue/{issueKey}/changelog
# Filter for changes to "Acceptance Criteria" field after sprint start
Azure DevOps
Work item history provides a full audit trail:
GET https://dev.azure.com/{org}/{project}/_apis/wit/workitems/{id}/updates
# Filter for AcceptanceCriteria field changes
Aha! / Productboard
Schedule nightly API pulls at sprint start to snapshot feature descriptions. Compare sprint-end to sprint-start for completeness rate measurement.
Linear
Sub-issues created after the parent issue transitions to "In Progress" are mid-sprint criteria additions. Use the GraphQL API history query to find the transition timestamp.
Quick-Start Checklist
- Define "spec" for your organization: what counts as one canonical source per team
- Establish the measurement period: sprint, quarter, or rolling 30 days
- Assign one owner per metric
- Set baseline scores in your first sprint using Level 1 manual tracking
- Add composite score to the engineering retrospective agenda
- Set 12-month targets at the next engineering leadership offsite
- If AI coding tools are in use, prioritize Cognitive Debt Index tracking immediately
Want help implementing this framework?
Just Keen AI helps PE-backed SaaS companies build AI-ready engineering practices that hold up in diligence.
Get in Touch →