Research · Benchmarks
An open scoring rubric. An invitation to replicate.
The rubric is the same one used to score the certification track. It's public, replicable, and frozen across each report's reference quarter.
Dimensions
Three dimensions.
Every challenge — and every report — scores along these three. No composite. No black box.
Dimension
Time-to-correct-fix
Time from session start to a fix that passes the challenge's hidden acceptance tests. Wall-clock, with idle gating to discount distraction.
Dimension
30-day regression rate
Fraction of submissions whose fix is reverted, patched, or rolled back inside a 30-day post-merge window. Measured against the canonical regression test set.
Dimension
Defensibility score
Blind panel score (0–5) on how well the candidate's replay walks a reviewer through the decision points. Inter-rater agreement is reported per round.
Task categories
Twelve categories.
Each task is tagged by category. Reports break out findings by category and experience band.
- Debugging — single-service
- Debugging — distributed
- Security — auth / authz
- Security — injection / SSRF
- Refactoring — within-module
- Refactoring — cross-module
- Feature build — bounded
- Feature build — cross-cutting
- System design — write-path
- System design — read-path
- Code review — accept / reject
- Incident response — diagnosis