glitterkill-sdl-mcp — Benchmark Reproduction Guide
glitterkill-sdl-mcp — Benchmark Reproduction Guide
Section titled “glitterkill-sdl-mcp — Benchmark Reproduction Guide”Source: tools/glitterkill-sdl-mcp/ (pinned: 492b5e8)
Date: 2026-04-14
Status: repro guide only — harness not executed
Harness location
Section titled “Harness location”The primary real-world benchmark is at:
tools/glitterkill-sdl-mcp/scripts/real-world-benchmark.tstools/glitterkill-sdl-mcp/scripts/real-world-benchmark-matrix.tsSupporting files:
benchmarks/real-world/ tasks.json — Task definitions (code review, bug fix, feature review, etc.) matrix.json — Matrix of task × repo combinations CLAIMS.md — Formal claim policy and gate thresholds external-repos.config.json — External OSS repos used as benchmark corpora symptom-tasks.json — Symptom-driven task definitions
config/ benchmark.ci.config.json — CI threshold config (indexing + quality metrics) benchmark.config.json — Local benchmark config (repo paths)
scripts/ check-benchmark-claims.ts — Validates aggregate output against claim gates benchmark.ts — Indexing microbenchmark budget-sensitivity-sweep.ts — Sensitivity analysis over budget parametersnpm scripts:
"benchmark:real": "node scripts/real-world-benchmark.ts","benchmark:matrix": "node scripts/real-world-benchmark-matrix.ts","benchmark:claims": "node scripts/check-benchmark-claims.ts","benchmark:ci": "node dist/cli/index.js benchmark:ci"What the harness measures
Section titled “What the harness measures”The real-world benchmark compares two workflows end-to-end for engineering tasks:
- Traditional: file search + open files (approximates baseline LLM file reads)
- SDL-MCP: symbol search → Symbol Cards → slice → skeletons
Task families covered by tasks.json:
| Family | Examples |
|---|---|
| Code review | PR review, diff analysis |
| Feature review | Understanding new code paths |
| Bug fixing | Locating and diagnosing failures |
| Feature understanding | Comprehending architectural areas |
| Code change implementation | Implementing a described change |
| Performance investigation | Profiling and bottleneck analysis |
| Impact analysis | Blast-radius for a proposed change |
| Test triage | Identifying failing or flaky tests |
Key metrics (from benchmarks/real-world/README.md):
| Metric | Definition |
|---|---|
| Token Reduction | % fewer tokens than traditional at task completion |
| File Coverage | relevant files found / relevant files total |
| Symbol Coverage | relevant symbols found / relevant symbols total |
| Composite Score | Weighted: token efficiency + coverage quality + efficiency + precision |
Formal claim gates (from benchmarks/real-world/CLAIMS.md):
p50(capped token reduction) >= 50%per family (realism profile)p25(capped token reduction) >= 40%per family- Per-task floor:
capped reduction >= 20%
Important distinction: the headline 81% figure in the README applies to tools/list payload size reduction (gateway vs. flat schema surface), which is not what this harness measures. The harness measures end-to-end task-level token reduction. These are separate claims with different methodologies.
Environment requirements
Section titled “Environment requirements”- Node.js 24+ (required by SDL-MCP)
- A built SDL-MCP (run
npm run build:allfirst) - External benchmark repos (downloaded by
benchmark:setup-external) - The
benchmarks/real-world/benchmark.config.jsonmust be built by merging the local config with external-repos config (see step 3 below) - A running SDL-MCP server instance (for the SDL-MCP workflow arm)
The benchmark config at benchmarks/real-world/benchmark.config.json contains Windows-absolute paths (F:/Claude/projects/...) indicating it was authored on a Windows machine. These paths must be updated to local absolute paths before running.
How to run
Section titled “How to run”Step 1: Build SDL-MCP
Section titled “Step 1: Build SDL-MCP”cd tools/glitterkill-sdl-mcpnpm installnpm run build:allStep 2: Download external benchmark repos
Section titled “Step 2: Download external benchmark repos”cd tools/glitterkill-sdl-mcpnpm run benchmark:setup-externalThis clones the OSS repos referenced in benchmarks/real-world/external-repos.config.json (e.g., zod, preact) into .tmp/external-benchmarks/.
Step 3: Build merged benchmark config
Section titled “Step 3: Build merged benchmark config”cd tools/glitterkill-sdl-mcpnode -e "const fs = require('fs');const b = JSON.parse(fs.readFileSync('config/sdlmcp.config.json', 'utf8'));const e = JSON.parse(fs.readFileSync('benchmarks/real-world/external-repos.config.json', 'utf8'));fs.writeFileSync( 'benchmarks/real-world/benchmark.config.json', JSON.stringify({ ...b, repos: [...(b.repos || []), ...(e.repos || [])] }, null, 2) + '\n');"Then edit benchmarks/real-world/benchmark.config.json to replace any Windows-absolute rootPath values with local paths.
Step 4: Run the matrix benchmark
Section titled “Step 4: Run the matrix benchmark”cd tools/glitterkill-sdl-mcpnpm run benchmark:matrix -- \ --matrix benchmarks/real-world/matrix.json \ --config benchmarks/real-world/benchmark.config.json \ --out-dir benchmarks/real-world/runs/coverage-matrixStep 5: Validate claim thresholds
Section titled “Step 5: Validate claim thresholds”cd tools/glitterkill-sdl-mcpnode --experimental-strip-types scripts/check-benchmark-claims.ts \ --in benchmarks/real-world/runs/coverage-matrix/aggregate.json \ --profile realismExit code 0 = all claim gates pass. Results are written to aggregate.json.
CI regression benchmarks (indexing + quality)
Section titled “CI regression benchmarks (indexing + quality)”cd tools/glitterkill-sdl-mcpnpm run benchmark:ciThis runs the indexing microbenchmark against CI thresholds defined in config/benchmark.ci.config.json (e.g., max 3000ms per file, min 5 symbols per file, graph connectivity ≥ 0.3).
Notes on this guide
Section titled “Notes on this guide”- This guide was written from source review only — neither the real-world benchmark nor the CI regression suite has been executed in this analysis.
- The benchmark config contains Windows-absolute paths that require updating for non-Windows environments; this is a friction point for reproduction.
- The 81%
tools/listtoken reduction (gateway vs. flat mode) is a separate claim not tested by this harness; it would require measuring the raw schema payload size oftools/listresponses under each mode, which is not covered by any script in the repo. - The formal claim gates in
CLAIMS.mdapply only to the benchmarked matrix (specific OSS repos × task families); performance on other repos or task shapes is not guaranteed. - LLM-generated Symbol Card summaries are produced at index time; the benchmark does not disclose which model was used for this step or what it cost.