Skip to content

glitterkill-sdl-mcp — Benchmark Reproduction Guide

glitterkill-sdl-mcp — Benchmark Reproduction Guide

Section titled “glitterkill-sdl-mcp — Benchmark Reproduction Guide”

Source: tools/glitterkill-sdl-mcp/ (pinned: 492b5e8) Date: 2026-04-14 Status: repro guide only — harness not executed


The primary real-world benchmark is at:

tools/glitterkill-sdl-mcp/scripts/real-world-benchmark.ts
tools/glitterkill-sdl-mcp/scripts/real-world-benchmark-matrix.ts

Supporting files:

benchmarks/real-world/
tasks.json — Task definitions (code review, bug fix, feature review, etc.)
matrix.json — Matrix of task × repo combinations
CLAIMS.md — Formal claim policy and gate thresholds
external-repos.config.json — External OSS repos used as benchmark corpora
symptom-tasks.json — Symptom-driven task definitions
config/
benchmark.ci.config.json — CI threshold config (indexing + quality metrics)
benchmark.config.json — Local benchmark config (repo paths)
scripts/
check-benchmark-claims.ts — Validates aggregate output against claim gates
benchmark.ts — Indexing microbenchmark
budget-sensitivity-sweep.ts — Sensitivity analysis over budget parameters

npm scripts:

"benchmark:real": "node scripts/real-world-benchmark.ts",
"benchmark:matrix": "node scripts/real-world-benchmark-matrix.ts",
"benchmark:claims": "node scripts/check-benchmark-claims.ts",
"benchmark:ci": "node dist/cli/index.js benchmark:ci"

The real-world benchmark compares two workflows end-to-end for engineering tasks:

  • Traditional: file search + open files (approximates baseline LLM file reads)
  • SDL-MCP: symbol search → Symbol Cards → slice → skeletons

Task families covered by tasks.json:

FamilyExamples
Code reviewPR review, diff analysis
Feature reviewUnderstanding new code paths
Bug fixingLocating and diagnosing failures
Feature understandingComprehending architectural areas
Code change implementationImplementing a described change
Performance investigationProfiling and bottleneck analysis
Impact analysisBlast-radius for a proposed change
Test triageIdentifying failing or flaky tests

Key metrics (from benchmarks/real-world/README.md):

MetricDefinition
Token Reduction% fewer tokens than traditional at task completion
File Coveragerelevant files found / relevant files total
Symbol Coveragerelevant symbols found / relevant symbols total
Composite ScoreWeighted: token efficiency + coverage quality + efficiency + precision

Formal claim gates (from benchmarks/real-world/CLAIMS.md):

  • p50(capped token reduction) >= 50% per family (realism profile)
  • p25(capped token reduction) >= 40% per family
  • Per-task floor: capped reduction >= 20%

Important distinction: the headline 81% figure in the README applies to tools/list payload size reduction (gateway vs. flat schema surface), which is not what this harness measures. The harness measures end-to-end task-level token reduction. These are separate claims with different methodologies.


  • Node.js 24+ (required by SDL-MCP)
  • A built SDL-MCP (run npm run build:all first)
  • External benchmark repos (downloaded by benchmark:setup-external)
  • The benchmarks/real-world/benchmark.config.json must be built by merging the local config with external-repos config (see step 3 below)
  • A running SDL-MCP server instance (for the SDL-MCP workflow arm)

The benchmark config at benchmarks/real-world/benchmark.config.json contains Windows-absolute paths (F:/Claude/projects/...) indicating it was authored on a Windows machine. These paths must be updated to local absolute paths before running.


Terminal window
cd tools/glitterkill-sdl-mcp
npm install
npm run build:all
Terminal window
cd tools/glitterkill-sdl-mcp
npm run benchmark:setup-external

This clones the OSS repos referenced in benchmarks/real-world/external-repos.config.json (e.g., zod, preact) into .tmp/external-benchmarks/.

Terminal window
cd tools/glitterkill-sdl-mcp
node -e "
const fs = require('fs');
const b = JSON.parse(fs.readFileSync('config/sdlmcp.config.json', 'utf8'));
const e = JSON.parse(fs.readFileSync('benchmarks/real-world/external-repos.config.json', 'utf8'));
fs.writeFileSync(
'benchmarks/real-world/benchmark.config.json',
JSON.stringify({ ...b, repos: [...(b.repos || []), ...(e.repos || [])] }, null, 2) + '\n'
);
"

Then edit benchmarks/real-world/benchmark.config.json to replace any Windows-absolute rootPath values with local paths.

Terminal window
cd tools/glitterkill-sdl-mcp
npm run benchmark:matrix -- \
--matrix benchmarks/real-world/matrix.json \
--config benchmarks/real-world/benchmark.config.json \
--out-dir benchmarks/real-world/runs/coverage-matrix
Terminal window
cd tools/glitterkill-sdl-mcp
node --experimental-strip-types scripts/check-benchmark-claims.ts \
--in benchmarks/real-world/runs/coverage-matrix/aggregate.json \
--profile realism

Exit code 0 = all claim gates pass. Results are written to aggregate.json.

CI regression benchmarks (indexing + quality)

Section titled “CI regression benchmarks (indexing + quality)”
Terminal window
cd tools/glitterkill-sdl-mcp
npm run benchmark:ci

This runs the indexing microbenchmark against CI thresholds defined in config/benchmark.ci.config.json (e.g., max 3000ms per file, min 5 symbols per file, graph connectivity ≥ 0.3).


  • This guide was written from source review only — neither the real-world benchmark nor the CI regression suite has been executed in this analysis.
  • The benchmark config contains Windows-absolute paths that require updating for non-Windows environments; this is a friction point for reproduction.
  • The 81% tools/list token reduction (gateway vs. flat mode) is a separate claim not tested by this harness; it would require measuring the raw schema payload size of tools/list responses under each mode, which is not covered by any script in the repo.
  • The formal claim gates in CLAIMS.md apply only to the benchmarked matrix (specific OSS repos × task families); performance on other repos or task shapes is not guaranteed.
  • LLM-generated Symbol Card summaries are produced at index time; the benchmark does not disclose which model was used for this step or what it cost.