Skip to content

context-mode — Benchmark Reproduction

Source: tools/context-mode/ (pinned: 601aaf1) Date: 2026-04-09 Environment: macOS Darwin 25.4.0, Node.js v24.14.1 Outcome: partially verified — context savings confirmed; session aggregate unverified (fixture-based)


tools/context-mode/tests/benchmark.ts # primary harness
tools/context-mode/tests/context-comparison.ts # side-by-side comparison
tools/context-mode/tests/ecosystem-benchmark.ts # cross-platform scenarios

npm scripts (from package.json):

npm run benchmark # npx tsx tests/benchmark.ts
npm run test:use-cases # npx tsx tests/use-cases.ts
npm run test:compare # npx tsx tests/context-comparison.ts
npm run test:ecosystem # npx tsx tests/ecosystem-benchmark.ts

Run at pinned commit 601aaf1, macOS Darwin 25.4.0, Node.js v24.14.1, Bun 1.3.11, Python 3.14.3. Dev dependencies required a separate npm install in the submodule before npx tsx was available.

Context savings (verified):

ScenarioRawOutputSavings
API Response (200 users)~49 KB22 B100%
Build Output (500 lines)~24 KB37 B100%
Log File (1000 entries)~78 KB67 B100%
npm ls output~39 KB25 B100%

Cold start latency (verified):

RuntimeAvgMinP95
JavaScript (Bun)2851 ms2315 ms3650 ms
TypeScript (Bun)3418 ms3062 ms3854 ms
Python1958 ms1558 ms2679 ms
Shell2553 ms2402 ms3122 ms
Perl1182 ms289 ms3671 ms

Note: the README and BENCHMARK.md do not disclose subprocess cold start overhead.

Terminal window
cd tools/context-mode
npm install # install dev deps (tsx, vitest, etc.)
npm run benchmark 2>&1 | tee benchmark-out.txt

Expected output: per-scenario table with raw KB, context KB, and savings %. Compare against BENCHMARK.md figures in the same directory.

Reported figures (from BENCHMARK.md, as reported)

Section titled “Reported figures (from BENCHMARK.md, as reported)”
MetricValue
Total scenarios21
Total raw data376 KB
Total context consumed16.5 KB
Overall savings96%

Session aggregate (curated fixture): 177 KB raw → 10.2 KB context (94%, ~45,300 → ~2,600 tokens)

  • Fixture corpus is curated (Playwright snapshots, GitHub Issues, CSV data, etc.) — not drawn from a real debugging session.
  • Savings on ctx_execute_file (95–100%) depend on the agent writing an effective summarization script. The harness likely uses pre-written scripts optimized for the fixture data.
  • ctx_index+ctx_search savings (44–93%) depend on query quality and corpus density.