context-mode — Benchmark Reproduction
context-mode — Benchmark Reproduction
Section titled “context-mode — Benchmark Reproduction”Source: tools/context-mode/ (pinned: 601aaf1)
Date: 2026-04-09
Environment: macOS Darwin 25.4.0, Node.js v24.14.1
Outcome: partially verified — context savings confirmed; session aggregate unverified (fixture-based)
Harness location
Section titled “Harness location”tools/context-mode/tests/benchmark.ts # primary harnesstools/context-mode/tests/context-comparison.ts # side-by-side comparisontools/context-mode/tests/ecosystem-benchmark.ts # cross-platform scenariosnpm scripts (from package.json):
npm run benchmark # npx tsx tests/benchmark.tsnpm run test:use-cases # npx tsx tests/use-cases.tsnpm run test:compare # npx tsx tests/context-comparison.tsnpm run test:ecosystem # npx tsx tests/ecosystem-benchmark.tsReproduction attempt
Section titled “Reproduction attempt”Run at pinned commit 601aaf1, macOS Darwin 25.4.0, Node.js v24.14.1, Bun 1.3.11, Python 3.14.3.
Dev dependencies required a separate npm install in the submodule before npx tsx was available.
Context savings (verified):
| Scenario | Raw | Output | Savings |
|---|---|---|---|
| API Response (200 users) | ~49 KB | 22 B | 100% |
| Build Output (500 lines) | ~24 KB | 37 B | 100% |
| Log File (1000 entries) | ~78 KB | 67 B | 100% |
| npm ls output | ~39 KB | 25 B | 100% |
Cold start latency (verified):
| Runtime | Avg | Min | P95 |
|---|---|---|---|
| JavaScript (Bun) | 2851 ms | 2315 ms | 3650 ms |
| TypeScript (Bun) | 3418 ms | 3062 ms | 3854 ms |
| Python | 1958 ms | 1558 ms | 2679 ms |
| Shell | 2553 ms | 2402 ms | 3122 ms |
| Perl | 1182 ms | 289 ms | 3671 ms |
Note: the README and BENCHMARK.md do not disclose subprocess cold start overhead.
How to reproduce
Section titled “How to reproduce”cd tools/context-modenpm install # install dev deps (tsx, vitest, etc.)npm run benchmark 2>&1 | tee benchmark-out.txtExpected output: per-scenario table with raw KB, context KB, and savings %.
Compare against BENCHMARK.md figures in the same directory.
Reported figures (from BENCHMARK.md, as reported)
Section titled “Reported figures (from BENCHMARK.md, as reported)”| Metric | Value |
|---|---|
| Total scenarios | 21 |
| Total raw data | 376 KB |
| Total context consumed | 16.5 KB |
| Overall savings | 96% |
Session aggregate (curated fixture): 177 KB raw → 10.2 KB context (94%, ~45,300 → ~2,600 tokens)
- Fixture corpus is curated (Playwright snapshots, GitHub Issues, CSV data, etc.) — not drawn from a real debugging session.
- Savings on
ctx_execute_file(95–100%) depend on the agent writing an effective summarization script. The harness likely uses pre-written scripts optimized for the fixture data. ctx_index+ctx_searchsavings (44–93%) depend on query quality and corpus density.