colbymchenry-codegraph — Benchmark Reproduction Guide
colbymchenry-codegraph — Benchmark Reproduction Guide
Section titled “colbymchenry-codegraph — Benchmark Reproduction Guide”Source: tools/colbymchenry-codegraph/ (pinned: 19532a81)
Date: 2026-04-13
Status: repro guide only — harness not executed
Harness location
Section titled “Harness location”The eval runner is at:
tools/colbymchenry-codegraph/tests/evaluation/runner.tsSupporting files:
tests/evaluation/ runner.ts — Main runner (CLI entry point) scoring.ts — Recall and MRR scoring functions test-cases.ts — 12 hardcoded test cases types.ts — EvalReport, EvalResult, EvalTestCase interfacesThe harness is also wired into package.json as:
"eval": "npm run build && npx tsx tests/evaluation/runner.ts"What the harness measures
Section titled “What the harness measures”The runner tests two CodeGraph APIs:
searchNodes(6 cases) — symbol lookup precision, scored by recall and MRR.findRelevantContext(6 cases) — exploration quality, scored by recall and edge density.
Pass threshold: recall >= 0.5 (defined in scoring.ts).
Important caveat: the 12 test cases are hardcoded to an Elasticsearch/OpenSearch-like codebase (symbols such as TransportService, RestController, AllocationService, BulkRequest). The runner is a regression guard for one target codebase, not a general benchmark. The README headline figures (92% fewer tool calls, 71% faster) come from a separate manual benchmark using VS Code, Excalidraw, Alamofire, and Swift Compiler codebases — the automated runner does not reproduce those figures.
Environment requirements
Section titled “Environment requirements”- Node.js 18–24 (engine constraint:
>=18.0.0 <25.0.0) - npm or npx available
- A pre-indexed codebase with
.codegraph/codegraph.dbpresent - The target codebase must contain Elasticsearch/OpenSearch-like symbols for the default test cases to pass; for a different codebase,
test-cases.tsmust be modified with project-specific expected symbols
How to run
Section titled “How to run”Step 1: Build the package
Section titled “Step 1: Build the package”cd tools/colbymchenry-codegraphnpm installnpm run buildStep 2: Prepare a target codebase
Section titled “Step 2: Prepare a target codebase”The default test cases target an Elasticsearch-like Java codebase. To use a different codebase, either:
- Use a clone of Elasticsearch (https://github.com/elastic/elasticsearch) — the test cases reference symbols present in that repo, or
- Edit
tests/evaluation/test-cases.tsto replaceexpectedSymbolsarrays with symbols from your target codebase.
Step 3: Index the target codebase
Section titled “Step 3: Index the target codebase”cd /path/to/target-codebasenode /path/to/tools/colbymchenry-codegraph/dist/bin/codegraph.js init -iOr, if installed globally:
codegraph init -iThis creates .codegraph/codegraph.db.
Step 4: Run the eval runner
Section titled “Step 4: Run the eval runner”cd tools/colbymchenry-codegraphEVAL_CODEBASE=/path/to/target-codebase npx tsx tests/evaluation/runner.tsOr via npm script (requires build step to be complete):
cd tools/colbymchenry-codegraphnpm run eval -- /path/to/target-codebaseStep 5: Read results
Section titled “Step 5: Read results”The runner prints a per-case table to stdout:
search-class-exact PASS recall=1.00 mrr=1.00 12ms explore-rest-layer FAIL recall=0.25 density=0.42 340ms missed: BaseRestHandler, RestHandlerA JSON report is saved to tests/evaluation/results/<timestamp>.json.
Reproducing the README benchmark (manual, not automated)
Section titled “Reproducing the README benchmark (manual, not automated)”The README reports tool-call counts for 6 codebases (VS Code, Excalidraw, Claude Code Python+Rust, Claude Code Java, Alamofire, Swift Compiler) comparing Claude Code with and without CodeGraph. These were produced by manually running Claude Code sessions, not by the eval runner.
To reproduce:
- Install and index one of the named codebases (e.g., VS Code source).
- Run Claude Code with CodeGraph enabled and use the exact query from the README benchmark table (e.g., “How does the extension host communicate with the main process?”).
- Record tool call count, elapsed time, and token usage from the Claude Code session log.
- Repeat without CodeGraph (remove the MCP server config) using the same query.
- Compare.
This reproduction requires a paid Claude Code session and is not automatable without an API-level tool-call tracing harness.
Notes on this guide
Section titled “Notes on this guide”- This guide was written from source review only — neither the eval runner nor the manual benchmark has been executed.
- No stored results exist in the vendored snapshot (
tests/evaluation/results/is absent). - The eval runner’s test cases use
npx tsxto run TypeScript directly;tsxis not listed as a dev dependency inpackage.jsonbut is available vianpx. Alternatively, run afternpm run buildagainst the compiled output.