Skip to content

tirth8205-code-review-graph — Benchmark Reproduction Guide

tirth8205-code-review-graph — Benchmark Reproduction Guide

Section titled “tirth8205-code-review-graph — Benchmark Reproduction Guide”

Source: tools/tirth8205-code-review-graph/ (pinned: v2.3.1, commit 36777165) Date: 2026-04-13 Status: repro guide only — harness not executed in this session


The eval runner lives inside the Python package:

tools/tirth8205-code-review-graph/
code_review_graph/eval/
init.py — Public API; lazy-imports runner (requires pyyaml)
runner.py — Orchestrator: clone repos, build graphs, run benchmarks, write CSVs
scorer.py — compute_token_efficiency, compute_mrr, compute_precision_recall
reporter.py — Markdown and CSV report generators
token_benchmark.py — Workflow-level token benchmarks (review, architecture, debug, onboard, pre_merge)
benchmarks/
token_efficiency.py — Per-commit naive vs diff vs graph token counts
impact_accuracy.py — Precision/recall/F1 for blast-radius predictions
search_quality.py — MRR for keyword search
flow_completeness.py — Flow detection recall
build_performance.py — Full build time and search latency
configs/
express.yaml — 2 pinned commits (qs CVE fix, res.type edge-case tests)
fastapi.yaml — 2 pinned commits
flask.yaml — 2 pinned commits
gin.yaml — 3 pinned commits
httpx.yaml — 2 pinned commits
nextjs.yaml — 2 pinned commits

The runner is also wired into the CLI as code-review-graph eval (defined in cli.py).


Five benchmark types, each backed by a run(repo_path, store, config) function:

BenchmarkPrimary metricNotes
token_efficiencynaive_to_graph_ratioNaive = full changed-file contents; graph = get_review_context JSON output; token count = len(text) // 4
impact_accuracyprecision / recall / F1Ground truth = changed files + files with CALLS/IMPORTS_FROM edges to changed nodes
search_qualityMRRExpected result matched by substring of qualified_name
flow_completenessrecallEntry points detected vs expected from YAML config
build_performancebuild_ms, search_latency_msWall time for full build and single search

The README table (8.2× average, 100% recall, MRR 0.35) is produced by running all five benchmarks across all six repos.


Token counting approximation: token count is len(text) // 4 — a rough character-to-token ratio. For code, actual token counts from a BPE tokenizer (e.g., tiktoken/cl100k) will differ, typically by 10–30%. The 8.2× ratio may shift when re-run with a real tokenizer.

Impact accuracy ground truth is circular: the benchmark computes ground truth from the same graph’s edge data (CALLS, IMPORTS_FROM edges). This means a file is “actually impacted” only if the graph contains a relevant edge — inflating recall to 100% by construction. The benchmark validates internal consistency, not independent ground truth.

Token efficiency methodology gap: the benchmark compares naive (full changed files) to graph (get_review_context) but does not include a diff-only baseline. A reviewer using only git diff would produce far fewer tokens than reading full files. The standard_tokens field (diff tokens) is computed but not used in the headline ratio.


  • Python 3.10–3.13
  • uv recommended (or pip)
  • pyyaml (included in [eval] extra)
  • matplotlib (included in [eval] extra)
  • Git (for cloning benchmark repos and computing diffs)
  • Network access (clones 6 real repos from GitHub)
  • Disk: ~500 MB for 6 repo clones

From the vendored source:

Terminal window
cd tools/tirth8205-code-review-graph
pip install -e ".[eval]"

Or with uv:

Terminal window
cd tools/tirth8205-code-review-graph
uv pip install -e ".[eval]"

To also reproduce community detection (Leiden) and search quality with semantic embeddings:

Terminal window
pip install -e ".[eval,communities,embeddings]"
Terminal window
code-review-graph eval --all

This will:

  1. Clone or update all 6 repos into evaluate/test_repos/
  2. Build the code graph for each repo
  3. Run all 5 benchmark types against each repo
  4. Write CSV results to evaluate/results/
  5. Print a summary table

To run a single repo:

Terminal window
code-review-graph eval --repo fastapi

To run a single benchmark type:

Terminal window
code-review-graph eval --benchmark token_efficiency
Terminal window
code-review-graph eval --report

Writes a Markdown summary to evaluate/reports/summary.md.

Alternatively, invoke the reporter directly:

from code_review_graph.eval import generate_markdown_report
report = generate_markdown_report(results)
print(report)

RepoCommitsAvg Naive TokensAvg Graph TokensReduction
express26939830.7x
fastapi24,9446148.1x
flask244,7514,2529.1x
gin321,9721,15316.4x
httpx212,0441,7286.9x
nextjs29,8821,2498.0x
Average138.2x

Impact accuracy across all repos: 100% recall, 0.54 average F1, 0.38 average precision.

Search quality (MRR): 0.35 (keyword search only; not re-run with semantic embeddings in this guide).


  • This guide was produced from source review only. The harness has not been executed; figures above are as-reported by the tool authors.
  • The 49x figure from the README monorepo diagram has no corresponding benchmark config or YAML entry and cannot be reproduced using the eval harness.
  • The evaluate/reports/summary.md referenced in the README does not exist in the vendored snapshot — it is generated on first run.
  • For a meaningful token efficiency comparison, consider adding a diff_to_graph_ratio column (standard_tokens / graph_tokens) to see how the tool compares to a reviewer using only git diff.