Skip to content

ryandonofrio3-osgrep — Benchmark Reproduction

ryandonofrio3-osgrep — Benchmark Reproduction

Section titled “ryandonofrio3-osgrep — Benchmark Reproduction”

Source: tools/osgrep/ (pinned: 9f2faf7, v0.5.16) Date: 2026-04-14 Environment: not run — harness missing from working tree Outcome: README benchmark is not reproducible from source; internal MRR harness is runnable but covers retrieval quality only, not token counts


The README states:

“In our public benchmarks, osgrep can save about 20% of your LLM tokens and deliver a 30% speedup.”

Source of these figures: benchmark/benchmark_opencode.csv — 10 queries run against the opencode codebase, comparing baseline (no osgrep) vs with osgrep on wall-clock time and API cost.

Raw LLM responses are committed in benchmark/raw_responses/baseline/ (10 files) and benchmark/raw_responses/with_osgrep/ (10 files). An HTML chart is at benchmark/output/benchmark-results.html.

QueryBaseline time (s)Baseline cost ($)osgrep time (s)osgrep cost ($)Winner
authentication + authorization checks1450.26940.26tie
file watching implementation2840.502050.59baseline
LSP integration configured870.33680.23osgrep
TUI client–server communication1310.261340.38osgrep
agent types defined460.13930.31baseline
terminal rendering + UI state1790.661040.30osgrep
AI model provider integration1470.24630.21osgrep
codebase search + navigation2160.83930.23osgrep
bash command execution + sandboxing780.07580.11osgrep
mobile app remote control architecture2580.41900.23osgrep
Mean157.10.359100.20.2856–2 osgrep

Observed from CSV: −36% mean time, −21% mean cost, consistent with README headline.


package.json defines three benchmark scripts:

benchmark: ./run-benchmark.sh
benchmark:index: ./run-benchmark.sh $HOME/osgrep-benchmarks --index
benchmark:agent: npx tsx src/bench/benchmark-agent.ts
benchmark:chart: npx tsx src/bench/generate-benchmark-chart.ts

None of these files exist in the working tree (commit 9f2faf7):

  • run-benchmark.sh — absent
  • src/bench/benchmark-agent.ts — absent (src/bench/ directory does not exist)
  • src/bench/generate-benchmark-chart.ts — absent

The benchmark infrastructure has been removed or was never committed alongside the CSV results. Reproduction is not possible from the repo in its current state.

Internal MRR harness (src/eval.ts) — runnable

Section titled “Internal MRR harness (src/eval.ts) — runnable”

eval.ts defines 70+ EvalCase records asserting that specific source files within the osgrep codebase rank first for given natural-language queries. This is a retrieval quality regression suite, not a token count or cost benchmark.

To run (requires an indexed copy of the osgrep repo):

Terminal window
cd tools/osgrep
pnpm install
pnpm build
# First, index the repo against itself
node dist/index.js index --reset
# Then run eval (no dedicated script — must import and execute manually)
# eval.ts exports EvalCase[] and EvalResult types; it must be driven by a runner
# No committed runner script found; experiments/ scripts are ad-hoc explorations

experiments/mrr-sweep.ts, experiments/ranking-test.ts, experiments/quick_check.ts, and experiments/verify-fix.ts are present but undocumented and not part of a repeatable harness.

Terminal window
pnpm test

Tests cover unit logic; no integration tests against a real indexed repo were found.


GapDetail
No runnable harnessrun-benchmark.sh and src/bench/ are absent from the repo
Corpus undisclosed in READMEIdentified from CSV content as the opencode codebase; not stated in README
No answer quality assessmentCSV records only wall-clock time and API cost; whether osgrep answers were correct or equivalent to baseline is not evaluated
10-query sampleStatistical power is low; 2/10 queries favoured baseline on time; 3/10 cost same or more with osgrep
Single codebaseGeneralization to other repos is unknown
Agent and model unspecifiedThe agent framework (opencode), model, and prompt templates used are not documented in the CSV or README

Not reproducible. The harness scripts referenced in package.json are absent from the repository. The raw LLM responses and CSV are committed, but there is no code to re-run the collection or compute aggregate figures from them.

The internal MRR harness (eval.ts) is runnable and tests retrieval quality independently of the CSV benchmark, but it measures file-ranking precision, not token or cost savings.