ryandonofrio3-osgrep — Benchmark Reproduction
ryandonofrio3-osgrep — Benchmark Reproduction
Section titled “ryandonofrio3-osgrep — Benchmark Reproduction”Source: tools/osgrep/ (pinned: 9f2faf7, v0.5.16)
Date: 2026-04-14
Environment: not run — harness missing from working tree
Outcome: README benchmark is not reproducible from source; internal MRR harness is runnable but covers retrieval quality only, not token counts
Benchmark claims
Section titled “Benchmark claims”The README states:
“In our public benchmarks,
osgrepcan save about 20% of your LLM tokens and deliver a 30% speedup.”
Source of these figures: benchmark/benchmark_opencode.csv — 10 queries run against the opencode codebase, comparing baseline (no osgrep) vs with osgrep on wall-clock time and API cost.
Raw LLM responses are committed in benchmark/raw_responses/baseline/ (10 files) and benchmark/raw_responses/with_osgrep/ (10 files). An HTML chart is at benchmark/output/benchmark-results.html.
CSV summary (as observed from source)
Section titled “CSV summary (as observed from source)”| Query | Baseline time (s) | Baseline cost ($) | osgrep time (s) | osgrep cost ($) | Winner |
|---|---|---|---|---|---|
| authentication + authorization checks | 145 | 0.26 | 94 | 0.26 | tie |
| file watching implementation | 284 | 0.50 | 205 | 0.59 | baseline |
| LSP integration configured | 87 | 0.33 | 68 | 0.23 | osgrep |
| TUI client–server communication | 131 | 0.26 | 134 | 0.38 | osgrep |
| agent types defined | 46 | 0.13 | 93 | 0.31 | baseline |
| terminal rendering + UI state | 179 | 0.66 | 104 | 0.30 | osgrep |
| AI model provider integration | 147 | 0.24 | 63 | 0.21 | osgrep |
| codebase search + navigation | 216 | 0.83 | 93 | 0.23 | osgrep |
| bash command execution + sandboxing | 78 | 0.07 | 58 | 0.11 | osgrep |
| mobile app remote control architecture | 258 | 0.41 | 90 | 0.23 | osgrep |
| Mean | 157.1 | 0.359 | 100.2 | 0.285 | 6–2 osgrep |
Observed from CSV: −36% mean time, −21% mean cost, consistent with README headline.
Harness status
Section titled “Harness status”README benchmark harness — missing
Section titled “README benchmark harness — missing”package.json defines three benchmark scripts:
benchmark: ./run-benchmark.shbenchmark:index: ./run-benchmark.sh $HOME/osgrep-benchmarks --indexbenchmark:agent: npx tsx src/bench/benchmark-agent.tsbenchmark:chart: npx tsx src/bench/generate-benchmark-chart.tsNone of these files exist in the working tree (commit 9f2faf7):
run-benchmark.sh— absentsrc/bench/benchmark-agent.ts— absent (src/bench/directory does not exist)src/bench/generate-benchmark-chart.ts— absent
The benchmark infrastructure has been removed or was never committed alongside the CSV results. Reproduction is not possible from the repo in its current state.
Internal MRR harness (src/eval.ts) — runnable
Section titled “Internal MRR harness (src/eval.ts) — runnable”eval.ts defines 70+ EvalCase records asserting that specific source files within the osgrep codebase rank first for given natural-language queries. This is a retrieval quality regression suite, not a token count or cost benchmark.
To run (requires an indexed copy of the osgrep repo):
cd tools/osgreppnpm installpnpm build
# First, index the repo against itselfnode dist/index.js index --reset
# Then run eval (no dedicated script — must import and execute manually)# eval.ts exports EvalCase[] and EvalResult types; it must be driven by a runner# No committed runner script found; experiments/ scripts are ad-hoc explorationsexperiments/mrr-sweep.ts, experiments/ranking-test.ts, experiments/quick_check.ts, and experiments/verify-fix.ts are present but undocumented and not part of a repeatable harness.
vitest test suite
Section titled “vitest test suite”pnpm testTests cover unit logic; no integration tests against a real indexed repo were found.
Methodology gaps
Section titled “Methodology gaps”| Gap | Detail |
|---|---|
| No runnable harness | run-benchmark.sh and src/bench/ are absent from the repo |
| Corpus undisclosed in README | Identified from CSV content as the opencode codebase; not stated in README |
| No answer quality assessment | CSV records only wall-clock time and API cost; whether osgrep answers were correct or equivalent to baseline is not evaluated |
| 10-query sample | Statistical power is low; 2/10 queries favoured baseline on time; 3/10 cost same or more with osgrep |
| Single codebase | Generalization to other repos is unknown |
| Agent and model unspecified | The agent framework (opencode), model, and prompt templates used are not documented in the CSV or README |
Repro status
Section titled “Repro status”Not reproducible. The harness scripts referenced in package.json are absent from the repository. The raw LLM responses and CSV are committed, but there is no code to re-run the collection or compute aggregate figures from them.
The internal MRR harness (eval.ts) is runnable and tests retrieval quality independently of the CSV benchmark, but it measures file-ranking precision, not token or cost savings.