Skip to content

rtk — Benchmark Reproduction

Source: https://github.com/rtk-ai/rtk (v0.35.0, master branch) Date: 2026-04-10 Environment: macOS Darwin 25.4.0 Outcome: not yet run — harness structure verified from source; see notes below


scripts/benchmark.sh # primary benchmark harness (17.7 KB, verified from source)
scripts/rtk-economics.sh # session-level economics estimates

No npm or language runtime dependencies are required; the harness is a plain bash script. It requires the rtk binary and the language toolchains for any sections you want to run (git, cargo, go, python, node, etc.).

Terminal window
# Install rtk (required)
brew install rtk
# or
curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/refs/heads/master/install.sh | sh
# Clone the repo
git clone https://github.com/rtk-ai/rtk
cd rtk
# Run benchmark against installed rtk
bash scripts/benchmark.sh
# Or build from source and run against local binary
cargo build --release
bash scripts/benchmark.sh # picks up ./target/release/rtk automatically

What the harness measures (verified from source)

Section titled “What the harness measures (verified from source)”
  • Runs live commands (git status, cargo test, pytest -v, go test -v, golangci-lint, ruff check, etc.) on temporary fixtures created in mktemp -d directories.
  • Compares ceil(chars / 4) token estimates for raw output vs rtk-filtered output. The bash implementation is $(( (len + 3) / 4 )) (integer ceiling division).
  • Includes a bench_rewrite section that verifies rtk rewrite correctness (e.g., compound cargo test && git push rewrites to rtk cargo test && rtk git push) — these are correctness tests, counted in the GOOD/FAIL totals.
  • Reports per-test: icon (✅/⚠️/❌), name, raw command, rtk command, raw tokens, filtered tokens, savings %.
    • ✅ GOOD: rtk output is non-empty and smaller (strictly fewer tokens) than raw.
    • ⚠️ SKIP: rtk output has same or more tokens than raw (or raw has 0 tokens). In SKIP mode, raw token count is added to both TOTAL_UNIX and TOTAL_RTK (no credit for savings).
    • ❌ FAIL: rtk output is empty. Raw count added to both totals (no savings assumed).
  • Reports aggregate: ✅ N good ⚠️ M skip ❌ P fail, then Tokens: TOTAL_UNIX → TOTAL_RTK (-PCT%).
  • 80% CI gate: exits non-zero (exit 1) if GOOD_PCT (= GOOD_TESTS * 100 / TOTAL_TESTS) is less than 80. Source: lines 587-591 of scripts/benchmark.sh.
  • Optionally writes per-test debug files to scripts/benchmark/{unix,rtk,diff}/ when $CI is unset.

Token estimation methodology (verified from source)

Section titled “Token estimation methodology (verified from source)”

The harness uses $(( (len + 3) / 4 )) (bash integer ceiling of string length / 4) as the token proxy — the same ceil(chars / 4) heuristic used by rtk gain. The Rust implementation in src/core/tracking.rs::estimate_tokens() is:

pub fn estimate_tokens(text: &str) -> usize {
(text.len() as f64 / 4.0).ceil() as usize
}

This is not a real LLM tokenizer. It operates on byte length (.len() in Rust returns bytes, not Unicode codepoints), which means:

  • ASCII-only outputs: reliable approximation.
  • Code with multi-byte Unicode (e.g., emoji in commit messages, non-ASCII identifiers): overcounts bytes, inflates estimated savings.
  • Actual LLM token savings could differ by 20-30% from reported figures depending on content type and tokenizer.
✅ git status │ git status │ rtk git status │ 420 → 84 (-80%)
✅ cargo test │ cargo test 2>&1 │ rtk cargo test │ 12400 → 1240 (-90%)
✅ pytest │ pytest -v 2>&1 || true │ rtk pytest -v │ 3200 → 320 (-90%)
...
═══════════════════════════════════════════════════════
✅ N good ⚠️ M skip ❌ P fail N/T (PCT%)
Tokens: TOTAL_UNIX → TOTAL_RTK (-SAVE_PCT%)

Environment requirements (verified from source)

Section titled “Environment requirements (verified from source)”

The benchmark auto-skips sections for unavailable toolchains. Sections and their requirements:

SectionRequirement
ls, find, grep, diff, wc, json, env, log, read, summarynone (uses repo files)
gitgit in PATH; must be run inside a git repo
cargo, test, errcargo in PATH
curlcurl in PATH
wgetwget in PATH
Modern JS (tsc, eslint, vitest, playwright, prisma, pnpm)package.json in CWD; individual binaries in PATH or node_modules/.bin/
ghgh in PATH and inside a git repo
dockerdocker in PATH
kubectlkubectl in PATH
python (ruff, pytest)python3, ruff, pytest all in PATH
go (golangci-lint, go test)go and golangci-lint in PATH

Running from the rtk repo root covers git, cargo, grep, find, and most system sections without any extra setup.

  • The rtk rewrite correctness section tests 6 cases (git status, ls -al, npm exec, cargo test, compound cargo test && git push). These are included in GOOD/FAIL counts and affect the 80% gate.
  • Debug files in scripts/benchmark/{unix,rtk,diff}/ are only written when $CI is unset. In CI mode the directory is not created.
  • The scripts/rtk-economics.sh script generates README table data by running commands on a sample project and computing aggregate session-level estimates. This is what produces the README benchmark table — it is not an independent measurement but a projection based on per-command measurements.
  • Source-reviewed version: tools/rtk/ vendored at v0.35.0 (Cargo.toml). The benchmark script is the copy at tools/rtk/scripts/benchmark.sh.