juliusbrussee-caveman — Benchmark Reproduction
juliusbrussee-caveman — Benchmark Reproduction
Section titled “juliusbrussee-caveman — Benchmark Reproduction”Source: https://github.com/JuliusBrussee/caveman
Date: 2026-04-10
Updated: 2026-04-13 (source-verified review; compress scripts now fully documented)
Environment: macOS Darwin 25.4.0
Outcome: not run (API key required for benchmarks/run.py); eval snapshot offline analysis
completed via character-proxy; full tiktoken run not executed
Harness locations
Section titled “Harness locations”Two distinct harnesses exist in the repo (verified from source):
benchmarks/run.py # Anthropic API benchmark: normal vs caveman, token countsbenchmarks/prompts.json # 10 fixed prompts (developer tasks)benchmarks/requirements.txt # anthropic SDKbenchmarks/results/ # .gitkeep only — results not committed
evals/llm_run.py # Three-arm eval generator: calls claude CLIevals/measure.py # Offline measurement via tiktoken (no API key needed)evals/prompts/en.txt # 10 prompts (different set from benchmarks/prompts.json)evals/snapshots/results.json # Committed snapshot, generated 2026-04-08The benchmark harness (benchmarks/) and eval harness (evals/) use different prompt sets.
benchmarks/prompts.json has 10 developer tasks structured as JSON with IDs, categories,
and prompt text. evals/prompts/en.txt has 10 developer questions as plain text, one per
line (both sets verified from source).
Harness 1: benchmarks/run.py
Section titled “Harness 1: benchmarks/run.py”Methodology
Section titled “Methodology”Calls the Anthropic API directly with temperature=0, 3 trials per prompt per mode
(normal vs caveman), reports median output tokens. Model default: claude-sonnet-4-20250514.
Prompts are 10 fixed developer tasks (debugging, bugfix, setup, explanation, refactor,
architecture, code-review, devops). The script also supports --update-readme to inject
results into the README benchmark table.
How to reproduce
Section titled “How to reproduce”git clone https://github.com/JuliusBrussee/cavemancd cavemanpip install -r benchmarks/requirements.txtexport ANTHROPIC_API_KEY=<your-key>python benchmarks/run.pyOptional flags:
python benchmarks/run.py --dry-run # Preview config, no API callspython benchmarks/run.py --trials 1 # Single trial (faster/cheaper)python benchmarks/run.py --model claude-haiku-4-5 # Cheaper modelpython benchmarks/run.py --update-readme # Write results into README.mdStatus
Section titled “Status”Not run. Results directory (benchmarks/results/) contains only .gitkeep — no committed
artefacts. The 65% headline figure from the README cannot be verified without an API key.
Harness 2: evals/ (three-arm, offline-measurable)
Section titled “Harness 2: evals/ (three-arm, offline-measurable)”Evals methodology
Section titled “Evals methodology”Three-arm design (verified from source):
baseline— no system promptterse—"Answer concisely."- skill arms —
"Answer concisely.\n\n{SKILL.md}"(one arm perskills/<name>/SKILL.md)
The honest skill delta is skill vs terse, not skill vs baseline. This isolates the
skill’s contribution from generic terseness. llm_run.py auto-discovers all skill
directories under skills/.
Snapshot committed to evals/snapshots/results.json:
- Generated: 2026-04-08T22:01:24Z
- Model:
claude-opus-4-6 - CLI version:
2.1.97 (Claude Code) - Prompts: 10 (from
evals/prompts/en.txt) - Runs per arm: 1 (single run)
- Arms:
baseline,terse,caveman,caveman-cn,caveman-es,compress
Snapshot contents (verified from source)
Section titled “Snapshot contents (verified from source)”The 10 prompts in the committed snapshot:
- Why does my React component re-render every time the parent updates?
- Explain database connection pooling.
- What’s the difference between TCP and UDP?
- How do I fix a memory leak in a long-running Node.js process?
- What does the SQL EXPLAIN command tell me?
- How does a hash table handle collisions?
- Why am I getting CORS errors in my browser console?
- What’s the point of using a debouncer on a search input?
- How does git rebase differ from git merge?
- When should I use a queue vs a topic in messaging systems?
Offline analysis — character-proxy results
Section titled “Offline analysis — character-proxy results”Character-count proxy analysis of the committed snapshot (run 2026-04-13, without tiktoken). Character length correlates with token count; ratios are directionally consistent but will differ from tiktoken output:
| Metric | caveman vs terse | caveman vs baseline |
|---|---|---|
| Median | ~53% | ~49% |
| Mean | ~50% | ~49% |
| Min | ~-2% | ~+10% |
| Max | ~89% | ~87% |
| Stdev | ~24% | — |
Notable per-prompt results: database connection pooling shows ~89% reduction (terse arm response was longest at 1405 chars; caveman 156 chars). Node.js memory leak shows ~-2% (caveman response slightly longer than terse — high-variance prompt). TCP/UDP shows ~37% reduction.
The caveman-cn arm (Chinese variant) shows ~78% char-proxy reduction vs terse (min 54%,
max 97%).
How to reproduce (offline — no API key)
Section titled “How to reproduce (offline — no API key)”# Using the vendored clone:cd tools/juliusbrussee-cavemanuv run --with tiktoken python evals/measure.pyThis reads the committed snapshot and prints per-skill savings (median/mean/min/max/stdev)
using tiktoken o200k_base. Requires uv on PATH.
Alternatively, from a fresh clone:
git clone https://github.com/JuliusBrussee/cavemancd cavemanuv run --with tiktoken python evals/measure.pyHow to regenerate the snapshot (requires claude CLI logged in)
Section titled “How to regenerate the snapshot (requires claude CLI logged in)”uv run python evals/llm_run.py
# Cheaper model variant:CAVEMAN_EVAL_MODEL=claude-haiku-4-5 uv run python evals/llm_run.pyRegeneration calls claude -p --system-prompt ... once per (prompt × arm). On 4 skill arms plus 2 control arms × 10 prompts = 60 claude CLI invocations. Writes output to evals/snapshots/results.json.
Known approximation
Section titled “Known approximation”tiktoken o200k_base is OpenAI’s BPE tokenizer. Claude uses a different tokenizer. Ratios
between arms are directionally correct; absolute token counts are approximate.
Evals status
Section titled “Evals status”Snapshot exists and is committed (verified from source). Offline measurement via
evals/measure.py is runnable without an API key. Single run per arm — not a statistically
powered experiment. No fidelity / accuracy evaluation exists for any arm.
Compress sub-tool — reported savings
Section titled “Compress sub-tool — reported savings”The README compress table reports 35%–60% savings on five prose memory files. The test
fixture files (both original and compressed versions) are committed in
tests/caveman-compress/ (verified from source — five .md / .original.md pairs).
Fixture files present (verified from source)
Section titled “Fixture files present (verified from source)”| File | Status |
|---|---|
tests/caveman-compress/claude-md-preferences.md | present |
tests/caveman-compress/claude-md-preferences.original.md | present |
tests/caveman-compress/project-notes.md | present |
tests/caveman-compress/project-notes.original.md | present |
tests/caveman-compress/claude-md-project.md | present |
tests/caveman-compress/claude-md-project.original.md | present |
tests/caveman-compress/todo-list.md | present |
tests/caveman-compress/todo-list.original.md | present |
tests/caveman-compress/mixed-with-code.md | present |
tests/caveman-compress/mixed-with-code.original.md | present |
How to reproduce compress savings
Section titled “How to reproduce compress savings”To verify the README figures, run tiktoken o200k_base against each .original.md vs
the compressed .md file:
import tiktokenfrom pathlib import Path
enc = tiktoken.get_encoding("o200k_base")fixtures = Path("tests/caveman-compress")for orig in fixtures.glob("*.original.md"): stem = orig.stem.replace(".original", "") comp = fixtures / f"{stem}.md" if comp.exists(): orig_toks = len(enc.encode(orig.read_text())) comp_toks = len(enc.encode(comp.read_text())) saved = 1 - comp_toks / orig_toks print(f"{stem}: {orig_toks} -> {comp_toks} ({saved:.1%})")Run from the repo root: uv run --with tiktoken python <above_script>. No API key required.
Compress script implementation (verified from source)
Section titled “Compress script implementation (verified from source)”The compress pipeline is fully inspectable. All five Python modules are present in
caveman-compress/scripts/:
compress.py— orchestrator; calls detect → compress → validate → retry → restoredetect.py— file-type classifier (extension table + JSON/YAML/code-line heuristics)validate.py— structural validator (headings, code blocks exact, URLs exact, paths, bullets within 15%)cli.py— argument parsingmain.py— entry point
The backup-overwrite guard is implemented: if <file>.original.md already exists, the
script aborts rather than overwriting (verified from source, compress.py line ~135).
Model used: claude-sonnet-4-5 (overridable via CAVEMAN_MODEL env var). API key path:
ANTHROPIC_API_KEY env var → Anthropic SDK; fallback to claude --print CLI.
Reproduce compress on a live file
Section titled “Reproduce compress on a live file”cd tools/juliusbrussee-cavemanexport ANTHROPIC_API_KEY=<your-key># Compress a file (backs up original to FILE.original.md):cd caveman-compress && python3 -m scripts /path/to/your/CLAUDE.mdNote: the script aborts if CLAUDE.original.md already exists. Remove or rename it to
re-run.