Analysis — juliusbrussee-caveman
ANALYSIS: juliusbrussee-caveman
Section titled “ANALYSIS: juliusbrussee-caveman”Summary
Section titled “Summary”Caveman is a Claude Code skill (and multi-agent plugin) that forces the model to respond in
minimal caveman-style prose, stripping articles, filler phrases, and verbose hedging while
preserving all technical substance and code blocks. The companion /caveman:compress sub-tool
applies the same style to persistent memory files (e.g. CLAUDE.md), reducing per-session
input tokens. The output-token benchmark harness (benchmarks/run.py) calls the Anthropic
API with temperature=0 on 10 fixed prompts and is fully reproducible; a separate three-arm
eval harness (evals/) provides a methodologically stronger control that isolates the skill’s
contribution from generic terseness. The README headline “~75% output token reduction” is
self-reported but the eval harness includes a committed snapshot (verified from source) and
runs offline via tiktoken.
Key findings from source review (2026-04-13): The eval snapshot is verified present and
intact. Character-proxy analysis of the committed snapshot yields ~53% median reduction
(caveman vs terse control arm), not the ~75% headline — consistent with the evals README’s
own note that the honest delta is skill vs terse, not skill vs baseline. The compress/
scripts are fully inspectable in the vendored clone (prior triage stated they were not
exposed; this is incorrect). A backup-overwrite guard is implemented in source and addresses
the previously identified data-loss risk. The snapshot contains four skill arms: caveman,
caveman-cn, caveman-es, and compress — only caveman was documented in the prior
analysis.
What it does (verified from source)
Section titled “What it does (verified from source)”Core mechanism
Section titled “Core mechanism”The skill works by injecting a system-prompt constraint (caveman/SKILL.md) at session start.
The constraint instructs the model to:
- Drop articles (a/an/the), filler words (just/really/basically), pleasantries, and hedging.
- Use fragments and short synonyms (big not extensive; fix not “implement a solution for”).
- Leave code blocks, technical terms, inline code, and error messages completely unchanged.
- Auto-escalate to full prose for security warnings and irreversible-action confirmations, then revert to caveman after the critical section.
Six intensity levels are defined in skills/caveman/SKILL.md (verified from source —
caveman/SKILL.md is the auto-synced copy; canonical source is skills/caveman/SKILL.md):
lite— drop filler, keep articles and full sentences.full— default; drop articles, fragments OK, short synonyms.ultra— abbreviate to DB/auth/config/req/res/fn/impl, arrows for causality (X → Y), one word when one word enough.wenyan-lite,wenyan-full,wenyan-ultra— classical Chinese variants; full wenyan mode claims 80-90% character reduction (as reported in SKILL.md; snapshotcaveman-cnarm shows ~78% char-proxy reduction vs terse, partially verified).
The single source of truth is skills/caveman/SKILL.md. All other SKILL.md copies are
auto-generated by CI on push to main (verified from source — CLAUDE.md documents the sync
pattern; .github/workflows/sync-skill.yml is present).
The constraint is stateless: it persists only until the session ends or the user says “stop caveman” / “normal mode”.
Interface / API
Section titled “Interface / API”Activation commands (verified from source):
/caveman/caveman lite/caveman full/caveman ultra/caveman wenyan/caveman wenyan-ultra$cavemanCompress sub-tool:
/caveman:compress <filepath>The compress sub-tool (caveman-compress/SKILL.md) instructs the agent to run:
cd caveman-compress && python3 -m scripts <absolute_filepath>The Python scripts module (caveman-compress/scripts/) handles (verified from source):
- File-type detection via extension table and content heuristics (
detect.py) — no tokens. - Calls Claude to compress via
compress.py; usesANTHROPIC_API_KEYif set, otherwise falls back toclaude --printCLI (model:claude-sonnet-4-5, overridable viaCAVEMAN_MODELenv var). - Validates output (
validate.py) — checks heading count, exact code-block preservation, URL preservation, file-path preservation, and bullet count tolerance — no tokens consumed. - On validation failure: calls Claude with a targeted fix prompt (cherry-picks errors only, no full recompression). Retries up to 2 times.
- On second retry failure: restores original from backup, deletes backup, reports error.
- Backup-overwrite guard (verified from source): if
<file>.original.mdalready exists, aborts with a warning rather than overwriting — prevents the data-loss risk identified in the original triage.
Installation:
npx skills add JuliusBrussee/cavemanclaude plugin marketplace add JuliusBrussee/cavemanSupports Claude Code, Cursor, Windsurf, Copilot, Cline, Codex, Gemini CLI, and 40+ agents
via the skills protocol (verified from source: CLAUDE.md agent distribution table).
Dependencies
Section titled “Dependencies”- Node.js (
npx) — installer only; no runtime dependency. - Python 3 +
anthropic>=0.40.0—benchmarks/run.pyandcaveman-compress/scripts/. uv+tiktoken—evals/measure.py(offline measurement; no API key required).- No server-side component; runs entirely within the agent session.
Scope / limitations
Section titled “Scope / limitations”- Compression quality is model-dependent; no parser or linter enforces the output style. The model may drift from caveman-speak on long or complex outputs.
- Ultra mode compresses conjunctions and uses abbreviations; there is no evaluation of whether this causes the model to omit correct steps or introduce errors.
benchmarks/results/contains only.gitkeep; the benchmark figures in the README are generated by runningbenchmarks/run.py --update-readmeand are not committed as artefacts (verified from source).- The
evals/snapshots/results.jsonfile IS committed (verified from source), generated 2026-04-08, modelclaude-opus-4-6, Claude Code CLI v2.1.97, 10 prompts. - The evals tokenizer is tiktoken
o200k_base(OpenAI BPE), an approximation of Claude’s tokenizer. Ratios between arms are meaningful; absolute token counts are approximate. - Snapshot arms:
caveman,caveman-cn,caveman-es,compress(verified from source). Prior analysis documented only thecavemanarm. - Wenyan mode: no dedicated benchmark file, but the
caveman-cnarm in the evals snapshot provides quantitative data (~78% char-proxy reduction vs terse, partially verified). - The
caveman-compress/scripts/Python module is fully inspectable in the vendored clone (prior triage stated it was not exposed; this is incorrect — all five source files are present and readable).
Benchmark claims — verified vs as-reported
Section titled “Benchmark claims — verified vs as-reported”README benchmark table (benchmarks/run.py)
Section titled “README benchmark table (benchmarks/run.py)”Methodology (verified from source): Anthropic API, temperature=0, 3 trials per prompt per
mode (normal vs caveman), median output tokens reported. Model: claude-sonnet-4-20250514.
10 fixed prompts in benchmarks/prompts.json. Results saved as timestamped JSON to
benchmarks/results/. README updated via --update-readme flag.
| Task | Normal (tokens) | Caveman (tokens) | Saved | Status |
|---|---|---|---|---|
| Explain React re-render bug | 1180 | 159 | 87% | as reported |
| Fix auth middleware token expiry | 704 | 121 | 83% | as reported |
| Set up PostgreSQL connection pool | 2347 | 380 | 84% | as reported |
| Explain git rebase vs merge | 702 | 292 | 58% | as reported |
| Refactor callback to async/await | 387 | 301 | 22% | as reported |
| Architecture: microservices vs monolith | 446 | 310 | 30% | as reported |
| Review PR for security issues | 678 | 398 | 41% | as reported |
| Docker multi-stage build | 1042 | 290 | 72% | as reported |
| Debug PostgreSQL race condition | 1200 | 232 | 81% | as reported |
| Implement React error boundary | 3454 | 456 | 87% | as reported |
| Average | 1214 | 294 | 65% | as reported |
Range: 22%–87%. Script and prompts are committed and reproducible with an API key. Result
artefacts are not committed (verified from source — benchmarks/results/ contains only
.gitkeep).
Evals harness — three-arm design (evals/)
Section titled “Evals harness — three-arm design (evals/)”Methodology (verified from source): three arms — baseline (no system prompt), terse
(“Answer concisely.”), skill (“Answer concisely.\n\n{SKILL.md}”). The honest delta is
skill vs terse. Snapshot committed to evals/snapshots/results.json, generated 2026-04-08.
Measured offline with tiktoken o200k_base.
| Metric | Value | Status |
|---|---|---|
| Snapshot committed to git | Yes | verified from source |
| CI runs offline (no API key) | Yes | verified from source |
| Tokenizer | tiktoken o200k_base (approx) | verified from source |
| Number of prompts | 10 | verified from source |
| Runs per arm | 1 (single run) | verified from source |
| Statistical significance | Not powered; stdev disclosed | verified from source |
| Arms in snapshot | baseline, terse, caveman, caveman-cn, caveman-es, compress | verified from source |
Character-proxy savings (caveman arm vs terse control, computed from committed snapshot using character count as a token-length proxy — tiktoken numbers differ but ratios are directionally consistent):
| Metric | caveman vs terse | caveman vs baseline | Status |
|---|---|---|---|
| Median | ~53% | ~49% | partially verified (char proxy) |
| Mean | ~50% | ~49% | partially verified (char proxy) |
| Min | ~-2% | ~+10% | partially verified (char proxy) |
| Max | ~89% | ~87% | partially verified (char proxy) |
| Stdev | ~24% | — | partially verified (char proxy) |
The prior analysis stated “~75% output token reduction (updated from 65% triage)”. This
figure reflects the skill-vs-baseline comparison, not the skill-vs-terse honest delta.
From the committed snapshot, the honest delta is approximately 50–53% (median, char proxy).
The README’s ~75% headline is derived from benchmarks/run.py runs (caveman vs normal
baseline) whose artefacts are not committed.
Compress sub-tool — README table
Section titled “Compress sub-tool — README table”| File | Original (tokens) | Compressed (tokens) | Saved | Status |
|---|---|---|---|---|
| claude-md-preferences.md | 706 | 285 | 59.6% | as reported |
| project-notes.md | 1145 | 535 | 53.3% | as reported |
| claude-md-project.md | 1122 | 687 | 38.8% | as reported |
| todo-list.md | 627 | 388 | 38.1% | as reported |
| mixed-with-code.md | 888 | 574 | 35.4% | as reported |
| Average | 898 | 494 | 45% | as reported |
Test fixture files (both compressed and original versions) are committed in
tests/caveman-compress/ and match the five files in this table (verified from source).
The reported savings percentages are as-reported pending a tokenizer run against the
fixtures.
Architectural assessment
Section titled “Architectural assessment”What’s genuinely novel
Section titled “What’s genuinely novel”Style-layer compression, not structural compression. Almost all context-management work targets memory retrieval, chunking, or eviction policies — decisions made before the model sees tokens. Caveman operates purely at the style layer: a system-prompt constraint forces output compression without altering what the model knows or retrieves. This makes it complementary to tiered-loading or RAG approaches, not competing with them.
Write-once input compression with a human-readable backup. The compress sub-tool converts
persistent memory files once and keeps a .original.md backup. The amortisation logic is
correct: a CLAUDE.md that saves 45% input tokens on every session start compounds across
the lifetime of the project.
Three-arm eval design. The evals/ harness separates “skill vs no skill” from “skill vs
generic terseness request”. The honest delta (skill vs terse) prevents conflating the
skill’s contribution with the well-known effect that “be brief” instructions reduce output
length. This is methodologically more rigorous than the benchmarks/run.py harness and is
rare for a tool of this maturity.
Committed snapshot with offline measurement. The eval snapshot is in git; CI can verify the numbers without an API key. Any SKILL.md change that alters token counts appears as a diff.
Auto-clarity escape hatch. The SKILL.md explicitly instructs the model to revert to full prose for security warnings and irreversible-action confirmations, then resume caveman. This is a practical safety valve that most style-compression tools omit (verified from source).
Backup-overwrite guard. compress_file() in caveman-compress/scripts/compress.py checks
whether <file>.original.md already exists before writing the backup, and aborts with a
warning if it does (verified from source). The prior triage identified this as a risk; it is
addressed in the implementation.
Hook system. Three hooks communicate via a flag file at ~/.claude/.caveman-active: a
SessionStart hook that activates caveman mode and injects the ruleset as system context, a
UserPromptSubmit hook that tracks mode changes from slash commands, and a statusline script
that renders a visual badge. All hooks silent-fail on filesystem errors (verified from source).
Gaps and risks
Section titled “Gaps and risks”No fidelity measurement. The evals README explicitly acknowledges this gap: a skill that replies with a single character would score best and “win”. There is no judge-model rubric or task-accuracy evaluation. Ultra mode’s abbreviation density is high enough that omissions are plausible on long multi-step answers.
Benchmark artefacts absent for benchmarks/run.py. The benchmarks/results/ directory
contains only .gitkeep. The 65% / 75% figures cannot be verified without running the
script against a live API key (verified from source).
Single run per eval arm. The evals snapshot is one run per (prompt, arm). The README discloses this correctly and provides stdev, but numbers can be noisy — especially for prompts where the model’s response length is bimodal. One prompt (Node.js memory leak) has a negative savings value (~-2%) in the caveman arm, confirming real variance (verified from source — character-proxy analysis of committed snapshot).
Tokenizer mismatch. tiktoken o200k_base is OpenAI’s BPE. Claude uses a different
tokenizer. Ratios are directionally correct but absolute token counts should not be quoted
as exact Claude tokens.
Compress validation is structural, not semantic (verified from source). The validate.py
checks heading count, exact code blocks, URLs, paths, and bullet count tolerance. It does
not detect compressed inline code spans, subtly altered technical instructions, or semantic
paraphrase of prose. An error that passes structural validation could still degrade quality.
Style drift in long sessions. No mechanism enforces the constraint beyond the initial system prompt. Intensity level resets at session end, requiring re-activation.
Wenyan mode: the snapshot caveman-cn arm is the only quantitative data for the Chinese
variant (~78% char-proxy reduction vs terse, partially verified). The SKILL.md claim of
80-90% character reduction falls within this range at the upper bound (partially verified).
Single-source-of-truth drift risk. The CI sync pattern means editing any auto-synced
SKILL.md copy has no persistent effect — CI overwrites on next push to main. This could
confuse contributors who edit caveman/SKILL.md instead of skills/caveman/SKILL.md
(verified from source — documented in CLAUDE.md).
Source review
Section titled “Source review”File structure and single source of truth (verified from source)
Section titled “File structure and single source of truth (verified from source)”The repo uses a CI sync pattern (.github/workflows/sync-skill.yml) to distribute a single
canonical skills/caveman/SKILL.md to all agent-specific locations. Any edit to an
auto-synced copy is overwritten on next push to main. This is documented in CLAUDE.md
and confirmed by the workflow file.
Compress pipeline — full implementation (verified from source)
Section titled “Compress pipeline — full implementation (verified from source)”The compress pipeline in caveman-compress/scripts/ is fully inspectable in the vendored
clone. All five Python modules are present: main.py (CLI entry), cli.py (argument
parsing), compress.py (orchestrator), detect.py (file-type classifier), and
validate.py (structural validator). The pipeline is:
detect.py— classifies file asnatural_language,code,config, orunknownvia extension table; for extensionless files, falls back to JSON/YAML/code-line heuristics. Skips.original.mdbackup files.compress.py— reads the file, checks for an existing backup (aborts if found), calls Claude, writes backup to<stem>.original.md, writes compressed to original path.validate.py— structural checks: heading count (error if mismatched), code blocks exact match (error), URL set exact match (error), file-path set match (warning), bullet count within 15% tolerance (warning). Only errors trigger retry.- Retry loop — on error, sends a targeted fix prompt with specific error messages. Max 2 retries. On final failure, restores original from backup.
The model used is claude-sonnet-4-5 (overridable via CAVEMAN_MODEL env var). API key
path: ANTHROPIC_API_KEY env var → Anthropic SDK; fallback to claude --print CLI.
Eval snapshot arms (verified from source)
Section titled “Eval snapshot arms (verified from source)”The committed snapshot contains six arms: baseline, terse, caveman,
caveman-cn, caveman-es, and compress. The compress arm measures the output-token
reduction of the compress skill (a SKILL.md for response style), not the file-compression
savings. The caveman-cn arm is the Chinese-language caveman variant; caveman-es is the
Spanish-language variant. These additional arms were not documented in the prior analysis.
Recommendation
Section titled “Recommendation”Adopt for individual Claude Code sessions where output verbosity is the bottleneck. The
mechanism is sound, the install is a single command, and the style constraint is
self-reverting on safety-critical outputs. The three-arm eval design is more rigorous than
most comparable tools — run evals/measure.py against the committed snapshot to obtain the
honest skill-vs-terse delta before quoting the numbers.
Use caveman:compress on CLAUDE.md and persistent memory files where the files are in
git (so the backup is implicit in version history) and where the project lifetime is long
enough to amortise the write-once compression cost across many sessions. The backup-overwrite
guard prevents double-compression accidents (verified from source).
Do not use Ultra mode for multi-step sequences until fidelity evaluation exists. The evals harness’s acknowledged gap (no judge-model accuracy rubric) is the main blocker for recommending Ultra in production workflows.
When quoting savings figures: use ~50–53% (median, caveman vs terse control, char proxy from
committed snapshot) as the honest output-token delta. The README’s ~75% headline is from
benchmarks/run.py (caveman vs normal baseline, API key required, results not committed).
The two figures measure different things; the evals honest delta is the correct figure to use
when comparing caveman against other output-compression approaches.
Comparison hooks (for ANALYSIS.md matrix)
Section titled “Comparison hooks (for ANALYSIS.md matrix)”| Dimension | caveman |
|---|---|
| Approach | Style-layer system-prompt constraint (output); LLM-rewrite of memory files (input) |
| Compression | ~50–53% median output tokens vs terse control (partially verified, char proxy, committed snapshot); 22%–87% vs baseline (as reported, benchmarks/run.py); 35%–60% input on memory files (as reported, README) |
| Token budget model | None — no hard budget; style constraint is session-scoped and intensity-selectable |
| Injection strategy | System-prompt injection at session start; compress sub-tool writes files to disk once |
| Eviction | None — operates on output, not context retrieval or eviction |
| Benchmark harness | Two: benchmarks/run.py (API key required, results not committed); evals/ (snapshot committed, offline measurement via tiktoken) |
| License | MIT |
| Maturity | Single-file skill; 6 intensity levels; 10,897 stars at 6 days; no fidelity eval |