Analysis — juliusbrussee-caveman

ANALYSIS: juliusbrussee-caveman

Summary

Caveman is a Claude Code skill (and multi-agent plugin) that forces the model to respond in minimal caveman-style prose, stripping articles, filler phrases, and verbose hedging while preserving all technical substance and code blocks. The companion /caveman:compress sub-tool applies the same style to persistent memory files (e.g. CLAUDE.md), reducing per-session input tokens. The output-token benchmark harness (benchmarks/run.py) calls the Anthropic API with temperature=0 on 10 fixed prompts and is fully reproducible; a separate three-arm eval harness (evals/) provides a methodologically stronger control that isolates the skill’s contribution from generic terseness. The README headline “~75% output token reduction” is self-reported but the eval harness includes a committed snapshot (verified from source) and runs offline via tiktoken.

Key findings from source review (2026-04-13): The eval snapshot is verified present and intact. Character-proxy analysis of the committed snapshot yields ~53% median reduction (caveman vs terse control arm), not the ~75% headline — consistent with the evals README’s own note that the honest delta is skill vs terse, not skill vs baseline. The compress/ scripts are fully inspectable in the vendored clone (prior triage stated they were not exposed; this is incorrect). A backup-overwrite guard is implemented in source and addresses the previously identified data-loss risk. The snapshot contains four skill arms: caveman, caveman-cn, caveman-es, and compress — only caveman was documented in the prior analysis.

What it does (verified from source)

Core mechanism

The skill works by injecting a system-prompt constraint (caveman/SKILL.md) at session start. The constraint instructs the model to:

Drop articles (a/an/the), filler words (just/really/basically), pleasantries, and hedging.
Use fragments and short synonyms (big not extensive; fix not “implement a solution for”).
Leave code blocks, technical terms, inline code, and error messages completely unchanged.
Auto-escalate to full prose for security warnings and irreversible-action confirmations, then revert to caveman after the critical section.

Six intensity levels are defined in skills/caveman/SKILL.md (verified from source — caveman/SKILL.md is the auto-synced copy; canonical source is skills/caveman/SKILL.md):

lite — drop filler, keep articles and full sentences.
full — default; drop articles, fragments OK, short synonyms.
ultra — abbreviate to DB/auth/config/req/res/fn/impl, arrows for causality (X → Y), one word when one word enough.
wenyan-lite, wenyan-full, wenyan-ultra — classical Chinese variants; full wenyan mode claims 80-90% character reduction (as reported in SKILL.md; snapshot caveman-cn arm shows ~78% char-proxy reduction vs terse, partially verified).

The single source of truth is skills/caveman/SKILL.md. All other SKILL.md copies are auto-generated by CI on push to main (verified from source — CLAUDE.md documents the sync pattern; .github/workflows/sync-skill.yml is present).

The constraint is stateless: it persists only until the session ends or the user says “stop caveman” / “normal mode”.

Interface / API

Activation commands (verified from source):

/caveman
/caveman lite
/caveman full
/caveman ultra
/caveman wenyan
/caveman wenyan-ultra
$caveman

Compress sub-tool:

/caveman:compress <filepath>

The compress sub-tool (caveman-compress/SKILL.md) instructs the agent to run:

cd caveman-compress && python3 -m scripts <absolute_filepath>

The Python scripts module (caveman-compress/scripts/) handles (verified from source):

File-type detection via extension table and content heuristics (detect.py) — no tokens.
Calls Claude to compress via compress.py; uses ANTHROPIC_API_KEY if set, otherwise falls back to claude --print CLI (model: claude-sonnet-4-5, overridable via CAVEMAN_MODEL env var).
Validates output (validate.py) — checks heading count, exact code-block preservation, URL preservation, file-path preservation, and bullet count tolerance — no tokens consumed.
On validation failure: calls Claude with a targeted fix prompt (cherry-picks errors only, no full recompression). Retries up to 2 times.
On second retry failure: restores original from backup, deletes backup, reports error.
Backup-overwrite guard (verified from source): if <file>.original.md already exists, aborts with a warning rather than overwriting — prevents the data-loss risk identified in the original triage.

Installation:

npx skills add JuliusBrussee/caveman
claude plugin marketplace add JuliusBrussee/caveman

Supports Claude Code, Cursor, Windsurf, Copilot, Cline, Codex, Gemini CLI, and 40+ agents via the skills protocol (verified from source: CLAUDE.md agent distribution table).

Dependencies

Node.js (npx) — installer only; no runtime dependency.
Python 3 + anthropic>=0.40.0 — benchmarks/run.py and caveman-compress/scripts/.
uv + tiktoken — evals/measure.py (offline measurement; no API key required).
No server-side component; runs entirely within the agent session.

Scope / limitations

Compression quality is model-dependent; no parser or linter enforces the output style. The model may drift from caveman-speak on long or complex outputs.
Ultra mode compresses conjunctions and uses abbreviations; there is no evaluation of whether this causes the model to omit correct steps or introduce errors.
benchmarks/results/ contains only .gitkeep; the benchmark figures in the README are generated by running benchmarks/run.py --update-readme and are not committed as artefacts (verified from source).
The evals/snapshots/results.json file IS committed (verified from source), generated 2026-04-08, model claude-opus-4-6, Claude Code CLI v2.1.97, 10 prompts.
The evals tokenizer is tiktoken o200k_base (OpenAI BPE), an approximation of Claude’s tokenizer. Ratios between arms are meaningful; absolute token counts are approximate.
Snapshot arms: caveman, caveman-cn, caveman-es, compress (verified from source). Prior analysis documented only the caveman arm.
Wenyan mode: no dedicated benchmark file, but the caveman-cn arm in the evals snapshot provides quantitative data (~78% char-proxy reduction vs terse, partially verified).
The caveman-compress/scripts/ Python module is fully inspectable in the vendored clone (prior triage stated it was not exposed; this is incorrect — all five source files are present and readable).

Benchmark claims — verified vs as-reported

README benchmark table (benchmarks/run.py)

Methodology (verified from source): Anthropic API, temperature=0, 3 trials per prompt per mode (normal vs caveman), median output tokens reported. Model: claude-sonnet-4-20250514. 10 fixed prompts in benchmarks/prompts.json. Results saved as timestamped JSON to benchmarks/results/. README updated via --update-readme flag.

Task	Normal (tokens)	Caveman (tokens)	Saved	Status
Explain React re-render bug	1180	159	87%	as reported
Fix auth middleware token expiry	704	121	83%	as reported
Set up PostgreSQL connection pool	2347	380	84%	as reported
Explain git rebase vs merge	702	292	58%	as reported
Refactor callback to async/await	387	301	22%	as reported
Architecture: microservices vs monolith	446	310	30%	as reported
Review PR for security issues	678	398	41%	as reported
Docker multi-stage build	1042	290	72%	as reported
Debug PostgreSQL race condition	1200	232	81%	as reported
Implement React error boundary	3454	456	87%	as reported
Average	1214	294	65%	as reported

Range: 22%–87%. Script and prompts are committed and reproducible with an API key. Result artefacts are not committed (verified from source — benchmarks/results/ contains only .gitkeep).

Evals harness — three-arm design (evals/)

Methodology (verified from source): three arms — baseline (no system prompt), terse (“Answer concisely.”), skill (“Answer concisely.\n\n{SKILL.md}”). The honest delta is skill vs terse. Snapshot committed to evals/snapshots/results.json, generated 2026-04-08. Measured offline with tiktoken o200k_base.

Metric	Value	Status
Snapshot committed to git	Yes	verified from source
CI runs offline (no API key)	Yes	verified from source
Tokenizer	tiktoken o200k_base (approx)	verified from source
Number of prompts	10	verified from source
Runs per arm	1 (single run)	verified from source
Statistical significance	Not powered; stdev disclosed	verified from source
Arms in snapshot	baseline, terse, caveman, caveman-cn, caveman-es, compress	verified from source

Character-proxy savings (caveman arm vs terse control, computed from committed snapshot using character count as a token-length proxy — tiktoken numbers differ but ratios are directionally consistent):

Metric	caveman vs terse	caveman vs baseline	Status
Median	~53%	~49%	partially verified (char proxy)
Mean	~50%	~49%	partially verified (char proxy)
Min	~-2%	~+10%	partially verified (char proxy)
Max	~89%	~87%	partially verified (char proxy)
Stdev	~24%	—	partially verified (char proxy)

The prior analysis stated “~75% output token reduction (updated from 65% triage)”. This figure reflects the skill-vs-baseline comparison, not the skill-vs-terse honest delta. From the committed snapshot, the honest delta is approximately 50–53% (median, char proxy). The README’s ~75% headline is derived from benchmarks/run.py runs (caveman vs normal baseline) whose artefacts are not committed.

Compress sub-tool — README table

File	Original (tokens)	Compressed (tokens)	Saved	Status
claude-md-preferences.md	706	285	59.6%	as reported
project-notes.md	1145	535	53.3%	as reported
claude-md-project.md	1122	687	38.8%	as reported
todo-list.md	627	388	38.1%	as reported
mixed-with-code.md	888	574	35.4%	as reported
Average	898	494	45%	as reported

Test fixture files (both compressed and original versions) are committed in tests/caveman-compress/ and match the five files in this table (verified from source). The reported savings percentages are as-reported pending a tokenizer run against the fixtures.

Architectural assessment

What’s genuinely novel

Style-layer compression, not structural compression. Almost all context-management work targets memory retrieval, chunking, or eviction policies — decisions made before the model sees tokens. Caveman operates purely at the style layer: a system-prompt constraint forces output compression without altering what the model knows or retrieves. This makes it complementary to tiered-loading or RAG approaches, not competing with them.

Write-once input compression with a human-readable backup. The compress sub-tool converts persistent memory files once and keeps a .original.md backup. The amortisation logic is correct: a CLAUDE.md that saves 45% input tokens on every session start compounds across the lifetime of the project.

Three-arm eval design. The evals/ harness separates “skill vs no skill” from “skill vs generic terseness request”. The honest delta (skill vs terse) prevents conflating the skill’s contribution with the well-known effect that “be brief” instructions reduce output length. This is methodologically more rigorous than the benchmarks/run.py harness and is rare for a tool of this maturity.

Committed snapshot with offline measurement. The eval snapshot is in git; CI can verify the numbers without an API key. Any SKILL.md change that alters token counts appears as a diff.

Auto-clarity escape hatch. The SKILL.md explicitly instructs the model to revert to full prose for security warnings and irreversible-action confirmations, then resume caveman. This is a practical safety valve that most style-compression tools omit (verified from source).

Backup-overwrite guard. compress_file() in caveman-compress/scripts/compress.py checks whether <file>.original.md already exists before writing the backup, and aborts with a warning if it does (verified from source). The prior triage identified this as a risk; it is addressed in the implementation.

Hook system. Three hooks communicate via a flag file at ~/.claude/.caveman-active: a SessionStart hook that activates caveman mode and injects the ruleset as system context, a UserPromptSubmit hook that tracks mode changes from slash commands, and a statusline script that renders a visual badge. All hooks silent-fail on filesystem errors (verified from source).

Gaps and risks

No fidelity measurement. The evals README explicitly acknowledges this gap: a skill that replies with a single character would score best and “win”. There is no judge-model rubric or task-accuracy evaluation. Ultra mode’s abbreviation density is high enough that omissions are plausible on long multi-step answers.

Benchmark artefacts absent for benchmarks/run.py. The benchmarks/results/ directory contains only .gitkeep. The 65% / 75% figures cannot be verified without running the script against a live API key (verified from source).

Single run per eval arm. The evals snapshot is one run per (prompt, arm). The README discloses this correctly and provides stdev, but numbers can be noisy — especially for prompts where the model’s response length is bimodal. One prompt (Node.js memory leak) has a negative savings value (~-2%) in the caveman arm, confirming real variance (verified from source — character-proxy analysis of committed snapshot).

Tokenizer mismatch. tiktoken o200k_base is OpenAI’s BPE. Claude uses a different tokenizer. Ratios are directionally correct but absolute token counts should not be quoted as exact Claude tokens.

Compress validation is structural, not semantic (verified from source). The validate.py checks heading count, exact code blocks, URLs, paths, and bullet count tolerance. It does not detect compressed inline code spans, subtly altered technical instructions, or semantic paraphrase of prose. An error that passes structural validation could still degrade quality.

Style drift in long sessions. No mechanism enforces the constraint beyond the initial system prompt. Intensity level resets at session end, requiring re-activation.

Wenyan mode: the snapshot caveman-cn arm is the only quantitative data for the Chinese variant (~78% char-proxy reduction vs terse, partially verified). The SKILL.md claim of 80-90% character reduction falls within this range at the upper bound (partially verified).

Single-source-of-truth drift risk. The CI sync pattern means editing any auto-synced SKILL.md copy has no persistent effect — CI overwrites on next push to main. This could confuse contributors who edit caveman/SKILL.md instead of skills/caveman/SKILL.md (verified from source — documented in CLAUDE.md).

Source review

File structure and single source of truth (verified from source)

The repo uses a CI sync pattern (.github/workflows/sync-skill.yml) to distribute a single canonical skills/caveman/SKILL.md to all agent-specific locations. Any edit to an auto-synced copy is overwritten on next push to main. This is documented in CLAUDE.md and confirmed by the workflow file.

Compress pipeline — full implementation (verified from source)

The compress pipeline in caveman-compress/scripts/ is fully inspectable in the vendored clone. All five Python modules are present: main.py (CLI entry), cli.py (argument parsing), compress.py (orchestrator), detect.py (file-type classifier), and validate.py (structural validator). The pipeline is:

detect.py — classifies file as natural_language, code, config, or unknown via extension table; for extensionless files, falls back to JSON/YAML/code-line heuristics. Skips .original.md backup files.
compress.py — reads the file, checks for an existing backup (aborts if found), calls Claude, writes backup to <stem>.original.md, writes compressed to original path.
validate.py — structural checks: heading count (error if mismatched), code blocks exact match (error), URL set exact match (error), file-path set match (warning), bullet count within 15% tolerance (warning). Only errors trigger retry.
Retry loop — on error, sends a targeted fix prompt with specific error messages. Max 2 retries. On final failure, restores original from backup.

The model used is claude-sonnet-4-5 (overridable via CAVEMAN_MODEL env var). API key path: ANTHROPIC_API_KEY env var → Anthropic SDK; fallback to claude --print CLI.

Eval snapshot arms (verified from source)

The committed snapshot contains six arms: baseline, terse, caveman, caveman-cn, caveman-es, and compress. The compress arm measures the output-token reduction of the compress skill (a SKILL.md for response style), not the file-compression savings. The caveman-cn arm is the Chinese-language caveman variant; caveman-es is the Spanish-language variant. These additional arms were not documented in the prior analysis.

Recommendation

Adopt for individual Claude Code sessions where output verbosity is the bottleneck. The mechanism is sound, the install is a single command, and the style constraint is self-reverting on safety-critical outputs. The three-arm eval design is more rigorous than most comparable tools — run evals/measure.py against the committed snapshot to obtain the honest skill-vs-terse delta before quoting the numbers.

Use caveman:compress on CLAUDE.md and persistent memory files where the files are in git (so the backup is implicit in version history) and where the project lifetime is long enough to amortise the write-once compression cost across many sessions. The backup-overwrite guard prevents double-compression accidents (verified from source).

Do not use Ultra mode for multi-step sequences until fidelity evaluation exists. The evals harness’s acknowledged gap (no judge-model accuracy rubric) is the main blocker for recommending Ultra in production workflows.

When quoting savings figures: use ~50–53% (median, caveman vs terse control, char proxy from committed snapshot) as the honest output-token delta. The README’s ~75% headline is from benchmarks/run.py (caveman vs normal baseline, API key required, results not committed). The two figures measure different things; the evals honest delta is the correct figure to use when comparing caveman against other output-compression approaches.

Comparison hooks (for ANALYSIS.md matrix)

Dimension	caveman
Approach	Style-layer system-prompt constraint (output); LLM-rewrite of memory files (input)
Compression	~50–53% median output tokens vs terse control (partially verified, char proxy, committed snapshot); 22%–87% vs baseline (as reported, benchmarks/run.py); 35%–60% input on memory files (as reported, README)
Token budget model	None — no hard budget; style constraint is session-scoped and intensity-selectable
Injection strategy	System-prompt injection at session start; compress sub-tool writes files to disk once
Eviction	None — operates on output, not context retrieval or eviction
Benchmark harness	Two: benchmarks/run.py (API key required, results not committed); evals/ (snapshot committed, offline measurement via tiktoken)
License	MIT
Maturity	Single-file skill; 6 intensity levels; 10,897 stars at 6 days; no fidelity eval