jackin'
ResearchToken-optimization tools

01 — Caveman: design teardown

01 — Caveman: design teardown

"Why use many token when few do trick." — the caveman tagline

Caveman is the output-side member of the trio. It compresses what the model writes back — visible prose — and nothing else. It is also the only one of the three you are likely already running: this very research was produced with caveman mode active. Its design is the most surprising of the three precisely because there is almost nothing to it.

FieldValue
RepositoryJuliusBrussee/caveman
PitchTerse-register output compression; README headline "cuts ~75% of output tokens"
Form factorClaude Code plugin (skills + hooks); not a binary, not a proxy, not an MCP server
Latest seenv1.9.0
Adoption (2026-06-18)74,446★ / 166 watchers — PR-inflated; see evidence
Bucket hitVisible output (~17% of heavy-session dollars)
Cache interactionNeutral — never touches the cached prefix

The magic: a prompt, not a program

The single most important fact about caveman is that it has no compression engine. There is no parser, no model, no subprocess, no byte ever transformed by caveman's own code. The "compression" happens inside the model's own decoding, for free, because a prompt instructed the model to talk shorter.

The engine is a markdown file — SKILL.md — that carries a register-shift instruction the model applies to its own generation: drop articles, filler, and hedging; keep code, errors, API names, and commit keywords verbatim; fragments are fine. That is the entire mechanism. Everything else in the plugin exists only to deliver that instruction reliably and to measure its effect.

   CAVEMAN ACTIVATION + EFFECT  (no bytes transformed by caveman code)

   session opens


   ┌─────────────────────────┐   SessionStart hook
   │ caveman-activate.js      │   injects the mode prompt from SKILL.md
   └─────────────────────────┘   into the system context


   ┌─────────────────────────┐   the register-shift INSTRUCTION now
   │ SKILL.md rule-set        │   sits in context: "drop articles/filler/
   │ (lite/full/ultra/wenyan) │    hedging; keep code/errors verbatim"
   └─────────────────────────┘

   user turn


   ┌─────────────────────────┐   UserPromptSubmit hook
   │ caveman-mode-tracker.js  │   re-asserts the active level every turn,
   └─────────────────────────┘   tallies tokens for /caveman-stats


   ┌─────────────────────────┐   the MODEL'S DECODER produces terser text
   │  model generation        │   because the prompt told it to.
   │  (compression happens     │   THIS is the "compression" — it is
   │   HERE, in decoding)      │   generation-time, not a transform.
   └─────────────────────────┘


   shorter visible output  ── fewer output tokens billed at the 5×-priced rate

Because the compression is just the model choosing shorter words, it is unconditionally cache-safe (it only affects newly generated output, never the cached input prefix) and it costs zero runtime compute. That is the whole appeal: the cheapest possible intervention on the most expensive token class.

The six intensity levels

The levels are not different programs — they are different strictness clauses in the same SKILL.md. Each ratchets the register harder.

LevelWhat changesMeasured token cut (local, Fable tokenizer)
liteDrop pleasantries and obvious filler; grammar intactmodest; quality-neutral
fullDrop articles/filler/hedging; fragments allowed (the default)~50–60% of visible prose
ultraAdds prose-word abbreviation and causal arrows ()58.5% (vs the README's "~75%" headline)
wenyan-liteSwitch to Classical Chinese register, lightercharacter cut ≫ token cut (see below)
wenyan-fullClassical Chinese, fuller80.9% character cut = 56.6% token cut
wenyan-ultraClassical Chinese, maximal word-dropping74.5% token cut — the measured ceiling

The wenyan levels are the cautionary tail of the ladder. Their headline "80–90%" is a character reduction; on the Fable tokenizer classical Chinese runs at roughly 0.9 characters per token (i.e. at or above one token per character), so the token saving is far smaller than the character saving implies, and on short phrases wenyan can cost more tokens than plain English ("凡測試皆過則提交" = 9 tokens vs "if all tests pass then commit" = 8). The savings that do exist come from extreme word-dropping in wenyan grammar, not from cheap characters — and they are bought with maximal, completely unmeasured comprehension and operator-legibility risk.

Why deleting words is the only lever that works

Caveman's effectiveness rests on one property of byte-pair encoding: a BPE token count tracks word count, not character count. This single fact decides what works and what is folklore.

  • Deleting words saves tokens — the entire safe core of caveman. The local register ladder on identical content: polite/full 72 tok → plain 33 (−54%) → telegraphic 29 (−60%) → caveman 24 (−67%). Notably, on the reply side, caveman grammar bought zero tokens over plain telegraphic English — the savings come from removing words, not from breaking grammar.
  • Abbreviating common words does not save tokens, and usually costs them. Measured per-word: function = 1 token but fn = 2; without = 1 but w/o = 3; because = 1 but bc = 2; config = 1 but cfg = 3. BPE already compressed the common words. Abbreviation only pays on genuinely multi-token words (asynchronousasync, parametersparams, implementationimpl).
  • Glyph/symbol DSLs are a trap on Claude. Exotic glyphs cost 3.9–4.9 tokens each; a glyph DSL lost to plain telegraphic ASCII by ~26–28% in two independent local samples. The lone exception is the arrow at 1.0 token (cheaper than ASCII -> at 2.0), which is why caveman-ultra adopts causal arrows. Beyond that one symbol, symbolic notation is anti-recommended — and there is a comprehension failure layered on the cost one: Claude Haiku 4.5 parses symbolic instructions at 100% but obeys them at only 26% fidelity (silent disobedience).

The practical consequence baked into caveman's design: the safe rungs (filler-strip, telegraphic) capture essentially all the realizable saving; the aggressive rungs (caveman-ultra, wenyan) add lossiness and risk for diminishing token returns.

The two hooks (and the host-state writes)

Delivering the instruction reliably requires two Claude Code hooks, both of which write to the host's ~/.claude configuration — a detail that matters because it is the exact surface that collides with RTK's hook (see combining).

  • SessionStartcaveman-activate.js — injects the mode prompt at session open.
  • UserPromptSubmitcaveman-mode-tracker.js — re-asserts the active level every turn (so it does not drift out of context over a long session) and tallies tokens for the /caveman-stats readout.

plugin.json declares the two hooks, and the installer (bin/lib/settings.js) writes them into ~/.claude/settings.json idempotently, with a Zod-validated shape, an hasCavemanHook probe to avoid duplicate registration, and a removeCavemanHooks uninstall path. The /caveman-stats numbers are computed by the hook reading the session JSONL — the model does not compute them.

The ecosystem: the same trick aimed elsewhere

Caveman is the flagship of a family, and every sibling is the same "instruction, not engine" idea pointed at a different surface. This breadth is part of what distinguishes caveman from headroom and RTK — it is an ecosystem of small register-shift tools, not a single compressor.

                       THE CAVEMAN FAMILY

   ┌──────────────────────────────────────────────────────────────┐
   │  OUTPUT side                                                   │
   │    caveman          terse register on the model's visible prose│
   │    caveman-commit   terse Conventional-Commit messages         │
   │    caveman-review   terse PR-review comments (one line each)   │
   │    caveman-code     "code" variant, "~2× fewer tokens vs Codex"│
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  INPUT / memory side                                           │
   │    caveman-compress rewrites memory files (CLAUDE.md) in place,│
   │                     keeps a .original.md backup ("~46% input") │
   │    caveman-shrink   compresses MCP tool DESCRIPTIONS           │
   │    cavemem          pre-compressed persistent memory over MCP  │
   │                     (SQLite+FTS5, BM25+vector retrieval)       │
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  MULTI-AGENT side                                              │
   │    cavecrew         investigator / builder / reviewer subagents│
   │                     return caveman-compressed reports to the   │
   │                     main thread (~43.9% smaller, not the       │
   │                     claimed 60%)                               │
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  ORCHESTRATION side                                            │
   │    cavekit          spec-driven autonomous build loop;         │
   │                     caveman-encoded blueprints + durable        │
   │                     SPEC.md so /clear doesn't force re-derive   │
   └──────────────────────────────────────────────────────────────┘

Two members are worth singling out because they create the only genuine overlaps with headroom and RTK:

  • cavecrew (compressed subagents) is caveman's answer to expensive agent-to-agent transfer. Investigator/builder/reviewer subagents do work and return reports that are caveman-compressed before re-entering the main context. This is the same goal as headroom's SharedContext, but cavecrew is Claude-Code-native and the compression is again just a register instruction on the subagent's output. The honest measured shrink is ~43.9%, not the claimed 60%.
  • cavemem (compressed memory) is the only caveman-family tool that acts on input. It stores caveman-compressed (lossy prose) observations in SQLite+FTS5 with BM25/vector retrieval, single-agent. This is the member that competes head-on with headroom's cross-agent memory + CCR — and loses on architecture, because cavemem's compression is lossy with no recovery path (the "confidently-wrong recalled fact" failure mode), whereas headroom's memory is reversible. If you adopt headroom memory, you retire cavemem for that workflow; running both is pure overhead.

What caveman has, and what it lacks

FeatureCaveman
Compresses output (the 5×-priced class)Yes — its whole purpose
Zero runtime computeYes (compression is in the model's decoder)
Unconditionally cache-safeYes (never touches the input prefix)
Cross-tool (works under any agent/model)Yes — it is just a prompt; the register shift applies wherever the prompt is honored
Minutes to adoptYes
Multi-surface ecosystem (commit/review/memory/subagents)Yes — the broadest family of the three
Compresses input (tool output, files, RAG)No — out of scope (cavemem covers only memory injection, lossily)
Reversible / recoverable compressionNo — dropped caveats are gone; re-ask to recover
Touches thinking (20% of dollars)No — the README concedes it
Independent benchmark on agentic task successNo — every quality datapoint is QA/MCQA on older models
A measurable whole-session telemetry streamNo/caveman-stats is a local tally, not a controlled measurement

Self-cost, eval method, and failure mode

Self-cost. The caveman skill listing carried in context costs ~940 tokens of prefix per session (about 0.5% of the modeled heavy day in cache-read rent before it saves anything), plus the two hooks' host-state writes. Runtime compute is essentially zero.

Eval method (why the honest number is below the headline). The plugin's evals/measure.py counts tokens with tiktoken o200k_base — an approximation of Claude's BPE, not Claude's own tokenizer — against an "Answer concisely." terse-control baseline over 10 prompts, reporting median/min/max/stdev. The README's "~75%" is the pooled ratio of that table (1214 → 294 tokens); the same table's per-task mean is 65% (range 22–87%). A local re-measure on the Claude tokenizer lands caveman-ultra at 58.5% of visible-prose tokens. The number is real and useful; it is simply smaller than the headline, and it measures what caveman adds on top of output that was already told to be concise.

Failure mode. Over-terse prose with no recovery — because the long form was never generated, recovering it means re-asking. The project's own issue tracker records the predictable version of this (issue #484 — caveman shrinking PR/issue titles it should have left verbatim). On the dollar axis, the deeper limit is structural: visible output is only ~17% of a heavy session, so even total muteness caps caveman's whole-bill effect at 17%, and the realistic figure is ~10% of session dollars — because the 20%-of-dollars thinking bucket is billed in full (and on Fable 5 cannot even be disabled), reachable only by the effort lever, never by any register instruction. That ceiling is the honest frame for everything caveman does; it is detailed in the dossier's output-discipline and style-compression chapters.

Evidence and claims to kill

  • "Caveman cuts ~75% of tokens." The 75% is the pooled benchmark ratio; the per-task mean is 65%; the local Claude-tokenizer replication is 58.5% (ultra). And it targets visible prose only (~17% of dollars), so the whole-bill effect is ~4–6% per day. The README itself calls the cost saving "a bonus."
  • "Wenyan/Classical-Chinese saves ~80%." Character-token confusion: 80.9% character cut = 56.6% token cut; wenyan-ultra reaches 74.5% tokens only at maximum lossiness and on some short phrases costs more than English.
  • "A terse style cuts your Claude Code bill proportionally." No — visible output is 17% of dollars; thinking (20%) is billed in full though displayed summarized.

Caveman's evidence tier is T3, partially reproduced locally for the output-token mechanism; there is no agentic-task quality benchmark of register-compressed output anywhere, which is the single largest open question hanging over it. The full per-technique records, the folklore ledger, and the tokenizer measurement battery live in the dossier's style-and-language-compression chapter and the prior-art scan.


Next: 02 — Headroom design, a genuine runtime system rather than a prompt or a filter binary.

On this page