# 01 — Caveman: design teardown (https://jackin.tailrocks.com/research/token-optimization-tools/01-caveman-design/)


# 01 — Caveman: design teardown [#01--caveman-design-teardown]

> "Why use many token when few do trick." — the caveman tagline

Caveman is the **output-side** member of the trio. It compresses what the model *writes back* — visible prose — and nothing else. It is also the only one of the three you are likely already running: this very research was produced with caveman mode active. Its design is the most surprising of the three precisely because there is almost nothing to it.

| Field                 | Value                                                                                                            |
| --------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Repository            | `JuliusBrussee/caveman`                                                                                          |
| Pitch                 | Terse-register output compression; README headline "cuts \~75% of output tokens"                                 |
| Form factor           | Claude Code plugin (skills + hooks); not a binary, not a proxy, not an MCP server                                |
| Latest seen           | `v1.9.0`                                                                                                         |
| Adoption (2026-06-18) | 74,446★ / 166 watchers — PR-inflated; see [evidence](/research/token-optimization-tools/07-evidence-and-claims/) |
| Bucket hit            | Visible output (\~17% of heavy-session dollars)                                                                  |
| Cache interaction     | **Neutral** — never touches the cached prefix                                                                    |

## The magic: a prompt, not a program [#the-magic-a-prompt-not-a-program]

The single most important fact about caveman is that &#x2A;*it has no compression engine.** There is no parser, no model, no subprocess, no byte ever transformed by caveman's own code. The "compression" happens inside the model's own decoding, for free, because a prompt instructed the model to talk shorter.

The engine is a markdown file — `SKILL.md` — that carries a register-shift instruction the model applies to its own generation: drop articles, filler, and hedging; keep code, errors, API names, and commit keywords verbatim; fragments are fine. That is the entire mechanism. Everything else in the plugin exists only to *deliver* that instruction reliably and to *measure* its effect.

```text
   CAVEMAN ACTIVATION + EFFECT  (no bytes transformed by caveman code)

   session opens
        │
        ▼
   ┌─────────────────────────┐   SessionStart hook
   │ caveman-activate.js      │   injects the mode prompt from SKILL.md
   └─────────────────────────┘   into the system context
        │
        ▼
   ┌─────────────────────────┐   the register-shift INSTRUCTION now
   │ SKILL.md rule-set        │   sits in context: "drop articles/filler/
   │ (lite/full/ultra/wenyan) │    hedging; keep code/errors verbatim"
   └─────────────────────────┘
        │
   user turn
        │
        ▼
   ┌─────────────────────────┐   UserPromptSubmit hook
   │ caveman-mode-tracker.js  │   re-asserts the active level every turn,
   └─────────────────────────┘   tallies tokens for /caveman-stats
        │
        ▼
   ┌─────────────────────────┐   the MODEL'S DECODER produces terser text
   │  model generation        │   because the prompt told it to.
   │  (compression happens     │   THIS is the "compression" — it is
   │   HERE, in decoding)      │   generation-time, not a transform.
   └─────────────────────────┘
        │
        ▼
   shorter visible output  ── fewer output tokens billed at the 5×-priced rate
```

Because the compression is just the model choosing shorter words, it is **unconditionally cache-safe** (it only affects newly generated output, never the cached input prefix) and it costs **zero runtime compute**. That is the whole appeal: the cheapest possible intervention on the most expensive token class.

## The six intensity levels [#the-six-intensity-levels]

The levels are not different programs — they are different strictness clauses in the same `SKILL.md`. Each ratchets the register harder.

| Level          | What changes                                                  | Measured token cut (local, Fable tokenizer)   |
| -------------- | ------------------------------------------------------------- | --------------------------------------------- |
| `lite`         | Drop pleasantries and obvious filler; grammar intact          | modest; quality-neutral                       |
| `full`         | Drop articles/filler/hedging; fragments allowed (the default) | \~50–60% of visible prose                     |
| `ultra`        | Adds prose-word abbreviation and causal arrows (`→`)          | **58.5%** (vs the README's "\~75%" headline)  |
| `wenyan-lite`  | Switch to Classical Chinese register, lighter                 | character cut ≫ token cut (see below)         |
| `wenyan-full`  | Classical Chinese, fuller                                     | 80.9% **character** cut = **56.6% token** cut |
| `wenyan-ultra` | Classical Chinese, maximal word-dropping                      | **74.5% token** cut — the measured ceiling    |

The `wenyan` levels are the cautionary tail of the ladder. Their headline "80–90%" is a **character** reduction; on the Fable tokenizer classical Chinese runs at roughly 0.9 characters per token (i.e. at or above one token per character), so the *token* saving is far smaller than the character saving implies, and on short phrases wenyan can cost **more** tokens than plain English ("凡測試皆過則提交" = 9 tokens vs "if all tests pass then commit" = 8). The savings that do exist come from extreme word-dropping in wenyan grammar, not from cheap characters — and they are bought with maximal, completely unmeasured comprehension and operator-legibility risk.

## Why deleting words is the only lever that works [#why-deleting-words-is-the-only-lever-that-works]

Caveman's effectiveness rests on one property of byte-pair encoding: &#x2A;*a BPE token count tracks word count, not character count.** This single fact decides what works and what is folklore.

* **Deleting words saves tokens** — the entire safe core of caveman. The local register ladder on identical content: polite/full 72 tok → plain 33 (−54%) → telegraphic 29 (−60%) → caveman 24 (−67%). Notably, on the reply side, **caveman grammar bought zero tokens over plain telegraphic English** — the savings come from removing words, not from breaking grammar.
* **Abbreviating common words does *not* save tokens, and usually costs them.** Measured per-word: `function` = 1 token but `fn` = 2; `without` = 1 but `w/o` = 3; `because` = 1 but `bc` = 2; `config` = 1 but `cfg` = 3. BPE already compressed the common words. Abbreviation only pays on genuinely multi-token words (`asynchronous`→`async`, `parameters`→`params`, `implementation`→`impl`).
* **Glyph/symbol DSLs are a trap on Claude.** Exotic glyphs cost 3.9–4.9 tokens *each*; a glyph DSL lost to plain telegraphic ASCII by \~26–28% in two independent local samples. The lone exception is the arrow `→` at 1.0 token (cheaper than ASCII `->` at 2.0), which is why caveman-ultra adopts causal arrows. Beyond that one symbol, symbolic notation is anti-recommended — and there is a comprehension failure layered on the cost one: Claude Haiku 4.5 parses symbolic instructions at 100% but *obeys* them at only 26% fidelity (silent disobedience).

The practical consequence baked into caveman's design: the safe rungs (filler-strip, telegraphic) capture essentially all the realizable saving; the aggressive rungs (caveman-ultra, wenyan) add lossiness and risk for diminishing token returns.

## The two hooks (and the host-state writes) [#the-two-hooks-and-the-host-state-writes]

Delivering the instruction reliably requires two Claude Code hooks, both of which write to the host's `~/.claude` configuration — a detail that matters because it is the exact surface that collides with RTK's hook (see [combining](/research/token-optimization-tools/06-combining/)).

* **`SessionStart` → `caveman-activate.js`** — injects the mode prompt at session open.
* **`UserPromptSubmit` → `caveman-mode-tracker.js`** — re-asserts the active level every turn (so it does not drift out of context over a long session) and tallies tokens for the `/caveman-stats` readout.

`plugin.json` declares the two hooks, and the installer (`bin/lib/settings.js`) writes them into `~/.claude/settings.json` idempotently, with a Zod-validated shape, an `hasCavemanHook` probe to avoid duplicate registration, and a `removeCavemanHooks` uninstall path. The `/caveman-stats` numbers are computed by the hook reading the session JSONL — the model does not compute them.

## The ecosystem: the same trick aimed elsewhere [#the-ecosystem-the-same-trick-aimed-elsewhere]

Caveman is the flagship of a family, and every sibling is the same "instruction, not engine" idea pointed at a different surface. This breadth is part of what distinguishes caveman from headroom and RTK — it is an *ecosystem* of small register-shift tools, not a single compressor.

```text
                       THE CAVEMAN FAMILY

   ┌──────────────────────────────────────────────────────────────┐
   │  OUTPUT side                                                   │
   │    caveman          terse register on the model's visible prose│
   │    caveman-commit   terse Conventional-Commit messages         │
   │    caveman-review   terse PR-review comments (one line each)   │
   │    caveman-code     "code" variant, "~2× fewer tokens vs Codex"│
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  INPUT / memory side                                           │
   │    caveman-compress rewrites memory files (CLAUDE.md) in place,│
   │                     keeps a .original.md backup ("~46% input") │
   │    caveman-shrink   compresses MCP tool DESCRIPTIONS           │
   │    cavemem          pre-compressed persistent memory over MCP  │
   │                     (SQLite+FTS5, BM25+vector retrieval)       │
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  MULTI-AGENT side                                              │
   │    cavecrew         investigator / builder / reviewer subagents│
   │                     return caveman-compressed reports to the   │
   │                     main thread (~43.9% smaller, not the       │
   │                     claimed 60%)                               │
   └──────────────────────────────────────────────────────────────┘
   ┌──────────────────────────────────────────────────────────────┐
   │  ORCHESTRATION side                                            │
   │    cavekit          spec-driven autonomous build loop;         │
   │                     caveman-encoded blueprints + durable        │
   │                     SPEC.md so /clear doesn't force re-derive   │
   └──────────────────────────────────────────────────────────────┘
```

Two members are worth singling out because they create the only genuine overlaps with headroom and RTK:

* **cavecrew** (compressed subagents) is caveman's answer to expensive agent-to-agent transfer. Investigator/builder/reviewer subagents do work and return reports that are caveman-compressed before re-entering the main context. This is the same *goal* as headroom's `SharedContext`, but cavecrew is Claude-Code-native and the compression is again just a register instruction on the subagent's output. The honest measured shrink is \~43.9%, not the claimed 60%.
* **cavemem** (compressed memory) is the only caveman-family tool that acts on input. It stores caveman-compressed (lossy prose) observations in SQLite+FTS5 with BM25/vector retrieval, single-agent. This is the member that competes head-on with headroom's cross-agent memory + CCR — and loses on architecture, because cavemem's compression is lossy with **no recovery path** (the "confidently-wrong recalled fact" failure mode), whereas headroom's memory is reversible. If you adopt headroom memory, you retire cavemem for that workflow; running both is pure overhead.

## What caveman has, and what it lacks [#what-caveman-has-and-what-it-lacks]

| Feature                                                  | Caveman                                                                                  |
| -------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| Compresses output (the 5×-priced class)                  | **Yes — its whole purpose**                                                              |
| Zero runtime compute                                     | **Yes** (compression is in the model's decoder)                                          |
| Unconditionally cache-safe                               | **Yes** (never touches the input prefix)                                                 |
| Cross-tool (works under any agent/model)                 | **Yes** — it is just a prompt; the register shift applies wherever the prompt is honored |
| Minutes to adopt                                         | **Yes**                                                                                  |
| Multi-surface ecosystem (commit/review/memory/subagents) | **Yes** — the broadest *family* of the three                                             |
| Compresses input (tool output, files, RAG)               | **No** — out of scope (cavemem covers only memory injection, lossily)                    |
| Reversible / recoverable compression                     | **No** — dropped caveats are gone; re-ask to recover                                     |
| Touches thinking (20% of dollars)                        | **No** — the README concedes it                                                          |
| Independent benchmark on agentic task success            | **No** — every quality datapoint is QA/MCQA on older models                              |
| A measurable whole-session telemetry stream              | **No** — `/caveman-stats` is a local tally, not a controlled measurement                 |

## Self-cost, eval method, and failure mode [#self-cost-eval-method-and-failure-mode]

**Self-cost.** The caveman skill listing carried in context costs \~940 tokens of prefix per session (about 0.5% of the modeled heavy day in cache-read rent before it saves anything), plus the two hooks' host-state writes. Runtime compute is essentially zero.

**Eval method (why the honest number is below the headline).** The plugin's `evals/measure.py` counts tokens with tiktoken `o200k_base` — an *approximation* of Claude's BPE, not Claude's own tokenizer — against an "Answer concisely." terse-control baseline over 10 prompts, reporting median/min/max/stdev. The README's "\~75%" is the *pooled* ratio of that table (1214 → 294 tokens); the same table's *per-task mean* is 65% (range 22–87%). A local re-measure on the Claude tokenizer lands caveman-ultra at &#x2A;*58.5%** of visible-prose tokens. The number is real and useful; it is simply smaller than the headline, and it measures what caveman adds *on top of* output that was already told to be concise.

**Failure mode.** Over-terse prose with no recovery — because the long form was never generated, recovering it means re-asking. The project's own issue tracker records the predictable version of this (issue #484 — caveman shrinking PR/issue titles it should have left verbatim). On the dollar axis, the deeper limit is structural: visible output is only \~17% of a heavy session, so even total muteness caps caveman's whole-bill effect at 17%, and the realistic figure is \~10% of session dollars — because the 20%-of-dollars thinking bucket is billed in full (and on Fable 5 cannot even be disabled), reachable only by the **effort** lever, never by any register instruction. That ceiling is the honest frame for everything caveman does; it is detailed in the dossier's [output-discipline](/research/token-optimization/15-output-discipline/) and [style-compression](/research/token-optimization/10-style-and-language-compression/) chapters.

## Evidence and claims to kill [#evidence-and-claims-to-kill]

* **"Caveman cuts \~75% of tokens."** The 75% is the pooled benchmark ratio; the per-task mean is 65%; the local Claude-tokenizer replication is 58.5% (ultra). And it targets visible prose only (\~17% of dollars), so the whole-bill effect is \~4–6% per day. The README itself calls the cost saving "a bonus."
* **"Wenyan/Classical-Chinese saves \~80%."** Character-token confusion: 80.9% character cut = 56.6% token cut; wenyan-ultra reaches 74.5% tokens only at maximum lossiness and on some short phrases costs *more* than English.
* **"A terse style cuts your Claude Code bill proportionally."** No — visible output is 17% of dollars; thinking (20%) is billed in full though displayed summarized.

Caveman's evidence tier is **T3, partially reproduced locally** for the output-token mechanism; there is no agentic-task quality benchmark of register-compressed output anywhere, which is the single largest open question hanging over it. The full per-technique records, the folklore ledger, and the tokenizer measurement battery live in the dossier's [style-and-language-compression chapter](/research/token-optimization/10-style-and-language-compression/) and the [prior-art scan](/research/token-optimization/03-prior-art-and-market-scan/).

***

Next: [02 — Headroom design](/research/token-optimization-tools/02-headroom-design/), a genuine runtime system rather than a prompt or a filter binary.