55 — Token observability and session visualization (token-optimizer and peers)

Volume III deep-dive on the observability layer, requested after the headroom/compression work (53, 54): analyze alexgreensh/token-optimizer and survey other tools that give full per-token visibility and session visualization — "see every token, all usage, visualize my latest session." This is a different problem from compression (53/54) and output brevity (caveman): it saves nothing directly but underwrites every other lever, which is exactly the dossier's standing position on measurement (record 18: "saves nothing, underwrites everything"). Research conducted 2026-06-15; sources in the ledger; stars are treated as noise in this niche per file 54 §A.

TL;DR

token-optimizer is the productized form of this dossier's own measurement method. It reads Claude Code's native JSONL session transcripts locally into SQLite (no proxy, no telemetry endpoint), and renders the exact decomposition the dossier built by hand — per-turn input / output / cache-read / cache-write with spike detection — as a single-file web dashboard (localhost:24842), a color-shifting status line, and a CLI audit. It is tools/session_cost.py plus a dashboard, quality grades, and a 30-day trend view.
It is cache-safe and zero-overhead by construction. JSONL-only, on-device, "external process, no context injection" — so unlike a compression proxy it cannot bust the prompt cache or add prefix rent. That makes the visibility half pure negative-cost: the dossier's "measure before optimizing" rule (file 47) with a UI.
The one token class it still cannot show is thinking — and neither can any JSONL tool. Claude Code redacts thinking from the transcript; output_tokens is thinking + visible fused. token-optimizer visualizes the input/output/cache split (the actionable ~80% of the bill) but cannot directly graph the thinking-vs-visible slice the dossier had to infer via count_tokens(visible) (file 02). "Full visibility for every token" has this one structural blind spot on hosted Claude. token-optimizer gets closest with a heuristic wasteful-thinking flag (it warns when extended thinking exceeds ~2× output on small edits) — useful, but a flag, not a thinking-vs-visible split. A genuinely complete tool would pair JSONL parsing with a count_tokens pass over visible blocks to estimate the thinking slice; none surveyed does this (a gap worth a jackin' tool, given tools/count_tokens.py already exists). A second wall compounds it: JSONL records per-API-call usage totals, not per-token streams, so true token-by-token session visualization does not exist for Claude Code at all (see the next section).
Its optimization half re-implements levers the dossier already ranked. Structure-map re-reads (95–99%) = the outline lever (file 51, locally −91%); read-cache dedup (180k→250-tok skeleton) = outline + observation masking (record 12); bash compression (~10%, lossy) = hook filtering (record 20). Its keep-warm pinger is the dossier's graveyard #3 ("keepalive pingers" killed for the live loop) — token-optimizer correctly scopes it to resumed/TTL-expired sessions on API billing, the one case the kill didn't cover, but that case is marginal on a Max subscription (file 41).
Two adoption caveats for jackin'. License is PolyForm Noncommercial 1.0.0 (free for personal/research and small teams <5 people / <$20k-mo, but not a permissive bundle into a product); and its dollar figures assume API/Vertex/Bedrock pricing, while the operator is on a Max subscription where dollars-below-cap are sunk and the metric is tasks-per-cap (file 41) — so for the operator the token/cache/quality views are the value, not the dollar totals.
Verdict: the JSONL-reading, no-proxy observability class (token-optimizer, ccusage, /usage, the dossier's tools/) is the right way to get session visibility — adopt it freely as the measurement front-end of the validation harness (31/51/53). Proxy- or OTel-based observability adds reach but also a moving part; prefer the local-transcript readers for a coding agent.

What token-optimizer is

Field	Value
Repository	github.com/alexgreensh/token-optimizer
Pitch	"Find the ghost tokens. Fix them. Survive compaction. Avoid context quality decay."
Created / activity	2026-02-26 (~4 months); pushed 2026-06-15
Adoption	1,341 stars / 8 watchers / 110 forks / 1 open issue (gh api 2026-06-15). The 168:1 star:watcher ratio is the same PR-inflation signal as the compression niche (file 54 §A) — modest here; rank by what it shows, not stars.
Language	Python (Claude Code/Codex hooks) + TypeScript (`openclaw/dist/*`)
License	PolyForm Noncommercial 1.0.0 (auto-commercial for teams <5 people / <$20k-mo)
Data source	Claude Code native JSONL transcripts, parsed locally to SQLite — CompletionStart/End events, tool invocations, compaction markers. No proxy, no telemetry endpoint.
Surfaces	Web dashboard (`localhost:24842/token-optimizer`); terminal status line (green→red on quality decay); CLI (`/token-optimizer` audit, `/token-coach` 30-day trends, `quick` 10-second check)
Platforms	Claude Code (CLI + VS Code), Codex (CLI + Desktop), OpenClaw, OpenCode, Hermes (beta), GitHub Copilot (beta)

What it makes visible (the part the operator wants)

Per-turn breakdown: input / output / cache-read / cache-write — with the cache-write line further split by TTL (5-minute vs 1-hour) and spike detection on context jumps. Numbers are exact, read from the API-response usage object in the JSONL (docs/METHODOLOGY.md: "the three input classes sum back to total billed input … this decomposition is exact"), not tokenizer estimates. This is the dossier's headline token-class split (32% cache-read / 29% cache-write / 20% thinking / 17% visible output / 2% uncached, file 00/02) rendered per turn — except thinking stays inside the output bar (see the blind spot below). The 5m/1h write split is the most granular cache decomposition of any tool surveyed.
Session metrics: cache hit rate + TTL mix; cost across four pricing tiers (Anthropic API, Vertex Global/Regional, Bedrock); per-message cost paired with response expense; subagent cost (orchestrator vs worker); top-5 costliest prompts by response expense.
Historical trends (30-day): quality degradation, session-duration creep, cache-hit-rate decline, cost-per-session climb.
Quality grades: an S–F composite of Resource Health (context fill %, compaction depth, absolute waste) and Session Efficiency (stale reads, bloated results, decision density), with green/yellow/orange/red bands. This is a heuristic dashboard on the same signal the dossier treats as online-quality governance (file 47) and context rot (file 46 / Chroma).

Its optimization half (secondary to visibility)

token-optimizer also ships an active-compression layer (v5): structure-map re-reads (95–99% on large code files), delta-mode re-reads (~97%), read-cache dedup (a 180,000-token file becomes a ~250-token skeleton), bash compression (16 handlers, ~10%, "lossy by design"), smart-compaction decision checkpointing, quality nudges, loop detection, and an opt-in keep-warm cache pinger. Claimed monthly savings: $80–150 light / $300–600 heavy / $1,500–2,500 high-waste.

How it maps onto the dossier

The important finding is that token-optimizer is almost entirely a productization of techniques the dossier already measured and ranked — which is a point in its favor (it is the dossier's instrument with a UI) and a reason to treat its novel claims skeptically.

token-optimizer feature	Dossier equivalent	Note
JSONL→SQLite per-turn token decomposition	Record 18 (ccusage/`/usage`/JSONL) + `tools/session_cost.py`	Same method, richer per-turn UI; cache-safe, no proxy
Input/output/cache-read/cache-write bars	File 00/02 decomposition (32/29/20/17/2)	Visualizes exactly what the dossier measured by hand
Structure-map re-reads 95–99%	File 51 (outline/symbol, local −91%/−98%)	Same outline lever, productized for re-reads
Read-cache dedup (skeleton on re-read)	Record 12 (observation masking) + file 51	Same "don't re-send the whole file" lever
Bash compression ~10% (lossy)	Record 20 (hook filtering, local −94.2% on logs)	Narrower than the dossier's grep-hook ceiling
Smart-compaction decision checkpoint	Record 06 (compaction) + file 46 (microcompact)	Preserves decisions across compaction — sound
Keep-warm cache pinger (opt-in, API billing)	Graveyard #3 (keepalive pingers, killed)	Killed for the live loop; scoped here to resumed/TTL-expired API-billed sessions — marginal on a Max subscription (file 41)
Quality grades / context-rot nudges	File 47 (online quality) + file 46 (Chroma context rot)	Heuristic dashboard on a real signal

The dossier's central measurement insight is that thinking tokens are invisible in the Claude Code transcript — they bill as output but are redacted, so the dossier had to infer them as output_tokens − count_tokens(visible) (file 02, 54.8% of output on the measured max-effort loop). Any tool that reads only the JSONL — token-optimizer, ccusage, /usage — inherits this blind spot: it can show the output bar but not split it into thinking vs visible. So "full visibility for every token" is true for input/cache classes and the output total, but the single largest lever the dossier found (thinking, the only one the effort parameter touches — file 09/15) is the one a transcript visualizer cannot directly graph. A genuinely complete token-visibility tool would pair JSONL parsing with a count_tokens pass over visible blocks to estimate the thinking slice; none of the surveyed tools does this today (a gap worth a jackin' tool, given tools/count_tokens.py already exists).

Similar tools — full-visibility / session-visualization landscape

First, the honest ceiling on "every token." No tool visualizes a session literally token-by-token, color-coded — that does not exist for Claude Code, because the transcript records per-API-call usage totals, not per-token streams. The only token-by-token tools are tokenizer playgrounds (e.g. Simon Willison's Claude token counter over the count_tokens API) that have no notion of a session. So "full token visibility" in practice means exact per-message and per-session input / output / cache-read / cache-write breakdown — which the JSONL usage object gives exactly — plus the output total with thinking fused in (the blind spot above). That realistic best case is what the tools below deliver; ranked by what is actually visible and verifiable, not stars (inflation is rampant here: daaain/claude-code-log 547:1, the anchor 167:1, phuryn 114:1 star:watcher).

Tool	Stars/watch · license	Granularity	Cache / I/O split	Surface	Data source	Live?	CC-native?
nateherkai/token-dashboard	584/11 · MIT	message / session / day / project	Yes (input/output/cacheRead)	Web (`localhost:8080`)	JSONL `~/.claude/projects/`	Yes (30s)	Yes — pure observability; dedupes CC's 2–3× JSONL stream-writes
phuryn/claude-usage	1,826/16 · MIT	message / session / day / model	Yes (input/output/cache_creation/cache_read)	Web + VS Code sidebar + CLI	JSONL	Yes (30s)	Yes — discloses $ is API-equivalent, not subscription
alexgreensh/token-optimizer (anchor)	1,341/8 · PolyForm-NC	per-turn / message / session / 30-day	Yes + 5m/1h TTL write split	Web + status line + CLI	JSONL → SQLite	Yes (status line)	Yes — richest split; but an optimizer, not pure viz
ColeMurray/claude-code-otel + official Grafana	~441 · MIT	session / day / model (aggregated)	Yes (OTel `type` = input/output/cacheRead/cacheCreation)	Grafana + Prometheus	Claude Code native OTel (no proxy)	Yes	Yes — org-grade, counters only (no per-message drilldown)
delexw/claude-code-trace	311/1 · MIT	message + tool calls	partial (counts where available)	Desktop (Tauri) + Web + TUI	JSONL	Yes (live tail)	Yes — best session replay; token surface minimal
jhlee0409/claude-code-history-viewer	1,598/4 · MIT	message / session; cross-tool	token usage; cache split not emphasized	Desktop app	JSONL (CC + Codex/Cursor/Gemini/Cline/Aider/OpenCode)	Post-hoc	Yes (multi-agent) — browser, token secondary
dabitk/claude-code-token-visualizer (cctv)	0/0 · MIT	per-request → time buckets	input/output; cache hit-rate	Terminal TUI (live histograms)	tails `.jsonl`	Yes (live)	Yes — real but unproven adoption
ccusage	16,080★ · MIT	day / session / model totals	totals only	CLI	JSONL	Yes	Yes (record 18)
`/usage` + `tools/session_cost.py`	first-party / in-repo	by skill/subagent/plugin; the 32/29/20/17/2 split	yes	in-CLI / script	built-in / JSONL	—	Yes — the baseline instruments (record 18)
Langfuse / Helicone / Phoenix / OpenLLMetry	5.8k–29k★ · mixed (MIT/Apache/Elastic)	per-call / per-trace span	yes if you instrument	self-host web	you instrument (SDK/OTel); Helicone is a proxy	Yes	No — not CC-native

Verified OTel detail (no proxy): Claude Code natively emits claude_code.token.usage with a type attribute valued input / output / cacheRead / cacheCreation, plus model / user / team / skill.name / plugin.name / agent.name, and a separate claude_code.cost.usage (USD). So the Grafana path gets the cache-read/write split for free — but only as aggregated counters (no per-message, no per-token).

The pattern mirrors file 54's compression sweep: the JSONL-reading, no-proxy class is the safe default for a coding agent (token-optimizer, nateherkai, phuryn, ccusage, /usage, tools/) — it cannot perturb the request or add prefix rent. The native-OTel→Grafana path is the clean org-grade option (also no proxy) but coarser. The general platforms (Langfuse/Helicone/Phoenix/OpenLLMetry) are powerful but observe apps you instrument yourself, are not Claude-Code-native, and the proxy variants (Helicone, reportedly in maintenance mode) reroute the base URL — a caching and availability risk. Use those only if you are also building your own Claude API app.

Best for "visualize my latest session"

nateherkai/token-dashboard — the pure-observability pick for this brief: per-prompt→session→day with input/output/cacheRead split, heatmaps, subagent attribution, local web UI, no proxy, no compression side-effects, and it correctly dedupes Claude Code's 2–3× JSONL stream-writes (an accuracy point most miss). MIT. Does only visibility — exactly what was asked.
phuryn/claude-usage — closest runner-up; cleanest cache_creation/cache_read separation plus a VS Code sidebar; honestly flags that its dollars are API-equivalent (the file-41 subscription caveat, disclosed by the tool itself). MIT.
token-optimizer (anchor) — observability layer only — the richest numbers (cache-write 5m/1h TTL split, exact-from-usage-object, four pricing tiers, wasteful-thinking flag, live status line). Third for this brief only because it is an optimizer that also rewrites reads/compaction; #1 on raw capability if you want (or don't mind) the active features. Source-available (PolyForm-NC), not OSI.
claude-code-otel + official Grafana — best durable, queryable, org-grade dashboard on Claude Code's native OTel (no proxy); loses on grain (aggregated, no per-message).
delexw/claude-code-trace (replay across desktop/web/TUI, live tail) + jhlee0409/claude-code-history-viewer (multi-agent browser) — best for navigating the latest session as a conversation with tool calls; pair with #1 for replay + cost.

Not for this goal: Langfuse/Helicone/Phoenix/OpenLLMetry (instrument-your-own-app, not CC-native; Helicone needs a proxy); tokenizer counters (Simon Willison's, claude-tokenizer, lunary — per-text totals, not sessions); quota monitors (per-token split is not their job).

jackin' fit

Adopt the local-transcript observability class as the measurement front-end of the validation harness (31, and the per-tool harnesses in 51/53). The pure-observability MIT options — nateherkai/token-dashboard and phuryn/claude-usage — and tools/session_cost.py answer the same question the harness needs ("where did the tokens go this session?") with no proxy and no prefix cost; token-optimizer's dashboard is richer but ships an optimizer alongside. All are pure negative-cost on the visibility axis — the file-47 "measure first" rule with a UI.
Mind the license and the metric. PolyForm Noncommercial means token-optimizer is fine for an operator/researcher to run but is not a permissive dependency to bundle into jackin-core; and its dollar dashboards assume API/Vertex/Bedrock pricing, while the operator's Max subscription makes tasks-per-cap the real objective (file 41) — read the token/cache/quality panels, discount the dollar totals.
Close the thinking blind spot in jackin' tooling. The highest-value local addition is not another dashboard but a count_tokens-backed thinking-vs-visible estimator layered on the JSONL reader (the dossier already ships tools/count_tokens.py) — the one token class no current visualizer shows.
Do not treat keep-warm as a default win. It is the dossier's killed keepalive lever (graveyard #3), correctly scoped here to resumed/TTL-expired API-billed sessions; on a live Claude Code loop and on a subscription it is marginal-to-irrelevant. Leave it opt-in and measure it before trusting the projected dollars (the README itself calls them "history-replay estimates, not yet-realized dollars").

Validation protocol

To accept any visibility tool as the harness front-end: point it at a set of archived Claude Code sessions, and (a) reconcile its session token totals against tools/session_cost.py and ccusage within <5% (the record-18 reconciliation bar), (b) confirm it reads transcripts only (no proxy, no cache_control mutation — diff usage fields with and without it running), and (c) verify its cache-read/write/output split matches the JSONL usage object turn-for-turn. Treat its quality grades and projected savings as heuristics, not measurements, until A/B'd on the 31/51/53 harness at equal task success.

Source ledger

All accessed 2026-06-15.

token-optimizer repo + README (features, data source, UI, limitations, PolyForm-NC license): github.com/alexgreensh/token-optimizer
token-optimizer stats (1,341★ / 8 watchers / 110 forks / created 2026-02-26): gh api repos/alexgreensh/token-optimizer; source tree (session-parser.js, dashboard.js, jl-sketcher.js, pricing.js, quality.js, read-cache.js, smart-compact.js, drift.js) confirms the JSONL-parse → dashboard architecture; exactness + Measured-vs-Estimated split from docs/METHODOLOGY.md
similar visibility / session-visualization tools (verified via repo READMEs + gh api): nateherkai/token-dashboard, phuryn/claude-usage, ColeMurray/claude-code-otel, delexw/claude-code-trace, jhlee0409/claude-code-history-viewer, dabitk/claude-code-token-visualizer; general platforms langfuse, Helicone, Arize Phoenix, OpenLLMetry; tokenizer counters (per-text, not session) simonw/tools
Claude Code native OTel metric/attribute schema (claude_code.token.usage type = input/output/cacheRead/cacheCreation; claude_code.cost.usage): code.claude.com/docs/en/monitoring-usage; official Grafana dashboard 25255
dossier cross-references: measurement method and ccusage//usage — 03-prior-art-and-market-scan.md (record 18); the token-class decomposition and thinking-invisibility — 02-baseline-audit.md and 00-executive-summary.md; keepalive graveyard #3 — 00/20; outline lever — 51-code-intelligence-tools.md; subscription/quota metric — 41-subscription-and-quota-economics.md; online quality + context rot — 47-meta-cost-governance-and-online-quality.md and 46-fresh-literature-and-market-delta.md; runnable instruments — tools/
compression companions: 53-headroom-and-context-compression.md, 54-context-compression-literature-and-market.md

55 — Token observability and session visualization (token-optimizer and peers)

55 — Token observability and session visualization (token-optimizer and peers)

TL;DR

What token-optimizer is

What it makes visible (the part the operator wants)

Its optimization half (secondary to visibility)

How it maps onto the dossier

The thinking blind spot (load-bearing)

Similar tools — full-visibility / session-visualization landscape

Best for "visualize my latest session"

jackin' fit

Validation protocol

Source ledger

On this page