55 — Token observability and session visualization (token-optimizer and peers)
55 — Token observability and session visualization (token-optimizer and peers)
Volume III deep-dive on the observability layer, requested after the headroom/compression work (53, 54): analyze alexgreensh/token-optimizer and survey other tools that give full per-token visibility and session visualization — "see every token, all usage, visualize my latest session." This is a different problem from compression (53/54) and output brevity (caveman): it saves nothing directly but underwrites every other lever, which is exactly the dossier's standing position on measurement (record 18: "saves nothing, underwrites everything"). Research conducted 2026-06-15; sources in the ledger; stars are treated as noise in this niche per file 54 §A.
TL;DR
- token-optimizer is the productized form of this dossier's own measurement method. It reads Claude Code's native JSONL session transcripts locally into SQLite (no proxy, no telemetry endpoint), and renders the exact decomposition the dossier built by hand — per-turn input / output / cache-read / cache-write with spike detection — as a single-file web dashboard (
localhost:24842), a color-shifting status line, and a CLI audit. It istools/session_cost.pyplus a dashboard, quality grades, and a 30-day trend view. - It is cache-safe and zero-overhead by construction. JSONL-only, on-device, "external process, no context injection" — so unlike a compression proxy it cannot bust the prompt cache or add prefix rent. That makes the visibility half pure negative-cost: the dossier's "measure before optimizing" rule (file 47) with a UI.
- The one token class it still cannot show is thinking — and neither can any JSONL tool. Claude Code redacts thinking from the transcript; output_tokens is thinking + visible fused. token-optimizer visualizes the input/output/cache split (the actionable ~80% of the bill) but cannot directly graph the thinking-vs-visible slice the dossier had to infer via
count_tokens(visible)(file 02). "Full visibility for every token" has this one structural blind spot on hosted Claude. token-optimizer gets closest with a heuristic wasteful-thinking flag (it warns when extended thinking exceeds ~2× output on small edits) — useful, but a flag, not a thinking-vs-visible split. A genuinely complete tool would pair JSONL parsing with acount_tokenspass over visible blocks to estimate the thinking slice; none surveyed does this (a gap worth a jackin' tool, giventools/count_tokens.pyalready exists). A second wall compounds it: JSONL records per-API-callusagetotals, not per-token streams, so true token-by-token session visualization does not exist for Claude Code at all (see the next section). - Its optimization half re-implements levers the dossier already ranked. Structure-map re-reads (95–99%) = the outline lever (file 51, locally −91%); read-cache dedup (180k→250-tok skeleton) = outline + observation masking (record 12); bash compression (~10%, lossy) = hook filtering (record 20). Its keep-warm pinger is the dossier's graveyard #3 ("keepalive pingers" killed for the live loop) — token-optimizer correctly scopes it to resumed/TTL-expired sessions on API billing, the one case the kill didn't cover, but that case is marginal on a Max subscription (file 41).
- Two adoption caveats for jackin'. License is PolyForm Noncommercial 1.0.0 (free for personal/research and small teams <5 people / <$20k-mo, but not a permissive bundle into a product); and its dollar figures assume API/Vertex/Bedrock pricing, while the operator is on a Max subscription where dollars-below-cap are sunk and the metric is tasks-per-cap (file 41) — so for the operator the token/cache/quality views are the value, not the dollar totals.
- Verdict: the JSONL-reading, no-proxy observability class (token-optimizer, ccusage,
/usage, the dossier'stools/) is the right way to get session visibility — adopt it freely as the measurement front-end of the validation harness (31/51/53). Proxy- or OTel-based observability adds reach but also a moving part; prefer the local-transcript readers for a coding agent.
What token-optimizer is
| Field | Value |
|---|---|
| Repository | github.com/alexgreensh/token-optimizer |
| Pitch | "Find the ghost tokens. Fix them. Survive compaction. Avoid context quality decay." |
| Created / activity | 2026-02-26 (~4 months); pushed 2026-06-15 |
| Adoption | 1,341 stars / 8 watchers / 110 forks / 1 open issue (gh api 2026-06-15). The 168:1 star:watcher ratio is the same PR-inflation signal as the compression niche (file 54 §A) — modest here; rank by what it shows, not stars. |
| Language | Python (Claude Code/Codex hooks) + TypeScript (openclaw/dist/*) |
| License | PolyForm Noncommercial 1.0.0 (auto-commercial for teams <5 people / <$20k-mo) |
| Data source | Claude Code native JSONL transcripts, parsed locally to SQLite — CompletionStart/End events, tool invocations, compaction markers. No proxy, no telemetry endpoint. |
| Surfaces | Web dashboard (localhost:24842/token-optimizer); terminal status line (green→red on quality decay); CLI (/token-optimizer audit, /token-coach 30-day trends, quick 10-second check) |
| Platforms | Claude Code (CLI + VS Code), Codex (CLI + Desktop), OpenClaw, OpenCode, Hermes (beta), GitHub Copilot (beta) |
What it makes visible (the part the operator wants)
- Per-turn breakdown: input / output / cache-read / cache-write — with the cache-write line further split by TTL (5-minute vs 1-hour) and spike detection on context jumps. Numbers are exact, read from the API-response
usageobject in the JSONL (docs/METHODOLOGY.md: "the three input classes sum back to total billed input … this decomposition is exact"), not tokenizer estimates. This is the dossier's headline token-class split (32% cache-read / 29% cache-write / 20% thinking / 17% visible output / 2% uncached, file 00/02) rendered per turn — except thinking stays inside the output bar (see the blind spot below). The 5m/1h write split is the most granular cache decomposition of any tool surveyed. - Session metrics: cache hit rate + TTL mix; cost across four pricing tiers (Anthropic API, Vertex Global/Regional, Bedrock); per-message cost paired with response expense; subagent cost (orchestrator vs worker); top-5 costliest prompts by response expense.
- Historical trends (30-day): quality degradation, session-duration creep, cache-hit-rate decline, cost-per-session climb.
- Quality grades: an S–F composite of Resource Health (context fill %, compaction depth, absolute waste) and Session Efficiency (stale reads, bloated results, decision density), with green/yellow/orange/red bands. This is a heuristic dashboard on the same signal the dossier treats as online-quality governance (file 47) and context rot (file 46 / Chroma).
Its optimization half (secondary to visibility)
token-optimizer also ships an active-compression layer (v5): structure-map re-reads (95–99% on large code files), delta-mode re-reads (~97%), read-cache dedup (a 180,000-token file becomes a ~250-token skeleton), bash compression (16 handlers, ~10%, "lossy by design"), smart-compaction decision checkpointing, quality nudges, loop detection, and an opt-in keep-warm cache pinger. Claimed monthly savings: $80–150 light / $300–600 heavy / $1,500–2,500 high-waste.
How it maps onto the dossier
The important finding is that token-optimizer is almost entirely a productization of techniques the dossier already measured and ranked — which is a point in its favor (it is the dossier's instrument with a UI) and a reason to treat its novel claims skeptically.
| token-optimizer feature | Dossier equivalent | Note |
|---|---|---|
| JSONL→SQLite per-turn token decomposition | Record 18 (ccusage//usage/JSONL) + tools/session_cost.py | Same method, richer per-turn UI; cache-safe, no proxy |
| Input/output/cache-read/cache-write bars | File 00/02 decomposition (32/29/20/17/2) | Visualizes exactly what the dossier measured by hand |
| Structure-map re-reads 95–99% | File 51 (outline/symbol, local −91%/−98%) | Same outline lever, productized for re-reads |
| Read-cache dedup (skeleton on re-read) | Record 12 (observation masking) + file 51 | Same "don't re-send the whole file" lever |
| Bash compression ~10% (lossy) | Record 20 (hook filtering, local −94.2% on logs) | Narrower than the dossier's grep-hook ceiling |
| Smart-compaction decision checkpoint | Record 06 (compaction) + file 46 (microcompact) | Preserves decisions across compaction — sound |
| Keep-warm cache pinger (opt-in, API billing) | Graveyard #3 (keepalive pingers, killed) | Killed for the live loop; scoped here to resumed/TTL-expired API-billed sessions — marginal on a Max subscription (file 41) |
| Quality grades / context-rot nudges | File 47 (online quality) + file 46 (Chroma context rot) | Heuristic dashboard on a real signal |
The thinking blind spot (load-bearing)
The dossier's central measurement insight is that thinking tokens are invisible in the Claude Code transcript — they bill as output but are redacted, so the dossier had to infer them as output_tokens − count_tokens(visible) (file 02, 54.8% of output on the measured max-effort loop). Any tool that reads only the JSONL — token-optimizer, ccusage, /usage — inherits this blind spot: it can show the output bar but not split it into thinking vs visible. So "full visibility for every token" is true for input/cache classes and the output total, but the single largest lever the dossier found (thinking, the only one the effort parameter touches — file 09/15) is the one a transcript visualizer cannot directly graph. A genuinely complete token-visibility tool would pair JSONL parsing with a count_tokens pass over visible blocks to estimate the thinking slice; none of the surveyed tools does this today (a gap worth a jackin' tool, given tools/count_tokens.py already exists).
Similar tools — full-visibility / session-visualization landscape
First, the honest ceiling on "every token." No tool visualizes a session literally token-by-token, color-coded — that does not exist for Claude Code, because the transcript records per-API-call usage totals, not per-token streams. The only token-by-token tools are tokenizer playgrounds (e.g. Simon Willison's Claude token counter over the count_tokens API) that have no notion of a session. So "full token visibility" in practice means exact per-message and per-session input / output / cache-read / cache-write breakdown — which the JSONL usage object gives exactly — plus the output total with thinking fused in (the blind spot above). That realistic best case is what the tools below deliver; ranked by what is actually visible and verifiable, not stars (inflation is rampant here: daaain/claude-code-log 547:1, the anchor 167:1, phuryn 114:1 star:watcher).
| Tool | Stars/watch · license | Granularity | Cache / I/O split | Surface | Data source | Live? | CC-native? |
|---|---|---|---|---|---|---|---|
| nateherkai/token-dashboard | 584/11 · MIT | message / session / day / project | Yes (input/output/cacheRead) | Web (localhost:8080) | JSONL ~/.claude/projects/ | Yes (30s) | Yes — pure observability; dedupes CC's 2–3× JSONL stream-writes |
| phuryn/claude-usage | 1,826/16 · MIT | message / session / day / model | Yes (input/output/cache_creation/cache_read) | Web + VS Code sidebar + CLI | JSONL | Yes (30s) | Yes — discloses $ is API-equivalent, not subscription |
| alexgreensh/token-optimizer (anchor) | 1,341/8 · PolyForm-NC | per-turn / message / session / 30-day | Yes + 5m/1h TTL write split | Web + status line + CLI | JSONL → SQLite | Yes (status line) | Yes — richest split; but an optimizer, not pure viz |
| ColeMurray/claude-code-otel + official Grafana | ~441 · MIT | session / day / model (aggregated) | Yes (OTel type = input/output/cacheRead/cacheCreation) | Grafana + Prometheus | Claude Code native OTel (no proxy) | Yes | Yes — org-grade, counters only (no per-message drilldown) |
| delexw/claude-code-trace | 311/1 · MIT | message + tool calls | partial (counts where available) | Desktop (Tauri) + Web + TUI | JSONL | Yes (live tail) | Yes — best session replay; token surface minimal |
| jhlee0409/claude-code-history-viewer | 1,598/4 · MIT | message / session; cross-tool | token usage; cache split not emphasized | Desktop app | JSONL (CC + Codex/Cursor/Gemini/Cline/Aider/OpenCode) | Post-hoc | Yes (multi-agent) — browser, token secondary |
| dabitk/claude-code-token-visualizer (cctv) | 0/0 · MIT | per-request → time buckets | input/output; cache hit-rate | Terminal TUI (live histograms) | tails .jsonl | Yes (live) | Yes — real but unproven adoption |
| ccusage | 16,080★ · MIT | day / session / model totals | totals only | CLI | JSONL | Yes | Yes (record 18) |
/usage + tools/session_cost.py | first-party / in-repo | by skill/subagent/plugin; the 32/29/20/17/2 split | yes | in-CLI / script | built-in / JSONL | — | Yes — the baseline instruments (record 18) |
| Langfuse / Helicone / Phoenix / OpenLLMetry | 5.8k–29k★ · mixed (MIT/Apache/Elastic) | per-call / per-trace span | yes if you instrument | self-host web | you instrument (SDK/OTel); Helicone is a proxy | Yes | No — not CC-native |
Verified OTel detail (no proxy): Claude Code natively emits claude_code.token.usage with a type attribute valued input / output / cacheRead / cacheCreation, plus model / user / team / skill.name / plugin.name / agent.name, and a separate claude_code.cost.usage (USD). So the Grafana path gets the cache-read/write split for free — but only as aggregated counters (no per-message, no per-token).
The pattern mirrors file 54's compression sweep: the JSONL-reading, no-proxy class is the safe default for a coding agent (token-optimizer, nateherkai, phuryn, ccusage, /usage, tools/) — it cannot perturb the request or add prefix rent. The native-OTel→Grafana path is the clean org-grade option (also no proxy) but coarser. The general platforms (Langfuse/Helicone/Phoenix/OpenLLMetry) are powerful but observe apps you instrument yourself, are not Claude-Code-native, and the proxy variants (Helicone, reportedly in maintenance mode) reroute the base URL — a caching and availability risk. Use those only if you are also building your own Claude API app.
Best for "visualize my latest session"
- nateherkai/token-dashboard — the pure-observability pick for this brief: per-prompt→session→day with input/output/cacheRead split, heatmaps, subagent attribution, local web UI, no proxy, no compression side-effects, and it correctly dedupes Claude Code's 2–3× JSONL stream-writes (an accuracy point most miss). MIT. Does only visibility — exactly what was asked.
- phuryn/claude-usage — closest runner-up; cleanest cache_creation/cache_read separation plus a VS Code sidebar; honestly flags that its dollars are API-equivalent (the file-41 subscription caveat, disclosed by the tool itself). MIT.
- token-optimizer (anchor) — observability layer only — the richest numbers (cache-write 5m/1h TTL split, exact-from-
usage-object, four pricing tiers, wasteful-thinking flag, live status line). Third for this brief only because it is an optimizer that also rewrites reads/compaction; #1 on raw capability if you want (or don't mind) the active features. Source-available (PolyForm-NC), not OSI. - claude-code-otel + official Grafana — best durable, queryable, org-grade dashboard on Claude Code's native OTel (no proxy); loses on grain (aggregated, no per-message).
- delexw/claude-code-trace (replay across desktop/web/TUI, live tail) + jhlee0409/claude-code-history-viewer (multi-agent browser) — best for navigating the latest session as a conversation with tool calls; pair with #1 for replay + cost.
Not for this goal: Langfuse/Helicone/Phoenix/OpenLLMetry (instrument-your-own-app, not CC-native; Helicone needs a proxy); tokenizer counters (Simon Willison's, claude-tokenizer, lunary — per-text totals, not sessions); quota monitors (per-token split is not their job).
jackin' fit
- Adopt the local-transcript observability class as the measurement front-end of the validation harness (31, and the per-tool harnesses in 51/53). The pure-observability MIT options —
nateherkai/token-dashboardandphuryn/claude-usage— andtools/session_cost.pyanswer the same question the harness needs ("where did the tokens go this session?") with no proxy and no prefix cost; token-optimizer's dashboard is richer but ships an optimizer alongside. All are pure negative-cost on the visibility axis — the file-47 "measure first" rule with a UI. - Mind the license and the metric. PolyForm Noncommercial means token-optimizer is fine for an operator/researcher to run but is not a permissive dependency to bundle into jackin-core; and its dollar dashboards assume API/Vertex/Bedrock pricing, while the operator's Max subscription makes tasks-per-cap the real objective (file 41) — read the token/cache/quality panels, discount the dollar totals.
- Close the thinking blind spot in jackin' tooling. The highest-value local addition is not another dashboard but a
count_tokens-backed thinking-vs-visible estimator layered on the JSONL reader (the dossier already shipstools/count_tokens.py) — the one token class no current visualizer shows. - Do not treat keep-warm as a default win. It is the dossier's killed keepalive lever (graveyard #3), correctly scoped here to resumed/TTL-expired API-billed sessions; on a live Claude Code loop and on a subscription it is marginal-to-irrelevant. Leave it opt-in and measure it before trusting the projected dollars (the README itself calls them "history-replay estimates, not yet-realized dollars").
Validation protocol
To accept any visibility tool as the harness front-end: point it at a set of archived Claude Code sessions, and (a) reconcile its session token totals against tools/session_cost.py and ccusage within <5% (the record-18 reconciliation bar), (b) confirm it reads transcripts only (no proxy, no cache_control mutation — diff usage fields with and without it running), and (c) verify its cache-read/write/output split matches the JSONL usage object turn-for-turn. Treat its quality grades and projected savings as heuristics, not measurements, until A/B'd on the 31/51/53 harness at equal task success.
Source ledger
All accessed 2026-06-15.
- token-optimizer repo + README (features, data source, UI, limitations, PolyForm-NC license): github.com/alexgreensh/token-optimizer
- token-optimizer stats (1,341★ / 8 watchers / 110 forks / created 2026-02-26):
gh api repos/alexgreensh/token-optimizer; source tree (session-parser.js,dashboard.js,jl-sketcher.js,pricing.js,quality.js,read-cache.js,smart-compact.js,drift.js) confirms the JSONL-parse → dashboard architecture; exactness + Measured-vs-Estimated split fromdocs/METHODOLOGY.md - similar visibility / session-visualization tools (verified via repo READMEs +
gh api): nateherkai/token-dashboard, phuryn/claude-usage, ColeMurray/claude-code-otel, delexw/claude-code-trace, jhlee0409/claude-code-history-viewer, dabitk/claude-code-token-visualizer; general platforms langfuse, Helicone, Arize Phoenix, OpenLLMetry; tokenizer counters (per-text, not session) simonw/tools - Claude Code native OTel metric/attribute schema (
claude_code.token.usagetype = input/output/cacheRead/cacheCreation;claude_code.cost.usage): code.claude.com/docs/en/monitoring-usage; official Grafana dashboard 25255 - dossier cross-references: measurement method and ccusage/
/usage—03-prior-art-and-market-scan.md(record 18); the token-class decomposition and thinking-invisibility —02-baseline-audit.mdand00-executive-summary.md; keepalive graveyard #3 —00/20; outline lever —51-code-intelligence-tools.md; subscription/quota metric —41-subscription-and-quota-economics.md; online quality + context rot —47-meta-cost-governance-and-online-quality.mdand46-fresh-literature-and-market-delta.md; runnable instruments —tools/ - compression companions:
53-headroom-and-context-compression.md,54-context-compression-literature-and-market.md