jackin'
ResearchToken Optimization Research

31 — Validation Harness: the No-Quality-Loss Proof Protocol

31 — Validation Harness: the No-Quality-Loss Proof Protocol

This file is a runnable protocol, not a literature review. Any technique or stack in this dossier earns "no quality loss" only by passing this harness.

TL;DR

  • Paired-task A/B: same ≥12 tasks per arm (baseline vs technique), headless claude -p runs, objective pass/fail checks, usage read from the JSON result envelope.
  • Decision rule: non-inferiority on success rate with a 5-percentage-point margin at n=12 (screening) or n=30 (confirmation); cost compared only between arms with statistically indistinguishable success.
  • Six canary classes target the known compression failure modes: negation, ordering, numeric precision, "don't do X" constraints, multi-step sequences, caveat retention — each with an exact string-level assertion.
  • Tokens by class come from usage in claude -p --output-format json (plus OTel or session JSONL for per-call decomposition); dollars are computed with the live price table in 01-economics-and-measurement.md.
  • A technique is NEGATIVE-COST only if it wins cost and its success rate is ≥ baseline at n≥30 — the bar folklore numbers never clear.

1. Design

Two arms, identical except for the technique under test:

  • Arm A (baseline): default Claude Code settings, no style compression, default context habits, fixed model + effort.
  • Arm B (technique): baseline + exactly one technique (or one composed stack — stacks are validated as a unit because savings interact).

Run all tasks in both arms, fresh session per task (no cross-task contamination), same repo fixture state per task (reset via git between runs). Order randomized; one run per task per arm per round to keep API nondeterminism comparable.

Measured per task: success (objective check), $ per task (usage × live prices), tokens by class (uncached input / cache write / cache read / output), turns (assistant API calls), wall time.

2. Task suite (n = 12, objective checks)

Tasks are repo-generic; instantiate against any pinned fixture repo (for this repo, pin a commit and use a worktree). Each has a deterministic checker — no judge needed for the primary metric.

#Task prompt (abridged)Objective check
1Fix the failing test <T> (introduce a known 1-line bug first)cargo nextest run <T> exits 0
2Add a unit test covering function <F>'s error branchnew test exists and fails when branch is broken, passes otherwise
3Rename symbol <S> to <S2> across the crategrep -rw <S> src/ empty AND cargo check exits 0
4Explain what <module> does and list its public functionsanswer contains all pub fn names (scripted set comparison)
5Find where config value <K> is parsed; report file:linereported location matches grep -n ground truth
6Apply a provided 20-line spec diff to <file>git diff equals expected patch modulo whitespace
7Write a CLI subcommand <cmd> per 10-line specscripted invocation produces specified stdout
8Summarize the last 5 commits' intent in ≤5 bulletsall 5 commit subjects' key nouns present (set check)
9Identify the bug causing provided failing-test outputnames the actual defective function (string match)
10Update a doc page to reflect a changed flag nameold flag absent, new flag present, page builds
11Add input validation rejecting negative <param>property test with negative input now errors; positive path unchanged
12Triage: which of 3 provided diffs introduced the regressionnames the correct diff

Success = checker exit 0. Tasks 4, 8, 9, 12 use scripted string/set assertions (still objective; no LLM judge). If a task needs judgment (rare), use the rubric judge in §5 — but never for the headline number.

3. Canary tasks (compression failure modes)

Run in both arms; any Arm-B-only failure is a quality regression regardless of the task suite score. Exact assertions:

CanaryPrompt coreAssertion (scripted)
C1 Negation"List the steps to deploy. Do NOT include the rollback step."output lacks "rollback" content step; includes deploy steps
C2 Ordering"Print migration steps; backup MUST precede schema drop."index("backup") < index("drop") in the answer
C3 Numeric precision"The timeout is 4.75 seconds; write the config snippet."literal 4.75 present (not 4.7, 5, or 4750ms unless converted correctly)
C4 Don't-do-X"Refactor this function. Do not change its public signature."signature line byte-identical post-run
C5 Multi-step7-step setup where step 5 depends on step 3's outputall 7 steps present, dependency order preserved
C6 Caveat retentionAsk for a command that has a destructive flag; reference answer carries a warningwarning semantics present in answer (keyword set)

Rationale: these are precisely what register compression (caveman/wenyan), terse-output prompts, summarization, and aggressive pruning are most likely to silently drop — short connectives, negations, and qualifiers carry disproportionate meaning per token (10-style-and-language-compression.md).

4. Runner (concrete)

Headless driver; one invocation per task per arm. claude -p returns a JSON envelope with the result text and cumulative usage (the same fields as the API usage object).

#!/usr/bin/env bash
# run_arm.sh &lt;arm-name&gt; &lt;settings-dir&gt; &lt;tasks.tsv&gt; &lt;out.jsonl&gt;
set -euo pipefail
ARM=$1; SETTINGS=$2; TASKS=$3; OUT=$4
while IFS=$'\t' read -r ID PROMPT CHECK; do
 git -C fixture reset --hard "$FIXTURE_SHA" -q
 R=$(CLAUDE_CONFIG_DIR="$SETTINGS" claude -p "$PROMPT" \
 --output-format json --max-turns 40 --cwd fixture 2>/dev/null)
 PASS=0; (cd fixture && bash -c "$CHECK") && PASS=1
 jq -nc --arg id "$ID" --arg arm "$ARM" --argjson pass "$PASS" \
 --argjson r "$R" \
 '{id:$id, arm:$arm, pass:$pass,
 usage:$r.usage, cost_usd:$r.total_cost_usd,
 turns:$r.num_turns, duration_ms:$r.duration_ms}' >> "$OUT"
done < "$TASKS"

Technique arms are realized as a separate CLAUDE_CONFIG_DIR (own settings.json, hooks, CLAUDE.md, plugins) so arms cannot leak into each other. For techniques that are session habits rather than config (e.g. "/clear between tasks"), encode the habit in the driver, not the prompt.

Dollars: prefer total_cost_usd from the envelope; recompute from usage × the price table (01-economics-and-measurement.md §1) as a cross-check. Per-call class decomposition when needed: enable OTel (claude_code.token.usage by type) or parse the session JSONL deduplicating by message.id (trap documented in 02-baseline-audit.md §1).

5. Judging

  • Primary: objective checkers only (above). Tests pass, diff correct, string assertions.
  • Secondary (only where unavoidable): LLM judge, fixed rubric, judge model ≠ tested model family where possible; rubric: task satisfied (0/1), constraints respected (0/1), factual errors (count), missing caveats (count). Judge sees both arms' outputs blinded and shuffled; ties broken by a second judge run with order swapped. Judge verdicts never override an objective checker.

6. The statistical bar for "no quality loss"

Success rates p_A, p_B over n tasks/arm. Declare no quality loss when Arm B is non-inferior with margin δ = 5 percentage points:

  • Screening (n = 12): require successes_B ≥ successes_A − 1 AND zero canary regressions. Power is honestly low at n=12 — this only filters out gross degradation (it will catch drops ≳25pp reliably, not 10pp).
  • Confirmation (n = 30, required before a technique enters a recommended stack): one-sided bootstrap (10,000 resamples of paired per-task outcomes) for Δ = p_A − p_B; require the 95th percentile of Δ < 5pp, plus zero canary regressions in 3 repetitions of the canary set.
  • Cost comparison is only valid between arms passing the bar: report median and mean $/task with bootstrap 95% CIs, and tokens by class so the saving's location is visible (a claimed output saving showing up as cache-read movement means the mechanism is misattributed — investigate before believing).

Paired bootstrap (tasks are paired across arms) in ~15 lines:

import json, random, sys
rows = [json.loads(l) for l in open(sys.argv[1])]
A = {r['id']: r for r in rows if r['arm']=='A'}
B = {r['id']: r for r in rows if r['arm']=='B'}
ids = sorted(set(A) & set(B)); n = len(ids)
deltas = []
for _ in range(10_000):
 s = [random.choice(ids) for _ in range(n)]
 deltas.append(sum(A[i]['pass'] for i in s)/n - sum(B[i]['pass'] for i in s)/n)
deltas.sort
print(f"Δ success (A−B) p95 = {deltas[int(.95*10_000)]:.3f} (non-inferior if < 0.05)")
print(f"cost/task A median {sorted(A[i]['cost_usd'] for i in ids)[n//2]:.3f} "
 f"B median {sorted(B[i]['cost_usd'] for i in ids)[n//2]:.3f}")

7. Per-technique falsification experiments

Every technique record in files 10–20 names its falsification experiment; they all reduce to an instantiation of this harness:

  • Style layers (caveman/wenyan): arms differ only in the style hook; canaries C1–C6 are the decisive part (register compression's failure mode is dropped qualifiers, not wrong code).
  • Context pruning/masking: Arm B evicts stale tool results; add task variants that require recalling early-session facts — tests whether eviction drops load-bearing history.
  • Model routing: Arm B routes named subtasks to a cheaper model; the task suite must include the routed task class (e.g. tasks 4, 5, 8 for Haiku exploration).
  • Caching techniques: quality-neutral by construction (identical tokens, different pricing) — validation is purely arithmetic: verify cache_read_input_tokens moved as predicted, success untouched.
  • Effort/thinking budget cuts: the highest-risk class (touches reasoning itself). Confirmation n=30 mandatory; include tasks 1, 9, 11, 12 (the reasoning-heavy ones).

8. Where to read the numbers (recap)

SurfaceUse
claude -p --output-format json envelopeper-run usage, total_cost_usd, turns — the harness's primary source
API usage object fieldsground truth class decomposition (01 §4)
/cost, /contextinteractive spot checks
OTel claude_code.token.usage / cost.usagefleet dashboards; per-agent.name attribution
Session JSONLpost-hoc per-call analysis (dedup by message.id; thinking is redacted — infer per 02 §6)

Verification ledger

ItemBasis
claude -p --output-format json envelope fields (usage, total_cost_usd, num_turns)Claude Code headless mode docs — code.claude.com/docs ; field names cross-checked against local CLI output format
Price table used for $ math01-economics-and-measurement.md (live-)
JSONL dedup + thinking-redaction trapsLocal measurement, 02-baseline-audit.md
δ=5pp, n=12/30, bootstrap procedureDesign choices (stated, not claimed as external standard); power statements are arithmetic over binomial outcomes — ESTIMATE

On this page