fix(eval): count cache tokens + per-harness model in variance probe by theagenticguy · Pull Request #271 · theagenticguy/opencodehub

theagenticguy · 2026-06-30T16:56:27Z

What

Two bugs the first live Bedrock run of the Move 2 probe surfaced (unit tests missed them because the stubbed spawn never produced real Claude usage blocks).

Bug 1 — token overhead undercounted

parseClaudeOutput read only usage.input_tokens and ignored the ~27K cached system tokens Claude Code injects per call (cache_creation_input_tokens + cache_read_input_tokens); Codex's cached_input_tokens was dropped too. So the token-overhead headline — the exact "~10% more tokens" cost the variance claim trades against — was wrong.

RunTokens / ArmTokens gain a cacheTokens field.
Both parsers capture it; totalTokens() counts it toward the overhead ratio.
formatReport now prints the in + out + cache split per arm so the overhead is auditable.

Bug 2 — global `--model` can't serve two agents

Claude and Codex take different Bedrock ids (a us.-prefixed Anthropic profile vs an openai.* model), so one --model handed Claude's id to Codex (forcing per-harness runs in the live test).

Replaced --model with --model-claude / --model-codex.
Threaded as a per-harness models: Partial<Record<Harness, string>> the runner factory reads by harness.

Why it matters

These are the two findings from the live dogfood run. The probe machinery worked end-to-end on Bedrock; these fixes make its cost number correct and let a single invocation drive both agents with their right models — prerequisites for a trustworthy with/without delta on the re-run.

Validation

biome ci . ✓ (713 files) · tsc -b ✓ · full build ✓ (eval inlined into CLI) · all 18 packages fail 0
+4 tests: cache tokens in parseClaude/parseCodex + the overhead total; per-harness model in buildArgv argv + the command's per-harness routing
No behavior change to the decision-equivalence / pack paths

🤖 Generated with Claude Code

Two bugs surfaced by the first live Bedrock run of the Move 2 probe. Bug 1 — token overhead undercounted. parseClaudeOutput read only usage.input_tokens, ignoring the ~27K cached system tokens Claude Code injects per call (cache_creation_input_tokens + cache_read_input_tokens); Codex's cached_input_tokens was likewise dropped. The token-overhead headline — the exact "~10% more tokens" cost the variance claim trades against — was therefore wrong. RunTokens/ArmTokens gain a `cacheTokens` field, both parsers capture it, and totalTokens() counts it toward the overhead ratio. formatReport now prints the in/out/cache split per arm so the overhead is auditable. Bug 2 — global --model can't serve two agents. Claude and Codex take different Bedrock ids (a us.-prefixed Anthropic profile vs an openai.* model), so one --model handed Claude's id to Codex. Replaced with --model-claude / --model-codex, threaded as a per-harness models: Partial<Record<Harness,string>> the runner factory reads by harness. +4 tests (cache in parse + overhead, per-harness model in argv + routing); all 18 packages green. No behavior change to the decision-equivalence or pack paths.

🤖 Automated release via release-please --- <details><summary>root: 0.10.6</summary> ## [0.10.6](root-v0.10.5...root-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) ### Documentation * finding 0001 — OCH pack cuts agent token usage 2–4× (live Move 2 data) ([#273](#273)) ([c712ebc](c712ebc)) </details> <details><summary>cli: 0.10.6</summary> ## [0.10.6](cli-v0.10.5...cli-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

theagenticguy merged commit df12cf9 into main Jun 30, 2026
38 checks passed

theagenticguy deleted the fix/variance-probe-tokens-and-model branch June 30, 2026 17:00

github-actions Bot mentioned this pull request Jun 30, 2026

chore: release main #272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): count cache tokens + per-harness model in variance probe#271

fix(eval): count cache tokens + per-harness model in variance probe#271
theagenticguy merged 1 commit into
mainfrom
fix/variance-probe-tokens-and-model

theagenticguy commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theagenticguy commented Jun 30, 2026

What

Bug 1 — token overhead undercounted

Bug 2 — global --model can't serve two agents

Why it matters

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 2 — global `--model` can't serve two agents