Skip to content

fix(eval): count cache tokens + per-harness model in variance probe#271

Merged
theagenticguy merged 1 commit into
mainfrom
fix/variance-probe-tokens-and-model
Jun 30, 2026
Merged

fix(eval): count cache tokens + per-harness model in variance probe#271
theagenticguy merged 1 commit into
mainfrom
fix/variance-probe-tokens-and-model

Conversation

@theagenticguy

Copy link
Copy Markdown
Owner

What

Two bugs the first live Bedrock run of the Move 2 probe surfaced (unit tests missed them because the stubbed spawn never produced real Claude usage blocks).

Bug 1 — token overhead undercounted

parseClaudeOutput read only usage.input_tokens and ignored the ~27K cached system tokens Claude Code injects per call (cache_creation_input_tokens + cache_read_input_tokens); Codex's cached_input_tokens was dropped too. So the token-overhead headline — the exact "~10% more tokens" cost the variance claim trades against — was wrong.

  • RunTokens / ArmTokens gain a cacheTokens field.
  • Both parsers capture it; totalTokens() counts it toward the overhead ratio.
  • formatReport now prints the in + out + cache split per arm so the overhead is auditable.

Bug 2 — global --model can't serve two agents

Claude and Codex take different Bedrock ids (a us.-prefixed Anthropic profile vs an openai.* model), so one --model handed Claude's id to Codex (forcing per-harness runs in the live test).

  • Replaced --model with --model-claude / --model-codex.
  • Threaded as a per-harness models: Partial<Record<Harness, string>> the runner factory reads by harness.

Why it matters

These are the two findings from the live dogfood run. The probe machinery worked end-to-end on Bedrock; these fixes make its cost number correct and let a single invocation drive both agents with their right models — prerequisites for a trustworthy with/without delta on the re-run.

Validation

  • biome ci . ✓ (713 files) · tsc -b ✓ · full build ✓ (eval inlined into CLI) · all 18 packages fail 0
  • +4 tests: cache tokens in parseClaude/parseCodex + the overhead total; per-harness model in buildArgv argv + the command's per-harness routing
  • No behavior change to the decision-equivalence / pack paths

🤖 Generated with Claude Code

Two bugs surfaced by the first live Bedrock run of the Move 2 probe.

Bug 1 — token overhead undercounted. parseClaudeOutput read only
usage.input_tokens, ignoring the ~27K cached system tokens Claude Code
injects per call (cache_creation_input_tokens + cache_read_input_tokens);
Codex's cached_input_tokens was likewise dropped. The token-overhead
headline — the exact "~10% more tokens" cost the variance claim trades
against — was therefore wrong. RunTokens/ArmTokens gain a `cacheTokens`
field, both parsers capture it, and totalTokens() counts it toward the
overhead ratio. formatReport now prints the in/out/cache split per arm so
the overhead is auditable.

Bug 2 — global --model can't serve two agents. Claude and Codex take
different Bedrock ids (a us.-prefixed Anthropic profile vs an openai.*
model), so one --model handed Claude's id to Codex. Replaced with
--model-claude / --model-codex, threaded as a per-harness
models: Partial<Record<Harness,string>> the runner factory reads by
harness.

+4 tests (cache in parse + overhead, per-harness model in argv + routing);
all 18 packages green. No behavior change to the decision-equivalence or
pack paths.
@theagenticguy theagenticguy merged commit df12cf9 into main Jun 30, 2026
38 checks passed
@theagenticguy theagenticguy deleted the fix/variance-probe-tokens-and-model branch June 30, 2026 17:00
@github-actions github-actions Bot mentioned this pull request Jun 30, 2026
theagenticguy pushed a commit that referenced this pull request Jun 30, 2026
🤖 Automated release via release-please
---


<details><summary>root: 0.10.6</summary>

##
[0.10.6](root-v0.10.5...root-v0.10.6)
(2026-06-30)


### Bug Fixes

* **eval:** count cache tokens + per-harness model in variance probe
([#271](#271))
([df12cf9](df12cf9))


### Documentation

* finding 0001 — OCH pack cuts agent token usage 2–4× (live Move 2 data)
([#273](#273))
([c712ebc](c712ebc))
</details>

<details><summary>cli: 0.10.6</summary>

##
[0.10.6](cli-v0.10.5...cli-v0.10.6)
(2026-06-30)


### Bug Fixes

* **eval:** count cache tokens + per-harness model in variance probe
([#271](#271))
([df12cf9](df12cf9))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant