Defer undeclared tool-surface changes from check to verify#237
Conversation
`shipgate check` (and the MCP `shipgate.check` tool) returned `allow` on a capability-adding diff that touches a tool/capability surface the manifest does not declare — or when there is no manifest — even though the trigger already flags the diff in scope. That is a false clear: only `verify` compiles the capability delta, so an agent acting on a boundary `allow` could ship a change `verify` would block. main already defers manifest-declared tool sources (`capability_surfaces_changed` → `coverage_gap`). This extends that coverage to the undeclared / no-manifest case: - Fire only when a content-driven `run_shipgate` trigger rule matched (e.g. TRIGGER-MCP-EXPORT-CHANGED), never the blanket `force_run` rule — so a docs-only edit in an opted-in repo stays `allow` (no noise). - Require a changed file the boundary evaluator does not itself inspect (`not is_boundary_path`), so a clean `allow` over `.codex/config.toml`, `AGENTS.md`, or the Shipgate workflow stays authoritative. - Escalate `allow` → `warn` and route to `agents-shipgate detect` (declare-then-verify), because `verify` only gates *declared* surfaces; routing straight to verify would falsely reassure on a surface it never scans. Never downgrades a real boundary `block` / `require_review`. The MCP `shipgate.check` tool inherits the fix automatically via the shared `build_codex_agent_result` path. No schema or check-ID changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses two review findings on the boundary coverage-gap defer. P1 — mixed declared + undeclared diffs only routed to `verify`. The undeclared gap was gated on `not coverage_surfaces`, making declared and undeclared mutually exclusive: a diff that changed a declared MCP export *and* an undeclared `api/openapi.yaml` routed to `verify`, which never scans the undeclared surface, so the new capability escaped the gate. Undeclared surfaces are now computed independently, per-file, and take precedence — a mixed diff routes to `detect` (declare-then-verify), which then covers both. P2 — `_trigger_in_scope` treated every `run_shipgate` rule as a tool-surface rule, so `shipgate.yaml` (TRIGGER-SHIPGATE-MANIFEST), `prompts/**` (TRIGGER-PROMPTS-OR-POLICIES), and the CI gate were mislabeled `undeclared_capability_surface` and routed to `detect`. Classification now keys on the actual tool-source rules only (MCP export, OpenAPI, static tool inventory, Codex plugin, n8n, @function_tool), excluding the manifest, prompts/policies, CI workflow, and `.codex` boundary config (which the boundary check itself evaluates). Implementation: `build_codex_agent_result` classifies each changed file per-file via the public trigger evaluator (glob rules match the path; diff_contains rules match the file's added lines), excluding boundary paths and already-declared files, and passes an explicit `undeclared_capability_surfaces` list to `evaluate_codex_boundary_result`. The coarse `_trigger_in_scope` + `is_boundary_path` heuristic is removed in favor of that explicit list. Tests: added mixed declared+undeclared cases (direct + via `check`) and manifest/prompts no-false-detect cases. Full suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Addressed both findings in P1 (mixed declared + undeclared → only verify). Undeclared surfaces are now computed independently and per-file, not gated on P2 ( Classification reuses the public trigger evaluator per file (glob rules match the path; |
`_undeclared_tool_surfaces_changed` treated any matched tool-source rule as a tool surface, ignoring the evaluator's winning verdict. A docs or test file that incidentally mentions `@tool` / `@function_tool` matches `TRIGGER-FUNCTION-TOOL-DECORATOR` but ALSO `TRIGGER-DOCS-ONLY-NEGATIVE`, which beats `run_shipgate` — so the catalog skips it. The classifier nonetheless flagged `README.md` / `tests/test_*.py` as an `undeclared_capability_surface` and routed to `detect`, contradicting the trigger catalog (docs/triggers.json: docs/tests that incidentally mention `@tool` must not run). Require the per-file trigger to actually run (`run_shipgate` true) in addition to a tool-source rule appearing in `matched_rules`. A real `@function_tool` source (run_shipgate true) still classifies; a docs/test mention (skip) no longer does. Tests: docs file and tests/ file that mention the decorator stay `allow`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Fixed in The per-file classifier now requires the evaluator's winning verdict ( Verified: |
Summary
shipgate check(and the MCPshipgate.checktool) returnedallowon a capability-adding diff that touches a tool/capability surface the manifest does not declare — or when there is no manifest — even though the trigger already flags the diff in scope. Onlyverifycompiles the capability delta, so an agent acting on that boundaryallowcould ship a changeverifywould block.mainalready defers manifest-declared tool sources (capability_surfaces_changed→coverage_gap). This PR extends that coverage to the undeclared / no-manifest case (the common "agent just added a new, not-yet-registered tool file"):run_shipgatetrigger rule matched (e.g.TRIGGER-MCP-EXPORT-CHANGED), never the blanketforce_runrule — so a docs-only edit in an opted-in repo staysallow(no noise).not is_boundary_path), so a cleanallowover.codex/config.toml,AGENTS.md, or the Shipgate workflow stays authoritative.allow→warnand routes toagents-shipgate detect(declare-then-verify), becauseverifyonly gates declared surfaces — routing straight toverifywould falsely reassure on a surface it never scans.block/require_review. The MCPshipgate.checktool inherits the fix automatically via the sharedbuild_codex_agent_resultpath.Behavior
tool_sourceswarn→ verifyallowwarn→ detectallowwarn→ detect.codex/config.tomlrepair (boundary-evaluated)allowallowType
Verification
CI is authoritative for
python -m ruff check .,python -m compileall -q src tests, andpython -m pytest.Additional local checks run:
pytest tests/test_codex_boundary_check.py tests/test_agent_protocol.py— green (incl. 5 new tests covering the undeclared/no-manifest gap, the never-downgrade-a-block invariant, and the no-noise-on-docs property).ruff checkon changed files — clean.test_latency_budget::test_scenarios_scale_sublinearly, a load-sensitive perf test that fails identically on the unmodified base and passes in isolation (unrelated to this change).shipgate.checktool returnswarn→detecton an undeclared MCP tool-add.Release-readiness notes
docs/checks.md— N/A: no new check IDs (adds aundeclared_capability_surfacediagnostic + coverage trace only)STABILITY.md— N/A: no schema change; uses existingAgentResultV1/CodexBoundaryResultV1fields🤖 Generated with Claude Code