Skip to content

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464

Merged
igerber merged 5 commits into
mainfrom
agent-workflow-discoverability
May 19, 2026
Merged

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464
igerber merged 5 commits into
mainfrom
agent-workflow-discoverability

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 18, 2026

Summary

  • Add diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None) — stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panelget_llm_guide("autonomous")<Estimator>().fit(df, ...)practitioner_next_steps(result)BusinessReport(result).full_report()) with the caller's column names templated in via repr() (source-safe under hostile labels). Calls nothing internally; does not inspect df. Step 3 branches on first_treat so the emitted call always matches an actual fit() signature (CallawaySantAnna with first_treat= when supplied; DifferenceInDifferences with treatment= for the simple 2x2 case).
  • Rewrite top-level __doc__ so the first non-blank paragraph names agent_workflow(df, ...) as the recommended starting call. get_llm_guide("full") and get_llm_guide("practitioner") pointers preserved (the existing tests/test_guides.py::test_module_docstring_mentions_helper guard continues to hold).
  • Add module-level __dir__() override that surfaces agent-facing names (agent_workflow, profile_panel, get_llm_guide, practitioner_next_steps, BusinessReport, DiagnosticReport) at the head of dir(diff_diff). Implementation uses a small _OrderedName(str) subclass with custom __lt__ — CPython's dir() always sorts the result of __dir__(), but PyList_Sort respects custom comparison operators, so the priority head wins. __all__ membership and from diff_diff import * semantics are unchanged. inspect.getmembers(diff_diff) still returns 274 members including __doc__/__name__/__file__.
  • Update diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt with one-line agent_workflow(...) pointers.

Closes #460.
Companion PR for #461 (snapshot/contract test) will follow after this merges.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no estimator, math, variance, or default inference behavior changes.
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_agent_workflow.py (new, 18 tests) covering canonical output keys, workflow primitive references, column-name templating, first_treat conditional, first_treat-driven Step 3 branching, no-overclaim contract on the no-first_treat path, AST-parseability of every emitted call line, adversarial column labels (6 parametrized cases: embedded quotes, backslashes, injection attempts, whitespace), fit_candidates importability, verbose toggle, no-df-inspection contract.
  • Existing test guards continue to hold: tests/test_guides.py (41 tests, including test_module_docstring_mentions_helper).
  • Backtest / simulation / notebook evidence (if applicable): N/A — orchestrator only emits string templates; no fits run.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes — no estimator math/SE/default-inference code changed in this diff, but the new agent_workflow() entrypoint introduces one unmitigated P1 methodology/routing issue.

Executive Summary

  • No changes in the diff modify estimator formulas, weighting, variance/SE logic, or documented defaults.
  • P1: agent_workflow() treats first_treat as if it implies the binary staggered CallawaySantAnna path, which conflicts with the library’s own documented ContinuousDiD and dose-based HAD contracts.
  • P2: the emitted "script" is not actually a copy-pasteable Python script as advertised; it contains prose/footer lines and the final report call does not print anything in script execution.
  • P3: the new __dir__() discoverability hack is not contract-tested in this PR.
  • Security surface looks fine; the repr()-based templating is the right shape for hostile column labels.

Runtime note: I could not execute the added tests in this sandbox because numpy is unavailable here, so this review is from static inspection.

Methodology

  • Severity: P1. Impact: first_treat is not equivalent to “binary staggered adoption, so start with Callaway-Sant’Anna.” The new Step 3 branch says exactly that and emits a CallawaySantAnna().fit(...) example whenever first_treat is present, but the library’s own contracts treat first_treat as a required input for ContinuousDiD as well, and dose-based HAD event-study also accepts first_treat_col. Because agent_workflow() is now advertised as the recommended starting call, an agent on a continuous-dose or graded-adoption panel can be steered into the wrong estimator with no warning. I found no REGISTRY.md note documenting this routing shortcut. Concrete fix: keep Step 3 conditional on treatment shape even when first_treat is supplied; if you keep a single example, label it explicitly as the binary staggered case and name ContinuousDiD / HAD alternatives for dose-based panels. Add regression tests for first_treat + continuous-dose and first_treat + HAD event-study scenarios. References: diff_diff/agent_workflow.py (around L148-L170), diff_diff/continuous_did.py, diff_diff/had.py, docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md, diff_diff/init.py.

Code Quality

  • Severity: P2. Impact: the returned "script" is described and tested as “copy-pasteable,” but the emitted block is not executable Python as printed: it contains prose/footer lines (Full reference: ..., Practitioner recipe: ...) and Step 5 evaluates BusinessReport(...).full_report() without printing or assigning the returned string. In a real .py execution the block will syntax-error, and even line-by-line execution discards the final report output. Concrete fix: either emit a real executable Python block (comment-prefix prose lines and print(diff_diff.BusinessReport(result).full_report())) or rename/document the output as workflow text rather than a script. Add a whole-output smoke test aligned to the intended contract. References: diff_diff/agent_workflow.py (around L189-L233), diff_diff/business_report.py, tests/test_agent_workflow.py.

Performance

No findings.

Maintainability

No findings beyond the test-gap note below.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new public __dir__() discoverability surface relies on a custom _OrderedName(str) ordering trick, but this PR adds no contract test for head order, membership, or inspect.getmembers() compatibility. That is deferrable, but it leaves the most compatibility-sensitive new surface unpinned. Concrete fix: add a small discoverability contract test in this PR, or explicitly track the deferral in TODO.md before merge. References: diff_diff/init.py, tests/test_agent_workflow.py.

Path to Approval

  1. Remove the universal first_treat -> CallawaySantAnna implication in agent_workflow() and add regression tests covering first_treat on continuous-dose / dose-based panels. That resolves the blocking P1 and would move the assessment to .
  2. Recommended before merge: make the emitted "script" contract honest by either generating executable Python or relabeling it as prose workflow text.

igerber added a commit that referenced this pull request May 18, 2026
… script executability + P3 deferral note)

CI codex review on PR #464 (initial push) flagged 3 items:

P1 (Methodology): The first_treat-branch Step 3 example implied
binary staggered -> CallawaySantAnna without naming alternatives.
But ContinuousDiD.fit() and HeterogeneousAdoptionDiD event-study
BOTH take a `first_treat` column (via `first_treat=` and
`first_treat_col=` respectively); the same overclaim pattern I
already caught and fixed for the no-first_treat path in R3 was
present mirror-image on the with-first_treat path. The Step 3
comment block now enumerates the three estimator options for
first_treat panels (CS21 / ContinuousDiD / HAD event-study) and
names the HAD signature distinction (first_treat_col vs first_treat)
so an agent isn't silently steered to CS21 on a continuous-dose
panel. New test `test_first_treat_step3_names_non_binary_alternatives`
locks this.

P2 (Code Quality): The emitted "script" was tested for parseability
only at the line-of-call granularity, not at the whole-file level.
Section headers like "Step 1 - Describe the panel:" and footer
lines "Full reference: ..." would SyntaxError if the agent dumped
the script to a .py file and ran it. Step 5 also evaluated
`BusinessReport(...).full_report()` without printing — the report
str would be discarded on execution. Fix: every prose line in the
template now starts with `#` (valid Python comment); code lines
stand at column 0 and run as-is; Step 5 wraps the report call in
`print()`. The full script `ast.parse()`s as a Python module under
both first_treat branches. New tests
`test_emitted_script_parses_as_python_module` and
`test_emitted_script_prints_report` lock these contracts.

P3 (Tests): The new `__dir__()` discoverability surface
(`_OrderedName` ordering trick) has no contract test in this PR.
Added a TODO.md entry under "Tech Debt from Code Reviews" /
"Testing/Docs" deferring it to PR B (planned, addresses #461 with
the full snapshot/contract surface). The deferral is the documented
pattern the reviewer's "Concrete fix" suggested.

Tests: 18 → 21 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bb822f4d9a0adfb5185471f65c6b0edb7433bd


Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings in the changed surface.

Executive Summary

  • Re-review: the prior P1 is resolved. agent_workflow() no longer treats first_treat as uniquely implying CallawaySantAnna; Step 3 now explicitly says first_treat does not identify a single estimator and names ContinuousDiD / HeterogeneousAdoptionDiD alternatives, which matches the estimator contracts and registry notes at diff_diff/agent_workflow.py:L148-L176, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2841, docs/methodology/REGISTRY.md:L2423-L2424, docs/methodology/REGISTRY.md:L2572-L2574.
  • The prior script-contract issue is also resolved: the emitted workflow is now valid Python text and Step 5 prints the report, with regression tests at diff_diff/agent_workflow.py:L213-L242 and tests/test_agent_workflow.py:L203-L242.
  • No estimator formulas, weighting, variance/SE logic, identification checks, or default inference behavior change in this diff.
  • The new surface is hardened against hostile column labels via repr() templating and parser-safety tests at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.
  • The prior __dir__() test-gap is now explicitly tracked in TODO.md:L164-L164, so it is mitigated P3 rather than a blocker.

Methodology

  • No findings. The only methodology-relevant prior issue was the first_treat routing shortcut, and that is addressed by the revised Step 3 wording plus targeted regression tests at tests/test_agent_workflow.py:L145-L200.

Code Quality

  • No findings. The new orchestrator is simple string templating, does not inspect df, and the copy-pasteable-script contract is materially improved and tested at diff_diff/agent_workflow.py:L140-L145, diff_diff/agent_workflow.py:L213-L242, and tests/test_agent_workflow.py:L203-L242.

Performance

  • No findings. The added code is negligible string assembly plus a dir() ordering hook.

Maintainability

  • No unmitigated findings. The __dir__() ordering trick is unusual, but it is documented inline at diff_diff/__init__.py:L525-L589.

Tech Debt

  • Severity: P3-informational (mitigated). Impact: __dir__() head-order / inspect.getmembers contract coverage is still deferred, so regressions on that discoverability surface would not be caught by this PR’s tests alone. Concrete fix: land the planned tests/test_agent_discoverability.py follow-up already tracked in TODO.md:L164-L164. References: diff_diff/__init__.py:L525-L589, TODO.md:L164-L164.

Security

  • No findings. repr()-based kwarg rendering prevents code injection through column labels, and the adversarial-label tests cover the relevant parser-safety cases at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.

Documentation/Tests

  • Severity: P3-informational (mitigated). Impact: this PR changes diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt, but .txt AI guides still sit outside dedicated snippet smoke validation in CI. That limitation predates this PR and is already tracked. Concrete fix: implement the existing .txt guide-validation follow-up tracked in TODO.md:L160-L160. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L17, TODO.md:L160-L160.

Runtime note: I could not execute the package tests in this sandbox because importing diff_diff fails without numpy; this review is based on static inspection.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 18, 2026
igerber and others added 4 commits May 18, 2026 19:43
…lity

Closes #460 items 1, 2, 3:
- `__doc__` rewritten to lead with `agent_workflow(df, ...)` as the
  recommended starting call. Existing `get_llm_guide` mention preserved
  for the `test_module_docstring_mentions_helper` guard.
- New `diff_diff.agent_workflow(df, *, unit, time, treatment, outcome,
  first_treat=None, verbose=True)` — stateless orchestrator that prints
  a copy-pasteable 5-step workflow (profile_panel → get_llm_guide →
  <Estimator>.fit → practitioner_next_steps → BusinessReport) with the
  caller's column names templated in. Calls nothing internally; does not
  inspect df.
- New module-level `__dir__()` override paired with a `_OrderedName(str)`
  subclass that subverts CPython's unconditional alphabetic sort
  (PyList_Sort respects __lt__ on elements). Agent-facing names surface
  at the head of `dir(diff_diff)`; tail stays alphabetic via the
  `str.__lt__` fallback. `__all__` membership unchanged; `from diff_diff
  import *` semantics unaffected.

Tests: tests/test_agent_workflow.py (9 tests covering canonical keys,
script content, column templating, first_treat conditional, fit
candidate importability, verbose toggle, no-df-inspection contract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ tests)

Addresses local /ai-review-local findings (4 of 4):

P0 (Security): agent_workflow.py emitted column names via `f'{k}="{v}"'`,
which broke on labels containing `"` and could inject Python statements
into the "copy-pasteable" script. Replaced with `_safe_kwarg` /
`_join_kwargs` that use Python's `repr()` — produces source-safe string
literals for any str input (quotes, backslashes, newlines, etc.). Added
9 regression tests (test_emitted_calls_are_valid_python plus a
parametrize over 6 adversarial labels) that ast.parse() the emitted
script lines.

P1 (Contract break): Step 3 example previously emitted
`CallawaySantAnna(...).fit(df, ..., treatment=...)` but CS21's
`.fit()` signature is `(data, outcome, unit, time, first_treat, ...)`
— `treatment=` is rejected. The snippet would TypeError if anyone
copy-pasted. Step 3 now branches on `first_treat` presence:
- first_treat=None  -> DifferenceInDifferences with `treatment=`
- first_treat=<col> -> CallawaySantAnna with `first_treat=` (no treatment)
Each branch's call signature matches the actual public API. New test
`test_first_treat_switches_step3_estimator` locks the contract.

Also dropped PreTrendsPower / HonestDiD from `_WORKFLOW_PATTERNS` (their
`.fit()` takes `results`, not `df`+columns — pattern label "substitute
the candidate" was misleading for them); they now appear in a separate
"diagnostics" block in the templated script under Step 4.

P2 (Introspection regression): __dir__() returned only `__all__`
entries, which dropped module dunders (`__doc__`, `__name__`, `__file__`)
from `inspect.getmembers(diff_diff)` — contradicted the CHANGELOG
"compatible with inspect.getmembers" claim. __dir__() now returns
`[_OrderedName(n) for n in globals()]`, so the full module namespace
is enumerated while head-first ordering is preserved.

P2 (Test coverage): Added `test_emitted_calls_are_valid_python` (ast.parse
on emitted profile_call + fit example for both first_treat branches) and
`test_adversarial_column_labels_produce_valid_python` (parametrized over
quotes, backslashes, injection attempts, whitespace). Tests would have
caught both P0 and P1 if present.

Total tests: 9 → 17 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eat path

Codex R2 narrowed P1: the no-`first_treat` branch unconditionally emitted
`DifferenceInDifferences().fit(...)` and labeled it "matched to your data
shape". That overclaim doesn't hold for continuous-dose or heterogeneous-
adoption designs without first_treat — DiD validates and rejects non-
binary treatment / time at fit-time (`diff_diff/estimators.py:307-312`).

Fix in agent_workflow.py:
- With `first_treat`: keep CallawaySantAnna example, relabel as
  "your data has first_treat -> staggered structure; CS is canonical".
- Without `first_treat`: keep DiD example as the simple 2x2 case but
  reframe the label to explicitly condition on that shape, name the
  alternative candidates (ContinuousDiD, HeterogeneousAdoptionDiD),
  and reference DiD's fit-time validation so the agent knows when to
  switch. No claim of universal match.

New regression test `test_no_first_treat_step3_does_not_overclaim_match`:
- asserts "matched to your data shape" is NOT in the no-first_treat
  script (negative)
- asserts "2x2" and "substitute" appear (positive — substitution hint)
- asserts ContinuousDiD and HeterogeneousAdoptionDiD remain enumerated
  in Step 2's routing patterns

Total tests: 17 → 18 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… script executability + P3 deferral note)

CI codex review on PR #464 (initial push) flagged 3 items:

P1 (Methodology): The first_treat-branch Step 3 example implied
binary staggered -> CallawaySantAnna without naming alternatives.
But ContinuousDiD.fit() and HeterogeneousAdoptionDiD event-study
BOTH take a `first_treat` column (via `first_treat=` and
`first_treat_col=` respectively); the same overclaim pattern I
already caught and fixed for the no-first_treat path in R3 was
present mirror-image on the with-first_treat path. The Step 3
comment block now enumerates the three estimator options for
first_treat panels (CS21 / ContinuousDiD / HAD event-study) and
names the HAD signature distinction (first_treat_col vs first_treat)
so an agent isn't silently steered to CS21 on a continuous-dose
panel. New test `test_first_treat_step3_names_non_binary_alternatives`
locks this.

P2 (Code Quality): The emitted "script" was tested for parseability
only at the line-of-call granularity, not at the whole-file level.
Section headers like "Step 1 - Describe the panel:" and footer
lines "Full reference: ..." would SyntaxError if the agent dumped
the script to a .py file and ran it. Step 5 also evaluated
`BusinessReport(...).full_report()` without printing — the report
str would be discarded on execution. Fix: every prose line in the
template now starts with `#` (valid Python comment); code lines
stand at column 0 and run as-is; Step 5 wraps the report call in
`print()`. The full script `ast.parse()`s as a Python module under
both first_treat branches. New tests
`test_emitted_script_parses_as_python_module` and
`test_emitted_script_prints_report` lock these contracts.

P3 (Tests): The new `__dir__()` discoverability surface
(`_OrderedName` ordering trick) has no contract test in this PR.
Added a TODO.md entry under "Tech Debt from Code Reviews" /
"Testing/Docs" deferring it to PR B (planned, addresses #461 with
the full snapshot/contract surface). The deferral is the documented
pattern the reviewer's "Concrete fix" suggested.

Tests: 18 → 21 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the agent-workflow-discoverability branch from 27bb822 to e1d0f33 Compare May 18, 2026 23:43
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e1d0f33e3cf4fa9f9e76c8ceac73a46c13327d44


Overall Assessment

⚠️ Needs changes — one unmitigated P1 remains in the new agent_workflow() surface.

Executive Summary

  • [Newly identified] agent_workflow() still hardcodes the symbol df into every emitted call/script, so the advertised “copy-pasteable” workflow fails with NameError whenever the caller’s dataframe is bound under any other local name. References: diff_diff/agent_workflow.py:L74-L85, diff_diff/agent_workflow.py:L142-L145, diff_diff/agent_workflow.py:L162-L181, diff_diff/agent_workflow.py:L217-L239, tests/test_agent_workflow.py:L203-L288.
  • Re-review: the prior methodology-routing issue is resolved. Step 3 now says first_treat does not uniquely identify a single estimator and explicitly names ContinuousDiD / HeterogeneousAdoptionDiD alternatives, which matches the actual public fit signatures and registry notes. References: diff_diff/agent_workflow.py:L148-L190, diff_diff/staggered.py:L1395-L1406, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2803, docs/methodology/REGISTRY.md:L2424-L2432, docs/methodology/REGISTRY.md:L2580-L2584.
  • No estimator math, weighting, variance/SE logic, identification checks, or default inference behavior change in this diff.
  • The custom __dir__() discoverability surface still lacks direct contract coverage in this PR, but that gap is explicitly tracked in TODO.md, so it is mitigated P3. References: diff_diff/__init__.py:L525-L589, TODO.md:L165-L165.
  • The .txt guide-validation gap also remains tracked rather than introduced here, so it is mitigated P3. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L18, TODO.md:L161-L161.

Methodology

  • No findings. Re-review check passed: the changed agent_workflow() guidance no longer over-claims that first_treat uniquely implies Callaway-Sant’Anna, and the emitted wording is consistent with the documented CS / ContinuousDiD / HAD contracts. References: diff_diff/agent_workflow.py:L148-L190, diff_diff/staggered.py:L1395-L1406, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2803, docs/methodology/REGISTRY.md:L2424-L2432, docs/methodology/REGISTRY.md:L2580-L2584.

Code Quality

  • Severity: P1 [Newly identified]. Impact: agent_workflow() promises a “copy-pasteable” workflow and says df is present so the caller can “pass the same handle along,” but the implementation hardcodes df into profile_call, the Step 3 fit example, and the final script. A caller who does panel = ...; diff_diff.agent_workflow(panel, ...) receives output that will fail immediately when followed verbatim. The new tests only assert parseability and use df-named fixtures, so this contract break is not exercised. Concrete fix: add an explicit dataframe-symbol kwarg such as df_name="df" and template every emitted code line from it, then add a regression test that validates/executed output in a namespace where panel exists and df does not. References: diff_diff/agent_workflow.py:L74-L85, diff_diff/agent_workflow.py:L142-L145, diff_diff/agent_workflow.py:L162-L181, diff_diff/agent_workflow.py:L217-L239, CHANGELOG.md:L11-L13, diff_diff/__init__.py:L1-L4, tests/test_agent_workflow.py:L203-L288.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational (mitigated). Impact: the _OrderedName / __dir__() ordering trick is subtle, and this PR still defers direct contract coverage for head order, full membership, and inspect.getmembers() parity. Concrete fix: land the already-tracked tests/test_agent_discoverability.py follow-up. References: diff_diff/__init__.py:L525-L589, TODO.md:L165-L165.

Security

  • No findings.

Documentation/Tests

  • Severity: P3-informational (mitigated). Impact: the new quickstart lines added to diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt still sit outside dedicated snippet smoke validation in CI, so guide regressions on this surface would not be caught automatically. Concrete fix: implement the already-tracked .txt guide-validation follow-up. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L18, TODO.md:L161-L161.

Path to Approval

  1. Add a caller-specified dataframe symbol to agent_workflow() (for example df_name="df") and use it consistently in profile_call, the emitted fit example, the returned call strings, and the final script so the generated workflow is actually runnable outside the special case where the local variable is named df.
  2. Add regression coverage for that contract: one test should assert the generated script uses the requested symbol, and one test should execute the emitted script in a namespace where panel exists and df does not.

Runtime Note

  • Static review only. I could not execute the package tests in this sandbox because numpy is unavailable and pytest is not installed.

@igerber igerber removed the ready-for-ci Triggers CI test workflows label May 18, 2026
CI codex review on the rebased SHA flagged a newly-identified P1:
agent_workflow() hardcoded the dataframe symbol `df` into every emitted
call. A caller who does `panel = pd.read_parquet(...)` and then calls
diff_diff.agent_workflow(panel, ...) gets back a script that references
`df` and NameErrors on first execution.

Fix: add `df_name: str = "df"` parameter, threaded into:
- profile_call (`diff_diff.profile_panel({df_name}, ...)`)
- both Step 3 fit example branches (CallawaySantAnna + DifferenceInDifferences)

Default `df_name="df"` preserves prior behavior verbatim. Caller binds
their dataframe to any identifier and passes the name; emitted script
runs in their namespace without renaming.

Tests added:
- `test_df_name_templates_into_script`: default vs `df_name="panel"`,
  static AST scan confirms no bare `df` Name node remains.
- `test_df_name_panel_script_executes_in_panel_namespace`: stubs
  `diff_diff` with MagicMock, exec's the panel-named script in a
  namespace where `panel` exists and `df` does not — NameError on
  the dataframe symbol would fail the test.

Total tests: 21 → 23 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3968f0c232c96949e850add575498b29b197e0dd


Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings remain in this re-review.

Executive Summary

  • Re-review: the prior P1 is resolved. agent_workflow() now exposes df_name, templates it into the emitted calls, and has regression tests that parse and execute the script in a namespace containing panel but not df (diff_diff/agent_workflow.py:L64-L114, diff_diff/agent_workflow.py:L155-L196, tests/test_agent_workflow.py:L245-L311).
  • No estimator math, weighting, variance/SE, identification checks, or default inference behavior changes in this diff.
  • The methodology-facing Step 3 guidance is now consistent with the public fit signatures and registry notes for CallawaySantAnna, ContinuousDiD, and HeterogeneousAdoptionDiD (diff_diff/agent_workflow.py:L161-L205, diff_diff/staggered.py:L1395-L1406, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2803, docs/methodology/REGISTRY.md:L2424-L2433, docs/methodology/REGISTRY.md:L2579-L2585).
  • Remaining gaps are informational only: direct __dir__() contract coverage is explicitly deferred in TODO.md, and .txt AI-guide smoke coverage remains a previously tracked CI gap (diff_diff/__init__.py:L525-L589, TODO.md:L161-L165, diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L17).

Methodology

No findings. Affected methods: none in the estimation/inference sense. The only methodology-adjacent change is estimator-selection prose in agent_workflow(), and that prose now correctly distinguishes first_treat= on CallawaySantAnna / ContinuousDiD from first_treat_col= on HeterogeneousAdoptionDiD and no longer over-claims that first_treat uniquely implies a single estimator (diff_diff/agent_workflow.py:L161-L205, diff_diff/staggered.py:L1395-L1406, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2803, docs/methodology/REGISTRY.md:L2424-L2433, docs/methodology/REGISTRY.md:L2579-L2585).

Code Quality

No findings. Re-review check passed: the previous hardcoded-df contract break is fixed in both implementation and tests (diff_diff/agent_workflow.py:L64-L114, diff_diff/agent_workflow.py:L155-L196, tests/test_agent_workflow.py:L245-L311).

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity: P3-informational (mitigated). Impact: the _OrderedName / __dir__() discoverability surface is subtle and still lacks direct contract coverage for head order, membership, and inspect.getmembers() parity. Concrete fix: land the already-tracked tests/test_agent_discoverability.py follow-up. References: diff_diff/__init__.py:L525-L589, TODO.md:L165-L165.

Security

No findings in the changed code.

Documentation/Tests

  • Severity: P3-informational (mitigated). Impact: the new quickstart lines added to the .txt AI guides are still outside dedicated snippet-smoke validation in CI, so regressions there would not be caught automatically. Concrete fix: extend the existing doc-snippet smoke coverage to the .txt guide files. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L17, TODO.md:L161-L161.

Static review only. I did not execute the test suite in this sandbox because pytest is not installed.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026
@igerber igerber merged commit 4646f4d into main May 19, 2026
33 of 34 checks passed
@igerber igerber deleted the agent-workflow-discoverability branch May 19, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface get_llm_guide / profile_panel / practitioner_next_steps more prominently for agent discovery

1 participant