📖 Deep Agents Code (
dcode) documentation — the coding agent this evaluator is designed for.
A LangSmith offline evaluator that analyzes Python code generated by coding agents and scores it against a suite of best-practice rules. Built to work with Deep Agents Code (dcode), LangChain's open source terminal coding agent.
curl -LsSf https://langch.in/dcode | bashOr install with uv directly:
uv tool install deepagents-codeSign up at smith.langchain.com and grab an API key. Then add it to ~/.deepagents/.env so it's available in every session:
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=my-coding-agent
Or export the variables in your shell:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-key"
export LANGSMITH_PROJECT="my-coding-agent"dcodeWith tracing enabled, every agent run — tool calls, decisions, file edits — is recorded in your LangSmith project automatically. You can view traces in the LangSmith UI or use /trace inside dcode to open the current thread directly.
On first launch, use the /auth command inside dcode to configure an API key for your chosen LLM provider (OpenAI, Anthropic, Google, etc.). See Provider credentials for details.
Coding agents produce code non-deterministically. The same prompt can yield clean, well-structured code one run and spaghetti the next. This project gives you a quantitative feedback loop: trace your agent, score its output, and use the results to improve it.
-
Trace your agent. Enable LangSmith tracing in
dcodeas described above. Every agent run is recorded automatically. -
Build a dataset from traces. In the LangSmith UI, go to your tracing project and set up a Rule to automatically add runs to a dataset. For example, create a rule that adds every completed root run (or only runs matching certain tags/metadata) to a dataset called
coding-agent-traces. This keeps your evaluation dataset growing as you use your agent — no manual curation needed. -
Run this evaluator. Point it at your dataset and get per-file, per-rule scores for every trace. The scores tell you exactly what your agent gets wrong consistently (missing docstrings? deeply nested logic? overly long functions?).
-
Improve your agent. Use the patterns you find to:
- Write skills — give your agent reusable instructions for the rules it fails most (e.g., a skill that enforces docstrings on public APIs). See
dcodeskills. - Tune your system prompt — if the agent consistently produces magic numbers or bare excepts, add explicit guidance.
- Compare experiments — after making a change, re-run the eval on the same dataset and compare scores side-by-side in LangSmith.
- Track regressions — run the eval in CI to catch quality drops before they ship.
- Write skills — give your agent reusable instructions for the rules it fails most (e.g., a skill that enforces docstrings on public APIs). See
The goal is a tight loop: trace → score → fix → re-score.
The evaluator processes real coding-agent traces — not simple text outputs, but full multi-turn threads where the agent calls tools like write_file, edit_file, and read_file. It:
- Reconstructs files from the agent's message history by replaying tool calls in order (handling writes, patches, reads, and failed attempts).
- Parses each Python file with the AST and runs 12 weighted checks.
- Returns per-file and aggregate scores plus a
no_code_yetflag for early turns that haven't produced code.
| Rule | Weight | What it catches |
|---|---|---|
no_triple_nested_loops |
3 | ≥3 levels of nested for/while loops |
no_bare_except |
3 | Bare except: swallowing all errors |
no_mutable_defaults |
3 | def foo(x=[]) mutable default args |
line_length_ok |
2 | Lines exceeding 120 characters |
function_length_ok |
2 | Functions longer than 50 lines |
param_count_ok |
2 | Functions with more than 5 parameters |
no_wildcard_imports |
2 | from X import * |
no_god_classes |
2 | Classes with more than 10 methods |
nesting_depth_ok |
2 | Control flow nested more than 4 levels deep |
has_docstrings |
1 | Public functions/classes missing docstrings |
variable_names_ok |
1 | Variables shorter than 3 characters |
no_magic_numbers |
1 | Hardcoded numeric literals |
├── evals/
│ └── evaluator.py # Core evaluator logic (perform_eval + file extraction)
├── scripts/
│ ├── run_eval.py # Run the evaluator locally via the LangSmith SDK
│ ├── upload_evaluator.py # Upload/update the evaluator in LangSmith
│ ├── upload_dataset.py # Upload example_dataset.json to LangSmith
│ └── example_setup.py # One-command setup (dataset + evaluator)
├── example_dataset.json # Example dataset of coding-agent traces
└── README.md
uv syncexport LANGSMITH_API_KEY="your-key-here"This uploads the dataset and evaluator to your LangSmith workspace:
uv run python scripts/example_setup.pyThat's it. You can also run the individual steps separately:
uv run python scripts/upload_dataset.py # Upload the dataset only
uv run python scripts/upload_evaluator.py # Upload the evaluator onlyuv run python scripts/run_eval.pyThis runs the evaluator against the default dataset (dcode-dataset) and prints results to the console. A link to the experiment in LangSmith is also printed.
To target a different dataset:
uv run python scripts/run_eval.py --dataset-name "my-other-dataset"Each dataset example is one turn of a coding-agent thread. Early turns (planning, dependency installation) produce no code — these get no_code_yet: 1.0 and no quality scores. Later turns where the agent has written Python files get the full rule breakdown.
Example output:
Example db6bca17... (39 msgs, files: ['main.py', 'test_main.py'])
no_code_yet: 0.0
code_quality_overall: 0.917
files_found: 2.0
no_triple_nested_loops: 1.0
no_bare_except: 1.0
has_docstrings: 0.0 ← agent didn't add docstrings
no_magic_numbers: 0.0 ← hardcoded numbers in test assertions
...
Example 15b158d9... (10 msgs, files: [])
no_code_yet: 1.0 ← agent was still setting up
files_found: 0.0
The dataset is built from real agent traces. To add more:
- Run your coding agent with LangSmith tracing enabled.
- In the LangSmith UI, add the trace runs to a dataset.
- Export the dataset or point
run_eval.pyat it directly.
The evaluator works with any coding agent that uses write_file / edit_file / read_file tool calls with a partial_json or input field containing file_path and content/old_string/new_string.
taining file_path and content/old_string/new_string.