Code Best Practices Eval

📖 Deep Agents Code (dcode) documentation — the coding agent this evaluator is designed for.

A LangSmith offline evaluator that analyzes Python code generated by coding agents and scores it against a suite of best-practice rules. Built to work with Deep Agents Code (dcode), LangChain's open source terminal coding agent.

Getting started with `dcode`

1. Install `dcode`

curl -LsSf https://langch.in/dcode | bash

Or install with uv directly:

uv tool install deepagents-code

2. Set up your LangSmith API key

Sign up at smith.langchain.com and grab an API key. Then add it to ~/.deepagents/.env so it's available in every session:

LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=my-coding-agent

Or export the variables in your shell:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-key"
export LANGSMITH_PROJECT="my-coding-agent"

3. Launch `dcode` and start tracing

dcode

With tracing enabled, every agent run — tool calls, decisions, file edits — is recorded in your LangSmith project automatically. You can view traces in the LangSmith UI or use /trace inside dcode to open the current thread directly.

4. Set up a model provider

On first launch, use the /auth command inside dcode to configure an API key for your chosen LLM provider (OpenAI, Anthropic, Google, etc.). See Provider credentials for details.

Why this evaluator exists

Coding agents produce code non-deterministically. The same prompt can yield clean, well-structured code one run and spaghetti the next. This project gives you a quantitative feedback loop: trace your agent, score its output, and use the results to improve it.

The workflow

Trace your agent. Enable LangSmith tracing in dcode as described above. Every agent run is recorded automatically.
Build a dataset from traces. In the LangSmith UI, go to your tracing project and set up a Rule to automatically add runs to a dataset. For example, create a rule that adds every completed root run (or only runs matching certain tags/metadata) to a dataset called coding-agent-traces. This keeps your evaluation dataset growing as you use your agent — no manual curation needed.
Run this evaluator. Point it at your dataset and get per-file, per-rule scores for every trace. The scores tell you exactly what your agent gets wrong consistently (missing docstrings? deeply nested logic? overly long functions?).
Improve your agent. Use the patterns you find to:
- Write skills — give your agent reusable instructions for the rules it fails most (e.g., a skill that enforces docstrings on public APIs). See dcode skills.
- Tune your system prompt — if the agent consistently produces magic numbers or bare excepts, add explicit guidance.
- Compare experiments — after making a change, re-run the eval on the same dataset and compare scores side-by-side in LangSmith.
- Track regressions — run the eval in CI to catch quality drops before they ship.

The goal is a tight loop: trace → score → fix → re-score.

What it does

The evaluator processes real coding-agent traces — not simple text outputs, but full multi-turn threads where the agent calls tools like write_file, edit_file, and read_file. It:

Reconstructs files from the agent's message history by replaying tool calls in order (handling writes, patches, reads, and failed attempts).
Parses each Python file with the AST and runs 12 weighted checks.
Returns per-file and aggregate scores plus a no_code_yet flag for early turns that haven't produced code.

Rules checked

Rule	Weight	What it catches
`no_triple_nested_loops`	3	≥3 levels of nested for/while loops
`no_bare_except`	3	Bare `except:` swallowing all errors
`no_mutable_defaults`	3	`def foo(x=[])` mutable default args
`line_length_ok`	2	Lines exceeding 120 characters
`function_length_ok`	2	Functions longer than 50 lines
`param_count_ok`	2	Functions with more than 5 parameters
`no_wildcard_imports`	2	`from X import *`
`no_god_classes`	2	Classes with more than 10 methods
`nesting_depth_ok`	2	Control flow nested more than 4 levels deep
`has_docstrings`	1	Public functions/classes missing docstrings
`variable_names_ok`	1	Variables shorter than 3 characters
`no_magic_numbers`	1	Hardcoded numeric literals

Project structure

├── evals/
│   └── evaluator.py          # Core evaluator logic (perform_eval + file extraction)
├── scripts/
│   ├── run_eval.py           # Run the evaluator locally via the LangSmith SDK
│   ├── upload_evaluator.py   # Upload/update the evaluator in LangSmith
│   ├── upload_dataset.py     # Upload example_dataset.json to LangSmith
│   └── example_setup.py      # One-command setup (dataset + evaluator)
├── example_dataset.json      # Example dataset of coding-agent traces
└── README.md

Setup

Prerequisites

uv installed
A LangSmith account and API key

1. Install dependencies

uv sync

2. Set your API key

export LANGSMITH_API_KEY="your-key-here"

3. Run setup

This uploads the dataset and evaluator to your LangSmith workspace:

uv run python scripts/example_setup.py

That's it. You can also run the individual steps separately:

uv run python scripts/upload_dataset.py     # Upload the dataset only
uv run python scripts/upload_evaluator.py   # Upload the evaluator only

Running the evaluation

uv run python scripts/run_eval.py

This runs the evaluator against the default dataset (dcode-dataset) and prints results to the console. A link to the experiment in LangSmith is also printed.

To target a different dataset:

uv run python scripts/run_eval.py --dataset-name "my-other-dataset"

Understanding the results

Each dataset example is one turn of a coding-agent thread. Early turns (planning, dependency installation) produce no code — these get no_code_yet: 1.0 and no quality scores. Later turns where the agent has written Python files get the full rule breakdown.

Example output:

Example db6bca17... (39 msgs, files: ['main.py', 'test_main.py'])
  no_code_yet: 0.0
  code_quality_overall: 0.917
  files_found: 2.0
  no_triple_nested_loops: 1.0
  no_bare_except: 1.0
  has_docstrings: 0.0         ← agent didn't add docstrings
  no_magic_numbers: 0.0       ← hardcoded numbers in test assertions
  ...

Example 15b158d9... (10 msgs, files: [])
  no_code_yet: 1.0            ← agent was still setting up
  files_found: 0.0

Adding your own traces

The dataset is built from real agent traces. To add more:

Run your coding agent with LangSmith tracing enabled.
In the LangSmith UI, add the trace runs to a dataset.
Export the dataset or point run_eval.py at it directly.

The evaluator works with any coding agent that uses write_file / edit_file / read_file tool calls with a partial_json or input field containing file_path and content/old_string/new_string. taining file_path and content/old_string/new_string.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evals		evals
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
example_dataset.json		example_dataset.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Best Practices Eval

Getting started with `dcode`

1. Install `dcode`

2. Set up your LangSmith API key

3. Launch `dcode` and start tracing

4. Set up a model provider

Why this evaluator exists

The workflow

What it does

Rules checked

Project structure

Setup

Prerequisites

1. Install dependencies

2. Set your API key

3. Run setup

Running the evaluation

Understanding the results

Adding your own traces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Best Practices Eval

Getting started with dcode

1. Install dcode

2. Set up your LangSmith API key

3. Launch dcode and start tracing

4. Set up a model provider

Why this evaluator exists

The workflow

What it does

Rules checked

Project structure

Setup

Prerequisites

1. Install dependencies

2. Set your API key

3. Run setup

Running the evaluation

Understanding the results

Adding your own traces

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Getting started with `dcode`

1. Install `dcode`

3. Launch `dcode` and start tracing

Packages