Skip to content

danielgshea/code-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Best Practices Eval

📖 Deep Agents Code (dcode) documentation — the coding agent this evaluator is designed for.

A LangSmith offline evaluator that analyzes Python code generated by coding agents and scores it against a suite of best-practice rules. Built to work with Deep Agents Code (dcode), LangChain's open source terminal coding agent.

Getting started with dcode

1. Install dcode

curl -LsSf https://langch.in/dcode | bash

Or install with uv directly:

uv tool install deepagents-code

2. Set up your LangSmith API key

Sign up at smith.langchain.com and grab an API key. Then add it to ~/.deepagents/.env so it's available in every session:

LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=my-coding-agent

Or export the variables in your shell:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-key"
export LANGSMITH_PROJECT="my-coding-agent"

3. Launch dcode and start tracing

dcode

With tracing enabled, every agent run — tool calls, decisions, file edits — is recorded in your LangSmith project automatically. You can view traces in the LangSmith UI or use /trace inside dcode to open the current thread directly.

4. Set up a model provider

On first launch, use the /auth command inside dcode to configure an API key for your chosen LLM provider (OpenAI, Anthropic, Google, etc.). See Provider credentials for details.

Why this evaluator exists

Coding agents produce code non-deterministically. The same prompt can yield clean, well-structured code one run and spaghetti the next. This project gives you a quantitative feedback loop: trace your agent, score its output, and use the results to improve it.

The workflow

  1. Trace your agent. Enable LangSmith tracing in dcode as described above. Every agent run is recorded automatically.

  2. Build a dataset from traces. In the LangSmith UI, go to your tracing project and set up a Rule to automatically add runs to a dataset. For example, create a rule that adds every completed root run (or only runs matching certain tags/metadata) to a dataset called coding-agent-traces. This keeps your evaluation dataset growing as you use your agent — no manual curation needed.

  3. Run this evaluator. Point it at your dataset and get per-file, per-rule scores for every trace. The scores tell you exactly what your agent gets wrong consistently (missing docstrings? deeply nested logic? overly long functions?).

  4. Improve your agent. Use the patterns you find to:

    • Write skills — give your agent reusable instructions for the rules it fails most (e.g., a skill that enforces docstrings on public APIs). See dcode skills.
    • Tune your system prompt — if the agent consistently produces magic numbers or bare excepts, add explicit guidance.
    • Compare experiments — after making a change, re-run the eval on the same dataset and compare scores side-by-side in LangSmith.
    • Track regressions — run the eval in CI to catch quality drops before they ship.

The goal is a tight loop: trace → score → fix → re-score.

What it does

The evaluator processes real coding-agent traces — not simple text outputs, but full multi-turn threads where the agent calls tools like write_file, edit_file, and read_file. It:

  1. Reconstructs files from the agent's message history by replaying tool calls in order (handling writes, patches, reads, and failed attempts).
  2. Parses each Python file with the AST and runs 12 weighted checks.
  3. Returns per-file and aggregate scores plus a no_code_yet flag for early turns that haven't produced code.

Rules checked

Rule Weight What it catches
no_triple_nested_loops 3 ≥3 levels of nested for/while loops
no_bare_except 3 Bare except: swallowing all errors
no_mutable_defaults 3 def foo(x=[]) mutable default args
line_length_ok 2 Lines exceeding 120 characters
function_length_ok 2 Functions longer than 50 lines
param_count_ok 2 Functions with more than 5 parameters
no_wildcard_imports 2 from X import *
no_god_classes 2 Classes with more than 10 methods
nesting_depth_ok 2 Control flow nested more than 4 levels deep
has_docstrings 1 Public functions/classes missing docstrings
variable_names_ok 1 Variables shorter than 3 characters
no_magic_numbers 1 Hardcoded numeric literals

Project structure

├── evals/
│   └── evaluator.py          # Core evaluator logic (perform_eval + file extraction)
├── scripts/
│   ├── run_eval.py           # Run the evaluator locally via the LangSmith SDK
│   ├── upload_evaluator.py   # Upload/update the evaluator in LangSmith
│   ├── upload_dataset.py     # Upload example_dataset.json to LangSmith
│   └── example_setup.py      # One-command setup (dataset + evaluator)
├── example_dataset.json      # Example dataset of coding-agent traces
└── README.md

Setup

Prerequisites

1. Install dependencies

uv sync

2. Set your API key

export LANGSMITH_API_KEY="your-key-here"

3. Run setup

This uploads the dataset and evaluator to your LangSmith workspace:

uv run python scripts/example_setup.py

That's it. You can also run the individual steps separately:

uv run python scripts/upload_dataset.py     # Upload the dataset only
uv run python scripts/upload_evaluator.py   # Upload the evaluator only

Running the evaluation

uv run python scripts/run_eval.py

This runs the evaluator against the default dataset (dcode-dataset) and prints results to the console. A link to the experiment in LangSmith is also printed.

To target a different dataset:

uv run python scripts/run_eval.py --dataset-name "my-other-dataset"

Understanding the results

Each dataset example is one turn of a coding-agent thread. Early turns (planning, dependency installation) produce no code — these get no_code_yet: 1.0 and no quality scores. Later turns where the agent has written Python files get the full rule breakdown.

Example output:

Example db6bca17... (39 msgs, files: ['main.py', 'test_main.py'])
  no_code_yet: 0.0
  code_quality_overall: 0.917
  files_found: 2.0
  no_triple_nested_loops: 1.0
  no_bare_except: 1.0
  has_docstrings: 0.0         ← agent didn't add docstrings
  no_magic_numbers: 0.0       ← hardcoded numbers in test assertions
  ...

Example 15b158d9... (10 msgs, files: [])
  no_code_yet: 1.0            ← agent was still setting up
  files_found: 0.0

Adding your own traces

The dataset is built from real agent traces. To add more:

  1. Run your coding agent with LangSmith tracing enabled.
  2. In the LangSmith UI, add the trace runs to a dataset.
  3. Export the dataset or point run_eval.py at it directly.

The evaluator works with any coding agent that uses write_file / edit_file / read_file tool calls with a partial_json or input field containing file_path and content/old_string/new_string. taining file_path and content/old_string/new_string.

About

Evaluating coding agents for better coding practices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages