CVEvolve

CVEvolve is a uv-managed, DeepAgents-based framework for agentic auto-research on open-ended computer vision tasks. It uses generate, tune, and evolve search rounds while allowing the agent to build a task-specific testbed inside a dedicated workspace.

What It Does

Reads a YAML config and a user-authored prompt markdown file.
Creates a dedicated session workspace and copies the prompt plus any configured user skill directories into it.
Runs a preparation stage where the agent is expected to:
- read the task prompt,
- create a separate uv environment inside the workspace,
- inspect example images,
- define the primary evaluation metric,
- set up an evaluation harness.
Runs iterative generate, tune, and evolve rounds.
Tracks candidates, metrics, rounds, and evolution edges in SQLite and CSV.
Converts tool-returned image paths into automatic multimodal follow-up context for the agent.

Installation

We recommend using uv to manage the environment. If you have uv installed, you can install CVEvolve with:

cd /path/to/CVEvolve
uv sync

Docker

Build the image:

docker build -t cvevolve .

To start an interactive container with a host workspace folder mounted at /workspace:

mkdir -p ./docker-workspace
docker run --rm -it \
  -e OPENAI_API_KEY \
  -e OPENAI_BASE_URL \
  -e SEMANTIC_SCHOLAR_API_KEY \
  -e TAVILY_API_KEY \
  -v "$(pwd)/docker-workspace:/workspace" \
  -w /workspace \
  cvevolve \
  bash

From inside the container, you can then run something like:

cvevolve run --config ./config.yaml --prompt ./task.md

Configuration

The runner expects a YAML file that matches cvevolve.config.CVEvolveConfig. Example:

name: translation-registration
model:
  model_name: gpt-4.1
  api_key_env_var: OPENAI_API_KEY
  api_base: null
  temperature: 0.2
  max_retries: 2
  rate_limit_resend_attempts: 3
  rate_limit_sleep_seconds: 60
workspace:
  root_dir: ./sessions
  data_dir: /path/to/data
  holdout_data_dir: null
  additional_skill_dirs:
    - /absolute/path/to/my-skill
    - /absolute/path/to/another-skill
  require_dangerous_command_approval: true
metric:
  name_hint: registration_accuracy
  direction_hint: maximize
  target_value: null
  description_hint: Let the agent refine the exact metric after inspecting the task.
branching:
  warmup_rounds: 3
  force_generate_every: 0
  generate_batch_size: 1
  tune_every: 3
  evolve_every: 2
  min_excellent_for_tune: 1
  min_successful_for_evolve: 2
  lineage_selection_temperature: 1.0
  crossover_candidates_per_lineage: 2
  crossover_same_lineage_penalty: 0.5
  exclude_poor_lineages: true
  exclude_lineages_with_failure_streak: 0
tracking:
  enabled: false
  mlflow_tracking_uri: ""
  mlflow_experiment_name: CVEvolve
hutch:
  enabled: false
  project: cvevolve
  run_id: null
  daemon_url: null
  db_path: null
  strict: false
num_workers_generate: 1
num_workers_tune: 1
cap_num_requests: null
stopping:
  max_rounds: 12
  patience_rounds: 4
  min_improvement: 0.0

Workspace Skill Configuration

Dataset Nomenclature

workspace.data_dir is the development set copied into workspace/data/ and available to the main agent for algorithm development and evaluation.
A validation set is an optional split the agent may create from the development set inside its own workflow, usually for data-driven algorithms that need an internal train-validation split.
workspace.holdout_data_dir is an optional user-provided holdout set. It is not copied into the main development workspace. When configured with --holdout-test-prompt, CVEvolve temporarily copies it into a per-candidate holdout test workspace after a round, runs the holdout test agent, records holdout_test_metrics, and then deletes that temporary workspace.
workspace.additional_skill_dirs: optional list of skill directories to copy into the sandbox.
Each configured path must point directly to a single skill directory that contains SKILL.md.
During preparation, CVEvolve copies those directories into workspace/skills/user/.
The agent receives a skill catalog in its system prompt and can then read the copied SKILL.md files from inside the sandbox.

Lineage-Aware Parent Selection

branching.lineage_selection_temperature: temperature for softmax-style lineage rank sampling. Lower values make selection greedier. 0 uses deterministic top-k behavior for tune and evolve parent selection.
branching.crossover_candidates_per_lineage: maximum number of candidates per lineage to include in the crossover candidate pool.
branching.crossover_same_lineage_penalty: multiplier used to downweight same-lineage candidates when choosing the second crossover parent.
branching.exclude_poor_lineages: whether to exclude lineages whose representative is classified as poor.
branching.exclude_lineages_with_failure_streak: exclude a lineage when its most recent N descendant attempts all failed. 0 disables this filter.

Parallel Worker Knobs

num_workers_generate: number of concurrent workers to launch for a generate round. Each worker is expected to submit one candidate.
num_workers_tune: number of concurrent workers to launch for a tune round. When lineage-aware sampling is enabled, workers are assigned jointly sampled parents without replacement. When temperature is 0, worker 1 gets the best-ranked candidate by the primary metric, worker 2 gets the second-best, and so on.
When either value is set to 1, CVEvolve runs that action serially without creating a thread pool. This is the recommended setting if the runtime environment or any tools are not thread-safe.

Usage Cap

cap_num_requests: optional maximum number of model requests allowed in the current CVEvolve process. null means unlimited. When the cap is reached, CVEvolve raises an exception before sending the next model request.

Tune And Evolve Parent Sampling Behavior

Overview:

When branching.lineage_selection_temperature > 0, tune and evolve use stochastic lineage-aware parent selection.
One or more candidates from each lineage are selected and put in a pool. Candidates in the pool are ranked by their performance metric.
The rank-based sampling weights follow a softmax-style exponential decay. For rank starting at 1, the unnormalized weight is: exp(-(rank - 1) / temperature), with lower temperatures making selection more greedy.
When branching.lineage_selection_temperature == 0, tune and evolve use deterministic top-k behavior.

Tune:

Tune builds its candidate pool from the best-scoring candidate in each eligible lineage.
Those lineage representatives are sorted by primary metric and sampled using the configured decay and temperature.
With a single worker, one parent candidate is sampled from that ranked lineage-representative pool.
With multiple workers, tune samples parent candidates jointly without replacement, so different workers are assigned different sampled parents whenever enough eligible lineages exist.
When branching.lineage_selection_temperature == 0, tune uses deterministic selection:
- one worker gets the single best candidate
- multiple workers get the top num_workers_tune candidates by primary metric

Evolve:

Mutation evolve selects one parent from the lineage-aware ranked pool and behaves like tune parent selection.
Crossover evolve builds a candidate pool from up to crossover_candidates_per_lineage candidates per lineage, sorted by primary metric.
Crossover parent selection is sequential:
- first sample one candidate from the ranked pool
- then downweight candidates from the same lineage
- then sample the second parent without replacement
This bias reduces same-lineage crossover while still allowing it when the weighted pool strongly favors it.
When branching.lineage_selection_temperature == 0, evolve uses deterministic selection:
- use the top two candidates if available
- otherwise mutate the top candidate

Action Selection Rules

Warmup rounds always run generate for round_index <= warmup_rounds.
After warmup, if force_generate_every > 0, CVEvolve forces a generate round whenever (round_index - 1 - warmup_rounds) is divisible by force_generate_every. These forced rounds take priority over every other rule, including suggested actions.
Otherwise, CVEvolve honors the majority recent suggested_next_action among candidates whose performance level is at least honor_suggestion_min_level.
Otherwise, CVEvolve computes an effective scheduled round index: round_index - 1 - warmup_rounds - forced_generate_rounds_so_far
On that effective index, tune is eligible when:
- min_excellent_for_tune excellent candidates exist, and
- tune_every > 0, and
- the effective index is divisible by tune_every
If tune is not selected, evolve is eligible when:
- min_successful_for_evolve moderate-or-better candidates exist, and
- evolve_every > 0, and
- the effective index is divisible by evolve_every
If neither scheduled action is eligible, CVEvolve falls back to branching.fallback_action, which defaults to generate.

Performance buckets come from each candidate's structured analysis when available, and otherwise from its primary-metric ranking. For scheduler gating, "successful" means candidates classified as excellent, good, or moderate.

MLflow Tracking

tracking.enabled: enable MLflow tracking for the session.
tracking.mlflow_tracking_uri: MLflow tracking URI used to create or resume the run.
tracking.mlflow_experiment_name: MLflow experiment name for the session.
When enabled, CVEvolve creates one MLflow run per session and persists its run_id in mlflow_run.json under the session root.
On resume, CVEvolve reuses that saved run_id and continues logging to the same MLflow run.
After every completed round, CVEvolve logs the best candidate's primary metric to MLflow under the fixed metric name metric.
It also logs the best primary-metric value seen so far under best_metric_so_far.
It logs round_type as a numeric metric at each step:
- 0 = baseline
- 1 = generate
- 2 = tune
- 3 = evolve
Preparation is not logged as a round_type metric.
CVEvolve logs the session config as MLflow parameters, uploads config.snapshot.yaml as an artifact, and stores the actual primary metric name as MLflow metadata via the cvevolve.actual_metric_name tag.
When the session has completed, CVEvolve also logs the best candidate source file as an MLflow artifact and tags its candidate ID.
At session start, CVEvolve records the current CVEvolve git commit hash into workspace/cvevolve_git_commit.txt. When present, runtime tracking and upload also log it to MLflow via the cvevolve.git_commit tag.
The upload command does not require a stored git commit file.

Hutch Tracking

Hutch tracking is optional and disabled by default. The base CVEvolve install does not depend on Hutch.

To enable native Hutch events, install the optional extra once thehutch>=0.1.1 is available in your environment:

uv sync --extra hutch

Then either set hutch.enabled: true in the config or export CVEVOLVE_HUTCH=1.

hutch.project: Hutch project name. Defaults to cvevolve.
hutch.run_id: optional Hutch run ID override. Defaults to cvevolve-<session-name>.
hutch.daemon_url: optional Hutch daemon URL. Falls back to HUTCH_DAEMON_URL.
hutch.db_path: optional embedded Hutch DuckDB path. Falls back to HUTCH_DB_PATH.
hutch.strict: when true, Hutch tracking failures fail the CVEvolve run. The default is false, so tracking problems emit warnings and the search continues.

Native Hutch tracking emits run updates, candidates, operators, primary and holdout metrics, candidate failures, and the terminal run end. It uses the same deterministic event IDs as Hutch's CVEvolve polling adapter, so a watched/imported session and a native-instrumented session can be deduplicated by Hutch.

The task prompt is a separate markdown file. Example:

1. Introduction

Find the optimal image registration algorithm that attains the best accuracy on the test data.

2. Test data

The registration is translation-only. You can find the test data under `images/` and the labels under `labels/` relative to the configured `data_dir`.

When authoring the task prompt markdown:

Important: Do not include any absolute path. When describing test data, always use relative paths under the configured data_dir. If you want the agent to see a file, make sure it is copied under data_dir.
Avoid markdown section header markers such as # because your prompt is injected into a larger prompt template and those headings can interfere with the outer structure. Prefer numbered section titles such as 1. Introduction, 2. Test data, and 3. Evaluation.

Running

Before running a session, export the API-related environment variables the agent may need:

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=...              # optional, only for OpenAI-compatible providers
export SEMANTIC_SCHOLAR_API_KEY=...     # optional, improves Semantic Scholar rate limits
export TAVILY_API_KEY=...               # required if you want the Tavily tool available

If your config sets model.api_key_env_var to a different variable name, export that variable instead of OPENAI_API_KEY before starting the run.

uv run cvevolve run --config ./config.yaml --prompt ./task.md

To enable holdout testing, set workspace.holdout_data_dir in the config and pass a prompt describing that directory:

uv run cvevolve run \
  --config ./config.yaml \
  --prompt ./task.md \
  --holdout-test-prompt ./holdout_test.md

To enable tool-call logging for a run or resume, pass --enable-logging on the command line:

uv run cvevolve run --config ./config.yaml --prompt ./task.md --enable-logging

To resume an interrupted session from its existing session directory:

uv run cvevolve resume --session ./sessions/<name>

uv run cvevolve resume \
  --session ./sessions/<name> \
  --enable-logging

To upload an existing session's DB-backed history to MLflow without rerunning the search:

uv run cvevolve upload ./sessions/<name> \
  --mlflow-tracking-uri file:///tmp/mlruns \
  --mlflow-experiment-name CVEvolve

The upload command reads config.snapshot.yaml and history/search_history.sqlite, takes the MLflow destination from the command line, and works even if the session has not finished and without relying on files under reports/.

By default, outputs are written under ./sessions/<name>/:

workspace/prompt/task_prompt.md
workspace/cvevolve_git_commit.txt
workspace/skills/user/... for any configured user skills
workspace/candidates/...
config.snapshot.yaml
mlflow_run.json when MLflow tracking is enabled
history/search_history.sqlite
history/tool_calls.sqlite when --enable-logging is enabled
exports/candidates.csv
exports/metrics.csv
exports/evaluation_metrics.csv
exports/holdout_test_metrics.csv
exports/evolution_tree.csv
exports/rounds.csv
reports/final_report.md
reports/final_summary.json
reports/best_candidate.py

Environment Variables

Set the model provider key in OPENAI_API_KEY, or in whatever variable name you configured in model.api_key_env_var. This one is required.
Set OPENAI_BASE_URL if your OpenAI-compatible provider requires a custom base URL.
Set SEMANTIC_SCHOLAR_API_KEY if you want the Semantic Scholar tool to use the authenticated API.
Set TAVILY_API_KEY if you want the Tavily search tool to be usable during a run.

Notes

The agent is instructed to use the shell tools available in the runtime plus run_python_inline.
Shell commands are stateless. Commands that depend on an environment must be written as one-shot commands, and uv run or .venv/bin/python should be preferred.
The agent is expected to stay inside the session workspace and copy any needed inputs into it before operating on them.
Dangerous shell commands trigger a terminal approval prompt.
Session resume is round-level. If interruption happens mid-round, resume reruns that round from the beginning rather than attempting mid-tool continuation.

Security

The agent works inside a sandboxed session workspace, and normal filesystem tools cannot read or write files outside that sandbox.
However, this is NOT a complete security boundary. An agent may still try to bypass the restriction by writing and executing a Python script that accesses files outside the sandbox.
If you need stronger isolation, run CVEvolve inside a container.

Acknowledgements

CVEvolve's generate-tune-evolve search strategy is adapted from Pty-Chi-Evolve.

References

Pty-Chi-Evolve paper: https://arxiv.org/html/2603.05696v1

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
skills/project		skills/project
src/cvevolve		src/cvevolve
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVEvolve

What It Does

Installation

Docker

Configuration

Workspace Skill Configuration

Dataset Nomenclature

Lineage-Aware Parent Selection

Parallel Worker Knobs

Usage Cap

Tune And Evolve Parent Sampling Behavior

Overview:

Tune:

Evolve:

Action Selection Rules

MLflow Tracking

Hutch Tracking

Running

Environment Variables

Notes

Security

Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CVEvolve

What It Does

Installation

Docker

Configuration

Workspace Skill Configuration

Dataset Nomenclature

Lineage-Aware Parent Selection

Parallel Worker Knobs

Usage Cap

Tune And Evolve Parent Sampling Behavior

Overview:

Tune:

Evolve:

Action Selection Rules

MLflow Tracking

Hutch Tracking

Running

Environment Variables

Notes

Security

Acknowledgements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages