CVEvolve is a uv-managed, DeepAgents-based framework for agentic auto-research on open-ended computer vision tasks. It uses generate, tune, and evolve search rounds while allowing the agent to build a task-specific testbed inside a dedicated workspace.
- Reads a YAML config and a user-authored prompt markdown file.
- Creates a dedicated session workspace and copies the prompt plus any configured user skill directories into it.
- Runs a preparation stage where the agent is expected to:
- read the task prompt,
- create a separate uv environment inside the workspace,
- inspect example images,
- define the primary evaluation metric,
- set up an evaluation harness.
- Runs iterative
generate,tune, andevolverounds. - Tracks candidates, metrics, rounds, and evolution edges in SQLite and CSV.
- Converts tool-returned image paths into automatic multimodal follow-up context for the agent.
We recommend using uv to manage the environment. If you have uv installed, you can install CVEvolve with:
cd /path/to/CVEvolve
uv syncBuild the image:
docker build -t cvevolve .To start an interactive container with a host workspace folder mounted at /workspace:
mkdir -p ./docker-workspace
docker run --rm -it \
-e OPENAI_API_KEY \
-e OPENAI_BASE_URL \
-e SEMANTIC_SCHOLAR_API_KEY \
-e TAVILY_API_KEY \
-v "$(pwd)/docker-workspace:/workspace" \
-w /workspace \
cvevolve \
bashFrom inside the container, you can then run something like:
cvevolve run --config ./config.yaml --prompt ./task.mdThe runner expects a YAML file that matches cvevolve.config.CVEvolveConfig. Example:
name: translation-registration
model:
model_name: gpt-4.1
api_key_env_var: OPENAI_API_KEY
api_base: null
temperature: 0.2
max_retries: 2
rate_limit_resend_attempts: 3
rate_limit_sleep_seconds: 60
workspace:
root_dir: ./sessions
data_dir: /path/to/data
holdout_data_dir: null
additional_skill_dirs:
- /absolute/path/to/my-skill
- /absolute/path/to/another-skill
require_dangerous_command_approval: true
metric:
name_hint: registration_accuracy
direction_hint: maximize
target_value: null
description_hint: Let the agent refine the exact metric after inspecting the task.
branching:
warmup_rounds: 3
force_generate_every: 0
generate_batch_size: 1
tune_every: 3
evolve_every: 2
min_excellent_for_tune: 1
min_successful_for_evolve: 2
lineage_selection_temperature: 1.0
crossover_candidates_per_lineage: 2
crossover_same_lineage_penalty: 0.5
exclude_poor_lineages: true
exclude_lineages_with_failure_streak: 0
tracking:
enabled: false
mlflow_tracking_uri: ""
mlflow_experiment_name: CVEvolve
hutch:
enabled: false
project: cvevolve
run_id: null
daemon_url: null
db_path: null
strict: false
num_workers_generate: 1
num_workers_tune: 1
cap_num_requests: null
stopping:
max_rounds: 12
patience_rounds: 4
min_improvement: 0.0workspace.data_diris the development set copied intoworkspace/data/and available to the main agent for algorithm development and evaluation.- A validation set is an optional split the agent may create from the development set inside its own workflow, usually for data-driven algorithms that need an internal train-validation split.
workspace.holdout_data_diris an optional user-provided holdout set. It is not copied into the main development workspace. When configured with--holdout-test-prompt, CVEvolve temporarily copies it into a per-candidate holdout test workspace after a round, runs the holdout test agent, recordsholdout_test_metrics, and then deletes that temporary workspace.workspace.additional_skill_dirs: optional list of skill directories to copy into the sandbox.- Each configured path must point directly to a single skill directory that contains
SKILL.md. - During preparation, CVEvolve copies those directories into
workspace/skills/user/. - The agent receives a skill catalog in its system prompt and can then read the copied
SKILL.mdfiles from inside the sandbox.
branching.lineage_selection_temperature: temperature for softmax-style lineage rank sampling. Lower values make selection greedier.0uses deterministic top-k behavior for tune and evolve parent selection.branching.crossover_candidates_per_lineage: maximum number of candidates per lineage to include in the crossover candidate pool.branching.crossover_same_lineage_penalty: multiplier used to downweight same-lineage candidates when choosing the second crossover parent.branching.exclude_poor_lineages: whether to exclude lineages whose representative is classified aspoor.branching.exclude_lineages_with_failure_streak: exclude a lineage when its most recentNdescendant attempts all failed.0disables this filter.
num_workers_generate: number of concurrent workers to launch for agenerateround. Each worker is expected to submit one candidate.num_workers_tune: number of concurrent workers to launch for atuneround. When lineage-aware sampling is enabled, workers are assigned jointly sampled parents without replacement. When temperature is0, worker 1 gets the best-ranked candidate by the primary metric, worker 2 gets the second-best, and so on.- When either value is set to
1, CVEvolve runs that action serially without creating a thread pool. This is the recommended setting if the runtime environment or any tools are not thread-safe.
cap_num_requests: optional maximum number of model requests allowed in the current CVEvolve process.nullmeans unlimited. When the cap is reached, CVEvolve raises an exception before sending the next model request.
- When
branching.lineage_selection_temperature > 0, tune and evolve use stochastic lineage-aware parent selection. - One or more candidates from each lineage are selected and put in a pool. Candidates in the pool are ranked by their performance metric.
- The rank-based sampling weights follow a softmax-style exponential decay. For rank starting at
1, the unnormalized weight is:exp(-(rank - 1) / temperature), with lower temperatures making selection more greedy. - When
branching.lineage_selection_temperature == 0, tune and evolve use deterministic top-k behavior.
- Tune builds its candidate pool from the best-scoring candidate in each eligible lineage.
- Those lineage representatives are sorted by primary metric and sampled using the configured decay and temperature.
- With a single worker, one parent candidate is sampled from that ranked lineage-representative pool.
- With multiple workers, tune samples parent candidates jointly without replacement, so different workers are assigned different sampled parents whenever enough eligible lineages exist.
- When
branching.lineage_selection_temperature == 0, tune uses deterministic selection:- one worker gets the single best candidate
- multiple workers get the top
num_workers_tunecandidates by primary metric
- Mutation evolve selects one parent from the lineage-aware ranked pool and behaves like tune parent selection.
- Crossover evolve builds a candidate pool from up to
crossover_candidates_per_lineagecandidates per lineage, sorted by primary metric. - Crossover parent selection is sequential:
- first sample one candidate from the ranked pool
- then downweight candidates from the same lineage
- then sample the second parent without replacement
- This bias reduces same-lineage crossover while still allowing it when the weighted pool strongly favors it.
- When
branching.lineage_selection_temperature == 0, evolve uses deterministic selection:- use the top two candidates if available
- otherwise mutate the top candidate
- Warmup rounds always run
generateforround_index <= warmup_rounds. - After warmup, if
force_generate_every > 0, CVEvolve forces agenerateround whenever(round_index - 1 - warmup_rounds)is divisible byforce_generate_every. These forced rounds take priority over every other rule, including suggested actions. - Otherwise, CVEvolve honors the majority recent
suggested_next_actionamong candidates whose performance level is at leasthonor_suggestion_min_level. - Otherwise, CVEvolve computes an effective scheduled round index:
round_index - 1 - warmup_rounds - forced_generate_rounds_so_far - On that effective index,
tuneis eligible when:min_excellent_for_tuneexcellent candidates exist, andtune_every > 0, and- the effective index is divisible by
tune_every
- If
tuneis not selected,evolveis eligible when:min_successful_for_evolvemoderate-or-better candidates exist, andevolve_every > 0, and- the effective index is divisible by
evolve_every
- If neither scheduled action is eligible, CVEvolve falls back to
branching.fallback_action, which defaults togenerate.
Performance buckets come from each candidate's structured analysis when available, and otherwise from its primary-metric ranking. For scheduler gating, "successful" means candidates classified as excellent, good, or moderate.
tracking.enabled: enable MLflow tracking for the session.tracking.mlflow_tracking_uri: MLflow tracking URI used to create or resume the run.tracking.mlflow_experiment_name: MLflow experiment name for the session.- When enabled, CVEvolve creates one MLflow run per session and persists its
run_idinmlflow_run.jsonunder the session root. - On
resume, CVEvolve reuses that savedrun_idand continues logging to the same MLflow run. - After every completed round, CVEvolve logs the best candidate's primary metric to MLflow under the fixed metric name
metric. - It also logs the best primary-metric value seen so far under
best_metric_so_far. - It logs
round_typeas a numeric metric at each step:0= baseline1= generate2= tune3= evolve
- Preparation is not logged as a
round_typemetric. - CVEvolve logs the session config as MLflow parameters, uploads
config.snapshot.yamlas an artifact, and stores the actual primary metric name as MLflow metadata via thecvevolve.actual_metric_nametag. - When the session has completed, CVEvolve also logs the best candidate source file as an MLflow artifact and tags its candidate ID.
- At session start, CVEvolve records the current CVEvolve git commit hash into
workspace/cvevolve_git_commit.txt. When present, runtime tracking anduploadalso log it to MLflow via thecvevolve.git_committag. - The
uploadcommand does not require a stored git commit file.
Hutch tracking is optional and disabled by default. The base CVEvolve install does not depend on Hutch.
To enable native Hutch events, install the optional extra once
thehutch>=0.1.1 is available in your environment:
uv sync --extra hutchThen either set hutch.enabled: true in the config or export
CVEVOLVE_HUTCH=1.
hutch.project: Hutch project name. Defaults tocvevolve.hutch.run_id: optional Hutch run ID override. Defaults tocvevolve-<session-name>.hutch.daemon_url: optional Hutch daemon URL. Falls back toHUTCH_DAEMON_URL.hutch.db_path: optional embedded Hutch DuckDB path. Falls back toHUTCH_DB_PATH.hutch.strict: when true, Hutch tracking failures fail the CVEvolve run. The default is false, so tracking problems emit warnings and the search continues.
Native Hutch tracking emits run updates, candidates, operators, primary and holdout metrics, candidate failures, and the terminal run end. It uses the same deterministic event IDs as Hutch's CVEvolve polling adapter, so a watched/imported session and a native-instrumented session can be deduplicated by Hutch.
The task prompt is a separate markdown file. Example:
1. Introduction
Find the optimal image registration algorithm that attains the best accuracy on the test data.
2. Test data
The registration is translation-only. You can find the test data under `images/` and the labels under `labels/` relative to the configured `data_dir`.When authoring the task prompt markdown:
- Important: Do not include any absolute path. When describing test data, always use relative paths under the configured
data_dir. If you want the agent to see a file, make sure it is copied underdata_dir. - Avoid markdown section header markers such as
#because your prompt is injected into a larger prompt template and those headings can interfere with the outer structure. Prefer numbered section titles such as1. Introduction,2. Test data, and3. Evaluation.
Before running a session, export the API-related environment variables the agent may need:
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=... # optional, only for OpenAI-compatible providers
export SEMANTIC_SCHOLAR_API_KEY=... # optional, improves Semantic Scholar rate limits
export TAVILY_API_KEY=... # required if you want the Tavily tool availableIf your config sets model.api_key_env_var to a different variable name, export that variable instead of OPENAI_API_KEY before starting the run.
uv run cvevolve run --config ./config.yaml --prompt ./task.mdTo enable holdout testing, set workspace.holdout_data_dir in the config and pass a prompt describing that directory:
uv run cvevolve run \
--config ./config.yaml \
--prompt ./task.md \
--holdout-test-prompt ./holdout_test.mdTo enable tool-call logging for a run or resume, pass --enable-logging on the command line:
uv run cvevolve run --config ./config.yaml --prompt ./task.md --enable-loggingTo resume an interrupted session from its existing session directory:
uv run cvevolve resume --session ./sessions/<name>uv run cvevolve resume \
--session ./sessions/<name> \
--enable-loggingTo upload an existing session's DB-backed history to MLflow without rerunning the search:
uv run cvevolve upload ./sessions/<name> \
--mlflow-tracking-uri file:///tmp/mlruns \
--mlflow-experiment-name CVEvolveThe upload command reads config.snapshot.yaml and history/search_history.sqlite, takes the MLflow destination from the command line, and works even if the session has not finished and without relying on files under reports/.
By default, outputs are written under ./sessions/<name>/:
workspace/prompt/task_prompt.mdworkspace/cvevolve_git_commit.txtworkspace/skills/user/...for any configured user skillsworkspace/candidates/...config.snapshot.yamlmlflow_run.jsonwhen MLflow tracking is enabledhistory/search_history.sqlitehistory/tool_calls.sqlitewhen--enable-loggingis enabledexports/candidates.csvexports/metrics.csvexports/evaluation_metrics.csvexports/holdout_test_metrics.csvexports/evolution_tree.csvexports/rounds.csvreports/final_report.mdreports/final_summary.jsonreports/best_candidate.py
- Set the model provider key in
OPENAI_API_KEY, or in whatever variable name you configured inmodel.api_key_env_var. This one is required. - Set
OPENAI_BASE_URLif your OpenAI-compatible provider requires a custom base URL. - Set
SEMANTIC_SCHOLAR_API_KEYif you want the Semantic Scholar tool to use the authenticated API. - Set
TAVILY_API_KEYif you want the Tavily search tool to be usable during a run.
- The agent is instructed to use the shell tools available in the runtime plus
run_python_inline. - Shell commands are stateless. Commands that depend on an environment must be written as one-shot commands, and
uv runor.venv/bin/pythonshould be preferred. - The agent is expected to stay inside the session workspace and copy any needed inputs into it before operating on them.
- Dangerous shell commands trigger a terminal approval prompt.
- Session resume is round-level. If interruption happens mid-round,
resumereruns that round from the beginning rather than attempting mid-tool continuation.
- The agent works inside a sandboxed session workspace, and normal filesystem tools cannot read or write files outside that sandbox.
- However, this is NOT a complete security boundary. An agent may still try to bypass the restriction by writing and executing a Python script that accesses files outside the sandbox.
- If you need stronger isolation, run CVEvolve inside a container.
CVEvolve's generate-tune-evolve search strategy is adapted from Pty-Chi-Evolve.
- Pty-Chi-Evolve paper: https://arxiv.org/html/2603.05696v1