renderinc
diff --git a/‎plugins/render/skills/render-background-workers/SKILL.md‎
Lines changed: 115 additions & 0 deletions b/‎plugins/render/skills/render-background-workers/SKILL.md‎
Lines changed: 115 additions & 0 deletions
diff --git a/‎plugins/render/skills/render-background-workers/evals.json‎
Lines changed: 57 additions & 0 deletions b/‎plugins/render/skills/render-background-workers/evals.json‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎plugins/render/skills/render-background-workers/references/graceful-shutdown.md‎
Lines changed: 79 additions & 0 deletions b/‎plugins/render/skills/render-background-workers/references/graceful-shutdown.md‎
Lines changed: 79 additions & 0 deletions
@@ -0,0 +1,115 @@
+---
+name: render-background-workers
+description: >-
+  Sets up and configures background workers on Render for queue-based job
+  processing. Use when the user needs to process async jobs, consume from a
+  queue, run Celery/Sidekiq/BullMQ/Asynq/Oban workers, handle graceful
+  shutdown with SIGTERM, wire a worker to Key Value (Redis), or choose between
+  workers and cron jobs for background work.
+  Trigger terms: background worker, async jobs, queue consumer, Celery,
+  Sidekiq, BullMQ, Asynq, Oban, job processing, SIGTERM, graceful shutdown.
+license: MIT
+compatibility: Render background worker services
+metadata:
+  author: Render
+  version: "1.0.0"
+  category: compute
+---
+
+# Render Background Workers
+
+This skill explains **worker** services on Render: processes that **consume jobs from a queue** instead of serving HTTP. Pair with **render-blueprints**, **render-env-vars**, and **render-networking** when wiring `render.yaml` and private connectivity.
+
+## When to Use
+
+- Designing or debugging **queue-backed workers** (Celery, Sidekiq, BullMQ, Asynq, etc.)
+- Choosing between a **worker**, **Cron Job**, or **Workflow** for background work
+- Configuring **Render Key Value** as a **broker** (not a cache) with correct **eviction policy**
+- Implementing **graceful shutdown** so in-flight jobs are not lost on deploy
+
+Per-framework setup and signal-handling detail: `references/queue-framework-setup.md`, `references/graceful-shutdown.md`.
+
+## How Workers Work
+
+- **Long-running services** with **no inbound (HTTP) traffic**. Render does not expose a public URL or internal hostname for workers the way it does for web or private services—**workers cannot receive private network traffic directed at them**.
+- The typical pattern is a **poll loop**: the process connects to a **queue backend** (often **Render Key Value**, Redis-compatible **Valkey 8**) and **pulls jobs**.
+- Workers **can initiate outbound connections** on the private network—to **PostgreSQL**, **Key Value**, **private services**, **web services** (internal URLs), and the public internet—subject to your plan and firewall rules.
+
+## Queue Framework Overview
+
+| Framework | Language | Queue backend | Notes |
+|-----------|----------|---------------|--------|
+| Celery | Python | Redis / Key Value | Most common Python task queue |
+| Sidekiq | Ruby | Redis / Key Value | Standard for Rails |
+| BullMQ | Node.js | Redis / Key Value | Modern Node queue (Redis-based) |
+| Asynq | Go | Redis / Key Value | Go async task processing |
+| Oban | Elixir | **Postgres** (not Redis) | Queue stored in the database |
+
+## Pairing with Key Value
+
+- Use **Render Key Value** as the **job broker** when your framework expects Redis.
+- Set **maxmemory policy** to **`noeviction`**. **`allkeys-lru`** and similar policies are for **caches**; evicting queue keys **drops jobs**.
+- Wire **`REDIS_URL`** (or your framework’s equivalent) via **`fromService`** with `type: keyvalue` and `property: connectionString` in the Blueprint.
+- **Blueprints require `ipAllowList`** on Key Value—include the CIDRs that should reach the instance (often `[]` for private-network-only access; see **render-blueprints** / Key Value field reference).
+
+See `references/queue-framework-setup.md` for minimal app + YAML examples.
+
+## Worker vs Cron vs Workflow
+
+| Need | Use | Why |
+|------|-----|-----|
+| Always-on queue consumer | **Background Worker** | Polls continuously; long-lived process |
+| Periodic scheduled task | **Cron Job** | Runs on a schedule, **exits**; **12h max** per run |
+| Distributed parallel compute | **Workflow** | Each run gets its own instance; fan-out patterns |
+| High-volume or bursty jobs | **Workflow** | Scales per run; **no idle instance cost** between runs |
+
+## Graceful Shutdown
+
+- Before stopping an instance, Render sends **`SIGTERM`**, then waits up to **`maxShutdownDelaySeconds`** (**1–300**, **default 30**) before **`SIGKILL`**.
+- Workers should: **(1)** stop accepting new jobs, **(2)** finish the current job or **checkpoint** progress, **(3)** close connections, **(4)** exit **0**.
+- Set **`maxShutdownDelaySeconds`** to at least your **longest safe job duration** (see Dashboard or Blueprint).
+
+Language- and framework-specific handlers: `references/graceful-shutdown.md`.
+
+## Blueprint Configuration
+
+Minimal pattern: **`type: worker`**, **`runtime`**, **`buildCommand`**, **`startCommand`**, and **`envVars`** wired from Key Value.
+
+```yaml
+services:
+  - type: keyvalue
+    name: jobs
+    plan: starter
+    region: oregon
+    ipAllowList: []
+
+  - type: worker
+    name: task-worker
+    runtime: python
+    region: oregon
+    plan: starter
+    buildCommand: pip install -r requirements.txt
+    startCommand: celery -A tasks worker --loglevel=info
+    envVars:
+      - key: REDIS_URL
+        fromService:
+          name: jobs
+          type: keyvalue
+          property: connectionString
+```
+
+Optional: **`maxShutdownDelaySeconds`** on the worker service for longer draining jobs.
+
+## References
+
+| Topic | File |
+|--------|------|
+| Celery, Sidekiq, BullMQ, Asynq, Oban setup + YAML | `references/queue-framework-setup.md` |
+| SIGTERM, `maxShutdownDelaySeconds`, per-language patterns | `references/graceful-shutdown.md` |
+
+## Related Skills
+
+- **render-deploy** — First deploy, CLI, service creation
+- **render-blueprints** — Full `render.yaml` schema, `fromService`, projects
+- **render-networking** — Private URLs, what can call what
+- **render-scaling** — Worker plans, instance counts, limits
@@ -0,0 +1,57 @@
+{
+  "skill_name": "render-background-workers",
+  "models": [
+    "claude-sonnet-4-6",
+    "gpt-5.4",
+    "gemini-3.1-pro-preview"
+  ],
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I need to set up a BullMQ worker in Node.js on Render to process async jobs. I want it connected to a Redis-compatible queue. Can you give me a complete render.yaml blueprint and explain the Key Value eviction policy I should use?",
+      "expected_output": "The agent provides a complete render.yaml with a Key Value service (ipAllowList, noeviction policy note) and a worker service (type: worker, runtime: node, buildCommand: npm ci, startCommand: node worker.js) with REDIS_URL wired via fromService. It explains that maxmemory-policy must be set to noeviction to prevent queue keys from being evicted.",
+      "files": [],
+      "integrations": {},
+      "assertions": [
+        "The response includes a render.yaml with type: keyvalue and type: worker services",
+        "The worker service uses runtime: node and startCommand referencing node worker.js or equivalent",
+        "REDIS_URL is wired using fromService with type: keyvalue and property: connectionString",
+        "The response explicitly states the Key Value maxmemory policy must be set to noeviction",
+        "The Key Value service includes an ipAllowList field",
+        "The response explains that allkeys-lru or similar eviction policies are for caches and will drop jobs"
+      ]
+    },
+    {
+      "id": 2,
+      "turns": [
+        "I want to run background jobs on Render. Should I use a background worker or a cron job?",
+        "The jobs are triggered by user actions and need to be processed as soon as possible. There could be bursts of hundreds of jobs at once."
+      ],
+      "expected_output": "The agent first asks clarifying questions about the job trigger pattern (scheduled vs event-driven) and volume. After the user clarifies that jobs are user-triggered and bursty, the agent recommends a Background Worker for always-on queue consumption, or a Workflow for high-volume bursty scenarios, explaining the tradeoffs. It explains cron jobs are for scheduled periodic tasks that exit, not queue consumers.",
+      "files": [],
+      "integrations": {},
+      "assertions": [
+        "In the first turn, the agent asks at least one clarifying question about the job trigger pattern or volume",
+        "After the user clarifies bursty user-triggered jobs, the agent recommends a Background Worker or Workflow",
+        "The agent explains that Cron Jobs are for scheduled periodic tasks and have a 12-hour max run limit",
+        "The agent mentions that Workflows scale per run and have no idle instance cost, making them suitable for bursty workloads",
+        "The agent distinguishes between always-on queue consumers (Background Worker) and scheduled tasks (Cron Job)"
+      ]
+    },
+    {
+      "id": 3,
+      "prompt": "My Celery workers on Render are losing jobs during deploys. How do I implement graceful shutdown? I need jobs that can take up to 2 minutes to finish. Show me the signal handler code and the render.yaml config.",
+      "expected_output": "The agent explains Render's SIGTERM -> wait -> SIGKILL sequence, provides a Python SIGTERM signal handler using signal.signal, recommends setting maxShutdownDelaySeconds to at least 120 (2 minutes) plus buffer, and shows the render.yaml worker config with maxShutdownDelaySeconds set appropriately. It also mentions Celery's worker_shutting_down lifecycle signal and the importance of idempotent tasks.",
+      "files": [],
+      "integrations": {},
+      "assertions": [
+        "The response explains that Render sends SIGTERM first, waits maxShutdownDelaySeconds, then sends SIGKILL",
+        "The response recommends setting maxShutdownDelaySeconds to at least 120 seconds (2 minutes) to cover the longest job duration",
+        "The response includes a Python signal handler using signal.signal(signal.SIGTERM, ...) or references Celery's worker_shutting_down signal",
+        "The render.yaml snippet includes maxShutdownDelaySeconds on the worker service set to 120 or higher",
+        "The response advises stopping the consumer loop (stop dequeuing) as the first step on SIGTERM",
+        "The response mentions job idempotency or checkpointing to handle retries safely"
+      ]
+    }
+  ]
+}
@@ -0,0 +1,79 @@
+# Graceful shutdown for Render workers
+
+Render stops worker instances during **deploys**, **manual restarts**, and **scale-in** events. Your process must **exit cleanly** within the configured window or work may be **lost** or **duplicated** (depending on your queue’s ack/retry semantics).
+
+## Platform behavior
+
+1. Render sends **`SIGTERM`** to your process.
+2. The platform waits up to **`maxShutdownDelaySeconds`** (**1–300**, **default 30**).
+3. If the process is still running, Render sends **`SIGKILL`** (not catchable).
+
+Configure **`maxShutdownDelaySeconds`** in the **Dashboard** (service settings) or in **`render.yaml`** on the worker service. Set it to cover your **longest job** you are willing to let complete during shutdown (plus buffer for flushing metrics, closing DB pools, etc.).
+
+## General pattern
+
+1. **Stop accepting new jobs** — stop the consumer loop, pause polling, or drain the framework’s internal fetch.
+2. **Finish the current job** or **checkpoint** durable progress so another worker can resume safely.
+3. **Close connections** — Redis/Postgres pools, HTTP clients.
+4. **Exit with code 0** when done.
+
+## Python
+
+**Low-level handler**
+
+```python
+import signal
+import sys
+
+def handle_sigterm(signum, frame):
+    # set a flag; main loop checks it and stops dequeuing
+    global shutting_down
+    shutting_down = True
+
+signal.signal(signal.SIGTERM, handle_sigterm)
+```
+
+**Celery** — use lifecycle signals such as **`worker_shutting_down`** to run cleanup; ensure tasks honor a **soft time limit** or cooperative cancel flag so shutdown can finish within **`maxShutdownDelaySeconds`**.
+
+## Ruby (Sidekiq)
+
+Sidekiq **handles SIGTERM** by default: it stops fetching new work and waits for in-flight jobs up to a **configurable timeout** (`:timeout` in Sidekiq options, in seconds). Align that timeout with Render’s **`maxShutdownDelaySeconds`** (Sidekiq timeout should be **≤** platform delay minus a small margin).
+
+## Node.js
+
+```javascript
+let accept = true;
+
+process.on("SIGTERM", async () => {
+  accept = false;
+  await worker.close(); // BullMQ: stops accepting, waits for active jobs
+  process.exit(0);
+});
+```
+
+**BullMQ** — prefer **`worker.close()`** (and **`queue.close()`** where applicable) so active jobs complete per library defaults; tune **`stalledInterval`** / job locks if you need stricter bounds.
+
+## Go
+
+Use **`signal.NotifyContext`** (or `signal.Notify` + `context.WithCancel`) to cancel a root context passed into your consumer loop and job handlers; wait on **`sync.WaitGroup`** or channels until in-flight work finishes, then exit.
+
+```go
+ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
+defer stop()
+// run consumer until ctx.Done(), then drain workers
+```
+
+## Anti-patterns
+
+- **Ignoring SIGTERM** — the process survives until **`SIGKILL`**, often **mid-job**, causing **lost work** or **stuck** queue entries.
+- **`maxShutdownDelaySeconds` too low** for your p95 job duration — frequent **hard kills** and retries.
+- **No idempotency** — if a job is retried after an ambiguous failure at shutdown, **duplicate side effects** can occur.
+
+## Checklist
+
+| Item | Action |
+|------|--------|
+| Delay | Set **`maxShutdownDelaySeconds`** ≥ longest graceful completion you need |
+| Consumer | On SIGTERM, **stop dequeuing** first |
+| Jobs | **Idempotent** handlers or explicit **checkpoints** |
+| Framework | Use built-in **drain** / **close** APIs (Sidekiq, BullMQ, Celery signals) where available |