fix(memory): prune unbounded server-side caches causing heap growth#4651
fix(memory): prune unbounded server-side caches causing heap growth#4651waleedlatif1 wants to merge 2 commits into
Conversation
toolSchemaCache (lib/copilot/chat/payload.ts): module-level Map keyed by userId:workspaceId never deleted expired entries, only checked TTL on read. With 100K+ unique user/workspace pairs each holding 50-200KB of tool schemas, this was the primary driver of the 24MB -> 25GB heap growth observed in CloudWatch. Add a setInterval sweep every 30s (matching the TTL) with .unref() so it does not prevent graceful shutdown. semaphores (lib/core/async-jobs/backends/database.ts): acquireSlot created Semaphore entries that releaseSlot never deleted. With per-execution UUID keys (e.g. scheduleJobId), each scheduled workflow run would add a permanent entry. Store the concurrency limit on the Semaphore struct and delete the entry from the Map when all slots are free and no waiters remain. validatorCache (lib/copilot/tools/server/generated-schema.ts): validated as bounded (93 tools x 2 schema kinds = 186 max entries, ~2-9MB). No fix needed. isolated-vm nativeContexts: validated as deferred GC, self-healed by worker rotation at MAX_EXECUTIONS_PER_WORKER=200. externalMB spikes trace to concurrent isolate heaps at peak load (128MB limit x active isolates), not a reference leak. No fix needed.
…lemetry effectiveEnvCache (lib/environment/utils.ts): same unbounded accumulation pattern as toolSchemaCache — module-level Map keyed by userId:workspaceId with a 15s TTL that is only checked on read, never proactively evicted. Adds a periodic sweep matching the TTL interval with .unref(). cache-registry (lib/monitoring/cache-registry.ts): lightweight registry so modules can expose their cache sizes to telemetry without coupling. toolSchemaCache and effectiveEnvCache both register on module load. memory-telemetry: emits cacheSizes in every Memory snapshot log so CloudWatch can confirm the caches stay bounded post-deploy.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Updates async-job inline concurrency bookkeeping to delete idle semaphore entries once fully released, and extends Reviewed by Cursor Bugbot for commit 6766bd2. Configure here. |
Greptile SummaryThis PR fixes three server-side memory leak patterns — unbounded TTL-only caches (
Confidence Score: 3/5The semaphore and effectiveEnvCache fixes are solid, but the toolSchemaCache sweep has a correctness gap that can cause duplicate concurrent schema fetches for the same user under slow builds. The apps/sim/lib/copilot/chat/payload.ts — the sweep interval needs the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Cache Miss"] --> B["Create entry with promise + expiresAt"]
B --> C["Promise in-flight"]
C --> D{"Sweep fires (setInterval)"}
D -->|"effectiveEnvCache !entry.promise guard"| E["Entry preserved until promise resolves"]
D -->|"toolSchemaCache no !entry.promise guard"| F["Entry deleted if expiresAt <= now"]
F --> G["New caller triggers duplicate parallel fetch"]
E --> H["Promise resolves - value entry written"]
H --> I{"Next sweep"}
I -->|"expiresAt <= now"| J["Entry evicted"]
I -->|"expiresAt > now"| K["Entry retained"]
Reviews (1): Last reviewed commit: "fix(memory): prune effectiveEnvCache and..." | Re-trigger Greptile |
| setInterval(() => { | ||
| const now = Date.now() | ||
| for (const [key, entry] of toolSchemaCache) { | ||
| if (entry.expiresAt <= now) toolSchemaCache.delete(key) | ||
| } | ||
| }, TOOL_SCHEMA_CACHE_TTL_MS).unref() |
There was a problem hiding this comment.
The
toolSchemaCache sweep lacks the !entry.promise guard that the parallel effectiveEnvCache sweep in utils.ts correctly includes. When a schema-build request is in-flight and the 30s interval fires before the async work completes (e.g. cold DB connection, slow permission lookup), the promise-only cache entry is deleted. Any new caller arriving after that deletion creates a second parallel fetch for the same key, causing a thundering-herd on exactly the large, expensive computation this cache is meant to deduplicate.
| setInterval(() => { | |
| const now = Date.now() | |
| for (const [key, entry] of toolSchemaCache) { | |
| if (entry.expiresAt <= now) toolSchemaCache.delete(key) | |
| } | |
| }, TOOL_SCHEMA_CACHE_TTL_MS).unref() | |
| setInterval(() => { | |
| const now = Date.now() | |
| for (const [key, entry] of toolSchemaCache) { | |
| if (!entry.promise && entry.expiresAt <= now) toolSchemaCache.delete(key) | |
| } | |
| }, TOOL_SCHEMA_CACHE_TTL_MS).unref() |
| export function getCacheSizes(): Record<string, number> { | ||
| const sizes: Record<string, number> = {} | ||
| for (const [name, getSize] of registry) { | ||
| sizes[name] = getSize() | ||
| } | ||
| return sizes | ||
| } |
There was a problem hiding this comment.
If any registered
getSize callback throws (e.g. a future cache with a non-trivial getter), getCacheSizes() bubbles the error up to the logger.info call in memory-telemetry.ts, preventing the entire memory snapshot from being written. Wrapping each call in a try/catch keeps telemetry working even if one getter is broken.
| export function getCacheSizes(): Record<string, number> { | |
| const sizes: Record<string, number> = {} | |
| for (const [name, getSize] of registry) { | |
| sizes[name] = getSize() | |
| } | |
| return sizes | |
| } | |
| export function getCacheSizes(): Record<string, number> { | |
| const sizes: Record<string, number> = {} | |
| for (const [name, getSize] of registry) { | |
| try { | |
| sizes[name] = getSize() | |
| } catch { | |
| sizes[name] = -1 | |
| } | |
| } | |
| return sizes | |
| } |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.
| for (const [key, entry] of toolSchemaCache) { | ||
| if (entry.expiresAt <= now) toolSchemaCache.delete(key) | ||
| } | ||
| }, TOOL_SCHEMA_CACHE_TTL_MS).unref() |
There was a problem hiding this comment.
Missing promise guard in toolSchemaCache sweep causes deduplication bypass
Medium Severity
The toolSchemaCache sweep deletes entries solely based on expiresAt <= now, but the effectiveEnvCache sweep correctly guards with !entry.promise && entry.expiresAt <= now. The toolSchemaCache is missing this !entry.promise check. If the async tool-schema build takes longer than 30 seconds (possible under load with multiple DB queries and dynamic imports), the sweep will delete the in-flight promise entry, defeating request deduplication and causing redundant expensive computations.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.
|
Closing to reopen with updated scope — lru-cache migration was added to the same branch after this PR was created. |


Summary
toolSchemaCache(lib/copilot/chat/payload.ts): module-level Map keyed byuserId:workspaceIdhad a 30s TTL that was only checked on read — expired entries were never deleted. With ~7.5MB per entry (2,612 tool schemas) and hundreds of thousands of unique user/workspace pairs, this was accumulating gigabytes of heap over a server instance lifetime. Adds asetIntervalsweep every 30s with.unref()to evict expired entrieseffectiveEnvCache(lib/environment/utils.ts): same pattern — 15s TTL, same unbounded accumulation, same fix appliedsemaphores(lib/core/async-jobs/backends/database.ts):releaseSlotnever deleted entries from the Map. Added idle-state deletion whenavailable === limit && waiters.length === 0. Safe under JS's single-threaded event loop; validated thatacquireSlotalways re-creates the entry with the correct limit if it's missingcache-registry(lib/monitoring/cache-registry.ts): new lightweight registry so any module can expose its cache size to telemetry without coupling. Both caches register on module loadmemory-telemetry: emitscacheSizesin every Memory snapshot log — confirms caches stay bounded in CloudWatch post-deployType of Change
Testing
Validated with CloudWatch data spanning 7 days of production memory snapshots. Math confirmed: ~7.5MB/entry × thousands of unique users = tens of GB of accumulated heap consistent with observed 24MB → 25GB growth pattern. All fixes independently reviewed for correctness (Map iteration safety,
.unref()behavior, semaphore race conditions under async scheduling).Checklist