Skip to content

fix(memory): prune unbounded server-side caches causing heap growth#4651

Closed
waleedlatif1 wants to merge 2 commits into
stagingfrom
fix/memory-leaks
Closed

fix(memory): prune unbounded server-side caches causing heap growth#4651
waleedlatif1 wants to merge 2 commits into
stagingfrom
fix/memory-leaks

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • toolSchemaCache (lib/copilot/chat/payload.ts): module-level Map keyed by userId:workspaceId had a 30s TTL that was only checked on read — expired entries were never deleted. With ~7.5MB per entry (2,612 tool schemas) and hundreds of thousands of unique user/workspace pairs, this was accumulating gigabytes of heap over a server instance lifetime. Adds a setInterval sweep every 30s with .unref() to evict expired entries
  • effectiveEnvCache (lib/environment/utils.ts): same pattern — 15s TTL, same unbounded accumulation, same fix applied
  • semaphores (lib/core/async-jobs/backends/database.ts): releaseSlot never deleted entries from the Map. Added idle-state deletion when available === limit && waiters.length === 0. Safe under JS's single-threaded event loop; validated that acquireSlot always re-creates the entry with the correct limit if it's missing
  • cache-registry (lib/monitoring/cache-registry.ts): new lightweight registry so any module can expose its cache size to telemetry without coupling. Both caches register on module load
  • memory-telemetry: emits cacheSizes in every Memory snapshot log — confirms caches stay bounded in CloudWatch post-deploy

Type of Change

  • Bug fix

Testing

Validated with CloudWatch data spanning 7 days of production memory snapshots. Math confirmed: ~7.5MB/entry × thousands of unique users = tens of GB of accumulated heap consistent with observed 24MB → 25GB growth pattern. All fixes independently reviewed for correctness (Map iteration safety, .unref() behavior, semaphore race conditions under async scheduling).

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

toolSchemaCache (lib/copilot/chat/payload.ts): module-level Map keyed by
userId:workspaceId never deleted expired entries, only checked TTL on read.
With 100K+ unique user/workspace pairs each holding 50-200KB of tool schemas,
this was the primary driver of the 24MB -> 25GB heap growth observed in
CloudWatch. Add a setInterval sweep every 30s (matching the TTL) with .unref()
so it does not prevent graceful shutdown.

semaphores (lib/core/async-jobs/backends/database.ts): acquireSlot created
Semaphore entries that releaseSlot never deleted. With per-execution UUID keys
(e.g. scheduleJobId), each scheduled workflow run would add a permanent entry.
Store the concurrency limit on the Semaphore struct and delete the entry from
the Map when all slots are free and no waiters remain.

validatorCache (lib/copilot/tools/server/generated-schema.ts): validated as
bounded (93 tools x 2 schema kinds = 186 max entries, ~2-9MB). No fix needed.

isolated-vm nativeContexts: validated as deferred GC, self-healed by worker
rotation at MAX_EXECUTIONS_PER_WORKER=200. externalMB spikes trace to
concurrent isolate heaps at peak load (128MB limit x active isolates), not a
reference leak. No fix needed.
…lemetry

effectiveEnvCache (lib/environment/utils.ts): same unbounded accumulation
pattern as toolSchemaCache — module-level Map keyed by userId:workspaceId
with a 15s TTL that is only checked on read, never proactively evicted.
Adds a periodic sweep matching the TTL interval with .unref().

cache-registry (lib/monitoring/cache-registry.ts): lightweight registry
so modules can expose their cache sizes to telemetry without coupling.
toolSchemaCache and effectiveEnvCache both register on module load.

memory-telemetry: emits cacheSizes in every Memory snapshot log so
CloudWatch can confirm the caches stay bounded post-deploy.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 18, 2026 5:55pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 18, 2026

PR Summary

Medium Risk
Medium risk because it changes runtime cache eviction behavior and async-job concurrency semaphore lifecycle, which could affect performance or inline job scheduling if incorrect. Changes are localized and add observability to validate bounded cache growth in production.

Overview
Fixes unbounded growth of module-level in-memory caches by adding periodic TTL sweeps (with unref() timers) for toolSchemaCache and effectiveEnvCache, and exposes their sizes via a new cache-registry.

Updates async-job inline concurrency bookkeeping to delete idle semaphore entries once fully released, and extends memory-telemetry to log cacheSizes on each snapshot for ongoing heap-growth monitoring.

Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR fixes three server-side memory leak patterns — unbounded TTL-only caches (toolSchemaCache, effectiveEnvCache), and an ever-growing semaphore Map — and adds a cache-registry + memory-telemetry integration to confirm they stay bounded in production.

  • toolSchemaCache / effectiveEnvCache: Both gain a setInterval sweep with .unref() to evict expired entries; effectiveEnvCache correctly skips in-flight promise entries, but toolSchemaCache is missing the same guard (see inline comment).
  • semaphores: releaseSlot now deletes idle entries when available === limit and no waiters remain; acquireSlot recreates the entry on demand, making this safe.
  • cache-registry + telemetry: New registerCache hook lets both caches expose their live size to the periodic memory snapshot log.

Confidence Score: 3/5

The semaphore and effectiveEnvCache fixes are solid, but the toolSchemaCache sweep has a correctness gap that can cause duplicate concurrent schema fetches for the same user under slow builds.

The toolSchemaCache sweep deletes entries whose only content is an in-flight promise if the async schema-build takes longer than 30 seconds before the interval fires. The effectiveEnvCache sweep was written with a !entry.promise guard for exactly this reason — its absence in payload.ts is a real defect on the changed path that undermines one of the two main cache fixes in this PR. The semaphore and telemetry changes are independently correct.

apps/sim/lib/copilot/chat/payload.ts — the sweep interval needs the !entry.promise guard before this is safe to deploy.

Important Files Changed

Filename Overview
apps/sim/lib/copilot/chat/payload.ts Adds a 30s sweep interval for toolSchemaCache, but is missing the !entry.promise guard present in the equivalent effectiveEnvCache sweep — in-flight entries can be evicted, triggering duplicate parallel fetches.
apps/sim/lib/core/async-jobs/backends/database.ts Adds limit field to Semaphore and deletes idle entries in releaseSlot; logic is correct — deletion only happens when available === limit and no waiters remain, and acquireSlot safely re-creates the entry if needed.
apps/sim/lib/environment/utils.ts Correctly adds sweep interval with !entry.promise guard and registers the cache in the registry; error path already deletes the cache entry so stuck promises are bounded to the duration of the underlying DB call.
apps/sim/lib/monitoring/cache-registry.ts New lightweight registry for cache size telemetry; getCacheSizes lacks per-getter error handling, so a throwing getter would silently drop the entire memory snapshot log.
apps/sim/lib/monitoring/memory-telemetry.ts Minimal change — adds cacheSizes: getCacheSizes() to the memory snapshot log; no logic concerns.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Cache Miss"] --> B["Create entry with promise + expiresAt"]
    B --> C["Promise in-flight"]
    C --> D{"Sweep fires (setInterval)"}
    D -->|"effectiveEnvCache !entry.promise guard"| E["Entry preserved until promise resolves"]
    D -->|"toolSchemaCache no !entry.promise guard"| F["Entry deleted if expiresAt <= now"]
    F --> G["New caller triggers duplicate parallel fetch"]
    E --> H["Promise resolves - value entry written"]
    H --> I{"Next sweep"}
    I -->|"expiresAt <= now"| J["Entry evicted"]
    I -->|"expiresAt > now"| K["Entry retained"]
Loading

Reviews (1): Last reviewed commit: "fix(memory): prune effectiveEnvCache and..." | Re-trigger Greptile

Comment on lines +25 to +30
setInterval(() => {
const now = Date.now()
for (const [key, entry] of toolSchemaCache) {
if (entry.expiresAt <= now) toolSchemaCache.delete(key)
}
}, TOOL_SCHEMA_CACHE_TTL_MS).unref()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The toolSchemaCache sweep lacks the !entry.promise guard that the parallel effectiveEnvCache sweep in utils.ts correctly includes. When a schema-build request is in-flight and the 30s interval fires before the async work completes (e.g. cold DB connection, slow permission lookup), the promise-only cache entry is deleted. Any new caller arriving after that deletion creates a second parallel fetch for the same key, causing a thundering-herd on exactly the large, expensive computation this cache is meant to deduplicate.

Suggested change
setInterval(() => {
const now = Date.now()
for (const [key, entry] of toolSchemaCache) {
if (entry.expiresAt <= now) toolSchemaCache.delete(key)
}
}, TOOL_SCHEMA_CACHE_TTL_MS).unref()
setInterval(() => {
const now = Date.now()
for (const [key, entry] of toolSchemaCache) {
if (!entry.promise && entry.expiresAt <= now) toolSchemaCache.delete(key)
}
}, TOOL_SCHEMA_CACHE_TTL_MS).unref()

Comment on lines +7 to +13
export function getCacheSizes(): Record<string, number> {
const sizes: Record<string, number> = {}
for (const [name, getSize] of registry) {
sizes[name] = getSize()
}
return sizes
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 If any registered getSize callback throws (e.g. a future cache with a non-trivial getter), getCacheSizes() bubbles the error up to the logger.info call in memory-telemetry.ts, preventing the entire memory snapshot from being written. Wrapping each call in a try/catch keeps telemetry working even if one getter is broken.

Suggested change
export function getCacheSizes(): Record<string, number> {
const sizes: Record<string, number> = {}
for (const [name, getSize] of registry) {
sizes[name] = getSize()
}
return sizes
}
export function getCacheSizes(): Record<string, number> {
const sizes: Record<string, number> = {}
for (const [name, getSize] of registry) {
try {
sizes[name] = getSize()
} catch {
sizes[name] = -1
}
}
return sizes
}

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.

for (const [key, entry] of toolSchemaCache) {
if (entry.expiresAt <= now) toolSchemaCache.delete(key)
}
}, TOOL_SCHEMA_CACHE_TTL_MS).unref()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing promise guard in toolSchemaCache sweep causes deduplication bypass

Medium Severity

The toolSchemaCache sweep deletes entries solely based on expiresAt <= now, but the effectiveEnvCache sweep correctly guards with !entry.promise && entry.expiresAt <= now. The toolSchemaCache is missing this !entry.promise check. If the async tool-schema build takes longer than 30 seconds (possible under load with multiple DB queries and dynamic imports), the sweep will delete the in-flight promise entry, defeating request deduplication and causing redundant expensive computations.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

Closing to reopen with updated scope — lru-cache migration was added to the same branch after this PR was created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant