fix(memory): prune unbounded server-side caches causing heap growth by waleedlatif1 · Pull Request #4651 · simstudioai/sim

waleedlatif1 · 2026-05-18T17:55:22Z

Summary

toolSchemaCache (lib/copilot/chat/payload.ts): module-level Map keyed by userId:workspaceId had a 30s TTL that was only checked on read — expired entries were never deleted. With ~7.5MB per entry (2,612 tool schemas) and hundreds of thousands of unique user/workspace pairs, this was accumulating gigabytes of heap over a server instance lifetime. Adds a setInterval sweep every 30s with .unref() to evict expired entries
effectiveEnvCache (lib/environment/utils.ts): same pattern — 15s TTL, same unbounded accumulation, same fix applied
semaphores (lib/core/async-jobs/backends/database.ts): releaseSlot never deleted entries from the Map. Added idle-state deletion when available === limit && waiters.length === 0. Safe under JS's single-threaded event loop; validated that acquireSlot always re-creates the entry with the correct limit if it's missing
cache-registry (lib/monitoring/cache-registry.ts): new lightweight registry so any module can expose its cache size to telemetry without coupling. Both caches register on module load
memory-telemetry: emits cacheSizes in every Memory snapshot log — confirms caches stay bounded in CloudWatch post-deploy

Type of Change

Bug fix

Testing

Validated with CloudWatch data spanning 7 days of production memory snapshots. Math confirmed: ~7.5MB/entry × thousands of unique users = tens of GB of accumulated heap consistent with observed 24MB → 25GB growth pattern. All fixes independently reviewed for correctness (Map iteration safety, .unref() behavior, semaphore race conditions under async scheduling).

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

toolSchemaCache (lib/copilot/chat/payload.ts): module-level Map keyed by userId:workspaceId never deleted expired entries, only checked TTL on read. With 100K+ unique user/workspace pairs each holding 50-200KB of tool schemas, this was the primary driver of the 24MB -> 25GB heap growth observed in CloudWatch. Add a setInterval sweep every 30s (matching the TTL) with .unref() so it does not prevent graceful shutdown. semaphores (lib/core/async-jobs/backends/database.ts): acquireSlot created Semaphore entries that releaseSlot never deleted. With per-execution UUID keys (e.g. scheduleJobId), each scheduled workflow run would add a permanent entry. Store the concurrency limit on the Semaphore struct and delete the entry from the Map when all slots are free and no waiters remain. validatorCache (lib/copilot/tools/server/generated-schema.ts): validated as bounded (93 tools x 2 schema kinds = 186 max entries, ~2-9MB). No fix needed. isolated-vm nativeContexts: validated as deferred GC, self-healed by worker rotation at MAX_EXECUTIONS_PER_WORKER=200. externalMB spikes trace to concurrent isolate heaps at peak load (128MB limit x active isolates), not a reference leak. No fix needed.

…lemetry effectiveEnvCache (lib/environment/utils.ts): same unbounded accumulation pattern as toolSchemaCache — module-level Map keyed by userId:workspaceId with a 15s TTL that is only checked on read, never proactively evicted. Adds a periodic sweep matching the TTL interval with .unref(). cache-registry (lib/monitoring/cache-registry.ts): lightweight registry so modules can expose their cache sizes to telemetry without coupling. toolSchemaCache and effectiveEnvCache both register on module load. memory-telemetry: emits cacheSizes in every Memory snapshot log so CloudWatch can confirm the caches stay bounded post-deploy.

vercel · 2026-05-18T17:55:28Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		May 18, 2026 5:55pm

cursor · 2026-05-18T17:55:32Z

PR Summary

Medium Risk
Medium risk because it changes runtime cache eviction behavior and async-job concurrency semaphore lifecycle, which could affect performance or inline job scheduling if incorrect. Changes are localized and add observability to validate bounded cache growth in production.

Overview
Fixes unbounded growth of module-level in-memory caches by adding periodic TTL sweeps (with unref() timers) for toolSchemaCache and effectiveEnvCache, and exposes their sizes via a new cache-registry.

Updates async-job inline concurrency bookkeeping to delete idle semaphore entries once fully released, and extends memory-telemetry to log cacheSizes on each snapshot for ongoing heap-growth monitoring.

^{Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.}

greptile-apps · 2026-05-18T17:58:30Z

Greptile Summary

This PR fixes three server-side memory leak patterns — unbounded TTL-only caches (toolSchemaCache, effectiveEnvCache), and an ever-growing semaphore Map — and adds a cache-registry + memory-telemetry integration to confirm they stay bounded in production.

toolSchemaCache / effectiveEnvCache: Both gain a setInterval sweep with .unref() to evict expired entries; effectiveEnvCache correctly skips in-flight promise entries, but toolSchemaCache is missing the same guard (see inline comment).
semaphores: releaseSlot now deletes idle entries when available === limit and no waiters remain; acquireSlot recreates the entry on demand, making this safe.
cache-registry + telemetry: New registerCache hook lets both caches expose their live size to the periodic memory snapshot log.

Confidence Score: 3/5

The semaphore and effectiveEnvCache fixes are solid, but the toolSchemaCache sweep has a correctness gap that can cause duplicate concurrent schema fetches for the same user under slow builds.

The toolSchemaCache sweep deletes entries whose only content is an in-flight promise if the async schema-build takes longer than 30 seconds before the interval fires. The effectiveEnvCache sweep was written with a !entry.promise guard for exactly this reason — its absence in payload.ts is a real defect on the changed path that undermines one of the two main cache fixes in this PR. The semaphore and telemetry changes are independently correct.

apps/sim/lib/copilot/chat/payload.ts — the sweep interval needs the !entry.promise guard before this is safe to deploy.

Important Files Changed

Filename	Overview
apps/sim/lib/copilot/chat/payload.ts	Adds a 30s sweep interval for toolSchemaCache, but is missing the `!entry.promise` guard present in the equivalent effectiveEnvCache sweep — in-flight entries can be evicted, triggering duplicate parallel fetches.
apps/sim/lib/core/async-jobs/backends/database.ts	Adds `limit` field to Semaphore and deletes idle entries in `releaseSlot`; logic is correct — deletion only happens when `available === limit` and no waiters remain, and `acquireSlot` safely re-creates the entry if needed.
apps/sim/lib/environment/utils.ts	Correctly adds sweep interval with `!entry.promise` guard and registers the cache in the registry; error path already deletes the cache entry so stuck promises are bounded to the duration of the underlying DB call.
apps/sim/lib/monitoring/cache-registry.ts	New lightweight registry for cache size telemetry; `getCacheSizes` lacks per-getter error handling, so a throwing getter would silently drop the entire memory snapshot log.
apps/sim/lib/monitoring/memory-telemetry.ts	Minimal change — adds `cacheSizes: getCacheSizes()` to the memory snapshot log; no logic concerns.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Cache Miss"] --> B["Create entry with promise + expiresAt"]
    B --> C["Promise in-flight"]
    C --> D{"Sweep fires (setInterval)"}
    D -->|"effectiveEnvCache !entry.promise guard"| E["Entry preserved until promise resolves"]
    D -->|"toolSchemaCache no !entry.promise guard"| F["Entry deleted if expiresAt <= now"]
    F --> G["New caller triggers duplicate parallel fetch"]
    E --> H["Promise resolves - value entry written"]
    H --> I{"Next sweep"}
    I -->|"expiresAt <= now"| J["Entry evicted"]
    I -->|"expiresAt > now"| K["Entry retained"]

_{Reviews (1): Last reviewed commit: "fix(memory): prune effectiveEnvCache and..." | Re-trigger Greptile}

greptile-apps · 2026-05-18T17:58:35Z

+setInterval(() => {
+  const now = Date.now()
+  for (const [key, entry] of toolSchemaCache) {
+    if (entry.expiresAt <= now) toolSchemaCache.delete(key)
+  }
+}, TOOL_SCHEMA_CACHE_TTL_MS).unref()


The toolSchemaCache sweep lacks the !entry.promise guard that the parallel effectiveEnvCache sweep in utils.ts correctly includes. When a schema-build request is in-flight and the 30s interval fires before the async work completes (e.g. cold DB connection, slow permission lookup), the promise-only cache entry is deleted. Any new caller arriving after that deletion creates a second parallel fetch for the same key, causing a thundering-herd on exactly the large, expensive computation this cache is meant to deduplicate.

Suggested change

setInterval(() => {

const now = Date.now()

for (const [key, entry] of toolSchemaCache) {

if (entry.expiresAt <= now) toolSchemaCache.delete(key)

}

}, TOOL_SCHEMA_CACHE_TTL_MS).unref()

setInterval(() => {

const now = Date.now()

for (const [key, entry] of toolSchemaCache) {

if (!entry.promise && entry.expiresAt <= now) toolSchemaCache.delete(key)

}

}, TOOL_SCHEMA_CACHE_TTL_MS).unref()

greptile-apps · 2026-05-18T17:58:39Z

+export function getCacheSizes(): Record<string, number> {
+  const sizes: Record<string, number> = {}
+  for (const [name, getSize] of registry) {
+    sizes[name] = getSize()
+  }
+  return sizes
+}


If any registered getSize callback throws (e.g. a future cache with a non-trivial getter), getCacheSizes() bubbles the error up to the logger.info call in memory-telemetry.ts, preventing the entire memory snapshot from being written. Wrapping each call in a try/catch keeps telemetry working even if one getter is broken.

Suggested change

export function getCacheSizes(): Record<string, number> {

const sizes: Record<string, number> = {}

for (const [name, getSize] of registry) {

sizes[name] = getSize()

}

return sizes

}

export function getCacheSizes(): Record<string, number> {

const sizes: Record<string, number> = {}

for (const [name, getSize] of registry) {

try {

sizes[name] = getSize()

} catch {

sizes[name] = -1

}

}

return sizes

}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.}

cursor · 2026-05-18T17:59:41Z

+  for (const [key, entry] of toolSchemaCache) {
+    if (entry.expiresAt <= now) toolSchemaCache.delete(key)
+  }
+}, TOOL_SCHEMA_CACHE_TTL_MS).unref()


Missing promise guard in toolSchemaCache sweep causes deduplication bypass

Medium Severity

The toolSchemaCache sweep deletes entries solely based on expiresAt <= now, but the effectiveEnvCache sweep correctly guards with !entry.promise && entry.expiresAt <= now. The toolSchemaCache is missing this !entry.promise check. If the async tool-schema build takes longer than 30 seconds (possible under load with multiple DB queries and dynamic imports), the sweep will delete the in-flight promise entry, defeating request deduplication and causing redundant expensive computations.

Additional Locations (1)

apps/sim/lib/environment/utils.ts#L26-L32

^{Reviewed by Cursor Bugbot for commit 6766bd2. Configure here.}

waleedlatif1 · 2026-05-18T18:09:21Z

Closing to reopen with updated scope — lru-cache migration was added to the same branch after this PR was created.

waleedlatif1 added 2 commits May 18, 2026 10:38

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

cursor Bot reviewed May 18, 2026

View reviewed changes

waleedlatif1 closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): prune unbounded server-side caches causing heap growth#4651

fix(memory): prune unbounded server-side caches causing heap growth#4651
waleedlatif1 wants to merge 2 commits into
stagingfrom
fix/memory-leaks

waleedlatif1 commented May 18, 2026

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

cursor Bot commented May 18, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 18, 2026

Uh oh!

greptile-apps Bot May 18, 2026

Uh oh!

greptile-apps Bot May 18, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 18, 2026

Uh oh!

waleedlatif1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waleedlatif1 commented May 18, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

cursor Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

greptile-apps Bot commented May 18, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 18, 2026

Choose a reason for hiding this comment

Missing promise guard in toolSchemaCache sweep causes deduplication bypass

Uh oh!

waleedlatif1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented May 18, 2026 •

edited

Loading