Speed up listing VMs#225
Merged
Merged
Conversation
Tracing on the previous list-path span PR showed single `instances.derive_state` spans of 10-37s on the dev fleet, with no children — all the time was inside `hv.GetVMInfo`. The cloud-hypervisor API socket serializes per VM, so a snapshot in flight from auto-standby parks every concurrent `/vm.info` call behind it, blocking the entire serial list loop. Two changes: 1. Cache the last observed hypervisor VM state per instance with a 5s TTL. List calls within the TTL skip the HTTP round-trip entirely. Lifecycle event notifications (start/stop/standby/restore/delete/...) write through the cache so it stays current with state changes hypeman itself drives. 2. Wrap the underlying `GetVMInfo` call in a 500ms timeout. On timeout the instance is reported as Unknown (matching today's behavior when the hypervisor is unreachable) and the next list call retries. Together these eliminate the 14-37s tails while leaving the hypervisor-of-last-resort check intact for crash detection via the TTL sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR modifies instance state caching logic in the hypervisor layer, not API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the kernel API filter. To monitor this PR anyway, reply with |
hiroTamada
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cache hypervisor state and bound /vm.info calls in list path
Builds on the list-path tracing PR (#208). Those spans showed single
instances.derive_statespans of 10–37s on the dev fleet, with no children — all the time was inside the un-instrumentedhv.GetVMInfoHTTP call. The cloud-hypervisor API socket serializes per VM, so a snapshot in flight fromauto_standbyparks every concurrent/vm.infocall behind it, blocking the whole serial list loop.This PR addresses both halves of the problem:
Cache the last observed hypervisor VM state per instance with a 5s TTL. List calls within the TTL skip the HTTP round-trip entirely. Lifecycle event notifications (
start/stop/standby/restore/delete/ …) write through the cache so it stays current with state changes hypeman itself drives.Bound the underlying
GetVMInfocall with a 500 ms timeout. On timeout the instance is reported asUnknown(matching today's behavior when the hypervisor is unreachable) and the next list call retries.The TTL preserves the original purpose of the call — detecting guest-driven shutdowns the lifecycle bus can't see — while collapsing bursty list calls onto at most one
/vm.infoper instance per 5s. The timeout caps the worst case at 500 ms instead of 37 s.Test plan
go vet ./lib/instances/...cleanupdateCachedHypervisorStateFromInstanceinstances.derive_statep99 on dev hypemen drops; confirmhypervisor_state.cache_hit=trueshows up on most list calls🤖 Generated with Claude Code
Note
Medium Risk
Changes instance state derivation to reuse cached hypervisor state and to timeout
GetVMInfo, which could alter list results (e.g., moreUnknownstates) or temporarily surface stale state within the TTL window.Overview
Speeds up hot instance listing/state-derivation paths by adding an in-memory per-instance hypervisor state cache (5s TTL) and using it to avoid repeated
/vm.infosocket calls.When a cache miss occurs,
deriveStatenow wrapshv.GetVMInfowith a 500ms timeout and returnsStateUnknownon failure/timeout rather than blocking for long periods. Lifecycle notifications write-through to the cache and deletions/no-socket states invalidate it, and new unit tests cover TTL behavior and cache update/invalidation via lifecycle events.Reviewed by Cursor Bugbot for commit 86d079f. Bugbot is set up for automated code reviews on this repo. Configure here.