Skip to content

Speed up listing VMs#225

Merged
sjmiller609 merged 1 commit into
mainfrom
hypeship/cache-vmm-state-for-list
May 11, 2026
Merged

Speed up listing VMs#225
sjmiller609 merged 1 commit into
mainfrom
hypeship/cache-vmm-state-for-list

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 11, 2026

Summary

Cache hypervisor state and bound /vm.info calls in list path

Builds on the list-path tracing PR (#208). Those spans showed single instances.derive_state spans of 10–37s on the dev fleet, with no children — all the time was inside the un-instrumented hv.GetVMInfo HTTP call. The cloud-hypervisor API socket serializes per VM, so a snapshot in flight from auto_standby parks every concurrent /vm.info call behind it, blocking the whole serial list loop.

This PR addresses both halves of the problem:

  1. Cache the last observed hypervisor VM state per instance with a 5s TTL. List calls within the TTL skip the HTTP round-trip entirely. Lifecycle event notifications (start / stop / standby / restore / delete / …) write through the cache so it stays current with state changes hypeman itself drives.

  2. Bound the underlying GetVMInfo call with a 500 ms timeout. On timeout the instance is reported as Unknown (matching today's behavior when the hypervisor is unreachable) and the next list call retries.

The TTL preserves the original purpose of the call — detecting guest-driven shutdowns the lifecycle bus can't see — while collapsing bursty list calls onto at most one /vm.info per instance per 5s. The timeout caps the worst case at 500 ms instead of 37 s.

Test plan

  • go vet ./lib/instances/... clean
  • New unit tests cover TTL expiry, lifecycle event update, delete invalidation, and the live-socket vs no-socket mapping in updateCachedHypervisorStateFromInstance
  • Existing query / lifecycle / admission / parse tests pass
  • After deploy: confirm instances.derive_state p99 on dev hypemen drops; confirm hypervisor_state.cache_hit=true shows up on most list calls

🤖 Generated with Claude Code


Note

Medium Risk
Changes instance state derivation to reuse cached hypervisor state and to timeout GetVMInfo, which could alter list results (e.g., more Unknown states) or temporarily surface stale state within the TTL window.

Overview
Speeds up hot instance listing/state-derivation paths by adding an in-memory per-instance hypervisor state cache (5s TTL) and using it to avoid repeated /vm.info socket calls.

When a cache miss occurs, deriveState now wraps hv.GetVMInfo with a 500ms timeout and returns StateUnknown on failure/timeout rather than blocking for long periods. Lifecycle notifications write-through to the cache and deletions/no-socket states invalidate it, and new unit tests cover TTL behavior and cache update/invalidation via lifecycle events.

Reviewed by Cursor Bugbot for commit 86d079f. Bugbot is set up for automated code reviews on this repo. Configure here.

Tracing on the previous list-path span PR showed single `instances.derive_state`
spans of 10-37s on the dev fleet, with no children — all the time was inside
`hv.GetVMInfo`. The cloud-hypervisor API socket serializes per VM, so a
snapshot in flight from auto-standby parks every concurrent `/vm.info` call
behind it, blocking the entire serial list loop.

Two changes:

1. Cache the last observed hypervisor VM state per instance with a 5s TTL.
   List calls within the TTL skip the HTTP round-trip entirely. Lifecycle
   event notifications (start/stop/standby/restore/delete/...) write through
   the cache so it stays current with state changes hypeman itself drives.

2. Wrap the underlying `GetVMInfo` call in a 500ms timeout. On timeout the
   instance is reported as Unknown (matching today's behavior when the
   hypervisor is unreachable) and the next list call retries.

Together these eliminate the 14-37s tails while leaving the
hypervisor-of-last-resort check intact for crash detection via the TTL
sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609 sjmiller609 changed the title Cache hypervisor state and bound /vm.info calls in list path Speed up listing VMs May 11, 2026
@sjmiller609 sjmiller609 marked this pull request as ready for review May 11, 2026 19:33
@sjmiller609 sjmiller609 requested a review from hiroTamada May 11, 2026 19:33
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies instance state caching logic in the hypervisor layer, not API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the kernel API filter.

To monitor this PR anyway, reply with @firetiger monitor this.

Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment.

Comment thread lib/instances/query.go
@sjmiller609 sjmiller609 requested a review from hiroTamada May 11, 2026 20:40
@sjmiller609 sjmiller609 merged commit 0c9574c into main May 11, 2026
13 of 14 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/cache-vmm-state-for-list branch May 11, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants