Three root causes addressed:
1. Go page allocator corruption: allocations between the pre-snapshot GC
and VM freeze leave the summary tree inconsistent. After restore, GC
reads corrupted metadata — either panicking (killing PID 1 → kernel
panic) or silently failing to collect, causing unbounded heap growth
until OOM. Fix: move GC to after all HTTP allocations in
PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
allocations run sequentially with no concurrent page allocator access.
GOMAXPROCS is restored on first health check after restore.
2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
single 30s context. If WaitUntilReady consumed most of it, PostInit
failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
disabled and zombie connections. Fix: separate timeout contexts.
3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
caused goroutine leaks from hung proxy connections. Fix: add both,
matching host agent values.
Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.
Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.
- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)