Three root causes addressed:
1. Go page allocator corruption: allocations between the pre-snapshot GC
and VM freeze leave the summary tree inconsistent. After restore, GC
reads corrupted metadata — either panicking (killing PID 1 → kernel
panic) or silently failing to collect, causing unbounded heap growth
until OOM. Fix: move GC to after all HTTP allocations in
PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
allocations run sequentially with no concurrent page allocator access.
GOMAXPROCS is restored on first health check after restore.
2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
single 30s context. If WaitUntilReady consumed most of it, PostInit
failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
disabled and zombie connections. Fix: separate timeout contexts.
3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
caused goroutine leaks from hung proxy connections. Fix: add both,
matching host agent values.
Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
- Scope WebSocket auth bypass to only WS endpoints by restructuring
routes into separate chi Groups. Non-WS routes no longer passthrough
unauthenticated requests with spoofed Upgrade headers. Added
optionalAPIKeyOrJWT middleware for WS routes (injects auth context
from API key/JWT if present, passes through otherwise) and
markAdminWS middleware for admin WS routes.
- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
was called unconditionally but p.tty is nil for non-PTY processes,
crashing every non-PTY process exit.
- Fix goroutine leak in sandbox Pause — stopSampler was never called,
leaking one sampler goroutine per successful pause operation.
- Decouple PTY WebSocket reads from RPC dispatch using a buffered
channel to prevent backpressure-induced connection drops under fast
typing. Includes input coalescing to reduce RPC call volume.