fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore

Three root causes addressed:

1. Go page allocator corruption: allocations between the pre-snapshot GC
   and VM freeze leave the summary tree inconsistent. After restore, GC
   reads corrupted metadata — either panicking (killing PID 1 → kernel
   panic) or silently failing to collect, causing unbounded heap growth
   until OOM. Fix: move GC to after all HTTP allocations in
   PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
   allocations run sequentially with no concurrent page allocator access.
   GOMAXPROCS is restored on first health check after restore.

2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
   single 30s context. If WaitUntilReady consumed most of it, PostInit
   failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
   disabled and zombie connections. Fix: separate timeout contexts.

3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
   caused goroutine leaks from hung proxy connections. Fix: add both,
   matching host agent values.

Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.

This commit is contained in:

Rafeed M. Bhuiyan

2026-05-02 17:22:51 +06:00

parent bb582deefa

commit 3deecbff89

13 changed files with 245 additions and 28 deletions

2

VERSION_AGENT

View File

 @ -1 +1 @@
 .1.1
 .1.2

fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore

2 VERSION_AGENT Unescape Escape View File

2

VERSION_AGENT

View File