fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore

Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.
2026-05-02 17:22:51 +06:00
parent bb582deefa
commit 3deecbff89
13 changed files with 245 additions and 28 deletions
--- a/envd/internal/api/init.go
+++ b/envd/internal/api/init.go
@ -150,15 +150,17 @@ func (a *API) PostInit(w http.ResponseWriter, r *http.Request) {
 		host.PollForMMDSOpts(ctx, a.mmdsChan, a.defaults.EnvVars)
 	}()

-	// Close zombie connections from before the snapshot and re-enable
-	// keep-alives. On first boot this is a no-op (no zombie connections).
+	// Safety net: if the health check's postRestoreRecovery didn't run yet
+	// (e.g. PostInit arrived before the first health check), re-enable GC
+	// here. On first boot needsRestore is false so CAS is a no-op.
+	if a.needsRestore.CompareAndSwap(true, false) {
+		a.postRestoreRecovery()
+	}
+	// RestoreAfterSnapshot is idempotent (clears preSnapshot set), and
+	// Start is a no-op if already running.
 	if a.connTracker != nil {
 		a.connTracker.RestoreAfterSnapshot()
 	}
-
-	// Start the port scanner and forwarder if they were stopped by a
-	// pre-snapshot prepare call. Start is a no-op if already running,
-	// so this is safe on first boot and only takes effect after restore.
 	if a.portSubsystem != nil {
 		a.portSubsystem.Start(a.rootCtx)
 	}