fix: close stale TCP connections across snapshot/restore to prevent envd hangs

After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client
2026-05-02 05:19:37 +06:00
parent f3572f7356
commit 7ef9a64613
11 changed files with 183 additions and 30 deletions
--- a/envd/internal/api/init.go
+++ b/envd/internal/api/init.go
@ -150,6 +150,12 @@ func (a *API) PostInit(w http.ResponseWriter, r *http.Request) {
 		host.PollForMMDSOpts(ctx, a.mmdsChan, a.defaults.EnvVars)
 	}()

+	// Close zombie connections from before the snapshot and re-enable
+	// keep-alives. On first boot this is a no-op (no zombie connections).
+	if a.connTracker != nil {
+		a.connTracker.RestoreAfterSnapshot()
+	}
+
 	// Start the port scanner and forwarder if they were stopped by a
 	// pre-snapshot prepare call. Start is a no-op if already running,
 	// so this is safe on first boot and only takes effect after restore.