forked from wrenn/wrenn
fix: close stale TCP connections across snapshot/restore to prevent envd hangs
After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client
This commit is contained in:
@ -387,9 +387,17 @@ func (m *Manager) Pause(ctx context.Context, sandboxID string) error {
|
||||
sb.connTracker.Drain(2 * time.Second)
|
||||
slog.Debug("pause: proxy connections drained", "id", sandboxID)
|
||||
|
||||
// Step 0b: Signal envd to quiesce continuous goroutines (port scanner,
|
||||
// forwarder) and run GC before freezing vCPUs. This prevents Go runtime
|
||||
// page allocator corruption ("bad summary data") on snapshot restore.
|
||||
// Step 0b: Close host-side idle connections to envd. Done before
|
||||
// PrepareSnapshot so FIN packets propagate to the guest during the
|
||||
// PrepareSnapshot window (no extra sleep needed).
|
||||
sb.client.CloseIdleConnections()
|
||||
slog.Debug("pause: envd client idle connections closed", "id", sandboxID)
|
||||
|
||||
// Step 0c: Signal envd to quiesce continuous goroutines (port scanner,
|
||||
// forwarder), close idle HTTP connections, and run GC before freezing
|
||||
// vCPUs. This prevents Go runtime page allocator corruption ("bad
|
||||
// summary data") on snapshot restore. The 3s timeout also gives time
|
||||
// for the FINs from Step 0b to be processed by the guest kernel.
|
||||
// Best-effort: a failure is logged but does not abort the pause.
|
||||
func() {
|
||||
prepCtx, prepCancel := context.WithTimeout(ctx, 3*time.Second)
|
||||
|
||||
Reference in New Issue
Block a user