wrenn-releases

tksadik92/wrenn-releases

Fork 0

forked from wrenn/wrenn

Commit Graph

Author	SHA1	Message	Date
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	7ef9a64613	fix: close stale TCP connections across snapshot/restore to prevent envd hangs After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client	2026-05-02 05:19:37 +06:00
pptx704	962860ba74	Pre-pause snapshot signal to prevent Go runtime crash on restore envd crashes with "fatal error: bad summary data" after Firecracker snapshot/restore because the page allocator radix tree is inconsistent when vCPUs are frozen mid-allocation. The port scanner goroutine allocates heavily every second, making it the primary trigger. Add POST /snapshot/prepare to envd — the host agent calls it before vm.Pause to quiesce continuous goroutines and force GC. On restore, PostInit restarts the port subsystem via the existing /init endpoint. - New PortSubsystem abstraction with Start/Stop/Restart lifecycle - Context-based goroutine cancellation (replaces irreversible channel close) - Context-aware Signal to prevent scanner/forwarder deadlock - Fix forwarder goroutine leak (was spinning forever on closed channel) - Kill socat children on stop to prevent orphans across snapshots - Fix double cmd.Wait panic (exec.Command instead of CommandContext)	2026-04-13 05:21:10 +06:00

Author

SHA1

Message

Date

pptx704

3deecbff89

fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore

Three root causes addressed:

1. Go page allocator corruption: allocations between the pre-snapshot GC
   and VM freeze leave the summary tree inconsistent. After restore, GC
   reads corrupted metadata — either panicking (killing PID 1 → kernel
   panic) or silently failing to collect, causing unbounded heap growth
   until OOM. Fix: move GC to after all HTTP allocations in
   PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
   allocations run sequentially with no concurrent page allocator access.
   GOMAXPROCS is restored on first health check after restore.

2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
   single 30s context. If WaitUntilReady consumed most of it, PostInit
   failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
   disabled and zombie connections. Fix: separate timeout contexts.

3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
   caused goroutine leaks from hung proxy connections. Fix: add both,
   matching host agent values.

Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.

2026-05-02 17:22:51 +06:00

pptx704

7ef9a64613

fix: close stale TCP connections across snapshot/restore to prevent envd hangs

After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.

Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
  closes idle connections and disables keep-alives before snapshot, then closes
  all pre-snapshot zombie connections on restore (while preserving post-restore
  connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
  use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
  packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
  hostagent use it instead of the timeout client

2026-05-02 05:19:37 +06:00

pptx704

962860ba74

Pre-pause snapshot signal to prevent Go runtime crash on restore

envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.

Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.

- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)

2026-04-13 05:21:10 +06:00

3 Commits