wrenn-releases

Author	SHA1	Message	Date
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	962860ba74	Pre-pause snapshot signal to prevent Go runtime crash on restore envd crashes with "fatal error: bad summary data" after Firecracker snapshot/restore because the page allocator radix tree is inconsistent when vCPUs are frozen mid-allocation. The port scanner goroutine allocates heavily every second, making it the primary trigger. Add POST /snapshot/prepare to envd — the host agent calls it before vm.Pause to quiesce continuous goroutines and force GC. On restore, PostInit restarts the port subsystem via the existing /init endpoint. - New PortSubsystem abstraction with Start/Stop/Restart lifecycle - Context-based goroutine cancellation (replaces irreversible channel close) - Context-aware Signal to prevent scanner/forwarder deadlock - Fix forwarder goroutine leak (was spinning forever on closed channel) - Kill socat children on stop to prevent orphans across snapshots - Fix double cmd.Wait panic (exec.Command instead of CommandContext)	2026-04-13 05:21:10 +06:00
pptx704	8b5fa3438e	Replace gopsutil port scanner with direct /proc/net/tcp reading The envd port scanner used gopsutil's net.Connections() which walks /proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime semaphore state when the VM is paused mid-operation and restored from a Firecracker snapshot. Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads a single file per address family — no /proc/{pid}/fd walk, no goroutines, no WaitGroups. Also replace concurrent-map (smap) in the scanner with a plain sync.RWMutex-protected map, since concurrent-map's Items() spawns goroutines with a WaitGroup internally, which is equally unsafe across snapshot boundaries. Use socket inode instead of PID for the port forwarding map key, since inode is available directly from /proc/net/tcp without the fd walk.	2026-04-01 15:47:28 +06:00
pptx704	34c89e814d	Added basic license information	2026-03-10 04:28:51 +06:00
pptx704	a3898d68fb	Port envd from e2b with internalized shared packages and Connect RPC - Copy envd source from e2b-dev/infra, internalize shared dependencies into envd/internal/shared/ (keys, filesystem, id, smap, utils) - Switch from gRPC to Connect RPC for all envd services - Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd} - Add proto specs (process, filesystem) with buf-based code generation - Implement full envd: process exec, filesystem ops, port forwarding, cgroup management, MMDS integration, and HTTP API - Update main module dependencies (firecracker SDK, pgx, goose, etc.) - Remove placeholder .gitkeep files replaced by real implementations	2026-03-09 21:03:19 +06:00

5 Commits