wrenn-releases

Author	SHA1	Message	Date
pptx704	aca43d51eb	fix: resolve process stream hangs, pause race, and PTY signal loss - Cache terminal EndEvent on ProcessHandle so connect() can detect already-exited processes instead of hanging forever on broadcast receivers that missed the event. Subscribe before checking cache to close the TOCTOU window. - Protect sb.Status writes in Pause with m.mu to prevent data race with concurrent readers (AcquireProxyConn, Exec, etc.). - Restart metrics sampler in restoreRunning so a failed pause attempt doesn't permanently kill sandbox metrics collection. - Return dequeued non-input messages from coalescePtyInput instead of dropping them, preventing silent loss of kill/resize signals during typing bursts.	2026-05-09 18:11:15 +06:00
pptx704	c93ad5e2db	fix: harden pause flow with connection isolation and UFFD event handling Restructure pause to: block new operations (StatusPausing), drain proxy connections with 5s grace, force-close remaining via context cancellation, drop page cache, inflate balloon, then freeze vCPUs. Previously connections could arrive during the pause window and API operations weren't blocked. Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing the UFFD server. These events fire during balloon deflation on snapshot restore, killing the page fault handler and preventing VM boot. Also adds ConnTracker.ForceClose() with cancellable context propagated through the proxy handler, so lingering proxy connections are actively terminated rather than left dangling.	2026-05-09 14:51:19 +06:00
pptx704	38799770db	fix: inflate balloon before snapshot to reduce memfile size Firecracker dumps the entire VM memory region regardless of guest usage. A 20GB VM using 500MB still produces a ~20GB memfile because freed pages retain stale data (non-zero blocks). Inflate the balloon device before snapshot to reclaim free guest memory. Balloon pages become zero from FC's perspective, allowing ProcessMemfile to skip them. This reduces memfile size from ~20GB to ~1-2GB for lightly-used VMs. - Pause: read guest memory usage, inflate balloon to reclaim free pages, wait 2s for guest kernel to process, then proceed - Resume: deflate balloon to 0 after PostInit so guest gets full memory back - createFromSnapshot: same deflation since template snapshots inherit inflated balloon state - All balloon ops are best-effort with debug logging on failure	2026-05-05 15:38:04 +06:00
pptx704	51b5d7b3ba	fix: resolve pause/snapshot failures and CoW exhaustion on large VMs Remove hard 10s timeout from Firecracker HTTP client — callers already pass context.Context with appropriate deadlines, and 20GB+ memfile writes easily exceed 10s. Ensure CoW file is at least as large as the origin rootfs. Previously, WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the default 5GB CoW could not hold all writes, causing dm-snapshot invalidation and EIO on all guest I/O. Destroy frozen VMs in resumeOnError instead of leaving zombies that report "running" but can't execute. Use fresh context for the resume attempt so a cancelled caller context doesn't falsely trigger destroy. Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and PrepareSnapshot timeout from 3s to 30s for large-memory VMs. After failed pause, ping agent to detect destroyed sandboxes and mark DB status as "error" instead of reverting to "running".	2026-05-04 01:46:57 +06:00
pptx704	1178ab8b21	fix: accurate sandbox metrics and memory management Three issues fixed: 1. Memory metrics read host-side VmRSS of the Firecracker process, which includes guest page cache and never decreases. Replaced readMemRSS(fcPID) with readEnvdMemUsed(client) that queries envd's /metrics endpoint for guest-side total - MemAvailable. This matches neofetch and reflects actual process memory. 2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and envd-side periodic page cache reclaimer (drop_caches when >80% used). Reclaimer is gated by snapshot_in_progress flag with sync() before freeze to prevent memory corruption during pause. 3. Sampling interval 500ms → 1s, ring buffer capacities adjusted to maintain same time windows. Reduces per-host HTTP load from 240 calls/sec to 120 calls/sec at 120 capsules. Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD lazy-loads anyway), envd mem_used formula uses total - available.	2026-05-03 12:19:01 +06:00
pptx704	1143acd37a	refactor: remove Go envd module, update host agent for Rust envd The Go envd guest agent (`envd/`) is fully replaced by the Rust implementation (`envd-rs/`). This commit removes the Go module and updates all references across the codebase. Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go, and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link verification to build-envd. Host agent: rewrite snapshot quiesce comments that referenced Go GC and page allocator corruption — no longer applicable with Rust envd. Tighten envdclient to expect HTTP 200 (not 204) from health and file upload endpoints, and require JSON version response from FetchVersion. Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and README.md to reflect Rust envd architecture.	2026-05-03 03:12:25 +06:00
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	bb582deefa	fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing caused head-of-line blocking when a slow sandbox RPC stalled the shared connection. Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's PostInit call to eliminate an extra round-trip that could hang on a stale connection.	2026-05-02 13:48:51 +06:00
pptx704	7ef9a64613	fix: close stale TCP connections across snapshot/restore to prevent envd hangs After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client	2026-05-02 05:19:37 +06:00
pptx704	f3572f7356	Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox Resume() was building VMConfig without TemplateID, so Firecracker MMDS received an empty string. envd's PostInit then wrote that empty value to /run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in snapshot metadata during Pause and reading it back during Resume.	2026-05-02 04:57:08 +06:00
pptx704	bd98610153	fix: sandbox network responsiveness under port-binding apps Running port-binding applications (Jupyter, http.server, NextJS) inside sandboxes caused severe PTY sluggishness and proxy navigation errors. Root cause: the CP sandbox proxy and Connect RPC pool shared a single HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling) interfered with PTY RPC streams via HTTP/2 flow control contention. Transport isolation (main fix): - Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2 disabled, separate from the RPC pool transport - Add dedicated proxy transport on host agent, replacing http.DefaultTransport - Add dedicated envdclient transport with tuned connection pooling - Replace http.DefaultClient in file streaming RPCs with per-sandbox envd client Proxy path rewriting (navigation fix): - Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port} prefix, handling both root-relative and absolute-URL redirects - Strip prefix back out in CP subdomain proxy for correct browser behavior - Replace path.Join with string concat in CP Director to preserve trailing slashes (prevents redirect loops on directory listings) Proxy resilience: - Add dial retry with linear backoff (3 attempts) to handle socat startup delay when ports are first detected - Cache ReverseProxy instances per sandbox+port+host in sync.Map - Add EvictProxy callback wired into sandbox Manager.Destroy Buffer and server hardening: - Increase PTY and exec stream channel buffers from 16 to 256 - Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent HTTP server Network tuning: - Set TAP device TxQueueLen to 5000 (up from default 1000) - Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst) to prevent guest traffic from saturating the TAP	2026-04-25 04:21:55 +06:00
pptx704	339cd7bee1	fix: security and stability fixes from code review - Scope WebSocket auth bypass to only WS endpoints by restructuring routes into separate chi Groups. Non-WS routes no longer passthrough unauthenticated requests with spoofed Upgrade headers. Added optionalAPIKeyOrJWT middleware for WS routes (injects auth context from API key/JWT if present, passes through otherwise) and markAdminWS middleware for admin WS routes. - Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close() was called unconditionally but p.tty is nil for non-PTY processes, crashing every non-PTY process exit. - Fix goroutine leak in sandbox Pause — stopSampler was never called, leaking one sampler goroutine per successful pause operation. - Decouple PTY WebSocket reads from RPC dispatch using a buffered channel to prevent backpressure-induced connection drops under fast typing. Includes input coalescing to reduce RPC call volume.	2026-04-24 15:48:38 +06:00
pptx704	a5ad3731f2	Refactored to maintain a separate cloud version Moves 12 packages from internal/ to pkg/ (config, id, validate, events, db, auth, lifecycle, scheduler, channels, audit, service) so they can be imported by the enterprise repo as a Go module dependency. Introduces pkg/cpextension (shared Extension interface + ServerContext) and pkg/cpserver (Run() entrypoint with functional options) so the enterprise main.go can call cpserver.Run(cpserver.WithExtensions(...)) without duplicating the 20-step server bootstrap. Adds db/migrations/embed.go for go:embed access to OSS SQL migrations from the enterprise module. cmd/control-plane/main.go is reduced to a 10-line wrapper around cpserver.Run.	2026-04-15 21:41:48 +06:00
pptx704	5b4fde055c	Fix build recipe execution and flatten reliability - Set HOME in bctx.EnvVars when USER switches so ~ expands correctly in subsequent RUN/WORKDIR steps instead of resolving to /root - Run /bin/sync inside the guest before FlattenRootfs destroys the VM, preventing pip-installed files from being captured as 0-byte due to unflushed page cache - Wrap healthcheck command with su <user> so it runs with the template's default user context (correct HOME, correct UID) - Export Shellescape from the recipe package for use in build service - Add code-runner-beta recipe (Jupyter server with ipykernel --sys-prefix) and replace old python-interpreter-v0-beta	2026-04-15 18:24:54 +06:00
pptx704	516890c49a	Add background process execution API Start long-running processes (web servers, daemons) without blocking the HTTP request. Leverages envd's existing background process support (context.Background(), List, Connect, SendSignal RPCs) and wires it through the host agent and control plane layers. New API surface: - POST /v1/capsules/{id}/exec with background:true → 202 {pid, tag} - GET /v1/capsules/{id}/processes → list running processes - DELETE /v1/capsules/{id}/processes/{selector} → kill by PID or tag - WS /v1/capsules/{id}/processes/{selector}/stream → reconnect to output The {selector} param auto-detects: numeric = PID, string = tag. Tags are auto-generated as "proc-" + 8 hex chars if not provided.	2026-04-14 03:57:01 +06:00
pptx704	962860ba74	Pre-pause snapshot signal to prevent Go runtime crash on restore envd crashes with "fatal error: bad summary data" after Firecracker snapshot/restore because the page allocator radix tree is inconsistent when vCPUs are frozen mid-allocation. The port scanner goroutine allocates heavily every second, making it the primary trigger. Add POST /snapshot/prepare to envd — the host agent calls it before vm.Pause to quiesce continuous goroutines and force GC. On restore, PostInit restarts the port subsystem via the existing /init endpoint. - New PortSubsystem abstraction with Start/Stop/Restart lifecycle - Context-based goroutine cancellation (replaces irreversible channel close) - Context-aware Signal to prevent scanner/forwarder deadlock - Fix forwarder goroutine leak (was spinning forever on closed channel) - Kill socat children on stop to prevent orphans across snapshots - Fix double cmd.Wait panic (exec.Command instead of CommandContext)	2026-04-13 05:21:10 +06:00
pptx704	25b5258841	COPY multi-source support, configurable rootfs size, build fixes - COPY now supports multiple sources: COPY a.txt b.txt /dest/ Last argument is always destination (matches Dockerfile semantics). - COPY resolves relative destinations against current WORKDIR. - WRENN_DEFAULT_ROOTFS_SIZE env var (e.g. 5G, 2Gi, 1000M, 512Mi) controls template rootfs expansion. Used both at agent startup (EnsureImageSizes) and after FlattenRootfs (shrink then re-expand). - Pre-build now sets WORKDIR /home/wrenn-user after USER switch. - Extracted archive files get chmod a+rX for readability. - Path traversal validation on COPY sources.	2026-04-12 03:39:17 +06:00
pptx704	75af2a4f66	Add USER, COPY, ENV persistence to template build system Implement three new recipe commands for the admin template builder: - USER <name>: creates the user (adduser + passwordless sudo), switches execution context so subsequent RUN/START commands run as that user via su wrapping. Last USER becomes the template's default_user. - COPY <src> <dst>: copies files from an uploaded build archive (tar/tar.gz/zip) into the sandbox. Source paths validated against traversal. Ownership set to the current USER. - ENV persistence: accumulated env vars stored in templates.default_env (JSONB) and injected via PostInit when sandboxes are created from the template, mirroring Docker's image metadata approach. Supporting changes: - Pre-build creates wrenn-user as default (via USER command) - WORKDIR now creates the directory if it doesn't exist (mkdir -p) - Per-step progress updates (ProgressFunc callback) for live UI - Multipart form support on POST /v1/admin/builds for archive upload - Proto: default_user/default_env fields on Create/ResumeSandboxRequest - Host agent: SetDefaults calls PostInitWithDefaults on envd - Control plane: reads template defaults, passes on sandbox create/resume - Frontend: file upload widget, recipe copy button, keyword colors for USER/COPY, fixed Svelte whitespace stripping in step display - Admin panel defaults to /admin/templates instead of /admin/hosts - Migration adds default_user and default_env to templates and template_builds tables	2026-04-12 02:10:01 +06:00
pptx704	ab3fc4a807	Add interactive PTY terminal sessions for sandboxes Wire envd's existing PTY process capabilities through the full stack: hostagent proto (4 new RPCs: PtyAttach, PtySendInput, PtyResize, PtyKill), envdclient, sandbox manager, and a new WebSocket endpoint at GET /v1/sandboxes/{id}/pty with bidirectional JSON message protocol. Sessions use tag-based identity for disconnect/reconnect support, base64-encoded PTY data for binary safety, and a 120s inactivity timeout.	2026-04-11 02:42:59 +06:00
pptx704	4ed17b2776	Fix stale WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID after snapshot restore After restoring a VM from snapshot, envd had already completed its initial MMDS poll, so the metadata files in /run/wrenn/ and env vars retained values from the original sandbox. Call POST /init after WaitUntilReady on both resume and create-from-template paths to trigger envd to re-read MMDS.	2026-04-10 19:23:48 +06:00
pptx704	172413e91e	Made changes to accomodate repo url update (#15 ) Reviewed-on: wrenn/wrenn#15 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-04-09 21:02:44 +00:00
Rafeed M. Bhuiyan	d3e4812e46	v0.0.1 (#8 ) Co-authored-by: Tasnim Kabir Sadik <tksadik92@gmail.com> Reviewed-on: wrenn/sandbox#8	2026-04-09 19:24:49 +00:00
pptx704	32e5a5a715	Prototype with single host server and no admin panel (#2 ) Reviewed-on: wrenn/sandbox#2 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-03-22 21:01:23 +00:00

23 Commits