wrenn-releases

Author	SHA1	Message	Date
Tasnim Kabir Sadik	66bdf8da43	Merge branch 'dev' into fix/exec-cwd-envs	2026-05-16 10:26:10 +00:00
Tasnim Kabir Sadik	239a642497	fix exec cwd and env propagation	2026-05-15 15:06:32 +06:00
Rafeed M. Bhuiyan	e34bcedc31	Merge pull request 'fix/remove-sync-updates' (#47 ) from fix/remove-sync-updates into dev Reviewed-on: wrenn/wrenn#47	2026-05-15 08:08:07 +00:00
pptx704	ff91ef3edf	Bump versions	2026-05-15 13:56:04 +06:00
pptx704	ba3a3db98c	Updated openapi specs	2026-05-15 12:39:06 +06:00
pptx704	6faad45a28	feat: async sandbox lifecycle with Redis Stream events Replace synchronous RPC-based CP-host communication for sandbox lifecycle operations (Create, Pause, Resume, Destroy) with an async pattern. CP handlers now return 202 Accepted immediately, fire agent RPCs in background goroutines, and publish state events to a Redis Stream. A background consumer processes events as a fallback writer. Agent-side auto-pause events are pushed to the CP via HTTP callback (POST /v1/hosts/sandbox-events), keeping Redis internal to the CP. All DB status transitions use conditional updates (UpdateSandboxStatusIf, UpdateSandboxRunningIf) to prevent race conditions between concurrent operations and background goroutines. The HostMonitor reconciler is kept at 60s as a safety net, extended to handle transient statuses (starting, pausing, resuming, stopping). Frontend updated to handle 202 responses with empty bodies and render transient statuses with blue indicators.	2026-05-15 12:25:16 +06:00
pptx704	c08884fa2c	Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev	2026-05-13 11:05:49 +06:00
pptx704	4707f16c76	v0.1.6 (#45 ) ## What's New? Performance updates for large capsules, admin panel enhancement and bug fixes ### Envd - Fixed bug with sandbox metrics calculation - Page cache drop and balloon inflation to reduce memfile snapshot - Updated rpc timeout logic for better control - Added tests ### Admin Panel - Add/Remove platform admin - Updated template deletion logic for fine grained permission ### Others - Minor frontend visual improvement - Minor bugfixes - Version bump Co-authored-by: Tasnim Kabir Sadik <tksadik92@gmail.com> Reviewed-on: wrenn/wrenn#45 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-05-13 05:05:35 +00:00
pptx704	6164d7cae3	version bump	2026-05-13 10:58:54 +06:00
pptx704	dc6776cc8f	fix(agent): register with CP before inflating rootfs images	2026-05-13 10:52:22 +06:00
Rafeed M. Bhuiyan	0bfda08f47	Merge pull request 'test (envd): add 136 unit tests across 12 modules' (#44 ) from testing/envd into dev Reviewed-on: wrenn/wrenn#44	2026-05-13 04:42:06 +00:00
pptx704	485be22a16	test(envd): add 136 unit tests across 12 modules Cover all pure-function modules with inline #[cfg(test)] blocks: crypto (NIST/RFC 4231 known-answer vectors), auth (SecureToken ops, signature generation/validation), conntracker (snapshot lifecycle), execcontext, util (AtomicMax concurrent correctness), http/encoding (RFC 7231 negotiation), port/conn (/proc/net/tcp parsing), rpc/entry (format_permissions), and permissions/path (tilde expansion, ensure_dirs). Add tempfile dev-dep for filesystem tests. Update Makefile test target to include cargo test.	2026-05-13 10:39:54 +06:00
Rafeed M. Bhuiyan	ead406bdac	Merge pull request 'fix: resolve large operation reliability — stream hangs, pause races, and memory bloat' (#43 ) from fix/large-operations into dev Reviewed-on: wrenn/wrenn#43	2026-05-13 03:44:41 +00:00
Rafeed M. Bhuiyan	1472d77b52	Merge branch 'dev' into fix/large-operations	2026-05-13 03:44:19 +00:00
pptx704	6a0fea30a6	Rootfs script updated	2026-05-13 09:35:06 +06:00
Tasnim Kabir Sadik	8c34388fc2	Changed commands to check if envd is statically linked or not	2026-05-12 23:19:30 +06:00
pptx704	aca43d51eb	fix: resolve process stream hangs, pause race, and PTY signal loss - Cache terminal EndEvent on ProcessHandle so connect() can detect already-exited processes instead of hanging forever on broadcast receivers that missed the event. Subscribe before checking cache to close the TOCTOU window. - Protect sb.Status writes in Pause with m.mu to prevent data race with concurrent readers (AcquireProxyConn, Exec, etc.). - Restart metrics sampler in restoreRunning so a failed pause attempt doesn't permanently kill sandbox metrics collection. - Return dequeued non-input messages from coalescePtyInput instead of dropping them, preventing silent loss of kill/resize signals during typing bursts.	2026-05-09 18:11:15 +06:00
pptx704	522e1c5e90	fix: subscribe to process channels before spawning threads to prevent event loss Fast-exiting processes (e.g. echo) sent data/end events before start() subscribed to the broadcast channels, causing the stream to hang indefinitely and the exec RPC to time out with 502. Move channel subscription into spawn_process, before reader/waiter threads start, and return pre-subscribed receivers via SpawnedProcess.	2026-05-09 17:28:37 +06:00
pptx704	d1d316f35c	fix: resolve exec 502 by terminating process streams on exit The start() and connect() streaming RPCs blocked forever in the data event loop because ProcessHandle retains a broadcast sender (needed for reconnection via connect()), preventing the channel from closing. Race data_rx against end_rx with tokio::select! so the stream terminates when the process exits. Remaining buffered data is drained before yielding the end event.	2026-05-09 16:36:33 +06:00
pptx704	2af8412cdc	fix: use RwLock for envd Defaults to fix silent mutation loss The /init handler's default_user mutation cloned the Defaults struct, mutated the clone, then dropped it — the actual state was never updated. This caused processes to always run as "root" regardless of the user set via POST /init. Additionally, default_workdir was accepted in the init request but never applied. Wrap user and workdir fields in RwLock with accessor methods so mutations propagate correctly through the shared AppState.	2026-05-09 15:28:09 +06:00
pptx704	c93ad5e2db	fix: harden pause flow with connection isolation and UFFD event handling Restructure pause to: block new operations (StatusPausing), drain proxy connections with 5s grace, force-close remaining via context cancellation, drop page cache, inflate balloon, then freeze vCPUs. Previously connections could arrive during the pause window and API operations weren't blocked. Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing the UFFD server. These events fire during balloon deflation on snapshot restore, killing the page fault handler and preventing VM boot. Also adds ConnTracker.ForceClose() with cancellable context propagated through the proxy handler, so lingering proxy connections are actively terminated rather than left dangling.	2026-05-09 14:51:19 +06:00
pptx704	38799770db	fix: inflate balloon before snapshot to reduce memfile size Firecracker dumps the entire VM memory region regardless of guest usage. A 20GB VM using 500MB still produces a ~20GB memfile because freed pages retain stale data (non-zero blocks). Inflate the balloon device before snapshot to reclaim free guest memory. Balloon pages become zero from FC's perspective, allowing ProcessMemfile to skip them. This reduces memfile size from ~20GB to ~1-2GB for lightly-used VMs. - Pause: read guest memory usage, inflate balloon to reclaim free pages, wait 2s for guest kernel to process, then proceed - Resume: deflate balloon to 0 after PostInit so guest gets full memory back - createFromSnapshot: same deflation since template snapshots inherit inflated balloon state - All balloon ops are best-effort with debug logging on failure	2026-05-05 15:38:04 +06:00
pptx704	51b5d7b3ba	fix: resolve pause/snapshot failures and CoW exhaustion on large VMs Remove hard 10s timeout from Firecracker HTTP client — callers already pass context.Context with appropriate deadlines, and 20GB+ memfile writes easily exceed 10s. Ensure CoW file is at least as large as the origin rootfs. Previously, WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the default 5GB CoW could not hold all writes, causing dm-snapshot invalidation and EIO on all guest I/O. Destroy frozen VMs in resumeOnError instead of leaving zombies that report "running" but can't execute. Use fresh context for the resume attempt so a cancelled caller context doesn't falsely trigger destroy. Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and PrepareSnapshot timeout from 3s to 30s for large-memory VMs. After failed pause, ping agent to detect destroyed sandboxes and mark DB status as "error" instead of reverting to "running".	2026-05-04 01:46:57 +06:00
Rafeed M. Bhuiyan	fd5fa28205	Merge pull request 'Enhanced frontend ux' (#42 ) from enhance/frontend into dev Reviewed-on: wrenn/wrenn#42	2026-05-03 11:08:48 +00:00
pptx704	1244c08e42	fix: fetch sandbox metrics immediately on page load Metrics data was only fetched after Chart.js dynamic import completed, leaving graphs empty until the first poll interval fired. Now loadMetrics() runs in parallel with the Chart.js import, and initCharts() resets the dedup key so pre-fetched data populates newly created chart instances.	2026-05-03 16:43:26 +06:00
pptx704	021d709de2	feat: show template owner and restrict delete in admin panel Add Owner column to admin templates table, resolving team IDs to names via admin teams API. Disable delete for non-platform templates and the minimal template, with contextual tooltips explaining why.	2026-05-03 15:51:20 +06:00
pptx704	cac6fcd626	feat: admin grant/revoke from admin panel Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for granting and revoking platform admin status. Uses atomic conditional SQL (RevokeUserAdmin) to prevent race conditions that could remove the last admin. Includes idempotency check, audit logging, and confirmation dialog with self-demotion warning.	2026-05-03 15:24:34 +06:00
pptx704	4954b19d7c	fix: merge capsule data in-place to prevent visual refresh on poll Replaces full array assignment with granular merge that reuses existing Svelte proxy objects, so only rows with actual data changes re-render.	2026-05-03 15:09:21 +06:00
pptx704	01819642cc	fix: drop page cache before snapshot to reduce memory dump size Linux keeps freed memory as page cache, which Firecracker snapshots as non-zero blocks. A 16GB VM with 12GB stale cache would write all 12GB to disk. Dropping pagecache (not dentries/inodes) in /snapshot/prepare before blocking the reclaimer shrinks snapshots to actual working set size with minimal resume latency impact.	2026-05-03 14:27:49 +06:00
Rafeed M. Bhuiyan	cb28f7759d	Merge pull request 'fix: accurate sandbox metrics and memory management' (#41 ) from bugfix/sandbox-metrics-calculations into dev Reviewed-on: wrenn/wrenn#41	2026-05-03 06:41:41 +00:00
pptx704	1178ab8b21	fix: accurate sandbox metrics and memory management Three issues fixed: 1. Memory metrics read host-side VmRSS of the Firecracker process, which includes guest page cache and never decreases. Replaced readMemRSS(fcPID) with readEnvdMemUsed(client) that queries envd's /metrics endpoint for guest-side total - MemAvailable. This matches neofetch and reflects actual process memory. 2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and envd-side periodic page cache reclaimer (drop_caches when >80% used). Reclaimer is gated by snapshot_in_progress flag with sync() before freeze to prevent memory corruption during pause. 3. Sampling interval 500ms → 1s, ring buffer capacities adjusted to maintain same time windows. Reduces per-host HTTP load from 240 calls/sec to 120 calls/sec at 120 capsules. Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD lazy-loads anyway), envd mem_used formula uses total - available.	2026-05-03 12:19:01 +06:00
pptx704	233e747d5d	Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev	2026-05-03 04:56:14 +06:00
Rafeed M. Bhuiyan	f5a23c1fa0	v0.1.5 (#40 ) Reviewed-on: wrenn/wrenn#40 v0.1.5	2026-05-02 22:56:00 +00:00
Rafeed M. Bhuiyan	20a228eb8d	Merge pull request 'Rewritten envd with rust to improve reliability during pause and resume operations' (#39 ) from feat/envd-rewrite into dev Reviewed-on: wrenn/wrenn#39	2026-05-02 22:49:36 +00:00
pptx704	ef5f223863	fix: improve error feedback for terminal disconnects and host unavailability Show "[session disconnected]" in terminal when PTY websocket closes cleanly. Map scheduler and agent unavailability errors to 503 with user-friendly message instead of leaking internal details.	2026-05-03 04:47:10 +06:00
pptx704	31456fd169	fix: resolve PTY failure, MMDS file writes, and metrics instability in envd-rs Three bugs fixed: 1. PTY connections failed because home directory was hardcoded as /home/{username} instead of reading from /etc/passwd. For root, this produced /home/root/ which doesn't exist — CWD validation rejected every PTY Start request without explicit cwd. Fixed all 6 locations to use user.dir from nix::unistd::User. 2. MMDS polling silently failed to parse metadata because the logs_collector_address field lacked #[serde(default)]. The host agent only sends instanceID + envID — missing "address" field caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID and .WRENN_TEMPLATE_ID were never written. Also added error logging and create_dir_all before file writes. 3. Metrics CPU values were non-deterministic because a fresh sysinfo::System was created per request with a 100ms sleep between reads. Replaced with a background thread that samples CPU at fixed 1-second intervals via a persistent System instance, matching gopsutil's internal caching behavior. Metrics endpoint now reads cached atomic values — no blocking, consistent window. Also: close master PTY fd in child pre_exec, add process.Start request logging, bump version to 0.2.0.	2026-05-03 04:28:10 +06:00
pptx704	bbcde17d49	Updated static link check for envd	2026-05-03 03:32:41 +06:00
pptx704	f328113a2a	rename guest hostname from "sandbox" to "capsule" Terminal prompt inside VMs now shows root@capsule instead of root@sandbox, aligning with user-facing "capsule" terminology.	2026-05-03 03:32:03 +06:00
pptx704	1143acd37a	refactor: remove Go envd module, update host agent for Rust envd The Go envd guest agent (`envd/`) is fully replaced by the Rust implementation (`envd-rs/`). This commit removes the Go module and updates all references across the codebase. Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go, and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link verification to build-envd. Host agent: rewrite snapshot quiesce comments that referenced Go GC and page allocator corruption — no longer applicable with Rust envd. Tighten envdclient to expect HTTP 200 (not 204) from health and file upload endpoints, and require JSON version response from FetchVersion. Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and README.md to reflect Rust envd architecture.	2026-05-03 03:12:25 +06:00
pptx704	0b53d34417	feat: rewrite envd guest agent in Rust (envd-rs) Complete Rust rewrite of the Go envd guest daemon that runs as PID 1 inside Firecracker microVMs. Feature-complete across all 8 phases: - Health, metrics, and env var endpoints - Crypto (SHA-256/512, HMAC), auth (secure token, signing), init/snapshot - Connect RPC via connectrpc + buffa (process + filesystem services) - File transfer (GET/POST /files) with gzip, multipart, chown, ENOSPC - Port subsystem (/proc/net/tcp scanner, socat forwarder) - Cgroup2 manager with noop fallback - Snapshot/restore lifecycle (conntracker, port subsystem stop/restart) - SIGTERM graceful shutdown, --cmd initial process spawn - MMDS metadata polling for Firecracker mode 42 source files, ~4200 LOC, 4.1MB stripped release binary. Makefile updated: build-envd now targets Rust (musl static), build-envd-go preserved for Go builds.	2026-05-03 02:47:15 +06:00
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	bb582deefa	fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing caused head-of-line blocking when a slow sandbox RPC stalled the shared connection. Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's PostInit call to eliminate an extra round-trip that could hang on a stale connection.	2026-05-02 13:48:51 +06:00
pptx704	7ef9a64613	fix: close stale TCP connections across snapshot/restore to prevent envd hangs After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client	2026-05-02 05:19:37 +06:00
pptx704	f3572f7356	Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox Resume() was building VMConfig without TemplateID, so Firecracker MMDS received an empty string. envd's PostInit then wrote that empty value to /run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in snapshot metadata during Pause and reading it back during Resume.	2026-05-02 04:57:08 +06:00
pptx704	2e998a26a2	Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev	2026-05-01 15:01:32 +06:00
pptx704	4fcc19e91f	v0.1.4 (#38 ) Reviewed-on: wrenn/wrenn#38 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-05-01 09:01:08 +00:00
pptx704	f3ec626d58	Envd version bump	2026-05-01 14:59:37 +06:00
pptx704	f4733e2f7a	Version bump	2026-04-25 04:49:17 +06:00
Rafeed M. Bhuiyan	cdacc12a48	Merge pull request 'Fixed network throttle when an application is running' (#37 ) from fix/network-throttle-on-load into dev Reviewed-on: wrenn/wrenn#37	2026-04-24 22:43:31 +00:00
pptx704	bd98610153	fix: sandbox network responsiveness under port-binding apps Running port-binding applications (Jupyter, http.server, NextJS) inside sandboxes caused severe PTY sluggishness and proxy navigation errors. Root cause: the CP sandbox proxy and Connect RPC pool shared a single HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling) interfered with PTY RPC streams via HTTP/2 flow control contention. Transport isolation (main fix): - Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2 disabled, separate from the RPC pool transport - Add dedicated proxy transport on host agent, replacing http.DefaultTransport - Add dedicated envdclient transport with tuned connection pooling - Replace http.DefaultClient in file streaming RPCs with per-sandbox envd client Proxy path rewriting (navigation fix): - Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port} prefix, handling both root-relative and absolute-URL redirects - Strip prefix back out in CP subdomain proxy for correct browser behavior - Replace path.Join with string concat in CP Director to preserve trailing slashes (prevents redirect loops on directory listings) Proxy resilience: - Add dial retry with linear backoff (3 attempts) to handle socat startup delay when ports are first detected - Cache ReverseProxy instances per sandbox+port+host in sync.Map - Add EvictProxy callback wired into sandbox Manager.Destroy Buffer and server hardening: - Increase PTY and exec stream channel buffers from 16 to 256 - Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent HTTP server Network tuning: - Set TAP device TxQueueLen to 5000 (up from default 1000) - Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst) to prevent guest traffic from saturating the TAP	2026-04-25 04:21:55 +06:00

1 2 3 4 5 ...

296 Commits