Replace synchronous RPC-based CP-host communication for sandbox
lifecycle operations (Create, Pause, Resume, Destroy) with an async
pattern. CP handlers now return 202 Accepted immediately, fire agent
RPCs in background goroutines, and publish state events to a Redis
Stream. A background consumer processes events as a fallback writer.
Agent-side auto-pause events are pushed to the CP via HTTP callback
(POST /v1/hosts/sandbox-events), keeping Redis internal to the CP.
All DB status transitions use conditional updates
(UpdateSandboxStatusIf, UpdateSandboxRunningIf) to prevent race
conditions between concurrent operations and background goroutines.
The HostMonitor reconciler is kept at 60s as a safety net, extended
to handle transient statuses (starting, pausing, resuming, stopping).
Frontend updated to handle 202 responses with empty bodies and render
transient statuses with blue indicators.
- Cache terminal EndEvent on ProcessHandle so connect() can detect
already-exited processes instead of hanging forever on broadcast
receivers that missed the event. Subscribe before checking cache
to close the TOCTOU window.
- Protect sb.Status writes in Pause with m.mu to prevent data race
with concurrent readers (AcquireProxyConn, Exec, etc.).
- Restart metrics sampler in restoreRunning so a failed pause attempt
doesn't permanently kill sandbox metrics collection.
- Return dequeued non-input messages from coalescePtyInput instead of
dropping them, preventing silent loss of kill/resize signals during
typing bursts.
Fast-exiting processes (e.g. echo) sent data/end events before
start() subscribed to the broadcast channels, causing the stream
to hang indefinitely and the exec RPC to time out with 502.
Move channel subscription into spawn_process, before reader/waiter
threads start, and return pre-subscribed receivers via SpawnedProcess.
The start() and connect() streaming RPCs blocked forever in the data
event loop because ProcessHandle retains a broadcast sender (needed for
reconnection via connect()), preventing the channel from closing.
Race data_rx against end_rx with tokio::select! so the stream terminates
when the process exits. Remaining buffered data is drained before
yielding the end event.
The /init handler's default_user mutation cloned the Defaults struct,
mutated the clone, then dropped it — the actual state was never updated.
This caused processes to always run as "root" regardless of the user
set via POST /init. Additionally, default_workdir was accepted in the
init request but never applied.
Wrap user and workdir fields in RwLock with accessor methods so mutations
propagate correctly through the shared AppState.
Restructure pause to: block new operations (StatusPausing), drain proxy
connections with 5s grace, force-close remaining via context cancellation,
drop page cache, inflate balloon, then freeze vCPUs. Previously connections
could arrive during the pause window and API operations weren't blocked.
Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing
the UFFD server. These events fire during balloon deflation on snapshot
restore, killing the page fault handler and preventing VM boot.
Also adds ConnTracker.ForceClose() with cancellable context propagated
through the proxy handler, so lingering proxy connections are actively
terminated rather than left dangling.
Firecracker dumps the entire VM memory region regardless of guest
usage. A 20GB VM using 500MB still produces a ~20GB memfile because
freed pages retain stale data (non-zero blocks).
Inflate the balloon device before snapshot to reclaim free guest
memory. Balloon pages become zero from FC's perspective, allowing
ProcessMemfile to skip them. This reduces memfile size from ~20GB
to ~1-2GB for lightly-used VMs.
- Pause: read guest memory usage, inflate balloon to reclaim free
pages, wait 2s for guest kernel to process, then proceed
- Resume: deflate balloon to 0 after PostInit so guest gets full
memory back
- createFromSnapshot: same deflation since template snapshots
inherit inflated balloon state
- All balloon ops are best-effort with debug logging on failure
Remove hard 10s timeout from Firecracker HTTP client — callers already
pass context.Context with appropriate deadlines, and 20GB+ memfile
writes easily exceed 10s.
Ensure CoW file is at least as large as the origin rootfs. Previously,
WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the
default 5GB CoW could not hold all writes, causing dm-snapshot
invalidation and EIO on all guest I/O.
Destroy frozen VMs in resumeOnError instead of leaving zombies that
report "running" but can't execute. Use fresh context for the resume
attempt so a cancelled caller context doesn't falsely trigger destroy.
Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and
PrepareSnapshot timeout from 3s to 30s for large-memory VMs.
After failed pause, ping agent to detect destroyed sandboxes and mark
DB status as "error" instead of reverting to "running".
Metrics data was only fetched after Chart.js dynamic import completed,
leaving graphs empty until the first poll interval fired. Now
loadMetrics() runs in parallel with the Chart.js import, and
initCharts() resets the dedup key so pre-fetched data populates
newly created chart instances.
Add Owner column to admin templates table, resolving team IDs to names
via admin teams API. Disable delete for non-platform templates and the
minimal template, with contextual tooltips explaining why.
Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for
granting and revoking platform admin status. Uses atomic conditional
SQL (RevokeUserAdmin) to prevent race conditions that could remove
the last admin. Includes idempotency check, audit logging, and
confirmation dialog with self-demotion warning.
Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.
Three issues fixed:
1. Memory metrics read host-side VmRSS of the Firecracker process,
which includes guest page cache and never decreases. Replaced
readMemRSS(fcPID) with readEnvdMemUsed(client) that queries
envd's /metrics endpoint for guest-side total - MemAvailable.
This matches neofetch and reflects actual process memory.
2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and
envd-side periodic page cache reclaimer (drop_caches when >80%
used). Reclaimer is gated by snapshot_in_progress flag with
sync() before freeze to prevent memory corruption during pause.
3. Sampling interval 500ms → 1s, ring buffer capacities adjusted
to maintain same time windows. Reduces per-host HTTP load from
240 calls/sec to 120 calls/sec at 120 capsules.
Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD
lazy-loads anyway), envd mem_used formula uses total - available.
Show "[session disconnected]" in terminal when PTY websocket closes cleanly.
Map scheduler and agent unavailability errors to 503 with user-friendly
message instead of leaking internal details.
Three bugs fixed:
1. PTY connections failed because home directory was hardcoded as
/home/{username} instead of reading from /etc/passwd. For root,
this produced /home/root/ which doesn't exist — CWD validation
rejected every PTY Start request without explicit cwd. Fixed all
6 locations to use user.dir from nix::unistd::User.
2. MMDS polling silently failed to parse metadata because the
logs_collector_address field lacked #[serde(default)]. The host
agent only sends instanceID + envID — missing "address" field
caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID
and .WRENN_TEMPLATE_ID were never written. Also added error
logging and create_dir_all before file writes.
3. Metrics CPU values were non-deterministic because a fresh
sysinfo::System was created per request with a 100ms sleep
between reads. Replaced with a background thread that samples
CPU at fixed 1-second intervals via a persistent System instance,
matching gopsutil's internal caching behavior. Metrics endpoint
now reads cached atomic values — no blocking, consistent window.
Also: close master PTY fd in child pre_exec, add process.Start
request logging, bump version to 0.2.0.
The Go envd guest agent (`envd/`) is fully replaced by the Rust
implementation (`envd-rs/`). This commit removes the Go module and
updates all references across the codebase.
Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go,
and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link
verification to build-envd.
Host agent: rewrite snapshot quiesce comments that referenced Go GC
and page allocator corruption — no longer applicable with Rust envd.
Tighten envdclient to expect HTTP 200 (not 204) from health and file
upload endpoints, and require JSON version response from FetchVersion.
Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and
README.md to reflect Rust envd architecture.
Three root causes addressed:
1. Go page allocator corruption: allocations between the pre-snapshot GC
and VM freeze leave the summary tree inconsistent. After restore, GC
reads corrupted metadata — either panicking (killing PID 1 → kernel
panic) or silently failing to collect, causing unbounded heap growth
until OOM. Fix: move GC to after all HTTP allocations in
PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
allocations run sequentially with no concurrent page allocator access.
GOMAXPROCS is restored on first health check after restore.
2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
single 30s context. If WaitUntilReady consumed most of it, PostInit
failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
disabled and zombie connections. Fix: separate timeout contexts.
3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
caused goroutine leaks from hung proxy connections. Fix: add both,
matching host agent values.
Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.
After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.
Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
closes idle connections and disables keep-alives before snapshot, then closes
all pre-snapshot zombie connections on restore (while preserving post-restore
connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
hostagent use it instead of the timeout client
Resume() was building VMConfig without TemplateID, so Firecracker MMDS
received an empty string. envd's PostInit then wrote that empty value to
/run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in
snapshot metadata during Pause and reading it back during Resume.
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.
Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.
Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
envd client
Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
behavior
- Replace path.Join with string concat in CP Director to preserve
trailing slashes (prevents redirect loops on directory listings)
Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy
Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
HTTP server
Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
to prevent guest traffic from saturating the TAP