Fix resource leaks, race conditions, and error handling across host
agent and control plane: proper sparse file cleanup on close error,
connect error wrapping for MakeDir, CoW file cleanup on pause failure,
per-sandbox VM directories, deferred map deletion to avoid race in VM
destroy, and goroutine launch for extension background workers.
Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.
After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.
Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
closes idle connections and disables keep-alives before snapshot, then closes
all pre-snapshot zombie connections on restore (while preserving post-restore
connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
hostagent use it instead of the timeout client
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.
Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.
Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
envd client
Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
behavior
- Replace path.Join with string concat in CP Director to preserve
trailing slashes (prevents redirect loops on directory listings)
Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy
Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
HTTP server
Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
to prevent guest traffic from saturating the TAP
Moves 12 packages from internal/ to pkg/ (config, id, validate, events, db,
auth, lifecycle, scheduler, channels, audit, service) so they can be imported
by the enterprise repo as a Go module dependency.
Introduces pkg/cpextension (shared Extension interface + ServerContext) and
pkg/cpserver (Run() entrypoint with functional options) so the enterprise
main.go can call cpserver.Run(cpserver.WithExtensions(...)) without duplicating
the 20-step server bootstrap. Adds db/migrations/embed.go for go:embed access
to OSS SQL migrations from the enterprise module.
cmd/control-plane/main.go is reduced to a 10-line wrapper around cpserver.Run.
Start long-running processes (web servers, daemons) without blocking the
HTTP request. Leverages envd's existing background process support
(context.Background(), List, Connect, SendSignal RPCs) and wires it
through the host agent and control plane layers.
New API surface:
- POST /v1/capsules/{id}/exec with background:true → 202 {pid, tag}
- GET /v1/capsules/{id}/processes → list running processes
- DELETE /v1/capsules/{id}/processes/{selector} → kill by PID or tag
- WS /v1/capsules/{id}/processes/{selector}/stream → reconnect to output
The {selector} param auto-detects: numeric = PID, string = tag.
Tags are auto-generated as "proc-" + 8 hex chars if not provided.
- envd: reject non-regular files (devices, pipes, sockets) in GetFiles
to prevent infinite reads from /dev/zero, /dev/urandom etc.
- host agent: add context cancellation check in ReadFileStream loop
with proper Connect error codes
- frontend: abort in-flight file reads on file switch, directory
navigation, and component teardown via AbortController
- frontend: guard against abort errors surfacing in UI, use try/finally
for fileLoading state
Implement three new recipe commands for the admin template builder:
- USER <name>: creates the user (adduser + passwordless sudo), switches
execution context so subsequent RUN/START commands run as that user
via su wrapping. Last USER becomes the template's default_user.
- COPY <src> <dst>: copies files from an uploaded build archive
(tar/tar.gz/zip) into the sandbox. Source paths validated against
traversal. Ownership set to the current USER.
- ENV persistence: accumulated env vars stored in templates.default_env
(JSONB) and injected via PostInit when sandboxes are created from the
template, mirroring Docker's image metadata approach.
Supporting changes:
- Pre-build creates wrenn-user as default (via USER command)
- WORKDIR now creates the directory if it doesn't exist (mkdir -p)
- Per-step progress updates (ProgressFunc callback) for live UI
- Multipart form support on POST /v1/admin/builds for archive upload
- Proto: default_user/default_env fields on Create/ResumeSandboxRequest
- Host agent: SetDefaults calls PostInitWithDefaults on envd
- Control plane: reads template defaults, passes on sandbox create/resume
- Frontend: file upload widget, recipe copy button, keyword colors for
USER/COPY, fixed Svelte whitespace stripping in step display
- Admin panel defaults to /admin/templates instead of /admin/hosts
- Migration adds default_user and default_env to templates and
template_builds tables
Removed the `connect.NewError(connect.CodeInternal, ...)` wrapper in the
Server's MakeDir proxy handler. Previously, this wrapper was catching
specific agent errors (like CodeAlreadyExists) and casting them into
generic Code 13 (Internal) errors, stripping the gRPC metadata.
This change allows the control-plane to act as a transparent pipeline,
ensuring the API gateway can properly interpret and route specific
filesystem failures.
Wire envd's existing PTY process capabilities through the full stack:
hostagent proto (4 new RPCs: PtyAttach, PtySendInput, PtyResize, PtyKill),
envdclient, sandbox manager, and a new WebSocket endpoint at
GET /v1/sandboxes/{id}/pty with bidirectional JSON message protocol.
Sessions use tag-based identity for disconnect/reconnect support,
base64-encoded PTY data for binary safety, and a 120s inactivity timeout.
Plumb ListDir, MakeDir, and RemovePath through all layers:
REST API → host agent RPC → envdclient → envd. These endpoints
enable a web file browser for sandbox filesystem interaction.
New endpoints (all under requireAPIKeyOrJWT):
- POST /v1/sandboxes/{id}/files/list
- POST /v1/sandboxes/{id}/files/mkdir
- POST /v1/sandboxes/{id}/files/remove