After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.
Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
closes idle connections and disables keep-alives before snapshot, then closes
all pre-snapshot zombie connections on restore (while preserving post-restore
connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
hostagent use it instead of the timeout client
- Scope WebSocket auth bypass to only WS endpoints by restructuring
routes into separate chi Groups. Non-WS routes no longer passthrough
unauthenticated requests with spoofed Upgrade headers. Added
optionalAPIKeyOrJWT middleware for WS routes (injects auth context
from API key/JWT if present, passes through otherwise) and
markAdminWS middleware for admin WS routes.
- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
was called unconditionally but p.tty is nil for non-PTY processes,
crashing every non-PTY process exit.
- Fix goroutine leak in sandbox Pause — stopSampler was never called,
leaking one sampler goroutine per successful pause operation.
- Decouple PTY WebSocket reads from RPC dispatch using a buffered
channel to prevent backpressure-induced connection drops under fast
typing. Includes input coalescing to reduce RPC call volume.
Moves 12 packages from internal/ to pkg/ (config, id, validate, events, db,
auth, lifecycle, scheduler, channels, audit, service) so they can be imported
by the enterprise repo as a Go module dependency.
Introduces pkg/cpextension (shared Extension interface + ServerContext) and
pkg/cpserver (Run() entrypoint with functional options) so the enterprise
main.go can call cpserver.Run(cpserver.WithExtensions(...)) without duplicating
the 20-step server bootstrap. Adds db/migrations/embed.go for go:embed access
to OSS SQL migrations from the enterprise module.
cmd/control-plane/main.go is reduced to a 10-line wrapper around cpserver.Run.
envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.
Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.
- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)
- envd: reject non-regular files (devices, pipes, sockets) in GetFiles
to prevent infinite reads from /dev/zero, /dev/urandom etc.
- host agent: add context cancellation check in ReadFileStream loop
with proper Connect error codes
- frontend: abort in-flight file reads on file switch, directory
navigation, and component teardown via AbortController
- frontend: guard against abort errors surfacing in UI, use try/finally
for fileLoading state
64-entry buffer was too small for high-throughput PTY output (e.g.
ls -laihR /). The consumer couldn't drain fast enough over the RPC
stream, causing the non-blocking send fallback to silently discard
data. 4096 entries (~64MB at 16KB/chunk) handles sustained output
without drops while still preventing deadlock on stuck consumers.
- Add multi-session Terminal tab with xterm.js (session tabs, close, reconnect)
- Keep terminal mounted across tab switches to preserve sessions
- Persist active tab in URL (?tab=terminal) so refresh stays on terminal
- Buffer keystrokes (50ms) to reduce per-character RPC overhead
- Add WebSocket auth via ?token= query param for browser WS connections
- Enable ws:true in Vite dev proxy for WebSocket support
envd fixes (pre-existing bugs exposed by multi-session terminals):
- Fix getProcess tag Range: inverted return values caused early stop when
multiple tagged processes existed, making SendInput fail with "not found"
- Fix multiplexer deadlock: blocking send to cancelled fork's unbuffered
channel prevented process cleanup. Now uses buffered channels (cap 64)
with non-blocking fallback
Replace Business Source License with Apache License Version 2.0 across
LICENSE, envd/LICENSE, and NOTICE. Update NOTICE to remove BSL-era
framing that singled out Apache-only portions.
- Add auth failure logging (login, API key, JWT) with IP/email/prefix
- Move OAuth JWT from URL params to short-lived cookies to prevent
token leakage via browser history, server logs, and Referer headers
- Pin Swagger UI to v5.18.2 with SRI integrity hashes
- Upgrade Go toolchain to 1.25.8 (fixes 5 called stdlib vulns)
- Fix unchecked error in host agent credential refresh
- Add .gstack to .gitignore for security report artifacts
The envd port scanner used gopsutil's net.Connections() which walks
/proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime
semaphore state when the VM is paused mid-operation and restored from
a Firecracker snapshot.
Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads
a single file per address family — no /proc/{pid}/fd walk, no goroutines,
no WaitGroups. Also replace concurrent-map (smap) in the scanner with a
plain sync.RWMutex-protected map, since concurrent-map's Items() spawns
goroutines with a WaitGroup internally, which is equally unsafe across
snapshot boundaries.
Use socket inode instead of PID for the port forwarding map key, since
inode is available directly from /proc/net/tcp without the fd walk.
Consolidate 16 migrations into one with UUID columns for all entity
IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id,
resource_id) and template names. The id package now generates UUIDs
via google/uuid, with Format*/Parse* helpers for the prefixed wire
format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and
handlers pass pgtype.UUID internally; conversion to/from prefixed
strings happens at API and RPC boundaries. Adds PlatformTeamID
(all-zeros UUID) for shared resources.
Switch from the envd /init endpoint pushing host time via syscall to
chronyd reading the KVM PTP hardware clock (/dev/ptp0) continuously.
This fixes clock drift between init calls and handles snapshot resume
gracefully.
Changes:
- Add clocksource=kvm-clock kernel boot arg
- Start chronyd in wrenn-init.sh before tini (PHC /dev/ptp0, makestep 1.0 -1)
- Remove clock_settime logic from envd SetData and shouldSetSystemTime
- Remove client.Init() clock sync calls from sandbox manager (3 sites)
- Remove Init() method from envdclient (no longer needed)
- Simplify rootfs scripts: socat/chrony now come from apt in the container
image, only envd/wrenn-init/tini are injected by build scripts
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.
Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
- REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read
- PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration)
- Connect RPC client to host agent for all VM operations
- Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes)
- OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs
- Added WriteFile/ReadFile RPCs to hostagent proto and implementations
- File upload via multipart form, download via JSON body POST
- sandbox_id propagated from control plane to host agent on create
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
Remove duplicate proto files from envd/spec/ and update envd's
buf.gen.yaml to generate stubs from the canonical proto/envd/ location.
Both modules now generate their own Connect RPC stubs from the same
source protos.
- Copy envd source from e2b-dev/infra, internalize shared dependencies
into envd/internal/shared/ (keys, filesystem, id, smap, utils)
- Switch from gRPC to Connect RPC for all envd services
- Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd}
- Add proto specs (process, filesystem) with buf-based code generation
- Implement full envd: process exec, filesystem ops, port forwarding,
cgroup management, MMDS integration, and HTTP API
- Update main module dependencies (firecracker SDK, pgx, goose, etc.)
- Remove placeholder .gitkeep files replaced by real implementations