After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.
Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
closes idle connections and disables keep-alives before snapshot, then closes
all pre-snapshot zombie connections on restore (while preserving post-restore
connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
hostagent use it instead of the timeout client
Resume() was building VMConfig without TemplateID, so Firecracker MMDS
received an empty string. envd's PostInit then wrote that empty value to
/run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in
snapshot metadata during Pause and reading it back during Resume.
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.
Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.
Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
envd client
Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
behavior
- Replace path.Join with string concat in CP Director to preserve
trailing slashes (prevents redirect loops on directory listings)
Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy
Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
HTTP server
Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
to prevent guest traffic from saturating the TAP
ConnectProvider computed HMAC over bare state, but Callback always
verifies HMAC(state+":"+intent). This caused the account-linking
flow to always fail with invalid_state.
- Scope WebSocket auth bypass to only WS endpoints by restructuring
routes into separate chi Groups. Non-WS routes no longer passthrough
unauthenticated requests with spoofed Upgrade headers. Added
optionalAPIKeyOrJWT middleware for WS routes (injects auth context
from API key/JWT if present, passes through otherwise) and
markAdminWS middleware for admin WS routes.
- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
was called unconditionally but p.tty is nil for non-PTY processes,
crashing every non-PTY process exit.
- Fix goroutine leak in sandbox Pause — stopSampler was never called,
leaking one sampler goroutine per successful pause operation.
- Decouple PTY WebSocket reads from RPC dispatch using a buffered
channel to prevent backpressure-induced connection drops under fast
typing. Includes input coalescing to reduce RPC call volume.
Log every admin-panel action (user activate/deactivate, team BYOC toggle,
team delete, template delete, build create/cancel) to the audit_logs table
under PlatformTeamID with scope "admin".
Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with
infinite scroll and hierarchical filters. Expose audit.Entry + Log() for
cloud repo extensibility.
Fix seed_platform_team down-migration FK violation by deleting dependent
rows before the team row.
Block auto-account creation when signing in via GitHub from login mode.
Signup via GitHub now shows a name confirmation dialog before redirecting
to dashboard, letting users verify/edit their display name pulled from
GitHub.
- Add intent query param to OAuth redirect, persisted in HMAC-signed state cookie
- Block registration in callback when intent=login, return no_account error
- Set wrenn_oauth_new_signup cookie on new account creation
- Frontend callback shows name confirmation dialog for new signups
- Add no_account error message to login page
Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots.
An hourly background worker aggregates completed days, while today's
usage is computed live from snapshots at query time for freshness.
Backend: new daily_usage table, rollup worker, UsageService, and
GET /v1/capsules/usage endpoint with date range filtering (up to 92 days).
Frontend: replace Usage page placeholder with bar charts (Chart.js),
summary total cards, and preset/custom date range controls.
Extracts Mailer interface, EmailData, and Button to pkg/email/types.go
so the cloud repo can use them via ServerContext. internal/email re-exports
the types as aliases so existing callers are unchanged. Also fixes
pre-existing lint errors (unchecked rollback and deadline calls).