Three root causes addressed:
1. Go page allocator corruption: allocations between the pre-snapshot GC
and VM freeze leave the summary tree inconsistent. After restore, GC
reads corrupted metadata — either panicking (killing PID 1 → kernel
panic) or silently failing to collect, causing unbounded heap growth
until OOM. Fix: move GC to after all HTTP allocations in
PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
allocations run sequentially with no concurrent page allocator access.
GOMAXPROCS is restored on first health check after restore.
2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
single 30s context. If WaitUntilReady consumed most of it, PostInit
failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
disabled and zombie connections. Fix: separate timeout contexts.
3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
caused goroutine leaks from hung proxy connections. Fix: add both,
matching host agent values.
Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.
Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.
Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
envd client
Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
behavior
- Replace path.Join with string concat in CP Director to preserve
trailing slashes (prevents redirect loops on directory listings)
Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy
Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
HTTP server
Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
to prevent guest traffic from saturating the TAP
Notify users via email when their account is permanently deleted after
the 15-day soft-delete grace period. Query now returns email alongside
user ID so the notification can be sent after deletion.
Email failure is logged as a warning but does not block cleanup.
Replace repetitive actorFields + write boilerplate across all 25+ typed
Log methods with shared helpers: newEntry (general), newAdminEntry
(platform-level), resolveHostTeamID, and logSystemHostEvent.
Reduces logger.go from 665 to 374 lines with no behavior change.
Log every admin-panel action (user activate/deactivate, team BYOC toggle,
team delete, template delete, build create/cancel) to the audit_logs table
under PlatformTeamID with scope "admin".
Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with
infinite scroll and hierarchical filters. Expose audit.Entry + Log() for
cloud repo extensibility.
Fix seed_platform_team down-migration FK violation by deleting dependent
rows before the team row.
Anonymize audit logs when soft-deleted users are purged after 15 days:
actor_name set to 'deleted-user', actor_id and resource_id nulled,
email stripped from member metadata. Per-user delete ensures no user
is removed without successful anonymization.
Frontend renders deleted-user as a styled red badge in audit log view.
Fix shared host create/delete audit logs landing in admin's personal
team — now correctly assigned to PlatformTeamID.
Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots.
An hourly background worker aggregates completed days, while today's
usage is computed live from snapshots at query time for freshness.
Backend: new daily_usage table, rollup worker, UsageService, and
GET /v1/capsules/usage endpoint with date range filtering (up to 92 days).
Frontend: replace Usage page placeholder with bar charts (Chart.js),
summary total cards, and preset/custom date range controls.
Extracts Mailer interface, EmailData, and Button to pkg/email/types.go
so the cloud repo can use them via ServerContext. internal/email re-exports
the types as aliases so existing callers are unchanged. Also fixes
pre-existing lint errors (unchecked rollback and deadline calls).