wrenn-releases

Author	SHA1	Message	Date
pptx704	124e097e23	refactor: eliminate DRY violations across control plane and host agent Extract shared helpers to consolidate repeated patterns: - requireRunningSandbox: sandbox lookup + running check (10 call sites) - upgradeAndAuthenticate: WS upgrade + JWT/API-key auth (3 handlers) - updateLastActive: last_active_at update with background context (5 sites) - attachCowAndCreate: cow loop attach + dmsetup create (devicemapper) - issueRegistrationToken: token gen + Redis + audit (host service) - ErrNotFound sentinel: replaces string matching in hostagent server Also merges duplicate wsProcessOut/wsOutMsg types into one. Net: -208 lines, zero behavior change.	2026-05-17 02:03:06 +06:00
pptx704	a5425969ed	fix: assorted bug fixes for CH migration Fix resource leaks, race conditions, and error handling across host agent and control plane: proper sparse file cleanup on close error, connect error wrapping for MakeDir, CoW file cleanup on pause failure, per-sandbox VM directories, deferred map deletion to avoid race in VM destroy, and goroutine launch for extension background workers.	2026-05-17 01:47:56 +06:00
pptx704	eaa6b8576d	feat(vm): replace Firecracker with Cloud Hypervisor Migrate the entire VM layer from Firecracker to Cloud Hypervisor (CH). CH provides native snapshot/restore via its HTTP API, eliminating the need for custom UFFD handling, memfile processing, and snapshot header management that Firecracker required. Key changes: - Remove fc.go, jailer.go (FC process management) - Remove internal/uffd/ package (userfaultfd lazy page loading) - Remove snapshot/header.go, mapping.go, memfile.go (FC snapshot format) - Add ch.go (CH HTTP API client over Unix socket) - Add process.go (CH process lifecycle with unshare+netns) - Add chversion.go (CH version detection) - Refactor sandbox manager: remove UFFD socket tracking, snapshot parent/diff chaining, FC-specific balloon logic; add crash watcher - Simplify snapshot/local.go to CH's native snapshot format - Update VM config: FirecrackerBin → VMMBin, new CH-specific fields - Update envdclient, devicemapper, network for CH compatibility	2026-05-17 01:33:12 +06:00
pptx704	c2dc382787	Updated openapi schema	2026-05-16 18:32:37 +06:00
pptx704	3671af2498	feat: immediate sandbox reconciliation on host reconnect When a host transitions from unreachable → online via heartbeat, trigger ReconcileHost in a background goroutine so "missing" sandboxes are resolved instantly instead of waiting up to 60s for the next monitor tick. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-16 16:15:49 +06:00
pptx704	ff91ef3edf	Bump versions	2026-05-15 13:56:04 +06:00
pptx704	ba3a3db98c	Updated openapi specs	2026-05-15 12:39:06 +06:00
pptx704	6faad45a28	feat: async sandbox lifecycle with Redis Stream events Replace synchronous RPC-based CP-host communication for sandbox lifecycle operations (Create, Pause, Resume, Destroy) with an async pattern. CP handlers now return 202 Accepted immediately, fire agent RPCs in background goroutines, and publish state events to a Redis Stream. A background consumer processes events as a fallback writer. Agent-side auto-pause events are pushed to the CP via HTTP callback (POST /v1/hosts/sandbox-events), keeping Redis internal to the CP. All DB status transitions use conditional updates (UpdateSandboxStatusIf, UpdateSandboxRunningIf) to prevent race conditions between concurrent operations and background goroutines. The HostMonitor reconciler is kept at 60s as a safety net, extended to handle transient statuses (starting, pausing, resuming, stopping). Frontend updated to handle 202 responses with empty bodies and render transient statuses with blue indicators.	2026-05-15 12:25:16 +06:00
pptx704	aca43d51eb	fix: resolve process stream hangs, pause race, and PTY signal loss - Cache terminal EndEvent on ProcessHandle so connect() can detect already-exited processes instead of hanging forever on broadcast receivers that missed the event. Subscribe before checking cache to close the TOCTOU window. - Protect sb.Status writes in Pause with m.mu to prevent data race with concurrent readers (AcquireProxyConn, Exec, etc.). - Restart metrics sampler in restoreRunning so a failed pause attempt doesn't permanently kill sandbox metrics collection. - Return dequeued non-input messages from coalescePtyInput instead of dropping them, preventing silent loss of kill/resize signals during typing bursts.	2026-05-09 18:11:15 +06:00
pptx704	c93ad5e2db	fix: harden pause flow with connection isolation and UFFD event handling Restructure pause to: block new operations (StatusPausing), drain proxy connections with 5s grace, force-close remaining via context cancellation, drop page cache, inflate balloon, then freeze vCPUs. Previously connections could arrive during the pause window and API operations weren't blocked. Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing the UFFD server. These events fire during balloon deflation on snapshot restore, killing the page fault handler and preventing VM boot. Also adds ConnTracker.ForceClose() with cancellable context propagated through the proxy handler, so lingering proxy connections are actively terminated rather than left dangling.	2026-05-09 14:51:19 +06:00
pptx704	38799770db	fix: inflate balloon before snapshot to reduce memfile size Firecracker dumps the entire VM memory region regardless of guest usage. A 20GB VM using 500MB still produces a ~20GB memfile because freed pages retain stale data (non-zero blocks). Inflate the balloon device before snapshot to reclaim free guest memory. Balloon pages become zero from FC's perspective, allowing ProcessMemfile to skip them. This reduces memfile size from ~20GB to ~1-2GB for lightly-used VMs. - Pause: read guest memory usage, inflate balloon to reclaim free pages, wait 2s for guest kernel to process, then proceed - Resume: deflate balloon to 0 after PostInit so guest gets full memory back - createFromSnapshot: same deflation since template snapshots inherit inflated balloon state - All balloon ops are best-effort with debug logging on failure	2026-05-05 15:38:04 +06:00
pptx704	51b5d7b3ba	fix: resolve pause/snapshot failures and CoW exhaustion on large VMs Remove hard 10s timeout from Firecracker HTTP client — callers already pass context.Context with appropriate deadlines, and 20GB+ memfile writes easily exceed 10s. Ensure CoW file is at least as large as the origin rootfs. Previously, WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the default 5GB CoW could not hold all writes, causing dm-snapshot invalidation and EIO on all guest I/O. Destroy frozen VMs in resumeOnError instead of leaving zombies that report "running" but can't execute. Use fresh context for the resume attempt so a cancelled caller context doesn't falsely trigger destroy. Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and PrepareSnapshot timeout from 3s to 30s for large-memory VMs. After failed pause, ping agent to detect destroyed sandboxes and mark DB status as "error" instead of reverting to "running".	2026-05-04 01:46:57 +06:00
pptx704	cac6fcd626	feat: admin grant/revoke from admin panel Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for granting and revoking platform admin status. Uses atomic conditional SQL (RevokeUserAdmin) to prevent race conditions that could remove the last admin. Includes idempotency check, audit logging, and confirmation dialog with self-demotion warning.	2026-05-03 15:24:34 +06:00
pptx704	1178ab8b21	fix: accurate sandbox metrics and memory management Three issues fixed: 1. Memory metrics read host-side VmRSS of the Firecracker process, which includes guest page cache and never decreases. Replaced readMemRSS(fcPID) with readEnvdMemUsed(client) that queries envd's /metrics endpoint for guest-side total - MemAvailable. This matches neofetch and reflects actual process memory. 2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and envd-side periodic page cache reclaimer (drop_caches when >80% used). Reclaimer is gated by snapshot_in_progress flag with sync() before freeze to prevent memory corruption during pause. 3. Sampling interval 500ms → 1s, ring buffer capacities adjusted to maintain same time windows. Reduces per-host HTTP load from 240 calls/sec to 120 calls/sec at 120 capsules. Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD lazy-loads anyway), envd mem_used formula uses total - available.	2026-05-03 12:19:01 +06:00
pptx704	ef5f223863	fix: improve error feedback for terminal disconnects and host unavailability Show "[session disconnected]" in terminal when PTY websocket closes cleanly. Map scheduler and agent unavailability errors to 503 with user-friendly message instead of leaking internal details.	2026-05-03 04:47:10 +06:00
pptx704	f328113a2a	rename guest hostname from "sandbox" to "capsule" Terminal prompt inside VMs now shows root@capsule instead of root@sandbox, aligning with user-facing "capsule" terminology.	2026-05-03 03:32:03 +06:00
pptx704	1143acd37a	refactor: remove Go envd module, update host agent for Rust envd The Go envd guest agent (`envd/`) is fully replaced by the Rust implementation (`envd-rs/`). This commit removes the Go module and updates all references across the codebase. Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go, and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link verification to build-envd. Host agent: rewrite snapshot quiesce comments that referenced Go GC and page allocator corruption — no longer applicable with Rust envd. Tighten envdclient to expect HTTP 200 (not 204) from health and file upload endpoints, and require JSON version response from FetchVersion. Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and README.md to reflect Rust envd architecture.	2026-05-03 03:12:25 +06:00
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	bb582deefa	fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing caused head-of-line blocking when a slow sandbox RPC stalled the shared connection. Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's PostInit call to eliminate an extra round-trip that could hang on a stale connection.	2026-05-02 13:48:51 +06:00
pptx704	7ef9a64613	fix: close stale TCP connections across snapshot/restore to prevent envd hangs After Firecracker snapshot restore, zombie TCP sockets from the previous session cause Go runtime corruption inside the guest VM, making envd unresponsive. This manifests as infinite loading in the file browser and terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally. Four-part fix: - Add ServerConnTracker to envd that tracks connections via ConnState callback, closes idle connections and disables keep-alives before snapshot, then closes all pre-snapshot zombie connections on restore (while preserving post-restore connections like the /init request) - Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients; use streaming client for file transfers and process RPCs - Close host-side idle envdclient connections before PrepareSnapshot so FIN packets propagate during the 3s quiesce window - Add StreamingHTTPClient() accessor; streaming file transfer handlers in hostagent use it instead of the timeout client	2026-05-02 05:19:37 +06:00
pptx704	f3572f7356	Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox Resume() was building VMConfig without TemplateID, so Firecracker MMDS received an empty string. envd's PostInit then wrote that empty value to /run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in snapshot metadata during Pause and reading it back during Resume.	2026-05-02 04:57:08 +06:00
pptx704	bd98610153	fix: sandbox network responsiveness under port-binding apps Running port-binding applications (Jupyter, http.server, NextJS) inside sandboxes caused severe PTY sluggishness and proxy navigation errors. Root cause: the CP sandbox proxy and Connect RPC pool shared a single HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling) interfered with PTY RPC streams via HTTP/2 flow control contention. Transport isolation (main fix): - Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2 disabled, separate from the RPC pool transport - Add dedicated proxy transport on host agent, replacing http.DefaultTransport - Add dedicated envdclient transport with tuned connection pooling - Replace http.DefaultClient in file streaming RPCs with per-sandbox envd client Proxy path rewriting (navigation fix): - Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port} prefix, handling both root-relative and absolute-URL redirects - Strip prefix back out in CP subdomain proxy for correct browser behavior - Replace path.Join with string concat in CP Director to preserve trailing slashes (prevents redirect loops on directory listings) Proxy resilience: - Add dial retry with linear backoff (3 attempts) to handle socat startup delay when ports are first detected - Cache ReverseProxy instances per sandbox+port+host in sync.Map - Add EvictProxy callback wired into sandbox Manager.Destroy Buffer and server hardening: - Increase PTY and exec stream channel buffers from 16 to 256 - Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent HTTP server Network tuning: - Set TAP device TxQueueLen to 5000 (up from default 1000) - Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst) to prevent guest traffic from saturating the TAP	2026-04-25 04:21:55 +06:00
pptx704	5e13879954	fix: OAuth ConnectProvider state HMAC format mismatch ConnectProvider computed HMAC over bare state, but Callback always verifies HMAC(state+":"+intent). This caused the account-linking flow to always fail with invalid_state.	2026-04-25 02:00:39 +06:00
pptx704	339cd7bee1	fix: security and stability fixes from code review - Scope WebSocket auth bypass to only WS endpoints by restructuring routes into separate chi Groups. Non-WS routes no longer passthrough unauthenticated requests with spoofed Upgrade headers. Added optionalAPIKeyOrJWT middleware for WS routes (injects auth context from API key/JWT if present, passes through otherwise) and markAdminWS middleware for admin WS routes. - Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close() was called unconditionally but p.tty is nil for non-PTY processes, crashing every non-PTY process exit. - Fix goroutine leak in sandbox Pause — stopSampler was never called, leaking one sampler goroutine per successful pause operation. - Decouple PTY WebSocket reads from RPC dispatch using a buffered channel to prevent backpressure-induced connection drops under fast typing. Includes input coalescing to reduce RPC call volume.	2026-04-24 15:48:38 +06:00
pptx704	d270ab7752	Version bump	2026-04-21 15:54:04 +06:00
pptx704	7fd801c1eb	feat: add audit logging for all admin actions and admin audit page Log every admin-panel action (user activate/deactivate, team BYOC toggle, team delete, template delete, build create/cancel) to the audit_logs table under PlatformTeamID with scope "admin". Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with infinite scroll and hierarchical filters. Expose audit.Entry + Log() for cloud repo extensibility. Fix seed_platform_team down-migration FK violation by deleting dependent rows before the team row.	2026-04-21 15:41:45 +06:00
pptx704	684c98b0fa	fix: admin capsule create audit log uses PlatformTeamID POST /v1/admin/capsules was outside the injectPlatformTeam middleware subrouter, so audit entries landed under the admin's personal team.	2026-04-21 14:54:52 +06:00
pptx704	6a6b489471	feat: separate GitHub OAuth login/signup flows with name confirmation Block auto-account creation when signing in via GitHub from login mode. Signup via GitHub now shows a name confirmation dialog before redirecting to dashboard, letting users verify/edit their display name pulled from GitHub. - Add intent query param to OAuth redirect, persisted in HMAC-signed state cookie - Block registration in callback when intent=login, return no_account error - Set wrenn_oauth_new_signup cookie on new account creation - Frontend callback shows name confirmation dialog for new signups - Add no_account error message to login page	2026-04-21 11:03:12 +06:00
pptx704	8f8638e6db	Bump version to 0.1.2	2026-04-18 14:47:25 +06:00
pptx704	92aab09104	Add daily usage metrics (CPU-minutes, RAM GB-minutes) Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots. An hourly background worker aggregates completed days, while today's usage is computed live from snapshots at query time for freshness. Backend: new daily_usage table, rollup worker, UsageService, and GET /v1/capsules/usage endpoint with date range filtering (up to 92 days). Frontend: replace Usage page placeholder with bar charts (Chart.js), summary total cards, and preset/custom date range controls.	2026-04-18 14:29:09 +06:00
pptx704	5fa3529df9	Move email types to pkg/email for cloud repo access Extracts Mailer interface, EmailData, and Button to pkg/email/types.go so the cloud repo can use them via ServerContext. internal/email re-exports the types as aliases so existing callers are unchanged. Also fixes pre-existing lint errors (unchecked rollback and deadline calls).	2026-04-17 16:36:54 +06:00
Rafeed M. Bhuiyan	605ad666a0	v0.1.0 (#17 )	2026-04-16 19:24:25 +00:00
pptx704	172413e91e	Made changes to accomodate repo url update (#15 ) Reviewed-on: wrenn/wrenn#15 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-04-09 21:02:44 +00:00
Rafeed M. Bhuiyan	d3e4812e46	v0.0.1 (#8 ) Co-authored-by: Tasnim Kabir Sadik <tksadik92@gmail.com> Reviewed-on: wrenn/sandbox#8	2026-04-09 19:24:49 +00:00
pptx704	32e5a5a715	Prototype with single host server and no admin panel (#2 ) Reviewed-on: wrenn/sandbox#2 Co-authored-by: pptx704 <rafeed@omukk.dev> Co-committed-by: pptx704 <rafeed@omukk.dev>	2026-03-22 21:01:23 +00:00
pptx704	bd78cc068c	Initial project structure for Wrenn Sandbox Set up directory layout, Makefiles, go.mod files, docker-compose, and empty placeholder files for all packages.	2026-03-09 17:22:47 +06:00

36 Commits