wrenn-releases

Author	SHA1	Message	Date
pptx704	74f85ce4e9	refactor: polish control plane and host agent code - Decompose executeBuild (318 lines) into provisionBuildSandbox and finalizeBuild helpers for readability - Extract cleanupPauseFailure in sandbox manager to unify 3 inconsistent inline teardown paths (also fixes CoW file leak on rename failure) - Remove unused ctx parameter from startProcess/startProcessForRestore - Add missing MASQUERADE rollback entry in CreateNetwork for symmetry - Consolidate duplicate writeJSON for UTF-8/base64 exec response	2026-05-17 02:11:48 +06:00
pptx704	124e097e23	refactor: eliminate DRY violations across control plane and host agent Extract shared helpers to consolidate repeated patterns: - requireRunningSandbox: sandbox lookup + running check (10 call sites) - upgradeAndAuthenticate: WS upgrade + JWT/API-key auth (3 handlers) - updateLastActive: last_active_at update with background context (5 sites) - attachCowAndCreate: cow loop attach + dmsetup create (devicemapper) - issueRegistrationToken: token gen + Redis + audit (host service) - ErrNotFound sentinel: replaces string matching in hostagent server Also merges duplicate wsProcessOut/wsOutMsg types into one. Net: -208 lines, zero behavior change.	2026-05-17 02:03:06 +06:00
pptx704	a5425969ed	fix: assorted bug fixes for CH migration Fix resource leaks, race conditions, and error handling across host agent and control plane: proper sparse file cleanup on close error, connect error wrapping for MakeDir, CoW file cleanup on pause failure, per-sandbox VM directories, deferred map deletion to avoid race in VM destroy, and goroutine launch for extension background workers.	2026-05-17 01:47:56 +06:00
pptx704	eaa6b8576d	feat(vm): replace Firecracker with Cloud Hypervisor Migrate the entire VM layer from Firecracker to Cloud Hypervisor (CH). CH provides native snapshot/restore via its HTTP API, eliminating the need for custom UFFD handling, memfile processing, and snapshot header management that Firecracker required. Key changes: - Remove fc.go, jailer.go (FC process management) - Remove internal/uffd/ package (userfaultfd lazy page loading) - Remove snapshot/header.go, mapping.go, memfile.go (FC snapshot format) - Add ch.go (CH HTTP API client over Unix socket) - Add process.go (CH process lifecycle with unshare+netns) - Add chversion.go (CH version detection) - Refactor sandbox manager: remove UFFD socket tracking, snapshot parent/diff chaining, FC-specific balloon logic; add crash watcher - Simplify snapshot/local.go to CH's native snapshot format - Update VM config: FirecrackerBin → VMMBin, new CH-specific fields - Update envdclient, devicemapper, network for CH compatibility	2026-05-17 01:33:12 +06:00
pptx704	3671af2498	feat: immediate sandbox reconciliation on host reconnect When a host transitions from unreachable → online via heartbeat, trigger ReconcileHost in a background goroutine so "missing" sandboxes are resolved instantly instead of waiting up to 60s for the next monitor tick. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-16 16:15:49 +06:00
pptx704	6faad45a28	feat: async sandbox lifecycle with Redis Stream events Replace synchronous RPC-based CP-host communication for sandbox lifecycle operations (Create, Pause, Resume, Destroy) with an async pattern. CP handlers now return 202 Accepted immediately, fire agent RPCs in background goroutines, and publish state events to a Redis Stream. A background consumer processes events as a fallback writer. Agent-side auto-pause events are pushed to the CP via HTTP callback (POST /v1/hosts/sandbox-events), keeping Redis internal to the CP. All DB status transitions use conditional updates (UpdateSandboxStatusIf, UpdateSandboxRunningIf) to prevent race conditions between concurrent operations and background goroutines. The HostMonitor reconciler is kept at 60s as a safety net, extended to handle transient statuses (starting, pausing, resuming, stopping). Frontend updated to handle 202 responses with empty bodies and render transient statuses with blue indicators.	2026-05-15 12:25:16 +06:00
pptx704	51b5d7b3ba	fix: resolve pause/snapshot failures and CoW exhaustion on large VMs Remove hard 10s timeout from Firecracker HTTP client — callers already pass context.Context with appropriate deadlines, and 20GB+ memfile writes easily exceed 10s. Ensure CoW file is at least as large as the origin rootfs. Previously, WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the default 5GB CoW could not hold all writes, causing dm-snapshot invalidation and EIO on all guest I/O. Destroy frozen VMs in resumeOnError instead of leaving zombies that report "running" but can't execute. Use fresh context for the resume attempt so a cancelled caller context doesn't falsely trigger destroy. Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and PrepareSnapshot timeout from 3s to 30s for large-memory VMs. After failed pause, ping agent to detect destroyed sandboxes and mark DB status as "error" instead of reverting to "running".	2026-05-04 01:46:57 +06:00
pptx704	cac6fcd626	feat: admin grant/revoke from admin panel Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for granting and revoking platform admin status. Uses atomic conditional SQL (RevokeUserAdmin) to prevent race conditions that could remove the last admin. Includes idempotency check, audit logging, and confirmation dialog with self-demotion warning.	2026-05-03 15:24:34 +06:00
pptx704	3deecbff89	fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore Three root causes addressed: 1. Go page allocator corruption: allocations between the pre-snapshot GC and VM freeze leave the summary tree inconsistent. After restore, GC reads corrupted metadata — either panicking (killing PID 1 → kernel panic) or silently failing to collect, causing unbounded heap growth until OOM. Fix: move GC to after all HTTP allocations in PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining allocations run sequentially with no concurrent page allocator access. GOMAXPROCS is restored on first health check after restore. 2. PostInit timeout starvation: WaitUntilReady and PostInit shared a single 30s context. If WaitUntilReady consumed most of it, PostInit failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives disabled and zombie connections. Fix: separate timeout contexts. 3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout caused goroutine leaks from hung proxy connections. Fix: add both, matching host agent values. Also adds UFFD prefetch to proactively load all guest pages after restore, eliminating on-demand page fault latency for subsequent RPC calls.	2026-05-02 17:22:51 +06:00
pptx704	bb582deefa	fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing caused head-of-line blocking when a slow sandbox RPC stalled the shared connection. Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's PostInit call to eliminate an extra round-trip that could hang on a stale connection.	2026-05-02 13:48:51 +06:00
pptx704	bd98610153	fix: sandbox network responsiveness under port-binding apps Running port-binding applications (Jupyter, http.server, NextJS) inside sandboxes caused severe PTY sluggishness and proxy navigation errors. Root cause: the CP sandbox proxy and Connect RPC pool shared a single HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling) interfered with PTY RPC streams via HTTP/2 flow control contention. Transport isolation (main fix): - Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2 disabled, separate from the RPC pool transport - Add dedicated proxy transport on host agent, replacing http.DefaultTransport - Add dedicated envdclient transport with tuned connection pooling - Replace http.DefaultClient in file streaming RPCs with per-sandbox envd client Proxy path rewriting (navigation fix): - Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port} prefix, handling both root-relative and absolute-URL redirects - Strip prefix back out in CP subdomain proxy for correct browser behavior - Replace path.Join with string concat in CP Director to preserve trailing slashes (prevents redirect loops on directory listings) Proxy resilience: - Add dial retry with linear backoff (3 attempts) to handle socat startup delay when ports are first detected - Cache ReverseProxy instances per sandbox+port+host in sync.Map - Add EvictProxy callback wired into sandbox Manager.Destroy Buffer and server hardening: - Increase PTY and exec stream channel buffers from 16 to 256 - Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent HTTP server Network tuning: - Set TAP device TxQueueLen to 5000 (up from default 1000) - Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst) to prevent guest traffic from saturating the TAP	2026-04-25 04:21:55 +06:00
pptx704	11928a172a	feat: send email notification on account hard-delete Notify users via email when their account is permanently deleted after the 15-day soft-delete grace period. Query now returns email alongside user ID so the notification can be sent after deletion. Email failure is logged as a warning but does not block cleanup.	2026-04-21 16:01:56 +06:00
pptx704	bb2146d838	refactor: deduplicate audit logger with shared entry builders Replace repetitive actorFields + write boilerplate across all 25+ typed Log methods with shared helpers: newEntry (general), newAdminEntry (platform-level), resolveHostTeamID, and logSystemHostEvent. Reduces logger.go from 665 to 374 lines with no behavior change.	2026-04-21 15:54:39 +06:00
pptx704	d270ab7752	Version bump	2026-04-21 15:54:04 +06:00
pptx704	7fd801c1eb	feat: add audit logging for all admin actions and admin audit page Log every admin-panel action (user activate/deactivate, team BYOC toggle, team delete, template delete, build create/cancel) to the audit_logs table under PlatformTeamID with scope "admin". Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with infinite scroll and hierarchical filters. Expose audit.Entry + Log() for cloud repo extensibility. Fix seed_platform_team down-migration FK violation by deleting dependent rows before the team row.	2026-04-21 15:41:45 +06:00
pptx704	ebbbde9cd1	feat: anonymize audit logs on user hard-delete and fix host audit log team assignment Anonymize audit logs when soft-deleted users are purged after 15 days: actor_name set to 'deleted-user', actor_id and resource_id nulled, email stripped from member metadata. Per-user delete ensures no user is removed without successful anonymization. Frontend renders deleted-user as a styled red badge in audit log view. Fix shared host create/delete audit logs landing in admin's personal team — now correctly assigned to PlatformTeamID.	2026-04-21 14:42:09 +06:00
pptx704	47be1143fb	Add MiddlewareProvider interface for extension middleware Allows cloud extensions to inject middleware that wraps OSS routes (e.g. billing enforcement) before they are registered.	2026-04-18 14:47:29 +06:00
pptx704	92aab09104	Add daily usage metrics (CPU-minutes, RAM GB-minutes) Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots. An hourly background worker aggregates completed days, while today's usage is computed live from snapshots at query time for freshness. Backend: new daily_usage table, rollup worker, UsageService, and GET /v1/capsules/usage endpoint with date range filtering (up to 92 days). Frontend: replace Usage page placeholder with bar charts (Chart.js), summary total cards, and preset/custom date range controls.	2026-04-18 14:29:09 +06:00
pptx704	5fa3529df9	Move email types to pkg/email for cloud repo access Extracts Mailer interface, EmailData, and Button to pkg/email/types.go so the cloud repo can use them via ServerContext. internal/email re-exports the types as aliases so existing callers are unchanged. Also fixes pre-existing lint errors (unchecked rollback and deadline calls).	2026-04-17 16:36:54 +06:00
Rafeed M. Bhuiyan	605ad666a0	v0.1.0 (#17 )	2026-04-16 19:24:25 +00:00

20 Commits