wrenn-releases

Author	SHA1	Message	Date
pptx704	88f919c4ca	Rename sandbox prefix to cl-, add MMDS metadata, fix proxy port routing - Change sandbox ID prefix from sb- to cl- (capsule) throughout - Fix proxy URL regex character class: base36 uses 0-9a-z, not just hex - Add MMDS V2 config and metadata to VM boot flow so envd can read WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID from inside the guest - Pass TemplateID through VMConfig into both fresh and snapshot boot paths	2026-03-30 17:12:05 +06:00
pptx704	8f06fc554a	Replace Full snapshot fallback with file-level diff merge Always use Firecracker Diff snapshots (fast, only changed pages) and merge diff files at the file level when the generation cap is reached. The previous approach used Firecracker's Full snapshot type which dumps all memory to disk and can timeout, losing all snapshot data on failure. Add snapshot.MergeDiffs() which reads each block from the appropriate generation's diff file via the header mapping and writes them into a single consolidated file with a fresh generation-0 header.	2026-03-29 02:33:33 +06:00
pptx704	1ca10230a9	Prefix network namespaces with wrenn-, add stale cleanup, lower diff cap Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx} to avoid collisions with other tools. Add CleanupStaleNamespaces() at agent startup to remove orphaned namespaces, veths, iptables rules, and routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to prevent Go runtime memory corruption from snapshot/restore drift.	2026-03-29 02:14:30 +06:00
pptx704	75b28ed899	Add UUID-based template IDs and team-scoped template directory layout Introduces internal/layout package for centralized path construction, migrates templates from name-based TEXT primary keys to UUID PKs with team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}). The built-in minimal template uses sentinel zero UUIDs. Proto messages carry team_id + template_id alongside deprecated template name field. Team deletion now cleans up template files across all hosts.	2026-03-29 00:30:10 +06:00
pptx704	34af77e0d8	Fix snapshot race, delete auth, sparse dd, default disk to 5GB Snapshot race fix: - Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and PauseSandbox RPCs, preventing the reconciler from marking it "stopped" during the flatten window when the sandbox is gone from the host agent's in-memory map but DB still says "running" - Revert status to "running" on RPC failure - Check ctx.Err() before writing response to avoid writing to dead connections when client disconnects during long snapshot operations Delete auth fix: - Block non-admin deletion of platform templates (team_id = all-zeros) at DELETE /v1/snapshots/{name} with 403, preventing file deletion before the team ownership check fails Sparse dd: - Add conv=sparse to dd in FlattenSnapshot so flattened images preserve sparseness (~200MB actual vs 5GB logical) Default disk size: - Change default disk_size_mb from 20GB to 5GB across migration, manager, service, build, and EnsureImageSizes - Disable split-button dropdown arrow for platform templates in dashboard snapshots page (teams cannot delete platform templates)	2026-03-28 14:30:18 +06:00
pptx704	c0d6381bbe	Add disk_size_mb, auto-expand base images, admin templates endpoint Disk sizing: - Add disk_size_mb column to sandboxes table (default 20480 = 20GB) - Add disk_size_mb to CreateSandboxRequest proto, passed through the full chain: service → RPC → host agent → sandbox manager → devicemapper - devicemapper.CreateSnapshot takes separate cowSizeBytes param so the sparse CoW file can be sized independently from the origin - EnsureImageSizes() runs at host agent startup: expands any base image smaller than 20GB via truncate + resize2fs (sparse, no extra physical disk). Sandboxes then get the full 20GB via fast dm-snapshot path - FlattenRootfs shrinks output images with resize2fs -M so stored templates are compact; EnsureImageSizes re-expands on next startup Admin templates visibility: - Add GET /v1/admin/templates endpoint listing all templates across teams - Frontend admin templates page uses listAdminTemplates() instead of team-scoped listSnapshots() - Platform templates (team_id = all-zeros UUID) now visible to all teams: GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType queries include platform team_id in WHERE clause	2026-03-26 23:45:41 +06:00
pptx704	4ddd494160	Switch database IDs from TEXT to native UUID Consolidate 16 migrations into one with UUID columns for all entity IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id, resource_id) and template names. The id package now generates UUIDs via google/uuid, with Format/Parse helpers for the prefixed wire format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and handlers pass pgtype.UUID internally; conversion to/from prefixed strings happens at API and RPC boundaries. Adds PlatformTeamID (all-zeros UUID) for shared resources.	2026-03-26 16:16:21 +06:00
pptx704	cdd89a7cee	Fix review issues: detached contexts, loop device leak, timer leak, size_bytes - Use context.Background() with timeout in destroySandbox/failBuild so cleanup and DB writes survive parent context cancellation on shutdown - Fix loop device refcount leak in FlattenRootfs when dmDevice is nil - Replace time.After with time.NewTimer in healthcheck polling to avoid goroutine leak when healthcheck passes early - Capture size_bytes from CreateSnapshot/FlattenRootfs RPC responses instead of hardcoding 0 in the templates table insert - Avoid leaking internal error details to API clients in build handler	2026-03-26 15:31:38 +06:00
pptx704	1ce62934b3	Add template build system with admin panel, async workers, and FlattenRootfs RPC Introduces an end-to-end template building pipeline: admins submit a recipe (list of shell commands) via the dashboard, a Redis-backed worker pool spins up a sandbox, executes each command, and produces either a full snapshot (with healthcheck) or an image-only template (rootfs flattened via a new FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted to a new template_builds table and polled by the frontend. Backend: - New FlattenRootfs RPC (proto + host agent + sandbox manager) - BuildService with Redis queue (BLPOP) and configurable worker pool (default 2) - Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id} - Migration for template_builds table with JSONB logs and recipe columns - sqlc queries for build CRUD and progress updates Frontend: - /admin/templates page with Templates + Builds tabs - Create Template dialog with recipe textarea, healthcheck, specs - Build history with expandable per-step logs, status badges, progress bars - Auto-polling every 3s for active builds - AdminSidebar updated with Templates nav item	2026-03-26 15:27:21 +06:00
pptx704	6898528096	Replace one-shot clock_settime with chrony for continuous guest time sync Switch from the envd /init endpoint pushing host time via syscall to chronyd reading the KVM PTP hardware clock (/dev/ptp0) continuously. This fixes clock drift between init calls and handles snapshot resume gracefully. Changes: - Add clocksource=kvm-clock kernel boot arg - Start chronyd in wrenn-init.sh before tini (PHC /dev/ptp0, makestep 1.0 -1) - Remove clock_settime logic from envd SetData and shouldSetSystemTime - Remove client.Init() clock sync calls from sandbox manager (3 sites) - Remove Init() method from envdclient (no longer needed) - Simplify rootfs scripts: socat/chrony now come from apt in the container image, only envd/wrenn-init/tini are injected by build scripts	2026-03-26 04:47:44 +06:00
pptx704	ed7880bc6c	Add per-capsule stats detail page with live CPU/RAM charts - New detail page at /dashboard/capsules/[id] with Stats and Files tabs - Stats tab shows capsule info card (status, template, CPU, memory, disk, started, idle timeout) and two stacked Chart.js charts with live values - Metrics API client with 10s polling and moving-average smoothing - Capsule ID in list table is now a clickable link to the detail page - Layout breadcrumb header (Capsules > sb-xxx) with back navigation - Fix metrics sampler: use v.PID() directly as Firecracker PID since unshare -m execs (not forks) through the bash/ip-netns-exec/firecracker chain, so all share the same PID. Removes unused findChildPID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 22:31:05 +06:00
pptx704	49b0b646a8	Add 5m, 1h, 6h, 12h range filters to metrics endpoint Maps each user-facing range to the appropriate underlying ring buffer tier and applies a time cutoff filter. No new ring buffers needed — 5m/10m read from the 10m tier, 1h/2h from the 2h tier, 6h/12h/24h from the 24h tier.	2026-03-25 20:44:28 +06:00
pptx704	9acdbb5ae9	Add per-sandbox CPU/memory/disk metrics collection Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and stat() on CoW files at 500ms intervals per running sandbox. Three tiered ring buffers downsample into 30s and 5min averages for 10min/2h/24h retention. Metrics are flushed to DB on pause (all tiers) and destroy (24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on the control plane. Returns live data for running sandboxes, DB data for paused, and 404 for stopped.	2026-03-25 20:10:33 +06:00
pptx704	9bf67aa7f7	Implement host registration, JWT refresh tokens, and multi-host scheduling Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a DB-driven registration system supporting multiple host agents (BYOC). Key changes: - Host agents register via one-time token, receive a 7-day JWT + 60-day refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all sandboxes if refresh fails - HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing the single static agent client throughout the API and service layers - RoundRobinScheduler: picks an online host for each new sandbox via ListActiveHosts; extensible for future scheduling strategies - HostMonitor (replaces Reconciler): passive heartbeat staleness check marks hosts unreachable and sandboxes missing after 90s; active reconciliation per online host restores missing-but-alive sandboxes and stops orphans - Graceful host delete: returns 409 with affected sandbox list without ?force=true; force-delete destroys sandboxes then evicts pool client - Snapshot delete broadcasts to all online hosts (templates have no host_id) - sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss - New migration: host_refresh_tokens table with token rotation (issue-then- revoke ordering to prevent lockout on mid-rotation crash) - New sandbox status 'missing' (reversible, unlike 'stopped') and host status 'unreachable'; both reflected in OpenAPI spec - Fix: refresh token auth failure now returns 401 (was 400 via generic 'invalid' substring match in serviceErrToHTTP)	2026-03-24 18:32:05 +06:00
pptx704	36782e1b4f	Add tini as PID 1, guest clock sync, and fix PATH in guest VMs - Use tini as PID 1 in wrenn-init.sh so zombie processes are reaped and signals are forwarded correctly to envd - Set standard PATH in wrenn-init.sh so child processes spawned by envd can find common binaries (fixes "nice: ls command not found") - Add envdclient.Init() to POST /init on envd after every boot/resume, syncing the guest clock via unix.ClockSettime — critical after snapshot resume where the guest clock is frozen - Run Init in a background goroutine so it doesn't block the CreateSandbox RPC response; a slow Init (vCPU busy with envd startup) was causing the RPC context to be canceled before the response reached the control plane - Update rootfs-from-container.sh and update-debug-rootfs.sh to inject tini into the rootfs, checking the container image and host first, downloading from GitHub releases as fallback	2026-03-23 02:45:27 +06:00
pptx704	477d4f8cf6	Add auto-pause TTL and ping endpoint for sandbox inactivity management Replace the existing auto-destroy TTL behavior with auto-pause: when a sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses it (snapshot + teardown) instead of destroying it, preserving the ability to resume later. Key changes: - TTL reaper calls Pause instead of Destroy, with fallback to Destroy if pause fails (e.g. Firecracker process already gone) - New PingSandbox RPC resets the in-memory LastActiveAt timer - New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory and DB last_active_at - ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler can distinguish auto-paused sandboxes from crashed ones in a single call - Reconciler polls every 5s (was 30s) and marks auto-paused as "paused" vs orphaned as "stopped" - Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles - Reaper checks every 2s (was 10s) and uses a detached context to avoid incomplete pauses on app shutdown - Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)	2026-03-15 05:15:18 +06:00
pptx704	88246fac2b	Fix sandbox lifecycle cleanup and dmsetup remove reliability - Add retry with backoff to dmsetupRemove for transient "device busy" errors caused by kernel not releasing the device immediately after Firecracker exits. Only retries on "Device or resource busy"; other errors (not found, permission denied) return immediately. - Thread context.Context through RemoveSnapshot/RestoreSnapshot so retries respect cancellation. Use context.Background() in all error cleanup paths to prevent cancelled contexts from skipping cleanup and leaking dm devices on the host. - Resume vCPUs on pause failure: if snapshot creation or memfile processing fails after freezing the VM, unfreeze vCPUs so the sandbox stays usable instead of becoming a frozen zombie. - Fix resource leaks in Pause when CoW rename or metadata write fails: properly clean up network, slot, loop device, and remove from boxes map instead of leaving a dead sandbox with leaked host resources. - Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot directory instead of deleting it, preserving the paused state so the user can retry. - Skip m.loops.Release when RemoveSnapshot fails during pause since the stale dm device still references the origin loop device. - Fix incorrect VCPUs placeholder in Resume VMConfig that used memory size instead of a sensible default.	2026-03-14 06:42:34 +06:00
pptx704	1846168736	Fix device-mapper "Device or resource busy" error on sandbox resume Pause was logging RemoveSnapshot failures as warnings and continuing, which left stale dm devices behind. Resume then failed trying to create a device with the same name. - Make RemoveSnapshot failure a hard error in Pause (clean up remaining resources and return error instead of silently proceeding) - Add defensive stale device cleanup in RestoreSnapshot before creating the new dm device	2026-03-14 03:57:14 +06:00
pptx704	80a99eec87	Add diff snapshots for re-pause to avoid UFFD fault-in storm Use Firecracker's Diff snapshot type when re-pausing a previously resumed sandbox, capturing only dirty pages instead of a full memory dump. Chains up to 10 incremental generations before collapsing back to a Full snapshot. Multi-generation diff files (memfile.{buildID}) are supported alongside the legacy single-file format in resume, template creation, and snapshot existence checks.	2026-03-13 09:41:58 +06:00
pptx704	a0d635ae5e	Fix path traversal in template/snapshot names and network cleanup leaks Add SafeName validator (allowlist regex) to reject directory traversal in user-supplied template and snapshot names. Validated at both API handlers (400 response) and sandbox manager (defense in depth). Refactor CreateNetwork with rollback slice so partially created resources (namespace, veth, routes, iptables rules) are cleaned up on any error. Refactor RemoveNetwork to collect and return errors instead of silently ignoring them.	2026-03-13 08:40:36 +06:00
pptx704	63e9132d38	Add device-mapper snapshots, test UI, fix pause ordering and lint errors - Replace reflink rootfs copy with device-mapper snapshots (shared read-only loop device per base template, per-sandbox sparse CoW file) - Add devicemapper package with create/restore/remove/flatten operations and refcounted LoopRegistry for base image loop devices - Fix pause ordering: destroy VM before removing dm-snapshot to avoid "device busy" error (FC must release the dm device first) - Add test UI at GET /test for sandbox lifecycle management (create, pause, resume, destroy, exec, snapshot create/list/delete) - Fix DirSize to report actual disk usage (stat.Blocks * 512) instead of apparent size, so sparse CoW files report correctly - Add timing logs to pause flow for performance diagnostics - Fix all lint errors across api, network, vm, uffd, and sandbox packages - Remove obsolete internal/filesystem package (replaced by devicemapper) - Update CLAUDE.md with device-mapper architecture documentation	2026-03-13 08:25:40 +06:00
pptx704	a1bd439c75	Add sandbox snapshot and restore with UFFD lazy memory loading Implement full snapshot lifecycle: pause (snapshot + free resources), resume (UFFD-based lazy restore), and named snapshot templates that can spawn new sandboxes from frozen VM state. Key changes: - Snapshot header system with generational diff mapping (inspired by e2b) - UFFD server for lazy page fault handling during snapshot restore - Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility - Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots) - CreateSnapshot/DeleteSnapshot RPCs in hostagent proto - Reconciler excludes paused sandboxes (expected absent from host agent) - Snapshot templates lock vcpus/memory to baked-in values - Proper cleanup of uffd sockets and pause snapshot files on destroy	2026-03-12 09:19:37 +06:00
pptx704	b4d8edb65b	Add streaming exec and file transfer endpoints Add WebSocket-based streaming exec endpoint and streaming file upload/download endpoints to the control plane API. Includes new host agent RPC methods (ExecStream, StreamWriteFile, StreamReadFile), envd client streaming support, and OpenAPI spec updates.	2026-03-11 05:42:42 +06:00
pptx704	ec3360d9ad	Add minimal control plane with REST API, database, and reconciler - REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read - PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration) - Connect RPC client to host agent for all VM operations - Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes) - OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs - Added WriteFile/ReadFile RPCs to hostagent proto and implementations - File upload via multipart form, download via JSON body POST - sandbox_id propagated from control plane to host agent on create	2026-03-10 16:50:12 +06:00
pptx704	6f0c365d44	Add host agent RPC server with sandbox lifecycle management Implement the host agent as a Connect RPC server that orchestrates sandbox creation, destruction, pause/resume, and command execution. Includes sandbox manager with TTL-based reaper, network slot allocator, rootfs cloning, hostagent proto definition with generated stubs, and test/debug scripts. Fix Firecracker process lifetime bug where VM was tied to HTTP request context instead of background context.	2026-03-10 03:54:53 +06:00

25 Commits