wrenn-releases

Author	SHA1	Message	Date
pptx704	a5425969ed	fix: assorted bug fixes for CH migration Fix resource leaks, race conditions, and error handling across host agent and control plane: proper sparse file cleanup on close error, connect error wrapping for MakeDir, CoW file cleanup on pause failure, per-sandbox VM directories, deferred map deletion to avoid race in VM destroy, and goroutine launch for extension background workers.	2026-05-17 01:47:56 +06:00
pptx704	eaa6b8576d	feat(vm): replace Firecracker with Cloud Hypervisor Migrate the entire VM layer from Firecracker to Cloud Hypervisor (CH). CH provides native snapshot/restore via its HTTP API, eliminating the need for custom UFFD handling, memfile processing, and snapshot header management that Firecracker required. Key changes: - Remove fc.go, jailer.go (FC process management) - Remove internal/uffd/ package (userfaultfd lazy page loading) - Remove snapshot/header.go, mapping.go, memfile.go (FC snapshot format) - Add ch.go (CH HTTP API client over Unix socket) - Add process.go (CH process lifecycle with unshare+netns) - Add chversion.go (CH version detection) - Refactor sandbox manager: remove UFFD socket tracking, snapshot parent/diff chaining, FC-specific balloon logic; add crash watcher - Simplify snapshot/local.go to CH's native snapshot format - Update VM config: FirecrackerBin → VMMBin, new CH-specific fields - Update envdclient, devicemapper, network for CH compatibility	2026-05-17 01:33:12 +06:00
pptx704	34af77e0d8	Fix snapshot race, delete auth, sparse dd, default disk to 5GB Snapshot race fix: - Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and PauseSandbox RPCs, preventing the reconciler from marking it "stopped" during the flatten window when the sandbox is gone from the host agent's in-memory map but DB still says "running" - Revert status to "running" on RPC failure - Check ctx.Err() before writing response to avoid writing to dead connections when client disconnects during long snapshot operations Delete auth fix: - Block non-admin deletion of platform templates (team_id = all-zeros) at DELETE /v1/snapshots/{name} with 403, preventing file deletion before the team ownership check fails Sparse dd: - Add conv=sparse to dd in FlattenSnapshot so flattened images preserve sparseness (~200MB actual vs 5GB logical) Default disk size: - Change default disk_size_mb from 20GB to 5GB across migration, manager, service, build, and EnsureImageSizes - Disable split-button dropdown arrow for platform templates in dashboard snapshots page (teams cannot delete platform templates)	2026-03-28 14:30:18 +06:00
pptx704	c0d6381bbe	Add disk_size_mb, auto-expand base images, admin templates endpoint Disk sizing: - Add disk_size_mb column to sandboxes table (default 20480 = 20GB) - Add disk_size_mb to CreateSandboxRequest proto, passed through the full chain: service → RPC → host agent → sandbox manager → devicemapper - devicemapper.CreateSnapshot takes separate cowSizeBytes param so the sparse CoW file can be sized independently from the origin - EnsureImageSizes() runs at host agent startup: expands any base image smaller than 20GB via truncate + resize2fs (sparse, no extra physical disk). Sandboxes then get the full 20GB via fast dm-snapshot path - FlattenRootfs shrinks output images with resize2fs -M so stored templates are compact; EnsureImageSizes re-expands on next startup Admin templates visibility: - Add GET /v1/admin/templates endpoint listing all templates across teams - Frontend admin templates page uses listAdminTemplates() instead of team-scoped listSnapshots() - Platform templates (team_id = all-zeros UUID) now visible to all teams: GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType queries include platform team_id in WHERE clause	2026-03-26 23:45:41 +06:00
pptx704	88246fac2b	Fix sandbox lifecycle cleanup and dmsetup remove reliability - Add retry with backoff to dmsetupRemove for transient "device busy" errors caused by kernel not releasing the device immediately after Firecracker exits. Only retries on "Device or resource busy"; other errors (not found, permission denied) return immediately. - Thread context.Context through RemoveSnapshot/RestoreSnapshot so retries respect cancellation. Use context.Background() in all error cleanup paths to prevent cancelled contexts from skipping cleanup and leaking dm devices on the host. - Resume vCPUs on pause failure: if snapshot creation or memfile processing fails after freezing the VM, unfreeze vCPUs so the sandbox stays usable instead of becoming a frozen zombie. - Fix resource leaks in Pause when CoW rename or metadata write fails: properly clean up network, slot, loop device, and remove from boxes map instead of leaving a dead sandbox with leaked host resources. - Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot directory instead of deleting it, preserving the paused state so the user can retry. - Skip m.loops.Release when RemoveSnapshot fails during pause since the stale dm device still references the origin loop device. - Fix incorrect VCPUs placeholder in Resume VMConfig that used memory size instead of a sensible default.	2026-03-14 06:42:34 +06:00
pptx704	1846168736	Fix device-mapper "Device or resource busy" error on sandbox resume Pause was logging RemoveSnapshot failures as warnings and continuing, which left stale dm devices behind. Resume then failed trying to create a device with the same name. - Make RemoveSnapshot failure a hard error in Pause (clean up remaining resources and return error instead of silently proceeding) - Add defensive stale device cleanup in RestoreSnapshot before creating the new dm device	2026-03-14 03:57:14 +06:00
pptx704	63e9132d38	Add device-mapper snapshots, test UI, fix pause ordering and lint errors - Replace reflink rootfs copy with device-mapper snapshots (shared read-only loop device per base template, per-sandbox sparse CoW file) - Add devicemapper package with create/restore/remove/flatten operations and refcounted LoopRegistry for base image loop devices - Fix pause ordering: destroy VM before removing dm-snapshot to avoid "device busy" error (FC must release the dm device first) - Add test UI at GET /test for sandbox lifecycle management (create, pause, resume, destroy, exec, snapshot create/list/delete) - Fix DirSize to report actual disk usage (stat.Blocks * 512) instead of apparent size, so sparse CoW files report correctly - Add timing logs to pause flow for performance diagnostics - Fix all lint errors across api, network, vm, uffd, and sandbox packages - Remove obsolete internal/filesystem package (replaced by devicemapper) - Update CLAUDE.md with device-mapper architecture documentation	2026-03-13 08:25:40 +06:00

7 Commits