Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
smaller than 20GB via truncate + resize2fs (sparse, no extra physical
disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
templates are compact; EnsureImageSizes re-expands on next startup
Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
queries include platform team_id in WHERE clause
- Add retry with backoff to dmsetupRemove for transient "device busy"
errors caused by kernel not releasing the device immediately after
Firecracker exits. Only retries on "Device or resource busy"; other
errors (not found, permission denied) return immediately.
- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
retries respect cancellation. Use context.Background() in all error
cleanup paths to prevent cancelled contexts from skipping cleanup
and leaking dm devices on the host.
- Resume vCPUs on pause failure: if snapshot creation or memfile
processing fails after freezing the VM, unfreeze vCPUs so the
sandbox stays usable instead of becoming a frozen zombie.
- Fix resource leaks in Pause when CoW rename or metadata write fails:
properly clean up network, slot, loop device, and remove from boxes
map instead of leaving a dead sandbox with leaked host resources.
- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
directory instead of deleting it, preserving the paused state so the
user can retry.
- Skip m.loops.Release when RemoveSnapshot fails during pause since
the stale dm device still references the origin loop device.
- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
size instead of a sensible default.
Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.
- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
the new dm device
- Replace reflink rootfs copy with device-mapper snapshots (shared
read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
"device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation