1
0
forked from wrenn/wrenn
Commit Graph

4 Commits

Author SHA1 Message Date
c0d6381bbe Add disk_size_mb, auto-expand base images, admin templates endpoint
Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
  full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
  sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
  smaller than 20GB via truncate + resize2fs (sparse, no extra physical
  disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
  templates are compact; EnsureImageSizes re-expands on next startup

Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
  team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
  GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
  queries include platform team_id in WHERE clause
2026-03-26 23:45:41 +06:00
88246fac2b Fix sandbox lifecycle cleanup and dmsetup remove reliability
- Add retry with backoff to dmsetupRemove for transient "device busy"
  errors caused by kernel not releasing the device immediately after
  Firecracker exits. Only retries on "Device or resource busy"; other
  errors (not found, permission denied) return immediately.

- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
  retries respect cancellation. Use context.Background() in all error
  cleanup paths to prevent cancelled contexts from skipping cleanup
  and leaking dm devices on the host.

- Resume vCPUs on pause failure: if snapshot creation or memfile
  processing fails after freezing the VM, unfreeze vCPUs so the
  sandbox stays usable instead of becoming a frozen zombie.

- Fix resource leaks in Pause when CoW rename or metadata write fails:
  properly clean up network, slot, loop device, and remove from boxes
  map instead of leaving a dead sandbox with leaked host resources.

- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
  directory instead of deleting it, preserving the paused state so the
  user can retry.

- Skip m.loops.Release when RemoveSnapshot fails during pause since
  the stale dm device still references the origin loop device.

- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
  size instead of a sensible default.
2026-03-14 06:42:34 +06:00
1846168736 Fix device-mapper "Device or resource busy" error on sandbox resume
Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.

- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
  resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
  the new dm device
2026-03-14 03:57:14 +06:00
63e9132d38 Add device-mapper snapshots, test UI, fix pause ordering and lint errors
- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
2026-03-13 08:25:40 +06:00