- Add retry with backoff to dmsetupRemove for transient "device busy"
errors caused by kernel not releasing the device immediately after
Firecracker exits. Only retries on "Device or resource busy"; other
errors (not found, permission denied) return immediately.
- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
retries respect cancellation. Use context.Background() in all error
cleanup paths to prevent cancelled contexts from skipping cleanup
and leaking dm devices on the host.
- Resume vCPUs on pause failure: if snapshot creation or memfile
processing fails after freezing the VM, unfreeze vCPUs so the
sandbox stays usable instead of becoming a frozen zombie.
- Fix resource leaks in Pause when CoW rename or metadata write fails:
properly clean up network, slot, loop device, and remove from boxes
map instead of leaving a dead sandbox with leaked host resources.
- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
directory instead of deleting it, preserving the paused state so the
user can retry.
- Skip m.loops.Release when RemoveSnapshot fails during pause since
the stale dm device still references the origin loop device.
- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
size instead of a sensible default.
Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.
- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
the new dm device
- Replace reflink rootfs copy with device-mapper snapshots (shared
read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
"device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation