Replace the existing auto-destroy TTL behavior with auto-pause: when a
sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses
it (snapshot + teardown) instead of destroying it, preserving the ability
to resume later.
Key changes:
- TTL reaper calls Pause instead of Destroy, with fallback to Destroy if
pause fails (e.g. Firecracker process already gone)
- New PingSandbox RPC resets the in-memory LastActiveAt timer
- New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory
and DB last_active_at
- ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler
can distinguish auto-paused sandboxes from crashed ones in a single call
- Reconciler polls every 5s (was 30s) and marks auto-paused as "paused"
vs orphaned as "stopped"
- Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles
- Reaper checks every 2s (was 10s) and uses a detached context to avoid
incomplete pauses on app shutdown
- Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)
Implement email/password auth with JWT sessions and API key auth for
sandbox lifecycle. Users get a default team on signup; sandboxes,
snapshots, and API keys are scoped to teams.
- Add user, team, users_teams, and team_api_keys tables (goose migrations)
- Add JWT middleware (Bearer token) for user management endpoints
- Add API key middleware (X-API-Key header, SHA-256 hashed) for sandbox ops
- Add signup/login handlers with transactional user+team creation
- Add API key CRUD endpoints (create/list/delete)
- Replace owner_id with team_id on sandboxes and templates
- Update all handlers to use team-scoped queries
- Add godotenv for .env file loading
- Update OpenAPI spec and test UI with auth flows
- Replace reflink rootfs copy with device-mapper snapshots (shared
read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
"device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation