Moves business logic from API handlers into internal/service/ so that
both the REST API and the upcoming dashboard can share the same operations
without duplicating code. API handlers now delegate to the service layer
and only handle HTTP-specific concerns (request parsing, response formatting).
Implement OAuth 2.0 login via GitHub as an alternative to email/password.
Uses a provider registry pattern (internal/auth/oauth/) so adding Google
or other providers later requires only a new Provider implementation.
Flow: GET /v1/auth/oauth/github redirects to GitHub, callback exchanges
the code for a user profile, upserts the user + team atomically, and
redirects to the frontend with a JWT token.
Key changes:
- Migration: make password_hash nullable, add oauth_providers table
- Provider registry with GitHubProvider (profile + email fallback)
- CSRF state cookie with HMAC-SHA256 validation
- Race-safe registration (23505 collision retries as login)
- Startup validation: CP_PUBLIC_URL required when OAuth is configured
Not fully tested — needs integration tests with a real GitHub OAuth app
and end-to-end testing with the frontend callback page.
Replace the existing auto-destroy TTL behavior with auto-pause: when a
sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses
it (snapshot + teardown) instead of destroying it, preserving the ability
to resume later.
Key changes:
- TTL reaper calls Pause instead of Destroy, with fallback to Destroy if
pause fails (e.g. Firecracker process already gone)
- New PingSandbox RPC resets the in-memory LastActiveAt timer
- New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory
and DB last_active_at
- ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler
can distinguish auto-paused sandboxes from crashed ones in a single call
- Reconciler polls every 5s (was 30s) and marks auto-paused as "paused"
vs orphaned as "stopped"
- Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles
- Reaper checks every 2s (was 10s) and uses a detached context to avoid
incomplete pauses on app shutdown
- Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)
- Add retry with backoff to dmsetupRemove for transient "device busy"
errors caused by kernel not releasing the device immediately after
Firecracker exits. Only retries on "Device or resource busy"; other
errors (not found, permission denied) return immediately.
- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
retries respect cancellation. Use context.Background() in all error
cleanup paths to prevent cancelled contexts from skipping cleanup
and leaking dm devices on the host.
- Resume vCPUs on pause failure: if snapshot creation or memfile
processing fails after freezing the VM, unfreeze vCPUs so the
sandbox stays usable instead of becoming a frozen zombie.
- Fix resource leaks in Pause when CoW rename or metadata write fails:
properly clean up network, slot, loop device, and remove from boxes
map instead of leaving a dead sandbox with leaked host resources.
- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
directory instead of deleting it, preserving the paused state so the
user can retry.
- Skip m.loops.Release when RemoveSnapshot fails during pause since
the stale dm device still references the origin loop device.
- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
size instead of a sensible default.
Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.
- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
the new dm device
Implement email/password auth with JWT sessions and API key auth for
sandbox lifecycle. Users get a default team on signup; sandboxes,
snapshots, and API keys are scoped to teams.
- Add user, team, users_teams, and team_api_keys tables (goose migrations)
- Add JWT middleware (Bearer token) for user management endpoints
- Add API key middleware (X-API-Key header, SHA-256 hashed) for sandbox ops
- Add signup/login handlers with transactional user+team creation
- Add API key CRUD endpoints (create/list/delete)
- Replace owner_id with team_id on sandboxes and templates
- Update all handlers to use team-scoped queries
- Add godotenv for .env file loading
- Update OpenAPI spec and test UI with auth flows
Use Firecracker's Diff snapshot type when re-pausing a previously
resumed sandbox, capturing only dirty pages instead of a full memory
dump. Chains up to 10 incremental generations before collapsing back
to a Full snapshot. Multi-generation diff files (memfile.{buildID})
are supported alongside the legacy single-file format in resume,
template creation, and snapshot existence checks.
Add SafeName validator (allowlist regex) to reject directory traversal
in user-supplied template and snapshot names. Validated at both API
handlers (400 response) and sandbox manager (defense in depth).
Refactor CreateNetwork with rollback slice so partially created
resources (namespace, veth, routes, iptables rules) are cleaned up
on any error. Refactor RemoveNetwork to collect and return errors
instead of silently ignoring them.
- Replace reflink rootfs copy with device-mapper snapshots (shared
read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
"device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.
Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
Add resolv.conf to wrenn-init so guests can resolve DNS, and fix the
host MASQUERADE rule to match vpeerIP (the actual source after namespace
SNAT) instead of hostIP.
Add WebSocket-based streaming exec endpoint and streaming file
upload/download endpoints to the control plane API. Includes new
host agent RPC methods (ExecStream, StreamWriteFile, StreamReadFile),
envd client streaming support, and OpenAPI spec updates.
- REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read
- PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration)
- Connect RPC client to host agent for all VM operations
- Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes)
- OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs
- Added WriteFile/ReadFile RPCs to hostagent proto and implementations
- File upload via multipart form, download via JSON body POST
- sandbox_id propagated from control plane to host agent on create
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
Remove duplicate proto files from envd/spec/ and update envd's
buf.gen.yaml to generate stubs from the canonical proto/envd/ location.
Both modules now generate their own Connect RPC stubs from the same
source protos.
Implements Phase 1: boot a Firecracker microVM, execute a command inside
it via envd, and get the output back. Uses raw Firecracker HTTP API via
Unix socket (not the Go SDK) for full control over the VM lifecycle.
- internal/vm: VM manager with create/pause/resume/destroy, Firecracker
HTTP client, process launcher with unshare + ip netns exec isolation
- internal/network: per-sandbox network namespace with veth pair, TAP
device, NAT rules, and IP forwarding
- internal/envdclient: Connect RPC client for envd process/filesystem
services with health check retry
- cmd/host-agent: demo binary that boots a VM, runs "echo hello", prints
output, and cleans up
- proto/envd: canonical proto files with buf + protoc-gen-connect-go
code generation
- images/wrenn-init.sh: minimal PID 1 init script for guest VMs
- CLAUDE.md: updated architecture to reflect TAP networking (not vsock)
and Firecracker HTTP API (not Go SDK)
- Copy envd source from e2b-dev/infra, internalize shared dependencies
into envd/internal/shared/ (keys, filesystem, id, smap, utils)
- Switch from gRPC to Connect RPC for all envd services
- Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd}
- Add proto specs (process, filesystem) with buf-based code generation
- Implement full envd: process exec, filesystem ops, port forwarding,
cgroup management, MMDS integration, and HTTP API
- Update main module dependencies (firecracker SDK, pgx, goose, etc.)
- Remove placeholder .gitkeep files replaced by real implementations