Implement full snapshot lifecycle: pause (snapshot + free resources), resume (UFFD-based lazy restore), and named snapshot templates that can spawn new sandboxes from frozen VM state. Key changes: - Snapshot header system with generational diff mapping (inspired by e2b) - UFFD server for lazy page fault handling during snapshot restore - Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility - Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots) - CreateSnapshot/DeleteSnapshot RPCs in hostagent proto - Reconciler excludes paused sandboxes (expected absent from host agent) - Snapshot templates lock vcpus/memory to baked-in values - Proper cleanup of uffd sockets and pause snapshot files on destroy
235 lines
16 KiB
Markdown
235 lines
16 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
||
## Project Overview
|
||
|
||
Wrenn Sandbox is a microVM-based code execution platform. Users create isolated sandboxes (Firecracker microVMs), run code inside them, and get output back via SDKs. Think E2B but with persistent sandboxes, pool-based pricing, and a single-binary deployment story.
|
||
|
||
## Build & Development Commands
|
||
|
||
All commands go through the Makefile. Never use raw `go build` or `go run`.
|
||
|
||
```bash
|
||
make build # Build all binaries → builds/
|
||
make build-cp # Control plane only
|
||
make build-agent # Host agent only
|
||
make build-envd # envd static binary (verified statically linked)
|
||
|
||
make dev # Full local dev: infra + migrate + control plane
|
||
make dev-infra # Start PostgreSQL + Prometheus + Grafana (Docker)
|
||
make dev-down # Stop dev infra
|
||
make dev-cp # Control plane with hot reload (if air installed)
|
||
make dev-agent # Host agent (sudo required)
|
||
make dev-envd # envd in TCP debug mode
|
||
|
||
make check # fmt + vet + lint + test (CI order)
|
||
make test # Unit tests: go test -race -v ./internal/...
|
||
make test-integration # Integration tests (require host agent + Firecracker)
|
||
make fmt # gofmt both modules
|
||
make vet # go vet both modules
|
||
make lint # golangci-lint
|
||
|
||
make migrate-up # Apply pending migrations
|
||
make migrate-down # Rollback last migration
|
||
make migrate-create name=xxx # Scaffold new goose migration (never create manually)
|
||
make migrate-reset # Drop + re-apply all
|
||
|
||
make generate # Proto (buf) + sqlc codegen
|
||
make proto # buf generate for all proto dirs
|
||
make tidy # go mod tidy both modules
|
||
```
|
||
|
||
Run a single test: `go test -race -v -run TestName ./internal/path/...`
|
||
|
||
## Architecture
|
||
|
||
```
|
||
User SDK → HTTPS/WS → Control Plane → Connect RPC → Host Agent → HTTP/Connect RPC over TAP → envd (inside VM)
|
||
```
|
||
|
||
**Three binaries, two Go modules:**
|
||
|
||
| Binary | Module | Entry point | Runs as |
|
||
|--------|--------|-------------|---------|
|
||
| wrenn-cp | `git.omukk.dev/wrenn/sandbox` | `cmd/control-plane/main.go` | Unprivileged |
|
||
| wrenn-agent | `git.omukk.dev/wrenn/sandbox` | `cmd/host-agent/main.go` | Root (NET_ADMIN + /dev/kvm) |
|
||
| envd | `git.omukk.dev/wrenn/sandbox/envd` (standalone `envd/go.mod`) | `envd/main.go` | PID 1 inside guest VM |
|
||
|
||
envd is a **completely independent Go module**. It is never imported by the main module. The only connection is the protobuf contract. It compiles to a static binary baked into rootfs images.
|
||
|
||
**Key architectural invariant:** The host agent is **stateful** (in-memory `boxes` map is the source of truth for running VMs). The control plane is **stateless** (all persistent state in PostgreSQL). The reconciler (`internal/api/reconciler.go`) bridges the gap — it periodically compares DB records against the host agent's live state and marks orphaned sandboxes as "stopped".
|
||
|
||
### Control Plane
|
||
|
||
**Packages:** `internal/api/`, `internal/admin/`, `internal/auth/`, `internal/scheduler/`, `internal/lifecycle/`, `internal/config/`, `internal/db/`
|
||
|
||
Startup (`cmd/control-plane/main.go`) wires: config (env vars) → pgxpool → `db.Queries` (sqlc-generated) → Connect RPC client to host agent → `api.Server`. Everything flows through constructor injection.
|
||
|
||
- **API Server** (`internal/api/server.go`): chi router with middleware. Creates handler structs (`sandboxHandler`, `execHandler`, `filesHandler`, etc.) injected with `db.Queries` and the host agent Connect RPC client. Routes under `/v1/sandboxes/*`.
|
||
- **Reconciler** (`internal/api/reconciler.go`): background goroutine (every 30s) that compares DB records against `agent.ListSandboxes()` RPC. Marks orphaned DB entries as "stopped".
|
||
- **Admin UI** at `/admin/` (htmx + Go html/template, no SPA, no build step)
|
||
- **Database**: PostgreSQL via pgx/v5. Queries generated by sqlc from `db/queries/sandboxes.sql`. Migrations in `db/migrations/` (goose, plain SQL).
|
||
- **Config** (`internal/config/config.go`): purely environment variables (`DATABASE_URL`, `CP_LISTEN_ADDR`, `CP_HOST_AGENT_ADDR`), no YAML/file config.
|
||
|
||
### Host Agent
|
||
|
||
**Packages:** `internal/hostagent/`, `internal/sandbox/`, `internal/vm/`, `internal/network/`, `internal/filesystem/`, `internal/envdclient/`, `internal/snapshot/`
|
||
|
||
Startup (`cmd/host-agent/main.go`) wires: root check → enable IP forwarding → `sandbox.Manager` (containing `vm.Manager` + `network.SlotAllocator`) → `hostagent.Server` (Connect RPC handler) → HTTP server.
|
||
|
||
- **RPC Server** (`internal/hostagent/server.go`): implements `hostagentv1connect.HostAgentServiceHandler`. Thin wrapper — every method delegates to `sandbox.Manager`. Maps Connect error codes on return.
|
||
- **Sandbox Manager** (`internal/sandbox/manager.go`): the core orchestration layer. Maintains in-memory state in `boxes map[string]*sandboxState` (protected by `sync.RWMutex`). Each `sandboxState` holds a `models.Sandbox`, a `*network.Slot`, and an `*envdclient.Client`. Runs a TTL reaper (every 10s) that auto-destroys timed-out sandboxes.
|
||
- **VM Manager** (`internal/vm/manager.go`, `fc.go`, `config.go`): manages Firecracker processes. Uses raw HTTP API over Unix socket (`/tmp/fc-{sandboxID}.sock`), not the firecracker-go-sdk Machine type. Launches Firecracker via `unshare -m` + `ip netns exec`. Configures VM via PUT to `/boot-source`, `/drives/rootfs`, `/network-interfaces/eth0`, `/machine-config`, then starts with PUT `/actions`.
|
||
- **Network** (`internal/network/setup.go`, `allocator.go`): per-sandbox network namespace with veth pair + TAP device. See Networking section below.
|
||
- **Filesystem** (`internal/filesystem/clone.go`): CoW rootfs clones via `cp --reflink=auto`.
|
||
- **envd Client** (`internal/envdclient/client.go`, `health.go`): dual interface to the guest agent. Connect RPC for streaming process exec (`process.Start()` bidirectional stream). Plain HTTP for file operations (POST/GET `/files?path=...&username=root`). Health check polls `GET /health` every 100ms until ready (30s timeout).
|
||
|
||
### envd (Guest Agent)
|
||
|
||
**Module:** `envd/` with its own `go.mod` (`git.omukk.dev/wrenn/sandbox/envd`)
|
||
|
||
Runs as PID 1 inside the microVM via `wrenn-init.sh` (mounts procfs/sysfs/dev, sets hostname, writes resolv.conf, then execs envd). Extracted from E2B (Apache 2.0), with shared packages internalized into `envd/internal/shared/`. Listens on TCP `0.0.0.0:49983`.
|
||
|
||
- **ProcessService**: start processes, stream stdout/stderr, signal handling, PTY support
|
||
- **FilesystemService**: stat/list/mkdir/move/remove/watch files
|
||
- **Health**: GET `/health`
|
||
|
||
### Networking (per sandbox)
|
||
|
||
Each sandbox gets its own Linux network namespace (`ns-{idx}`). Slot index (1-based, up to 65534) determines all addressing:
|
||
|
||
```
|
||
Host Namespace Namespace "ns-{idx}" Guest VM
|
||
──────────────────────────────────────────────────────────────────────────────────────
|
||
veth-{idx} ←──── veth pair ────→ eth0
|
||
10.12.0.{idx*2}/31 10.12.0.{idx*2+1}/31
|
||
│
|
||
tap0 (169.254.0.22/30) ←── TAP ──→ eth0 (169.254.0.21)
|
||
↑ kernel ip= boot arg
|
||
```
|
||
|
||
- **Host-reachable IP**: `10.11.0.{idx}/32` — routed through veth to namespace, DNAT'd to guest
|
||
- **Outbound NAT**: guest (169.254.0.21) → SNAT to vpeerIP inside namespace → MASQUERADE on host to default interface
|
||
- **Inbound NAT**: host traffic to 10.11.0.{idx} → DNAT to 169.254.0.21 inside namespace
|
||
- IP forwarding enabled inside each namespace
|
||
- All details in `internal/network/setup.go`
|
||
|
||
### Sandbox State Machine
|
||
```
|
||
PENDING → STARTING → RUNNING → PAUSED → HIBERNATED
|
||
│ │
|
||
↓ ↓
|
||
STOPPED STOPPED → (destroyed)
|
||
|
||
Any state → ERROR (on crash/failure)
|
||
PAUSED → RUNNING (warm snapshot resume)
|
||
HIBERNATED → RUNNING (cold snapshot resume, slower)
|
||
```
|
||
|
||
### Key Request Flows
|
||
|
||
**Sandbox creation** (`POST /v1/sandboxes`):
|
||
1. API handler generates sandbox ID, inserts into DB as "pending"
|
||
2. RPC `CreateSandbox` → host agent → `sandbox.Manager.Create()`
|
||
3. Manager: resolve base rootfs → `cp --reflink` clone → allocate network slot → `CreateNetwork()` (netns + veth + tap + NAT) → `vm.Create()` (start Firecracker, configure via HTTP API, boot) → `envdclient.WaitUntilReady()` (poll /health) → store in-memory state
|
||
4. API handler updates DB to "running" with host_ip
|
||
|
||
**Command execution** (`POST /v1/sandboxes/{id}/exec`):
|
||
1. API handler verifies sandbox is "running" in DB
|
||
2. RPC `Exec` → host agent → `sandbox.Manager.Exec()` → `envdclient.Exec()`
|
||
3. envd client opens bidirectional Connect RPC stream (`process.Start`), collects stdout/stderr/exit_code
|
||
4. API handler checks UTF-8 validity (base64-encodes if binary), updates last_active_at, returns result
|
||
|
||
**Streaming exec** (`WS /v1/sandboxes/{id}/exec/stream`):
|
||
1. WebSocket upgrade, read first message for cmd/args
|
||
2. RPC `ExecStream` → host agent → `sandbox.Manager.ExecStream()` → `envdclient.ExecStream()`
|
||
3. envd client returns a channel of events; host agent forwards events through the RPC stream
|
||
4. API handler forwards stream events to WebSocket as JSON messages (`{type: "stdout"|"stderr"|"exit", ...}`)
|
||
|
||
**File transfer**: Write uses multipart POST to envd `/files`; read uses GET. Streaming variants chunk in 64KB pieces through the RPC stream.
|
||
|
||
## REST API
|
||
|
||
Routes defined in `internal/api/server.go`, handlers in `internal/api/handlers_*.go`. OpenAPI spec embedded via `//go:embed` and served at `/openapi.yaml` (Swagger UI at `/docs`). JSON request/response. API key auth via `X-API-Key` header. Error responses: `{"error": {"code": "...", "message": "..."}}`.
|
||
|
||
## Code Generation
|
||
|
||
### Proto (Connect RPC)
|
||
|
||
Proto source of truth is `proto/envd/*.proto` and `proto/hostagent/*.proto`. Run `make proto` to regenerate. Three `buf.gen.yaml` files control output:
|
||
|
||
| buf.gen.yaml location | Generates to | Used by |
|
||
|---|---|---|
|
||
| `proto/envd/buf.gen.yaml` | `proto/envd/gen/` | Main module (host agent's envd client) |
|
||
| `proto/hostagent/buf.gen.yaml` | `proto/hostagent/gen/` | Main module (control plane ↔ host agent) |
|
||
| `envd/spec/buf.gen.yaml` | `envd/internal/services/spec/` | envd module (guest agent server) |
|
||
|
||
The envd `buf.gen.yaml` reads from `../../proto/envd/` (same source protos) but generates into envd's own module. This means the same `.proto` files produce two independent sets of Go stubs — one for each Go module.
|
||
|
||
To add a new RPC method: edit the `.proto` file → `make proto` → implement the handler on both sides.
|
||
|
||
### sqlc
|
||
|
||
Config: `sqlc.yaml` (project root). Reads queries from `db/queries/*.sql`, reads schema from `db/migrations/`, outputs to `internal/db/`.
|
||
|
||
To add a new query: add it to the appropriate `.sql` file in `db/queries/` → `make generate` → use the new method on `*db.Queries`.
|
||
|
||
## Key Technical Decisions
|
||
|
||
- **Connect RPC** (not gRPC) for all RPC communication between components
|
||
- **Buf + protoc-gen-connect-go** for code generation (not protoc-gen-go-grpc)
|
||
- **Raw Firecracker HTTP API** via Unix socket (not firecracker-go-sdk Machine type)
|
||
- **TAP networking** (not vsock) for host-to-envd communication
|
||
- **PostgreSQL** via pgx/v5 + sqlc (type-safe query generation). Goose for migrations (plain SQL, up/down)
|
||
- **Admin UI**: htmx + Go html/template + chi router. No SPA, no React, no build step
|
||
- **Lago** for billing (external service, not in this codebase)
|
||
|
||
## Coding Conventions
|
||
|
||
- **Go style**: `gofmt`, `go vet`, `context.Context` everywhere, errors wrapped with `fmt.Errorf("action: %w", err)`, `slog` for logging, no global state
|
||
- **Naming**: Sandbox IDs `sb-` + 8 hex, API keys `wrn_` + 32 chars, Host IDs `host-` + 8 hex
|
||
- **Dependencies**: Use `go get` to add deps, never hand-edit go.mod. For envd deps: `cd envd && go get ...` (separate module)
|
||
- **Generated code**: Always commit generated code (proto stubs, sqlc). Never add generated code to .gitignore
|
||
- **Migrations**: Always use `make migrate-create name=xxx`, never create migration files manually
|
||
- **Testing**: Table-driven tests for handlers and state machine transitions
|
||
|
||
### Two-module gotcha
|
||
|
||
The main module (`go.mod`) and envd (`envd/go.mod`) are fully independent. `make tidy`, `make fmt`, `make vet` already operate on both. But when adding dependencies manually, remember to target the correct module (`cd envd && go get ...` for envd deps). `make proto` also generates stubs for both modules from the same proto sources.
|
||
|
||
## Rootfs & Guest Init
|
||
|
||
- **wrenn-init** (`images/wrenn-init.sh`): the PID 1 init script baked into every rootfs. Mounts virtual filesystems, sets hostname, writes `/etc/resolv.conf`, then execs envd.
|
||
- **Updating the rootfs** after changing envd or wrenn-init: `bash scripts/update-debug-rootfs.sh [rootfs_path]`. This builds envd via `make build-envd`, mounts the rootfs image, copies in the new binaries, and unmounts. Defaults to `/var/lib/wrenn/images/minimal.ext4`.
|
||
- Rootfs images are minimal debootstrap — no systemd, no coreutils beyond busybox. Use `/bin/sh -c` for shell builtins inside the guest.
|
||
|
||
## Fixed Paths (on host machine)
|
||
|
||
- Kernel: `/var/lib/wrenn/kernels/vmlinux`
|
||
- Base rootfs images: `/var/lib/wrenn/images/{template}.ext4`
|
||
- Sandbox clones: `/var/lib/wrenn/sandboxes/`
|
||
- Firecracker: `/usr/local/bin/firecracker` (e2b's fork of firecracker)
|
||
|
||
## Web UI Styling
|
||
|
||
**Wrenn brand:**
|
||
Warm earthy developer tool with crafted organic character.
|
||
|
||
**Color palette (light/dark):**
|
||
Background scale: #f8f6f1 → #f1eeea → #e8e5e0 → #dedbd5 (light); #090b0a → #0f1211 → #151918 → #1b201e → #222826 (dark). Text hierarchy: bright #2c2a26 / body #4a4740 / dim #7a766e / faint #a09b93 (light); #e8e5df / #c8c4bc / #8a867f / #5f5c57 (dark). Sage green brand accent: #5e8c58 (light) / #89a785 (dark), with glow variant rgba(94,140,88,0.08). Borders: #e2dfd9 (light) / #262c2a (dark). Semantic status colors: amber #9e7c2e (warning/building), red #b35544 (error/failed), blue #3d7aac (info/stopped) — each with a color-dim transparent bg variant for badge backgrounds. Destructive: #b35544 light / #c27b6d dark.
|
||
|
||
**Typography:**
|
||
Four fonts. Manrope (variable, weights 300–700) for all UI labels, nav, body. Instrument Serif (400) for page titles, empty-state headings, large metric values. JetBrains Mono (400/500) for code, env var keys/values, deployment IDs, commit SHAs, log viewer, URL paths. Alice for the sidebar wordmark only. Base body size 14px. Headings: h1 24px serif, h2 20px, h3 18px, h4–h6 11px sans-serif uppercase wide-tracked. Metric card values 34px serif at letter-spacing: -0.08em. Section labels at 0.06–0.07em tracking, weight 550–600.
|
||
Spacing: 4px base unit (Tailwind scale). Page content p-8 (32px). Cards p-4–p-5. Sidebar nav items 7px 10px. Consistent, moderate density — functional but not cramped.
|
||
|
||
**Borders & depth:** Flat aesthetic — --shadow-sm: 0 0 #0000, no drop shadows. Depth is achieved through background color stepping (bg → bg-3 → bg-4 → bg-5), not shadows. Borders 1px solid in warm muted tones. Corner radii: cards/surfaces 12px, inputs/small buttons 6–8px, avatars 8px, dots 50%.
|
||
|
||
**Components:** Active sidebar nav items use a 3px left-border in sage green rather than filled backgrounds, with a sage glow bg (rgba(94,140,88,0.08)). Focus rings are double-ring: 0 0 0 2px background, 0 0 0 4px ring. Status system has four states (Live/sage, Building/amber+pulse, Failed/red, Stopped/faint) each with solid dot + transparent-bg badge pair. Buttons follow ghost → outline → filled hierarchy. Tables wrapped in rounded-xl border. Dialogs via native <dialog>. Toasts bottom-anchored.
|
||
|
||
**Animation:** Crisp 150ms transitions on all interactive elements. Sidebar width 250ms ease. Custom wrenn-pulse keyframe (2.5s ease infinite box-shadow bloom) on live/building status dots. Top-of-page loading bar (h-0.5, sage green) on navigation.
|
||
|
||
**Dark mode:** Full support. Very dark near-black-green backgrounds with warm off-white text and desaturated sage accent. Flat (no card shadows). System preference detection + localStorage persistence.
|
||
|
||
**Overall feel:** Warm, earthy, semi-flat. Avoids cold grays entirely — palette leans slightly warm/brown-tinted throughout. The serif + mono + geometric sans type stack gives a designed but unfussy developer-tool character. Organic and considered, not sterile.
|