Files
sandbox/CLAUDE.md
pptx704 a1bd439c75 Add sandbox snapshot and restore with UFFD lazy memory loading
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
2026-03-12 09:19:37 +06:00

235 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Wrenn Sandbox is a microVM-based code execution platform. Users create isolated sandboxes (Firecracker microVMs), run code inside them, and get output back via SDKs. Think E2B but with persistent sandboxes, pool-based pricing, and a single-binary deployment story.
## Build & Development Commands
All commands go through the Makefile. Never use raw `go build` or `go run`.
```bash
make build # Build all binaries → builds/
make build-cp # Control plane only
make build-agent # Host agent only
make build-envd # envd static binary (verified statically linked)
make dev # Full local dev: infra + migrate + control plane
make dev-infra # Start PostgreSQL + Prometheus + Grafana (Docker)
make dev-down # Stop dev infra
make dev-cp # Control plane with hot reload (if air installed)
make dev-agent # Host agent (sudo required)
make dev-envd # envd in TCP debug mode
make check # fmt + vet + lint + test (CI order)
make test # Unit tests: go test -race -v ./internal/...
make test-integration # Integration tests (require host agent + Firecracker)
make fmt # gofmt both modules
make vet # go vet both modules
make lint # golangci-lint
make migrate-up # Apply pending migrations
make migrate-down # Rollback last migration
make migrate-create name=xxx # Scaffold new goose migration (never create manually)
make migrate-reset # Drop + re-apply all
make generate # Proto (buf) + sqlc codegen
make proto # buf generate for all proto dirs
make tidy # go mod tidy both modules
```
Run a single test: `go test -race -v -run TestName ./internal/path/...`
## Architecture
```
User SDK → HTTPS/WS → Control Plane → Connect RPC → Host Agent → HTTP/Connect RPC over TAP → envd (inside VM)
```
**Three binaries, two Go modules:**
| Binary | Module | Entry point | Runs as |
|--------|--------|-------------|---------|
| wrenn-cp | `git.omukk.dev/wrenn/sandbox` | `cmd/control-plane/main.go` | Unprivileged |
| wrenn-agent | `git.omukk.dev/wrenn/sandbox` | `cmd/host-agent/main.go` | Root (NET_ADMIN + /dev/kvm) |
| envd | `git.omukk.dev/wrenn/sandbox/envd` (standalone `envd/go.mod`) | `envd/main.go` | PID 1 inside guest VM |
envd is a **completely independent Go module**. It is never imported by the main module. The only connection is the protobuf contract. It compiles to a static binary baked into rootfs images.
**Key architectural invariant:** The host agent is **stateful** (in-memory `boxes` map is the source of truth for running VMs). The control plane is **stateless** (all persistent state in PostgreSQL). The reconciler (`internal/api/reconciler.go`) bridges the gap — it periodically compares DB records against the host agent's live state and marks orphaned sandboxes as "stopped".
### Control Plane
**Packages:** `internal/api/`, `internal/admin/`, `internal/auth/`, `internal/scheduler/`, `internal/lifecycle/`, `internal/config/`, `internal/db/`
Startup (`cmd/control-plane/main.go`) wires: config (env vars) → pgxpool → `db.Queries` (sqlc-generated) → Connect RPC client to host agent → `api.Server`. Everything flows through constructor injection.
- **API Server** (`internal/api/server.go`): chi router with middleware. Creates handler structs (`sandboxHandler`, `execHandler`, `filesHandler`, etc.) injected with `db.Queries` and the host agent Connect RPC client. Routes under `/v1/sandboxes/*`.
- **Reconciler** (`internal/api/reconciler.go`): background goroutine (every 30s) that compares DB records against `agent.ListSandboxes()` RPC. Marks orphaned DB entries as "stopped".
- **Admin UI** at `/admin/` (htmx + Go html/template, no SPA, no build step)
- **Database**: PostgreSQL via pgx/v5. Queries generated by sqlc from `db/queries/sandboxes.sql`. Migrations in `db/migrations/` (goose, plain SQL).
- **Config** (`internal/config/config.go`): purely environment variables (`DATABASE_URL`, `CP_LISTEN_ADDR`, `CP_HOST_AGENT_ADDR`), no YAML/file config.
### Host Agent
**Packages:** `internal/hostagent/`, `internal/sandbox/`, `internal/vm/`, `internal/network/`, `internal/filesystem/`, `internal/envdclient/`, `internal/snapshot/`
Startup (`cmd/host-agent/main.go`) wires: root check → enable IP forwarding → `sandbox.Manager` (containing `vm.Manager` + `network.SlotAllocator`) → `hostagent.Server` (Connect RPC handler) → HTTP server.
- **RPC Server** (`internal/hostagent/server.go`): implements `hostagentv1connect.HostAgentServiceHandler`. Thin wrapper — every method delegates to `sandbox.Manager`. Maps Connect error codes on return.
- **Sandbox Manager** (`internal/sandbox/manager.go`): the core orchestration layer. Maintains in-memory state in `boxes map[string]*sandboxState` (protected by `sync.RWMutex`). Each `sandboxState` holds a `models.Sandbox`, a `*network.Slot`, and an `*envdclient.Client`. Runs a TTL reaper (every 10s) that auto-destroys timed-out sandboxes.
- **VM Manager** (`internal/vm/manager.go`, `fc.go`, `config.go`): manages Firecracker processes. Uses raw HTTP API over Unix socket (`/tmp/fc-{sandboxID}.sock`), not the firecracker-go-sdk Machine type. Launches Firecracker via `unshare -m` + `ip netns exec`. Configures VM via PUT to `/boot-source`, `/drives/rootfs`, `/network-interfaces/eth0`, `/machine-config`, then starts with PUT `/actions`.
- **Network** (`internal/network/setup.go`, `allocator.go`): per-sandbox network namespace with veth pair + TAP device. See Networking section below.
- **Filesystem** (`internal/filesystem/clone.go`): CoW rootfs clones via `cp --reflink=auto`.
- **envd Client** (`internal/envdclient/client.go`, `health.go`): dual interface to the guest agent. Connect RPC for streaming process exec (`process.Start()` bidirectional stream). Plain HTTP for file operations (POST/GET `/files?path=...&username=root`). Health check polls `GET /health` every 100ms until ready (30s timeout).
### envd (Guest Agent)
**Module:** `envd/` with its own `go.mod` (`git.omukk.dev/wrenn/sandbox/envd`)
Runs as PID 1 inside the microVM via `wrenn-init.sh` (mounts procfs/sysfs/dev, sets hostname, writes resolv.conf, then execs envd). Extracted from E2B (Apache 2.0), with shared packages internalized into `envd/internal/shared/`. Listens on TCP `0.0.0.0:49983`.
- **ProcessService**: start processes, stream stdout/stderr, signal handling, PTY support
- **FilesystemService**: stat/list/mkdir/move/remove/watch files
- **Health**: GET `/health`
### Networking (per sandbox)
Each sandbox gets its own Linux network namespace (`ns-{idx}`). Slot index (1-based, up to 65534) determines all addressing:
```
Host Namespace Namespace "ns-{idx}" Guest VM
──────────────────────────────────────────────────────────────────────────────────────
veth-{idx} ←──── veth pair ────→ eth0
10.12.0.{idx*2}/31 10.12.0.{idx*2+1}/31
tap0 (169.254.0.22/30) ←── TAP ──→ eth0 (169.254.0.21)
↑ kernel ip= boot arg
```
- **Host-reachable IP**: `10.11.0.{idx}/32` — routed through veth to namespace, DNAT'd to guest
- **Outbound NAT**: guest (169.254.0.21) → SNAT to vpeerIP inside namespace → MASQUERADE on host to default interface
- **Inbound NAT**: host traffic to 10.11.0.{idx} → DNAT to 169.254.0.21 inside namespace
- IP forwarding enabled inside each namespace
- All details in `internal/network/setup.go`
### Sandbox State Machine
```
PENDING → STARTING → RUNNING → PAUSED → HIBERNATED
│ │
↓ ↓
STOPPED STOPPED → (destroyed)
Any state → ERROR (on crash/failure)
PAUSED → RUNNING (warm snapshot resume)
HIBERNATED → RUNNING (cold snapshot resume, slower)
```
### Key Request Flows
**Sandbox creation** (`POST /v1/sandboxes`):
1. API handler generates sandbox ID, inserts into DB as "pending"
2. RPC `CreateSandbox` → host agent → `sandbox.Manager.Create()`
3. Manager: resolve base rootfs → `cp --reflink` clone → allocate network slot → `CreateNetwork()` (netns + veth + tap + NAT) → `vm.Create()` (start Firecracker, configure via HTTP API, boot) → `envdclient.WaitUntilReady()` (poll /health) → store in-memory state
4. API handler updates DB to "running" with host_ip
**Command execution** (`POST /v1/sandboxes/{id}/exec`):
1. API handler verifies sandbox is "running" in DB
2. RPC `Exec` → host agent → `sandbox.Manager.Exec()``envdclient.Exec()`
3. envd client opens bidirectional Connect RPC stream (`process.Start`), collects stdout/stderr/exit_code
4. API handler checks UTF-8 validity (base64-encodes if binary), updates last_active_at, returns result
**Streaming exec** (`WS /v1/sandboxes/{id}/exec/stream`):
1. WebSocket upgrade, read first message for cmd/args
2. RPC `ExecStream` → host agent → `sandbox.Manager.ExecStream()``envdclient.ExecStream()`
3. envd client returns a channel of events; host agent forwards events through the RPC stream
4. API handler forwards stream events to WebSocket as JSON messages (`{type: "stdout"|"stderr"|"exit", ...}`)
**File transfer**: Write uses multipart POST to envd `/files`; read uses GET. Streaming variants chunk in 64KB pieces through the RPC stream.
## REST API
Routes defined in `internal/api/server.go`, handlers in `internal/api/handlers_*.go`. OpenAPI spec embedded via `//go:embed` and served at `/openapi.yaml` (Swagger UI at `/docs`). JSON request/response. API key auth via `X-API-Key` header. Error responses: `{"error": {"code": "...", "message": "..."}}`.
## Code Generation
### Proto (Connect RPC)
Proto source of truth is `proto/envd/*.proto` and `proto/hostagent/*.proto`. Run `make proto` to regenerate. Three `buf.gen.yaml` files control output:
| buf.gen.yaml location | Generates to | Used by |
|---|---|---|
| `proto/envd/buf.gen.yaml` | `proto/envd/gen/` | Main module (host agent's envd client) |
| `proto/hostagent/buf.gen.yaml` | `proto/hostagent/gen/` | Main module (control plane ↔ host agent) |
| `envd/spec/buf.gen.yaml` | `envd/internal/services/spec/` | envd module (guest agent server) |
The envd `buf.gen.yaml` reads from `../../proto/envd/` (same source protos) but generates into envd's own module. This means the same `.proto` files produce two independent sets of Go stubs — one for each Go module.
To add a new RPC method: edit the `.proto` file → `make proto` → implement the handler on both sides.
### sqlc
Config: `sqlc.yaml` (project root). Reads queries from `db/queries/*.sql`, reads schema from `db/migrations/`, outputs to `internal/db/`.
To add a new query: add it to the appropriate `.sql` file in `db/queries/``make generate` → use the new method on `*db.Queries`.
## Key Technical Decisions
- **Connect RPC** (not gRPC) for all RPC communication between components
- **Buf + protoc-gen-connect-go** for code generation (not protoc-gen-go-grpc)
- **Raw Firecracker HTTP API** via Unix socket (not firecracker-go-sdk Machine type)
- **TAP networking** (not vsock) for host-to-envd communication
- **PostgreSQL** via pgx/v5 + sqlc (type-safe query generation). Goose for migrations (plain SQL, up/down)
- **Admin UI**: htmx + Go html/template + chi router. No SPA, no React, no build step
- **Lago** for billing (external service, not in this codebase)
## Coding Conventions
- **Go style**: `gofmt`, `go vet`, `context.Context` everywhere, errors wrapped with `fmt.Errorf("action: %w", err)`, `slog` for logging, no global state
- **Naming**: Sandbox IDs `sb-` + 8 hex, API keys `wrn_` + 32 chars, Host IDs `host-` + 8 hex
- **Dependencies**: Use `go get` to add deps, never hand-edit go.mod. For envd deps: `cd envd && go get ...` (separate module)
- **Generated code**: Always commit generated code (proto stubs, sqlc). Never add generated code to .gitignore
- **Migrations**: Always use `make migrate-create name=xxx`, never create migration files manually
- **Testing**: Table-driven tests for handlers and state machine transitions
### Two-module gotcha
The main module (`go.mod`) and envd (`envd/go.mod`) are fully independent. `make tidy`, `make fmt`, `make vet` already operate on both. But when adding dependencies manually, remember to target the correct module (`cd envd && go get ...` for envd deps). `make proto` also generates stubs for both modules from the same proto sources.
## Rootfs & Guest Init
- **wrenn-init** (`images/wrenn-init.sh`): the PID 1 init script baked into every rootfs. Mounts virtual filesystems, sets hostname, writes `/etc/resolv.conf`, then execs envd.
- **Updating the rootfs** after changing envd or wrenn-init: `bash scripts/update-debug-rootfs.sh [rootfs_path]`. This builds envd via `make build-envd`, mounts the rootfs image, copies in the new binaries, and unmounts. Defaults to `/var/lib/wrenn/images/minimal.ext4`.
- Rootfs images are minimal debootstrap — no systemd, no coreutils beyond busybox. Use `/bin/sh -c` for shell builtins inside the guest.
## Fixed Paths (on host machine)
- Kernel: `/var/lib/wrenn/kernels/vmlinux`
- Base rootfs images: `/var/lib/wrenn/images/{template}.ext4`
- Sandbox clones: `/var/lib/wrenn/sandboxes/`
- Firecracker: `/usr/local/bin/firecracker` (e2b's fork of firecracker)
## Web UI Styling
**Wrenn brand:**
Warm earthy developer tool with crafted organic character.
**Color palette (light/dark):**
Background scale: #f8f6f1#f1eeea#e8e5e0#dedbd5 (light); #090b0a#0f1211#151918#1b201e#222826 (dark). Text hierarchy: bright #2c2a26 / body #4a4740 / dim #7a766e / faint #a09b93 (light); #e8e5df / #c8c4bc / #8a867f / #5f5c57 (dark). Sage green brand accent: #5e8c58 (light) / #89a785 (dark), with glow variant rgba(94,140,88,0.08). Borders: #e2dfd9 (light) / #262c2a (dark). Semantic status colors: amber #9e7c2e (warning/building), red #b35544 (error/failed), blue #3d7aac (info/stopped) — each with a color-dim transparent bg variant for badge backgrounds. Destructive: #b35544 light / #c27b6d dark.
**Typography:**
Four fonts. Manrope (variable, weights 300700) for all UI labels, nav, body. Instrument Serif (400) for page titles, empty-state headings, large metric values. JetBrains Mono (400/500) for code, env var keys/values, deployment IDs, commit SHAs, log viewer, URL paths. Alice for the sidebar wordmark only. Base body size 14px. Headings: h1 24px serif, h2 20px, h3 18px, h4h6 11px sans-serif uppercase wide-tracked. Metric card values 34px serif at letter-spacing: -0.08em. Section labels at 0.060.07em tracking, weight 550600.
Spacing: 4px base unit (Tailwind scale). Page content p-8 (32px). Cards p-4p-5. Sidebar nav items 7px 10px. Consistent, moderate density — functional but not cramped.
**Borders & depth:** Flat aesthetic — --shadow-sm: 0 0 #0000, no drop shadows. Depth is achieved through background color stepping (bg → bg-3 → bg-4 → bg-5), not shadows. Borders 1px solid in warm muted tones. Corner radii: cards/surfaces 12px, inputs/small buttons 68px, avatars 8px, dots 50%.
**Components:** Active sidebar nav items use a 3px left-border in sage green rather than filled backgrounds, with a sage glow bg (rgba(94,140,88,0.08)). Focus rings are double-ring: 0 0 0 2px background, 0 0 0 4px ring. Status system has four states (Live/sage, Building/amber+pulse, Failed/red, Stopped/faint) each with solid dot + transparent-bg badge pair. Buttons follow ghost → outline → filled hierarchy. Tables wrapped in rounded-xl border. Dialogs via native <dialog>. Toasts bottom-anchored.
**Animation:** Crisp 150ms transitions on all interactive elements. Sidebar width 250ms ease. Custom wrenn-pulse keyframe (2.5s ease infinite box-shadow bloom) on live/building status dots. Top-of-page loading bar (h-0.5, sage green) on navigation.
**Dark mode:** Full support. Very dark near-black-green backgrounds with warm off-white text and desaturated sage accent. Flat (no card shadows). System preference detection + localStorage persistence.
**Overall feel:** Warm, earthy, semi-flat. Avoids cold grays entirely — palette leans slightly warm/brown-tinted throughout. The serif + mono + geometric sans type stack gives a designed but unfussy developer-tool character. Organic and considered, not sterile.