diff --git a/.env.example b/.env.example index b9075b2..7446591 100644 --- a/.env.example +++ b/.env.example @@ -1,3 +1,7 @@ +# Shared (applies to both control plane and host agent) +WRENN_DIR=/var/lib/wrenn +LOG_LEVEL=info + # Database DATABASE_URL=postgres://wrenn:wrenn@localhost:5432/wrenn?sslmode=disable @@ -9,7 +13,6 @@ WRENN_CP_LISTEN_ADDR=:9725 # Host Agent WRENN_HOST_LISTEN_ADDR=:50051 -WRENN_DIR=/var/lib/wrenn WRENN_HOST_INTERFACE=eth0 WRENN_CP_URL=http://localhost:9725 WRENN_DEFAULT_ROOTFS_SIZE=5Gi diff --git a/CLAUDE.md b/CLAUDE.md index c104a7b..56fdbbc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -12,10 +12,10 @@ All commands go through the Makefile. Never use raw `go build` or `go run`. ```bash make build # Build all binaries → builds/ -make build-cp # Control plane only (builds frontend first) +make build-cp # Control plane only make build-agent # Host agent only make build-envd # envd static binary (verified statically linked) -make build-frontend # SvelteKit dashboard → internal/dashboard/static/ +make build-frontend # SvelteKit dashboard → frontend/build/ (served by Caddy) make dev # Full local dev: infra + migrate + control plane make dev-infra # Start PostgreSQL + Prometheus + Grafana (Docker) @@ -55,7 +55,7 @@ User SDK → HTTPS/WS → Control Plane → Connect RPC → Host Agent → HTTP/ | Binary | Module | Entry point | Runs as | |--------|--------|-------------|---------| | wrenn-cp | `git.omukk.dev/wrenn/wrenn` | `cmd/control-plane/main.go` | Unprivileged | -| wrenn-agent | `git.omukk.dev/wrenn/wrenn` | `cmd/host-agent/main.go` | Root (NET_ADMIN + /dev/kvm) | +| wrenn-agent | `git.omukk.dev/wrenn/wrenn` | `cmd/host-agent/main.go` | `wrenn` user with capabilities (SYS_ADMIN, NET_ADMIN, NET_RAW, SYS_PTRACE, KILL, DAC_OVERRIDE, MKNOD) via setcap; also accepts root | | envd | `git.omukk.dev/wrenn/wrenn/envd` (standalone `envd/go.mod`) | `envd/main.go` | PID 1 inside guest VM | envd is a **completely independent Go module**. It is never imported by the main module. The only connection is the protobuf contract. It compiles to a static binary baked into rootfs images. @@ -64,7 +64,7 @@ envd is a **completely independent Go module**. It is never imported by the main ### Control Plane -**Internal packages:** `internal/api/`, `internal/dashboard/`, `internal/email/` +**Internal packages:** `internal/api/`, `internal/email/` **Public packages (importable by cloud repo):** `pkg/config/`, `pkg/db/`, `pkg/auth/`, `pkg/auth/oauth/`, `pkg/scheduler/`, `pkg/lifecycle/`, `pkg/channels/`, `pkg/audit/`, `pkg/service/`, `pkg/events/`, `pkg/id/`, `pkg/validate/` @@ -78,7 +78,7 @@ Startup (`cmd/control-plane/main.go`) is a thin wrapper: `cpserver.Run(cpserver. - **API Server** (`internal/api/server.go`): chi router with middleware. Creates handler structs (`sandboxHandler`, `execHandler`, `filesHandler`, etc.) injected with `db.Queries` and the host agent Connect RPC client. Routes under `/v1/capsules/*`. Accepts `[]cpextension.Extension` — each extension's `RegisterRoutes()` is called after all core routes are registered. - **Reconciler** (`internal/api/reconciler.go`): background goroutine (every 30s) that compares DB records against `agent.ListSandboxes()` RPC. Marks orphaned DB entries as "stopped". -- **Dashboard** (SvelteKit + Tailwind + Bits UI, statically built and embedded via `go:embed`, served as catch-all at root) +- **Dashboard** (SvelteKit + Tailwind + Bits UI, built to static files in `frontend/build/`, served by Caddy as a reverse proxy) - **Database**: PostgreSQL via pgx/v5. Queries generated by sqlc from `db/queries/*.sql` → `pkg/db/`. Migrations in `db/migrations/` (goose, plain SQL). `db/migrations/embed.go` exposes `migrations.FS` so the cloud repo can run OSS migrations via `go:embed`. - **Config** (`pkg/config/config.go`): purely environment variables (`DATABASE_URL`, `CP_LISTEN_ADDR`, `CP_HOST_AGENT_ADDR`), no YAML/file config. @@ -86,7 +86,9 @@ Startup (`cmd/control-plane/main.go`) is a thin wrapper: `cpserver.Run(cpserver. **Packages:** `internal/hostagent/`, `internal/sandbox/`, `internal/vm/`, `internal/network/`, `internal/devicemapper/`, `internal/envdclient/`, `internal/snapshot/` -Startup (`cmd/host-agent/main.go`) wires: root check → enable IP forwarding → clean up stale dm devices → `sandbox.Manager` (containing `vm.Manager` + `network.SlotAllocator` + `devicemapper.LoopRegistry`) → `hostagent.Server` (Connect RPC handler) → HTTP server. +**Production deployment:** `scripts/prepare-wrenn-user.sh` creates the `wrenn` system user, sets Linux capabilities (setcap) on wrenn-agent and all child binaries (iptables, losetup, dmsetup, etc.), installs an apt hook to restore capabilities after package updates, configures udev rules for `/dev/net/tun`, loads required kernel modules, and writes systemd unit files for both services. No sudo grants — all privilege is via capabilities. + +Startup (`cmd/host-agent/main.go`) wires: root/capabilities check → enable IP forwarding → clean up stale dm devices → `sandbox.Manager` (containing `vm.Manager` + `network.SlotAllocator` + `devicemapper.LoopRegistry`) → `hostagent.Server` (Connect RPC handler) → HTTP server. - **RPC Server** (`internal/hostagent/server.go`): implements `hostagentv1connect.HostAgentServiceHandler`. Thin wrapper — every method delegates to `sandbox.Manager`. Maps Connect error codes on return. - **Sandbox Manager** (`internal/sandbox/manager.go`): the core orchestration layer. Maintains in-memory state in `boxes map[string]*sandboxState` (protected by `sync.RWMutex`). Each `sandboxState` holds a `models.Sandbox`, a `*network.Slot`, and an `*envdclient.Client`. Runs a TTL reaper (every 10s) that auto-destroys timed-out sandboxes. @@ -113,8 +115,8 @@ Runs as PID 1 inside the microVM via `wrenn-init.sh` (mounts procfs/sysfs/dev, s - **Package manager**: pnpm - **Routing**: SvelteKit file-based routing under `frontend/src/routes/` - **Routing layout**: `/login` and `/signup` at root, authenticated pages under `/dashboard/*` (e.g. `/dashboard/capsules`, `/dashboard/keys`) -- **Build output**: `frontend/build/` → copied to `internal/dashboard/static/` → embedded via `go:embed` into the control plane binary -- **Serving**: `internal/dashboard/dashboard.go` registers a `NotFound` catch-all SPA handler with fallback to `index.html`. API routes (`/v1/*`, `/openapi.yaml`, `/docs`) are registered first and take priority +- **Build output**: `frontend/build/` — static files served by Caddy +- **Serving**: Caddy reverse-proxies API requests to the control plane and serves the SvelteKit SPA directly. The control plane does not serve frontend assets. - **Dev workflow**: `make dev-frontend` runs Vite dev server on port 5173 with HMR. API calls proxy to `http://localhost:8000` - **Fonts**: Manrope (UI), Instrument Serif (headings), JetBrains Mono (code), Alice (brand wordmark) — all self-hosted via `@fontsource` - **Dark mode**: class-based (`.dark` on ``) with system preference detection + localStorage persistence @@ -209,7 +211,7 @@ To add a new query: add it to the appropriate `.sql` file in `db/queries/` → ` - **TAP networking** (not vsock) for host-to-envd communication - **Device-mapper snapshots** for rootfs CoW — shared read-only loop device per base template, per-sandbox sparse CoW file, Firecracker gets `/dev/mapper/wrenn-{id}` - **PostgreSQL** via pgx/v5 + sqlc (type-safe query generation). Goose for migrations (plain SQL, up/down) -- **Dashboard**: SvelteKit (Svelte 5, adapter-static) + Tailwind CSS v4 + Bits UI. Built to static files, embedded into the Go binary via `go:embed`, served as catch-all at root +- **Dashboard**: SvelteKit (Svelte 5, adapter-static) + Tailwind CSS v4 + Bits UI. Built to static files in `frontend/build/`, served by Caddy (not embedded in the Go binary) - **Lago** for billing (external service, not in this codebase) ## Coding Conventions diff --git a/README.md b/README.md index c312494..765be5a 100644 --- a/README.md +++ b/README.md @@ -2,16 +2,16 @@ Secure infrastructure for AI -## Deployment - -### Prerequisites +## Prerequisites - Linux host with `/dev/kvm` access (bare metal or nested virt) - Firecracker binary at `/usr/local/bin/firecracker` - PostgreSQL - Go 1.25+ +- pnpm (for frontend) +- Docker (for dev infra and rootfs builds) -### Build +## Build ```bash make build # outputs to builds/ @@ -19,30 +19,77 @@ make build # outputs to builds/ Produces three binaries: `wrenn-cp` (control plane), `wrenn-agent` (host agent), `envd` (guest agent). -### Host setup +## Host setup -The host agent machine needs: +The host agent needs a kernel, a minimal rootfs image, and working directories on the host machine. -```bash -# Kernel for guest VMs -mkdir -p /var/lib/wrenn/kernels -# Place a vmlinux kernel at /var/lib/wrenn/kernels/vmlinux +### Directory structure -# Rootfs images -mkdir -p /var/lib/wrenn/images -# Build or place .ext4 rootfs images (e.g., minimal.ext4) - -# Sandbox working directory -mkdir -p /var/lib/wrenn/sandboxes - -# Snapshots directory -mkdir -p /var/lib/wrenn/snapshots - -# Enable IP forwarding -sysctl -w net.ipv4.ip_forward=1 +``` +/var/lib/wrenn/ +├── kernels/ +│ └── vmlinux # uncompressed Linux kernel (not bzImage) +├── images/ +│ └── minimal/ +│ └── rootfs.ext4 # base rootfs (all other templates snapshot from this) +├── sandboxes/ # per-sandbox CoW files (created at runtime) +└── snapshots/ # pause/hibernate snapshot files (created at runtime) ``` -### Configure +Create the directories: + +```bash +sudo mkdir -p /var/lib/wrenn/{kernels,images/minimal,sandboxes,snapshots} +``` + +### Kernel + +Place an uncompressed `vmlinux` kernel at `/var/lib/wrenn/kernels/vmlinux`. Versioned kernels (`vmlinux-{semver}`) are also supported — the agent picks the latest by semver. + +### Minimal rootfs + +The minimal rootfs is the base image that all other templates (Python, Node, etc.) are built on top of via device-mapper snapshots. It must contain: + +| Package | Why | +|---------|-----| +| `socat` | Bidirectional relay for port forwarding | +| `chrony` | Time sync from KVM PTP clock (`/dev/ptp0`) | +| `tini` | PID 1 zombie reaper (injected by build script, not apt) | +| `sudo` | User privilege management inside the guest | +| `wget` | HTTP fetching | +| `curl` | HTTP client | +| `ca-certificates` | TLS certificate verification | + +**To build a rootfs from a Docker container:** + +1. Create and configure a container with the required packages: + ```bash + docker run -it --name wrenn-minimal debian:bookworm bash + # Inside the container: + apt update && apt install -y socat chrony sudo wget curl ca-certificates + exit + ``` + +2. Export to a rootfs image (builds envd, injects wrenn-init + tini, shrinks to minimum size): + ```bash + sudo bash scripts/rootfs-from-container.sh wrenn-minimal minimal + ``` + +**To update an existing rootfs** after changing envd or `wrenn-init.sh`: + +```bash +bash scripts/update-minimal-rootfs.sh +``` + +This rebuilds envd via `make build-envd` and copies the fresh binaries into the mounted rootfs image. + +### IP forwarding + +```bash +sudo sysctl -w net.ipv4.ip_forward=1 +``` + +## Configure Copy `.env.example` to `.env` and edit: @@ -59,25 +106,21 @@ WRENN_HOST_LISTEN_ADDR=:50051 WRENN_DIR=/var/lib/wrenn ``` -### Run +## Development ```bash -# Apply database migrations -make migrate-up - -# Start control plane -./builds/wrenn-cp +make dev # Start PostgreSQL (Docker), run migrations, start control plane +make dev-agent # Start host agent (separate terminal, sudo) +make dev-frontend # Vite dev server with HMR (port 5173) +make check # fmt + vet + lint + test ``` -Control plane listens on `WRENN_CP_LISTEN_ADDR` (default `:8000`). - ### Host registration Hosts must be registered with the control plane before they can serve sandboxes. 1. **Create a host record** (via API or dashboard): ```bash - # As an admin (JWT auth) curl -X POST http://localhost:8000/v1/hosts \ -H "Authorization: Bearer $JWT_TOKEN" \ -H "Content-Type: application/json" \ @@ -87,17 +130,16 @@ Hosts must be registered with the control plane before they can serve sandboxes. 2. **Start the host agent** with the registration token and its externally-reachable address: ```bash - sudo WRENN_CP_URL=http://cp-host:8000 \ + sudo WRENN_CP_URL=http://localhost:8000 \ ./builds/wrenn-agent \ --register \ - --address 10.0.1.5:50051 + --address :50051 ``` On first startup the agent sends its specs (arch, CPU, memory, disk) to the control plane, receives a long-lived host JWT, and saves it to `$WRENN_DIR/host-token`. 3. **Subsequent startups** don't need `--register` — the agent loads the saved JWT automatically: ```bash - sudo WRENN_CP_URL=http://cp-host:8000 \ - ./builds/wrenn-agent --address 10.0.1.5:50051 + sudo ./builds/wrenn-agent --address :50051 ``` 4. **If registration fails** (e.g., network error after token was consumed), regenerate a token: @@ -107,23 +149,6 @@ Hosts must be registered with the control plane before they can serve sandboxes. ``` Then restart the agent with the new token. -The agent sends heartbeats to the control plane every 30 seconds. Host agent listens on `WRENN_HOST_LISTEN_ADDR` (default `:50051`). - -### Rootfs images - -envd must be baked into every rootfs image. After building: - -```bash -make build-envd -bash scripts/update-debug-rootfs.sh /var/lib/wrenn/images/minimal.ext4 -``` - -## Development - -```bash -make dev # Start PostgreSQL (Docker), run migrations, start control plane -make dev-agent # Start host agent (separate terminal, sudo) -make check # fmt + vet + lint + test -``` +The agent sends heartbeats to the control plane every 30 seconds. See `CLAUDE.md` for full architecture documentation. diff --git a/cmd/host-agent/main.go b/cmd/host-agent/main.go index 047d726..5896c2c 100644 --- a/cmd/host-agent/main.go +++ b/cmd/host-agent/main.go @@ -1,14 +1,18 @@ package main import ( + "bufio" "context" "crypto/tls" "flag" + "fmt" "log/slog" "net/http" "os" "os/signal" "path/filepath" + "strconv" + "strings" "sync" "syscall" "time" @@ -21,6 +25,7 @@ import ( "git.omukk.dev/wrenn/wrenn/internal/network" "git.omukk.dev/wrenn/wrenn/internal/sandbox" "git.omukk.dev/wrenn/wrenn/pkg/auth" + "git.omukk.dev/wrenn/wrenn/pkg/logging" "git.omukk.dev/wrenn/wrenn/proto/hostagent/gen/hostagentv1connect" ) @@ -38,18 +43,24 @@ func main() { advertiseAddr := flag.String("address", "", "Externally-reachable address (ip:port) for this host agent") flag.Parse() - slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{ - Level: slog.LevelDebug, - }))) + rootDir := envOrDefault("WRENN_DIR", "/var/lib/wrenn") + cleanupLog := logging.Setup(filepath.Join(rootDir, "logs"), "host-agent") + defer cleanupLog() - if os.Geteuid() != 0 { - slog.Error("host agent must run as root") + if err := checkPrivileges(); err != nil { + slog.Error("insufficient privileges", "error", err) os.Exit(1) } - // Enable IP forwarding (required for NAT). + // Enable IP forwarding (required for NAT). The write may fail if running + // as non-root without DAC_OVERRIDE on this path — that's OK if the systemd + // unit's ExecStartPre already set it. We verify the value regardless. if err := os.WriteFile("/proc/sys/net/ipv4/ip_forward", []byte("1"), 0644); err != nil { - slog.Warn("failed to enable ip_forward", "error", err) + slog.Warn("failed to enable ip_forward (may have been set by systemd unit)", "error", err) + } + if b, err := os.ReadFile("/proc/sys/net/ipv4/ip_forward"); err != nil || strings.TrimSpace(string(b)) != "1" { + slog.Error("ip_forward is not enabled — sandbox networking will be broken", "error", err) + os.Exit(1) } // Clean up stale resources from a previous crash. @@ -57,7 +68,6 @@ func main() { network.CleanupStaleNamespaces() listenAddr := envOrDefault("WRENN_HOST_LISTEN_ADDR", ":50051") - rootDir := envOrDefault("WRENN_DIR", "/var/lib/wrenn") cpURL := os.Getenv("WRENN_CP_URL") credsFile := filepath.Join(rootDir, "host-credentials.json") @@ -170,6 +180,7 @@ func main() { shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second) defer shutdownCancel() mgr.Shutdown(shutdownCtx) + sandbox.ShrinkMinimalImage(rootDir) if err := httpServer.Shutdown(shutdownCtx); err != nil { slog.Error("http server shutdown error", "error", err) } @@ -245,3 +256,63 @@ func envOrDefault(key, def string) string { } return def } + +// checkPrivileges verifies the process has the required Linux capabilities. +// Always reads CapEff — even for root — because a root process inside a +// restricted container (e.g. docker --cap-drop=all) may not have all caps. +func checkPrivileges() error { + capEff, err := readEffectiveCaps() + if err != nil { + return fmt.Errorf("read capabilities: %w", err) + } + + // All capabilities required by the host agent at runtime. + required := []struct { + bit uint + name string + }{ + {1, "CAP_DAC_OVERRIDE"}, // /dev/loop*, /dev/mapper/*, /dev/net/tun + {5, "CAP_KILL"}, // SIGTERM/SIGKILL to Firecracker processes + {12, "CAP_NET_ADMIN"}, // netlink, iptables, routing, TAP/veth + {13, "CAP_NET_RAW"}, // raw sockets (iptables) + {19, "CAP_SYS_PTRACE"}, // reading /proc/self/ns/net (netns.Get) + {21, "CAP_SYS_ADMIN"}, // netns, mount ns, losetup, dmsetup + {27, "CAP_MKNOD"}, // device-mapper node creation + } + + var missing []string + for _, cap := range required { + if capEff&(1< 0 { + return fmt.Errorf("missing capabilities: %s — run as root or apply setcap to the binary", + strings.Join(missing, ", ")) + } + + return nil +} + +// readEffectiveCaps parses the CapEff bitmask from /proc/self/status. +func readEffectiveCaps() (uint64, error) { + f, err := os.Open("/proc/self/status") + if err != nil { + return 0, err + } + defer f.Close() + + scanner := bufio.NewScanner(f) + for scanner.Scan() { + line := scanner.Text() + if hexStr, ok := strings.CutPrefix(line, "CapEff:"); ok { + return strconv.ParseUint(strings.TrimSpace(hexStr), 16, 64) + } + } + + if err := scanner.Err(); err != nil { + return 0, fmt.Errorf("read /proc/self/status: %w", err) + } + return 0, fmt.Errorf("CapEff not found in /proc/self/status") +} diff --git a/deploy/logrotate/wrenn b/deploy/logrotate/wrenn new file mode 100644 index 0000000..f05a606 --- /dev/null +++ b/deploy/logrotate/wrenn @@ -0,0 +1,19 @@ +/var/lib/wrenn/logs/control-plane.log +/var/lib/wrenn/logs/host-agent.log +{ + daily + rotate 3 + missingok + notifempty + dateext + dateformat -%Y-%m-%d + compress + delaycompress + sharedscripts + postrotate + # Signal the processes to reopen their log files. + # Use SIGHUP — both binaries handle it gracefully. + pkill -HUP -f wrenn-cp || true + pkill -HUP -f wrenn-agent || true + endscript +} diff --git a/frontend/src/routes/admin/capsules/[id]/+page.js b/frontend/src/routes/admin/capsules/[id]/+page.js new file mode 100644 index 0000000..d43d0cd --- /dev/null +++ b/frontend/src/routes/admin/capsules/[id]/+page.js @@ -0,0 +1 @@ +export const prerender = false; diff --git a/internal/api/server.go b/internal/api/server.go index 47a1b44..9e81340 100644 --- a/internal/api/server.go +++ b/internal/api/server.go @@ -28,6 +28,7 @@ var openapiYAML []byte type Server struct { router chi.Router BuildSvc *service.BuildService + version string } // New constructs the chi router and registers all routes. @@ -48,6 +49,7 @@ func New( mailer email.Mailer, extensions []cpextension.Extension, sctx cpextension.ServerContext, + version string, ) *Server { r := chi.NewRouter() r.Use(requestLogger()) @@ -86,6 +88,12 @@ func New( adminCapsules := newAdminCapsuleHandler(sandboxSvc, queries, pool, al) meH := newMeHandler(queries, pgPool, rdb, jwtSecret, mailer, oauthRegistry, oauthRedirectURL, teamSvc) + // Health check. + r.Get("/health", func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + fmt.Fprintf(w, `{"status":"ok","version":%q}`, version) + }) + // OpenAPI spec and docs. r.Get("/openapi.yaml", serveOpenAPI) r.Get("/docs", serveDocs) @@ -270,7 +278,7 @@ func New( ext.RegisterRoutes(r, sctx) } - return &Server{router: r, BuildSvc: buildSvc} + return &Server{router: r, BuildSvc: buildSvc, version: version} } // Handler returns the HTTP handler. diff --git a/internal/network/allocator.go b/internal/network/allocator.go index b7265e6..6a929d0 100644 --- a/internal/network/allocator.go +++ b/internal/network/allocator.go @@ -24,7 +24,7 @@ func (a *SlotAllocator) Allocate() (int, error) { a.mu.Lock() defer a.mu.Unlock() - for i := 1; i <= 65534; i++ { + for i := 1; i <= 32767; i++ { if !a.inUse[i] { a.inUse[i] = true return i, nil diff --git a/internal/sandbox/images.go b/internal/sandbox/images.go index 26634d3..b1a848d 100644 --- a/internal/sandbox/images.go +++ b/internal/sandbox/images.go @@ -104,6 +104,37 @@ func ParseSizeToMB(s string) (int, error) { } } +// ShrinkMinimalImage shrinks the built-in minimal rootfs back to its minimum +// size using resize2fs -M. This is the inverse of EnsureImageSizes and should +// be called during graceful shutdown so the image is stored compactly on disk. +func ShrinkMinimalImage(wrennDir string) { + minimalRootfs := layout.TemplateRootfs(wrennDir, id.PlatformTeamID, id.MinimalTemplateID) + shrinkImage(minimalRootfs) +} + +// shrinkImage shrinks a single rootfs image to its minimum size. +func shrinkImage(rootfs string) { + if _, err := os.Stat(rootfs); err != nil { + return + } + + slog.Info("shrinking base image", "path", rootfs) + + if out, err := exec.Command("e2fsck", "-fy", rootfs).CombinedOutput(); err != nil { + if exitErr, ok := err.(*exec.ExitError); ok && exitErr.ExitCode() > 1 { + slog.Warn("e2fsck before shrink failed", "path", rootfs, "output", string(out), "error", err) + return + } + } + + if out, err := exec.Command("resize2fs", "-M", rootfs).CombinedOutput(); err != nil { + slog.Warn("resize2fs -M failed", "path", rootfs, "output", string(out), "error", err) + return + } + + slog.Info("base image shrunk", "path", rootfs) +} + // expandImage expands a single rootfs image if it is smaller than targetBytes. func expandImage(rootfs string, targetBytes int64, targetMB int) error { info, err := os.Stat(rootfs) diff --git a/pkg/config/config.go b/pkg/config/config.go index 2274bb2..a695392 100644 --- a/pkg/config/config.go +++ b/pkg/config/config.go @@ -14,6 +14,7 @@ type Config struct { RedisURL string ListenAddr string JWTSecret string + WrennDir string // WRENN_DIR — base directory for wrenn data (logs, etc.) // mTLS — CP→Agent channel. Both must be set to enable mTLS; omitting either // disables cert issuance and leaves agent connections on plain HTTP (dev mode). @@ -48,6 +49,7 @@ func Load() Config { RedisURL: envOrDefault("REDIS_URL", "redis://localhost:6379/0"), ListenAddr: envOrDefault("WRENN_CP_LISTEN_ADDR", ":8080"), JWTSecret: os.Getenv("JWT_SECRET"), + WrennDir: envOrDefault("WRENN_DIR", "/var/lib/wrenn"), CACert: os.Getenv("WRENN_CA_CERT"), CAKey: os.Getenv("WRENN_CA_KEY"), diff --git a/pkg/cpserver/run.go b/pkg/cpserver/run.go index aeb21ac..32a819a 100644 --- a/pkg/cpserver/run.go +++ b/pkg/cpserver/run.go @@ -6,6 +6,7 @@ import ( "net/http" "os" "os/signal" + "path/filepath" "strings" "syscall" "time" @@ -22,6 +23,7 @@ import ( "git.omukk.dev/wrenn/wrenn/pkg/config" "git.omukk.dev/wrenn/wrenn/pkg/db" "git.omukk.dev/wrenn/wrenn/pkg/lifecycle" + "git.omukk.dev/wrenn/wrenn/pkg/logging" "git.omukk.dev/wrenn/wrenn/pkg/scheduler" ) @@ -39,11 +41,9 @@ func Run(opts ...Option) { opt(o) } - slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{ - Level: slog.LevelDebug, - }))) - cfg := config.Load() + cleanupLog := logging.Setup(filepath.Join(cfg.WrennDir, "logs"), "control-plane") + defer cleanupLog() if len(cfg.JWTSecret) < 32 { slog.Error("JWT_SECRET must be at least 32 characters") @@ -175,7 +175,7 @@ func Run(opts ...Option) { } // API server. - srv := api.New(queries, hostPool, hostScheduler, pool, rdb, []byte(cfg.JWTSecret), oauthRegistry, cfg.OAuthRedirectURL, ca, al, channelSvc, mailer, o.extensions, sctx) + srv := api.New(queries, hostPool, hostScheduler, pool, rdb, []byte(cfg.JWTSecret), oauthRegistry, cfg.OAuthRedirectURL, ca, al, channelSvc, mailer, o.extensions, sctx, o.version) // Start template build workers (2 concurrent). stopBuildWorkers := srv.BuildSvc.StartWorkers(ctx, 2) diff --git a/pkg/logging/logging.go b/pkg/logging/logging.go new file mode 100644 index 0000000..6159a9c --- /dev/null +++ b/pkg/logging/logging.go @@ -0,0 +1,135 @@ +package logging + +import ( + "io" + "log/slog" + "os" + "os/signal" + "path/filepath" + "strings" + "sync" + "syscall" +) + +// Setup configures the global slog logger with dual output (stderr + rotating +// log file). logsDir is the directory where log files are written. binaryName +// is used as the log filename (e.g. "control-plane" → "control-plane.log"). +// +// If logsDir is empty or the directory cannot be created, Setup falls back to +// stderr-only logging and returns a no-op cleanup function. +// +// The returned cleanup function closes the log file and must be deferred. +// Setup also installs a SIGHUP handler that reopens the log file, allowing +// external log rotation tools (e.g. logrotate) to rotate files in place. +func Setup(logsDir, binaryName string) func() { + level := parseLevel(os.Getenv("LOG_LEVEL")) + + if logsDir == "" { + slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{ + Level: level, + }))) + return func() {} + } + + if err := os.MkdirAll(logsDir, 0750); err != nil { + // Fall back to stderr-only; log the error so operators notice. + slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{ + Level: level, + }))) + slog.Warn("file logging unavailable: failed to create log directory", "dir", logsDir, "error", err) + return func() {} + } + + logPath := filepath.Join(logsDir, binaryName+".log") + rf, err := newReopenableFile(logPath) + if err != nil { + slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{ + Level: level, + }))) + slog.Warn("file logging unavailable: failed to open log file", "path", logPath, "error", err) + return func() {} + } + + mw := io.MultiWriter(os.Stderr, rf) + slog.SetDefault(slog.New(slog.NewTextHandler(mw, &slog.HandlerOptions{ + Level: level, + }))) + + // SIGHUP reopens the log file so logrotate can rotate in place. + sigCh := make(chan os.Signal, 1) + signal.Notify(sigCh, syscall.SIGHUP) + go func() { + for range sigCh { + if err := rf.Reopen(); err != nil { + slog.Error("failed to reopen log file on SIGHUP", "path", logPath, "error", err) + } else { + slog.Info("log file reopened", "path", logPath) + } + } + }() + + return func() { + signal.Stop(sigCh) + close(sigCh) + rf.Close() + } +} + +func parseLevel(s string) slog.Level { + switch strings.ToLower(strings.TrimSpace(s)) { + case "debug": + return slog.LevelDebug + case "warn", "warning": + return slog.LevelWarn + case "error": + return slog.LevelError + default: + return slog.LevelInfo + } +} + +// reopenableFile is an io.Writer backed by an *os.File that can be atomically +// reopened (for log rotation via SIGHUP). All operations are goroutine-safe. +type reopenableFile struct { + path string + mu sync.Mutex + f *os.File +} + +func newReopenableFile(path string) (*reopenableFile, error) { + f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0640) + if err != nil { + return nil, err + } + return &reopenableFile{path: path, f: f}, nil +} + +func (r *reopenableFile) Write(p []byte) (int, error) { + r.mu.Lock() + defer r.mu.Unlock() + return r.f.Write(p) +} + +// Reopen closes the current file and opens a new one at the same path. +// This is the mechanism that makes logrotate's copytruncate-free rotation work: +// logrotate renames the old file, then sends SIGHUP, and the process opens a +// fresh file at the original path. +func (r *reopenableFile) Reopen() error { + r.mu.Lock() + defer r.mu.Unlock() + // Open the new file before closing the old one so a failed open doesn't + // leave the writer in a broken state with a closed fd. + f, err := os.OpenFile(r.path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0640) + if err != nil { + return err + } + r.f.Close() + r.f = f + return nil +} + +func (r *reopenableFile) Close() error { + r.mu.Lock() + defer r.mu.Unlock() + return r.f.Close() +} diff --git a/scripts/prepare-wrenn-user.sh b/scripts/prepare-wrenn-user.sh new file mode 100755 index 0000000..c48eef0 --- /dev/null +++ b/scripts/prepare-wrenn-user.sh @@ -0,0 +1,385 @@ +#!/usr/bin/env bash +# +# prepare-wrenn-user.sh — Create the wrenn system user and configure minimal privileges. +# +# Creates a locked-down 'wrenn' system user that can run wrenn-agent and wrenn-cp +# with only the privileges they need. The agent binary gets Linux capabilities +# via setcap — no sudo is configured for the wrenn user at all. If an attacker +# compromises the wrenn user, they cannot escalate via sudo. +# +# What this script does: +# 1. Creates the 'wrenn' system user (bash shell for debugging, no home dir) +# 2. Creates required directories with correct ownership +# 3. Sets Linux capabilities on wrenn-agent and all child binaries +# 4. Installs an apt hook to restore capabilities after package updates +# 5. Installs a sudoers drop-in (comment-only, no grants — absence is the cage) +# 6. Ensures required kernel modules are loaded +# 7. Writes systemd unit files for both wrenn-agent and wrenn-cp +# +# Usage: +# sudo bash scripts/prepare-wrenn-user.sh +# +# Prerequisites: +# - wrenn-agent binary at /usr/local/bin/wrenn-agent +# - wrenn-cp binary at /usr/local/bin/wrenn-cp +# - firecracker binary at /usr/local/bin/firecracker +# - libcap2-bin installed (for setcap) + +set -euo pipefail + +# ── Guard ──────────────────────────────────────────────────────────────────── + +if [[ $EUID -ne 0 ]]; then + echo "ERROR: This script must be run as root." + exit 1 +fi + +# ── Configuration ──────────────────────────────────────────────────────────── + +WRENN_USER="wrenn" +WRENN_GROUP="wrenn" +WRENN_DIR="/var/lib/wrenn" +AGENT_BIN="/usr/local/bin/wrenn-agent" +CP_BIN="/usr/local/bin/wrenn-cp" +FC_BIN="/usr/local/bin/firecracker" +RESTORE_CAPS_SCRIPT="/etc/wrenn/restore-caps.sh" + +# ── 1. Create system user ─────────────────────────────────────────────────── + +if id "${WRENN_USER}" &>/dev/null; then + echo "==> User '${WRENN_USER}' already exists, skipping creation." +else + echo "==> Creating system user '${WRENN_USER}'..." + useradd \ + --system \ + --no-create-home \ + --home-dir "${WRENN_DIR}" \ + --shell /bin/bash \ + "${WRENN_USER}" +fi + +# Add wrenn to kvm group for /dev/kvm access. +if getent group kvm &>/dev/null; then + usermod -aG kvm "${WRENN_USER}" + echo "==> Added '${WRENN_USER}' to 'kvm' group." +fi + +# ── 2. Create directories with correct ownership ──────────────────────────── + +echo "==> Setting up directories..." + +directories=( + "${WRENN_DIR}" + "${WRENN_DIR}/images" + "${WRENN_DIR}/kernels" + "${WRENN_DIR}/sandboxes" + "${WRENN_DIR}/snapshots" + "${WRENN_DIR}/logs" + "/run/netns" +) + +for dir in "${directories[@]}"; do + mkdir -p "${dir}" +done + +# Only chown wrenn-owned dirs (not /run/netns which is system-managed). +for dir in "${WRENN_DIR}" "${WRENN_DIR}/images" "${WRENN_DIR}/kernels" \ + "${WRENN_DIR}/sandboxes" "${WRENN_DIR}/snapshots" "${WRENN_DIR}/logs"; do + chown "${WRENN_USER}:${WRENN_GROUP}" "${dir}" + chmod 750 "${dir}" +done + +# ── 3. Set capabilities on binaries ───────────────────────────────────────── +# +# These capabilities replace full root access. The wrenn-agent binary gets +# exactly the capabilities it needs for: +# +# CAP_SYS_ADMIN — network namespaces (netns create/enter), mount namespaces +# (unshare -m), losetup, dmsetup, mount/umount +# CAP_NET_ADMIN — veth/TAP creation (netlink), iptables rules, IP forwarding, +# routing table manipulation +# CAP_NET_RAW — raw socket access (needed by iptables internally) +# CAP_SYS_PTRACE — reading /proc/self/ns/net (netns.Get) +# CAP_KILL — sending SIGTERM/SIGKILL to Firecracker processes +# CAP_DAC_OVERRIDE — accessing /dev/loop*, /dev/mapper/*, /dev/net/tun, +# /proc/sys/net/ipv4/ip_forward +# CAP_MKNOD — creating device nodes (dm-snapshot) +# +# The 'ep' suffix means Effective + Permitted (granted at exec time). + +echo "==> Setting capabilities on wrenn-agent..." + +if [[ ! -f "${AGENT_BIN}" ]]; then + echo "WARNING: ${AGENT_BIN} not found, skipping setcap. Install the binary first." +else + setcap \ + cap_sys_admin,cap_net_admin,cap_net_raw,cap_sys_ptrace,cap_kill,cap_dac_override,cap_mknod+ep \ + "${AGENT_BIN}" + + echo " Capabilities set on ${AGENT_BIN}:" + getcap "${AGENT_BIN}" +fi + +# Firecracker also needs capabilities when spawned by a non-root parent. +# CAP_NET_ADMIN is required for network device access inside the netns. +if [[ -f "${FC_BIN}" ]]; then + setcap cap_net_admin,cap_sys_admin,cap_dac_override+ep "${FC_BIN}" + echo " Capabilities set on ${FC_BIN}:" + getcap "${FC_BIN}" +fi + +# ── Helper: resolve binary path and apply setcap ──────────────────────────── +# +# Uses `command -v` to find the binary in PATH (handles /usr/bin vs /usr/sbin +# differences across distros), then `readlink -f` to resolve symlinks so that +# setcap hits the real inode (important for iptables-nft/alternatives). + +setcap_binary() { + local name="$1" caps="$2" + local bin + bin=$(command -v "$name" 2>/dev/null) || { + echo " WARNING: ${name} not found in PATH, skipping." + return 0 + } + bin=$(readlink -f "$bin") + setcap "$caps" "$bin" + echo " $(getcap "$bin")" +} + +# The child binaries invoked by wrenn-agent (iptables, losetup, dmsetup, etc.) +# also need capabilities since they'll be exec'd by a non-root user. +echo "==> Setting capabilities on child binaries..." + +setcap_binary iptables "cap_net_admin,cap_net_raw+ep" +setcap_binary iptables-save "cap_net_admin,cap_net_raw+ep" +setcap_binary ip "cap_sys_admin,cap_net_admin+ep" +setcap_binary sysctl "cap_net_admin+ep" +setcap_binary losetup "cap_sys_admin,cap_dac_override+ep" +setcap_binary blockdev "cap_sys_admin,cap_dac_override+ep" +setcap_binary dmsetup "cap_sys_admin,cap_dac_override,cap_mknod+ep" +setcap_binary e2fsck "cap_sys_admin,cap_dac_override+ep" +setcap_binary resize2fs "cap_sys_admin,cap_dac_override+ep" +setcap_binary dd "cap_dac_override+ep" +setcap_binary unshare "cap_sys_admin+ep" +setcap_binary mount "cap_sys_admin,cap_dac_override+ep" + +# ── 4. Persist capabilities across package updates ────────────────────────── +# +# apt/dpkg overwrites binaries on package updates, which strips the xattr-based +# capabilities set by setcap. This installs: +# - /etc/wrenn/restore-caps.sh: re-applies setcap to all child binaries +# - /etc/apt/apt.conf.d/99-wrenn-setcap: apt post-invoke hook that calls it + +echo "==> Installing capability restore hook..." + +mkdir -p /etc/wrenn + +cat > "${RESTORE_CAPS_SCRIPT}" << 'RESTORE' +#!/usr/bin/env bash +# +# restore-caps.sh — Re-apply Linux capabilities to wrenn child binaries. +# Called automatically by apt after package updates (see /etc/apt/apt.conf.d/99-wrenn-setcap). +# Can also be run manually: sudo /etc/wrenn/restore-caps.sh + +set -euo pipefail + +setcap_binary() { + local name="$1" caps="$2" + local bin + bin=$(command -v "$name" 2>/dev/null) || return 0 + bin=$(readlink -f "$bin") + setcap "$caps" "$bin" 2>/dev/null || true +} + +# wrenn-agent and firecracker (only if present — they aren't package-managed). +[[ -f /usr/local/bin/wrenn-agent ]] && \ + setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_sys_ptrace,cap_kill,cap_dac_override,cap_mknod+ep \ + /usr/local/bin/wrenn-agent 2>/dev/null || true +[[ -f /usr/local/bin/firecracker ]] && \ + setcap cap_net_admin,cap_sys_admin,cap_dac_override+ep \ + /usr/local/bin/firecracker 2>/dev/null || true + +# Child binaries (these are the ones wiped by apt). +setcap_binary iptables "cap_net_admin,cap_net_raw+ep" +setcap_binary iptables-save "cap_net_admin,cap_net_raw+ep" +setcap_binary ip "cap_sys_admin,cap_net_admin+ep" +setcap_binary sysctl "cap_net_admin+ep" +setcap_binary losetup "cap_sys_admin,cap_dac_override+ep" +setcap_binary blockdev "cap_sys_admin,cap_dac_override+ep" +setcap_binary dmsetup "cap_sys_admin,cap_dac_override,cap_mknod+ep" +setcap_binary e2fsck "cap_sys_admin,cap_dac_override+ep" +setcap_binary resize2fs "cap_sys_admin,cap_dac_override+ep" +setcap_binary dd "cap_dac_override+ep" +setcap_binary unshare "cap_sys_admin+ep" +setcap_binary mount "cap_sys_admin,cap_dac_override+ep" +RESTORE + +chmod 755 "${RESTORE_CAPS_SCRIPT}" + +cat > /etc/apt/apt.conf.d/99-wrenn-setcap << 'APT' +// Re-apply Linux capabilities to wrenn child binaries after any package update. +// Capabilities (xattr) are stripped when dpkg overwrites a binary. +DPkg::Post-Invoke { "/etc/wrenn/restore-caps.sh"; }; +APT + +echo " Installed ${RESTORE_CAPS_SCRIPT} and apt post-invoke hook." + +# ── 5. Device access ──────────────────────────────────────────────────────── +# +# /dev/kvm — handled by kvm group membership above +# /dev/net/tun — needs to be accessible by wrenn user + +echo "==> Configuring device access..." + +# Ensure /dev/net/tun is accessible (udev rule for persistence across reboots). +cat > /etc/udev/rules.d/99-wrenn.rules << 'UDEV' +# Allow wrenn user access to TUN device for TAP networking. +SUBSYSTEM=="misc", KERNEL=="tun", GROUP="wrenn", MODE="0660" +UDEV + +udevadm control --reload-rules 2>/dev/null || true +echo " Installed udev rule for /dev/net/tun." + +# ── 6. Kernel modules ─────────────────────────────────────────────────────── + +echo "==> Ensuring kernel modules are loaded..." + +modules=(dm_snapshot dm_mod loop tun) +for mod in "${modules[@]}"; do + if ! lsmod | grep -q "^${mod}"; then + modprobe "${mod}" 2>/dev/null && echo " Loaded ${mod}" || echo " WARNING: Could not load ${mod}" + else + echo " ${mod} already loaded." + fi +done + +# Persist across reboots. +for mod in "${modules[@]}"; do + grep -qxF "${mod}" /etc/modules-load.d/wrenn.conf 2>/dev/null || echo "${mod}" >> /etc/modules-load.d/wrenn.conf +done +echo " Module persistence written to /etc/modules-load.d/wrenn.conf." + +# ── 7. Sudoers ────────────────────────────────────────────────────────────── +# +# The wrenn user has no sudo grants. The absence of a grant is the cage — an +# explicit "!ALL" deny is weaker due to known bypasses (CVE-2019-14287). +# This file exists purely as documentation for operators running `sudo -l`. + +echo "==> Writing sudoers drop-in..." + +cat > /etc/sudoers.d/wrenn << 'SUDOERS' +# Wrenn system user — no sudo access permitted. +# All privilege is granted via Linux capabilities on specific binaries (setcap). +# This file contains no active rules. The absence of any grant is intentional +# and is the strongest way to deny escalation. +# +# Do not add rules here. If the wrenn user needs new privileges, use setcap +# on the specific binary instead. +SUDOERS + +chmod 440 /etc/sudoers.d/wrenn +visudo -c -f /etc/sudoers.d/wrenn +echo " /etc/sudoers.d/wrenn installed and validated." + +# ── 8. Systemd units ──────────────────────────────────────────────────────── + +echo "==> Writing systemd service files..." + +cat > /etc/systemd/system/wrenn-agent.service << 'UNIT' +[Unit] +Description=Wrenn Host Agent +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=wrenn +Group=wrenn +EnvironmentFile=-/etc/wrenn/agent.env + +# The binary has capabilities set via setcap. These systemd directives ensure +# the capabilities are inherited into the process at exec time. +AmbientCapabilities=CAP_SYS_ADMIN CAP_NET_ADMIN CAP_NET_RAW CAP_SYS_PTRACE CAP_KILL CAP_DAC_OVERRIDE CAP_MKNOD +CapabilityBoundingSet=CAP_SYS_ADMIN CAP_NET_ADMIN CAP_NET_RAW CAP_SYS_PTRACE CAP_KILL CAP_DAC_OVERRIDE CAP_MKNOD + +# IMPORTANT: must be false — child binaries (iptables, losetup, dmsetup, etc.) +# have their own file capabilities via setcap which must be honored at exec time. +NoNewPrivileges=false + +# Enable IP forwarding before the agent starts. The "+" prefix runs this +# directive as root (bypassing User=wrenn) so it can write to procfs. +ExecStartPre=+/bin/sh -c 'sysctl -w net.ipv4.ip_forward=1' + +ExecStart=/usr/local/bin/wrenn-agent --address ${WRENN_ADVERTISE_ADDR} + +Restart=on-failure +RestartSec=5 + +# File descriptor limits (Firecracker + loop devices + sockets). +LimitNOFILE=65536 +LimitNPROC=4096 + +# Protect host filesystem — only allow access to what's needed. +ProtectHome=true +ReadWritePaths=/var/lib/wrenn /tmp /run/netns /dev/mapper +ReadOnlyPaths=/usr/local/bin/firecracker + +[Install] +WantedBy=multi-user.target +UNIT + +cat > /etc/systemd/system/wrenn-cp.service << 'UNIT' +[Unit] +Description=Wrenn Control Plane +After=network-online.target postgresql.service +Wants=network-online.target + +[Service] +Type=simple +User=wrenn +Group=wrenn +EnvironmentFile=-/etc/wrenn/cp.env + +# Control plane is fully unprivileged — no capabilities needed. +NoNewPrivileges=true +CapabilityBoundingSet= + +ExecStart=/usr/local/bin/wrenn-cp + +Restart=on-failure +RestartSec=5 + +ProtectHome=true +ProtectSystem=strict +ReadWritePaths=/tmp + +[Install] +WantedBy=multi-user.target +UNIT + +mkdir -p /etc/wrenn +touch /etc/wrenn/agent.env /etc/wrenn/cp.env +chmod 640 /etc/wrenn/agent.env /etc/wrenn/cp.env +chown root:${WRENN_GROUP} /etc/wrenn/agent.env /etc/wrenn/cp.env + +systemctl daemon-reload +echo " wrenn-agent.service and wrenn-cp.service installed." + +# ── Done ───────────────────────────────────────────────────────────────────── + +echo "" +echo "=== Setup complete ===" +echo "" +echo "Next steps:" +echo " 1. Copy wrenn-agent and wrenn-cp binaries to /usr/local/bin/" +echo " 2. Edit /etc/wrenn/agent.env with WRENN_CP_URL and WRENN_ADVERTISE_ADDR" +echo " 3. Edit /etc/wrenn/cp.env with DATABASE_URL and other control plane config" +echo " 4. systemctl enable --now wrenn-agent" +echo " 5. systemctl enable --now wrenn-cp" +echo "" +echo "Security summary:" +echo " - wrenn user: bash shell (for debugging), no home, no sudo (no grants in sudoers)" +echo " - wrenn-agent: runs as wrenn with 7 capabilities via setcap (not root)" +echo " - wrenn-cp: runs as wrenn with zero capabilities" +echo " - Capabilities auto-restored after apt upgrades via /etc/wrenn/restore-caps.sh" +echo ""