# Wrenn

Secure infrastructure for AI

## Prerequisites

- Linux host with `/dev/kvm` access (bare metal or nested virt)
- Cloud Hypervisor binary at `/usr/local/bin/cloud-hypervisor`
- PostgreSQL
- Go 1.25+
- Rust 1.88+ with `x86_64-unknown-linux-musl` target (`rustup target add x86_64-unknown-linux-musl`)
- Bun (for frontend)
- Docker (for dev infra and rootfs builds)

## Build

```bash
make build    # outputs to builds/
```

Produces three binaries: `wrenn-cp` (control plane), `wrenn-agent` (host agent), `envd` (guest agent).

## Host setup

The host agent needs a kernel, the system base rootfs images, and working directories on the host machine.

### Directory structure

```
/var/lib/wrenn/
├── kernels/
│   └── vmlinux              # uncompressed Linux kernel (not bzImage)
├── images/
│   └── teams/
│       └── 0000000000000000000000000/   # platform team (base36 all-zeros)
│           ├── 0000000000000000000000000/rootfs.ext4   # minimal-ubuntu (id 0)
│           ├── 0000000000000000000000001/rootfs.ext4   # minimal-alpine (id 1)
│           ├── 0000000000000000000000002/rootfs.ext4   # minimal-arch   (id 2)
│           └── 0000000000000000000000003/rootfs.ext4   # minimal-fedora (id 3)
├── sandboxes/               # per-sandbox CoW files (created at runtime)
└── snapshots/               # pause/hibernate snapshot files (created at runtime)
```

Create the base directories (the per-template image dirs are created by the build scripts):

```bash
sudo mkdir -p /var/lib/wrenn/{kernels,images,sandboxes,snapshots}
```

### Kernel

Place an uncompressed `vmlinux` kernel at `/var/lib/wrenn/kernels/vmlinux`. Versioned kernels (`vmlinux-{semver}`) are also supported — the agent picks the latest by semver.

### System base rootfs images

There are four built-in **system base templates** — one per distro — that all other
templates snapshot from via device-mapper. They are platform-owned (visible to every
team) and protected from deletion (reserved template IDs 0–1024):

| Template | Distro | ID |
|----------|--------|----|
| `minimal-ubuntu` | `ubuntu:26.04` | 0 |
| `minimal-alpine` | `alpine:3.22` | 1 |
| `minimal-arch` | `archlinux:base` | 2 |
| `minimal-fedora` | `fedora:45` | 3 |

`minimal-ubuntu` is the default template for new sandboxes and builds. The same
statically-linked `envd` + `tini` run on all four regardless of the distro's libc
(glibc on Ubuntu/Arch/Fedora, musl on Alpine).

Each image contains these packages plus a `wrenn-user` account with passwordless `sudo`:

| Package | Why |
|---------|-----|
| `socat` | Bidirectional relay for port forwarding |
| `chrony` | Time sync from KVM PTP clock (`/dev/ptp0`) |
| `iproute2` (`iproute` on Fedora) | `ip` for guest network setup in `wrenn-init` |
| `tini` | PID 1 zombie reaper |
| `sudo` | User privilege management inside the guest |
| `wget` | HTTP fetching |
| `curl` | HTTP client |
| `ca-certificates` | TLS certificate verification |
| `git` | Version control |

**To build all four images** (each spawns a distro container, installs the packages +
`wrenn-user`, builds `envd`, injects `wrenn-init` + `tini`, and exports to the
team-scoped path). Requires Docker + sudo:

```bash
make images
```

Or build a single distro: `make rootfs-ubuntu` / `rootfs-alpine` / `rootfs-arch` / `rootfs-fedora`.

**To update the images** after changing `envd` or `wrenn-init.sh` (rebuilds `envd` once,
then re-injects `envd` + `wrenn-init` + `tini` into every system base image):

```bash
bash scripts/update-minimal-rootfs.sh
```

### IP forwarding

```bash
sudo sysctl -w net.ipv4.ip_forward=1
```

## Configure

Copy `.env.example` to `.env` and edit:

```bash
# Required
DATABASE_URL=postgres://wrenn:wrenn@localhost:5432/wrenn?sslmode=disable

# Control plane
WRENN_CP_LISTEN_ADDR=:8000
CP_HOST_AGENT_ADDR=http://localhost:50051

# Host agent
WRENN_HOST_LISTEN_ADDR=:50051
WRENN_DIR=/var/lib/wrenn
```

## Development

```bash
make dev          # Start PostgreSQL (Docker), run migrations, start control plane
make dev-agent    # Start host agent (separate terminal, sudo)
make dev-frontend # Vite dev server with HMR (port 5173)
make check        # fmt + vet + lint + test
```

### Host registration

Hosts must be registered with the control plane before they can serve sandboxes.

1. **Create a host record** in the dashboard (admin only — host management is not exposed over the SDK / API keys). Sign in at `/login`, open the admin hosts page, and click **Add host**. The dashboard returns a `registration_token` valid for 1 hour.

2. **Start the host agent** with the registration token and its externally-reachable address:
   ```bash
   sudo WRENN_CP_URL=http://localhost:8000 \
        ./builds/wrenn-agent \
        --register <token-from-step-1> \
        --address <host-ip>:50051
   ```
   On first startup the agent sends its specs (arch, CPU, memory, disk) to the control plane, receives a long-lived host JWT, and saves it to `$WRENN_DIR/host-token`.

3. **Subsequent startups** don't need `--register` — the agent loads the saved JWT automatically:
   ```bash
   sudo ./builds/wrenn-agent --address <host-ip>:50051
   ```

4. **If registration fails** (e.g., network error after token was consumed), regenerate a token from the dashboard host detail page, then restart the agent with the new token.

The agent sends heartbeats to the control plane every 30 seconds.

## Notification channels

Teams can subscribe to lifecycle events via webhook, Discord, Slack, Teams, Google Chat, Telegram, or Matrix. All providers consume the same event stream (durable Redis stream `wrenn:events`, consumer group `wrenn-channels-v1`, at-least-once delivery with two retries at 10s / 30s).

### Subscribable event types

| Event | Emitted on | Has outcome |
|-------|-----------|-------------|
| `capsule.create` | First boot of a sandbox | yes |
| `capsule.pause` | Manual pause, TTL auto-pause, or reconciler-detected pause | yes |
| `capsule.resume` | Unpause (any subsequent boot after `capsule.create`) | yes |
| `capsule.destroy` | Stop / destroy, including system cleanup-on-error | yes |
| `template.snapshot.create` | Snapshot taken from a running sandbox | yes |
| `template.snapshot.delete` | Snapshot deletion (including cleanup-on-error) | yes |
| `host.up` | Host agent comes online | no |
| `host.down` | Host agent crashes or misses heartbeats | no |

Subscribing to an event type delivers **both success and failure**. The `outcome` field on the payload (`success` or `error`) distinguishes them. `error` events carry an `error` string with the failure reason.

The transient `capsule.state.changed` event (intermediate transitions like `starting`, `pausing`, `resuming`) is **not** subscribable — it is delivered to the dashboard via SSE only and never written to the durable stream.

### Event payload

All channels receive the same canonical JSON shape:

```json
{
  "event": "capsule.pause",
  "outcome": "success",
  "timestamp": "2026-05-19T14:23:01Z",
  "team_id": "tm_...",
  "actor": {
    "type": "user",
    "id": "usr_...",
    "name": "alice@example.com"
  },
  "resource": {
    "id": "sb_a1b2c3d4",
    "type": "sandbox"
  },
  "metadata": {
    "reason": "ttl_expired"
  },
  "error": ""
}
```

| Field | Type | Notes |
|-------|------|-------|
| `event` | string | Event type (see table above) |
| `outcome` | `"success"` \| `"error"` \| `""` | Omitted for host.up/host.down |
| `timestamp` | RFC3339 UTC | When the event was published |
| `team_id` | string | Owning team |
| `actor.type` | `"user"` \| `"api_key"` \| `"system"` | System = TTL reaper, reconciler, cleanup-on-error |
| `actor.id` | string | User ID, API key ID, or empty for system |
| `actor.name` | string | Display name (email for user, label for api_key) |
| `resource.id` | string | Sandbox ID, snapshot ID, or host ID |
| `resource.type` | `"sandbox"` \| `"snapshot"` \| `"host"` | |
| `metadata` | object\<string,string\> | Event-specific context (e.g., `reason`, `from`/`to`, `inferred`) |
| `error` | string | Failure reason when `outcome == "error"` |

`metadata` keys you may observe:

- `reason` — `ttl_expired` (auto-pause), `orphaned` (reconciler cleanup), `cleanup_after_create_error`, `restored_after_host_recovery`, `host_state_sync`, `transient_timeout`, `transient_timeout_inferred`
- `inferred` — `"true"` when the reconciler derived the event from host state, not a direct host callback

### Webhook delivery

Webhook channels receive a raw `POST` with the JSON payload as the body.

Headers:

| Header | Value |
|--------|-------|
| `Content-Type` | `application/json` |
| `X-Wrenn-Delivery` | UUID, unique per delivery attempt |
| `X-Wrenn-Timestamp` | RFC3339 UTC, used for signature verification |
| `X-WRENN-SIGNATURE` | `sha256=<hex>` HMAC over `<timestamp>.<body>` using the channel's signing secret |

The signing secret is shown **once** at channel creation. Verify signatures by computing `HMAC-SHA256(secret, timestamp + "." + body)` and comparing to the header (constant-time compare). Reject deliveries where `X-Wrenn-Timestamp` is outside your acceptable clock skew window. Redirects are not followed.

Any non-2xx response triggers retry (10s, then 30s). After three total failures the event is dropped (logged on the control plane).

### Other providers

Discord, Slack, Teams, Google Chat, Telegram, and Matrix receive a formatted text message — the same fields, rendered as human-readable text — not the JSON payload. Use webhook if you need the structured event.

## Extending the control plane

The OSS control plane is designed to be embedded by a private cloud distribution without forking. Import this module, implement the `Extension` interface from `pkg/cpextension`, and pass it to `cpserver.Run`:

```go
import (
    "git.omukk.dev/wrenn/wrenn/pkg/cpextension"
    "git.omukk.dev/wrenn/wrenn/pkg/cpserver"
)

func main() {
    cpserver.Run(
        cpserver.WithVersion("cloud-1.0.0"),
        cpserver.WithExtensions(&myExtension{}),
    )
}
```

Every extension implements two methods:

```go
RegisterRoutes(r chi.Router, sctx cpextension.ServerContext)
BackgroundWorkers(sctx cpextension.ServerContext) []func(context.Context)
```

`ServerContext` exposes the initialized OSS services so extensions never re-implement them: `Queries`, `PgPool`, `Redis`, `HostPool`, `Scheduler`, `CA`, `Audit`, `Mailer`, `OAuthRegistry`, `Channels`, `ChannelPub`, `JWTSecret`, `Sessions`, `Config`.

### Optional hook interfaces

An extension can also implement any subset of these — the OSS server type-asserts at startup:

| Interface | When it fires | Failure semantics |
|---|---|---|
| `MiddlewareProvider` | Wraps every OSS route before registration | n/a |
| `AuthHook.OnSignup(ctx, userID, teamID, email)` | After team provisioning on email-activate or OAuth-new-signup | Error aborts signup with 500 `signup_hook_failed` (billing customer creation must succeed) |
| `AuthHook.OnLogin(ctx, userID)` | After a successful login or OAuth callback | Error logged, login still succeeds |
| `AuthHook.OnAccountSoftDelete(ctx, userID)` | After `DELETE /v1/me` commits | Error logged, request still succeeds |
| `AuthHook.OnAccountHardDelete(ctx, userID)` | After the 15-day cleanup goroutine purges a soft-deleted account | Error logged, cleanup continues |
| `SandboxEventHook.OnSandboxEvent(ctx, ev)` | Capsule create/pause/resume/destroy success, from the Redis stream consumer | Error leaves the message un-acked — hooks **must** be idempotent |
| `LimitsProvider.EffectiveLimits(ctx, teamID)` | `POST /v1/capsules` consults before scheduling | Returns 402 (`concurrent_sandbox_limit` / `vcpu_limit` / `memory_limit`) when over |
| `UsageProvider.CurrentUsage(ctx, teamID)` | Feeds `LimitsProvider` checks; falls back to OSS DB-backed default | Error → 402 `usage_unavailable` |

### Auth middleware helpers

For extensions that gate their own routes:

```go
r.With(cpextension.RequireSession(sctx)).Get("/billing", handler)
r.With(cpextension.RequireSessionOrAPIKey(sctx)).Get("/usage", handler)
r.With(cpextension.RequireSession(sctx), cpextension.RequireAdmin(sctx)).Get("/admin/exports", handler)

// Issue a session from a custom flow (e.g. invite-accept):
sess, err := cpextension.IssueSession(w, r, sctx, userID, teamID)
```

Cookie/header names are exported as `cpextension.SessionCookieName`, `CSRFCookieName`, `CSRFHeaderName`.

See `CLAUDE.md` for full architecture documentation.