1
0
forked from wrenn/wrenn
Co-authored-by: Tasnim Kabir Sadik <tksadik@omukk.dev>

Reviewed-on: wrenn/wrenn#50
This commit is contained in:
2026-05-24 21:10:37 +00:00
parent 4707f16c76
commit 05ddf62399
203 changed files with 15815 additions and 9344 deletions

225
README.md
View File

@ -5,11 +5,11 @@ Secure infrastructure for AI
## Prerequisites
- Linux host with `/dev/kvm` access (bare metal or nested virt)
- Firecracker binary at `/usr/local/bin/firecracker`
- Cloud Hypervisor binary at `/usr/local/bin/cloud-hypervisor`
- PostgreSQL
- Go 1.25+
- Rust 1.88+ with `x86_64-unknown-linux-musl` target (`rustup target add x86_64-unknown-linux-musl`)
- pnpm (for frontend)
- Bun (for frontend)
- Docker (for dev infra and rootfs builds)
## Build
@ -22,7 +22,7 @@ Produces three binaries: `wrenn-cp` (control plane), `wrenn-agent` (host agent),
## Host setup
The host agent needs a kernel, a minimal rootfs image, and working directories on the host machine.
The host agent needs a kernel, the system base rootfs images, and working directories on the host machine.
### Directory structure
@ -31,59 +31,74 @@ The host agent needs a kernel, a minimal rootfs image, and working directories o
├── kernels/
│ └── vmlinux # uncompressed Linux kernel (not bzImage)
├── images/
│ └── minimal/
│ └── rootfs.ext4 # base rootfs (all other templates snapshot from this)
│ └── teams/
│ └── 0000000000000000000000000/ # platform team (base36 all-zeros)
│ ├── 0000000000000000000000000/rootfs.ext4 # minimal-ubuntu (id 0)
│ ├── 0000000000000000000000001/rootfs.ext4 # minimal-alpine (id 1)
│ ├── 0000000000000000000000002/rootfs.ext4 # minimal-arch (id 2)
│ └── 0000000000000000000000003/rootfs.ext4 # minimal-fedora (id 3)
├── sandboxes/ # per-sandbox CoW files (created at runtime)
└── snapshots/ # pause/hibernate snapshot files (created at runtime)
```
Create the directories:
Create the base directories (the per-template image dirs are created by the build scripts):
```bash
sudo mkdir -p /var/lib/wrenn/{kernels,images/minimal,sandboxes,snapshots}
sudo mkdir -p /var/lib/wrenn/{kernels,images,sandboxes,snapshots}
```
### Kernel
Place an uncompressed `vmlinux` kernel at `/var/lib/wrenn/kernels/vmlinux`. Versioned kernels (`vmlinux-{semver}`) are also supported — the agent picks the latest by semver.
### Minimal rootfs
### System base rootfs images
The minimal rootfs is the base image that all other templates (Python, Node, etc.) are built on top of via device-mapper snapshots. It must contain:
There are four built-in **system base templates** — one per distro — that all other
templates snapshot from via device-mapper. They are platform-owned (visible to every
team) and protected from deletion (reserved template IDs 01024):
| Template | Distro | ID |
|----------|--------|----|
| `minimal-ubuntu` | `ubuntu:26.04` | 0 |
| `minimal-alpine` | `alpine:3.22` | 1 |
| `minimal-arch` | `archlinux:base` | 2 |
| `minimal-fedora` | `fedora:45` | 3 |
`minimal-ubuntu` is the default template for new sandboxes and builds. The same
statically-linked `envd` + `tini` run on all four regardless of the distro's libc
(glibc on Ubuntu/Arch/Fedora, musl on Alpine).
Each image contains these packages plus a `wrenn-user` account with passwordless `sudo`:
| Package | Why |
|---------|-----|
| `socat` | Bidirectional relay for port forwarding |
| `chrony` | Time sync from KVM PTP clock (`/dev/ptp0`) |
| `tini` | PID 1 zombie reaper (injected by build script, not apt) |
| `iproute2` (`iproute` on Fedora) | `ip` for guest network setup in `wrenn-init` |
| `tini` | PID 1 zombie reaper |
| `sudo` | User privilege management inside the guest |
| `wget` | HTTP fetching |
| `curl` | HTTP client |
| `ca-certificates` | TLS certificate verification |
| `git` | Version control |
**To build a rootfs from a Docker container:**
**To build all four images** (each spawns a distro container, installs the packages +
`wrenn-user`, builds `envd`, injects `wrenn-init` + `tini`, and exports to the
team-scoped path). Requires Docker + sudo:
1. Create and configure a container with the required packages:
```bash
docker run -it --name wrenn-minimal debian:bookworm bash
# Inside the container:
apt update && apt install -y socat chrony sudo wget curl ca-certificates
exit
```
```bash
make images
```
2. Export to a rootfs image (builds envd, injects wrenn-init + tini, shrinks to minimum size):
```bash
sudo bash scripts/rootfs-from-container.sh wrenn-minimal minimal
```
Or build a single distro: `make rootfs-ubuntu` / `rootfs-alpine` / `rootfs-arch` / `rootfs-fedora`.
**To update an existing rootfs** after changing envd or `wrenn-init.sh`:
**To update the images** after changing `envd` or `wrenn-init.sh` (rebuilds `envd` once,
then re-injects `envd` + `wrenn-init` + `tini` into every system base image):
```bash
bash scripts/update-minimal-rootfs.sh
```
This rebuilds envd via `make build-envd` and copies the fresh binaries into the mounted rootfs image.
### IP forwarding
```bash
@ -120,14 +135,7 @@ make check # fmt + vet + lint + test
Hosts must be registered with the control plane before they can serve sandboxes.
1. **Create a host record** (via API or dashboard):
```bash
curl -X POST http://localhost:8000/v1/hosts \
-H "Authorization: Bearer $JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"type": "regular"}'
```
This returns a `registration_token` (valid for 1 hour).
1. **Create a host record** in the dashboard (admin only — host management is not exposed over the SDK / API keys). Sign in at `/login`, open the admin hosts page, and click **Add host**. The dashboard returns a `registration_token` valid for 1 hour.
2. **Start the host agent** with the registration token and its externally-reachable address:
```bash
@ -143,13 +151,152 @@ Hosts must be registered with the control plane before they can serve sandboxes.
sudo ./builds/wrenn-agent --address <host-ip>:50051
```
4. **If registration fails** (e.g., network error after token was consumed), regenerate a token:
```bash
curl -X POST http://localhost:8000/v1/hosts/$HOST_ID/token \
-H "Authorization: Bearer $JWT_TOKEN"
```
Then restart the agent with the new token.
4. **If registration fails** (e.g., network error after token was consumed), regenerate a token from the dashboard host detail page, then restart the agent with the new token.
The agent sends heartbeats to the control plane every 30 seconds.
## Notification channels
Teams can subscribe to lifecycle events via webhook, Discord, Slack, Teams, Google Chat, Telegram, or Matrix. All providers consume the same event stream (durable Redis stream `wrenn:events`, consumer group `wrenn-channels-v1`, at-least-once delivery with two retries at 10s / 30s).
### Subscribable event types
| Event | Emitted on | Has outcome |
|-------|-----------|-------------|
| `capsule.create` | First boot of a sandbox | yes |
| `capsule.pause` | Manual pause, TTL auto-pause, or reconciler-detected pause | yes |
| `capsule.resume` | Unpause (any subsequent boot after `capsule.create`) | yes |
| `capsule.destroy` | Stop / destroy, including system cleanup-on-error | yes |
| `template.snapshot.create` | Snapshot taken from a running sandbox | yes |
| `template.snapshot.delete` | Snapshot deletion (including cleanup-on-error) | yes |
| `host.up` | Host agent comes online | no |
| `host.down` | Host agent crashes or misses heartbeats | no |
Subscribing to an event type delivers **both success and failure**. The `outcome` field on the payload (`success` or `error`) distinguishes them. `error` events carry an `error` string with the failure reason.
The transient `capsule.state.changed` event (intermediate transitions like `starting`, `pausing`, `resuming`) is **not** subscribable — it is delivered to the dashboard via SSE only and never written to the durable stream.
### Event payload
All channels receive the same canonical JSON shape:
```json
{
"event": "capsule.pause",
"outcome": "success",
"timestamp": "2026-05-19T14:23:01Z",
"team_id": "tm_...",
"actor": {
"type": "user",
"id": "usr_...",
"name": "alice@example.com"
},
"resource": {
"id": "sb_a1b2c3d4",
"type": "sandbox"
},
"metadata": {
"reason": "ttl_expired"
},
"error": ""
}
```
| Field | Type | Notes |
|-------|------|-------|
| `event` | string | Event type (see table above) |
| `outcome` | `"success"` \| `"error"` \| `""` | Omitted for host.up/host.down |
| `timestamp` | RFC3339 UTC | When the event was published |
| `team_id` | string | Owning team |
| `actor.type` | `"user"` \| `"api_key"` \| `"system"` | System = TTL reaper, reconciler, cleanup-on-error |
| `actor.id` | string | User ID, API key ID, or empty for system |
| `actor.name` | string | Display name (email for user, label for api_key) |
| `resource.id` | string | Sandbox ID, snapshot ID, or host ID |
| `resource.type` | `"sandbox"` \| `"snapshot"` \| `"host"` | |
| `metadata` | object\<string,string\> | Event-specific context (e.g., `reason`, `from`/`to`, `inferred`) |
| `error` | string | Failure reason when `outcome == "error"` |
`metadata` keys you may observe:
- `reason` — `ttl_expired` (auto-pause), `orphaned` (reconciler cleanup), `cleanup_after_create_error`, `restored_after_host_recovery`, `host_state_sync`, `transient_timeout`, `transient_timeout_inferred`
- `inferred` — `"true"` when the reconciler derived the event from host state, not a direct host callback
### Webhook delivery
Webhook channels receive a raw `POST` with the JSON payload as the body.
Headers:
| Header | Value |
|--------|-------|
| `Content-Type` | `application/json` |
| `X-Wrenn-Delivery` | UUID, unique per delivery attempt |
| `X-Wrenn-Timestamp` | RFC3339 UTC, used for signature verification |
| `X-WRENN-SIGNATURE` | `sha256=<hex>` HMAC over `<timestamp>.<body>` using the channel's signing secret |
The signing secret is shown **once** at channel creation. Verify signatures by computing `HMAC-SHA256(secret, timestamp + "." + body)` and comparing to the header (constant-time compare). Reject deliveries where `X-Wrenn-Timestamp` is outside your acceptable clock skew window. Redirects are not followed.
Any non-2xx response triggers retry (10s, then 30s). After three total failures the event is dropped (logged on the control plane).
### Other providers
Discord, Slack, Teams, Google Chat, Telegram, and Matrix receive a formatted text message — the same fields, rendered as human-readable text — not the JSON payload. Use webhook if you need the structured event.
## Extending the control plane
The OSS control plane is designed to be embedded by a private cloud distribution without forking. Import this module, implement the `Extension` interface from `pkg/cpextension`, and pass it to `cpserver.Run`:
```go
import (
"git.omukk.dev/wrenn/wrenn/pkg/cpextension"
"git.omukk.dev/wrenn/wrenn/pkg/cpserver"
)
func main() {
cpserver.Run(
cpserver.WithVersion("cloud-1.0.0"),
cpserver.WithExtensions(&myExtension{}),
)
}
```
Every extension implements two methods:
```go
RegisterRoutes(r chi.Router, sctx cpextension.ServerContext)
BackgroundWorkers(sctx cpextension.ServerContext) []func(context.Context)
```
`ServerContext` exposes the initialized OSS services so extensions never re-implement them: `Queries`, `PgPool`, `Redis`, `HostPool`, `Scheduler`, `CA`, `Audit`, `Mailer`, `OAuthRegistry`, `Channels`, `ChannelPub`, `JWTSecret`, `Sessions`, `Config`.
### Optional hook interfaces
An extension can also implement any subset of these — the OSS server type-asserts at startup:
| Interface | When it fires | Failure semantics |
|---|---|---|
| `MiddlewareProvider` | Wraps every OSS route before registration | n/a |
| `AuthHook.OnSignup(ctx, userID, teamID, email)` | After team provisioning on email-activate or OAuth-new-signup | Error aborts signup with 500 `signup_hook_failed` (billing customer creation must succeed) |
| `AuthHook.OnLogin(ctx, userID)` | After a successful login or OAuth callback | Error logged, login still succeeds |
| `AuthHook.OnAccountSoftDelete(ctx, userID)` | After `DELETE /v1/me` commits | Error logged, request still succeeds |
| `AuthHook.OnAccountHardDelete(ctx, userID)` | After the 15-day cleanup goroutine purges a soft-deleted account | Error logged, cleanup continues |
| `SandboxEventHook.OnSandboxEvent(ctx, ev)` | Capsule create/pause/resume/destroy success, from the Redis stream consumer | Error leaves the message un-acked — hooks **must** be idempotent |
| `LimitsProvider.EffectiveLimits(ctx, teamID)` | `POST /v1/capsules` consults before scheduling | Returns 402 (`concurrent_sandbox_limit` / `vcpu_limit` / `memory_limit`) when over |
| `UsageProvider.CurrentUsage(ctx, teamID)` | Feeds `LimitsProvider` checks; falls back to OSS DB-backed default | Error → 402 `usage_unavailable` |
### Auth middleware helpers
For extensions that gate their own routes:
```go
r.With(cpextension.RequireSession(sctx)).Get("/billing", handler)
r.With(cpextension.RequireSessionOrAPIKey(sctx)).Get("/usage", handler)
r.With(cpextension.RequireSession(sctx), cpextension.RequireAdmin(sctx)).Get("/admin/exports", handler)
// Issue a session from a custom flow (e.g. invite-accept):
sess, err := cpextension.IssueSession(w, r, sctx, userID, teamID)
```
Cookie/header names are exported as `cpextension.SessionCookieName`, `CSRFCookieName`, `CSRFHeaderName`.
See `CLAUDE.md` for full architecture documentation.