Implement a channels system for notifying teams via external providers
(Discord, Slack, Teams, Google Chat, Telegram, Matrix, webhook) when
lifecycle events occur (capsule/template/host state changes).
- Channel CRUD API under /v1/channels (JWT-only auth)
- Test endpoint to verify config before saving (POST /v1/channels/test)
- Secret rotation endpoint (PUT /v1/channels/{id}/config)
- AES-256-GCM encryption for provider secrets (WRENN_ENCRYPTION_KEY)
- Redis stream event publishing from audit logger
- Background dispatcher with consumer group and retry (10s, 30s)
- Webhook delivery with HMAC-SHA256 signing (X-WRENN-SIGNATURE)
- shoutrrr integration for chat providers
- Secrets never exposed in API responses
Replace Business Source License with Apache License Version 2.0 across
LICENSE, envd/LICENSE, and NOTICE. Update NOTICE to remove BSL-era
framing that singled out Apache-only portions.
Change host marked_down/marked_up audit log scope from "admin" to "team" so
BYOC team members can see when their hosts go unreachable or recover. Rename
BYOC sidebar entry to Hosts, add placeholder billing/usage pages, disable
unimplemented notifications/settings links, and point docs to external site.
- Add auth failure logging (login, API key, JWT) with IP/email/prefix
- Move OAuth JWT from URL params to short-lived cookies to prevent
token leakage via browser history, server logs, and Referer headers
- Pin Swagger UI to v5.18.2 with SRI integrity hashes
- Upgrade Go toolchain to 1.25.8 (fixes 5 called stdlib vulns)
- Fix unchecked error in host agent credential refresh
- Add .gstack to .gitignore for security report artifacts
Both the control plane and host agent now refuse to start without valid
mTLS configuration, closing the unauthenticated proxy/RPC attack surface
that existed when running in plain HTTP fallback mode.
- Fix envRegex: remove spurious (\$)? group that swallowed $$$, handle ${}
- wrenn-init.sh: add || true to networking commands under set -e, remove dead code
- waitForHealthcheck: use context deadline for unlimited retries instead of implicit 100 cap
- Make parseSandboxEnv a package-level function (unused receiver)
- Fix WrappedCommand test: map iteration order dependency, pre-expand env values
- Fix error wrapping: %v → %w per project conventions
- test-jupyter-kernel.py: move import to top-level, fix misleading comment
healthchecks
Fix ENV instructions to expand $VAR references at set time using the
current env state, preventing self-referencing values like
PATH=/opt/venv/bin:$PATH from producing recursive expansions. Remove
expandEnv from shellPrefix to avoid double expansion.
Fetch sandbox environment variables via `env` before recipe execution
so ENV steps resolve against actual runtime values from the base
template image.
Replace hardcoded healthcheck timing with a Dockerfile-like flag parser
supporting --interval, --timeout, --start-period, and --retries. Add
start-period grace window and bounded retry counting to
waitForHealthcheck.
Add python-interpreter-v0-beta recipe and healthcheck files.
The envd port scanner used gopsutil's net.Connections() which walks
/proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime
semaphore state when the VM is paused mid-operation and restored from
a Firecracker snapshot.
Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads
a single file per address family — no /proc/{pid}/fd walk, no goroutines,
no WaitGroups. Also replace concurrent-map (smap) in the scanner with a
plain sync.RWMutex-protected map, since concurrent-map's Items() spawns
goroutines with a WaitGroup internally, which is equally unsafe across
snapshot boundaries.
Use socket inode instead of PID for the port forwarding map key, since
inode is available directly from /proc/net/tcp without the fd walk.
Introduce ConnTracker (atomic.Bool + WaitGroup) to track in-flight proxy
connections per sandbox. Before pausing a VM, the manager drains active
connections with a 2s grace period, preventing Go runtime corruption
inside the guest caused by stale TCP state surviving Firecracker
snapshot/restore.
Also add:
- AcquireProxyConn on Manager for atomic lookup + connection tracking
- Proxy cache (120s TTL) on CP SandboxProxyWrapper with single-query
DB lookup (GetSandboxProxyTarget) to avoid two round-trips
- Reset() on ConnTracker to re-enable connections if pause fails
- skip_pre_post flag on builds bypasses apt update/clean pre/post steps for
faster iteration when the recipe handles its own environment setup
- POST /v1/admin/builds/{id}/cancel endpoint marks an in-progress build as
cancelled; UpdateBuildStatus now also sets completed_at for 'cancelled'
- internal/recipe: typed recipe parser and executor (RUN/ENV/COPY steps)
replacing the raw string slice approach in the build worker
- pre/post build commands prefixed with RUN to match recipe step format
- Internal ECDSA P-256 CA (WRENN_CA_CERT/WRENN_CA_KEY env vars); when absent
the system falls back to plain HTTP so dev mode works without certificates
- Host leaf cert (7-day TTL, IP SAN) issued at registration and renewed on
every JWT refresh; fingerprint + expiry stored in DB (cert_expires_at column
replaces the removed mtls_enabled flag)
- CP ephemeral client cert (24-hour TTL) via CPCertStore with atomic hot-swap;
background goroutine renews it every 12 hours without restarting the server
- Host agent uses tls.Listen + httpServer.Serve so GetCertificate callback is
respected (ListenAndServeTLS always reads cert from disk)
- Sandbox reverse proxy now uses pool.Transport() so it shares the same TLS
config as the Connect RPC clients instead of http.DefaultTransport
- Credentials file renamed host-credentials.json with cert_pem/key_pem/
ca_cert_pem fields; duplicate register/refresh response structs collapsed
to authResponse
- Change sandbox ID prefix from sb- to cl- (capsule) throughout
- Fix proxy URL regex character class: base36 uses 0-9a-z, not just hex
- Add MMDS V2 config and metadata to VM boot flow so envd can read
WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID from inside the guest
- Pass TemplateID through VMConfig into both fresh and snapshot boot paths
Always use Firecracker Diff snapshots (fast, only changed pages) and
merge diff files at the file level when the generation cap is reached.
The previous approach used Firecracker's Full snapshot type which dumps
all memory to disk and can timeout, losing all snapshot data on failure.
Add snapshot.MergeDiffs() which reads each block from the appropriate
generation's diff file via the header mapping and writes them into a
single consolidated file with a fresh generation-0 header.
Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx}
to avoid collisions with other tools. Add CleanupStaleNamespaces() at
agent startup to remove orphaned namespaces, veths, iptables rules, and
routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to
prevent Go runtime memory corruption from snapshot/restore drift.
Insert a minimal template row (all-zeros UUID) so it appears in both
team and admin template listings. Guard delete endpoints to prevent
removal of the minimal template.
Introduces internal/layout package for centralized path construction,
migrates templates from name-based TEXT primary keys to UUID PKs with
team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}).
The built-in minimal template uses sentinel zero UUIDs. Proto messages
carry team_id + template_id alongside deprecated template name field.
Team deletion now cleans up template files across all hosts.
Snapshot race fix:
- Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and
PauseSandbox RPCs, preventing the reconciler from marking it "stopped"
during the flatten window when the sandbox is gone from the host
agent's in-memory map but DB still says "running"
- Revert status to "running" on RPC failure
- Check ctx.Err() before writing response to avoid writing to dead
connections when client disconnects during long snapshot operations
Delete auth fix:
- Block non-admin deletion of platform templates (team_id = all-zeros)
at DELETE /v1/snapshots/{name} with 403, preventing file deletion
before the team ownership check fails
Sparse dd:
- Add conv=sparse to dd in FlattenSnapshot so flattened images preserve
sparseness (~200MB actual vs 5GB logical)
Default disk size:
- Change default disk_size_mb from 20GB to 5GB across migration,
manager, service, build, and EnsureImageSizes
- Disable split-button dropdown arrow for platform templates in
dashboard snapshots page (teams cannot delete platform templates)
DB stays native UUID; the format/parse layer now encodes 16 UUID bytes
as 25-char lowercase alphanumeric (base36) strings instead of the
standard 36-char hex-with-dashes format. e.g. sb-2e5glxi4g3qnhwci95qev0cg0
- DELETE /v1/admin/templates/{name} endpoint (admin-only)
- Broadcasts DeleteSnapshot RPC to all online hosts before removing DB record
- Frontend admin templates page uses deleteAdminTemplate() instead of
team-scoped deleteSnapshot()
- Delete button shown for all template types, not just snapshots
Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
smaller than 20GB via truncate + resize2fs (sparse, no extra physical
disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
templates are compact; EnsureImageSizes re-expands on next startup
Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
queries include platform team_id in WHERE clause
Consolidate 16 migrations into one with UUID columns for all entity
IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id,
resource_id) and template names. The id package now generates UUIDs
via google/uuid, with Format*/Parse* helpers for the prefixed wire
format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and
handlers pass pgtype.UUID internally; conversion to/from prefixed
strings happens at API and RPC boundaries. Adds PlatformTeamID
(all-zeros UUID) for shared resources.
- Use context.Background() with timeout in destroySandbox/failBuild so
cleanup and DB writes survive parent context cancellation on shutdown
- Fix loop device refcount leak in FlattenRootfs when dmDevice is nil
- Replace time.After with time.NewTimer in healthcheck polling to avoid
goroutine leak when healthcheck passes early
- Capture size_bytes from CreateSnapshot/FlattenRootfs RPC responses
instead of hardcoding 0 in the templates table insert
- Avoid leaking internal error details to API clients in build handler
Introduces an end-to-end template building pipeline: admins submit a recipe
(list of shell commands) via the dashboard, a Redis-backed worker pool spins
up a sandbox, executes each command, and produces either a full snapshot
(with healthcheck) or an image-only template (rootfs flattened via a new
FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted
to a new template_builds table and polled by the frontend.
Backend:
- New FlattenRootfs RPC (proto + host agent + sandbox manager)
- BuildService with Redis queue (BLPOP) and configurable worker pool (default 2)
- Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id}
- Migration for template_builds table with JSONB logs and recipe columns
- sqlc queries for build CRUD and progress updates
Frontend:
- /admin/templates page with Templates + Builds tabs
- Create Template dialog with recipe textarea, healthcheck, specs
- Build history with expandable per-step logs, status badges, progress bars
- Auto-polling every 3s for active builds
- AdminSidebar updated with Templates nav item
Switch from the envd /init endpoint pushing host time via syscall to
chronyd reading the KVM PTP hardware clock (/dev/ptp0) continuously.
This fixes clock drift between init calls and handles snapshot resume
gracefully.
Changes:
- Add clocksource=kvm-clock kernel boot arg
- Start chronyd in wrenn-init.sh before tini (PHC /dev/ptp0, makestep 1.0 -1)
- Remove clock_settime logic from envd SetData and shouldSetSystemTime
- Remove client.Init() clock sync calls from sandbox manager (3 sites)
- Remove Init() method from envdclient (no longer needed)
- Simplify rootfs scripts: socat/chrony now come from apt in the container
image, only envd/wrenn-init/tini are injected by build scripts
Add Caddy to docker-compose as the single entry point on port 8000:
- localhost -> /api/* stripped and proxied to CP:8080, /* to frontend:5173
- *.localhost -> proxied to CP:8080 (sandbox proxy catch-all)
- Direct /v1/*, /auth/*, /docs routes proxied to CP
Move CP from :8000 to :8080 (its default). Caddy takes :8000.
Update .env.example, vite proxy target (kept as fallback), and Makefile
dev targets (pg_isready via docker exec, frontend binds 0.0.0.0).
This is an intermediate state — needs further work for the full code
interpreter feature.
Add SandboxProxyWrapper that intercepts requests with Host headers
matching {port}-{sandbox_id}.{domain} and proxies them through the
owning host agent's /proxy endpoint.
Authentication is via X-API-Key only (no JWT). The API key's team must
own the sandbox. Export EnsureScheme from lifecycle package for reuse.
Request flow: SDK -> Caddy -> CP catch-all -> Host Agent -> sandbox VM.
This is an intermediate state — needs further work for the full code
interpreter feature.
Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP
requests to services running inside sandbox VMs. The sandbox's host IP
(10.11.0.{idx}) is used as the upstream target.
Includes port validation (1-65535) and shared HTTP transport for
connection pooling. Supports WebSocket upgrades for protocols like
Jupyter's streaming API.
This is an intermediate state — needs further work for the full code
interpreter feature.
Inject a statically-linked socat binary into rootfs images. envd's
port forwarder requires socat to bridge localhost-listening services
(e.g. Jupyter kernel) to the guest TAP interface.
Both scripts follow the same 3-step resolution: check rootfs, check
host, build from source (http://www.dest-unreach.org/socat/ v1.8.1.1).
Static linkage is verified before injection.
This is an intermediate state — needs further work for the full code
interpreter feature.