v0.0.1 #8

Merged
pptx704 merged 118 commits from dev into main 2026-04-09 19:24:50 +00:00
Owner

This PR merges the full v2 development cycle into main. It represents a complete platform buildout on top of the initial Firecracker prototype: multi-tenant auth, a production-grade control plane, a polished dashboard, and a host of reliability and security improvements.

Summary

Infrastructure & Security

  • Enforce mandatory mTLS for CP↔agent communication
  • Security hardening from CSO audit (path traversal fixes, LIKE injection fix, network cleanup leaks)
  • Fixed IP collision, pause race condition, proxy path routing, and connection drain on pause

Host & Multi-Host

  • Host registration, heartbeat, and multi-host scheduling
  • BYOC (Bring Your Own Compute) team support with admin panel and JWT refresh tokens
  • Host up/down events surfaced in audit log and exposed to BYOC teams
  • Switched Redis dependency to KeyDB

Sandbox Capabilities

  • HTTP proxy endpoint: per-sandbox reverse proxy with Caddy integration and port routing
  • Environment variable expansion inside guest VMs + sandbox env fetching
  • Configurable healthchecks; fixed clock sync via chrony (replaced one-shot clock_settime)
  • Pre/post build stages for template builds; async build workers

Templates & Snapshots

  • Device-mapper CoW snapshots with UFFD lazy memory loading
  • Diff snapshots for re-pause (avoids UFFD fault-in storm)
  • UUID-based template IDs with team-scoped directory layout
  • Admin template creation, deletion, and broadcast to all hosts
  • Seed minimal template in DB; protect it from deletion

Observability

  • Per-sandbox CPU/memory/disk metrics with time-range filtering (5m–12h)
  • Audit log infrastructure with frontend page
  • Per-capsule live stats page with CPU/RAM charts

Notification Channels

  • Notification channel system with provider integrations (webhook, etc.) and retry logic
  • Channel audit logging, name cleaning, and message formatting
  • Dashboard UI for managing channels

Auth & Teams

  • GitHub OAuth login with provider registry
  • Team management: create/invite/remove members, team-scoped sandbox access control
  • Admin users, API key management, BYOC team visibility gating

Frontend (Dashboard)

  • SvelteKit dashboard with full page coverage: capsules, templates, metrics, audit logs, channels, team settings, admin panel
  • Dark-first design system: Manrope/Instrument Serif/JetBrains Mono type stack, warm-black palette
  • Live background polling without table blink; Bits UI accessible components

License

  • Relicensed from BSL 1.1 → Apache 2.0

(PR Description generated by Claude Sonnet 4.6)

This PR merges the full v2 development cycle into main. It represents a complete platform buildout on top of the initial Firecracker prototype: multi-tenant auth, a production-grade control plane, a polished dashboard, and a host of reliability and security improvements. ## Summary **Infrastructure & Security** - Enforce mandatory mTLS for CP↔agent communication - Security hardening from CSO audit (path traversal fixes, LIKE injection fix, network cleanup leaks) - Fixed IP collision, pause race condition, proxy path routing, and connection drain on pause **Host & Multi-Host** - Host registration, heartbeat, and multi-host scheduling - BYOC (Bring Your Own Compute) team support with admin panel and JWT refresh tokens - Host up/down events surfaced in audit log and exposed to BYOC teams - Switched Redis dependency to KeyDB **Sandbox Capabilities** - HTTP proxy endpoint: per-sandbox reverse proxy with Caddy integration and port routing - Environment variable expansion inside guest VMs + sandbox env fetching - Configurable healthchecks; fixed clock sync via chrony (replaced one-shot `clock_settime`) - Pre/post build stages for template builds; async build workers **Templates & Snapshots** - Device-mapper CoW snapshots with UFFD lazy memory loading - Diff snapshots for re-pause (avoids UFFD fault-in storm) - UUID-based template IDs with team-scoped directory layout - Admin template creation, deletion, and broadcast to all hosts - Seed minimal template in DB; protect it from deletion **Observability** - Per-sandbox CPU/memory/disk metrics with time-range filtering (5m–12h) - Audit log infrastructure with frontend page - Per-capsule live stats page with CPU/RAM charts **Notification Channels** - Notification channel system with provider integrations (webhook, etc.) and retry logic - Channel audit logging, name cleaning, and message formatting - Dashboard UI for managing channels **Auth & Teams** - GitHub OAuth login with provider registry - Team management: create/invite/remove members, team-scoped sandbox access control - Admin users, API key management, BYOC team visibility gating **Frontend (Dashboard)** - SvelteKit dashboard with full page coverage: capsules, templates, metrics, audit logs, channels, team settings, admin panel - Dark-first design system: Manrope/Instrument Serif/JetBrains Mono type stack, warm-black palette - Live background polling without table blink; Bits UI accessible components **License** - Relicensed from BSL 1.1 → Apache 2.0 (PR Description generated by Claude Sonnet 4.6)
pptx704 added 52 commits 2026-03-24 23:32:00 +00:00
- Copy envd source from e2b-dev/infra, internalize shared dependencies
  into envd/internal/shared/ (keys, filesystem, id, smap, utils)
- Switch from gRPC to Connect RPC for all envd services
- Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd}
- Add proto specs (process, filesystem) with buf-based code generation
- Implement full envd: process exec, filesystem ops, port forwarding,
  cgroup management, MMDS integration, and HTTP API
- Update main module dependencies (firecracker SDK, pgx, goose, etc.)
- Remove placeholder .gitkeep files replaced by real implementations
Implements Phase 1: boot a Firecracker microVM, execute a command inside
it via envd, and get the output back. Uses raw Firecracker HTTP API via
Unix socket (not the Go SDK) for full control over the VM lifecycle.

- internal/vm: VM manager with create/pause/resume/destroy, Firecracker
  HTTP client, process launcher with unshare + ip netns exec isolation
- internal/network: per-sandbox network namespace with veth pair, TAP
  device, NAT rules, and IP forwarding
- internal/envdclient: Connect RPC client for envd process/filesystem
  services with health check retry
- cmd/host-agent: demo binary that boots a VM, runs "echo hello", prints
  output, and cleans up
- proto/envd: canonical proto files with buf + protoc-gen-connect-go
  code generation
- images/wrenn-init.sh: minimal PID 1 init script for guest VMs
- CLAUDE.md: updated architecture to reflect TAP networking (not vsock)
  and Firecracker HTTP API (not Go SDK)
Remove duplicate proto files from envd/spec/ and update envd's
buf.gen.yaml to generate stubs from the canonical proto/envd/ location.
Both modules now generate their own Connect RPC stubs from the same
source protos.
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
- REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read
- PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration)
- Connect RPC client to host agent for all VM operations
- Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes)
- OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs
- Added WriteFile/ReadFile RPCs to hostagent proto and implementations
- File upload via multipart form, download via JSON body POST
- sandbox_id propagated from control plane to host agent on create
Add WebSocket-based streaming exec endpoint and streaming file
upload/download endpoints to the control plane API. Includes new
host agent RPC methods (ExecStream, StreamWriteFile, StreamReadFile),
envd client streaming support, and OpenAPI spec updates.
Add resolv.conf to wrenn-init so guests can resolve DNS, and fix the
host MASQUERADE rule to match vpeerIP (the actual source after namespace
SNAT) instead of hostIP.
CLAUDE.md: replace bloated 850-line version with focused 230-line
guide. Fix inaccuracies (module path, build dir, Connect RPC vs gRPC,
buf vs protoc). Add detailed architecture with request flows, code
generation workflow, rootfs update process, and two-module gotchas.

README.md: add core deployment instructions (prerequisites, build,
host setup, configuration, running, rootfs workflow).
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
Add SafeName validator (allowlist regex) to reject directory traversal
in user-supplied template and snapshot names. Validated at both API
handlers (400 response) and sandbox manager (defense in depth).

Refactor CreateNetwork with rollback slice so partially created
resources (namespace, veth, routes, iptables rules) are cleaned up
on any error. Refactor RemoveNetwork to collect and return errors
instead of silently ignoring them.
Use Firecracker's Diff snapshot type when re-pausing a previously
resumed sandbox, capturing only dirty pages instead of a full memory
dump. Chains up to 10 incremental generations before collapsing back
to a Full snapshot. Multi-generation diff files (memfile.{buildID})
are supported alongside the legacy single-file format in resume,
template creation, and snapshot existence checks.
Implement email/password auth with JWT sessions and API key auth for
sandbox lifecycle. Users get a default team on signup; sandboxes,
snapshots, and API keys are scoped to teams.

- Add user, team, users_teams, and team_api_keys tables (goose migrations)
- Add JWT middleware (Bearer token) for user management endpoints
- Add API key middleware (X-API-Key header, SHA-256 hashed) for sandbox ops
- Add signup/login handlers with transactional user+team creation
- Add API key CRUD endpoints (create/list/delete)
- Replace owner_id with team_id on sandboxes and templates
- Update all handlers to use team-scoped queries
- Add godotenv for .env file loading
- Update OpenAPI spec and test UI with auth flows
Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.

- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
  resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
  the new dm device
- Add retry with backoff to dmsetupRemove for transient "device busy"
  errors caused by kernel not releasing the device immediately after
  Firecracker exits. Only retries on "Device or resource busy"; other
  errors (not found, permission denied) return immediately.

- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
  retries respect cancellation. Use context.Background() in all error
  cleanup paths to prevent cancelled contexts from skipping cleanup
  and leaking dm devices on the host.

- Resume vCPUs on pause failure: if snapshot creation or memfile
  processing fails after freezing the VM, unfreeze vCPUs so the
  sandbox stays usable instead of becoming a frozen zombie.

- Fix resource leaks in Pause when CoW rename or metadata write fails:
  properly clean up network, slot, loop device, and remove from boxes
  map instead of leaving a dead sandbox with leaked host resources.

- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
  directory instead of deleting it, preserving the paused state so the
  user can retry.

- Skip m.loops.Release when RemoveSnapshot fails during pause since
  the stale dm device still references the origin loop device.

- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
  size instead of a sensible default.
Replace the existing auto-destroy TTL behavior with auto-pause: when a
sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses
it (snapshot + teardown) instead of destroying it, preserving the ability
to resume later.

Key changes:
- TTL reaper calls Pause instead of Destroy, with fallback to Destroy if
  pause fails (e.g. Firecracker process already gone)
- New PingSandbox RPC resets the in-memory LastActiveAt timer
- New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory
  and DB last_active_at
- ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler
  can distinguish auto-paused sandboxes from crashed ones in a single call
- Reconciler polls every 5s (was 30s) and marks auto-paused as "paused"
  vs orphaned as "stopped"
- Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles
- Reaper checks every 2s (was 10s) and uses a detached context to avoid
  incomplete pauses on app shutdown
- Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)
Implement OAuth 2.0 login via GitHub as an alternative to email/password.
Uses a provider registry pattern (internal/auth/oauth/) so adding Google
or other providers later requires only a new Provider implementation.

Flow: GET /v1/auth/oauth/github redirects to GitHub, callback exchanges
the code for a user profile, upserts the user + team atomically, and
redirects to the frontend with a JWT token.

Key changes:
- Migration: make password_hash nullable, add oauth_providers table
- Provider registry with GitHubProvider (profile + email fallback)
- CSRF state cookie with HMAC-SHA256 validation
- Race-safe registration (23505 collision retries as login)
- Startup validation: CP_PUBLIC_URL required when OAuth is configured

Not fully tested — needs integration tests with a real GitHub OAuth app
and end-to-end testing with the frontend callback page.
Moves business logic from API handlers into internal/service/ so that
both the REST API and the upcoming dashboard can share the same operations
without duplicating code. API handlers now delegate to the service layer
and only handle HTTP-specific concerns (request parsing, response formatting).
The internal/admin/ package was never imported or mounted — just
placeholder files. Removing to avoid confusion before the real
dashboard is built.
Introduce three migrations: admin permissions (is_admin + permissions table),
BYOC team tracking, and multi-host support (hosts, host_tokens, host_tags).
Add Redis to dev infra and wire up client in control plane for ephemeral
host registration tokens. Add go-redis dependency.
Implements the full host ↔ control plane connection flow:

- Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access:
  regular hosts admin-only, BYOC hosts for admins and team owners
- One-time registration token flow: admin creates host → gets token (1hr TTL
  in Redis + Postgres audit trail) → host agent registers with specs → gets
  long-lived JWT (1yr)
- Host agent registration client with automatic spec detection (arch, CPU,
  memory, disk) and token persistence to disk
- Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token
  auth and host ID cross-check
- Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after
  failed registration
- Tag management (add/remove/list) with team-scoped access control
- Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and
  VerifyHostJWT
- requireHostToken middleware for host agent authentication
- DB-level race protection: RegisterHost uses AND status='pending' with
  rows-affected check; Redis GetDel for atomic token consume
- Migration for future mTLS support (cert_fingerprint, mtls_enabled columns)
- Host agent flags: --register (one-time token), --address (required ip:port)
- serviceErrToHTTP extended with "forbidden" → 403 mapping
- OpenAPI spec, .env.example, and README updated
Replace AGENT_KERNEL_PATH, AGENT_IMAGES_PATH, AGENT_SANDBOXES_PATH,
AGENT_SNAPSHOTS_PATH, and AGENT_TOKEN_FILE with a single
AGENT_FILES_ROOTDIR (default /var/lib/wrenn) that derives all
subdirectory paths automatically.
Reviewed-on: wrenn/sandbox#1
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>
- Use tini as PID 1 in wrenn-init.sh so zombie processes are reaped
  and signals are forwarded correctly to envd
- Set standard PATH in wrenn-init.sh so child processes spawned by envd
  can find common binaries (fixes "nice: ls command not found")
- Add envdclient.Init() to POST /init on envd after every boot/resume,
  syncing the guest clock via unix.ClockSettime — critical after snapshot
  resume where the guest clock is frozen
- Run Init in a background goroutine so it doesn't block the CreateSandbox
  RPC response; a slow Init (vCPU busy with envd startup) was causing the
  RPC context to be canceled before the response reached the control plane
- Update rootfs-from-container.sh and update-debug-rootfs.sh to inject
  tini into the rootfs, checking the container image and host first,
  downloading from GitHub releases as fallback
- Snapshot delete: make agent RPC failure a hard error so DB record is
  not removed when files cannot be deleted from disk
- Snapshot overwrite: call agent to delete old files before removing the
  DB record, preventing stale memfile.{uuid} generations from accumulating
  on disk across repeated overwrites
- Sandbox destroy: only swallow CodeNotFound from the agent (sandbox
  already gone / TTL-reaped); any other error now propagates to the caller
  instead of being silently ignored
- Increase content padding (p-7→p-8) and table cell padding (px-4→px-5,
  py-3→py-4 for data rows) across capsules, keys, and snapshots pages
- Improve animation performance: wrenn-glow uses opacity instead of
  box-shadow (compositor-only, no paint cost)
- Add prefers-reduced-motion media query covering inline style animations
- Fix OAuth error display on login page (read ?error= param on mount)
- Harden clipboard copy with try-catch and toast fallback
- Improve empty state copy, dialog microcopy, and error messages
- Add retry button to error banners on keys page
- Replace "All systems operational" footer bar with a clean 1px divider
- Fix text truncation on long capsule/snapshot names (min-w-0 + truncate)
Reviewed-on: wrenn/sandbox#3
- Three-role model (owner/admin/member) with owner protection invariants
- Team CRUD: create, rename (admin+), soft-delete with VM cleanup (owner only)
- Member management: add by email, remove, role updates (admin+), leave
- Switch-team endpoint re-issues JWT after DB membership verification
- User email prefix search for add-member UI autocomplete
- JWT carries role as a hint; all authorization decisions verified from DB
- Team slug: immutable 12-char hex (e.g. a1b2c3-d1e2f3), reserved on soft-delete
- Migration adds slug + deleted_at to teams; backfills existing rows
- New /dashboard/team page with inline team name editing, slug/ID copy,
  members table with split-button (remove + make admin/member), add member
  typeahead, and danger zone (delete/leave) with confirmation dialogs
- Sidebar now fetches real teams from API, supports team switching and
  team creation via dialog
- Rename nav item Members → Team, route /dashboard/members → /dashboard/team
- New src/lib/api/team.ts with typed functions for all team endpoints
- Allow hyphens, @, and apostrophes in team names (backend regex)
- After delete/leave, switch to next available team instead of logging
  out; if no teams remain, show a toast prompting to create one
- Disable delete/leave button when user has only one team, with
  explanatory hint to create another team first
- Show empty state on /dashboard/team when auth has no team context,
  pointing user to the sidebar to create a team
- Fetch all teams in parallel with team detail on page load to power
  the isLastTeam guard
The anti-enumeration guard required @ in the email prefix, causing the
typeahead to silently return nothing until the user typed @. Replace with
a minimum 3-character length check to match the frontend trigger condition.
Teams list was fetched on every Sidebar mount (each page navigation),
causing a flash from '…' to the real name on every tab switch. Move teams
into a module-level reactive store (teams.svelte.ts) that fetches once per
session and is shared between Sidebar and the team page.
- Slug + Team ID rows collapsed into a 2-column grid for better density
- "you" badge moved inline with email instead of stacked below it
- Copy checkmark draws itself via SVG stroke-dashoffset animation
- New member row flashes accent-green on entry
- Removed member row slides out smoothly (fly transition)
- Member rows use staggered fly-in on page load
- Team name briefly highlights accent color after a successful rename
- Search result avatars get colorized initials based on email character
Reviewed-on: wrenn/sandbox#4
Delight (keys page):
- Animated checkmark draw + circle pop on key reveal dialog open
- Key display area pulses accent glow on open to draw eye to "copy this"
- Copy button spring-bounces on successful copy (re-triggers on repeat)
- Empty state key icon floats (iconFloat, now global)
- Row hover uses scaleY left-accent stripe (matches capsules pattern)
- New key row flashes accent on reveal dialog dismiss (matches capsule-born)

Audit fixes (all dashboard pages):
- Page titles standardized to em dash: "Wrenn — X" across all four pages
- formatDate/timeAgo extracted to src/lib/utils/format.ts (string | undefined
  signatures); keys and snapshots now import from there instead of duplicating
- team formatDate gains undefined guard (kept local, date-only format differs)
- spin-once and iconFloat keyframes moved to app.css as globals; scoped copies
  removed from capsules and keys
- Snapshots empty state icon was referencing undefined @keyframes float; fixed
  to iconFloat

Normalization:
- Snapshots table rows: replaced ::before pseudo-element accent (opacity-only,
  single color) with DOM row-stripe element using scaleY transition, type-keyed
  color (green for snapshots, blue for images) — matches capsules pattern
- Create Key dialog: max-w-[400px] → max-w-[420px] to align with form dialogs
- Snapshots count and empty-state heading are now terminology-aware: shows
  "templates/snapshots/images" based on active filter; empty heading for all
  filter reads "No templates yet" instead of "No snapshots yet"

Not done (documented in audit, deferred):
- Sidebar nav items pointing to unimplemented routes (audit, usage, billing,
  notifications, settings) — left as-is, needs product decision
- Dialog max-widths fully normalized beyond Create Key — minor, deferred
- capsules timeAgo not imported from shared util (formatTime differs intentionally)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewed-on: wrenn/sandbox#5
- Add name column to users (migration + sqlc regen); propagate through JWT
  claims, auth context, all auth/OAuth handlers, service layer, and frontend
- Sidebar and team page show name instead of email; team page splits Name/Email
  into separate columns
- Block sandbox creation in UI and API when user has no active team context
- loginTeam helper falls back to first active team when no default is set,
  fixing login for invited users with no is_default membership
- Exclude soft-deleted teams from GetDefaultTeamForUser, GetBYOCTeams queries
- Guard host creation against soft-deleted teams in service/host.go
- SwitchTeam re-fetches name from DB instead of trusting stale JWT claim
- Reset teams store on login so stale data from a previous session never persists
- Update openapi.yaml: add name to SignupRequest and AuthResponse schemas
Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a
DB-driven registration system supporting multiple host agents (BYOC).

Key changes:
- Host agents register via one-time token, receive a 7-day JWT + 60-day
  refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all
  sandboxes if refresh fails
- HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing
  the single static agent client throughout the API and service layers
- RoundRobinScheduler: picks an online host for each new sandbox via
  ListActiveHosts; extensible for future scheduling strategies
- HostMonitor (replaces Reconciler): passive heartbeat staleness check marks
  hosts unreachable and sandboxes missing after 90s; active reconciliation
  per online host restores missing-but-alive sandboxes and stops orphans
- Graceful host delete: returns 409 with affected sandbox list without
  ?force=true; force-delete destroys sandboxes then evicts pool client
- Snapshot delete broadcasts to all online hosts (templates have no host_id)
- sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss
- New migration: host_refresh_tokens table with token rotation (issue-then-
  revoke ordering to prevent lockout on mid-rotation crash)
- New sandbox status 'missing' (reversible, unlike 'stopped') and host
  status 'unreachable'; both reflected in OpenAPI spec
- Fix: refresh token auth failure now returns 401 (was 400 via generic
  'invalid' substring match in serviceErrToHTTP)
- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows,
  shimmer loading, pulsing online status, animated token reveal checkmark
- Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat
  pills, skeleton loading, slide-in animations for new rows
- Frontend: AdminSidebar component with accent top bar and admin pill badge
- Frontend: BYOC nav item shown only when team.is_byoc is true (derived
  from teams store, not JWT); disabled for members
- Frontend: Admin shield button in Sidebar, visible only to platform admins
- Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated)
- Backend: is_byoc added to teamResponse so frontend derives visibility
  from fresh team data rather than stale JWT fields
- Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc)
- Backend: Admin hosts list enriches BYOC entries with team_name
- Host agent: load .env file via godotenv on startup
Reviewed-on: wrenn/sandbox#6
Introduces an append-only audit trail for all user and system actions:
sandbox lifecycle (create/pause/resume/destroy/auto-pause), snapshots,
team rename, API key create/revoke, member add/remove/leave/role_update,
and BYOC host add/delete/marked_down/marked_up.

- New audit_logs table (migration) with team_id, actor, resource,
  action, scope (team|admin), status (success|info|warning|error),
  metadata, and created_at
- AuditLogger (internal/audit) with named fire-and-forget methods per
  event; system actor used for background events (HostMonitor, TTL reaper)
- GET /v1/audit-logs: JWT-only, cursor pagination (max 200), multi-value
  filters for resource_type and action (comma-sep or repeated params);
  members see team-scoped events only, admins/owners see all
- AuthContext extended with APIKeyID + APIKeyName so API key requests
  record meaningful actor identity
- HostMonitor wired with AuditLogger for auto-pause and host marked_down
Infinite-scroll table with hierarchical filter dropdown, expandable
metadata rows, and status-coded visual signals per event severity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewed-on: wrenn/sandbox#7
pptx704 added 17 commits 2026-03-25 16:40:07 +00:00
- app.css: replace flat --shadow-sm token with real shadows; add
  --shadow-card and --shadow-dialog tokens; add @keyframes status-ping
  and .animate-status-ping utility (outward ring ripple, GPU-composited
  via will-change) for live running status dots
- login: headline 5rem → 6.5rem with tighter leading/tracking; expand
  container to 460px; add sage-green dot grid texture layer beneath the
  mouse-reactive glow for industrial depth
- capsules: upgrade all running dots (header chip + row indicators +
  status bar) from opacity-fade to ring ripple; apply --shadow-dialog
  to Launch and Snapshot dialogs
- keys: apply --shadow-dialog to all three dialogs
- audit: remove duplicate @keyframes fadeUp and iconFloat (redundant
  with app.css definitions, audit's fadeUp also subtly diverged)
- sidebar: active indicator bar taller and thicker (h-5 w-[3px] → h-6
  w-1); active bg more vivid (accent/12%); label font-medium →
  font-semibold; team dialog gets --shadow-dialog
- New sandbox_metrics_snapshots table sampled every 10s (60-day retention)
- Background MetricsSampler goroutine wired into control plane startup
- GET /v1/sandboxes/stats?range=5m|1h|6h|24h|30d endpoint with adaptive
  polling intervals; reserved CPU/RAM uses ceil(paused/2) formula
- StatsPanel component: 4 stat cards + 2 Chart.js line charts (straight
  lines, integer y-axis for running count, dual-axis for CPU/RAM)
- Range filter persisted in URL query param; polls update data silently
  (no blink — loading state only shown on initial mount)
- Split /dashboard/capsules into /list and /stats sub-routes with shared
  layout; capsuleRunningCount store syncs badge across routes
- CreateCapsuleDialog extracted as reusable component
- Replace stale snapshot read (GetCurrentMetrics) with live query
  (GetLiveMetrics) against sandboxes table — always returns correct
  zeros when no capsules are running
- Fix CPU reserved formula: running + starting only; paused VMs no
  longer contribute vCPUs (RAM reservation for paused unchanged)
- Merge top cards into 3 paired Now/Peak cards with colored accent
  borders (green/blue/amber matching chart colors)
- Move Live badge from Running Capsules card to page-level header
- Add colored category dots to card and chart headers
- Charts stacked vertically, flex-1 to fill remaining page height
- vCPUs chart color changed to blue (#5a9fd4), RAM stays amber
- Add Metrics nav item to sidebar with bar chart icon
- Create /dashboard/metrics page wrapping StatsPanel
- Remove tabs from capsules page (list is now the only view)
- Flatten capsules route: /capsules directly shows the list,
  removing the /list and /stats sub-routes
- Strip redundant title/subtitle from StatsPanel (page header
  provides context)
SampleSandboxMetrics previously filtered WHERE status IN ('running',
'starting', 'paused'), which returned no rows when all capsules were
stopped. This caused zero snapshots to be skipped, leaving the
time-series charts with no trailing data points instead of showing
the expected zero values.

Remove the WHERE filter so the query groups by all teams that have
any sandbox row. The per-status FILTER clauses on the aggregates
already produce correct zero counts for stopped capsules.

Also includes the per-VM RAM ceiling formula change (sum(ceil(each/2))
instead of ceil(sum/2)).
CPU (vCPUs) and RAM (GB) use different units and scales, so combining
them on a dual-axis chart was misleading. Each now has its own chart
card, laid out side-by-side.
- Accent stripes: 3px → 5px; indicator dots: 6px → 8px
- Peak values step down to text-[1.714rem]/text-secondary so Now values read as the clear hero
- Now labels: semibold + uppercase for weight parity with the metric
- Cell padding py-5 → py-6; outer gap-7/pt-4 → gap-8/pt-6 for breathing room
- Chart fills: 7-8% → 11-13% opacity; lines: 1.5 → 2px
- Tick labels brighter (#635f5c), grid lines slightly more visible
- Running capsules chart: min-height 220 → 260px
Poll fetches now silently update data without triggering loading
states, spinner animations, or row fadeUp re-animations. Only manual
refresh shows the spin indicator.
Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and
stat() on CoW files at 500ms intervals per running sandbox. Three tiered
ring buffers downsample into 30s and 5min averages for 10min/2h/24h
retention. Metrics are flushed to DB on pause (all tiers) and destroy
(24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the
host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on
the control plane. Returns live data for running sandboxes, DB data for
paused, and 404 for stopped.
Maps each user-facing range to the appropriate underlying ring buffer
tier and applies a time cutoff filter. No new ring buffers needed —
5m/10m read from the 10m tier, 1h/2h from the 2h tier, 6h/12h/24h
from the 24h tier.
Escape LIKE metacharacters (% and _) in the email prefix before passing
to the SQL query, and enforce the documented '@' requirement to prevent
broad user enumeration. Move search logic out of TeamService into
usersHandler since it is a site-wide lookup, not team-scoped.
The query was fetching all rows for a (sandbox_id, tier) pair and
filtering by timestamp in Go. For repeatedly-paused sandboxes the
24h tier can accumulate up to 30 days of data, causing up to 120x
over-fetching for a 6h range request.

Add AND ts >= $3 to the query so Postgres filters on the primary key
(sandbox_id, tier, ts) directly. Drop the redundant Go-side loop.
- New detail page at /dashboard/capsules/[id] with Stats and Files tabs
- Stats tab shows capsule info card (status, template, CPU, memory, disk,
  started, idle timeout) and two stacked Chart.js charts with live values
- Metrics API client with 10s polling and moving-average smoothing
- Capsule ID in list table is now a clickable link to the detail page
- Layout breadcrumb header (Capsules > sb-xxx) with back navigation
- Fix metrics sampler: use v.PID() directly as Firecracker PID since
  unshare -m execs (not forks) through the bash/ip-netns-exec/firecracker
  chain, so all share the same PID. Removes unused findChildPID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: wrenn/sandbox#9
pptx704 added 29 commits 2026-04-02 17:41:09 +00:00
Inject a statically-linked socat binary into rootfs images. envd's
port forwarder requires socat to bridge localhost-listening services
(e.g. Jupyter kernel) to the guest TAP interface.

Both scripts follow the same 3-step resolution: check rootfs, check
host, build from source (http://www.dest-unreach.org/socat/ v1.8.1.1).
Static linkage is verified before injection.

This is an intermediate state — needs further work for the full code
interpreter feature.
Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP
requests to services running inside sandbox VMs. The sandbox's host IP
(10.11.0.{idx}) is used as the upstream target.

Includes port validation (1-65535) and shared HTTP transport for
connection pooling. Supports WebSocket upgrades for protocols like
Jupyter's streaming API.

This is an intermediate state — needs further work for the full code
interpreter feature.
Add SandboxProxyWrapper that intercepts requests with Host headers
matching {port}-{sandbox_id}.{domain} and proxies them through the
owning host agent's /proxy endpoint.

Authentication is via X-API-Key only (no JWT). The API key's team must
own the sandbox. Export EnsureScheme from lifecycle package for reuse.

Request flow: SDK -> Caddy -> CP catch-all -> Host Agent -> sandbox VM.

This is an intermediate state — needs further work for the full code
interpreter feature.
Add Caddy to docker-compose as the single entry point on port 8000:
- localhost -> /api/* stripped and proxied to CP:8080, /* to frontend:5173
- *.localhost -> proxied to CP:8080 (sandbox proxy catch-all)
- Direct /v1/*, /auth/*, /docs routes proxied to CP

Move CP from :8000 to :8080 (its default). Caddy takes :8000.
Update .env.example, vite proxy target (kept as fallback), and Makefile
dev targets (pg_isready via docker exec, frontend binds 0.0.0.0).

This is an intermediate state — needs further work for the full code
interpreter feature.
The [id] route cannot be prerendered at build time since IDs are unknown.
With adapter-static's index.html fallback, the route is handled client-side.
Switch from the envd /init endpoint pushing host time via syscall to
chronyd reading the KVM PTP hardware clock (/dev/ptp0) continuously.
This fixes clock drift between init calls and handles snapshot resume
gracefully.

Changes:
- Add clocksource=kvm-clock kernel boot arg
- Start chronyd in wrenn-init.sh before tini (PHC /dev/ptp0, makestep 1.0 -1)
- Remove clock_settime logic from envd SetData and shouldSetSystemTime
- Remove client.Init() clock sync calls from sandbox manager (3 sites)
- Remove Init() method from envdclient (no longer needed)
- Simplify rootfs scripts: socat/chrony now come from apt in the container
  image, only envd/wrenn-init/tini are injected by build scripts
Introduces an end-to-end template building pipeline: admins submit a recipe
(list of shell commands) via the dashboard, a Redis-backed worker pool spins
up a sandbox, executes each command, and produces either a full snapshot
(with healthcheck) or an image-only template (rootfs flattened via a new
FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted
to a new template_builds table and polled by the frontend.

Backend:
- New FlattenRootfs RPC (proto + host agent + sandbox manager)
- BuildService with Redis queue (BLPOP) and configurable worker pool (default 2)
- Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id}
- Migration for template_builds table with JSONB logs and recipe columns
- sqlc queries for build CRUD and progress updates

Frontend:
- /admin/templates page with Templates + Builds tabs
- Create Template dialog with recipe textarea, healthcheck, specs
- Build history with expandable per-step logs, status badges, progress bars
- Auto-polling every 3s for active builds
- AdminSidebar updated with Templates nav item
- Use context.Background() with timeout in destroySandbox/failBuild so
  cleanup and DB writes survive parent context cancellation on shutdown
- Fix loop device refcount leak in FlattenRootfs when dmDevice is nil
- Replace time.After with time.NewTimer in healthcheck polling to avoid
  goroutine leak when healthcheck passes early
- Capture size_bytes from CreateSnapshot/FlattenRootfs RPC responses
  instead of hardcoding 0 in the templates table insert
- Avoid leaking internal error details to API clients in build handler
Consolidate 16 migrations into one with UUID columns for all entity
IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id,
resource_id) and template names. The id package now generates UUIDs
via google/uuid, with Format*/Parse* helpers for the prefixed wire
format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and
handlers pass pgtype.UUID internally; conversion to/from prefixed
strings happens at API and RPC boundaries. Adds PlatformTeamID
(all-zeros UUID) for shared resources.
Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
  full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
  sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
  smaller than 20GB via truncate + resize2fs (sparse, no extra physical
  disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
  templates are compact; EnsureImageSizes re-expands on next startup

Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
  team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
  GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
  queries include platform team_id in WHERE clause
- DELETE /v1/admin/templates/{name} endpoint (admin-only)
- Broadcasts DeleteSnapshot RPC to all online hosts before removing DB record
- Frontend admin templates page uses deleteAdminTemplate() instead of
  team-scoped deleteSnapshot()
- Delete button shown for all template types, not just snapshots
Pre-build: apt update
Post-build: apt clean, apt autoremove, rm apt lists

Total steps count includes pre/post commands for accurate progress bars.
Build phases:
- Pre-build (apt update) and post-build (apt clean, autoremove, rm lists)
  run with 10-minute timeout; user recipe commands keep 30s timeout
- Log entries include phase field for UI grouping
- Always send explicit TimeoutSec to host agent (0 defaulted to 30s)

Frontend:
- Pre-build/post-build steps show phase label without exposing commands
- Recipe steps numbered independently starting from 1

Guest PATH:
- Add /usr/games:/usr/local/games to wrenn-init.sh PATH export
  (standard Ubuntu paths, needed for packages like cowsay)
DB stays native UUID; the format/parse layer now encodes 16 UUID bytes
as 25-char lowercase alphanumeric (base36) strings instead of the
standard 36-char hex-with-dashes format. e.g. sb-2e5glxi4g3qnhwci95qev0cg0
Snapshot race fix:
- Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and
  PauseSandbox RPCs, preventing the reconciler from marking it "stopped"
  during the flatten window when the sandbox is gone from the host
  agent's in-memory map but DB still says "running"
- Revert status to "running" on RPC failure
- Check ctx.Err() before writing response to avoid writing to dead
  connections when client disconnects during long snapshot operations

Delete auth fix:
- Block non-admin deletion of platform templates (team_id = all-zeros)
  at DELETE /v1/snapshots/{name} with 403, preventing file deletion
  before the team ownership check fails

Sparse dd:
- Add conv=sparse to dd in FlattenSnapshot so flattened images preserve
  sparseness (~200MB actual vs 5GB logical)

Default disk size:
- Change default disk_size_mb from 20GB to 5GB across migration,
  manager, service, build, and EnsureImageSizes
- Disable split-button dropdown arrow for platform templates in
  dashboard snapshots page (teams cannot delete platform templates)
Introduces internal/layout package for centralized path construction,
migrates templates from name-based TEXT primary keys to UUID PKs with
team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}).
The built-in minimal template uses sentinel zero UUIDs. Proto messages
carry team_id + template_id alongside deprecated template name field.
Team deletion now cleans up template files across all hosts.
AGENT_FILES_ROOTDIR → WRENN_DIR, AGENT_LISTEN_ADDR → WRENN_HOST_LISTEN_ADDR,
AGENT_CP_URL → WRENN_CP_URL, AGENT_HOST_INTERFACE → WRENN_HOST_INTERFACE,
CP_LISTEN_ADDR → WRENN_CP_LISTEN_ADDR. Consolidates all env vars under a
consistent WRENN_ namespace.
Insert a minimal template row (all-zeros UUID) so it appears in both
team and admin template listings. Guard delete endpoints to prevent
removal of the minimal template.
Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx}
to avoid collisions with other tools. Add CleanupStaleNamespaces() at
agent startup to remove orphaned namespaces, veths, iptables rules, and
routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to
prevent Go runtime memory corruption from snapshot/restore drift.
Always use Firecracker Diff snapshots (fast, only changed pages) and
merge diff files at the file level when the generation cap is reached.
The previous approach used Firecracker's Full snapshot type which dumps
all memory to disk and can timeout, losing all snapshot data on failure.

Add snapshot.MergeDiffs() which reads each block from the appropriate
generation's diff file via the header mapping and writes them into a
single consolidated file with a fresh generation-0 header.
- Change sandbox ID prefix from sb- to cl- (capsule) throughout
- Fix proxy URL regex character class: base36 uses 0-9a-z, not just hex
- Add MMDS V2 config and metadata to VM boot flow so envd can read
  WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID from inside the guest
- Pass TemplateID through VMConfig into both fresh and snapshot boot paths
- Internal ECDSA P-256 CA (WRENN_CA_CERT/WRENN_CA_KEY env vars); when absent
  the system falls back to plain HTTP so dev mode works without certificates
- Host leaf cert (7-day TTL, IP SAN) issued at registration and renewed on
  every JWT refresh; fingerprint + expiry stored in DB (cert_expires_at column
  replaces the removed mtls_enabled flag)
- CP ephemeral client cert (24-hour TTL) via CPCertStore with atomic hot-swap;
  background goroutine renews it every 12 hours without restarting the server
- Host agent uses tls.Listen + httpServer.Serve so GetCertificate callback is
  respected (ListenAndServeTLS always reads cert from disk)
- Sandbox reverse proxy now uses pool.Transport() so it shares the same TLS
  config as the Connect RPC clients instead of http.DefaultTransport
- Credentials file renamed host-credentials.json with cert_pem/key_pem/
  ca_cert_pem fields; duplicate register/refresh response structs collapsed
  to authResponse
- skip_pre_post flag on builds bypasses apt update/clean pre/post steps for
  faster iteration when the recipe handles its own environment setup
- POST /v1/admin/builds/{id}/cancel endpoint marks an in-progress build as
  cancelled; UpdateBuildStatus now also sets completed_at for 'cancelled'
- internal/recipe: typed recipe parser and executor (RUN/ENV/COPY steps)
  replacing the raw string slice approach in the build worker
- pre/post build commands prefixed with RUN to match recipe step format
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce ConnTracker (atomic.Bool + WaitGroup) to track in-flight proxy
connections per sandbox. Before pausing a VM, the manager drains active
connections with a 2s grace period, preventing Go runtime corruption
inside the guest caused by stale TCP state surviving Firecracker
snapshot/restore.

Also add:
- AcquireProxyConn on Manager for atomic lookup + connection tracking
- Proxy cache (120s TTL) on CP SandboxProxyWrapper with single-query
  DB lookup (GetSandboxProxyTarget) to avoid two round-trips
- Reset() on ConnTracker to re-enable connections if pause fails
The envd port scanner used gopsutil's net.Connections() which walks
/proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime
semaphore state when the VM is paused mid-operation and restored from
a Firecracker snapshot.

Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads
a single file per address family — no /proc/{pid}/fd walk, no goroutines,
no WaitGroups. Also replace concurrent-map (smap) in the scanner with a
plain sync.RWMutex-protected map, since concurrent-map's Items() spawns
goroutines with a WaitGroup internally, which is equally unsafe across
snapshot boundaries.

Use socket inode instead of PID for the port forwarding map key, since
inode is available directly from /proc/net/tcp without the fd walk.
Reviewed-on: wrenn/sandbox#10
pptx704 added 2 commits 2026-04-04 07:11:51 +00:00
pptx704 added 1 commit 2026-04-07 19:35:29 +00:00
pptx704 added 6 commits 2026-04-07 20:18:08 +00:00
healthchecks

Fix ENV instructions to expand $VAR references at set time using the
current env state, preventing self-referencing values like
PATH=/opt/venv/bin:$PATH from producing recursive expansions. Remove
expandEnv from shellPrefix to avoid double expansion.

Fetch sandbox environment variables via `env` before recipe execution
so ENV steps resolve against actual runtime values from the base
template image.

Replace hardcoded healthcheck timing with a Dockerfile-like flag parser
supporting --interval, --timeout, --start-period, and --retries. Add
start-period grace window and bounded retry counting to
waitForHealthcheck.

Add python-interpreter-v0-beta recipe and healthcheck files.
Updated recipefile with test script to check code execution with state
management
- Fix envRegex: remove spurious (\$)? group that swallowed $$$, handle ${}
- wrenn-init.sh: add || true to networking commands under set -e, remove dead code
- waitForHealthcheck: use context deadline for unlimited retries instead of implicit 100 cap
- Make parseSandboxEnv a package-level function (unused receiver)
- Fix WrappedCommand test: map iteration order dependency, pre-expand env values
- Fix error wrapping: %v → %w per project conventions
- test-jupyter-kernel.py: move import to top-level, fix misleading comment
Reviewed-on: wrenn/sandbox#12
pptx704 added 1 commit 2026-04-07 20:26:03 +00:00
Both the control plane and host agent now refuse to start without valid
mTLS configuration, closing the unauthenticated proxy/RPC attack surface
that existed when running in plain HTTP fallback mode.
pptx704 added 2 commits 2026-04-07 21:46:40 +00:00
- Add auth failure logging (login, API key, JWT) with IP/email/prefix
- Move OAuth JWT from URL params to short-lived cookies to prevent
  token leakage via browser history, server logs, and Referer headers
- Pin Swagger UI to v5.18.2 with SRI integrity hashes
- Upgrade Go toolchain to 1.25.8 (fixes 5 called stdlib vulns)
- Fix unchecked error in host agent credential refresh
- Add .gstack to .gitignore for security report artifacts
pptx704 added 1 commit 2026-04-07 22:34:00 +00:00
- Fix IP address collision at slot 32768+ by using bitwise shifts instead of
  byte-truncating division in network slot addressing
- Add per-sandbox lifecycleMu to serialize concurrent Pause/Destroy calls
- Sanitize proxy forwarding path with path.Clean
- Sort ENV keys in recipe shell preamble for deterministic ordering
- Fix ConnTracker goroutine leak by adding cancel channel to Drain/Reset
- Update context_test to assert deterministic ENV ordering
pptx704 added 1 commit 2026-04-08 18:47:28 +00:00
pptx704 added 3 commits 2026-04-09 08:28:49 +00:00
Change host marked_down/marked_up audit log scope from "admin" to "team" so
BYOC team members can see when their hosts go unreachable or recover. Rename
BYOC sidebar entry to Hosts, add placeholder billing/usage pages, disable
unimplemented notifications/settings links, and point docs to external site.
Replace Business Source License with Apache License Version 2.0 across
LICENSE, envd/LICENSE, and NOTICE. Update NOTICE to remove BSL-era
framing that singled out Apache-only portions.
pptx704 added 3 commits 2026-04-09 19:20:38 +00:00
Implement a channels system for notifying teams via external providers
(Discord, Slack, Teams, Google Chat, Telegram, Matrix, webhook) when
lifecycle events occur (capsule/template/host state changes).

- Channel CRUD API under /v1/channels (JWT-only auth)
- Test endpoint to verify config before saving (POST /v1/channels/test)
- Secret rotation endpoint (PUT /v1/channels/{id}/config)
- AES-256-GCM encryption for provider secrets (WRENN_ENCRYPTION_KEY)
- Redis stream event publishing from audit logger
- Background dispatcher with consumer group and retry (10s, 30s)
- Webhook delivery with HMAC-SHA256 signing (X-WRENN-SIGNATURE)
- shoutrrr integration for chat providers
- Secrets never exposed in API responses
- Add audit log entries for channel create, update, rotate_config, delete
- Clean channel names on create/update (trim, lowercase, spaces → hyphens,
  SafeName validation)
- Format chat notifications with full event details (resource, actor, team,
  timestamp) instead of one-liners
- Fix Discord split-line embeds by setting splitLines=No on shoutrrr URL
- Add channels dashboard page and sidebar navigation
Reviewed-on: wrenn/sandbox#13
pptx704 changed title from WIP: v0.1.0 to v0.0.1 2026-04-09 19:23:12 +00:00
pptx704 merged commit d3e4812e46 into main 2026-04-09 19:24:50 +00:00
pptx704 referenced this issue from a commit 2026-04-09 19:24:51 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: wrenn/wrenn#8
No description provided.