v0.1.6 #45

pptx704 · 2026-05-13T05:05:17Z

pptx704 commented

2026-05-13 05:05:17 +00:00

What's New?

Performance updates for large capsules, admin panel enhancement and bug fixes

Envd

Fixed bug with sandbox metrics calculation
Page cache drop and balloon inflation to reduce memfile snapshot
Updated rpc timeout logic for better control
Added tests

Admin Panel

Add/Remove platform admin
Updated template deletion logic for fine grained permission

Others

Minor frontend visual improvement
Minor bugfixes
Version bump

## What's New? Performance updates for large capsules, admin panel enhancement and bug fixes ### Envd - Fixed bug with sandbox metrics calculation - Page cache drop and balloon inflation to reduce memfile snapshot - Updated rpc timeout logic for better control - Added tests ### Admin Panel - Add/Remove platform admin - Updated template deletion logic for fine grained permission ### Others - Minor frontend visual improvement - Minor bugfixes - Version bump

pptx704 added 277 commits 2026-05-13 05:05:17 +00:00

Port envd from e2b with internalized shared packages and Connect RPC a3898d68fb

- Copy envd source from e2b-dev/infra, internalize shared dependencies
  into envd/internal/shared/ (keys, filesystem, id, smap, utils)
- Switch from gRPC to Connect RPC for all envd services
- Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd}
- Add proto specs (process, filesystem) with buf-based code generation
- Implement full envd: process exec, filesystem ops, port forwarding,
  cgroup management, MMDS integration, and HTTP API
- Update main module dependencies (firecracker SDK, pgx, goose, etc.)
- Remove placeholder .gitkeep files replaced by real implementations

Add host agent with VM lifecycle, TAP networking, and envd client 7753938044

Implements Phase 1: boot a Firecracker microVM, execute a command inside
it via envd, and get the output back. Uses raw Firecracker HTTP API via
Unix socket (not the Go SDK) for full control over the VM lifecycle.

- internal/vm: VM manager with create/pause/resume/destroy, Firecracker
  HTTP client, process launcher with unshare + ip netns exec isolation
- internal/network: per-sandbox network namespace with veth pair, TAP
  device, NAT rules, and IP forwarding
- internal/envdclient: Connect RPC client for envd process/filesystem
  services with health check retry
- cmd/host-agent: demo binary that boots a VM, runs "echo hello", prints
  output, and cleans up
- proto/envd: canonical proto files with buf + protoc-gen-connect-go
  code generation
- images/wrenn-init.sh: minimal PID 1 init script for guest VMs
- CLAUDE.md: updated architecture to reflect TAP networking (not vsock)
  and Firecracker HTTP API (not Go SDK)

Centralize envd proto source of truth to proto/envd/ c31ce90306

Remove duplicate proto files from envd/spec/ and update envd's
buf.gen.yaml to generate stubs from the canonical proto/envd/ location.
Both modules now generate their own Connect RPC stubs from the same
source protos.

Add host agent RPC server with sandbox lifecycle management 6f0c365d44

Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.

Added basic license information 34c89e814d

updated license structure d7b25b0891

Add minimal control plane with REST API, database, and reconciler ec3360d9ad

- REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read
- PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration)
- Connect RPC client to host agent for all VM operations
- Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes)
- OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs
- Added WriteFile/ReadFile RPCs to hostagent proto and implementations
- File upload via multipart form, download via JSON body POST
- sandbox_id propagated from control plane to host agent on create

Add streaming exec and file transfer endpoints b4d8edb65b

Add WebSocket-based streaming exec endpoint and streaming file
upload/download endpoints to the control plane API. Includes new
host agent RPC methods (ExecStream, StreamWriteFile, StreamReadFile),
envd client streaming support, and OpenAPI spec updates.

Fix guest VM outbound networking and DNS resolution 0c245e9e1c

Add resolv.conf to wrenn-init so guests can resolve DNS, and fix the
host MASQUERADE rule to match vpeerIP (the actual source after namespace
SNAT) instead of hostIP.

Rewrite CLAUDE.md and README.md 9b94df7f56

CLAUDE.md: replace bloated 850-line version with focused 230-line
guide. Fix inaccuracies (module path, build dir, Connect RPC vs gRPC,
buf vs protoc). Add detailed architecture with request flows, code
generation workflow, rootfs update process, and two-module gotchas.

README.md: add core deployment instructions (prerequisites, build,
host setup, configuration, running, rootfs workflow).

Add sandbox snapshot and restore with UFFD lazy memory loading a1bd439c75

Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy

Made license related changes 778894b488

Add device-mapper snapshots, test UI, fix pause ordering and lint errors 63e9132d38

- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation

Fix path traversal in template/snapshot names and network cleanup leaks a0d635ae5e

Add SafeName validator (allowlist regex) to reject directory traversal
in user-supplied template and snapshot names. Validated at both API
handlers (400 response) and sandbox manager (defense in depth).

Refactor CreateNetwork with rollback slice so partially created
resources (namespace, veth, routes, iptables rules) are cleaned up
on any error. Refactor RemoveNetwork to collect and return errors
instead of silently ignoring them.

Add diff snapshots for re-pause to avoid UFFD fault-in storm 80a99eec87

Use Firecracker's Diff snapshot type when re-pausing a previously
resumed sandbox, capturing only dirty pages instead of a full memory
dump. Chains up to 10 incremental generations before collapsing back
to a Full snapshot. Multi-generation diff files (memfile.{buildID})
are supported alongside the legacy single-file format in resume,
template creation, and snapshot existence checks.

Add script to create rootfs from Docker container 712b77b01c

Add authentication, authorization, and team-scoped access control c92cc29b88

Implement email/password auth with JWT sessions and API key auth for
sandbox lifecycle. Users get a default team on signup; sandboxes,
snapshots, and API keys are scoped to teams.

- Add user, team, users_teams, and team_api_keys tables (goose migrations)
- Add JWT middleware (Bearer token) for user management endpoints
- Add API key middleware (X-API-Key header, SHA-256 hashed) for sandbox ops
- Add signup/login handlers with transactional user+team creation
- Add API key CRUD endpoints (create/list/delete)
- Replace owner_id with team_id on sandboxes and templates
- Update all handlers to use team-scoped queries
- Add godotenv for .env file loading
- Update OpenAPI spec and test UI with auth flows

Fix device-mapper "Device or resource busy" error on sandbox resume 1846168736

Pause was logging RemoveSnapshot failures as warnings and continuing,
which left stale dm devices behind. Resume then failed trying to create
a device with the same name.

- Make RemoveSnapshot failure a hard error in Pause (clean up remaining
  resources and return error instead of silently proceeding)
- Add defensive stale device cleanup in RestoreSnapshot before creating
  the new dm device

Fix sandbox lifecycle cleanup and dmsetup remove reliability 88246fac2b

- Add retry with backoff to dmsetupRemove for transient "device busy"
  errors caused by kernel not releasing the device immediately after
  Firecracker exits. Only retries on "Device or resource busy"; other
  errors (not found, permission denied) return immediately.

- Thread context.Context through RemoveSnapshot/RestoreSnapshot so
  retries respect cancellation. Use context.Background() in all error
  cleanup paths to prevent cancelled contexts from skipping cleanup
  and leaking dm devices on the host.

- Resume vCPUs on pause failure: if snapshot creation or memfile
  processing fails after freezing the VM, unfreeze vCPUs so the
  sandbox stays usable instead of becoming a frozen zombie.

- Fix resource leaks in Pause when CoW rename or metadata write fails:
  properly clean up network, slot, loop device, and remove from boxes
  map instead of leaving a dead sandbox with leaked host resources.

- Fix Resume WaitUntilReady failure: roll back CoW file to the snapshot
  directory instead of deleting it, preserving the paused state so the
  user can retry.

- Skip m.loops.Release when RemoveSnapshot fails during pause since
  the stale dm device still references the origin loop device.

- Fix incorrect VCPUs placeholder in Resume VMConfig that used memory
  size instead of a sensible default.

Add auto-pause TTL and ping endpoint for sandbox inactivity management 477d4f8cf6

Replace the existing auto-destroy TTL behavior with auto-pause: when a
sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses
it (snapshot + teardown) instead of destroying it, preserving the ability
to resume later.

Key changes:
- TTL reaper calls Pause instead of Destroy, with fallback to Destroy if
  pause fails (e.g. Firecracker process already gone)
- New PingSandbox RPC resets the in-memory LastActiveAt timer
- New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory
  and DB last_active_at
- ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler
  can distinguish auto-paused sandboxes from crashed ones in a single call
- Reconciler polls every 5s (was 30s) and marks auto-paused as "paused"
  vs orphaned as "stopped"
- Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles
- Reaper checks every 2s (was 10s) and uses a detached context to avoid
  incomplete pauses on app shutdown
- Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)

Add GitHub OAuth login with provider registry 931b7d54b3

Implement OAuth 2.0 login via GitHub as an alternative to email/password.
Uses a provider registry pattern (internal/auth/oauth/) so adding Google
or other providers later requires only a new Provider implementation.

Flow: GET /v1/auth/oauth/github redirects to GitHub, callback exchanges
the code for a user profile, upserts the user + team atomically, and
redirects to the frontend with a JWT token.

Key changes:
- Migration: make password_hash nullable, add oauth_providers table
- Provider registry with GitHubProvider (profile + email fallback)
- CSRF state cookie with HMAC-SHA256 validation
- Race-safe registration (23505 collision retries as login)
- Startup validation: CP_PUBLIC_URL required when OAuth is configured

Not fully tested — needs integration tests with a real GitHub OAuth app
and end-to-end testing with the frontend callback page.

Extract shared service layer for sandbox, API key, and template operations f38d5812d1

Moves business logic from API handlers into internal/service/ so that
both the REST API and the upcoming dashboard can share the same operations
without duplicating code. API handlers now delegate to the service layer
and only handle HTTP-specific concerns (request parsing, response formatting).

Remove empty admin UI stubs 1d59b50e49

The internal/admin/ package was never imported or mounted — just
placeholder files. Removing to avoid confusion before the real
dashboard is built.

Add admin users, BYOC teams, hosts schema, and Redis for host registration e4ead076e3

Introduce three migrations: admin permissions (is_admin + permissions table),
BYOC team tracking, and multi-host support (hosts, host_tokens, host_tags).
Add Redis to dev infra and wire up client in control plane for ephemeral
host registration tokens. Add go-redis dependency.

Add host registration, heartbeat, and multi-host management 2c66959b92

Implements the full host ↔ control plane connection flow:

- Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access:
  regular hosts admin-only, BYOC hosts for admins and team owners
- One-time registration token flow: admin creates host → gets token (1hr TTL
  in Redis + Postgres audit trail) → host agent registers with specs → gets
  long-lived JWT (1yr)
- Host agent registration client with automatic spec detection (arch, CPU,
  memory, disk) and token persistence to disk
- Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token
  auth and host ID cross-check
- Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after
  failed registration
- Tag management (add/remove/list) with team-scoped access control
- Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and
  VerifyHostJWT
- requireHostToken middleware for host agent authentication
- DB-level race protection: RegisterHost uses AND status='pending' with
  rows-affected check; Redis GetDel for atomic token consume
- Migration for future mTLS support (cert_fingerprint, mtls_enabled columns)
- Host agent flags: --register (one-time token), --address (required ip:port)
- serviceErrToHTTP extended with "forbidden" → 403 mapping
- OpenAPI spec, .env.example, and README updated

Consolidate host agent path env vars into single AGENT_FILES_ROOTDIR 866f3ac012

Replace AGENT_KERNEL_PATH, AGENT_IMAGES_PATH, AGENT_SANDBOXES_PATH,
AGENT_SNAPSHOTS_PATH, and AGENT_TOKEN_FILE with a single
AGENT_FILES_ROOTDIR (default /var/lib/wrenn) that derives all
subdirectory paths automatically.

Added basic frontend (#1 ) 97292ba0bf

Reviewed-on: wrenn/sandbox#1
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>

Add tini as PID 1, guest clock sync, and fix PATH in guest VMs 36782e1b4f

- Use tini as PID 1 in wrenn-init.sh so zombie processes are reaped
  and signals are forwarded correctly to envd
- Set standard PATH in wrenn-init.sh so child processes spawned by envd
  can find common binaries (fixes "nice: ls command not found")
- Add envdclient.Init() to POST /init on envd after every boot/resume,
  syncing the guest clock via unix.ClockSettime — critical after snapshot
  resume where the guest clock is frozen
- Run Init in a background goroutine so it doesn't block the CreateSandbox
  RPC response; a slow Init (vCPU busy with envd startup) was causing the
  RPC context to be canceled before the response reached the control plane
- Update rootfs-from-container.sh and update-debug-rootfs.sh to inject
  tini into the rootfs, checking the container image and host first,
  downloading from GitHub releases as fallback

Fix snapshot and sandbox delete consistency 5f0dbadea6

- Snapshot delete: make agent RPC failure a hard error so DB record is
  not removed when files cannot be deleted from disk
- Snapshot overwrite: call agent to delete old files before removing the
  DB record, preventing stale memfile.{uuid} generations from accumulating
  on disk across repeated overwrites
- Sandbox destroy: only swallow CodeNotFound from the agent (sandbox
  already gone / TTL-reaped); any other error now propagates to the caller
  instead of being silently ignored

Merge branch 'main' of git.omukk.dev:wrenn/sandbox into dev 71564b202e

Polish dashboard frontend: spacing, copy, resilience b786a825d4

- Increase content padding (p-7→p-8) and table cell padding (px-4→px-5,
  py-3→py-4 for data rows) across capsules, keys, and snapshots pages
- Improve animation performance: wrenn-glow uses opacity instead of
  box-shadow (compositor-only, no paint cost)
- Add prefers-reduced-motion media query covering inline style animations
- Fix OAuth error display on login page (read ?error= param on mount)
- Harden clipboard copy with try-catch and toast fallback
- Improve empty state copy, dialog microcopy, and error messages
- Add retry button to error banners on keys page
- Replace "All systems operational" footer bar with a clean 1px divider
- Fix text truncation on long capsule/snapshot names (min-w-0 + truncate)

Updated design docs 79eba782fb

Merge pull request 'Minor frontend enhancement' (#3 ) from frontend into dev 4e26d7a292

Reviewed-on: wrenn/sandbox#3

Add team management endpoints 8e5d426638

- Three-role model (owner/admin/member) with owner protection invariants
- Team CRUD: create, rename (admin+), soft-delete with VM cleanup (owner only)
- Member management: add by email, remove, role updates (admin+), leave
- Switch-team endpoint re-issues JWT after DB membership verification
- User email prefix search for add-member UI autocomplete
- JWT carries role as a hint; all authorization decisions verified from DB
- Team slug: immutable 12-char hex (e.g. a1b2c3-d1e2f3), reserved on soft-delete
- Migration adds slug + deleted_at to teams; backfills existing rows

Add team management frontend 1e681da738

- New /dashboard/team page with inline team name editing, slug/ID copy,
  members table with split-button (remove + make admin/member), add member
  typeahead, and danger zone (delete/leave) with confirmation dialogs
- Sidebar now fetches real teams from API, supports team switching and
  team creation via dialog
- Rename nav item Members → Team, route /dashboard/members → /dashboard/team
- New src/lib/api/team.ts with typed functions for all team endpoints

Refine team management: name chars, danger zone, no-team state b3e8bdd171

- Allow hyphens, @, and apostrophes in team names (backend regex)
- After delete/leave, switch to next available team instead of logging
  out; if no teams remain, show a toast prompting to create one
- Disable delete/leave button when user has only one team, with
  explanatory hint to create another team first
- Show empty state on /dashboard/team when auth has no team context,
  pointing user to the sidebar to create a team
- Fetch all teams in parallel with team detail on page load to power
  the isLastTeam guard

Fix user search to trigger on 3 characters without requiring @ 71a7fdb76f

The anti-enumeration guard required @ in the email prefix, causing the
typeahead to silently return nothing until the user typed @. Replace with
a minimum 3-character length check to match the frontend trigger condition.

Fix team name blink on navigation by lifting teams into a singleton store bf494f73fc

Teams list was fetched on every Sidebar mount (each page navigation),
causing a flash from '…' to the real name on every tab switch. Move teams
into a module-level reactive store (teams.svelte.ts) that fetches once per
session and is shared between Sidebar and the team page.

Polish team page: delight micro-interactions and layout improvements 90c296f5e1

- Slug + Team ID rows collapsed into a 2-column grid for better density
- "you" badge moved inline with email instead of stacked below it
- Copy checkmark draws itself via SVG stroke-dashoffset animation
- New member row flashes accent-green on entry
- Removed member row slides out smoothly (fly transition)
- Member rows use staggered fly-in on page load
- Team name briefly highlights accent color after a successful rename
- Search result avatars get colorized initials based on email character

Merge pull request 'Added team related functionalities' (#4 ) from team-management into dev 336080bb6d

Reviewed-on: wrenn/sandbox#4

Frontend consistency pass: delight, audit, and normalization 915d934c26

Delight (keys page):
- Animated checkmark draw + circle pop on key reveal dialog open
- Key display area pulses accent glow on open to draw eye to "copy this"
- Copy button spring-bounces on successful copy (re-triggers on repeat)
- Empty state key icon floats (iconFloat, now global)
- Row hover uses scaleY left-accent stripe (matches capsules pattern)
- New key row flashes accent on reveal dialog dismiss (matches capsule-born)

Audit fixes (all dashboard pages):
- Page titles standardized to em dash: "Wrenn — X" across all four pages
- formatDate/timeAgo extracted to src/lib/utils/format.ts (string | undefined
  signatures); keys and snapshots now import from there instead of duplicating
- team formatDate gains undefined guard (kept local, date-only format differs)
- spin-once and iconFloat keyframes moved to app.css as globals; scoped copies
  removed from capsules and keys
- Snapshots empty state icon was referencing undefined @keyframes float; fixed
  to iconFloat

Normalization:
- Snapshots table rows: replaced ::before pseudo-element accent (opacity-only,
  single color) with DOM row-stripe element using scaleY transition, type-keyed
  color (green for snapshots, blue for images) — matches capsules pattern
- Create Key dialog: max-w-[400px] → max-w-[420px] to align with form dialogs
- Snapshots count and empty-state heading are now terminology-aware: shows
  "templates/snapshots/images" based on active filter; empty heading for all
  filter reads "No templates yet" instead of "No snapshots yet"

Not done (documented in audit, deferred):
- Sidebar nav items pointing to unimplemented routes (audit, usage, billing,
  notifications, settings) — left as-is, needs product decision
- Dialog max-widths fully normalized beyond Create Key — minor, deferred
- capsules timeAgo not imported from shared util (formatTime differs intentionally)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request 'Frontend consistency and improvements' (#5 ) from frontend-enhancement into dev aaeccd32ce

Reviewed-on: wrenn/sandbox#5

Add user names, team-scoped sandbox guard, and login robustness fixes 3932bc056e

- Add name column to users (migration + sqlc regen); propagate through JWT
  claims, auth context, all auth/OAuth handlers, service layer, and frontend
- Sidebar and team page show name instead of email; team page splits Name/Email
  into separate columns
- Block sandbox creation in UI and API when user has no active team context
- loginTeam helper falls back to first active team when no default is set,
  fixing login for invited users with no is_default membership
- Exclude soft-deleted teams from GetDefaultTeamForUser, GetBYOCTeams queries
- Guard host creation against soft-deleted teams in service/host.go
- SwitchTeam re-fetches name from DB instead of trusting stale JWT claim
- Reset teams store on login so stale data from a previous session never persists
- Update openapi.yaml: add name to SignupRequest and AuthResponse schemas

Minor frontend enhancements f968da9768

Implement host registration, JWT refresh tokens, and multi-host scheduling 9bf67aa7f7

Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a
DB-driven registration system supporting multiple host agents (BYOC).

Key changes:
- Host agents register via one-time token, receive a 7-day JWT + 60-day
  refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all
  sandboxes if refresh fails
- HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing
  the single static agent client throughout the API and service layers
- RoundRobinScheduler: picks an online host for each new sandbox via
  ListActiveHosts; extensible for future scheduling strategies
- HostMonitor (replaces Reconciler): passive heartbeat staleness check marks
  hosts unreachable and sandboxes missing after 90s; active reconciliation
  per online host restores missing-but-alive sandboxes and stops orphans
- Graceful host delete: returns 409 with affected sandbox list without
  ?force=true; force-delete destroys sandboxes then evicts pool client
- Snapshot delete broadcasts to all online hosts (templates have no host_id)
- sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss
- New migration: host_refresh_tokens table with token rotation (issue-then-
  revoke ordering to prevent lockout on mid-rotation crash)
- New sandbox status 'missing' (reversible, unlike 'stopped') and host
  status 'unreachable'; both reflected in OpenAPI spec
- Fix: refresh token auth failure now returns 401 (was 400 via generic
  'invalid' substring match in serviceErrToHTTP)

Add BYOC page, admin section, and is_byoc team visibility gating e069b3e679

- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows,
  shimmer loading, pulsing online status, animated token reveal checkmark
- Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat
  pills, skeleton loading, slide-in animations for new rows
- Frontend: AdminSidebar component with accent top bar and admin pill badge
- Frontend: BYOC nav item shown only when team.is_byoc is true (derived
  from teams store, not JWT); disabled for members
- Frontend: Admin shield button in Sidebar, visible only to platform admins
- Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated)
- Backend: is_byoc added to teamResponse so frontend derives visibility
  from fresh team data rather than stale JWT fields
- Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc)
- Backend: Admin hosts list enriches BYOC entries with team_name
- Host agent: load .env file via godotenv on startup

Merge pull request 'Set up working host registration (including BYOC) with the CP' (#6 ) from host-registration into dev 9878156798

Reviewed-on: wrenn/sandbox#6

Add audit log infrastructure and GET /v1/audit-logs endpoint 1be30034bd

Introduces an append-only audit trail for all user and system actions:
sandbox lifecycle (create/pause/resume/destroy/auto-pause), snapshots,
team rename, API key create/revoke, member add/remove/leave/role_update,
and BYOC host add/delete/marked_down/marked_up.

- New audit_logs table (migration) with team_id, actor, resource,
  action, scope (team|admin), status (success|info|warning|error),
  metadata, and created_at
- AuditLogger (internal/audit) with named fire-and-forget methods per
  event; system actor used for background events (HostMonitor, TTL reaper)
- GET /v1/audit-logs: JWT-only, cursor pagination (max 200), multi-value
  filters for resource_type and action (comma-sep or repeated params);
  members see team-scoped events only, admins/owners see all
- AuthContext extended with APIKeyID + APIKeyName so API key requests
  record meaningful actor identity
- HostMonitor wired with AuditLogger for auto-pause and host marked_down

Add audit logs frontend page 3ce8fdcb02

Infinite-scroll table with hierarchical filter dropdown, expandable
metadata rows, and status-coded visual signals per event severity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove expandable metadata from audit log rows 6b76abe38e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request 'Added audit logs for users' (#7 ) from audit-logs into dev 0414fbe733

Reviewed-on: wrenn/sandbox#7

Added snapshot name dialogue on the UI d4eb24be7e

Bolder, more delightful frontend across all pages 2349f585ae

- app.css: replace flat --shadow-sm token with real shadows; add
  --shadow-card and --shadow-dialog tokens; add @keyframes status-ping
  and .animate-status-ping utility (outward ring ripple, GPU-composited
  via will-change) for live running status dots
- login: headline 5rem → 6.5rem with tighter leading/tracking; expand
  container to 460px; add sage-green dot grid texture layer beneath the
  mouse-reactive glow for industrial depth
- capsules: upgrade all running dots (header chip + row indicators +
  status bar) from opacity-fade to ring ripple; apply --shadow-dialog
  to Launch and Snapshot dialogs
- keys: apply --shadow-dialog to all three dialogs
- audit: remove duplicate @keyframes fadeUp and iconFloat (redundant
  with app.css definitions, audit's fadeUp also subtly diverged)
- sidebar: active indicator bar taller and thicker (h-5 w-[3px] → h-6
  w-1); active bg more vivid (accent/12%); label font-medium →
  font-semibold; team dialog gets --shadow-dialog

Add live stats page with metrics sampling and route split fee66bda50

- New sandbox_metrics_snapshots table sampled every 10s (60-day retention)
- Background MetricsSampler goroutine wired into control plane startup
- GET /v1/sandboxes/stats?range=5m|1h|6h|24h|30d endpoint with adaptive
  polling intervals; reserved CPU/RAM uses ceil(paused/2) formula
- StatsPanel component: 4 stat cards + 2 Chart.js line charts (straight
  lines, integer y-axis for running count, dual-axis for CPU/RAM)
- Range filter persisted in URL query param; polls update data silently
  (no blink — loading state only shown on initial mount)
- Split /dashboard/capsules into /list and /stats sub-routes with shared
  layout; capsuleRunningCount store syncs badge across routes
- CreateCapsuleDialog extracted as reusable component

Fix metrics correctness, redesign stats page 47b0ed5b52

- Replace stale snapshot read (GetCurrentMetrics) with live query
  (GetLiveMetrics) against sandboxes table — always returns correct
  zeros when no capsules are running
- Fix CPU reserved formula: running + starting only; paused VMs no
  longer contribute vCPUs (RAM reservation for paused unchanged)
- Merge top cards into 3 paired Now/Peak cards with colored accent
  borders (green/blue/amber matching chart colors)
- Move Live badge from Running Capsules card to page-level header
- Add colored category dots to card and chart headers
- Charts stacked vertically, flex-1 to fill remaining page height
- vCPUs chart color changed to blue (#5a9fd4), RAM stays amber

Move metrics to dedicated nav item, simplify capsules page 930da8a578

- Add Metrics nav item to sidebar with bar chart icon
- Create /dashboard/metrics page wrapping StatsPanel
- Remove tabs from capsules page (list is now the only view)
- Flatten capsules route: /capsules directly shows the list,
  removing the /list and /stats sub-routes
- Strip redundant title/subtitle from StatsPanel (page header
  provides context)

Fix metrics sampler to record zero-value snapshots when idle e3750f79f9

SampleSandboxMetrics previously filtered WHERE status IN ('running',
'starting', 'paused'), which returned no rows when all capsules were
stopped. This caused zero snapshots to be skipped, leaving the
time-series charts with no trailing data points instead of showing
the expected zero values.

Remove the WHERE filter so the query groups by all teams that have
any sandbox row. The per-status FILTER clauses on the aggregates
already produce correct zero counts for stopped capsules.

Also includes the per-VM RAM ceiling formula change (sum(ceil(each/2))
instead of ceil(sum/2)).

Move metrics to after templates in sidebar nav 45793e181c

Split CPU and RAM into separate side-by-side charts a69b0f579c

CPU (vCPUs) and RAM (GB) use different units and scales, so combining
them on a dual-axis chart was misleading. Each now has its own chart
card, laid out side-by-side.

Bolder stats page layout with stronger visual hierarchy b0e6f5ffb3

- Accent stripes: 3px → 5px; indicator dots: 6px → 8px
- Peak values step down to text-[1.714rem]/text-secondary so Now values read as the clear hero
- Now labels: semibold + uppercase for weight parity with the metric
- Cell padding py-5 → py-6; outer gap-7/pt-4 → gap-8/pt-6 for breathing room
- Chart fills: 7-8% → 11-13% opacity; lines: 1.5 → 2px
- Tick labels brighter (#635f5c), grid lines slightly more visible
- Running capsules chart: min-height 220 → 260px

Fix capsules table blink on background poll refresh 8d5ba3873a

Poll fetches now silently update data without triggering loading
states, spinner animations, or row fadeUp re-animations. Only manual
refresh shows the spin indicator.

Bugfix: cgroup2 related error inside the sandbox 7473c15f52

Add per-sandbox CPU/memory/disk metrics collection 9acdbb5ae9

Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and
stat() on CoW files at 500ms intervals per running sandbox. Three tiered
ring buffers downsample into 30s and 5min averages for 10min/2h/24h
retention. Metrics are flushed to DB on pause (all tiers) and destroy
(24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the
host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on
the control plane. Returns live data for running sandboxes, DB data for
paused, and 404 for stopped.

Add 5m, 1h, 6h, 12h range filters to metrics endpoint 49b0b646a8

Maps each user-facing range to the appropriate underlying ring buffer
tier and applies a time cutoff filter. No new ring buffers needed —
5m/10m read from the 10m tier, 1h/2h from the 2h tier, 6h/12h/24h
from the 24h tier.

Minor improvement 88cb24bb86

Fix LIKE pattern injection in user email search 6eacf0f735

Escape LIKE metacharacters (% and _) in the email prefix before passing
to the SQL query, and enforce the documented '@' requirement to prevent
broad user enumeration. Move search logic out of TeamService into
usersHandler since it is a site-wide lookup, not team-scoped.

Push GetSandboxMetricPoints time filter into SQL 27ff828e60

The query was fetching all rows for a (sandbox_id, tier) pair and
filtering by timestamp in Go. For repeatedly-paused sandboxes the
24h tier can accumulate up to 30 days of data, causing up to 120x
over-fetching for a 6h range request.

Add AND ts >= $3 to the query so Postgres filters on the primary key
(sandbox_id, tier, ts) directly. Drop the redundant Go-side loop.

Add per-capsule stats detail page with live CPU/RAM charts ed7880bc6c

- New detail page at /dashboard/capsules/[id] with Stats and Files tabs
- Stats tab shows capsule info card (status, template, CPU, memory, disk,
  started, idle timeout) and two stacked Chart.js charts with live values
- Metrics API client with 10s polling and moving-average smoothing
- Capsule ID in list table is now a clickable link to the detail page
- Layout breadcrumb header (Capsules > sb-xxx) with back navigation
- Fix metrics sampler: use v.PID() directly as Firecracker PID since
  unshare -m execs (not forks) through the bash/ip-netns-exec/firecracker
  chain, so all share the same PID. Removes unused findChildPID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request 'Added metrics' (#9 ) from metrics into dev 8cdf91d895

Reviewed-on: wrenn/sandbox#9

WIP: Add socat injection to rootfs build scripts 602ee470d9

Inject a statically-linked socat binary into rootfs images. envd's
port forwarder requires socat to bridge localhost-listening services
(e.g. Jupyter kernel) to the guest TAP interface.

Both scripts follow the same 3-step resolution: check rootfs, check
host, build from source (http://www.dest-unreach.org/socat/ v1.8.1.1).
Static linkage is verified before injection.

This is an intermediate state — needs further work for the full code
interpreter feature.

WIP: Add HTTP proxy endpoint to host agent f4675ebfc0

Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP
requests to services running inside sandbox VMs. The sandbox's host IP
(10.11.0.{idx}) is used as the upstream target.

Includes port validation (1-65535) and shared HTTP transport for
connection pooling. Supports WebSocket upgrades for protocols like
Jupyter's streaming API.

This is an intermediate state — needs further work for the full code
interpreter feature.

WIP: Add sandbox proxy catch-all to control plane 4be65b0abb

Add SandboxProxyWrapper that intercepts requests with Host headers
matching {port}-{sandbox_id}.{domain} and proxies them through the
owning host agent's /proxy endpoint.

Authentication is via X-API-Key only (no JWT). The API key's team must
own the sandbox. Export EnsureScheme from lifecycle package for reuse.

Request flow: SDK -> Caddy -> CP catch-all -> Host Agent -> sandbox VM.

This is an intermediate state — needs further work for the full code
interpreter feature.

WIP: Add Caddy reverse proxy for dev environment b0a8b498a8

Add Caddy to docker-compose as the single entry point on port 8000:
- localhost -> /api/* stripped and proxied to CP:8080, /* to frontend:5173
- *.localhost -> proxied to CP:8080 (sandbox proxy catch-all)
- Direct /v1/*, /auth/*, /docs routes proxied to CP

Move CP from :8000 to :8080 (its default). Caddy takes :8000.
Update .env.example, vite proxy target (kept as fallback), and Makefile
dev targets (pg_isready via docker exec, frontend binds 0.0.0.0).

This is an intermediate state — needs further work for the full code
interpreter feature.

Fix static build: disable prerender for dynamic capsule detail route 139f86bf9c

The [id] route cannot be prerendered at build time since IDs are unknown.
With adapter-static's index.html fallback, the route is handled client-side.

Minor UI copy updates across capsules and templates pages 12d1e356fa

Replace one-shot clock_settime with chrony for continuous guest time sync 6898528096

Switch from the envd /init endpoint pushing host time via syscall to
chronyd reading the KVM PTP hardware clock (/dev/ptp0) continuously.
This fixes clock drift between init calls and handles snapshot resume
gracefully.

Changes:
- Add clocksource=kvm-clock kernel boot arg
- Start chronyd in wrenn-init.sh before tini (PHC /dev/ptp0, makestep 1.0 -1)
- Remove clock_settime logic from envd SetData and shouldSetSystemTime
- Remove client.Init() clock sync calls from sandbox manager (3 sites)
- Remove Init() method from envdclient (no longer needed)
- Simplify rootfs scripts: socat/chrony now come from apt in the container
  image, only envd/wrenn-init/tini are injected by build scripts

Add template build system with admin panel, async workers, and FlattenRootfs RPC 1ce62934b3

Introduces an end-to-end template building pipeline: admins submit a recipe
(list of shell commands) via the dashboard, a Redis-backed worker pool spins
up a sandbox, executes each command, and produces either a full snapshot
(with healthcheck) or an image-only template (rootfs flattened via a new
FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted
to a new template_builds table and polled by the frontend.

Backend:
- New FlattenRootfs RPC (proto + host agent + sandbox manager)
- BuildService with Redis queue (BLPOP) and configurable worker pool (default 2)
- Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id}
- Migration for template_builds table with JSONB logs and recipe columns
- sqlc queries for build CRUD and progress updates

Frontend:
- /admin/templates page with Templates + Builds tabs
- Create Template dialog with recipe textarea, healthcheck, specs
- Build history with expandable per-step logs, status badges, progress bars
- Auto-polling every 3s for active builds
- AdminSidebar updated with Templates nav item

Fix review issues: detached contexts, loop device leak, timer leak, size_bytes cdd89a7cee

- Use context.Background() with timeout in destroySandbox/failBuild so
  cleanup and DB writes survive parent context cancellation on shutdown
- Fix loop device refcount leak in FlattenRootfs when dmDevice is nil
- Replace time.After with time.NewTimer in healthcheck polling to avoid
  goroutine leak when healthcheck passes early
- Capture size_bytes from CreateSnapshot/FlattenRootfs RPC responses
  instead of hardcoding 0 in the templates table insert
- Avoid leaking internal error details to API clients in build handler

Switch database IDs from TEXT to native UUID 4ddd494160

Consolidate 16 migrations into one with UUID columns for all entity
IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id,
resource_id) and template names. The id package now generates UUIDs
via google/uuid, with Format*/Parse* helpers for the prefixed wire
format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and
handlers pass pgtype.UUID internally; conversion to/from prefixed
strings happens at API and RPC boundaries. Adds PlatformTeamID
(all-zeros UUID) for shared resources.

Add disk_size_mb, auto-expand base images, admin templates endpoint c0d6381bbe

Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
  full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
  sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
  smaller than 20GB via truncate + resize2fs (sparse, no extra physical
  disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
  templates are compact; EnsureImageSizes re-expands on next startup

Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
  team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
  GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
  queries include platform team_id in WHERE clause

Add admin template deletion with broadcast to all hosts 5cb37bf2a0

- DELETE /v1/admin/templates/{name} endpoint (admin-only)
- Broadcasts DeleteSnapshot RPC to all online hosts before removing DB record
- Frontend admin templates page uses deleteAdminTemplate() instead of
  team-scoped deleteSnapshot()
- Delete button shown for all template types, not just snapshots

Add pre/post build stages to template builds c8acac92cc

Pre-build: apt update
Post-build: apt clean, apt autoremove, rm apt lists

Total steps count includes pre/post commands for accurate progress bars.

Add pre/post build stages, fix exec timeout, expand guest PATH 3509ca90e8

Build phases:
- Pre-build (apt update) and post-build (apt clean, autoremove, rm lists)
  run with 10-minute timeout; user recipe commands keep 30s timeout
- Log entries include phase field for UI grouping
- Always send explicit TimeoutSec to host agent (0 defaulted to 30s)

Frontend:
- Pre-build/post-build steps show phase label without exposing commands
- Recipe steps numbered independently starting from 1

Guest PATH:
- Add /usr/games:/usr/local/games to wrenn-init.sh PATH export
  (standard Ubuntu paths, needed for packages like cowsay)

Switch API ID format from UUID to base36 for compact, E2B-style IDs c89a664a37

DB stays native UUID; the format/parse layer now encodes 16 UUID bytes
as 25-char lowercase alphanumeric (base36) strings instead of the
standard 36-char hex-with-dashes format. e.g. sb-2e5glxi4g3qnhwci95qev0cg0

Fix snapshot race, delete auth, sparse dd, default disk to 5GB 34af77e0d8

Snapshot race fix:
- Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and
  PauseSandbox RPCs, preventing the reconciler from marking it "stopped"
  during the flatten window when the sandbox is gone from the host
  agent's in-memory map but DB still says "running"
- Revert status to "running" on RPC failure
- Check ctx.Err() before writing response to avoid writing to dead
  connections when client disconnects during long snapshot operations

Delete auth fix:
- Block non-admin deletion of platform templates (team_id = all-zeros)
  at DELETE /v1/snapshots/{name} with 403, preventing file deletion
  before the team ownership check fails

Sparse dd:
- Add conv=sparse to dd in FlattenSnapshot so flattened images preserve
  sparseness (~200MB actual vs 5GB logical)

Default disk size:
- Change default disk_size_mb from 20GB to 5GB across migration,
  manager, service, build, and EnsureImageSizes
- Disable split-button dropdown arrow for platform templates in
  dashboard snapshots page (teams cannot delete platform templates)

Remove slug from team page UI 03e96629c7

Add UUID-based template IDs and team-scoped template directory layout 75b28ed899

Introduces internal/layout package for centralized path construction,
migrates templates from name-based TEXT primary keys to UUID PKs with
team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}).
The built-in minimal template uses sentinel zero UUIDs. Proto messages
carry team_id + template_id alongside deprecated template name field.
Team deletion now cleans up template files across all hosts.

Rename AGENT_*/CP_LISTEN_ADDR env vars to WRENN_* prefix 906cc42d13

AGENT_FILES_ROOTDIR → WRENN_DIR, AGENT_LISTEN_ADDR → WRENN_HOST_LISTEN_ADDR,
AGENT_CP_URL → WRENN_CP_URL, AGENT_HOST_INTERFACE → WRENN_HOST_INTERFACE,
CP_LISTEN_ADDR → WRENN_CP_LISTEN_ADDR. Consolidates all env vars under a
consistent WRENN_ namespace.

Seed minimal template in DB and protect it from deletion 46d60fc5a5

Insert a minimal template row (all-zeros UUID) so it appears in both
team and admin template listings. Guard delete endpoints to prevent
removal of the minimal template.

Prefix network namespaces with wrenn-, add stale cleanup, lower diff cap 1ca10230a9

Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx}
to avoid collisions with other tools. Add CleanupStaleNamespaces() at
agent startup to remove orphaned namespaces, veths, iptables rules, and
routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to
prevent Go runtime memory corruption from snapshot/restore drift.

Replace Full snapshot fallback with file-level diff merge 8f06fc554a

Always use Firecracker Diff snapshots (fast, only changed pages) and
merge diff files at the file level when the generation cap is reached.
The previous approach used Firecracker's Full snapshot type which dumps
all memory to disk and can timeout, losing all snapshot data on failure.

Add snapshot.MergeDiffs() which reads each block from the appropriate
generation's diff file via the header mapping and writes them into a
single consolidated file with a fresh generation-0 header.

Rename sandbox prefix to cl-, add MMDS metadata, fix proxy port routing 88f919c4ca

- Change sandbox ID prefix from sb- to cl- (capsule) throughout
- Fix proxy URL regex character class: base36 uses 0-9a-z, not just hex
- Add MMDS V2 config and metadata to VM boot flow so envd can read
  WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID from inside the guest
- Pass TemplateID through VMConfig into both fresh and snapshot boot paths

Add mTLS to CP→agent channel 25ce0729d5

- Internal ECDSA P-256 CA (WRENN_CA_CERT/WRENN_CA_KEY env vars); when absent
  the system falls back to plain HTTP so dev mode works without certificates
- Host leaf cert (7-day TTL, IP SAN) issued at registration and renewed on
  every JWT refresh; fingerprint + expiry stored in DB (cert_expires_at column
  replaces the removed mtls_enabled flag)
- CP ephemeral client cert (24-hour TTL) via CPCertStore with atomic hot-swap;
  background goroutine renews it every 12 hours without restarting the server
- Host agent uses tls.Listen + httpServer.Serve so GetCertificate callback is
  respected (ListenAndServeTLS always reads cert from disk)
- Sandbox reverse proxy now uses pool.Transport() so it shares the same TLS
  config as the Connect RPC clients instead of http.DefaultTransport
- Credentials file renamed host-credentials.json with cert_pem/key_pem/
  ca_cert_pem fields; duplicate register/refresh response structs collapsed
  to authResponse

Add skip_pre_post build option, cancel endpoint, and recipe package 948db13bed

- skip_pre_post flag on builds bypasses apt update/clean pre/post steps for
  faster iteration when the recipe handles its own environment setup
- POST /v1/admin/builds/{id}/cancel endpoint marks an in-progress build as
  cancelled; UpdateBuildStatus now also sets completed_at for 'cancelled'
- internal/recipe: typed recipe parser and executor (RUN/ENV/COPY steps)
  replacing the raw string slice approach in the build worker
- pre/post build commands prefixed with RUN to match recipe step format

Fix lint warnings: drop deprecated Name field from snapshot response, check errcheck in benchmark 377e856c8f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add pre-pause proxy connection drain and sandbox proxy caching 2b4c5e0176

Introduce ConnTracker (atomic.Bool + WaitGroup) to track in-flight proxy
connections per sandbox. Before pausing a VM, the manager drains active
connections with a 2s grace period, preventing Go runtime corruption
inside the guest caused by stale TCP state surviving Firecracker
snapshot/restore.

Also add:
- AcquireProxyConn on Manager for atomic lookup + connection tracking
- Proxy cache (120s TTL) on CP SandboxProxyWrapper with single-query
  DB lookup (GetSandboxProxyTarget) to avoid two round-trips
- Reset() on ConnTracker to re-enable connections if pause fails

Replace gopsutil port scanner with direct /proc/net/tcp reading 8b5fa3438e

The envd port scanner used gopsutil's net.Connections() which walks
/proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime
semaphore state when the VM is paused mid-operation and restored from
a Firecracker snapshot.

Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads
a single file per address family — no /proc/{pid}/fd walk, no goroutines,
no WaitGroups. Also replace concurrent-map (smap) in the scanner with a
plain sync.RWMutex-protected map, since concurrent-map's Items() spawns
goroutines with a WaitGroup internally, which is equally unsafe across
snapshot boundaries.

Use socket inode instead of PID for the port forwarding map key, since
inode is available directly from /proc/net/tcp without the fd walk.

Merge pull request 'Feature: HTTP communication with sandbox' (#10 ) from code-interpreter into dev ab38c8372c

Reviewed-on: wrenn/sandbox#10

Minor temporary fix for sitewide metrics 9a52b47786

Merge pull request 'Minor temporary fix for sitewide metrics' (#11 ) from patch/analytics into dev f57fe85492

Reviewed-on: wrenn/sandbox#11

feat: add env expansion, sandbox env fetching, and configurable 4f340b8847

healthchecks

Fix ENV instructions to expand $VAR references at set time using the
current env state, preventing self-referencing values like
PATH=/opt/venv/bin:$PATH from producing recursive expansions. Remove
expandEnv from shellPrefix to avoid double expansion.

Fetch sandbox environment variables via `env` before recipe execution
so ENV steps resolve against actual runtime values from the base
template image.

Replace hardcoded healthcheck timing with a Dockerfile-like flag parser
supporting --interval, --timeout, --start-period, and --retries. Add
start-period grace window and bounded retry counting to
waitForHealthcheck.

Add python-interpreter-v0-beta recipe and healthcheck files.

Merge branch 'dev' into feat/python-code-interpreter bf05677bef

Modified expandEnv to use regex. 9852f96127

Updated recipefile with test script to check code execution with state
management

Removed incorrect example cert format 4dc8cc3867

Merge branch 'dev' into feat/python-code-interpreter 11e08e5b96

Fix expandEnv regex, init script crash, healthcheck deadline, and test issues 0ea0e7cc70

- Fix envRegex: remove spurious (\$)? group that swallowed $$$, handle ${}
- wrenn-init.sh: add || true to networking commands under set -e, remove dead code
- waitForHealthcheck: use context deadline for unlimited retries instead of implicit 100 cap
- Make parseSandboxEnv a package-level function (unused receiver)
- Fix WrappedCommand test: map iteration order dependency, pre-expand env values
- Fix error wrapping: %v → %w per project conventions
- test-jupyter-kernel.py: move import to top-level, fix misleading comment

Merge pull request 'Changes for a python code interpreter' (#12 ) from feat/python-code-interpreter into dev 2737288a2b

Reviewed-on: wrenn/sandbox#12

Enforce mandatory mTLS for CP↔agent communication c8615466be

Both the control plane and host agent now refuse to start without valid
mTLS configuration, closing the unauthenticated proxy/RPC attack surface
that existed when running in plain HTTP fallback mode.

chore: add gstack skill routing rules to CLAUDE.md 3675ecba65

fix: security hardening from CSO audit dd50cfdcb1

- Add auth failure logging (login, API key, JWT) with IP/email/prefix
- Move OAuth JWT from URL params to short-lived cookies to prevent
  token leakage via browser history, server logs, and Referer headers
- Pin Swagger UI to v5.18.2 with SRI integrity hashes
- Upgrade Go toolchain to 1.25.8 (fixes 5 called stdlib vulns)
- Fix unchecked error in host agent credential refresh
- Add .gstack to .gitignore for security report artifacts

Fix review findings: IP collision, pause race, proxy path, ENV ordering, conn drain e3ffa576ce

- Fix IP address collision at slot 32768+ by using bitwise shifts instead of
  byte-truncating division in network slot addressing
- Add per-sandbox lifecycleMu to serialize concurrent Pause/Destroy calls
- Sanitize proxy forwarding path with path.Clean
- Sort ENV keys in recipe shell preamble for deterministic ordering
- Fix ConnTracker goroutine leak by adding cancel channel to Drain/Reset
- Update context_test to assert deterministic ENV ordering

Changed redis dependency to keydb a9ca13b238

Expose host up/down audit events to BYOC teams and refresh dashboard navigation e2beef817d

Change host marked_down/marked_up audit log scope from "admin" to "team" so
BYOC team members can see when their hosts go unreachable or recover. Rename
BYOC sidebar entry to Hosts, add placeholder billing/usage pages, disable
unimplemented notifications/settings links, and point docs to external site.

chore: relicense from BSL 1.1 to Apache 2.0 37d85ec998

Replace Business Source License with Apache License Version 2.0 across
LICENSE, envd/LICENSE, and NOTICE. Update NOTICE to remove BSL-era
framing that singled out Apache-only portions.

Updated CLAUDE.md 5148b5dd64

feat: add notification channels with provider integrations and retry 84dd15d22b

Implement a channels system for notifying teams via external providers
(Discord, Slack, Teams, Google Chat, Telegram, Matrix, webhook) when
lifecycle events occur (capsule/template/host state changes).

- Channel CRUD API under /v1/channels (JWT-only auth)
- Test endpoint to verify config before saving (POST /v1/channels/test)
- Secret rotation endpoint (PUT /v1/channels/{id}/config)
- AES-256-GCM encryption for provider secrets (WRENN_ENCRYPTION_KEY)
- Redis stream event publishing from audit logger
- Background dispatcher with consumer group and retry (10s, 30s)
- Webhook delivery with HMAC-SHA256 signing (X-WRENN-SIGNATURE)
- shoutrrr integration for chat providers
- Secrets never exposed in API responses

feat: channel audit logging, name cleaning, message formatting, and dashboard UI 0f78982186

- Add audit log entries for channel create, update, rotate_config, delete
- Clean channel names on create/update (trim, lowercase, spaces → hyphens,
  SafeName validation)
- Format chat notifications with full event details (resource, actor, team,
  timestamp) instead of one-liners
- Fix Discord split-line embeds by setting splitLines=No on shoutrrr URL
- Add channels dashboard page and sidebar navigation

Merge pull request 'Added channels for external notifications' (#13 ) from feat/channels into dev 831c898b71

Reviewed-on: wrenn/sandbox#13

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2b31af8fde

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev c1987b0bda

Add filesystem operations (list, mkdir, remove) across full stack c9283cac70

Plumb ListDir, MakeDir, and RemovePath through all layers:
REST API → host agent RPC → envdclient → envd. These endpoints
enable a web file browser for sandbox filesystem interaction.

New endpoints (all under requireAPIKeyOrJWT):
- POST /v1/sandboxes/{id}/files/list
- POST /v1/sandboxes/{id}/files/mkdir
- POST /v1/sandboxes/{id}/files/remove

Add Files tab to capsule detail page with file browser and preview 82531b735c

Implements a split-panel file browser: directory tree on the left with
path input and breadcrumb navigation, file preview on the right with
line numbers. Binary/large files (>10MB) show a download prompt instead.

Also adds CopyButton component across capsule, snapshot, and template
pages, and fixes pre-existing type errors in StatsPanel and admin
templates page.

Fix file browser: use ~ as default path, support tilde expansion 0e6daaabe0

- Default to ~ instead of hardcoded /home/user — envd resolves it
  to the actual home dir of the configured user
- Pass ~ and ~/... paths through to envd for server-side expansion
- Resolve actual absolute path from response entries for breadcrumbs
- Fall back to / if home dir is empty or doesn't exist
- Fix leftover label prop on admin templates CopyButton

Fix stale WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID after snapshot restore 4ed17b2776

After restoring a VM from snapshot, envd had already completed its initial
MMDS poll, so the metadata files in /run/wrenn/ and env vars retained values
from the original sandbox. Call POST /init after WaitUntilReady on both
resume and create-from-template paths to trigger envd to re-read MMDS.

Polish file browser: add up button, normalize design, improve UX 851f54a9e1

Add parent directory button in breadcrumb bar, remove redundant ..
row from file list. Normalize styles to use design system tokens
(accent glow, iconFloat, fadeUp). Improve empty states, add staggered
row entrance animation, file extension badge, and clearer UX copy.

Merge pull request 'Added browser based filesystem interactions' (#16 ) from feat/file-interactions into dev 43c15c86de

Reviewed-on: #16

Replace file browser not-running state with centered empty state 09f030d202

The small bordered card looked broken and misaligned — now uses a
full-width centered layout with floating icon, matching the app's
empty-state pattern.

Add interactive PTY terminal sessions for sandboxes ab3fc4a807

Wire envd's existing PTY process capabilities through the full stack:
hostagent proto (4 new RPCs: PtyAttach, PtySendInput, PtyResize, PtyKill),
envdclient, sandbox manager, and a new WebSocket endpoint at
GET /v1/sandboxes/{id}/pty with bidirectional JSON message protocol.

Sessions use tag-based identity for disconnect/reconnect support,
base64-encoded PTY data for binary safety, and a 120s inactivity timeout.

Add terminal tab to capsule detail page and fix envd process lookup bugs 4b2ff279f7

- Add multi-session Terminal tab with xterm.js (session tabs, close, reconnect)
- Keep terminal mounted across tab switches to preserve sessions
- Persist active tab in URL (?tab=terminal) so refresh stays on terminal
- Buffer keystrokes (50ms) to reduce per-character RPC overhead
- Add WebSocket auth via ?token= query param for browser WS connections
- Enable ws:true in Vite dev proxy for WebSocket support

envd fixes (pre-existing bugs exposed by multi-session terminals):
- Fix getProcess tag Range: inverted return values caused early stop when
  multiple tagged processes existed, making SendInput fail with "not found"
- Fix multiplexer deadlock: blocking send to cancelled fork's unbuffered
  channel prevented process cleanup. Now uses buffered channels (cap 64)
  with non-blocking fallback

Polish terminal tab: merge status bar into tab strip, normalize sizing acc721526d

- Merge separate status bar into unified tab bar (one row of chrome instead of two)
- Bump font/button/icon sizes to match rest of capsule page
- VS Code-style tab separators with intelligent hiding around active tab
- Hide tab bar when no sessions exist (empty state has its own CTA)
- Fix xterm background gaps by painting viewport/screen backgrounds
- Increase terminal font from 13px to 14px

Increase multiplexer fork buffer to 4096 to prevent output drops 1826af37a5

64-entry buffer was too small for high-throughput PTY output (e.g.
ls -laihR /). The consumer couldn't drain fast enough over the RPC
stream, causing the non-blocking send fallback to silently discard
data. 4096 entries (~64MB at 16KB/chunk) handles sustained output
without drops while still preventing deadlock on stuck consumers.

Harden terminal: binary-safe base64, auto-reconnect, session limits d2202c4f49

- Replace btoa/atob with TextEncoder/TextDecoder for binary-safe base64
  encoding — fixes crash on multi-byte UTF-8 input (emoji, CJK, accents)
- Auto-reconnect on abnormal WebSocket close while session is live
- Cap concurrent sessions at 8 with disabled "+" button at limit
- Guard all ws.send() calls with try/catch via wsSend() wrapper
- Clean up input flush timer on session close and component destroy
- Close all sessions when capsule stops running (isRunning → false)
- Clean up orphaned display entry if DOM container fails to render

Harden file browser: cap preview lines, fix race conditions, download UX cf191ca821

- Cap text preview at 5,000 lines with truncation footer and download link
  to prevent browser freeze on large files (300k+ DOM nodes)
- Add request generation counters to discard stale API responses from
  rapid directory/file clicking
- Guard initial $effect with hasInitiallyLoaded to prevent double-load
- Add download loading state with spinner and disabled button
- Delay URL.revokeObjectURL by 5s so browser can start download

Merge pull request 'Terminal connection (PTY)' (#18 ) from feat/ssh-connection into dev 9332f4ac18

Reviewed-on: #18

Extract SnapshotDialog and DestroyDialog into reusable components 2bad843069

Add lifecycle buttons (pause, resume, snapshot, destroy) to the
individual capsule detail page and refactor both the list and detail
pages to share the new dialog components.

Harden channels page: deduplicate dropdowns, add missing provider logos dbad418093

Consolidate three identical click-outside $effect blocks into a reusable
useClickOutside helper. Extract duplicated events checkbox list into an
eventsDropdownItems snippet shared by create and edit dialogs. Add brand
SVG icons for Teams, Google Chat, and Matrix providers.

Optimize frontend polling: visibility API, range-based intervals, skip redundant redraws 21b82c2283

Adds Page Visibility API to StatsPanel, templates, and capsule detail
pages so polling pauses when the browser tab is hidden. Capsule metrics
now use range-appropriate poll intervals (10s for 5m/10m, up to 120s for
24h) instead of a flat 10s. Chart updates are skipped when the data
fingerprint hasn't changed, avoiding unnecessary Canvas redraws.

Minor textual change e2f869bfc2

Skip row fly-transitions on template filter change to prevent visual flicker 11ca6935a6

After initial page load animations complete, subsequent filter switches
render instantly (duration: 0) instead of replaying staggered fly-in/out
transitions that caused all rows to flash before filtering took effect.

Replace template text input with searchable combobox, lock specs for snapshots 0807946d45

Template field is now a filterable dropdown that fetches available
templates on dialog open. Selecting a snapshot auto-fills and disables
vCPU/memory inputs since they must match the original capsule config.

Add per-provider brand colors to channels page 430fb9e70e

Give each provider (Discord, Slack, Teams, Google Chat, Telegram,
Matrix, Webhook) its own distinctive color for badges, row hover
stripes, and dialog tags. Move channel count into the header as a
serif numeral for stronger typographic hierarchy.

Add syntax highlighting to file browser, harden capsules list 26917d432d

File browser:
- Add shiki-based syntax highlighting (lazy-loaded, zero initial bundle
  impact) with support for 30+ languages
- Cap highlighting at 2000 lines to avoid freezing on large files
- Pre-compute preview lines as derived state instead of re-splitting
  on every render
- Add content-visibility: auto on code lines for off-screen skip
- Remove per-line CSS transitions (unnecessary paint on 5000 elements)
- Cap row entrance animations to first 30 entries

Capsules list:
- Pause auto-refresh polling when browser tab is hidden
- Add empty state for search with no results
- Fix error state not clearing on successful refresh
- Fix action menu positioning near viewport edges
- Disable create button when no template selected

Merge pull request 'Visual optimizations for the web UI' (#19 ) from fix/optimizations into dev 7d0a21644f

Reviewed-on: #19

Remove API key auth requirement for sandbox port proxy connections c3c9ced9dd

Sandbox URLs ({port}-{sandbox_id}.{domain}) are now accessible without
authentication. The sandbox ID in the hostname is sufficient for routing.

fix: stop overwriting agent gRPC errors with CodeInternal 8d0356e372

Removed the `connect.NewError(connect.CodeInternal, ...)` wrapper in the
Server's MakeDir proxy handler. Previously, this wrapper was catching
specific agent errors (like CodeAlreadyExists) and casting them into
generic Code 13 (Internal) errors, stripping the gRPC metadata.

This change allows the control-plane to act as a transparent pipeline,
ensuring the API gateway can properly interpret and route specific
filesystem failures.

fix: map CodeAlreadyExists to HTTP 409 Conflict f5a9a1209f

Updated the `agentErrToHTTP` switch statement to explicitly catch
`connect.CodeAlreadyExists` (as well as
`connect.CodeFailedPrecondition`)
and return `http.StatusConflict` (409) instead of falling through to the

default 502 Bad Gateway.

Merge pull request 'bugfix: preserve agent gRPC status codes and map AlreadyExists to 409 Conflict' (#20 ) from bugfix/mkdir-already-exists-409 into dev f6c3dc0801

Reviewed-on: #20

Add USER, COPY, ENV persistence to template build system 75af2a4f66

Implement three new recipe commands for the admin template builder:

- USER <name>: creates the user (adduser + passwordless sudo), switches
  execution context so subsequent RUN/START commands run as that user
  via su wrapping. Last USER becomes the template's default_user.

- COPY <src> <dst>: copies files from an uploaded build archive
  (tar/tar.gz/zip) into the sandbox. Source paths validated against
  traversal. Ownership set to the current USER.

- ENV persistence: accumulated env vars stored in templates.default_env
  (JSONB) and injected via PostInit when sandboxes are created from the
  template, mirroring Docker's image metadata approach.

Supporting changes:
- Pre-build creates wrenn-user as default (via USER command)
- WORKDIR now creates the directory if it doesn't exist (mkdir -p)
- Per-step progress updates (ProgressFunc callback) for live UI
- Multipart form support on POST /v1/admin/builds for archive upload
- Proto: default_user/default_env fields on Create/ResumeSandboxRequest
- Host agent: SetDefaults calls PostInitWithDefaults on envd
- Control plane: reads template defaults, passes on sandbox create/resume
- Frontend: file upload widget, recipe copy button, keyword colors for
  USER/COPY, fixed Svelte whitespace stripping in step display
- Admin panel defaults to /admin/templates instead of /admin/hosts
- Migration adds default_user and default_env to templates and
  template_builds tables

Rename /dashboard/snapshots to /dashboard/templates, show specs for all template types f5eeb0ffcc

- Rename snapshots route to templates for consistency with sidebar label
- Show vCPU and Memory values for base templates (not just snapshots),
  with tooltip distinguishing "Required" vs "Recommended"
- Show recipe copy button in admin build logs
- Admin panel defaults to /admin/templates on entry
- WORKDIR creates directory if not present (mkdir -p)
- Use USER command in pre-build instead of raw adduser
- Fix Svelte whitespace stripping in step keyword display

Fix runtime env leaking into templates, add hostname to /etc/hosts 000318f77e

- Filter out user-specific env vars (HOME, USER, LOGNAME, SHELL, etc.)
  from template default_env so they don't override envd's per-user
  resolution. Fixes bash sourcing /root/.bashrc as wrenn-user.
- Keep WRENN_SANDBOX (legitimate runtime flag), only filter per-sandbox
  IDs (WRENN_SANDBOX_ID, WRENN_TEMPLATE_ID).
- Add "127.0.0.1 sandbox" to /etc/hosts in wrenn-init.sh so sudo can
  resolve the hostname. Fixes "unable to resolve host sandbox" error.
- Move capsule lifecycle buttons (Pause/Resume/Snapshot/Destroy) to the
  same row as Stats/Files/Terminal tabs.
- Show vCPU/Memory for all template types with Required/Recommended
  tooltips on the user templates page.

Visual polish 46c43b95c2

COPY multi-source support, configurable rootfs size, build fixes 25b5258841

- COPY now supports multiple sources: COPY a.txt b.txt /dest/
  Last argument is always destination (matches Dockerfile semantics).
- COPY resolves relative destinations against current WORKDIR.
- WRENN_DEFAULT_ROOTFS_SIZE env var (e.g. 5G, 2Gi, 1000M, 512Mi)
  controls template rootfs expansion. Used both at agent startup
  (EnsureImageSizes) and after FlattenRootfs (shrink then re-expand).
- Pre-build now sets WORKDIR /home/wrenn-user after USER switch.
- Extracted archive files get chmod a+rX for readability.
- Path traversal validation on COPY sources.

Merge pull request 'Completed template build for admins' (#21 ) from feat/admin-template-build into dev ea65fb584c

Reviewed-on: #21

Rename API routes /v1/sandboxes → /v1/capsules 565817273d

Updated gitignore 108b68c3fa

Update pgx/v5 from v5.8.0 to v5.9.1 7b853a05ba

Picks up timestamp scan optimizations, ContextWatcher goroutine leak
fix, and stdlib ResetSession connection pool fix.

Bump frontend and Go x/ dependencies 0189d030bb

- vite 7→8, @sveltejs/vite-plugin-svelte 6→7, typescript 5→6
- golang.org/x/crypto v0.49→v0.50, golang.org/x/sys v0.42→v0.43 (both modules)

Update CP listen port to 9725 and public URL to app.wrenn.dev 9ad704c12b

Bump netlink v1.3.1 and netns v0.0.5 0e7b198768

Fixes resource leaks in named namespace handlers, adds IFF_RUNNING
flag deserialization and RouteGetWithOptions.

Merge pull request 'Updated dependencies and fixed breaking changes' (#22 ) from fix/dependency-updates into dev 0d5007089e

Reviewed-on: #22

Fix file browser crash on non-regular files and connection leaks da06ecb97b

- envd: reject non-regular files (devices, pipes, sockets) in GetFiles
  to prevent infinite reads from /dev/zero, /dev/urandom etc.
- host agent: add context cancellation check in ReadFileStream loop
  with proper Connect error codes
- frontend: abort in-flight file reads on file switch, directory
  navigation, and component teardown via AbortController
- frontend: guard against abort errors surfacing in UI, use try/finally
  for fileLoading state

Updated env.example b1595baa19

Merge pull request 'Fixed crash on non-regular files and connection leaks' (#23 ) from hotfix/file-browsing-error-for-dev into dev eb47e22496

Reviewed-on: #23

Explicit write when mounting rootfs for updates 5633957b51

Normalize dialog styles across capsules and templates pages 19ddb1ab8b

Aligned all dialog boxes to a consistent pattern: same shadow
(--shadow-dialog), animation (fadeUp 0.2s ease), button sizing
(py-2, duration-150), and hover effects. Added template type
indicator dot to CreateCapsuleDialog combobox. Removed accent
gradient bars from templates page inline dialogs.

Block download for non-regular files in file browser f920023ecf

Disable the download button for symlinks and show a dedicated
preview pane explaining the symlink target and suggesting to
navigate to the target file instead. Guard handleDownload against
non-file types as a safety net.

Add admin capsule management, fix file browser for special files, normalize dialog styles 90bea52ccd

- Admin capsule CRUD: list, create (platform templates), get detail with
  terminal/files/metrics, snapshot, destroy
- First signup auto-promotes to platform admin
- JWT auth via query param for WebSocket connections
- File browser: handle non-regular files (devices, pipes, sockets) gracefully
  instead of showing raw backend errors
- Normalize admin template dialogs to match established dialog patterns:
  remove accent bars, unify animation/shadow/button styles

Extract MetricsPanel component and use it in admin capsule detail page 60c0de670c

Moves all Chart.js metrics logic (polling, smoothing, chart init/update)
into a reusable MetricsPanel component with 'full' and 'compact' layout
modes. The admin capsule detail page now reuses MetricsPanel, TerminalTab,
and FilesTab — no duplicated code.

Polish admin capsule pages and improve shared components 784fe5c7a8

- Admin list: remove redundant Open button, normalize with dashboard
  patterns (sorting, search highlight, auto-refresh, animations)
- Admin detail: breadcrumb header, status bar, visibility polling
- FilesTab: add treeOnly prop, compact mode uses 2/7 tree + 5/7 preview
  split, expand tree to full width when no file selected, improve copy
- MetricsPanel: hide Live badge in compact layout (redundant with status)
- DestroyDialog: accept destroyFn prop for admin capsule deletion

Merge pull request 'Added manual template building' (#24 ) from feat/admin-panel into dev bbdb44afee

Reviewed-on: #24

Normalize dashboard page headers: add divider line and align button layout d828a6be08

Add consistent mt-6 border-b divider to Capsules, Metrics, and Templates
headers. Align Channels header to match Keys page pattern (items-center,
description inside the title group).

Fix: Auto-admin didn't work for oauth users 117c46a386

Pre-pause snapshot signal to prevent Go runtime crash on restore 962860ba74

envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.

Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.

- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)

Add background process execution API 516890c49a

Start long-running processes (web servers, daemons) without blocking the
HTTP request. Leverages envd's existing background process support
(context.Background(), List, Connect, SendSignal RPCs) and wires it
through the host agent and control plane layers.

New API surface:
- POST /v1/capsules/{id}/exec with background:true → 202 {pid, tag}
- GET /v1/capsules/{id}/processes → list running processes
- DELETE /v1/capsules/{id}/processes/{selector} → kill by PID or tag
- WS /v1/capsules/{id}/processes/{selector}/stream → reconnect to output

The {selector} param auto-detects: numeric = PID, string = tag.
Tags are auto-generated as "proc-" + 8 hex chars if not provided.

Remove redundant comments from login page glow animation 71b87020c9

Removed unused env vars from env example 17d5d07b3a

Implement least-loaded host scheduler with bottleneck-first strategy 82d281b5b5

Replace round-robin scheduling with resource-aware host selection that
picks the host with the most headroom at its tightest resource. Extends
the HostScheduler interface with memory/disk params for admission control.

Merge pull request 'Implemented least-loaded host scheduler with bottleneck-first strategy' (#25 ) from feat/host-scheduler into dev 587f6ed8ad

Reviewed-on: #25

Add admin teams management page d332630267

Admin panel now includes a Teams page with paginated listing of all teams
(including soft-deleted), BYOC enable with confirmation dialog, and team
deletion with active capsule warnings. Shows member count, owner info,
active capsules, and channel count per team.

Add admin user management with is_active enforcement a265c15c4d

Admin users page at /admin/users with paginated user list showing name,
email, team counts, role, join date, and active status toggle. Inactive
users are blocked from all authenticated endpoints immediately via DB
check in JWT middleware. OAuth login errors now show human-readable
messages on the login page.

Merge pull request 'Added teams and users pages to admin panel' (#26 ) from feat/admin-panel into dev 59507d7553

Reviewed-on: #26

Fix build recipe execution and flatten reliability 5b4fde055c

- Set HOME in bctx.EnvVars when USER switches so ~ expands correctly in
  subsequent RUN/WORKDIR steps instead of resolving to /root
- Run /bin/sync inside the guest before FlattenRootfs destroys the VM,
  preventing pip-installed files from being captured as 0-byte due to
  unflushed page cache
- Wrap healthcheck command with su <user> so it runs with the template's
  default user context (correct HOME, correct UID)
- Export Shellescape from the recipe package for use in build service
- Add code-runner-beta recipe (Jupyter server with ipykernel --sys-prefix)
  and replace old python-interpreter-v0-beta

Remove PTY inactivity timeout to keep terminal sessions alive indefinitely 5f877afb9e

Sessions now only end on process exit or explicit kill, not idle time.
The keepalive ping every 30s remains to prevent network-level disconnects.

Merge pull request 'Fixed issues with code interpreter' (#27 ) from fix/code-interpreter into dev 11d746dcfc

Reviewed-on: #27

Refactored to maintain a separate cloud version a5ad3731f2

Moves 12 packages from internal/ to pkg/ (config, id, validate, events, db,
auth, lifecycle, scheduler, channels, audit, service) so they can be imported
by the enterprise repo as a Go module dependency.

Introduces pkg/cpextension (shared Extension interface + ServerContext) and
pkg/cpserver (Run() entrypoint with functional options) so the enterprise
main.go can call cpserver.Run(cpserver.WithExtensions(...)) without duplicating
the 20-step server bootstrap. Adds db/migrations/embed.go for go:embed access
to OSS SQL migrations from the enterprise module.

cmd/control-plane/main.go is reduced to a 10-line wrapper around cpserver.Run.

Merge pull request 'Added metadata tracking for binaries and refactored to maintain a separate cloud version' (#28 ) from feat/meta-versioning into dev d1975089f1

Reviewed-on: #28

Updated letter-spacing 700512b627

Add transactional email system via SMTP 9d68eb5f00

Introduce internal/email package with SMTP sending, embedded HTML/text
templates, and multipart MIME assembly. Emails use a generic EmailData
struct (recipient name, message, optional button, optional closing) so
new email types can be added without code changes.

Wired into signup (welcome email), team creation, and team member
addition. No-op mailer when SMTP_HOST is not configured.

minor changes ded9c15f06

Updated email template for optional name 970ae2b6b2

Merge pull request 'Added transactional email sending' (#29 ) from feat/email-transaction into dev 2f0e7fcdc2

Reviewed-on: #29

Removed unnecessary files and renamed minimal update script d705f83b68

Updated claude md 81715947bb

Add DB queries for account self-service bc8348b199

New queries: UpdateUserPassword, SoftDeleteUser, HardDeleteExpiredUsers,
CountUserOwnedTeamsWithOtherMembers, GetOAuthProvidersByUserID, DeleteOAuthProvider.

Add /v1/me account management endpoints f69fa8cded

Adds self-service endpoints: GET/PATCH/DELETE /v1/me, POST /v1/me/password,
POST /v1/me/password/reset{/confirm}, GET/DELETE /v1/me/providers/{provider}.
Includes OAuth account-linking flow via cookie, hard-delete cleanup goroutine
(24h ticker, 15-day grace period), and OpenAPI spec for all new routes.

Add Wrenn wordmark to email template and improve spacing 93e6fe8160

Add settings page, forgot/reset password flows, and me API client e8a2217247

Adds /dashboard/settings route with profile/password/OAuth/account-deletion
management. Adds /forgot-password and /reset-password routes. Enables sidebar
settings link. Adds typed me.ts API client.

Add email activation flow and replace is_active with status column a3f75300a9

Email signup now creates inactive users who must activate via a 30-minute
email token before signing in. Team creation is deferred to first login
after activation, while OAuth users continue to get teams immediately.

- Replace boolean is_active with status column (inactive/active/disabled/deleted)
- Add POST /v1/auth/activate endpoint with Redis-backed token consumption
- Signup returns message instead of JWT, sends activation email
- Login differentiates error messages by user status
- Add confirm password field to signup form
- Add /activate frontend page that auto-logs in on success
- Handle inactive user cleanup on re-signup (30-min cooldown) and OAuth collision

Updated claude md with better design e1b23f3d79

Fix cascading deletion gaps for user and team cleanup 43e838c55c

- Add ON DELETE CASCADE to users_teams, oauth_providers, admin_permissions
  and ON DELETE SET NULL (with nullable columns) to team_api_keys.created_by,
  hosts.created_by, host_tokens.created_by so HardDeleteExpiredUsers no longer
  fails with FK violations
- User account deletion now cascades to sole-owned teams via DeleteTeamInternal,
  preventing orphaned teams with live sandboxes after account removal
- ListActiveSandboxesByTeam now includes hibernated sandboxes so their disk
  snapshots are cleaned up during team deletion
- Team soft-delete now hard-deletes sandbox metric points, metric snapshots,
  API keys, and channels to prevent data accumulation on deleted teams
- Extract deleteTeamCore() to deduplicate shared logic across DeleteTeam,
  AdminDeleteTeam, and DeleteTeamInternal
- Fix ListAPIKeysByTeamWithCreator to use LEFT JOIN after created_by became
  nullable, and update handler to read pgtype.Text.String for creator_email

Redirect authenticated users away from login page 084c6caa7d

Merge pull request 'Added settings for users and proper email flow for authentication' (#30 ) from feat/user-onboarding into dev 451d0819cc

Reviewed-on: #30

Fix API key cleanup on user deactivation and build archive race condition e91109d69c

Delete all API keys created by a user when their account is disabled,
deleted, or soft-deleted. Store build archives before enqueuing to Redis
so workers never dequeue a build with missing files.

Move sidebar into layout files and fix timer cleanup across frontend ed2222c80c

Sidebar and AdminSidebar were re-instantiated on every page navigation
(17 pages total), causing unnecessary DOM teardown/rebuild and redundant
localStorage reads. Now each lives in its respective +layout.svelte as a
single persistent instance.

Also adds onDestroy cleanup for leaked timers (settings, team, login RAF
loop) and CSS containment on <main> to isolate layout recalculations.

Fix concurrency, security, and correctness issues across backend and frontend 9ea847923c

- C1: Add sync.RWMutex to vm.Manager to protect concurrent vms map access
- H1: Fix IP arithmetic overflow in network slot addressing (byte truncation)
- H5: Fix MultiplexedChannel.Fork() TOCTOU race (move exited check inside lock)
- H8: Remove snapshot overwrite — return template_name_taken conflict instead
- H9: Wrap DeleteAccount DB ops in a transaction, make team deletion fatal
- H10: Sanitize serviceErrToHTTP to stop leaking internal error messages
- H11: Add deleted_at IS NULL to GetUserByEmail/GetUserByID queries
- H12: Add id DESC to audit log composite index for cursor pagination
- H15: Delete dead AuthModal.svelte component
- H17: Move JWT from WebSocket URL query param to first WS message
- H18: Fix $derived to $derived.by in FilesTab breadcrumbs

Destroy owned sandboxes on user disable and fix OAuth login resilience fb4b67adb3

When an admin disables a user, all active sandboxes (running, paused,
hibernated) for teams they own are now destroyed and their API keys
are deleted. User queries now filter by status column instead of
deleted_at, so re-enabling a user always works. OAuth login paths
use ensureDefaultTeam to auto-create a team if the user has none,
matching the email/password login behavior.

Merge pull request 'Bug fixes and optimizations' (#31 ) from fix/optimizations into dev b9aa444472

Reviewed-on: #31

Cap network slot allocator at 32767 to match veth IP space 44c32587e3

The veth addressing uses 10.12.0.0/16 with 2 IPs per slot. At slot
index 32768, vethOffset=65536 overflows byte arithmetic and wraps back
to 10.12.0.0, causing silent IP collisions with existing sandboxes.
Cap the allocator at 32767, which is the actual addressable limit.

Add production file logging with logrotate support bba5f80294

Both control plane and host agent now write structured slog output to
$WRENN_DIR/logs/ in addition to stderr. Log level is configurable via
LOG_LEVEL env var (default: info). SIGHUP reopens the log file so
logrotate can rotate without copytruncate.

Add unauthenticated /health endpoint to control plane e6e3975426

Returns JSON with status and build version for monitoring and
load balancer health checks.

Shrink minimal rootfs on graceful host agent shutdown 977c3a466a

On startup EnsureImageSizes expands the minimal rootfs to the configured
disk size. This adds the inverse: ShrinkMinimalImage runs e2fsck + resize2fs -M
during graceful shutdown so the image is stored compactly on disk.

Added host preparation script and updated claude md 9c4fea93bc

Minor patch cc63ed2197

Add +page.js to disable prerendering for admin capsule detail page 24f904fa74

Merge branch 'dev' into chore/hardening ab034062d3

Merge pull request 'Improved codebase to prepare for production' (#32 ) from chore/hardening into dev ce452c3d11

Reviewed-on: #32

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 955aa09780

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev e7670e4449

Add daily usage metrics (CPU-minutes, RAM GB-minutes) 92aab09104

Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots.
An hourly background worker aggregates completed days, while today's
usage is computed live from snapshots at query time for freshness.

Backend: new daily_usage table, rollup worker, UsageService, and
GET /v1/capsules/usage endpoint with date range filtering (up to 92 days).

Frontend: replace Usage page placeholder with bar charts (Chart.js),
summary total cards, and preset/custom date range controls.

Normalize usage page layout and clarify copy 003453fa3c

Separate summary cards with proper surface hierarchy, add staggered
entrance animations, tighten padding, and rewrite labels/descriptions
to be specific and actionable rather than generic.

Bump version to 0.1.2 8f8638e6db

Add MiddlewareProvider interface for extension middleware 47be1143fb

Allows cloud extensions to inject middleware that wraps OSS routes
(e.g. billing enforcement) before they are registered.

Clean up dashboard page headers for consistency aa96557d1c

Remove unnecessary wrapper divs around h1/subtitle pairs in audit,
channels, settings, and templates pages. Drop inline count from
channels header.

Merge pull request 'Feat: Added daily usage page' (#34 ) from feat/usage into dev 9ee6e3e1a8

Reviewed-on: #34

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev dbc6030c17

feat: separate GitHub OAuth login/signup flows with name confirmation 6a6b489471

Block auto-account creation when signing in via GitHub from login mode.
Signup via GitHub now shows a name confirmation dialog before redirecting
to dashboard, letting users verify/edit their display name pulled from
GitHub.

- Add intent query param to OAuth redirect, persisted in HMAC-signed state cookie
- Block registration in callback when intent=login, return no_account error
- Set wrenn_oauth_new_signup cookie on new account creation
- Frontend callback shows name confirmation dialog for new signups
- Add no_account error message to login page

feat: anonymize audit logs on user hard-delete and fix host audit log team assignment ebbbde9cd1

Anonymize audit logs when soft-deleted users are purged after 15 days:
actor_name set to 'deleted-user', actor_id and resource_id nulled,
email stripped from member metadata. Per-user delete ensures no user
is removed without successful anonymization.

Frontend renders deleted-user as a styled red badge in audit log view.

Fix shared host create/delete audit logs landing in admin's personal
team — now correctly assigned to PlatformTeamID.

fix: admin capsule create audit log uses PlatformTeamID 684c98b0fa

POST /v1/admin/capsules was outside the injectPlatformTeam middleware
subrouter, so audit entries landed under the admin's personal team.

fix: remove accent gradient bars from admin host dialogs edec170652

Normalize admin host page dialogs to match design system pattern:
border + shadow only, no colored gradient strips. Align animation
timing and shadow to reference components (DestroyDialog, etc).

feat: add audit logging for all admin actions and admin audit page 7fd801c1eb

Log every admin-panel action (user activate/deactivate, team BYOC toggle,
team delete, template delete, build create/cancel) to the audit_logs table
under PlatformTeamID with scope "admin".

Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with
infinite scroll and hierarchical filters. Expose audit.Entry + Log() for
cloud repo extensibility.

Fix seed_platform_team down-migration FK violation by deleting dependent
rows before the team row.

Version bump d270ab7752

refactor: deduplicate audit logger with shared entry builders bb2146d838

Replace repetitive actorFields + write boilerplate across all 25+ typed
Log methods with shared helpers: newEntry (general), newAdminEntry
(platform-level), resolveHostTeamID, and logSystemHostEvent.

Reduces logger.go from 665 to 374 lines with no behavior change.

feat: send email notification on account hard-delete 11928a172a

Notify users via email when their account is permanently deleted after
the 15-day soft-delete grace period. Query now returns email alongside
user ID so the notification can be sent after deletion.

Email failure is logged as a warning but does not block cleanup.

Merge pull request 'Audit logging, Data anonymization, and OAuth flow improvements' (#35 ) from feat/compliance into dev c3afd0c8a0

Reviewed-on: #35

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 153a54fdcd

fix: security and stability fixes from code review 339cd7bee1

- Scope WebSocket auth bypass to only WS endpoints by restructuring
  routes into separate chi Groups. Non-WS routes no longer passthrough
  unauthenticated requests with spoofed Upgrade headers. Added
  optionalAPIKeyOrJWT middleware for WS routes (injects auth context
  from API key/JWT if present, passes through otherwise) and
  markAdminWS middleware for admin WS routes.

- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
  was called unconditionally but p.tty is nil for non-PTY processes,
  crashing every non-PTY process exit.

- Fix goroutine leak in sandbox Pause — stopSampler was never called,
  leaking one sampler goroutine per successful pause operation.

- Decouple PTY WebSocket reads from RPC dispatch using a buffered
  channel to prevent backpressure-induced connection drops under fast
  typing. Includes input coalescing to reduce RPC call volume.

fix: OAuth ConnectProvider state HMAC format mismatch 5e13879954

ConnectProvider computed HMAC over bare state, but Callback always
verifies HMAC(state+":"+intent). This caused the account-linking
flow to always fail with invalid_state.

fix: sandbox network responsiveness under port-binding apps bd98610153

Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.

Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.

Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
  disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
  http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
  envd client

Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
  prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
  behavior
- Replace path.Join with string concat in CP Director to preserve
  trailing slashes (prevents redirect loops on directory listings)

Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
  startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy

Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
  HTTP server

Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
  to prevent guest traffic from saturating the TAP

Merge pull request 'Fixed network throttle when an application is running' (#37 ) from fix/network-throttle-on-load into dev cdacc12a48

Reviewed-on: #37

Version bump f4733e2f7a

Envd version bump f3ec626d58

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2e998a26a2

Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox f3572f7356

Resume() was building VMConfig without TemplateID, so Firecracker MMDS
received an empty string. envd's PostInit then wrote that empty value to
/run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in
snapshot metadata during Pause and reading it back during Resume.

fix: close stale TCP connections across snapshot/restore to prevent envd hangs 7ef9a64613

After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.

Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
  closes idle connections and disables keep-alives before snapshot, then closes
  all pre-snapshot zombie connections on restore (while preserving post-restore
  connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
  use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
  packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
  hostagent use it instead of the timeout client

fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts bb582deefa

Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.

fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore 3deecbff89

Three root causes addressed:

1. Go page allocator corruption: allocations between the pre-snapshot GC
   and VM freeze leave the summary tree inconsistent. After restore, GC
   reads corrupted metadata — either panicking (killing PID 1 → kernel
   panic) or silently failing to collect, causing unbounded heap growth
   until OOM. Fix: move GC to after all HTTP allocations in
   PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
   allocations run sequentially with no concurrent page allocator access.
   GOMAXPROCS is restored on first health check after restore.

2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
   single 30s context. If WaitUntilReady consumed most of it, PostInit
   failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
   disabled and zombie connections. Fix: separate timeout contexts.

3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
   caused goroutine leaks from hung proxy connections. Fix: add both,
   matching host agent values.

Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.

feat: rewrite envd guest agent in Rust (envd-rs) 0b53d34417

Complete Rust rewrite of the Go envd guest daemon that runs as PID 1
inside Firecracker microVMs. Feature-complete across all 8 phases:

- Health, metrics, and env var endpoints
- Crypto (SHA-256/512, HMAC), auth (secure token, signing), init/snapshot
- Connect RPC via connectrpc + buffa (process + filesystem services)
- File transfer (GET/POST /files) with gzip, multipart, chown, ENOSPC
- Port subsystem (/proc/net/tcp scanner, socat forwarder)
- Cgroup2 manager with noop fallback
- Snapshot/restore lifecycle (conntracker, port subsystem stop/restart)
- SIGTERM graceful shutdown, --cmd initial process spawn
- MMDS metadata polling for Firecracker mode

42 source files, ~4200 LOC, 4.1MB stripped release binary.
Makefile updated: build-envd now targets Rust (musl static),
build-envd-go preserved for Go builds.

refactor: remove Go envd module, update host agent for Rust envd 1143acd37a

The Go envd guest agent (`envd/`) is fully replaced by the Rust
implementation (`envd-rs/`). This commit removes the Go module and
updates all references across the codebase.

Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go,
and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link
verification to build-envd.

Host agent: rewrite snapshot quiesce comments that referenced Go GC
and page allocator corruption — no longer applicable with Rust envd.
Tighten envdclient to expect HTTP 200 (not 204) from health and file
upload endpoints, and require JSON version response from FetchVersion.

Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and
README.md to reflect Rust envd architecture.

rename guest hostname from "sandbox" to "capsule" f328113a2a

Terminal prompt inside VMs now shows root@capsule instead of
root@sandbox, aligning with user-facing "capsule" terminology.

Updated static link check for envd bbcde17d49

fix: resolve PTY failure, MMDS file writes, and metrics instability in envd-rs 31456fd169

Three bugs fixed:

1. PTY connections failed because home directory was hardcoded as
   /home/{username} instead of reading from /etc/passwd. For root,
   this produced /home/root/ which doesn't exist — CWD validation
   rejected every PTY Start request without explicit cwd. Fixed all
   6 locations to use user.dir from nix::unistd::User.

2. MMDS polling silently failed to parse metadata because the
   logs_collector_address field lacked #[serde(default)]. The host
   agent only sends instanceID + envID — missing "address" field
   caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID
   and .WRENN_TEMPLATE_ID were never written. Also added error
   logging and create_dir_all before file writes.

3. Metrics CPU values were non-deterministic because a fresh
   sysinfo::System was created per request with a 100ms sleep
   between reads. Replaced with a background thread that samples
   CPU at fixed 1-second intervals via a persistent System instance,
   matching gopsutil's internal caching behavior. Metrics endpoint
   now reads cached atomic values — no blocking, consistent window.

Also: close master PTY fd in child pre_exec, add process.Start
request logging, bump version to 0.2.0.

fix: improve error feedback for terminal disconnects and host unavailability ef5f223863

Show "[session disconnected]" in terminal when PTY websocket closes cleanly.
Map scheduler and agent unavailability errors to 503 with user-friendly
message instead of leaking internal details.

Merge pull request 'Rewritten envd with rust to improve reliability during pause and resume operations' (#39 ) from feat/envd-rewrite into dev 20a228eb8d

Reviewed-on: #39

Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 233e747d5d

fix: accurate sandbox metrics and memory management 1178ab8b21

Three issues fixed:

1. Memory metrics read host-side VmRSS of the Firecracker process,
   which includes guest page cache and never decreases. Replaced
   readMemRSS(fcPID) with readEnvdMemUsed(client) that queries
   envd's /metrics endpoint for guest-side total - MemAvailable.
   This matches neofetch and reflects actual process memory.

2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and
   envd-side periodic page cache reclaimer (drop_caches when >80%
   used). Reclaimer is gated by snapshot_in_progress flag with
   sync() before freeze to prevent memory corruption during pause.

3. Sampling interval 500ms → 1s, ring buffer capacities adjusted
   to maintain same time windows. Reduces per-host HTTP load from
   240 calls/sec to 120 calls/sec at 120 capsules.

Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD
lazy-loads anyway), envd mem_used formula uses total - available.

Merge pull request 'fix: accurate sandbox metrics and memory management' (#41 ) from bugfix/sandbox-metrics-calculations into dev cb28f7759d

Reviewed-on: #41

fix: drop page cache before snapshot to reduce memory dump size 01819642cc

Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.

fix: merge capsule data in-place to prevent visual refresh on poll 4954b19d7c

Replaces full array assignment with granular merge that reuses existing
Svelte proxy objects, so only rows with actual data changes re-render.

feat: admin grant/revoke from admin panel cac6fcd626

Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for
granting and revoking platform admin status. Uses atomic conditional
SQL (RevokeUserAdmin) to prevent race conditions that could remove
the last admin. Includes idempotency check, audit logging, and
confirmation dialog with self-demotion warning.

feat: show template owner and restrict delete in admin panel 021d709de2

Add Owner column to admin templates table, resolving team IDs to names
via admin teams API. Disable delete for non-platform templates and the
minimal template, with contextual tooltips explaining why.

fix: fetch sandbox metrics immediately on page load 1244c08e42

Metrics data was only fetched after Chart.js dynamic import completed,
leaving graphs empty until the first poll interval fired. Now
loadMetrics() runs in parallel with the Chart.js import, and
initCharts() resets the dedup key so pre-fetched data populates
newly created chart instances.

Merge pull request 'Enhanced frontend ux' (#42 ) from enhance/frontend into dev fd5fa28205

Reviewed-on: #42

fix: resolve pause/snapshot failures and CoW exhaustion on large VMs 51b5d7b3ba

Remove hard 10s timeout from Firecracker HTTP client — callers already
pass context.Context with appropriate deadlines, and 20GB+ memfile
writes easily exceed 10s.

Ensure CoW file is at least as large as the origin rootfs. Previously,
WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the
default 5GB CoW could not hold all writes, causing dm-snapshot
invalidation and EIO on all guest I/O.

Destroy frozen VMs in resumeOnError instead of leaving zombies that
report "running" but can't execute. Use fresh context for the resume
attempt so a cancelled caller context doesn't falsely trigger destroy.

Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and
PrepareSnapshot timeout from 3s to 30s for large-memory VMs.

After failed pause, ping agent to detect destroyed sandboxes and mark
DB status as "error" instead of reverting to "running".

fix: inflate balloon before snapshot to reduce memfile size 38799770db

Firecracker dumps the entire VM memory region regardless of guest
usage. A 20GB VM using 500MB still produces a ~20GB memfile because
freed pages retain stale data (non-zero blocks).

Inflate the balloon device before snapshot to reclaim free guest
memory. Balloon pages become zero from FC's perspective, allowing
ProcessMemfile to skip them. This reduces memfile size from ~20GB
to ~1-2GB for lightly-used VMs.

- Pause: read guest memory usage, inflate balloon to reclaim free
  pages, wait 2s for guest kernel to process, then proceed
- Resume: deflate balloon to 0 after PostInit so guest gets full
  memory back
- createFromSnapshot: same deflation since template snapshots
  inherit inflated balloon state
- All balloon ops are best-effort with debug logging on failure

fix: harden pause flow with connection isolation and UFFD event handling c93ad5e2db

Restructure pause to: block new operations (StatusPausing), drain proxy
connections with 5s grace, force-close remaining via context cancellation,
drop page cache, inflate balloon, then freeze vCPUs. Previously connections
could arrive during the pause window and API operations weren't blocked.

Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing
the UFFD server. These events fire during balloon deflation on snapshot
restore, killing the page fault handler and preventing VM boot.

Also adds ConnTracker.ForceClose() with cancellable context propagated
through the proxy handler, so lingering proxy connections are actively
terminated rather than left dangling.

fix: use RwLock for envd Defaults to fix silent mutation loss 2af8412cdc

The /init handler's default_user mutation cloned the Defaults struct,
mutated the clone, then dropped it — the actual state was never updated.
This caused processes to always run as "root" regardless of the user
set via POST /init. Additionally, default_workdir was accepted in the
init request but never applied.

Wrap user and workdir fields in RwLock with accessor methods so mutations
propagate correctly through the shared AppState.

fix: resolve exec 502 by terminating process streams on exit d1d316f35c

The start() and connect() streaming RPCs blocked forever in the data
event loop because ProcessHandle retains a broadcast sender (needed for
reconnection via connect()), preventing the channel from closing.

Race data_rx against end_rx with tokio::select! so the stream terminates
when the process exits. Remaining buffered data is drained before
yielding the end event.

fix: subscribe to process channels before spawning threads to prevent event loss 522e1c5e90

Fast-exiting processes (e.g. echo) sent data/end events before
start() subscribed to the broadcast channels, causing the stream
to hang indefinitely and the exec RPC to time out with 502.

Move channel subscription into spawn_process, before reader/waiter
threads start, and return pre-subscribed receivers via SpawnedProcess.

fix: resolve process stream hangs, pause race, and PTY signal loss aca43d51eb

- Cache terminal EndEvent on ProcessHandle so connect() can detect
  already-exited processes instead of hanging forever on broadcast
  receivers that missed the event. Subscribe before checking cache
  to close the TOCTOU window.

- Protect sb.Status writes in Pause with m.mu to prevent data race
  with concurrent readers (AcquireProxyConn, Exec, etc.).

- Restart metrics sampler in restoreRunning so a failed pause attempt
  doesn't permanently kill sandbox metrics collection.

- Return dequeued non-input messages from coalescePtyInput instead of
  dropping them, preventing silent loss of kill/resize signals during
  typing bursts.

Changed commands to check if envd is statically linked or not 8c34388fc2

Rootfs script updated 6a0fea30a6

Merge branch 'dev' into fix/large-operations 1472d77b52

Merge pull request 'fix: resolve large operation reliability — stream hangs, pause races, and memory bloat' (#43 ) from fix/large-operations into dev ead406bdac

Reviewed-on: #43

test(envd): add 136 unit tests across 12 modules 485be22a16

Cover all pure-function modules with inline #[cfg(test)] blocks:
crypto (NIST/RFC 4231 known-answer vectors), auth (SecureToken ops,
signature generation/validation), conntracker (snapshot lifecycle),
execcontext, util (AtomicMax concurrent correctness), http/encoding
(RFC 7231 negotiation), port/conn (/proc/net/tcp parsing),
rpc/entry (format_permissions), and permissions/path (tilde expansion,
ensure_dirs). Add tempfile dev-dep for filesystem tests. Update
Makefile test target to include cargo test.

Merge pull request 'test (envd): add 136 unit tests across 12 modules' (#44 ) from testing/envd into dev 0bfda08f47