1
0
forked from wrenn/wrenn
Commit Graph

14 Commits

Author SHA1 Message Date
906cc42d13 Rename AGENT_*/CP_LISTEN_ADDR env vars to WRENN_* prefix
AGENT_FILES_ROOTDIR → WRENN_DIR, AGENT_LISTEN_ADDR → WRENN_HOST_LISTEN_ADDR,
AGENT_CP_URL → WRENN_CP_URL, AGENT_HOST_INTERFACE → WRENN_HOST_INTERFACE,
CP_LISTEN_ADDR → WRENN_CP_LISTEN_ADDR. Consolidates all env vars under a
consistent WRENN_ namespace.
2026-03-29 00:30:20 +06:00
75b28ed899 Add UUID-based template IDs and team-scoped template directory layout
Introduces internal/layout package for centralized path construction,
migrates templates from name-based TEXT primary keys to UUID PKs with
team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}).
The built-in minimal template uses sentinel zero UUIDs. Proto messages
carry team_id + template_id alongside deprecated template name field.
Team deletion now cleans up template files across all hosts.
2026-03-29 00:30:10 +06:00
c0d6381bbe Add disk_size_mb, auto-expand base images, admin templates endpoint
Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
  full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
  sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
  smaller than 20GB via truncate + resize2fs (sparse, no extra physical
  disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
  templates are compact; EnsureImageSizes re-expands on next startup

Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
  team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
  GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
  queries include platform team_id in WHERE clause
2026-03-26 23:45:41 +06:00
1ce62934b3 Add template build system with admin panel, async workers, and FlattenRootfs RPC
Introduces an end-to-end template building pipeline: admins submit a recipe
(list of shell commands) via the dashboard, a Redis-backed worker pool spins
up a sandbox, executes each command, and produces either a full snapshot
(with healthcheck) or an image-only template (rootfs flattened via a new
FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted
to a new template_builds table and polled by the frontend.

Backend:
- New FlattenRootfs RPC (proto + host agent + sandbox manager)
- BuildService with Redis queue (BLPOP) and configurable worker pool (default 2)
- Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id}
- Migration for template_builds table with JSONB logs and recipe columns
- sqlc queries for build CRUD and progress updates

Frontend:
- /admin/templates page with Templates + Builds tabs
- Create Template dialog with recipe textarea, healthcheck, specs
- Build history with expandable per-step logs, status badges, progress bars
- Auto-polling every 3s for active builds
- AdminSidebar updated with Templates nav item
2026-03-26 15:27:21 +06:00
f4675ebfc0 WIP: Add HTTP proxy endpoint to host agent
Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP
requests to services running inside sandbox VMs. The sandbox's host IP
(10.11.0.{idx}) is used as the upstream target.

Includes port validation (1-65535) and shared HTTP transport for
connection pooling. Supports WebSocket upgrades for protocols like
Jupyter's streaming API.

This is an intermediate state — needs further work for the full code
interpreter feature.
2026-03-26 02:12:01 +06:00
9acdbb5ae9 Add per-sandbox CPU/memory/disk metrics collection
Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and
stat() on CoW files at 500ms intervals per running sandbox. Three tiered
ring buffers downsample into 30s and 5min averages for 10min/2h/24h
retention. Metrics are flushed to DB on pause (all tiers) and destroy
(24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the
host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on
the control plane. Returns live data for running sandboxes, DB data for
paused, and 404 for stopped.
2026-03-25 20:10:33 +06:00
e069b3e679 Add BYOC page, admin section, and is_byoc team visibility gating
- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows,
  shimmer loading, pulsing online status, animated token reveal checkmark
- Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat
  pills, skeleton loading, slide-in animations for new rows
- Frontend: AdminSidebar component with accent top bar and admin pill badge
- Frontend: BYOC nav item shown only when team.is_byoc is true (derived
  from teams store, not JWT); disabled for members
- Frontend: Admin shield button in Sidebar, visible only to platform admins
- Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated)
- Backend: is_byoc added to teamResponse so frontend derives visibility
  from fresh team data rather than stale JWT fields
- Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc)
- Backend: Admin hosts list enriches BYOC entries with team_name
- Host agent: load .env file via godotenv on startup
2026-03-25 03:10:41 +06:00
9bf67aa7f7 Implement host registration, JWT refresh tokens, and multi-host scheduling
Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a
DB-driven registration system supporting multiple host agents (BYOC).

Key changes:
- Host agents register via one-time token, receive a 7-day JWT + 60-day
  refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all
  sandboxes if refresh fails
- HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing
  the single static agent client throughout the API and service layers
- RoundRobinScheduler: picks an online host for each new sandbox via
  ListActiveHosts; extensible for future scheduling strategies
- HostMonitor (replaces Reconciler): passive heartbeat staleness check marks
  hosts unreachable and sandboxes missing after 90s; active reconciliation
  per online host restores missing-but-alive sandboxes and stops orphans
- Graceful host delete: returns 409 with affected sandbox list without
  ?force=true; force-delete destroys sandboxes then evicts pool client
- Snapshot delete broadcasts to all online hosts (templates have no host_id)
- sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss
- New migration: host_refresh_tokens table with token rotation (issue-then-
  revoke ordering to prevent lockout on mid-rotation crash)
- New sandbox status 'missing' (reversible, unlike 'stopped') and host
  status 'unreachable'; both reflected in OpenAPI spec
- Fix: refresh token auth failure now returns 401 (was 400 via generic
  'invalid' substring match in serviceErrToHTTP)
2026-03-24 18:32:05 +06:00
2c66959b92 Add host registration, heartbeat, and multi-host management
Implements the full host ↔ control plane connection flow:

- Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access:
  regular hosts admin-only, BYOC hosts for admins and team owners
- One-time registration token flow: admin creates host → gets token (1hr TTL
  in Redis + Postgres audit trail) → host agent registers with specs → gets
  long-lived JWT (1yr)
- Host agent registration client with automatic spec detection (arch, CPU,
  memory, disk) and token persistence to disk
- Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token
  auth and host ID cross-check
- Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after
  failed registration
- Tag management (add/remove/list) with team-scoped access control
- Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and
  VerifyHostJWT
- requireHostToken middleware for host agent authentication
- DB-level race protection: RegisterHost uses AND status='pending' with
  rows-affected check; Redis GetDel for atomic token consume
- Migration for future mTLS support (cert_fingerprint, mtls_enabled columns)
- Host agent flags: --register (one-time token), --address (required ip:port)
- serviceErrToHTTP extended with "forbidden" → 403 mapping
- OpenAPI spec, .env.example, and README updated
2026-03-17 05:51:28 +06:00
477d4f8cf6 Add auto-pause TTL and ping endpoint for sandbox inactivity management
Replace the existing auto-destroy TTL behavior with auto-pause: when a
sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses
it (snapshot + teardown) instead of destroying it, preserving the ability
to resume later.

Key changes:
- TTL reaper calls Pause instead of Destroy, with fallback to Destroy if
  pause fails (e.g. Firecracker process already gone)
- New PingSandbox RPC resets the in-memory LastActiveAt timer
- New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory
  and DB last_active_at
- ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler
  can distinguish auto-paused sandboxes from crashed ones in a single call
- Reconciler polls every 5s (was 30s) and marks auto-paused as "paused"
  vs orphaned as "stopped"
- Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles
- Reaper checks every 2s (was 10s) and uses a detached context to avoid
  incomplete pauses on app shutdown
- Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)
2026-03-15 05:15:18 +06:00
a1bd439c75 Add sandbox snapshot and restore with UFFD lazy memory loading
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
2026-03-12 09:19:37 +06:00
b4d8edb65b Add streaming exec and file transfer endpoints
Add WebSocket-based streaming exec endpoint and streaming file
upload/download endpoints to the control plane API. Includes new
host agent RPC methods (ExecStream, StreamWriteFile, StreamReadFile),
envd client streaming support, and OpenAPI spec updates.
2026-03-11 05:42:42 +06:00
ec3360d9ad Add minimal control plane with REST API, database, and reconciler
- REST API (chi router): sandbox CRUD, exec, pause/resume, file write/read
- PostgreSQL persistence via pgx/v5 + sqlc (sandboxes table with goose migration)
- Connect RPC client to host agent for all VM operations
- Reconciler syncs host agent state with DB every 30s (detects TTL-reaped sandboxes)
- OpenAPI 3.1 spec served at /openapi.yaml, Swagger UI at /docs
- Added WriteFile/ReadFile RPCs to hostagent proto and implementations
- File upload via multipart form, download via JSON body POST
- sandbox_id propagated from control plane to host agent on create
2026-03-10 16:50:12 +06:00
6f0c365d44 Add host agent RPC server with sandbox lifecycle management
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
2026-03-10 03:54:53 +06:00