1
0
forked from wrenn/wrenn
Commit Graph

13 Commits

Author SHA1 Message Date
1ca10230a9 Prefix network namespaces with wrenn-, add stale cleanup, lower diff cap
Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx}
to avoid collisions with other tools. Add CleanupStaleNamespaces() at
agent startup to remove orphaned namespaces, veths, iptables rules, and
routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to
prevent Go runtime memory corruption from snapshot/restore drift.
2026-03-29 02:14:30 +06:00
906cc42d13 Rename AGENT_*/CP_LISTEN_ADDR env vars to WRENN_* prefix
AGENT_FILES_ROOTDIR → WRENN_DIR, AGENT_LISTEN_ADDR → WRENN_HOST_LISTEN_ADDR,
AGENT_CP_URL → WRENN_CP_URL, AGENT_HOST_INTERFACE → WRENN_HOST_INTERFACE,
CP_LISTEN_ADDR → WRENN_CP_LISTEN_ADDR. Consolidates all env vars under a
consistent WRENN_ namespace.
2026-03-29 00:30:20 +06:00
c0d6381bbe Add disk_size_mb, auto-expand base images, admin templates endpoint
Disk sizing:
- Add disk_size_mb column to sandboxes table (default 20480 = 20GB)
- Add disk_size_mb to CreateSandboxRequest proto, passed through the
  full chain: service → RPC → host agent → sandbox manager → devicemapper
- devicemapper.CreateSnapshot takes separate cowSizeBytes param so the
  sparse CoW file can be sized independently from the origin
- EnsureImageSizes() runs at host agent startup: expands any base image
  smaller than 20GB via truncate + resize2fs (sparse, no extra physical
  disk). Sandboxes then get the full 20GB via fast dm-snapshot path
- FlattenRootfs shrinks output images with resize2fs -M so stored
  templates are compact; EnsureImageSizes re-expands on next startup

Admin templates visibility:
- Add GET /v1/admin/templates endpoint listing all templates across teams
- Frontend admin templates page uses listAdminTemplates() instead of
  team-scoped listSnapshots()
- Platform templates (team_id = all-zeros UUID) now visible to all teams:
  GetTemplateByTeam, ListTemplatesByTeam, ListTemplatesByTeamAndType
  queries include platform team_id in WHERE clause
2026-03-26 23:45:41 +06:00
f4675ebfc0 WIP: Add HTTP proxy endpoint to host agent
Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP
requests to services running inside sandbox VMs. The sandbox's host IP
(10.11.0.{idx}) is used as the upstream target.

Includes port validation (1-65535) and shared HTTP transport for
connection pooling. Supports WebSocket upgrades for protocols like
Jupyter's streaming API.

This is an intermediate state — needs further work for the full code
interpreter feature.
2026-03-26 02:12:01 +06:00
e069b3e679 Add BYOC page, admin section, and is_byoc team visibility gating
- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows,
  shimmer loading, pulsing online status, animated token reveal checkmark
- Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat
  pills, skeleton loading, slide-in animations for new rows
- Frontend: AdminSidebar component with accent top bar and admin pill badge
- Frontend: BYOC nav item shown only when team.is_byoc is true (derived
  from teams store, not JWT); disabled for members
- Frontend: Admin shield button in Sidebar, visible only to platform admins
- Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated)
- Backend: is_byoc added to teamResponse so frontend derives visibility
  from fresh team data rather than stale JWT fields
- Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc)
- Backend: Admin hosts list enriches BYOC entries with team_name
- Host agent: load .env file via godotenv on startup
2026-03-25 03:10:41 +06:00
9bf67aa7f7 Implement host registration, JWT refresh tokens, and multi-host scheduling
Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a
DB-driven registration system supporting multiple host agents (BYOC).

Key changes:
- Host agents register via one-time token, receive a 7-day JWT + 60-day
  refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all
  sandboxes if refresh fails
- HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing
  the single static agent client throughout the API and service layers
- RoundRobinScheduler: picks an online host for each new sandbox via
  ListActiveHosts; extensible for future scheduling strategies
- HostMonitor (replaces Reconciler): passive heartbeat staleness check marks
  hosts unreachable and sandboxes missing after 90s; active reconciliation
  per online host restores missing-but-alive sandboxes and stops orphans
- Graceful host delete: returns 409 with affected sandbox list without
  ?force=true; force-delete destroys sandboxes then evicts pool client
- Snapshot delete broadcasts to all online hosts (templates have no host_id)
- sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss
- New migration: host_refresh_tokens table with token rotation (issue-then-
  revoke ordering to prevent lockout on mid-rotation crash)
- New sandbox status 'missing' (reversible, unlike 'stopped') and host
  status 'unreachable'; both reflected in OpenAPI spec
- Fix: refresh token auth failure now returns 401 (was 400 via generic
  'invalid' substring match in serviceErrToHTTP)
2026-03-24 18:32:05 +06:00
866f3ac012 Consolidate host agent path env vars into single AGENT_FILES_ROOTDIR
Replace AGENT_KERNEL_PATH, AGENT_IMAGES_PATH, AGENT_SANDBOXES_PATH,
AGENT_SNAPSHOTS_PATH, and AGENT_TOKEN_FILE with a single
AGENT_FILES_ROOTDIR (default /var/lib/wrenn) that derives all
subdirectory paths automatically.
2026-03-17 05:59:26 +06:00
2c66959b92 Add host registration, heartbeat, and multi-host management
Implements the full host ↔ control plane connection flow:

- Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access:
  regular hosts admin-only, BYOC hosts for admins and team owners
- One-time registration token flow: admin creates host → gets token (1hr TTL
  in Redis + Postgres audit trail) → host agent registers with specs → gets
  long-lived JWT (1yr)
- Host agent registration client with automatic spec detection (arch, CPU,
  memory, disk) and token persistence to disk
- Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token
  auth and host ID cross-check
- Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after
  failed registration
- Tag management (add/remove/list) with team-scoped access control
- Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and
  VerifyHostJWT
- requireHostToken middleware for host agent authentication
- DB-level race protection: RegisterHost uses AND status='pending' with
  rows-affected check; Redis GetDel for atomic token consume
- Migration for future mTLS support (cert_fingerprint, mtls_enabled columns)
- Host agent flags: --register (one-time token), --address (required ip:port)
- serviceErrToHTTP extended with "forbidden" → 403 mapping
- OpenAPI spec, .env.example, and README updated
2026-03-17 05:51:28 +06:00
63e9132d38 Add device-mapper snapshots, test UI, fix pause ordering and lint errors
- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
2026-03-13 08:25:40 +06:00
a1bd439c75 Add sandbox snapshot and restore with UFFD lazy memory loading
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
2026-03-12 09:19:37 +06:00
6f0c365d44 Add host agent RPC server with sandbox lifecycle management
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
2026-03-10 03:54:53 +06:00
7753938044 Add host agent with VM lifecycle, TAP networking, and envd client
Implements Phase 1: boot a Firecracker microVM, execute a command inside
it via envd, and get the output back. Uses raw Firecracker HTTP API via
Unix socket (not the Go SDK) for full control over the VM lifecycle.

- internal/vm: VM manager with create/pause/resume/destroy, Firecracker
  HTTP client, process launcher with unshare + ip netns exec isolation
- internal/network: per-sandbox network namespace with veth pair, TAP
  device, NAT rules, and IP forwarding
- internal/envdclient: Connect RPC client for envd process/filesystem
  services with health check retry
- cmd/host-agent: demo binary that boots a VM, runs "echo hello", prints
  output, and cleans up
- proto/envd: canonical proto files with buf + protoc-gen-connect-go
  code generation
- images/wrenn-init.sh: minimal PID 1 init script for guest VMs
- CLAUDE.md: updated architecture to reflect TAP networking (not vsock)
  and Firecracker HTTP API (not Go SDK)
2026-03-10 00:06:47 +06:00
bd78cc068c Initial project structure for Wrenn Sandbox
Set up directory layout, Makefiles, go.mod files, docker-compose,
and empty placeholder files for all packages.
2026-03-09 17:22:47 +06:00