1
0
forked from wrenn/wrenn
Commit Graph

11 Commits

Author SHA1 Message Date
75b28ed899 Add UUID-based template IDs and team-scoped template directory layout
Introduces internal/layout package for centralized path construction,
migrates templates from name-based TEXT primary keys to UUID PKs with
team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}).
The built-in minimal template uses sentinel zero UUIDs. Proto messages
carry team_id + template_id alongside deprecated template name field.
Team deletion now cleans up template files across all hosts.
2026-03-29 00:30:10 +06:00
34af77e0d8 Fix snapshot race, delete auth, sparse dd, default disk to 5GB
Snapshot race fix:
- Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and
  PauseSandbox RPCs, preventing the reconciler from marking it "stopped"
  during the flatten window when the sandbox is gone from the host
  agent's in-memory map but DB still says "running"
- Revert status to "running" on RPC failure
- Check ctx.Err() before writing response to avoid writing to dead
  connections when client disconnects during long snapshot operations

Delete auth fix:
- Block non-admin deletion of platform templates (team_id = all-zeros)
  at DELETE /v1/snapshots/{name} with 403, preventing file deletion
  before the team ownership check fails

Sparse dd:
- Add conv=sparse to dd in FlattenSnapshot so flattened images preserve
  sparseness (~200MB actual vs 5GB logical)

Default disk size:
- Change default disk_size_mb from 20GB to 5GB across migration,
  manager, service, build, and EnsureImageSizes
- Disable split-button dropdown arrow for platform templates in
  dashboard snapshots page (teams cannot delete platform templates)
2026-03-28 14:30:18 +06:00
4ddd494160 Switch database IDs from TEXT to native UUID
Consolidate 16 migrations into one with UUID columns for all entity
IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id,
resource_id) and template names. The id package now generates UUIDs
via google/uuid, with Format*/Parse* helpers for the prefixed wire
format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and
handlers pass pgtype.UUID internally; conversion to/from prefixed
strings happens at API and RPC boundaries. Adds PlatformTeamID
(all-zeros UUID) for shared resources.
2026-03-26 16:16:21 +06:00
1be30034bd Add audit log infrastructure and GET /v1/audit-logs endpoint
Introduces an append-only audit trail for all user and system actions:
sandbox lifecycle (create/pause/resume/destroy/auto-pause), snapshots,
team rename, API key create/revoke, member add/remove/leave/role_update,
and BYOC host add/delete/marked_down/marked_up.

- New audit_logs table (migration) with team_id, actor, resource,
  action, scope (team|admin), status (success|info|warning|error),
  metadata, and created_at
- AuditLogger (internal/audit) with named fire-and-forget methods per
  event; system actor used for background events (HostMonitor, TTL reaper)
- GET /v1/audit-logs: JWT-only, cursor pagination (max 200), multi-value
  filters for resource_type and action (comma-sep or repeated params);
  members see team-scoped events only, admins/owners see all
- AuthContext extended with APIKeyID + APIKeyName so API key requests
  record meaningful actor identity
- HostMonitor wired with AuditLogger for auto-pause and host marked_down
2026-03-25 05:15:16 +06:00
9bf67aa7f7 Implement host registration, JWT refresh tokens, and multi-host scheduling
Replaces the hardcoded CP_HOST_AGENT_ADDR single-agent setup with a
DB-driven registration system supporting multiple host agents (BYOC).

Key changes:
- Host agents register via one-time token, receive a 7-day JWT + 60-day
  refresh token; heartbeat loop auto-refreshes on 401/403 and pauses all
  sandboxes if refresh fails
- HostClientPool: lazy Connect RPC client cache keyed by host ID, replacing
  the single static agent client throughout the API and service layers
- RoundRobinScheduler: picks an online host for each new sandbox via
  ListActiveHosts; extensible for future scheduling strategies
- HostMonitor (replaces Reconciler): passive heartbeat staleness check marks
  hosts unreachable and sandboxes missing after 90s; active reconciliation
  per online host restores missing-but-alive sandboxes and stops orphans
- Graceful host delete: returns 409 with affected sandbox list without
  ?force=true; force-delete destroys sandboxes then evicts pool client
- Snapshot delete broadcasts to all online hosts (templates have no host_id)
- sandbox.Manager.PauseAll: pauses all running VMs on CP connectivity loss
- New migration: host_refresh_tokens table with token rotation (issue-then-
  revoke ordering to prevent lockout on mid-rotation crash)
- New sandbox status 'missing' (reversible, unlike 'stopped') and host
  status 'unreachable'; both reflected in OpenAPI spec
- Fix: refresh token auth failure now returns 401 (was 400 via generic
  'invalid' substring match in serviceErrToHTTP)
2026-03-24 18:32:05 +06:00
5f0dbadea6 Fix snapshot and sandbox delete consistency
- Snapshot delete: make agent RPC failure a hard error so DB record is
  not removed when files cannot be deleted from disk
- Snapshot overwrite: call agent to delete old files before removing the
  DB record, preventing stale memfile.{uuid} generations from accumulating
  on disk across repeated overwrites
- Sandbox destroy: only swallow CodeNotFound from the agent (sandbox
  already gone / TTL-reaped); any other error now propagates to the caller
  instead of being silently ignored
2026-03-23 02:59:30 +06:00
f38d5812d1 Extract shared service layer for sandbox, API key, and template operations
Moves business logic from API handlers into internal/service/ so that
both the REST API and the upcoming dashboard can share the same operations
without duplicating code. API handlers now delegate to the service layer
and only handle HTTP-specific concerns (request parsing, response formatting).
2026-03-16 05:39:30 +06:00
c92cc29b88 Add authentication, authorization, and team-scoped access control
Implement email/password auth with JWT sessions and API key auth for
sandbox lifecycle. Users get a default team on signup; sandboxes,
snapshots, and API keys are scoped to teams.

- Add user, team, users_teams, and team_api_keys tables (goose migrations)
- Add JWT middleware (Bearer token) for user management endpoints
- Add API key middleware (X-API-Key header, SHA-256 hashed) for sandbox ops
- Add signup/login handlers with transactional user+team creation
- Add API key CRUD endpoints (create/list/delete)
- Replace owner_id with team_id on sandboxes and templates
- Update all handlers to use team-scoped queries
- Add godotenv for .env file loading
- Update OpenAPI spec and test UI with auth flows
2026-03-14 03:57:06 +06:00
a0d635ae5e Fix path traversal in template/snapshot names and network cleanup leaks
Add SafeName validator (allowlist regex) to reject directory traversal
in user-supplied template and snapshot names. Validated at both API
handlers (400 response) and sandbox manager (defense in depth).

Refactor CreateNetwork with rollback slice so partially created
resources (namespace, veth, routes, iptables rules) are cleaned up
on any error. Refactor RemoveNetwork to collect and return errors
instead of silently ignoring them.
2026-03-13 08:40:36 +06:00
63e9132d38 Add device-mapper snapshots, test UI, fix pause ordering and lint errors
- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
2026-03-13 08:25:40 +06:00
a1bd439c75 Add sandbox snapshot and restore with UFFD lazy memory loading
Implement full snapshot lifecycle: pause (snapshot + free resources),
resume (UFFD-based lazy restore), and named snapshot templates that
can spawn new sandboxes from frozen VM state.

Key changes:
- Snapshot header system with generational diff mapping (inspired by e2b)
- UFFD server for lazy page fault handling during snapshot restore
- Stable rootfs symlink path (/tmp/fc-vm/) for snapshot compatibility
- Templates DB table and CRUD API endpoints (POST/GET/DELETE /v1/snapshots)
- CreateSnapshot/DeleteSnapshot RPCs in hostagent proto
- Reconciler excludes paused sandboxes (expected absent from host agent)
- Snapshot templates lock vcpus/memory to baked-in values
- Proper cleanup of uffd sockets and pause snapshot files on destroy
2026-03-12 09:19:37 +06:00