v0.0.1 #8
Reference in New Issue
Block a user
No description provided.
Delete Branch "dev"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR merges the full v2 development cycle into main. It represents a complete platform buildout on top of the initial Firecracker prototype: multi-tenant auth, a production-grade control plane, a polished dashboard, and a host of reliability and security improvements.
Summary
Infrastructure & Security
Host & Multi-Host
Sandbox Capabilities
clock_settime)Templates & Snapshots
Observability
Notification Channels
Auth & Teams
Frontend (Dashboard)
License
(PR Description generated by Claude Sonnet 4.6)
- Copy envd source from e2b-dev/infra, internalize shared dependencies into envd/internal/shared/ (keys, filesystem, id, smap, utils) - Switch from gRPC to Connect RPC for all envd services - Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd} - Add proto specs (process, filesystem) with buf-based code generation - Implement full envd: process exec, filesystem ops, port forwarding, cgroup management, MMDS integration, and HTTP API - Update main module dependencies (firecracker SDK, pgx, goose, etc.) - Remove placeholder .gitkeep files replaced by real implementationsUse Firecracker's Diff snapshot type when re-pausing a previously resumed sandbox, capturing only dirty pages instead of a full memory dump. Chains up to 10 incremental generations before collapsing back to a Full snapshot. Multi-generation diff files (memfile.{buildID}) are supported alongside the legacy single-file format in resume, template creation, and snapshot existence checks.Replace the existing auto-destroy TTL behavior with auto-pause: when a sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses it (snapshot + teardown) instead of destroying it, preserving the ability to resume later. Key changes: - TTL reaper calls Pause instead of Destroy, with fallback to Destroy if pause fails (e.g. Firecracker process already gone) - New PingSandbox RPC resets the in-memory LastActiveAt timer - New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory and DB last_active_at - ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler can distinguish auto-paused sandboxes from crashed ones in a single call - Reconciler polls every 5s (was 30s) and marks auto-paused as "paused" vs orphaned as "stopped" - Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles - Reaper checks every 2s (was 10s) and uses a detached context to avoid incomplete pauses on app shutdown - Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)Implements the full host ↔ control plane connection flow: - Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access: regular hosts admin-only, BYOC hosts for admins and team owners - One-time registration token flow: admin creates host → gets token (1hr TTL in Redis + Postgres audit trail) → host agent registers with specs → gets long-lived JWT (1yr) - Host agent registration client with automatic spec detection (arch, CPU, memory, disk) and token persistence to disk - Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token auth and host ID cross-check - Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after failed registration - Tag management (add/remove/list) with team-scoped access control - Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and VerifyHostJWT - requireHostToken middleware for host agent authentication - DB-level race protection: RegisterHost uses AND status='pending' with rows-affected check; Redis GetDel for atomic token consume - Migration for future mTLS support (cert_fingerprint, mtls_enabled columns) - Host agent flags: --register (one-time token), --address (required ip:port) - serviceErrToHTTP extended with "forbidden" → 403 mapping - OpenAPI spec, .env.example, and README updated- Snapshot delete: make agent RPC failure a hard error so DB record is not removed when files cannot be deleted from disk - Snapshot overwrite: call agent to delete old files before removing the DB record, preventing stale memfile.{uuid} generations from accumulating on disk across repeated overwrites - Sandbox destroy: only swallow CodeNotFound from the agent (sandbox already gone / TTL-reaped); any other error now propagates to the caller instead of being silently ignored- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows, shimmer loading, pulsing online status, animated token reveal checkmark - Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat pills, skeleton loading, slide-in animations for new rows - Frontend: AdminSidebar component with accent top bar and admin pill badge - Frontend: BYOC nav item shown only when team.is_byoc is true (derived from teams store, not JWT); disabled for members - Frontend: Admin shield button in Sidebar, visible only to platform admins - Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated) - Backend: is_byoc added to teamResponse so frontend derives visibility from fresh team data rather than stale JWT fields - Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc) - Backend: Admin hosts list enriches BYOC entries with team_name - Host agent: load .env file via godotenv on startupSampleSandboxMetrics previously filtered WHERE status IN ('running', 'starting', 'paused'), which returned no rows when all capsules were stopped. This caused zero snapshots to be skipped, leaving the time-series charts with no trailing data points instead of showing the expected zero values. Remove the WHERE filter so the query groups by all teams that have any sandbox row. The per-status FILTER clauses on the aggregates already produce correct zero counts for stopped capsules. Also includes the per-VM RAM ceiling formula change (sum(ceil(each/2)) instead of ceil(sum/2)).Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and stat() on CoW files at 500ms intervals per running sandbox. Three tiered ring buffers downsample into 30s and 5min averages for 10min/2h/24h retention. Metrics are flushed to DB on pause (all tiers) and destroy (24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on the control plane. Returns live data for running sandboxes, DB data for paused, and 404 for stopped.Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP requests to services running inside sandbox VMs. The sandbox's host IP (10.11.0.{idx}) is used as the upstream target. Includes port validation (1-65535) and shared HTTP transport for connection pooling. Supports WebSocket upgrades for protocols like Jupyter's streaming API. This is an intermediate state — needs further work for the full code interpreter feature.Add SandboxProxyWrapper that intercepts requests with Host headers matching {port}-{sandbox_id}.{domain} and proxies them through the owning host agent's /proxy endpoint. Authentication is via X-API-Key only (no JWT). The API key's team must own the sandbox. Export EnsureScheme from lifecycle package for reuse. Request flow: SDK -> Caddy -> CP catch-all -> Host Agent -> sandbox VM. This is an intermediate state — needs further work for the full code interpreter feature.Introduces an end-to-end template building pipeline: admins submit a recipe (list of shell commands) via the dashboard, a Redis-backed worker pool spins up a sandbox, executes each command, and produces either a full snapshot (with healthcheck) or an image-only template (rootfs flattened via a new FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted to a new template_builds table and polled by the frontend. Backend: - New FlattenRootfs RPC (proto + host agent + sandbox manager) - BuildService with Redis queue (BLPOP) and configurable worker pool (default 2) - Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id} - Migration for template_builds table with JSONB logs and recipe columns - sqlc queries for build CRUD and progress updates Frontend: - /admin/templates page with Templates + Builds tabs - Create Template dialog with recipe textarea, healthcheck, specs - Build history with expandable per-step logs, status badges, progress bars - Auto-polling every 3s for active builds - AdminSidebar updated with Templates nav itemConsolidate 16 migrations into one with UUID columns for all entity IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id, resource_id) and template names. The id package now generates UUIDs via google/uuid, with Format*/Parse* helpers for the prefixed wire format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and handlers pass pgtype.UUID internally; conversion to/from prefixed strings happens at API and RPC boundaries. Adds PlatformTeamID (all-zeros UUID) for shared resources.- DELETE /v1/admin/templates/{name} endpoint (admin-only) - Broadcasts DeleteSnapshot RPC to all online hosts before removing DB record - Frontend admin templates page uses deleteAdminTemplate() instead of team-scoped deleteSnapshot() - Delete button shown for all template types, not just snapshotsSnapshot race fix: - Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and PauseSandbox RPCs, preventing the reconciler from marking it "stopped" during the flatten window when the sandbox is gone from the host agent's in-memory map but DB still says "running" - Revert status to "running" on RPC failure - Check ctx.Err() before writing response to avoid writing to dead connections when client disconnects during long snapshot operations Delete auth fix: - Block non-admin deletion of platform templates (team_id = all-zeros) at DELETE /v1/snapshots/{name} with 403, preventing file deletion before the team ownership check fails Sparse dd: - Add conv=sparse to dd in FlattenSnapshot so flattened images preserve sparseness (~200MB actual vs 5GB logical) Default disk size: - Change default disk_size_mb from 20GB to 5GB across migration, manager, service, build, and EnsureImageSizes - Disable split-button dropdown arrow for platform templates in dashboard snapshots page (teams cannot delete platform templates)Introduces internal/layout package for centralized path construction, migrates templates from name-based TEXT primary keys to UUID PKs with team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}). The built-in minimal template uses sentinel zero UUIDs. Proto messages carry team_id + template_id alongside deprecated template name field. Team deletion now cleans up template files across all hosts.Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx} to avoid collisions with other tools. Add CleanupStaleNamespaces() at agent startup to remove orphaned namespaces, veths, iptables rules, and routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to prevent Go runtime memory corruption from snapshot/restore drift.- skip_pre_post flag on builds bypasses apt update/clean pre/post steps for faster iteration when the recipe handles its own environment setup - POST /v1/admin/builds/{id}/cancel endpoint marks an in-progress build as cancelled; UpdateBuildStatus now also sets completed_at for 'cancelled' - internal/recipe: typed recipe parser and executor (RUN/ENV/COPY steps) replacing the raw string slice approach in the build worker - pre/post build commands prefixed with RUN to match recipe step formatThe envd port scanner used gopsutil's net.Connections() which walks /proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime semaphore state when the VM is paused mid-operation and restored from a Firecracker snapshot. Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads a single file per address family — no /proc/{pid}/fd walk, no goroutines, no WaitGroups. Also replace concurrent-map (smap) in the scanner with a plain sync.RWMutex-protected map, since concurrent-map's Items() spawns goroutines with a WaitGroup internally, which is equally unsafe across snapshot boundaries. Use socket inode instead of PID for the port forwarding map key, since inode is available directly from /proc/net/tcp without the fd walk.expandEnvto use regex. 9852f96127- Fix envRegex: remove spurious (\$)? group that swallowed $$$, handle ${} - wrenn-init.sh: add || true to networking commands under set -e, remove dead code - waitForHealthcheck: use context deadline for unlimited retries instead of implicit 100 cap - Make parseSandboxEnv a package-level function (unused receiver) - Fix WrappedCommand test: map iteration order dependency, pre-expand env values - Fix error wrapping: %v → %w per project conventions - test-jupyter-kernel.py: move import to top-level, fix misleading commentImplement a channels system for notifying teams via external providers (Discord, Slack, Teams, Google Chat, Telegram, Matrix, webhook) when lifecycle events occur (capsule/template/host state changes). - Channel CRUD API under /v1/channels (JWT-only auth) - Test endpoint to verify config before saving (POST /v1/channels/test) - Secret rotation endpoint (PUT /v1/channels/{id}/config) - AES-256-GCM encryption for provider secrets (WRENN_ENCRYPTION_KEY) - Redis stream event publishing from audit logger - Background dispatcher with consumer group and retry (10s, 30s) - Webhook delivery with HMAC-SHA256 signing (X-WRENN-SIGNATURE) - shoutrrr integration for chat providers - Secrets never exposed in API responsesWIP: v0.1.0to v0.0.1