v0.1.6 #45
Reference in New Issue
Block a user
No description provided.
Delete Branch "dev"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What's New?
Performance updates for large capsules, admin panel enhancement and bug fixes
Envd
Admin Panel
Others
- Copy envd source from e2b-dev/infra, internalize shared dependencies into envd/internal/shared/ (keys, filesystem, id, smap, utils) - Switch from gRPC to Connect RPC for all envd services - Update module paths to git.omukk.dev/wrenn/{sandbox,sandbox/envd} - Add proto specs (process, filesystem) with buf-based code generation - Implement full envd: process exec, filesystem ops, port forwarding, cgroup management, MMDS integration, and HTTP API - Update main module dependencies (firecracker SDK, pgx, goose, etc.) - Remove placeholder .gitkeep files replaced by real implementationsUse Firecracker's Diff snapshot type when re-pausing a previously resumed sandbox, capturing only dirty pages instead of a full memory dump. Chains up to 10 incremental generations before collapsing back to a Full snapshot. Multi-generation diff files (memfile.{buildID}) are supported alongside the legacy single-file format in resume, template creation, and snapshot existence checks.Replace the existing auto-destroy TTL behavior with auto-pause: when a sandbox exceeds its timeout_sec of inactivity, the TTL reaper now pauses it (snapshot + teardown) instead of destroying it, preserving the ability to resume later. Key changes: - TTL reaper calls Pause instead of Destroy, with fallback to Destroy if pause fails (e.g. Firecracker process already gone) - New PingSandbox RPC resets the in-memory LastActiveAt timer - New POST /v1/sandboxes/{id}/ping REST endpoint resets both agent memory and DB last_active_at - ListSandboxes RPC now includes auto_paused_sandbox_ids so the reconciler can distinguish auto-paused sandboxes from crashed ones in a single call - Reconciler polls every 5s (was 30s) and marks auto-paused as "paused" vs orphaned as "stopped" - Resume RPC accepts timeout_sec from DB so TTL survives pause/resume cycles - Reaper checks every 2s (was 10s) and uses a detached context to avoid incomplete pauses on app shutdown - Default timeout_sec changed from 300 to 0 (no auto-pause unless requested)Implements the full host ↔ control plane connection flow: - Host CRUD endpoints (POST/GET/DELETE /v1/hosts) with role-based access: regular hosts admin-only, BYOC hosts for admins and team owners - One-time registration token flow: admin creates host → gets token (1hr TTL in Redis + Postgres audit trail) → host agent registers with specs → gets long-lived JWT (1yr) - Host agent registration client with automatic spec detection (arch, CPU, memory, disk) and token persistence to disk - Periodic heartbeat (30s) via POST /v1/hosts/{id}/heartbeat with X-Host-Token auth and host ID cross-check - Token regeneration endpoint (POST /v1/hosts/{id}/token) for retry after failed registration - Tag management (add/remove/list) with team-scoped access control - Host JWT with typ:"host" claim, cross-use prevention in both VerifyJWT and VerifyHostJWT - requireHostToken middleware for host agent authentication - DB-level race protection: RegisterHost uses AND status='pending' with rows-affected check; Redis GetDel for atomic token consume - Migration for future mTLS support (cert_fingerprint, mtls_enabled columns) - Host agent flags: --register (one-time token), --address (required ip:port) - serviceErrToHTTP extended with "forbidden" → 403 mapping - OpenAPI spec, .env.example, and README updated- Snapshot delete: make agent RPC failure a hard error so DB record is not removed when files cannot be deleted from disk - Snapshot overwrite: call agent to delete old files before removing the DB record, preventing stale memfile.{uuid} generations from accumulating on disk across repeated overwrites - Sandbox destroy: only swallow CodeNotFound from the agent (sandbox already gone / TTL-reaped); any other error now propagates to the caller instead of being silently ignored- Frontend: BYOC hosts page (/dashboard/byoc) with register/delete flows, shimmer loading, pulsing online status, animated token reveal checkmark - Frontend: Admin section (/admin/hosts) with platform + BYOC tabs, stat pills, skeleton loading, slide-in animations for new rows - Frontend: AdminSidebar component with accent top bar and admin pill badge - Frontend: BYOC nav item shown only when team.is_byoc is true (derived from teams store, not JWT); disabled for members - Frontend: Admin shield button in Sidebar, visible only to platform admins - Backend: is_admin in JWT claims + requireAdmin middleware (DB-validated) - Backend: is_byoc added to teamResponse so frontend derives visibility from fresh team data rather than stale JWT fields - Backend: SetBYOC admin endpoint (PUT /v1/admin/teams/{id}/byoc) - Backend: Admin hosts list enriches BYOC entries with team_name - Host agent: load .env file via godotenv on startupSampleSandboxMetrics previously filtered WHERE status IN ('running', 'starting', 'paused'), which returned no rows when all capsules were stopped. This caused zero snapshots to be skipped, leaving the time-series charts with no trailing data points instead of showing the expected zero values. Remove the WHERE filter so the query groups by all teams that have any sandbox row. The per-status FILTER clauses on the aggregates already produce correct zero counts for stopped capsules. Also includes the per-VM RAM ceiling formula change (sum(ceil(each/2)) instead of ceil(sum/2)).Samples /proc/{fc_pid}/stat (CPU%), /proc/{fc_pid}/status (VmRSS), and stat() on CoW files at 500ms intervals per running sandbox. Three tiered ring buffers downsample into 30s and 5min averages for 10min/2h/24h retention. Metrics are flushed to DB on pause (all tiers) and destroy (24h only). New GetSandboxMetrics and FlushSandboxMetrics RPCs on the host agent, proxied through GET /v1/sandboxes/{id}/metrics?range= on the control plane. Returns live data for running sandboxes, DB data for paused, and 404 for stopped.Add /proxy/{sandbox_id}/{port}/* handler that reverse-proxies HTTP requests to services running inside sandbox VMs. The sandbox's host IP (10.11.0.{idx}) is used as the upstream target. Includes port validation (1-65535) and shared HTTP transport for connection pooling. Supports WebSocket upgrades for protocols like Jupyter's streaming API. This is an intermediate state — needs further work for the full code interpreter feature.Add SandboxProxyWrapper that intercepts requests with Host headers matching {port}-{sandbox_id}.{domain} and proxies them through the owning host agent's /proxy endpoint. Authentication is via X-API-Key only (no JWT). The API key's team must own the sandbox. Export EnsureScheme from lifecycle package for reuse. Request flow: SDK -> Caddy -> CP catch-all -> Host Agent -> sandbox VM. This is an intermediate state — needs further work for the full code interpreter feature.Introduces an end-to-end template building pipeline: admins submit a recipe (list of shell commands) via the dashboard, a Redis-backed worker pool spins up a sandbox, executes each command, and produces either a full snapshot (with healthcheck) or an image-only template (rootfs flattened via a new FlattenRootfs host-agent RPC). Build progress and per-step logs are persisted to a new template_builds table and polled by the frontend. Backend: - New FlattenRootfs RPC (proto + host agent + sandbox manager) - BuildService with Redis queue (BLPOP) and configurable worker pool (default 2) - Admin-only REST endpoints: POST/GET /v1/admin/builds, GET /v1/admin/builds/{id} - Migration for template_builds table with JSONB logs and recipe columns - sqlc queries for build CRUD and progress updates Frontend: - /admin/templates page with Templates + Builds tabs - Create Template dialog with recipe textarea, healthcheck, specs - Build history with expandable per-step logs, status badges, progress bars - Auto-polling every 3s for active builds - AdminSidebar updated with Templates nav itemConsolidate 16 migrations into one with UUID columns for all entity IDs. TEXT is kept only for polymorphic fields (audit_logs.actor_id, resource_id) and template names. The id package now generates UUIDs via google/uuid, with Format*/Parse* helpers for the prefixed wire format (sb-{uuid}, usr-{uuid}, etc.). Auth context, services, and handlers pass pgtype.UUID internally; conversion to/from prefixed strings happens at API and RPC boundaries. Adds PlatformTeamID (all-zeros UUID) for shared resources.- DELETE /v1/admin/templates/{name} endpoint (admin-only) - Broadcasts DeleteSnapshot RPC to all online hosts before removing DB record - Frontend admin templates page uses deleteAdminTemplate() instead of team-scoped deleteSnapshot() - Delete button shown for all template types, not just snapshotsSnapshot race fix: - Pre-mark sandbox as "paused" in DB before issuing CreateSnapshot and PauseSandbox RPCs, preventing the reconciler from marking it "stopped" during the flatten window when the sandbox is gone from the host agent's in-memory map but DB still says "running" - Revert status to "running" on RPC failure - Check ctx.Err() before writing response to avoid writing to dead connections when client disconnects during long snapshot operations Delete auth fix: - Block non-admin deletion of platform templates (team_id = all-zeros) at DELETE /v1/snapshots/{name} with 403, preventing file deletion before the team ownership check fails Sparse dd: - Add conv=sparse to dd in FlattenSnapshot so flattened images preserve sparseness (~200MB actual vs 5GB logical) Default disk size: - Change default disk_size_mb from 20GB to 5GB across migration, manager, service, build, and EnsureImageSizes - Disable split-button dropdown arrow for platform templates in dashboard snapshots page (teams cannot delete platform templates)Introduces internal/layout package for centralized path construction, migrates templates from name-based TEXT primary keys to UUID PKs with team-scoped directories (WRENN_DIR/images/teams/{team_id}/{template_id}). The built-in minimal template uses sentinel zero UUIDs. Proto messages carry team_id + template_id alongside deprecated template name field. Team deletion now cleans up template files across all hosts.Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx} to avoid collisions with other tools. Add CleanupStaleNamespaces() at agent startup to remove orphaned namespaces, veths, iptables rules, and routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to prevent Go runtime memory corruption from snapshot/restore drift.- skip_pre_post flag on builds bypasses apt update/clean pre/post steps for faster iteration when the recipe handles its own environment setup - POST /v1/admin/builds/{id}/cancel endpoint marks an in-progress build as cancelled; UpdateBuildStatus now also sets completed_at for 'cancelled' - internal/recipe: typed recipe parser and executor (RUN/ENV/COPY steps) replacing the raw string slice approach in the build worker - pre/post build commands prefixed with RUN to match recipe step formatThe envd port scanner used gopsutil's net.Connections() which walks /proc/{pid}/fd to enumerate socket inodes. This corrupts Go runtime semaphore state when the VM is paused mid-operation and restored from a Firecracker snapshot. Replace with a direct /proc/net/tcp + /proc/net/tcp6 parser that reads a single file per address family — no /proc/{pid}/fd walk, no goroutines, no WaitGroups. Also replace concurrent-map (smap) in the scanner with a plain sync.RWMutex-protected map, since concurrent-map's Items() spawns goroutines with a WaitGroup internally, which is equally unsafe across snapshot boundaries. Use socket inode instead of PID for the port forwarding map key, since inode is available directly from /proc/net/tcp without the fd walk.expandEnvto use regex. 9852f96127- Fix envRegex: remove spurious (\$)? group that swallowed $$$, handle ${} - wrenn-init.sh: add || true to networking commands under set -e, remove dead code - waitForHealthcheck: use context deadline for unlimited retries instead of implicit 100 cap - Make parseSandboxEnv a package-level function (unused receiver) - Fix WrappedCommand test: map iteration order dependency, pre-expand env values - Fix error wrapping: %v → %w per project conventions - test-jupyter-kernel.py: move import to top-level, fix misleading commentImplement a channels system for notifying teams via external providers (Discord, Slack, Teams, Google Chat, Telegram, Matrix, webhook) when lifecycle events occur (capsule/template/host state changes). - Channel CRUD API under /v1/channels (JWT-only auth) - Test endpoint to verify config before saving (POST /v1/channels/test) - Secret rotation endpoint (PUT /v1/channels/{id}/config) - AES-256-GCM encryption for provider secrets (WRENN_ENCRYPTION_KEY) - Redis stream event publishing from audit logger - Background dispatcher with consumer group and retry (10s, 30s) - Webhook delivery with HMAC-SHA256 signing (X-WRENN-SIGNATURE) - shoutrrr integration for chat providers - Secrets never exposed in API responsesPlumb ListDir, MakeDir, and RemovePath through all layers: REST API → host agent RPC → envdclient → envd. These endpoints enable a web file browser for sandbox filesystem interaction. New endpoints (all under requireAPIKeyOrJWT): - POST /v1/sandboxes/{id}/files/list - POST /v1/sandboxes/{id}/files/mkdir - POST /v1/sandboxes/{id}/files/removeWire envd's existing PTY process capabilities through the full stack: hostagent proto (4 new RPCs: PtyAttach, PtySendInput, PtyResize, PtyKill), envdclient, sandbox manager, and a new WebSocket endpoint at GET /v1/sandboxes/{id}/pty with bidirectional JSON message protocol. Sessions use tag-based identity for disconnect/reconnect support, base64-encoded PTY data for binary safety, and a 120s inactivity timeout.Sandbox URLs ({port}-{sandbox_id}.{domain}) are now accessible without authentication. The sandbox ID in the hostname is sufficient for routing.Start long-running processes (web servers, daemons) without blocking the HTTP request. Leverages envd's existing background process support (context.Background(), List, Connect, SendSignal RPCs) and wires it through the host agent and control plane layers. New API surface: - POST /v1/capsules/{id}/exec with background:true → 202 {pid, tag} - GET /v1/capsules/{id}/processes → list running processes - DELETE /v1/capsules/{id}/processes/{selector} → kill by PID or tag - WS /v1/capsules/{id}/processes/{selector}/stream → reconnect to output The {selector} param auto-detects: numeric = PID, string = tag. Tags are auto-generated as "proc-" + 8 hex chars if not provided.Adds self-service endpoints: GET/PATCH/DELETE /v1/me, POST /v1/me/password, POST /v1/me/password/reset{/confirm}, GET/DELETE /v1/me/providers/{provider}. Includes OAuth account-linking flow via cookie, hard-delete cleanup goroutine (24h ticker, 15-day grace period), and OpenAPI spec for all new routes.Running port-binding applications (Jupyter, http.server, NextJS) inside sandboxes caused severe PTY sluggishness and proxy navigation errors. Root cause: the CP sandbox proxy and Connect RPC pool shared a single HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling) interfered with PTY RPC streams via HTTP/2 flow control contention. Transport isolation (main fix): - Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2 disabled, separate from the RPC pool transport - Add dedicated proxy transport on host agent, replacing http.DefaultTransport - Add dedicated envdclient transport with tuned connection pooling - Replace http.DefaultClient in file streaming RPCs with per-sandbox envd client Proxy path rewriting (navigation fix): - Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port} prefix, handling both root-relative and absolute-URL redirects - Strip prefix back out in CP subdomain proxy for correct browser behavior - Replace path.Join with string concat in CP Director to preserve trailing slashes (prevents redirect loops on directory listings) Proxy resilience: - Add dial retry with linear backoff (3 attempts) to handle socat startup delay when ports are first detected - Cache ReverseProxy instances per sandbox+port+host in sync.Map - Add EvictProxy callback wired into sandbox Manager.Destroy Buffer and server hardening: - Increase PTY and exec stream channel buffers from 16 to 256 - Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent HTTP server Network tuning: - Set TAP device TxQueueLen to 5000 (up from default 1000) - Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst) to prevent guest traffic from saturating the TAPThree bugs fixed: 1. PTY connections failed because home directory was hardcoded as /home/{username} instead of reading from /etc/passwd. For root, this produced /home/root/ which doesn't exist — CWD validation rejected every PTY Start request without explicit cwd. Fixed all 6 locations to use user.dir from nix::unistd::User. 2. MMDS polling silently failed to parse metadata because the logs_collector_address field lacked #[serde(default)]. The host agent only sends instanceID + envID — missing "address" field caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID and .WRENN_TEMPLATE_ID were never written. Also added error logging and create_dir_all before file writes. 3. Metrics CPU values were non-deterministic because a fresh sysinfo::System was created per request with a 100ms sleep between reads. Replaced with a background thread that samples CPU at fixed 1-second intervals via a persistent System instance, matching gopsutil's internal caching behavior. Metrics endpoint now reads cached atomic values — no blocking, consistent window. Also: close master PTY fd in child pre_exec, add process.Start request logging, bump version to 0.2.0.Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for granting and revoking platform admin status. Uses atomic conditional SQL (RevokeUserAdmin) to prevent race conditions that could remove the last admin. Includes idempotency check, audit logging, and confirmation dialog with self-demotion warning.