1
0
forked from wrenn/wrenn
Commit Graph

290 Commits

Author SHA1 Message Date
c08884fa2c Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2026-05-13 11:05:49 +06:00
4707f16c76 v0.1.6 (#45)
## What's New?
Performance updates for large capsules, admin panel enhancement and bug fixes

### Envd
- Fixed bug with sandbox metrics calculation
- Page cache drop and balloon inflation to reduce memfile snapshot
- Updated rpc timeout logic for better control
- Added tests

### Admin Panel
- Add/Remove platform admin
- Updated template deletion logic for fine grained permission

### Others
- Minor frontend visual improvement
- Minor bugfixes
- Version bump

Co-authored-by: Tasnim Kabir Sadik <tksadik92@gmail.com>
Reviewed-on: wrenn/wrenn#45
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>
2026-05-13 05:05:35 +00:00
6164d7cae3 version bump 2026-05-13 10:58:54 +06:00
dc6776cc8f fix(agent): register with CP before inflating rootfs images 2026-05-13 10:52:22 +06:00
0bfda08f47 Merge pull request 'test (envd): add 136 unit tests across 12 modules' (#44) from testing/envd into dev
Reviewed-on: wrenn/wrenn#44
2026-05-13 04:42:06 +00:00
485be22a16 test(envd): add 136 unit tests across 12 modules
Cover all pure-function modules with inline #[cfg(test)] blocks:
crypto (NIST/RFC 4231 known-answer vectors), auth (SecureToken ops,
signature generation/validation), conntracker (snapshot lifecycle),
execcontext, util (AtomicMax concurrent correctness), http/encoding
(RFC 7231 negotiation), port/conn (/proc/net/tcp parsing),
rpc/entry (format_permissions), and permissions/path (tilde expansion,
ensure_dirs). Add tempfile dev-dep for filesystem tests. Update
Makefile test target to include cargo test.
2026-05-13 10:39:54 +06:00
ead406bdac Merge pull request 'fix: resolve large operation reliability — stream hangs, pause races, and memory bloat' (#43) from fix/large-operations into dev
Reviewed-on: wrenn/wrenn#43
2026-05-13 03:44:41 +00:00
1472d77b52 Merge branch 'dev' into fix/large-operations 2026-05-13 03:44:19 +00:00
6a0fea30a6 Rootfs script updated 2026-05-13 09:35:06 +06:00
8c34388fc2 Changed commands to check if envd is statically linked or not 2026-05-12 23:19:30 +06:00
aca43d51eb fix: resolve process stream hangs, pause race, and PTY signal loss
- Cache terminal EndEvent on ProcessHandle so connect() can detect
  already-exited processes instead of hanging forever on broadcast
  receivers that missed the event. Subscribe before checking cache
  to close the TOCTOU window.

- Protect sb.Status writes in Pause with m.mu to prevent data race
  with concurrent readers (AcquireProxyConn, Exec, etc.).

- Restart metrics sampler in restoreRunning so a failed pause attempt
  doesn't permanently kill sandbox metrics collection.

- Return dequeued non-input messages from coalescePtyInput instead of
  dropping them, preventing silent loss of kill/resize signals during
  typing bursts.
2026-05-09 18:11:15 +06:00
522e1c5e90 fix: subscribe to process channels before spawning threads to prevent event loss
Fast-exiting processes (e.g. echo) sent data/end events before
start() subscribed to the broadcast channels, causing the stream
to hang indefinitely and the exec RPC to time out with 502.

Move channel subscription into spawn_process, before reader/waiter
threads start, and return pre-subscribed receivers via SpawnedProcess.
2026-05-09 17:28:37 +06:00
d1d316f35c fix: resolve exec 502 by terminating process streams on exit
The start() and connect() streaming RPCs blocked forever in the data
event loop because ProcessHandle retains a broadcast sender (needed for
reconnection via connect()), preventing the channel from closing.

Race data_rx against end_rx with tokio::select! so the stream terminates
when the process exits. Remaining buffered data is drained before
yielding the end event.
2026-05-09 16:36:33 +06:00
2af8412cdc fix: use RwLock for envd Defaults to fix silent mutation loss
The /init handler's default_user mutation cloned the Defaults struct,
mutated the clone, then dropped it — the actual state was never updated.
This caused processes to always run as "root" regardless of the user
set via POST /init. Additionally, default_workdir was accepted in the
init request but never applied.

Wrap user and workdir fields in RwLock with accessor methods so mutations
propagate correctly through the shared AppState.
2026-05-09 15:28:09 +06:00
c93ad5e2db fix: harden pause flow with connection isolation and UFFD event handling
Restructure pause to: block new operations (StatusPausing), drain proxy
connections with 5s grace, force-close remaining via context cancellation,
drop page cache, inflate balloon, then freeze vCPUs. Previously connections
could arrive during the pause window and API operations weren't blocked.

Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing
the UFFD server. These events fire during balloon deflation on snapshot
restore, killing the page fault handler and preventing VM boot.

Also adds ConnTracker.ForceClose() with cancellable context propagated
through the proxy handler, so lingering proxy connections are actively
terminated rather than left dangling.
2026-05-09 14:51:19 +06:00
38799770db fix: inflate balloon before snapshot to reduce memfile size
Firecracker dumps the entire VM memory region regardless of guest
usage. A 20GB VM using 500MB still produces a ~20GB memfile because
freed pages retain stale data (non-zero blocks).

Inflate the balloon device before snapshot to reclaim free guest
memory. Balloon pages become zero from FC's perspective, allowing
ProcessMemfile to skip them. This reduces memfile size from ~20GB
to ~1-2GB for lightly-used VMs.

- Pause: read guest memory usage, inflate balloon to reclaim free
  pages, wait 2s for guest kernel to process, then proceed
- Resume: deflate balloon to 0 after PostInit so guest gets full
  memory back
- createFromSnapshot: same deflation since template snapshots
  inherit inflated balloon state
- All balloon ops are best-effort with debug logging on failure
2026-05-05 15:38:04 +06:00
51b5d7b3ba fix: resolve pause/snapshot failures and CoW exhaustion on large VMs
Remove hard 10s timeout from Firecracker HTTP client — callers already
pass context.Context with appropriate deadlines, and 20GB+ memfile
writes easily exceed 10s.

Ensure CoW file is at least as large as the origin rootfs. Previously,
WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the
default 5GB CoW could not hold all writes, causing dm-snapshot
invalidation and EIO on all guest I/O.

Destroy frozen VMs in resumeOnError instead of leaving zombies that
report "running" but can't execute. Use fresh context for the resume
attempt so a cancelled caller context doesn't falsely trigger destroy.

Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and
PrepareSnapshot timeout from 3s to 30s for large-memory VMs.

After failed pause, ping agent to detect destroyed sandboxes and mark
DB status as "error" instead of reverting to "running".
2026-05-04 01:46:57 +06:00
fd5fa28205 Merge pull request 'Enhanced frontend ux' (#42) from enhance/frontend into dev
Reviewed-on: wrenn/wrenn#42
2026-05-03 11:08:48 +00:00
1244c08e42 fix: fetch sandbox metrics immediately on page load
Metrics data was only fetched after Chart.js dynamic import completed,
leaving graphs empty until the first poll interval fired. Now
loadMetrics() runs in parallel with the Chart.js import, and
initCharts() resets the dedup key so pre-fetched data populates
newly created chart instances.
2026-05-03 16:43:26 +06:00
021d709de2 feat: show template owner and restrict delete in admin panel
Add Owner column to admin templates table, resolving team IDs to names
via admin teams API. Disable delete for non-platform templates and the
minimal template, with contextual tooltips explaining why.
2026-05-03 15:51:20 +06:00
cac6fcd626 feat: admin grant/revoke from admin panel
Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for
granting and revoking platform admin status. Uses atomic conditional
SQL (RevokeUserAdmin) to prevent race conditions that could remove
the last admin. Includes idempotency check, audit logging, and
confirmation dialog with self-demotion warning.
2026-05-03 15:24:34 +06:00
4954b19d7c fix: merge capsule data in-place to prevent visual refresh on poll
Replaces full array assignment with granular merge that reuses existing
Svelte proxy objects, so only rows with actual data changes re-render.
2026-05-03 15:09:21 +06:00
01819642cc fix: drop page cache before snapshot to reduce memory dump size
Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.
2026-05-03 14:27:49 +06:00
cb28f7759d Merge pull request 'fix: accurate sandbox metrics and memory management' (#41) from bugfix/sandbox-metrics-calculations into dev
Reviewed-on: wrenn/wrenn#41
2026-05-03 06:41:41 +00:00
1178ab8b21 fix: accurate sandbox metrics and memory management
Three issues fixed:

1. Memory metrics read host-side VmRSS of the Firecracker process,
   which includes guest page cache and never decreases. Replaced
   readMemRSS(fcPID) with readEnvdMemUsed(client) that queries
   envd's /metrics endpoint for guest-side total - MemAvailable.
   This matches neofetch and reflects actual process memory.

2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and
   envd-side periodic page cache reclaimer (drop_caches when >80%
   used). Reclaimer is gated by snapshot_in_progress flag with
   sync() before freeze to prevent memory corruption during pause.

3. Sampling interval 500ms → 1s, ring buffer capacities adjusted
   to maintain same time windows. Reduces per-host HTTP load from
   240 calls/sec to 120 calls/sec at 120 capsules.

Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD
lazy-loads anyway), envd mem_used formula uses total - available.
2026-05-03 12:19:01 +06:00
233e747d5d Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2026-05-03 04:56:14 +06:00
f5a23c1fa0 v0.1.5 (#40)
Reviewed-on: wrenn/wrenn#40
v0.1.5
2026-05-02 22:56:00 +00:00
20a228eb8d Merge pull request 'Rewritten envd with rust to improve reliability during pause and resume operations' (#39) from feat/envd-rewrite into dev
Reviewed-on: wrenn/wrenn#39
2026-05-02 22:49:36 +00:00
ef5f223863 fix: improve error feedback for terminal disconnects and host unavailability
Show "[session disconnected]" in terminal when PTY websocket closes cleanly.
Map scheduler and agent unavailability errors to 503 with user-friendly
message instead of leaking internal details.
2026-05-03 04:47:10 +06:00
31456fd169 fix: resolve PTY failure, MMDS file writes, and metrics instability in envd-rs
Three bugs fixed:

1. PTY connections failed because home directory was hardcoded as
   /home/{username} instead of reading from /etc/passwd. For root,
   this produced /home/root/ which doesn't exist — CWD validation
   rejected every PTY Start request without explicit cwd. Fixed all
   6 locations to use user.dir from nix::unistd::User.

2. MMDS polling silently failed to parse metadata because the
   logs_collector_address field lacked #[serde(default)]. The host
   agent only sends instanceID + envID — missing "address" field
   caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID
   and .WRENN_TEMPLATE_ID were never written. Also added error
   logging and create_dir_all before file writes.

3. Metrics CPU values were non-deterministic because a fresh
   sysinfo::System was created per request with a 100ms sleep
   between reads. Replaced with a background thread that samples
   CPU at fixed 1-second intervals via a persistent System instance,
   matching gopsutil's internal caching behavior. Metrics endpoint
   now reads cached atomic values — no blocking, consistent window.

Also: close master PTY fd in child pre_exec, add process.Start
request logging, bump version to 0.2.0.
2026-05-03 04:28:10 +06:00
bbcde17d49 Updated static link check for envd 2026-05-03 03:32:41 +06:00
f328113a2a rename guest hostname from "sandbox" to "capsule"
Terminal prompt inside VMs now shows root@capsule instead of
root@sandbox, aligning with user-facing "capsule" terminology.
2026-05-03 03:32:03 +06:00
1143acd37a refactor: remove Go envd module, update host agent for Rust envd
The Go envd guest agent (`envd/`) is fully replaced by the Rust
implementation (`envd-rs/`). This commit removes the Go module and
updates all references across the codebase.

Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go,
and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link
verification to build-envd.

Host agent: rewrite snapshot quiesce comments that referenced Go GC
and page allocator corruption — no longer applicable with Rust envd.
Tighten envdclient to expect HTTP 200 (not 204) from health and file
upload endpoints, and require JSON version response from FetchVersion.

Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and
README.md to reflect Rust envd architecture.
2026-05-03 03:12:25 +06:00
0b53d34417 feat: rewrite envd guest agent in Rust (envd-rs)
Complete Rust rewrite of the Go envd guest daemon that runs as PID 1
inside Firecracker microVMs. Feature-complete across all 8 phases:

- Health, metrics, and env var endpoints
- Crypto (SHA-256/512, HMAC), auth (secure token, signing), init/snapshot
- Connect RPC via connectrpc + buffa (process + filesystem services)
- File transfer (GET/POST /files) with gzip, multipart, chown, ENOSPC
- Port subsystem (/proc/net/tcp scanner, socat forwarder)
- Cgroup2 manager with noop fallback
- Snapshot/restore lifecycle (conntracker, port subsystem stop/restart)
- SIGTERM graceful shutdown, --cmd initial process spawn
- MMDS metadata polling for Firecracker mode

42 source files, ~4200 LOC, 4.1MB stripped release binary.
Makefile updated: build-envd now targets Rust (musl static),
build-envd-go preserved for Go builds.
2026-05-03 02:47:15 +06:00
3deecbff89 fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore
Three root causes addressed:

1. Go page allocator corruption: allocations between the pre-snapshot GC
   and VM freeze leave the summary tree inconsistent. After restore, GC
   reads corrupted metadata — either panicking (killing PID 1 → kernel
   panic) or silently failing to collect, causing unbounded heap growth
   until OOM. Fix: move GC to after all HTTP allocations in
   PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
   allocations run sequentially with no concurrent page allocator access.
   GOMAXPROCS is restored on first health check after restore.

2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
   single 30s context. If WaitUntilReady consumed most of it, PostInit
   failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
   disabled and zombie connections. Fix: separate timeout contexts.

3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
   caused goroutine leaks from hung proxy connections. Fix: add both,
   matching host agent values.

Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
2026-05-02 17:22:51 +06:00
bb582deefa fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts
Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.
2026-05-02 13:48:51 +06:00
7ef9a64613 fix: close stale TCP connections across snapshot/restore to prevent envd hangs
After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.

Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
  closes idle connections and disables keep-alives before snapshot, then closes
  all pre-snapshot zombie connections on restore (while preserving post-restore
  connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
  use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
  packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
  hostagent use it instead of the timeout client
2026-05-02 05:19:37 +06:00
f3572f7356 Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox
Resume() was building VMConfig without TemplateID, so Firecracker MMDS
received an empty string. envd's PostInit then wrote that empty value to
/run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in
snapshot metadata during Pause and reading it back during Resume.
2026-05-02 04:57:08 +06:00
2e998a26a2 Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2026-05-01 15:01:32 +06:00
4fcc19e91f v0.1.4 (#38)
Reviewed-on: wrenn/wrenn#38
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>
2026-05-01 09:01:08 +00:00
f3ec626d58 Envd version bump 2026-05-01 14:59:37 +06:00
f4733e2f7a Version bump 2026-04-25 04:49:17 +06:00
cdacc12a48 Merge pull request 'Fixed network throttle when an application is running' (#37) from fix/network-throttle-on-load into dev
Reviewed-on: wrenn/wrenn#37
2026-04-24 22:43:31 +00:00
bd98610153 fix: sandbox network responsiveness under port-binding apps
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.

Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.

Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
  disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
  http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
  envd client

Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
  prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
  behavior
- Replace path.Join with string concat in CP Director to preserve
  trailing slashes (prevents redirect loops on directory listings)

Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
  startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy

Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
  HTTP server

Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
  to prevent guest traffic from saturating the TAP
2026-04-25 04:21:55 +06:00
5e13879954 fix: OAuth ConnectProvider state HMAC format mismatch
ConnectProvider computed HMAC over bare state, but Callback always
verifies HMAC(state+":"+intent). This caused the account-linking
flow to always fail with invalid_state.
2026-04-25 02:00:39 +06:00
339cd7bee1 fix: security and stability fixes from code review
- Scope WebSocket auth bypass to only WS endpoints by restructuring
  routes into separate chi Groups. Non-WS routes no longer passthrough
  unauthenticated requests with spoofed Upgrade headers. Added
  optionalAPIKeyOrJWT middleware for WS routes (injects auth context
  from API key/JWT if present, passes through otherwise) and
  markAdminWS middleware for admin WS routes.

- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
  was called unconditionally but p.tty is nil for non-PTY processes,
  crashing every non-PTY process exit.

- Fix goroutine leak in sandbox Pause — stopSampler was never called,
  leaking one sampler goroutine per successful pause operation.

- Decouple PTY WebSocket reads from RPC dispatch using a buffered
  channel to prevent backpressure-induced connection drops under fast
  typing. Includes input coalescing to reduce RPC call volume.
2026-04-24 15:48:38 +06:00
153a54fdcd Merge branch 'main' of git.omukk.dev:wrenn/wrenn into dev 2026-04-21 16:11:59 +06:00
52ad21c339 v0.1.3 (#36)
## What's new

Compliance, audit, and account lifecycle improvements — admin actions are now fully auditable, user data is properly anonymized on deletion, and OAuth signup flow gives users control over their profile.

### Audit

- Added audit logging for all admin actions (user activate/deactivate, team BYOC toggle, team delete, template delete, build create/cancel)
- Added admin audit page with infinite scroll and hierarchical filters
- Fixed audit log team assignment — admin/host actions now correctly land under PlatformTeamID
- Anonymize audit logs on user hard-delete (actor name, IDs, emails stripped)
- Deduplicated audit logger internals (665 → 374 lines, no behavior change)

### Authentication

- Separated GitHub OAuth login/signup flows — login no longer auto-creates accounts
- Added name confirmation dialog for new GitHub signups

### Account Lifecycle

- Email notification sent when account is permanently deleted after grace period
- Audit log anonymization tied to user purge (per-user transactional)

### UX

- Removed accent gradient bars from admin host dialogs (border + shadow only)
- Frontend renders deleted users as styled badge in audit log view

### Others

- Version bump
- Bug fixes

Reviewed-on: wrenn/wrenn#36
v0.1.3
2026-04-21 10:11:49 +00:00
c3afd0c8a0 Merge pull request 'Audit logging, Data anonymization, and OAuth flow improvements' (#35) from feat/compliance into dev
Reviewed-on: wrenn/wrenn#35
2026-04-21 10:09:37 +00:00
11928a172a feat: send email notification on account hard-delete
Notify users via email when their account is permanently deleted after
the 15-day soft-delete grace period. Query now returns email alongside
user ID so the notification can be sent after deletion.

Email failure is logged as a warning but does not block cleanup.
2026-04-21 16:01:56 +06:00