1
0
forked from wrenn/wrenn
Commit Graph

16 Commits

Author SHA1 Message Date
7ef9a64613 fix: close stale TCP connections across snapshot/restore to prevent envd hangs
After Firecracker snapshot restore, zombie TCP sockets from the previous
session cause Go runtime corruption inside the guest VM, making envd
unresponsive. This manifests as infinite loading in the file browser and
terminal timeouts (524) in production (HTTP/2 + Cloudflare) but not locally.

Four-part fix:
- Add ServerConnTracker to envd that tracks connections via ConnState callback,
  closes idle connections and disables keep-alives before snapshot, then closes
  all pre-snapshot zombie connections on restore (while preserving post-restore
  connections like the /init request)
- Split envdclient into timeout (2min) and streaming (no timeout) HTTP clients;
  use streaming client for file transfers and process RPCs
- Close host-side idle envdclient connections before PrepareSnapshot so FIN
  packets propagate during the 3s quiesce window
- Add StreamingHTTPClient() accessor; streaming file transfer handlers in
  hostagent use it instead of the timeout client
2026-05-02 05:19:37 +06:00
f3572f7356 Fix empty WRENN_TEMPLATE_ID after resuming paused sandbox
Resume() was building VMConfig without TemplateID, so Firecracker MMDS
received an empty string. envd's PostInit then wrote that empty value to
/run/wrenn/.WRENN_TEMPLATE_ID. Fix by persisting the template ID in
snapshot metadata during Pause and reading it back during Resume.
2026-05-02 04:57:08 +06:00
bd98610153 fix: sandbox network responsiveness under port-binding apps
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.

Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.

Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
  disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
  http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
  envd client

Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
  prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
  behavior
- Replace path.Join with string concat in CP Director to preserve
  trailing slashes (prevents redirect loops on directory listings)

Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
  startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy

Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
  HTTP server

Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
  to prevent guest traffic from saturating the TAP
2026-04-25 04:21:55 +06:00
339cd7bee1 fix: security and stability fixes from code review
- Scope WebSocket auth bypass to only WS endpoints by restructuring
  routes into separate chi Groups. Non-WS routes no longer passthrough
  unauthenticated requests with spoofed Upgrade headers. Added
  optionalAPIKeyOrJWT middleware for WS routes (injects auth context
  from API key/JWT if present, passes through otherwise) and
  markAdminWS middleware for admin WS routes.

- Fix nil pointer dereference in envd Handler.Wait() — p.tty.Close()
  was called unconditionally but p.tty is nil for non-PTY processes,
  crashing every non-PTY process exit.

- Fix goroutine leak in sandbox Pause — stopSampler was never called,
  leaking one sampler goroutine per successful pause operation.

- Decouple PTY WebSocket reads from RPC dispatch using a buffered
  channel to prevent backpressure-induced connection drops under fast
  typing. Includes input coalescing to reduce RPC call volume.
2026-04-24 15:48:38 +06:00
977c3a466a Shrink minimal rootfs on graceful host agent shutdown
On startup EnsureImageSizes expands the minimal rootfs to the configured
disk size. This adds the inverse: ShrinkMinimalImage runs e2fsck + resize2fs -M
during graceful shutdown so the image is stored compactly on disk.
2026-04-16 16:26:50 +06:00
a5ad3731f2 Refactored to maintain a separate cloud version
Moves 12 packages from internal/ to pkg/ (config, id, validate, events, db,
auth, lifecycle, scheduler, channels, audit, service) so they can be imported
by the enterprise repo as a Go module dependency.

Introduces pkg/cpextension (shared Extension interface + ServerContext) and
pkg/cpserver (Run() entrypoint with functional options) so the enterprise
main.go can call cpserver.Run(cpserver.WithExtensions(...)) without duplicating
the 20-step server bootstrap. Adds db/migrations/embed.go for go:embed access
to OSS SQL migrations from the enterprise module.

cmd/control-plane/main.go is reduced to a 10-line wrapper around cpserver.Run.
2026-04-15 21:41:48 +06:00
5b4fde055c Fix build recipe execution and flatten reliability
- Set HOME in bctx.EnvVars when USER switches so ~ expands correctly in
  subsequent RUN/WORKDIR steps instead of resolving to /root
- Run /bin/sync inside the guest before FlattenRootfs destroys the VM,
  preventing pip-installed files from being captured as 0-byte due to
  unflushed page cache
- Wrap healthcheck command with su <user> so it runs with the template's
  default user context (correct HOME, correct UID)
- Export Shellescape from the recipe package for use in build service
- Add code-runner-beta recipe (Jupyter server with ipykernel --sys-prefix)
  and replace old python-interpreter-v0-beta
2026-04-15 18:24:54 +06:00
516890c49a Add background process execution API
Start long-running processes (web servers, daemons) without blocking the
HTTP request. Leverages envd's existing background process support
(context.Background(), List, Connect, SendSignal RPCs) and wires it
through the host agent and control plane layers.

New API surface:
- POST /v1/capsules/{id}/exec with background:true → 202 {pid, tag}
- GET /v1/capsules/{id}/processes → list running processes
- DELETE /v1/capsules/{id}/processes/{selector} → kill by PID or tag
- WS /v1/capsules/{id}/processes/{selector}/stream → reconnect to output

The {selector} param auto-detects: numeric = PID, string = tag.
Tags are auto-generated as "proc-" + 8 hex chars if not provided.
2026-04-14 03:57:01 +06:00
962860ba74 Pre-pause snapshot signal to prevent Go runtime crash on restore
envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.

Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.

- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)
2026-04-13 05:21:10 +06:00
25b5258841 COPY multi-source support, configurable rootfs size, build fixes
- COPY now supports multiple sources: COPY a.txt b.txt /dest/
  Last argument is always destination (matches Dockerfile semantics).
- COPY resolves relative destinations against current WORKDIR.
- WRENN_DEFAULT_ROOTFS_SIZE env var (e.g. 5G, 2Gi, 1000M, 512Mi)
  controls template rootfs expansion. Used both at agent startup
  (EnsureImageSizes) and after FlattenRootfs (shrink then re-expand).
- Pre-build now sets WORKDIR /home/wrenn-user after USER switch.
- Extracted archive files get chmod a+rX for readability.
- Path traversal validation on COPY sources.
2026-04-12 03:39:17 +06:00
75af2a4f66 Add USER, COPY, ENV persistence to template build system
Implement three new recipe commands for the admin template builder:

- USER <name>: creates the user (adduser + passwordless sudo), switches
  execution context so subsequent RUN/START commands run as that user
  via su wrapping. Last USER becomes the template's default_user.

- COPY <src> <dst>: copies files from an uploaded build archive
  (tar/tar.gz/zip) into the sandbox. Source paths validated against
  traversal. Ownership set to the current USER.

- ENV persistence: accumulated env vars stored in templates.default_env
  (JSONB) and injected via PostInit when sandboxes are created from the
  template, mirroring Docker's image metadata approach.

Supporting changes:
- Pre-build creates wrenn-user as default (via USER command)
- WORKDIR now creates the directory if it doesn't exist (mkdir -p)
- Per-step progress updates (ProgressFunc callback) for live UI
- Multipart form support on POST /v1/admin/builds for archive upload
- Proto: default_user/default_env fields on Create/ResumeSandboxRequest
- Host agent: SetDefaults calls PostInitWithDefaults on envd
- Control plane: reads template defaults, passes on sandbox create/resume
- Frontend: file upload widget, recipe copy button, keyword colors for
  USER/COPY, fixed Svelte whitespace stripping in step display
- Admin panel defaults to /admin/templates instead of /admin/hosts
- Migration adds default_user and default_env to templates and
  template_builds tables
2026-04-12 02:10:01 +06:00
ab3fc4a807 Add interactive PTY terminal sessions for sandboxes
Wire envd's existing PTY process capabilities through the full stack:
hostagent proto (4 new RPCs: PtyAttach, PtySendInput, PtyResize, PtyKill),
envdclient, sandbox manager, and a new WebSocket endpoint at
GET /v1/sandboxes/{id}/pty with bidirectional JSON message protocol.

Sessions use tag-based identity for disconnect/reconnect support,
base64-encoded PTY data for binary safety, and a 120s inactivity timeout.
2026-04-11 02:42:59 +06:00
4ed17b2776 Fix stale WRENN_SANDBOX_ID and WRENN_TEMPLATE_ID after snapshot restore
After restoring a VM from snapshot, envd had already completed its initial
MMDS poll, so the metadata files in /run/wrenn/ and env vars retained values
from the original sandbox. Call POST /init after WaitUntilReady on both
resume and create-from-template paths to trigger envd to re-read MMDS.
2026-04-10 19:23:48 +06:00
172413e91e Made changes to accomodate repo url update (#15)
Reviewed-on: wrenn/wrenn#15
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>
2026-04-09 21:02:44 +00:00
d3e4812e46 v0.0.1 (#8)
Co-authored-by: Tasnim Kabir Sadik <tksadik92@gmail.com>
Reviewed-on: wrenn/sandbox#8
2026-04-09 19:24:49 +00:00
32e5a5a715 Prototype with single host server and no admin panel (#2)
Reviewed-on: wrenn/sandbox#2
Co-authored-by: pptx704 <rafeed@omukk.dev>
Co-committed-by: pptx704 <rafeed@omukk.dev>
2026-03-22 21:01:23 +00:00