1
0
forked from wrenn/wrenn

12 Commits

Author SHA1 Message Date
bd98610153 fix: sandbox network responsiveness under port-binding apps
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.

Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.

Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
  disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
  http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
  envd client

Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
  prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
  behavior
- Replace path.Join with string concat in CP Director to preserve
  trailing slashes (prevents redirect loops on directory listings)

Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
  startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy

Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
  HTTP server

Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
  to prevent guest traffic from saturating the TAP
2026-04-25 04:21:55 +06:00
ab034062d3 Merge branch 'dev' into chore/hardening 2026-04-16 12:58:48 +00:00
44c32587e3 Cap network slot allocator at 32767 to match veth IP space
The veth addressing uses 10.12.0.0/16 with 2 IPs per slot. At slot
index 32768, vethOffset=65536 overflows byte arithmetic and wraps back
to 10.12.0.0, causing silent IP collisions with existing sandboxes.
Cap the allocator at 32767, which is the actual addressable limit.
2026-04-16 14:57:44 +06:00
9ea847923c Fix concurrency, security, and correctness issues across backend and frontend
- C1: Add sync.RWMutex to vm.Manager to protect concurrent vms map access
- H1: Fix IP arithmetic overflow in network slot addressing (byte truncation)
- H5: Fix MultiplexedChannel.Fork() TOCTOU race (move exited check inside lock)
- H8: Remove snapshot overwrite — return template_name_taken conflict instead
- H9: Wrap DeleteAccount DB ops in a transaction, make team deletion fatal
- H10: Sanitize serviceErrToHTTP to stop leaking internal error messages
- H11: Add deleted_at IS NULL to GetUserByEmail/GetUserByID queries
- H12: Add id DESC to audit log composite index for cursor pagination
- H15: Delete dead AuthModal.svelte component
- H17: Move JWT from WebSocket URL query param to first WS message
- H18: Fix $derived to $derived.by in FilesTab breadcrumbs
2026-04-16 06:11:42 +06:00
e3ffa576ce Fix review findings: IP collision, pause race, proxy path, ENV ordering, conn drain
- Fix IP address collision at slot 32768+ by using bitwise shifts instead of
  byte-truncating division in network slot addressing
- Add per-sandbox lifecycleMu to serialize concurrent Pause/Destroy calls
- Sanitize proxy forwarding path with path.Clean
- Sort ENV keys in recipe shell preamble for deterministic ordering
- Fix ConnTracker goroutine leak by adding cancel channel to Drain/Reset
- Update context_test to assert deterministic ENV ordering
2026-04-08 04:32:41 +06:00
1ca10230a9 Prefix network namespaces with wrenn-, add stale cleanup, lower diff cap
Rename ns-{idx} to wrenn-ns-{idx} and veth-{idx} to wrenn-veth-{idx}
to avoid collisions with other tools. Add CleanupStaleNamespaces() at
agent startup to remove orphaned namespaces, veths, iptables rules, and
routes from a previous crash. Lower maxDiffGenerations from 10 to 8 to
prevent Go runtime memory corruption from snapshot/restore drift.
2026-03-29 02:14:30 +06:00
a0d635ae5e Fix path traversal in template/snapshot names and network cleanup leaks
Add SafeName validator (allowlist regex) to reject directory traversal
in user-supplied template and snapshot names. Validated at both API
handlers (400 response) and sandbox manager (defense in depth).

Refactor CreateNetwork with rollback slice so partially created
resources (namespace, veth, routes, iptables rules) are cleaned up
on any error. Refactor RemoveNetwork to collect and return errors
instead of silently ignoring them.
2026-03-13 08:40:36 +06:00
63e9132d38 Add device-mapper snapshots, test UI, fix pause ordering and lint errors
- Replace reflink rootfs copy with device-mapper snapshots (shared
  read-only loop device per base template, per-sandbox sparse CoW file)
- Add devicemapper package with create/restore/remove/flatten operations
  and refcounted LoopRegistry for base image loop devices
- Fix pause ordering: destroy VM before removing dm-snapshot to avoid
  "device busy" error (FC must release the dm device first)
- Add test UI at GET /test for sandbox lifecycle management (create,
  pause, resume, destroy, exec, snapshot create/list/delete)
- Fix DirSize to report actual disk usage (stat.Blocks * 512) instead
  of apparent size, so sparse CoW files report correctly
- Add timing logs to pause flow for performance diagnostics
- Fix all lint errors across api, network, vm, uffd, and sandbox packages
- Remove obsolete internal/filesystem package (replaced by devicemapper)
- Update CLAUDE.md with device-mapper architecture documentation
2026-03-13 08:25:40 +06:00
0c245e9e1c Fix guest VM outbound networking and DNS resolution
Add resolv.conf to wrenn-init so guests can resolve DNS, and fix the
host MASQUERADE rule to match vpeerIP (the actual source after namespace
SNAT) instead of hostIP.
2026-03-11 06:02:31 +06:00
6f0c365d44 Add host agent RPC server with sandbox lifecycle management
Implement the host agent as a Connect RPC server that orchestrates
sandbox creation, destruction, pause/resume, and command execution.
Includes sandbox manager with TTL-based reaper, network slot allocator,
rootfs cloning, hostagent proto definition with generated stubs, and
test/debug scripts. Fix Firecracker process lifetime bug where VM was
tied to HTTP request context instead of background context.
2026-03-10 03:54:53 +06:00
7753938044 Add host agent with VM lifecycle, TAP networking, and envd client
Implements Phase 1: boot a Firecracker microVM, execute a command inside
it via envd, and get the output back. Uses raw Firecracker HTTP API via
Unix socket (not the Go SDK) for full control over the VM lifecycle.

- internal/vm: VM manager with create/pause/resume/destroy, Firecracker
  HTTP client, process launcher with unshare + ip netns exec isolation
- internal/network: per-sandbox network namespace with veth pair, TAP
  device, NAT rules, and IP forwarding
- internal/envdclient: Connect RPC client for envd process/filesystem
  services with health check retry
- cmd/host-agent: demo binary that boots a VM, runs "echo hello", prints
  output, and cleans up
- proto/envd: canonical proto files with buf + protoc-gen-connect-go
  code generation
- images/wrenn-init.sh: minimal PID 1 init script for guest VMs
- CLAUDE.md: updated architecture to reflect TAP networking (not vsock)
  and Firecracker HTTP API (not Go SDK)
2026-03-10 00:06:47 +06:00
bd78cc068c Initial project structure for Wrenn Sandbox
Set up directory layout, Makefiles, go.mod files, docker-compose,
and empty placeholder files for all packages.
2026-03-09 17:22:47 +06:00