1
0
forked from wrenn/wrenn
Commit Graph

12 Commits

Author SHA1 Message Date
3deecbff89 fix: prevent Go runtime memory corruption and sandbox halt after snapshot restore
Three root causes addressed:

1. Go page allocator corruption: allocations between the pre-snapshot GC
   and VM freeze leave the summary tree inconsistent. After restore, GC
   reads corrupted metadata — either panicking (killing PID 1 → kernel
   panic) or silently failing to collect, causing unbounded heap growth
   until OOM. Fix: move GC to after all HTTP allocations in
   PostSnapshotPrepare, then set GOMAXPROCS(1) so any remaining
   allocations run sequentially with no concurrent page allocator access.
   GOMAXPROCS is restored on first health check after restore.

2. PostInit timeout starvation: WaitUntilReady and PostInit shared a
   single 30s context. If WaitUntilReady consumed most of it, PostInit
   failed — RestoreAfterSnapshot never ran, leaving envd with keep-alives
   disabled and zombie connections. Fix: separate timeout contexts.

3. CP HTTP server missing timeouts: no ReadHeaderTimeout or IdleTimeout
   caused goroutine leaks from hung proxy connections. Fix: add both,
   matching host agent values.

Also adds UFFD prefetch to proactively load all guest pages after restore,
eliminating on-demand page fault latency for subsequent RPC calls.
2026-05-02 17:22:51 +06:00
bb582deefa fix: prevent sandbox halt after resume by fixing HTTP/2 HOL blocking and adding timeouts
Disable HTTP/2 on both host agent server and CP→agent transport — multiplexing
caused head-of-line blocking when a slow sandbox RPC stalled the shared connection.
Add ResponseHeaderTimeout to envd HTTP clients. Merge SetDefaults into Resume's
PostInit call to eliminate an extra round-trip that could hang on a stale connection.
2026-05-02 13:48:51 +06:00
bd98610153 fix: sandbox network responsiveness under port-binding apps
Running port-binding applications (Jupyter, http.server, NextJS) inside
sandboxes caused severe PTY sluggishness and proxy navigation errors.

Root cause: the CP sandbox proxy and Connect RPC pool shared a single
HTTP transport. Heavy proxy traffic (Jupyter WebSocket, REST polling)
interfered with PTY RPC streams via HTTP/2 flow control contention.

Transport isolation (main fix):
- Add dedicated proxy transport on CP (NewProxyTransport) with HTTP/2
  disabled, separate from the RPC pool transport
- Add dedicated proxy transport on host agent, replacing
  http.DefaultTransport
- Add dedicated envdclient transport with tuned connection pooling
- Replace http.DefaultClient in file streaming RPCs with per-sandbox
  envd client

Proxy path rewriting (navigation fix):
- Add ModifyResponse to rewrite Location headers with /proxy/{id}/{port}
  prefix, handling both root-relative and absolute-URL redirects
- Strip prefix back out in CP subdomain proxy for correct browser
  behavior
- Replace path.Join with string concat in CP Director to preserve
  trailing slashes (prevents redirect loops on directory listings)

Proxy resilience:
- Add dial retry with linear backoff (3 attempts) to handle socat
  startup delay when ports are first detected
- Cache ReverseProxy instances per sandbox+port+host in sync.Map
- Add EvictProxy callback wired into sandbox Manager.Destroy

Buffer and server hardening:
- Increase PTY and exec stream channel buffers from 16 to 256
- Add ReadHeaderTimeout (10s) and IdleTimeout (620s) to host agent
  HTTP server

Network tuning:
- Set TAP device TxQueueLen to 5000 (up from default 1000)
- Add Firecracker tx_rate_limiter (200 MB/s sustained, 100 MB burst)
  to prevent guest traffic from saturating the TAP
2026-04-25 04:21:55 +06:00
11928a172a feat: send email notification on account hard-delete
Notify users via email when their account is permanently deleted after
the 15-day soft-delete grace period. Query now returns email alongside
user ID so the notification can be sent after deletion.

Email failure is logged as a warning but does not block cleanup.
2026-04-21 16:01:56 +06:00
bb2146d838 refactor: deduplicate audit logger with shared entry builders
Replace repetitive actorFields + write boilerplate across all 25+ typed
Log methods with shared helpers: newEntry (general), newAdminEntry
(platform-level), resolveHostTeamID, and logSystemHostEvent.

Reduces logger.go from 665 to 374 lines with no behavior change.
2026-04-21 15:54:39 +06:00
d270ab7752 Version bump 2026-04-21 15:54:04 +06:00
7fd801c1eb feat: add audit logging for all admin actions and admin audit page
Log every admin-panel action (user activate/deactivate, team BYOC toggle,
team delete, template delete, build create/cancel) to the audit_logs table
under PlatformTeamID with scope "admin".

Add GET /v1/admin/audit-logs endpoint and /admin/audit frontend page with
infinite scroll and hierarchical filters. Expose audit.Entry + Log() for
cloud repo extensibility.

Fix seed_platform_team down-migration FK violation by deleting dependent
rows before the team row.
2026-04-21 15:41:45 +06:00
ebbbde9cd1 feat: anonymize audit logs on user hard-delete and fix host audit log team assignment
Anonymize audit logs when soft-deleted users are purged after 15 days:
actor_name set to 'deleted-user', actor_id and resource_id nulled,
email stripped from member metadata. Per-user delete ensures no user
is removed without successful anonymization.

Frontend renders deleted-user as a styled red badge in audit log view.

Fix shared host create/delete audit logs landing in admin's personal
team — now correctly assigned to PlatformTeamID.
2026-04-21 14:42:09 +06:00
47be1143fb Add MiddlewareProvider interface for extension middleware
Allows cloud extensions to inject middleware that wraps OSS routes
(e.g. billing enforcement) before they are registered.
2026-04-18 14:47:29 +06:00
92aab09104 Add daily usage metrics (CPU-minutes, RAM GB-minutes)
Introduce pre-computed daily usage rollups from sandbox_metrics_snapshots.
An hourly background worker aggregates completed days, while today's
usage is computed live from snapshots at query time for freshness.

Backend: new daily_usage table, rollup worker, UsageService, and
GET /v1/capsules/usage endpoint with date range filtering (up to 92 days).

Frontend: replace Usage page placeholder with bar charts (Chart.js),
summary total cards, and preset/custom date range controls.
2026-04-18 14:29:09 +06:00
5fa3529df9 Move email types to pkg/email for cloud repo access
Extracts Mailer interface, EmailData, and Button to pkg/email/types.go
so the cloud repo can use them via ServerContext. internal/email re-exports
the types as aliases so existing callers are unchanged. Also fixes
pre-existing lint errors (unchecked rollback and deadline calls).
2026-04-17 16:36:54 +06:00
605ad666a0 v0.1.0 (#17) 2026-04-16 19:24:25 +00:00