Cloud-repo extensions could not build on the cookie-session migration —
they still called the removed JWT helpers and duplicated middleware. Move
the session/CSRF middleware and cookie helpers into pkg/auth/session/middleware
as the single source of truth, with thin re-exports on cpextension.
Add hook interfaces so cloud can plug billing without forking OSS:
- AuthHook (OnSignup fail-fast; OnLogin / OnAccount*Delete log+ignore)
- SandboxEventHook (un-acks on error so messages redeliver; idempotent)
- LimitsProvider / UsageProvider (402 on overage; DB-backed usage default)
ServerContext gains OAuthRegistry, Channels, ChannelPub so extensions stop
reimplementing them.
- channels dispatcher: drop capsule.{create,pause,resume,destroy} events
with system actor and no reason metadata. Suppresses the goroutine /
host-callback follow-up that duplicated every user-initiated action in
notification channels (Telegram, webhooks). Genuinely system-only
emitters (TTL auto-pause, host monitor reconciler, host failures) all
set reason, so they continue to notify.
- CreateCapsuleDialog: wrap submit in try/finally so the creating flag
always clears, and close the dialog before invoking oncreated to avoid
the parent receiving the new capsule while the dialog is still open.
- capsules page: guard against double-insertion of the same capsule when
the SSE event arrives before the dialog's oncreated callback resolves.
User authentication moves from short-lived JWT bearer tokens to opaque
session cookies (wrenn_sid) backed by a Postgres sessions table and a
Redis hot cache. Browsers get a paired wrenn_csrf cookie; all mutating
requests must echo it via X-CSRF-Token (double-submit).
- New pkg/auth/session service: issue/revoke, idle (6h) + absolute
(24h) lifetimes, switch-team rotation, RevokeAllForUser on password
events, per-user listing for self-service.
- Middleware: requireSession + requireCSRF replace requireJWT and the
WS first-message JWT exchange. SSE/WS endpoints rely on the cookie
flowing on the upgrade — SSE ticket store deleted.
- API keys (wrn_<32hex>) remain for SDK/server use; capsule routes
accept either via requireSessionOrAPIKey.
- Host-agent JWTs (signed by JWT_SECRET) are unchanged — that channel
is wrenn-cp ↔ wrenn-agent and unrelated to user identity.
- Frontend client drops bearer-token plumbing, sends credentials and
the CSRF header on every mutating call.
- OpenAPI + dashboard host-registration docs updated.
Rename CapsuleCreated→CapsuleCreate (and pair siblings) into action
verbs, add Outcome (success/error), Metadata, and Error fields to the
canonical Event. Introduce PublishTransient for ephemeral SSE-only
signals (capsule.state.changed) so dashboard transitions don't reach
webhook/telegram subscribers.
Audit logger now publishes the canonical event itself with the derived
outcome, collapsing the old "audit then separately publish" split.
Sandbox event consumer rebuilt around the unified stream: host-agent
callbacks are translated once into canonical events, then fan out via
DB reconciler, channel dispatcher, and SSE relay independently.
Documents the channels subscription model in the README.
Introduce CapsuleStatus union + RESUMABLE_STATUSES / TRANSIENT_STATUSES
sets that mirror the backend state machine; the routes and SnapshotDialog
now derive button enablement from the sets instead of ad-hoc string
checks. Add disk_size_mb + metadata to the Capsule shape.
SSEEventKind union + isSSEEvent guard so malformed wire payloads can't
reach handlers via blind casts. Event stream reconnect now:
- retries with backoff when the ticket fetch itself fails (previously
gave up on a single 401/network blip),
- reconnects immediately on window 'online' and document visibilitychange
(back to visible) when the EventSource is not OPEN,
- subscribes to capsule.error.
openapi.yaml: align OAuth paths (/v1/auth/oauth → /auth/oauth to match
the actual mount point), document bearerAuth on capsule routes, fix
'capsulees' typos, and expand schemas for the new state machine surfaces
the frontend now consumes.
NewSSETicketStore now takes a context so its cleanup goroutine exits on
server shutdown instead of leaking for the process lifetime. Threaded
through api.New and pkg/cpserver/run.go.
SandboxEventConsumer learns sandbox.pause_failed / sandbox.resume_failed
event types and forwards TeamID from the publisher; server.go propagates
TeamID into the SSE broker so per-team subscribers receive failure
events. resumeInBackground now rolls resuming → paused on failure (was
resuming → error) so the user can retry without manual intervention.
pkg/service/sandbox: mirror internal/sandbox.MinTimeoutSec + clampTimeout
on the control plane so the DB row's timeout_sec agrees with what the
agent runs after its own silent clamp.
Move CoW from sandboxes/{id}.cow to sandboxes/{id}/rootfs.cow so every
per-sandbox artifact lives under one parent. PauseSnapshotDir now aliases
SandboxDir; Pause stages the CoW into the staging dir before swapDir so
the swap carries it through.
Publish sb.client via atomic.Pointer so Exec/Pty/Process callers can load
without holding lifecycleMu; Pause's releaseRuntime stores nil, Resume
stores a fresh client. Funnel every caller through new activeClient()
that nil-checks after Load to close the pause-vs-exec race.
Replace string-sniffing for "not found" / "not running" with sentinel
errors (ErrNotFound, ErrNotRunning, ErrNotPaused, ErrInvalidRange) and a
single mapSandboxError switch in the hostagent server. Add
parseSandboxIDs helper for the repeated team+template UUID parse.
Rewrite ConnTracker off sync.WaitGroup onto an explicit counter + zeroCh
so Drain/ForceClose can select on cancellation and timeout without
leaking the waiter goroutine on repeated pause failures.
Add internal/sandbox/punch.go: post-snapshot SEEK_DATA scan that
fallocate-punches any 4 KiB block of zeros in CH memory-* files (guest
dirty-then-free pages CH writes verbatim). Run after both pause snapshot
and CreateSnapshot. Bump envd quiesce sleep 500ms → 1s so the kernel
fully flushes before CH dumps memory.
Add sandboxDirOverride threaded through snapshotMeta + restoreVMConfig:
sandboxes launched from snapshot templates carry the original source
sandbox's tmpfs path in CH's saved config.json, so every subsequent
restore must reuse it.
New createFromSnapshotTemplate path branches off Manager.Create when the
template directory contains a CH memory snapshot (state.json + config.json
+ rootfs.ext4). Mirrors the pause/resume restore mechanics — same UFFD
lazy memory + post-restore memory loader — but produces a fresh sandbox
per call (new ID, new slot, new CoW on the shared flattened rootfs).
Shared restore primitives extracted to restore.go (buildRestoreVMConfig,
launchRestoredVM, initAndStartMemoryLoader) and reused by resumeFromMeta.
Chain correctness: descendants of snapshot templates start the memory
loader so subsequent CreateSnapshot from them is self-contained.
Defensive guards:
- CreateSnapshot refuses to overwrite an existing template dir.
- DeleteSnapshot refuses when running sandboxes still reference it.
- TimeoutSec clamped to MinTimeoutSec=60 to keep TTL reaper well clear of
the post-create startup window.
- Snapshot routing skips minimal template even if a stray state.json lands.
vm.SandboxTmpDir / vm.SandboxSocketPath extracted so launchers don't
re-derive CH disk paths independently.
Pause and live-snapshot share one CH primitive (ch.pause + ch.snapshot +
ch.destroy/resume). Pause writes artefacts to a staging dir and
atomically swaps to avoid CH re-reading a memory-ranges file mid-rewrite
across pause-resume-pause chains. Resume uses
memory_restore_mode=ondemand backed by userfaultfd; CH lazily faults
pages from the source file. A new envd /memory/preload endpoint
materialises every physical page (one byte per page via /dev/mem,
fallback /proc/kcore) so a subsequent snapshot writes a self-contained
file instead of holes.
Sandbox manager refactor: lifecycle / pause / resume code extracted to
internal/sandbox/pause.go, leaving manager.go focused on the in-memory
state map and orchestration entry points (-871 / +72). Stale CH process
and dm-snapshot cleanup runs at agent startup (internal/vm/cleanup.go)
and via scripts/cleanup-stale.sh for operator use.
Host monitor honors the agent's reported per-sandbox status when
reconciling missing rows (so an agent-side pause during a CP
disconnect isn't silently promoted back to running). New
BulkRestoreMissingToStatus query replaces the running-only path.
Transient statuses (pausing/resuming/starting/stopping) defer
reconciliation to the next tick.
In-process broker fans out sandbox state events (created/paused/running/
destroyed) to connected SSE clients, filtered by team. Backend publishes
through the channels Publisher; an SSE relay subscribes to Redis Pub/Sub
and dispatches to subscribers. Browser auth uses short-lived tickets
issued via /v1/events/token; SDKs use header auth. Admin routes get a
parallel stream that sees all teams. Frontend dashboard and admin
capsule pages subscribe to push state changes instead of polling.
Sandbox event publishing moved out of AuditLogger into the service layer
so callbacks from the host agent and direct state changes share one
path.
Fix resource leaks, race conditions, and error handling across host
agent and control plane: proper sparse file cleanup on close error,
connect error wrapping for MakeDir, CoW file cleanup on pause failure,
per-sandbox VM directories, deferred map deletion to avoid race in VM
destroy, and goroutine launch for extension background workers.
Remove Firecracker-specific MMDS metadata fetching and metrics host
module. CH communicates with the guest purely over TAP networking,
so MMDS (Firecracker's metadata service via MMDS address) is no longer
needed.
- Remove src/host/ module (mmds.rs, metrics.rs)
- Remove reqwest dependency (was only used for MMDS HTTP calls)
- Remove --isnotfc CLI flag (no longer dual-mode)
- Simplify health endpoint and init handler
- Update state management for CH snapshot lifecycle
- Bump version to 0.3.0
When a host transitions from unreachable → online via heartbeat, trigger
ReconcileHost in a background goroutine so "missing" sandboxes are
resolved instantly instead of waiting up to 60s for the next monitor tick.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace synchronous RPC-based CP-host communication for sandbox
lifecycle operations (Create, Pause, Resume, Destroy) with an async
pattern. CP handlers now return 202 Accepted immediately, fire agent
RPCs in background goroutines, and publish state events to a Redis
Stream. A background consumer processes events as a fallback writer.
Agent-side auto-pause events are pushed to the CP via HTTP callback
(POST /v1/hosts/sandbox-events), keeping Redis internal to the CP.
All DB status transitions use conditional updates
(UpdateSandboxStatusIf, UpdateSandboxRunningIf) to prevent race
conditions between concurrent operations and background goroutines.
The HostMonitor reconciler is kept at 60s as a safety net, extended
to handle transient statuses (starting, pausing, resuming, stopping).
Frontend updated to handle 202 responses with empty bodies and render
transient statuses with blue indicators.
- Cache terminal EndEvent on ProcessHandle so connect() can detect
already-exited processes instead of hanging forever on broadcast
receivers that missed the event. Subscribe before checking cache
to close the TOCTOU window.
- Protect sb.Status writes in Pause with m.mu to prevent data race
with concurrent readers (AcquireProxyConn, Exec, etc.).
- Restart metrics sampler in restoreRunning so a failed pause attempt
doesn't permanently kill sandbox metrics collection.
- Return dequeued non-input messages from coalescePtyInput instead of
dropping them, preventing silent loss of kill/resize signals during
typing bursts.
Fast-exiting processes (e.g. echo) sent data/end events before
start() subscribed to the broadcast channels, causing the stream
to hang indefinitely and the exec RPC to time out with 502.
Move channel subscription into spawn_process, before reader/waiter
threads start, and return pre-subscribed receivers via SpawnedProcess.
The start() and connect() streaming RPCs blocked forever in the data
event loop because ProcessHandle retains a broadcast sender (needed for
reconnection via connect()), preventing the channel from closing.
Race data_rx against end_rx with tokio::select! so the stream terminates
when the process exits. Remaining buffered data is drained before
yielding the end event.
The /init handler's default_user mutation cloned the Defaults struct,
mutated the clone, then dropped it — the actual state was never updated.
This caused processes to always run as "root" regardless of the user
set via POST /init. Additionally, default_workdir was accepted in the
init request but never applied.
Wrap user and workdir fields in RwLock with accessor methods so mutations
propagate correctly through the shared AppState.
Restructure pause to: block new operations (StatusPausing), drain proxy
connections with 5s grace, force-close remaining via context cancellation,
drop page cache, inflate balloon, then freeze vCPUs. Previously connections
could arrive during the pause window and API operations weren't blocked.
Handle UFFD_EVENT_REMOVE/UNMAP/REMAP/FORK gracefully instead of crashing
the UFFD server. These events fire during balloon deflation on snapshot
restore, killing the page fault handler and preventing VM boot.
Also adds ConnTracker.ForceClose() with cancellable context propagated
through the proxy handler, so lingering proxy connections are actively
terminated rather than left dangling.
Firecracker dumps the entire VM memory region regardless of guest
usage. A 20GB VM using 500MB still produces a ~20GB memfile because
freed pages retain stale data (non-zero blocks).
Inflate the balloon device before snapshot to reclaim free guest
memory. Balloon pages become zero from FC's perspective, allowing
ProcessMemfile to skip them. This reduces memfile size from ~20GB
to ~1-2GB for lightly-used VMs.
- Pause: read guest memory usage, inflate balloon to reclaim free
pages, wait 2s for guest kernel to process, then proceed
- Resume: deflate balloon to 0 after PostInit so guest gets full
memory back
- createFromSnapshot: same deflation since template snapshots
inherit inflated balloon state
- All balloon ops are best-effort with debug logging on failure
Remove hard 10s timeout from Firecracker HTTP client — callers already
pass context.Context with appropriate deadlines, and 20GB+ memfile
writes easily exceed 10s.
Ensure CoW file is at least as large as the origin rootfs. Previously,
WRENN_DEFAULT_ROOTFS_SIZE=30Gi expanded the base image to 30GB but the
default 5GB CoW could not hold all writes, causing dm-snapshot
invalidation and EIO on all guest I/O.
Destroy frozen VMs in resumeOnError instead of leaving zombies that
report "running" but can't execute. Use fresh context for the resume
attempt so a cancelled caller context doesn't falsely trigger destroy.
Increase CP→Agent ResponseHeaderTimeout from 45s to 5min and
PrepareSnapshot timeout from 3s to 30s for large-memory VMs.
After failed pause, ping agent to detect destroyed sandboxes and mark
DB status as "error" instead of reverting to "running".
Metrics data was only fetched after Chart.js dynamic import completed,
leaving graphs empty until the first poll interval fired. Now
loadMetrics() runs in parallel with the Chart.js import, and
initCharts() resets the dedup key so pre-fetched data populates
newly created chart instances.
Add Owner column to admin templates table, resolving team IDs to names
via admin teams API. Disable delete for non-platform templates and the
minimal template, with contextual tooltips explaining why.
Add PUT /v1/admin/users/{id}/admin endpoint and frontend UI for
granting and revoking platform admin status. Uses atomic conditional
SQL (RevokeUserAdmin) to prevent race conditions that could remove
the last admin. Includes idempotency check, audit logging, and
confirmation dialog with self-demotion warning.
Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.