Remove Firecracker-specific MMDS metadata fetching and metrics host
module. CH communicates with the guest purely over TAP networking,
so MMDS (Firecracker's metadata service via MMDS address) is no longer
needed.
- Remove src/host/ module (mmds.rs, metrics.rs)
- Remove reqwest dependency (was only used for MMDS HTTP calls)
- Remove --isnotfc CLI flag (no longer dual-mode)
- Simplify health endpoint and init handler
- Update state management for CH snapshot lifecycle
- Bump version to 0.3.0
- Cache terminal EndEvent on ProcessHandle so connect() can detect
already-exited processes instead of hanging forever on broadcast
receivers that missed the event. Subscribe before checking cache
to close the TOCTOU window.
- Protect sb.Status writes in Pause with m.mu to prevent data race
with concurrent readers (AcquireProxyConn, Exec, etc.).
- Restart metrics sampler in restoreRunning so a failed pause attempt
doesn't permanently kill sandbox metrics collection.
- Return dequeued non-input messages from coalescePtyInput instead of
dropping them, preventing silent loss of kill/resize signals during
typing bursts.
Fast-exiting processes (e.g. echo) sent data/end events before
start() subscribed to the broadcast channels, causing the stream
to hang indefinitely and the exec RPC to time out with 502.
Move channel subscription into spawn_process, before reader/waiter
threads start, and return pre-subscribed receivers via SpawnedProcess.
The start() and connect() streaming RPCs blocked forever in the data
event loop because ProcessHandle retains a broadcast sender (needed for
reconnection via connect()), preventing the channel from closing.
Race data_rx against end_rx with tokio::select! so the stream terminates
when the process exits. Remaining buffered data is drained before
yielding the end event.
The /init handler's default_user mutation cloned the Defaults struct,
mutated the clone, then dropped it — the actual state was never updated.
This caused processes to always run as "root" regardless of the user
set via POST /init. Additionally, default_workdir was accepted in the
init request but never applied.
Wrap user and workdir fields in RwLock with accessor methods so mutations
propagate correctly through the shared AppState.
Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.
Three issues fixed:
1. Memory metrics read host-side VmRSS of the Firecracker process,
which includes guest page cache and never decreases. Replaced
readMemRSS(fcPID) with readEnvdMemUsed(client) that queries
envd's /metrics endpoint for guest-side total - MemAvailable.
This matches neofetch and reflects actual process memory.
2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and
envd-side periodic page cache reclaimer (drop_caches when >80%
used). Reclaimer is gated by snapshot_in_progress flag with
sync() before freeze to prevent memory corruption during pause.
3. Sampling interval 500ms → 1s, ring buffer capacities adjusted
to maintain same time windows. Reduces per-host HTTP load from
240 calls/sec to 120 calls/sec at 120 capsules.
Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD
lazy-loads anyway), envd mem_used formula uses total - available.
Three bugs fixed:
1. PTY connections failed because home directory was hardcoded as
/home/{username} instead of reading from /etc/passwd. For root,
this produced /home/root/ which doesn't exist — CWD validation
rejected every PTY Start request without explicit cwd. Fixed all
6 locations to use user.dir from nix::unistd::User.
2. MMDS polling silently failed to parse metadata because the
logs_collector_address field lacked #[serde(default)]. The host
agent only sends instanceID + envID — missing "address" field
caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID
and .WRENN_TEMPLATE_ID were never written. Also added error
logging and create_dir_all before file writes.
3. Metrics CPU values were non-deterministic because a fresh
sysinfo::System was created per request with a 100ms sleep
between reads. Replaced with a background thread that samples
CPU at fixed 1-second intervals via a persistent System instance,
matching gopsutil's internal caching behavior. Metrics endpoint
now reads cached atomic values — no blocking, consistent window.
Also: close master PTY fd in child pre_exec, add process.Start
request logging, bump version to 0.2.0.