Linux keeps freed memory as page cache, which Firecracker snapshots
as non-zero blocks. A 16GB VM with 12GB stale cache would write all
12GB to disk. Dropping pagecache (not dentries/inodes) in
/snapshot/prepare before blocking the reclaimer shrinks snapshots
to actual working set size with minimal resume latency impact.
Three issues fixed:
1. Memory metrics read host-side VmRSS of the Firecracker process,
which includes guest page cache and never decreases. Replaced
readMemRSS(fcPID) with readEnvdMemUsed(client) that queries
envd's /metrics endpoint for guest-side total - MemAvailable.
This matches neofetch and reflects actual process memory.
2. Added Firecracker balloon device (deflate_on_oom, 5s stats) and
envd-side periodic page cache reclaimer (drop_caches when >80%
used). Reclaimer is gated by snapshot_in_progress flag with
sync() before freeze to prevent memory corruption during pause.
3. Sampling interval 500ms → 1s, ring buffer capacities adjusted
to maintain same time windows. Reduces per-host HTTP load from
240 calls/sec to 120 calls/sec at 120 capsules.
Also: maxDiffGenerations 8 → 1 (merge every re-pause since UFFD
lazy-loads anyway), envd mem_used formula uses total - available.
Three bugs fixed:
1. PTY connections failed because home directory was hardcoded as
/home/{username} instead of reading from /etc/passwd. For root,
this produced /home/root/ which doesn't exist — CWD validation
rejected every PTY Start request without explicit cwd. Fixed all
6 locations to use user.dir from nix::unistd::User.
2. MMDS polling silently failed to parse metadata because the
logs_collector_address field lacked #[serde(default)]. The host
agent only sends instanceID + envID — missing "address" field
caused every deserialize attempt to fail, so .WRENN_SANDBOX_ID
and .WRENN_TEMPLATE_ID were never written. Also added error
logging and create_dir_all before file writes.
3. Metrics CPU values were non-deterministic because a fresh
sysinfo::System was created per request with a 100ms sleep
between reads. Replaced with a background thread that samples
CPU at fixed 1-second intervals via a persistent System instance,
matching gopsutil's internal caching behavior. Metrics endpoint
now reads cached atomic values — no blocking, consistent window.
Also: close master PTY fd in child pre_exec, add process.Start
request logging, bump version to 0.2.0.
The Go envd guest agent (`envd/`) is fully replaced by the Rust
implementation (`envd-rs/`). This commit removes the Go module and
updates all references across the codebase.
Makefile: remove ENVD_DIR, VERSION_ENVD, build-envd-go, dev-envd-go,
and Go envd from proto/fmt/vet/tidy/clean targets. Add static-link
verification to build-envd.
Host agent: rewrite snapshot quiesce comments that referenced Go GC
and page allocator corruption — no longer applicable with Rust envd.
Tighten envdclient to expect HTTP 200 (not 204) from health and file
upload endpoints, and require JSON version response from FetchVersion.
Remove NOTICE (no e2b-derived code remains). Update CLAUDE.md and
README.md to reflect Rust envd architecture.