1
0
forked from wrenn/wrenn
Files
wrenn-releases/envd/internal/api/snapshot.go
pptx704 962860ba74 Pre-pause snapshot signal to prevent Go runtime crash on restore
envd crashes with "fatal error: bad summary data" after Firecracker
snapshot/restore because the page allocator radix tree is inconsistent
when vCPUs are frozen mid-allocation. The port scanner goroutine
allocates heavily every second, making it the primary trigger.

Add POST /snapshot/prepare to envd — the host agent calls it before
vm.Pause to quiesce continuous goroutines and force GC. On restore,
PostInit restarts the port subsystem via the existing /init endpoint.

- New PortSubsystem abstraction with Start/Stop/Restart lifecycle
- Context-based goroutine cancellation (replaces irreversible channel close)
- Context-aware Signal to prevent scanner/forwarder deadlock
- Fix forwarder goroutine leak (was spinning forever on closed channel)
- Kill socat children on stop to prevent orphans across snapshots
- Fix double cmd.Wait panic (exec.Command instead of CommandContext)
2026-04-13 05:21:10 +06:00

26 lines
730 B
Go

// SPDX-License-Identifier: Apache-2.0
// Modifications by M/S Omukk
package api
import (
"net/http"
)
// PostSnapshotPrepare quiesces continuous goroutines (port scanner, forwarder)
// and forces a GC cycle before Firecracker takes a VM snapshot. This ensures
// the Go runtime's page allocator is in a consistent state when vCPUs are frozen.
//
// Called by the host agent as a best-effort signal before vm.Pause().
func (a *API) PostSnapshotPrepare(w http.ResponseWriter, r *http.Request) {
defer r.Body.Close()
if a.portSubsystem != nil {
a.portSubsystem.Stop()
a.logger.Info().Msg("snapshot/prepare: port subsystem quiesced")
}
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(http.StatusNoContent)
}