Files
certctl/cmd
shankar0123 183c56f6c5 fix(agent): SCALE-006 — startup + recurring jitter on heartbeat and poll loops
Sprint 2 unified-master-audit closure. Pre-fix the agent started
its heartbeat + poll loops on bare time.NewTicker cadence with no
startup jitter:

    heartbeatTicker := time.NewTicker(a.heartbeatInterval)
    pollTicker := time.NewTicker(a.pollInterval)
    a.sendHeartbeat(ctx)   // fires immediately, in lockstep
    a.pollForWork(ctx)     // ditto

A mass restart (rolling K8s deploy, control-plane reboot, scheduled
fleet bounce) produced a thundering herd — 5K agents booting in a
10-second window all hit /heartbeat in lockstep, then /poll, every
interval forever afterward.

Fix:
  - Per-agent startup jitter ∈ [0, interval) drawn fresh from
    math/rand/v2 (no cryptographic strength needed) before the first
    heartbeat and first poll. Heartbeat and poll jitters are drawn
    independently so a single seed doesn't create a secondary
    correlation pattern.
  - time.NewTicker swapped for the existing in-tree
    internal/scheduler.JitteredTicker primitive (±10% per-tick
    envelope, fresh draw per tick to prevent drift compounding).
    Same pattern as every server-side scheduler.go loop.
  - Startup-jitter Sleeps are ctx-aware so a sigint-during-startup
    exits cleanly rather than hanging.

The select cases that read heartbeatTicker.C / pollTicker.C are
unchanged — JitteredTicker.C is a chan time.Time, identical shape
to time.Ticker.C.

Discovery ticker is left as bare time.NewTicker (audit didn't cite
it; changing it would expand scope).

Closes SCALE-006.
2026-05-16 04:01:59 +00:00
..