feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)

Phase 13 Sprint 13.3 — the completion half of the ARCH-M1 substantive close. Sprint 13.2 shipped the Postgres-backed sliding-window limiter + multi-replica integration test; Sprint 13.3 wires the 6 call sites in cmd/server/main.go through the operator- chosen backend selector, adds the rate_limit_buckets scheduler janitor sweep, rewrites the observability doc, exposes the env-var in the helm chart, and promotes the multi-replica integration test to a required CI status check. Signature ground-truth (sprint 13.2 + 13.3) =========================================== Prompt-template signatures: `Allow(key string) error` and "5 call sites." Actual repo: `Allow(key string, now time.Time) error` and 6 NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt miscounted the second EST per-principal arm). Per CLAUDE.md "the repo is truth," matched the live shape. What changed ============ internal/config/server.go (+40 LOC): - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval time.Duration` to RateLimitConfig with full operator-facing documentation of the two valid values (memory|postgres) + when-to-use-which decision tree. internal/config/config.go (+27 LOC): - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m). - Validate() rejects anything other than ""/"memory"/"postgres" (empty = memory equivalence for test-built Configs that bypass Load()). Janitor interval must be ≥ 1 minute when set. - Failure modes return clear ::error:: with the env-var name + the valid values, so an operator typo ("postgress" → memory in a 3-replica cluster) fails fast at startup. internal/ratelimit/factory.go (NEW, 67 LOC): - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single factory the 6 cmd/server/main.go call sites route through. - Drop-in signature: same maxN/window/mapCap as NewSlidingWindowLimiter (mapCap accepted + ignored for postgres — the rate_limit_buckets table grows until the janitor sweeps). - Defensive panic on unknown backend (config.Validate is SoT; this is belt-and-suspenders). internal/ratelimit/postgres_gc.go (NEW, 73 LOC): - PostgresGC struct + NewPostgresGC + GarbageCollect. - Single-statement DELETE FROM rate_limit_buckets WHERE updated_at < NOW() - maxWindow. Idempotent. - maxWindow <= 0 is a no-op (operator opt-out). internal/scheduler/scheduler.go (+90 LOC): - New RateLimitGarbageCollector interface (mirrors the ACMEGarbageCollector / SessionGarbageCollector contracts). - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning on Scheduler. - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d) Setters following the existing acmeGC/sessionGC pattern. - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard + per-tick context.WithTimeout(1m). Logs row count at Debug. - Loop counted in the Start() WaitGroup only when the GC is non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector when backend=memory so the loop never launches for that case. cmd/server/main.go (35 LOC diff): - All 6 ratelimit.NewSlidingWindowLimiter call sites now route through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend, db, ...). Grep verification post-fix returns ZERO hits. - Six sites: breakglass loginLimiter (580), ocspLimiter (1003), exportLimiter (1068), EST failed-basic (1535), EST per-principal SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved — its inner type-alias wrapper is the prompt's out-of-scope (cmd/server/*.go only). - Conditionally constructs PostgresGC + wires the scheduler janitor when backend=postgres; logs the wiring decision either way so operators see "rate-limit GC sweep enabled (postgres backend)" or "in-memory backend self-prunes" in the boot log. internal/api/handler/{est,export,certificates,auth_breakglass}.go: - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types with ratelimit.Limiter (the interface). Allow() satisfies the same call shape on both backends; the in-memory tests that construct *SlidingWindowLimiter still compile because the concrete type satisfies the interface (compile-time check in internal/ratelimit/limiter.go pins this). docs/operator/observability.md (176 LOC diff): - Replaced the "per-process, in-memory, reset-on-restart, not shared across replicas" paragraph with the new configurable-backend section: operator decision tree, backend internals (memory vs postgres), janitor description, falsifiable closure proof (the Sprint 13.2 integration test name + invocation), helm chart wiring example. - Updated inventory to reflect the actual handler file paths + actual cap configurations (the prior doc said "60s window" for several limiters that actually use 60m / 24h windows). - Doc smoke confirmed: grep -c 'per-process, in-memory, reset-on-restart' docs/operator/observability.md = 0. deploy/helm/certctl/values.yaml + templates/server-configmap.yaml + templates/server-deployment.yaml: - Exposed server.rateLimiting.backend (default "memory") + server.rateLimiting.janitorInterval (default "5m") under the existing rateLimiting block. - ConfigMap renders both as rate-limit-backend + rate-limit-janitor-interval keys. - Deployment wires CERTCTL_RATE_LIMIT_BACKEND + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap. - Helm render: `helm template deploy/helm/certctl --set server.rateLimiting.backend=postgres` shows the env-var on the server-deployment.yaml output. .github/workflows/ci.yml (+12 LOC): - Added a new step in the Go Build & Test job that runs the Sprint 13.2 multi-replica integration test (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with -tags=integration -race -timeout=300s. Fails the CI status check if the cross-replica row lock ever stops arbitrating across replicas — the ARCH-M1 closure regression gate. Verification (all green locally; postgres integration via CI) ============================================================ $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go (zero hits — Sprint 13.3 receipt) $ go test -short -count=1 \ ./internal/config/... ./internal/ratelimit/... \ ./internal/scheduler/... ./internal/api/handler/... \ ./cmd/server/... ok internal/config 1.177s ok internal/ratelimit 0.007s ok internal/scheduler 9.165s ok internal/api/handler 6.245s ok cmd/server 0.390s $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \ ./internal/config/... ./internal/api/handler/... ./cmd/server/... (clean) $ gofmt -l internal/ cmd/server/ (clean) $ grep -c 'per-process, in-memory, reset-on-restart' \ docs/operator/observability.md 0 (doc smoke — the audit's verbatim phrasing is gone) $ bash scripts/ci-guards/G-3-env-docs-drift.sh G-3 env-docs-drift: clean. $ bash scripts/ci-guards/complete-path-config-coverage.sh OK — every CERTCTL_* env var (197) has at least one non-config- package consumer. Selector contract verified — config.Validate() rejects any value other than ""/memory/postgres at startup with a clear error message. Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a different axis; ARCH-M1 closure is complete with this commit modulo the Sprint 13.7 audit-HTML flip + zero-floor pin. Closes: ARCH-M1 substantive remediation. The cross-replica rate- limit-cap-enforcement gap that the audit recommended deferring to v3 is closed; operators with server.replicas > 1 flip CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement across the cluster (proved by the multi-replica integration test now gating CI).
2026-06-07 13:51:36 +00:00 · 2026-05-14 11:52:13 +00:00
parent c8347d742d
commit a41fc2d75c
15 changed files with 516 additions and 61 deletions
@@ -103,6 +103,21 @@ type BCLReplayGarbageCollector interface {
 	SweepExpired(ctx context.Context, now time.Time) (int, error)
 }

+// RateLimitGarbageCollector sweeps stale rows from the
+// rate_limit_buckets table introduced in migration 000046. Phase 13
+// Sprint 13.3 (ARCH-M1 closure completion) — wired only when
+// CERTCTL_RATE_LIMIT_BACKEND=postgres. Concrete impl is
+// *ratelimit.PostgresGC. Mirrors the ACMEGarbageCollector +
+// SessionGarbageCollector contracts so the scheduler reuses the same
+// atomic.Bool + WithTimeout + ticker pattern as the existing GC loops.
+//
+// Returns the row count to surface via observability logs (matches
+// SessionGarbageCollector's shape — the operator wants to see
+// "how many buckets did the sweep delete" in steady-state monitoring).
+type RateLimitGarbageCollector interface {
+	GarbageCollect(ctx context.Context) (int64, error)
+}
+
 // JobReaperService defines the interface for job timeout reaping used by the scheduler.
 type JobReaperService interface {
 	ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
@@ -130,6 +145,7 @@ type Scheduler struct {
 	acmeGC                ACMEGarbageCollector
 	sessionGC             SessionGarbageCollector
 	bclReplayGC           BCLReplayGarbageCollector
+	rateLimitGC           RateLimitGarbageCollector
 	jobReaper             JobReaperService
 	logger                *slog.Logger

@@ -149,6 +165,7 @@ type Scheduler struct {
 	jobTimeoutInterval            time.Duration
 	acmeGCInterval                time.Duration
 	sessionGCInterval             time.Duration
+	rateLimitGCInterval           time.Duration
 	// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
 	// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
 	agentOfflineJobTTL      time.Duration
@@ -171,6 +188,7 @@ type Scheduler struct {
 	jobTimeoutRunning            atomic.Bool
 	acmeGCRunning                atomic.Bool
 	sessionGCRunning             atomic.Bool
+	rateLimitGCRunning           atomic.Bool

 	// Graceful shutdown: wait for in-flight work to complete
 	wg sync.WaitGroup
@@ -209,6 +227,7 @@ func NewScheduler(
 		jobTimeoutInterval:            10 * time.Minute,
 		acmeGCInterval:                1 * time.Minute,
 		sessionGCInterval:             1 * time.Hour,
+		rateLimitGCInterval:           5 * time.Minute,
 		// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
 		// must miss multiple heartbeats before its in-flight jobs are reaped.
 		agentOfflineJobTTL: 5 * time.Minute,
@@ -365,6 +384,29 @@ func (s *Scheduler) SetSessionGCInterval(d time.Duration) {
 	s.sessionGCInterval = d
 }

+// SetRateLimitGarbageCollector wires the Phase 13 Sprint 13.3 rate-
+// limit bucket GC. Optional; nil disables the loop (which is the
+// correct behavior when CERTCTL_RATE_LIMIT_BACKEND=memory — the
+// in-memory backend's prune-on-Allow path keeps buckets short-lived
+// without a separate sweep).
+//
+// Concrete impl is *ratelimit.PostgresGC, constructed in
+// cmd/server/main.go only when the postgres backend is selected.
+func (s *Scheduler) SetRateLimitGarbageCollector(gc RateLimitGarbageCollector) {
+	s.rateLimitGC = gc
+}
+
+// SetRateLimitGCInterval configures the interval at which the rate-
+// limit GC sweep runs. Default 5m. Wire:
+// CERTCTL_RATE_LIMIT_JANITOR_INTERVAL. Zero or negative values are
+// ignored.
+func (s *Scheduler) SetRateLimitGCInterval(d time.Duration) {
+	if d <= 0 {
+		return
+	}
+	s.rateLimitGCInterval = d
+}
+
 // SetAgentOfflineJobTTL sets the threshold past which a Running job whose
 // owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
 // Zero or negative values are ignored (the default of 5 minutes is kept).
@@ -426,6 +468,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
 		if s.sessionGC != nil {
 			loopCount++
 		}
+		if s.rateLimitGC != nil {
+			loopCount++
+		}
 		s.wg.Add(loopCount)

 		go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
@@ -457,6 +502,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
 		if s.sessionGC != nil {
 			go func() { defer s.wg.Done(); s.sessionGCLoop(ctx) }()
 		}
+		if s.rateLimitGC != nil {
+			go func() { defer s.wg.Done(); s.rateLimitGCLoop(ctx) }()
+		}

 		// Signal that all loops are launched
 		close(startedChan)
@@ -1247,3 +1295,45 @@ func (s *Scheduler) sessionGCLoop(ctx context.Context) {
 		}
 	}
 }
+
+// rateLimitGCLoop runs every rateLimitGCInterval and invokes
+// RateLimitGarbageCollector.GarbageCollect, which sweeps stale rows
+// from the rate_limit_buckets table introduced in Phase 13 Sprint
+// 13.2's migration 000046.
+//
+// Wired only when CERTCTL_RATE_LIMIT_BACKEND=postgres (the in-memory
+// backend's prune-on-Allow path keeps buckets short-lived without a
+// separate sweep — cmd/server/main.go skips SetRateLimitGarbageCollector
+// for that case so this loop never launches).
+//
+// Phase 13 Sprint 13.3 closure. The atomic.Bool guard + per-tick
+// context.WithTimeout match every other GC loop's pattern.
+func (s *Scheduler) rateLimitGCLoop(ctx context.Context) {
+	ticker := NewJitteredTicker(s.rateLimitGCInterval, DefaultSchedulerJitter)
+	defer ticker.Stop()
+
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			if !s.rateLimitGCRunning.CompareAndSwap(false, true) {
+				s.logger.Warn("rate-limit GC sweep still running, skipping tick")
+				continue
+			}
+			s.wg.Add(1)
+			go func() {
+				defer s.wg.Done()
+				defer s.rateLimitGCRunning.Store(false)
+				// 1-minute timeout matches acme + session GC loops.
+				opCtx, cancel := context.WithTimeout(ctx, time.Minute)
+				defer cancel()
+				if n, err := s.rateLimitGC.GarbageCollect(opCtx); err != nil {
+					s.logger.Warn("rate-limit gc sweep failed (next tick will retry)", "error", err)
+				} else if n > 0 {
+					s.logger.Debug("rate-limit gc swept stale buckets", "rows", n)
+				}
+			}()
+		}
+	}
+}