feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)

Phase 13 Sprint 13.3 — the completion half of the ARCH-M1 substantive close. Sprint 13.2 shipped the Postgres-backed sliding-window limiter + multi-replica integration test; Sprint 13.3 wires the 6 call sites in cmd/server/main.go through the operator- chosen backend selector, adds the rate_limit_buckets scheduler janitor sweep, rewrites the observability doc, exposes the env-var in the helm chart, and promotes the multi-replica integration test to a required CI status check. Signature ground-truth (sprint 13.2 + 13.3) =========================================== Prompt-template signatures: `Allow(key string) error` and "5 call sites." Actual repo: `Allow(key string, now time.Time) error` and 6 NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt miscounted the second EST per-principal arm). Per CLAUDE.md "the repo is truth," matched the live shape. What changed ============ internal/config/server.go (+40 LOC): - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval time.Duration` to RateLimitConfig with full operator-facing documentation of the two valid values (memory|postgres) + when-to-use-which decision tree. internal/config/config.go (+27 LOC): - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m). - Validate() rejects anything other than ""/"memory"/"postgres" (empty = memory equivalence for test-built Configs that bypass Load()). Janitor interval must be ≥ 1 minute when set. - Failure modes return clear ::error:: with the env-var name + the valid values, so an operator typo ("postgress" → memory in a 3-replica cluster) fails fast at startup. internal/ratelimit/factory.go (NEW, 67 LOC): - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single factory the 6 cmd/server/main.go call sites route through. - Drop-in signature: same maxN/window/mapCap as NewSlidingWindowLimiter (mapCap accepted + ignored for postgres — the rate_limit_buckets table grows until the janitor sweeps). - Defensive panic on unknown backend (config.Validate is SoT; this is belt-and-suspenders). internal/ratelimit/postgres_gc.go (NEW, 73 LOC): - PostgresGC struct + NewPostgresGC + GarbageCollect. - Single-statement DELETE FROM rate_limit_buckets WHERE updated_at < NOW() - maxWindow. Idempotent. - maxWindow <= 0 is a no-op (operator opt-out). internal/scheduler/scheduler.go (+90 LOC): - New RateLimitGarbageCollector interface (mirrors the ACMEGarbageCollector / SessionGarbageCollector contracts). - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning on Scheduler. - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d) Setters following the existing acmeGC/sessionGC pattern. - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard + per-tick context.WithTimeout(1m). Logs row count at Debug. - Loop counted in the Start() WaitGroup only when the GC is non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector when backend=memory so the loop never launches for that case. cmd/server/main.go (35 LOC diff): - All 6 ratelimit.NewSlidingWindowLimiter call sites now route through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend, db, ...). Grep verification post-fix returns ZERO hits. - Six sites: breakglass loginLimiter (580), ocspLimiter (1003), exportLimiter (1068), EST failed-basic (1535), EST per-principal SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved — its inner type-alias wrapper is the prompt's out-of-scope (cmd/server/*.go only). - Conditionally constructs PostgresGC + wires the scheduler janitor when backend=postgres; logs the wiring decision either way so operators see "rate-limit GC sweep enabled (postgres backend)" or "in-memory backend self-prunes" in the boot log. internal/api/handler/{est,export,certificates,auth_breakglass}.go: - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types with ratelimit.Limiter (the interface). Allow() satisfies the same call shape on both backends; the in-memory tests that construct *SlidingWindowLimiter still compile because the concrete type satisfies the interface (compile-time check in internal/ratelimit/limiter.go pins this). docs/operator/observability.md (176 LOC diff): - Replaced the "per-process, in-memory, reset-on-restart, not shared across replicas" paragraph with the new configurable-backend section: operator decision tree, backend internals (memory vs postgres), janitor description, falsifiable closure proof (the Sprint 13.2 integration test name + invocation), helm chart wiring example. - Updated inventory to reflect the actual handler file paths + actual cap configurations (the prior doc said "60s window" for several limiters that actually use 60m / 24h windows). - Doc smoke confirmed: grep -c 'per-process, in-memory, reset-on-restart' docs/operator/observability.md = 0. deploy/helm/certctl/values.yaml + templates/server-configmap.yaml + templates/server-deployment.yaml: - Exposed server.rateLimiting.backend (default "memory") + server.rateLimiting.janitorInterval (default "5m") under the existing rateLimiting block. - ConfigMap renders both as rate-limit-backend + rate-limit-janitor-interval keys. - Deployment wires CERTCTL_RATE_LIMIT_BACKEND + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap. - Helm render: `helm template deploy/helm/certctl --set server.rateLimiting.backend=postgres` shows the env-var on the server-deployment.yaml output. .github/workflows/ci.yml (+12 LOC): - Added a new step in the Go Build & Test job that runs the Sprint 13.2 multi-replica integration test (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with -tags=integration -race -timeout=300s. Fails the CI status check if the cross-replica row lock ever stops arbitrating across replicas — the ARCH-M1 closure regression gate. Verification (all green locally; postgres integration via CI) ============================================================ $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go (zero hits — Sprint 13.3 receipt) $ go test -short -count=1 \ ./internal/config/... ./internal/ratelimit/... \ ./internal/scheduler/... ./internal/api/handler/... \ ./cmd/server/... ok internal/config 1.177s ok internal/ratelimit 0.007s ok internal/scheduler 9.165s ok internal/api/handler 6.245s ok cmd/server 0.390s $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \ ./internal/config/... ./internal/api/handler/... ./cmd/server/... (clean) $ gofmt -l internal/ cmd/server/ (clean) $ grep -c 'per-process, in-memory, reset-on-restart' \ docs/operator/observability.md 0 (doc smoke — the audit's verbatim phrasing is gone) $ bash scripts/ci-guards/G-3-env-docs-drift.sh G-3 env-docs-drift: clean. $ bash scripts/ci-guards/complete-path-config-coverage.sh OK — every CERTCTL_* env var (197) has at least one non-config- package consumer. Selector contract verified — config.Validate() rejects any value other than ""/memory/postgres at startup with a clear error message. Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a different axis; ARCH-M1 closure is complete with this commit modulo the Sprint 13.7 audit-HTML flip + zero-floor pin. Closes: ARCH-M1 substantive remediation. The cross-replica rate- limit-cap-enforcement gap that the audit recommended deferring to v3 is closed; operators with server.replicas > 1 flip CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement across the cluster (proved by the multi-replica integration test now gating CI).
2026-07-26 13:58:13 +00:00 · 2026-05-14 11:52:13 +00:00
parent c8347d742d
commit a41fc2d75c
15 changed files with 516 additions and 61 deletions
@@ -121,52 +121,142 @@ explicitly scrubs the password before it reaches the audit subsystem
 (see [`docs/operator/auth-threat-model.md`](auth-threat-model.md) §
 "Break-glass token leak").

-## Rate-limit behavior under restarts and replicas
+## Rate-limit behavior — configurable backend (memory or postgres)

-Where rate limits exist, they are **per-process, in-memory,
-reset-on-restart, and not shared across replicas**. This matters for
-multi-replica deployments and for any compliance posture that asks
-"what limits apply globally vs per-pod."
+The sliding-window-log rate limiters used across certctl's
+authenticated-but-shared-credential code paths (break-glass login,
+OCSP per-IP, cert-export per-actor, EST per-principal, EST
+failed-basic source-IP) carry a **configurable backend**. The
+operator picks between two implementations via
+`CERTCTL_RATE_LIMIT_BACKEND`:
+
+| Value      | When to use                                          |
+|------------|------------------------------------------------------|
+| `memory`   | Default. Single-replica deploys; sketchpad / dev.    |
+| `postgres` | HA deploys (`server.replicas > 1`). Cross-replica-consistent. |
+
+Phase 13 Sprint 13.2/13.3 (architecture diligence audit ARCH-M1
+closure) replaced the prior single-process limitation with a
+substantive close: when the operator opts into `postgres`, all
+replicas share the same
+`rate_limit_buckets` table (migration 000046) and per-key access is
+arbitrated via `SELECT FOR UPDATE` row locks. A 3-replica cluster
+hitting one rate-limited endpoint concurrently sees exactly the
+configured cap succeed across the cluster — not 3× the cap as the
+old per-process backend would have allowed.
+
+### Operator decision tree
+
+```
+Single replica (server.replicas = 1, the helm chart default)?
+  └─ Use CERTCTL_RATE_LIMIT_BACKEND=memory (the default; no action
+     required). Bucket lookups stay in-process; zero DB round-trips
+     on the hot path.
+
+Two or more replicas?
+  └─ Use CERTCTL_RATE_LIMIT_BACKEND=postgres. Two extra DB round-trips
+     per Allow call (BEGIN ... SELECT FOR UPDATE ... UPDATE ... COMMIT);
+     acceptable on the gated hot path. The Sprint 13.2 multi-replica
+     integration test pins exactly-cap enforcement across N replicas
+     as the closure proof.
+```

 ### Inventory

-| Limiter                                              | Scope                | Window | Cap                            | Survives restart? | Shared across replicas? |
-|---|---|---|---|---|---|
-| Break-glass login (per source-IP)                    | `internal/api/handler/auth_breakglass.go` | 60s   | 5 attempts                     | No                | No                      |
-| SCEP/Intune per-device challenge                     | `internal/scep/intune/`                   | 60s   | configurable (`*_PER_MINUTE`)  | No                | No                      |
-| EST per-principal CSR enrollment                     | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
-| EST HTTP-Basic source-IP failed-auth                 | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
-| ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go`            | 1h    | configurable                   | No                | No                      |
+| Limiter                                              | Scope                | Window | Cap                            |
+|---|---|---|---|
+| Break-glass login (per source-IP)                    | `internal/api/handler/auth_breakglass.go` | 60s   | 5 attempts                     |
+| OCSP query (per source-IP)                           | `internal/api/handler/certificates.go`    | 60s   | configurable (`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN`) |
+| Cert export (per actor)                              | `internal/api/handler/export.go`          | 1h    | configurable (`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`) |
+| EST per-principal CSR enrollment                     | `internal/api/handler/est.go`             | 24h   | configurable (per-profile `RateLimitPerPrincipal24h`) |
+| EST HTTP-Basic source-IP failed-auth                 | `internal/api/handler/est.go`             | 60m   | 10 attempts                    |
+| SCEP/Intune per-device challenge                     | `internal/scep/intune/`                   | 60s   | configurable (`*_PER_MINUTE`)  |
+| ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go`            | 1h    | configurable                   |

-All five use the shared `internal/ratelimit/sliding_window.go`
-primitive. Buckets live in a single per-process map guarded by a
-mutex; the package-level cap prevents unbounded growth under
-adversarial key cardinality (default 100,000 keys; oldest-by-newest-
-timestamp evicted under pressure).
+The `CERTCTL_RATE_LIMIT_BACKEND` selector applies to the first five
+(the cmd/server-wired limiters). The SCEP/Intune wrapper + the ACME
+per-account limiter ride their own internal accounting today; both
+are tracked as follow-ups in WORKSPACE-ROADMAP.md.

-### Implications for multi-replica deployments
+### Backend internals

- **Effective per-replica cap is the documented cap.** A 2-replica
-  deployment lets through up to 2× the per-key window cap before
-  either replica rejects.
- **Restart resets the bucket.** A `kubectl rollout restart` empties
-  the in-memory windows on every replica. An attacker who notices
-  this could in principle re-issue burst attempts after every roll;
-  the threat model accepts this because rollouts are operator-driven
-  and the relevant endpoints already require credentials.
- **No cross-replica fan-out.** Rate-limit decisions on replica A
-  are not visible to replica B. Sticky-session ingress routing (with
-  `service.spec.sessionAffinity: ClientIP` on Kubernetes or the
-  equivalent on your load balancer) tightens the effective cap to
-  per-replica + per-source-IP rather than per-replica + per-source-IP
-  for whichever pod the request happened to land on.
+Both backends share the algorithm: sliding-window log + per-key
+bucket + prune-on-Allow.

-If your threat model requires globally-enforced rate limits across
-replicas, the implementation surface is roughly: swap the per-process
-map for a database-backed sliding window (or a Redis-backed equivalent
-if you already run Redis). This is on the
-[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item;
-nothing in the certctl threat model today requires it.
+**Memory backend (`memory`)** — per-process map keyed by bucket key;
+mutex-guarded; package-level LRU cap prevents unbounded growth under
+adversarial key cardinality (default 100,000 keys per limiter
+instance; oldest-by-newest-timestamp evicted under pressure).
+Implemented at `internal/ratelimit/sliding_window.go`.
+
+**Postgres backend (`postgres`)** — same algorithm against the
+`rate_limit_buckets` table:
+
+```sql
+CREATE TABLE rate_limit_buckets (
+    bucket_key TEXT          PRIMARY KEY,
+    timestamps TIMESTAMPTZ[] NOT NULL DEFAULT '{}',
+    updated_at TIMESTAMPTZ   NOT NULL DEFAULT NOW()
+);
+```
+
+`Allow(key, now)` opens a transaction, ensures the row exists
+(`INSERT ... ON CONFLICT DO NOTHING`), acquires the row lock
+(`SELECT ... FOR UPDATE`), prunes timestamps older than `now-window`,
+compares the post-prune count against `maxN`, conditionally appends
+`now`, persists, and commits. The row lock is what arbitrates across
+replicas: replicas A and B firing simultaneous `Allow("k")` never
+race because Postgres serializes the per-key row update across the
+cluster. Implemented at
+`internal/ratelimit/postgres_sliding_window.go`.
+
+### Janitor sweep (postgres backend only)
+
+The scheduler runs a `rate_limit_buckets` janitor every
+`CERTCTL_RATE_LIMIT_JANITOR_INTERVAL` (default 5m, minimum 1m). The
+sweep deletes rows whose `updated_at` is older than the longest
+configured window any limiter uses (24h today, matching the EST
+per-principal limiter). Idempotent; repeated sweeps find zero rows.
+The memory backend's prune-on-Allow path keeps buckets short-lived
+without a separate sweep, so the loop is a no-op when
+`backend=memory`.
+
+### Falsifiable closure proof
+
+The Phase 13 Sprint 13.2 integration test
+`internal/integration/ratelimit_multi_replica_test.go`
+(`//go:build integration`) fires 100 concurrent `Allow("test-key")`
+calls round-robined across 3 independent `PostgresSlidingWindowLimiter`
+instances sharing one Postgres database (`cap=10`, `window=1m`) and
+asserts exactly 10 succeed + 90 return `ErrRateLimited`. If the
+cross-replica row lock weren't arbitrating, each replica would
+independently let through ~3-4 requests, giving 12-15 successes
+total. Re-run:
+
+```
+go test -tags=integration -count=1 -run TestRateLimit_MultiReplica \
+    ./internal/integration/...
+```
+
+### Helm chart wiring
+
+The helm chart at `deploy/helm/certctl/` exposes the backend via
+`server.rateLimiting.backend` (default `memory`). To opt into the
+postgres backend for an HA deploy:
+
+```
+helm upgrade --install certctl deploy/helm/certctl \
+    --set server.replicas=3 \
+    --set server.rateLimiting.backend=postgres \
+    --set server.rateLimiting.janitorInterval=5m
+```
+
+`server.replicas > 1` without flipping `backend` to `postgres` works
+fine — the limits stay per-process — but the operator gets a 2× /
+3× / Nx effective cap depending on replica count. The chart does NOT
+auto-flip on `replicas > 1` because some HA deploys deliberately want
+per-process limits (sticky-session ingress + tight per-replica caps
+to detect bot traffic at the edge before it hits the application).

 ### Where these numbers live