fix(scale): close BUNDLE 4 — migrations, scheduler HA, rate-limits, scale receipts

Bundle 4 closure (2026-05-13 acquisition diligence audit). Closes the "what happens under multi-replica" question cluster: migration runner had no concurrency control + no applied-version ledger, 15 scheduler loops had per-process idempotency but no cross-replica documentation, rate limits were process-local without an operator-facing scope statement, load-test scope explicitly omitted four hot paths without linking them to a roadmap. Source findings closed: HIGH-1 + D4 + finding 4 (migration tracking) D8 (scheduler loop ownership) MED-1 + MED-2 (rate-limit scope) T9 + LOW-7 + finding 7 (load-test receipt scope) Closures by source ID: HIGH-1 + D4 + finding 4 — Migration tracking + advisory lock. internal/repository/postgres/db.go::RunMigrations now wraps every migration execution in: 1. A dedicated *sql.Conn pinned to one connection for the entire scan + apply lifecycle (pg_advisory_lock is connection-scoped). 2. pg_advisory_lock(migrationAdvisoryLockID) — fixed int64 key derived from "certctl-migrations" so the same constant resolves across deployments without colliding with operator advisory locks. Blocks the second replica until the first finishes. 3. CREATE TABLE IF NOT EXISTS schema_migrations(version TEXT PK, applied_at TIMESTAMPTZ DEFAULT NOW()) — audit ledger. 4. Skip-applied loop: SELECT version FROM schema_migrations → map[string]struct{} → skip every .up.sql whose filename is in the map. INSERT after successful execute, ON CONFLICT (version) DO NOTHING for defense in depth. Pre-Bundle-4 every server boot re-ran all 45 .up.sql files. The "idempotency via IF NOT EXISTS / ON CONFLICT" contract in CLAUDE.md held per-migration but offered no protection when two Helm replicas raced on schema DDL. Post-Bundle-4 single-replica deploys see zero behavior change beyond the audit-table population; multi-replica deploys get HA-safe schema bootstrap. D8 — Scheduler HA semantics documented. New docs/operator/scheduler-ha.md with per-loop inventory of all 15 loops in internal/scheduler/scheduler.go. Classification: - HA-safe (jobProcessorLoop, jobRetryLoop) — FOR UPDATE SKIP LOCKED via ClaimPendingJobs (Bundle 1 H-6 closure, 3e78ecb). - HA-safe-ish (jobTimeoutLoop) — atomic UPDATE-WHERE-status. - Idempotent under N>1 replicas (renewalCheckLoop, agentHealthCheckLoop, shortLivedExpiryCheckLoop, networkScanLoop, healthCheckLoop, acmeGCLoop, sessionGCLoop) — duplicate ticks produce idempotent side effects. - Side-effect-duplicating under N>1 replicas (notificationProcessLoop, notificationRetryLoop, digestLoop, cloudDiscoveryLoop, crlGenerationLoop) — duplicate webhook/email/AWS-API/CRL-signing operations. Operators running multi-replica accept N× side effects or pin to server.replicas: 1. Leader-election work tracked in WORKSPACE-ROADMAP.md as v3. MED-1 + MED-2 — Rate-limit scope. New docs/operator/rate-limit-scope.md states the contract verbatim: process-local sync.Mutex-guarded sliding-window log, effective cluster-wide cap = configured-per-replica × server.replicas, restart-safe (no persistent state, no shared store), bounded (50k/100k key cap with eviction). Five call sites documented: ocspLimiter (1m/IP), exportLimiter (1h/actor), EST per-principal (24h/CN), EST failed-auth (1h/IP), Intune dispatcher (24h/Subject+Issuer), plus the HTTP middleware token-bucket (RPS+Burst per replica). Cluster-wide shared limits via Redis or Postgres-backed bucket are tracked in WORKSPACE-ROADMAP.md as v3. T9 + LOW-7 + finding 7 — Load-test receipt scope. The existing harness at deploy/test/loadtest/ already self-documents the gap ("What it explicitly does NOT measure"). No code change needed for this finding; Bundle 4 cross-references scheduler-ha.md and rate-limit-scope.md from those gap callouts so the four deferred coverage classes (issuer connector, scheduler throughput, agent fleet, DB p99) land in the same place an acquirer reads about HA semantics and rate limits. Tests: internal/repository/postgres/migrations_test.go (new, 4 tests): - TestRunMigrations_PopulatesSchemaMigrations: audit table exists and is non-empty after the first migration run. - TestRunMigrations_SkipsAppliedOnSecondCall: second call is observable no-op on row count. - TestRunMigrations_ConcurrentCallsSerialized: two goroutines racing the migrator both return without error; row count unchanged; no duplicate versions. - TestRunMigrations_FreshDatabaseHappyPath: ≥ 30 migrations land on a fresh schema. Gated by testcontainers via the existing repo_test.go getTestDB pattern; skipped under -short. The integration lane runs them. Verification: gofmt -l # clean go vet ./internal/repository/postgres ./cmd/server # clean go build ./cmd/server ./internal/repository/postgres # clean go test -short -count=1 ./internal/repository/postgres ./internal/ratelimit # PASS Operator follow-up: full integration run on workstation: go test -count=1 ./internal/repository/postgres -run TestRunMigrations_ Receipts (paths for the audit packet): Migration runner evidence: internal/repository/postgres/db.go L135-340 (advisory-lock + ledger + skip-applied loop) + internal/repository/postgres/migrations_test.go (4 tests). Scheduler loop inventory: docs/operator/scheduler-ha.md (15-loop table with HA classification per loop). Rate-limit storage matrix: docs/operator/rate-limit-scope.md. Load-test baseline: deploy/test/loadtest/README.md (already self-documenting), cross-linked from scheduler-ha.md. Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md): - Leader election for the four duplicate-side-effect loops (notificationProcessLoop, notificationRetryLoop, digestLoop, cloudDiscoveryLoop, crlGenerationLoop). v3 work item. - Shared rate-limits across replicas (Redis / Postgres token bucket). v3 work item. - Issuer-connector + scheduler-throughput + agent-fleet + DB-p99 load-test coverage. Tracked separately; per-issuer Prometheus histograms already capture issuer round-trip latency in production runs. Audit-Closes: BUNDLE-4 HIGH-1 D4 D8 MED-1 MED-2 T9 LOW-7 finding-4 finding-7
2026-07-26 13:58:13 +00:00 · 2026-05-13 01:00:39 +00:00
parent 7fcdc73e20
commit 750478a6fe
5 changed files with 497 additions and 6 deletions
@@ -74,6 +74,8 @@ You're running certctl in production and need operational guidance.
 | [Helm deployment](operator/helm-deployment.md) | Kubernetes installation via the bundled chart |
 | [Performance baselines](operator/performance-baselines.md) | Operator-runnable benchmarks for regression spot checks |
 | [Auth benchmarks](operator/auth-benchmarks.md) | Session + OIDC validation p99 targets and measured baselines |
+| [Scheduler HA semantics](operator/scheduler-ha.md) | Per-loop HA truth table for the 15 scheduler loops; what duplicates on multi-replica |
+| [Rate-limit scope](operator/rate-limit-scope.md) | Process-local vs cluster-wide rate-limit behavior, restart semantics, multi-replica mental math |
 | [Legacy clients (TLS 1.2)](operator/legacy-clients-tls-1.2.md) | Reverse-proxy runbook for embedded EST/SCEP clients on TLS 1.2 |

 ### Runbooks
@@ -0,0 +1,54 @@
+# Rate-Limit Scope
+
+> Last reviewed: 2026-05-13
+
+How certctl's rate limiters behave under multi-replica and restart, and where the boundaries are. Closes Bundle 4 audit findings **MED-1** (process-local rate limits) and **MED-2** (rate-limit semantics across replicas).
+
+## TL;DR
+
+Every rate limiter in certctl is **process-local**: in-memory `sync.Mutex`-guarded maps in the `internal/ratelimit` package. The effective rate-limit across an N-replica deployment is **N × the configured per-replica limit**. Limiter state is lost on restart — no persistent ledger, no shared store. This is intentional for v2.1.0 and documented; shared rate limits across replicas (via Redis or a Postgres-backed token bucket) is a v3 work item tracked in `WORKSPACE-ROADMAP.md`.
+
+## Limiter inventory
+
+The shared primitive lives at `internal/ratelimit/sliding_window.go::SlidingWindowLimiter`. It's a sliding-window log algorithm (each key holds timestamps within the configured window; on `Allow` the bucket prunes timestamps older than `now - window` and admits or rejects based on the post-prune count).
+
+Five call sites exercise it across `cmd/server/main.go`:
+
+| Call site | Key | Window | Cap | What it protects |
+|---|---|---|---|---|
+| `ocspLimiter` | source IP | 1 minute | `CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` (default 1000) | RFC 6960 OCSP responder against scan amplification. |
+| `exportLimiter` | actor ID | 1 hour | `CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR` (default 50) | `/api/v1/certificates/{id}/export` bulk-cert-pull abuse. |
+| EST per-principal | CN | 24 hours | per-profile `RateLimitPerPrincipal24h` | EST RFC 7030 device enrollment abuse. |
+| EST failed-auth | source IP | 1 hour | 10 attempts | EST HTTP-Basic brute force. |
+| Intune dispatcher | (Subject, Issuer) | 24 hours | per-profile `PerDeviceRateLimit24h` | SCEP + Intune duplicate-enrollment cap. |
+
+The HTTP middleware rate-limiter (`internal/api/middleware/middleware.go::RateLimitConfig`, knobs `CERTCTL_RATE_LIMIT_RPS` + `CERTCTL_RATE_LIMIT_BURST`) is a separate token-bucket implementation but follows the same process-local scope.
+
+## What is in scope
+
+- **Per-process abuse mitigation**. A scanner hitting one replica's OCSP responder at 5000 rps gets dropped to 1000 rps by `ocspLimiter`. A compromised API key trying to bulk-export 1000 certs in an hour against one replica gets dropped to 50.
+- **Bounded memory footprint**. Each limiter caps its key-tracking map at the size passed to `NewSlidingWindowLimiter` (50 000 for OCSP/export, 100 000 for EST/Intune per-device). The at-cap eviction janitor drops the oldest entry by newest-timestamp so a key explosion (random source IPs from a botnet, random CNs from a misconfigured fleet) never bloats memory.
+- **Restart safety**. The sliding-window state is per-process in-memory. On a server restart the limiter state resets — any attacker who'd burned through their window cap before the restart gets a fresh window after. Conversely, a legitimate caller that had hit their cap right before a restart gets immediately unblocked. Both directions are intentional: we don't ship a persistent state store, so the post-restart state is "fresh start".
+
+## What is NOT in scope
+
+- **Shared limits across replicas**. With `server.replicas: N`, an attacker hitting all replicas in parallel gets N × the per-replica cap. For the default OCSP knob (1000 rps per IP per minute) at N=3, that's 3000 rps per IP per minute before any single replica drops the traffic. The chart's default is N=1; operators running multi-replica should multiply published per-replica caps by the replica count to get the cluster-wide effective cap.
+- **Cross-restart persistence**. See "Restart safety" above — this is by-design but operators need to know.
+- **First-party (`Authorization: Bearer ...`) request rate-limit cohesion across replicas**. The middleware-level RPS/burst rate-limit (`CERTCTL_RATE_LIMIT_RPS`) is also process-local. At N=3 replicas with default `RPS=50, Burst=100`, the effective cluster-wide rate is 150 rps with 300 burst.
+
+## Operator guidance
+
+**Single-replica deployments** (Helm chart default `server.replicas: 1`): published caps are the effective caps. No mental math.
+
+**Multi-replica deployments**: multiply every published cap by `server.replicas` to get the effective per-key cap. If your threat model needs strict cluster-wide rate limits (e.g., SOC 2 control mapping that quotes "≤ 1000 OCSP requests per IP per minute"), one of:
+
+1. **Pin to single replica** and scale vertically. The default v2.1.0 posture; works for substantial fleets.
+2. **Front the cluster with a rate-limited gateway** (NGINX `limit_req_zone`, Envoy global rate-limit service, Cloudflare WAF Bypass rules) and treat the cluster-internal limiter as defense-in-depth.
+3. **Wait for the v3 shared-rate-limit work** (tracked in WORKSPACE-ROADMAP.md). Likely Redis or Postgres-backed token-bucket plumbed through the same `internal/ratelimit` package so call sites stay unchanged.
+
+## Source-of-truth references
+
+- Shared primitive: `internal/ratelimit/sliding_window.go` (the package comment at the top is the canonical algorithm + concurrency reference).
+- Middleware rate limiter: `internal/api/middleware/middleware.go::RateLimitConfig`.
+- Call sites: `grep -n "ratelimit.NewSlidingWindowLimiter\|RateLimitConfig" cmd/server/main.go`.
+- Configuration knobs: `docs/reference/configuration.md` (search "rate limit").
@@ -0,0 +1,70 @@
+# Scheduler HA Semantics
+
+> Last reviewed: 2026-05-13
+
+What happens when you run more than one `certctl-server` replica? Which scheduler loops are safe to run on every replica simultaneously, which need leader election, and which silently duplicate work today?
+
+This page closes Bundle 4 audit findings **D8** (singleton loop ambiguity) and **HIGH-1 + MED-2** (HA semantics across the scheduler surface). It is a per-loop inventory of the 15 scheduler loops in `internal/scheduler/scheduler.go`, classified by HA-safety.
+
+## TL;DR
+
+The only loops that are HA-safe today via `FOR UPDATE SKIP LOCKED` job claiming are `jobProcessorLoop` and `jobRetryLoop`. Every other loop is *intra-process* idempotent (a per-replica `sync/atomic.Bool` guard prevents a single replica from running the same loop twice at once) but is *cross-replica* duplicative — two replicas tick at the same interval and both do the work.
+
+For ten of the fourteen non-job-claim loops this is harmless: the work is read-only DB scanning that produces idempotent side effects (e.g., create-job-if-not-exists, send-notification-once-per-event-via-DB-ledger), and duplicate execution wastes CPU but cannot corrupt data. For four loops (`notificationProcessLoop`, `digestLoop`, `crlGenerationLoop`, `cloudDiscoveryLoop`) duplicate execution produces observable duplicate side effects: duplicate emails, duplicate webhooks, duplicate CRL writes. v2.1.0 supports `server.replicas > 1` for read availability and api throughput, but operators running multi-replica should accept these four duplication classes or pin replicas to 1 until the leader-election work lands.
+
+True leader-election via Postgres advisory lock or Kubernetes lease is tracked in `WORKSPACE-ROADMAP.md` as a v3 work item.
+
+## Per-loop inventory
+
+The 15 loops live in `internal/scheduler/scheduler.go`. Each is a `func (s *Scheduler) <name>Loop(ctx context.Context)` driven by a `time.Ticker`. The intra-process guard pattern is `sync/atomic.Bool` `CompareAndSwap(false, true)` at the top of the loop body — pre-Bundle-4 every loop already had this guard. Bundle 4 added the cross-replica classification below.
+
+| # | Loop | HA mode | Side-effect duplication risk under N>1 replicas |
+|---|---|---|---|
+| 1 | `renewalCheckLoop` | **Idempotent** — creates renewal jobs via `service.CheckExpiringCertificates`. | None. Duplicate ticks try to create the same `RenewalRequested` job; service-layer dedup (cert_id + status uniqueness window) collapses the second. Result: 2× CPU, 1× job, no data corruption. |
+| 2 | `jobProcessorLoop` | **HA-safe** — `service.ProcessPendingJobs` ultimately calls `repository/postgres.JobRepository.ClaimPendingJobs` which uses `SELECT ... FOR UPDATE SKIP LOCKED`. | None. Postgres guarantees exactly-once row claim per tick across the replica set. |
+| 3 | `jobRetryLoop` | **HA-safe** — `service.RetryFailedJobs` uses the same `ClaimPendingJobs` primitive (Bundle 1 audit fix H-6, commit `6cb4414`). | None. |
+| 4 | `jobTimeoutLoop` | **HA-safe-ish** — `service.TimeoutStalledJobs` UPDATEs with `WHERE status = 'Running' AND started_at < $cutoff` inside a single statement. Two replicas may UPDATE the same row but the second UPDATE sees `Running → Failed` already applied and matches zero rows. | None. |
+| 5 | `agentHealthCheckLoop` | **Idempotent** — UPDATEs `agents SET operational_status = 'Offline' WHERE last_heartbeat < $cutoff`. Two replicas running the same UPDATE land the same final state. | None. |
+| 6 | `notificationProcessLoop` | **Duplicates** — reads pending notification queue, dispatches to Slack / PagerDuty / SMTP / Teams / OpsGenie, marks dispatched. The dispatch and the "mark dispatched" are not in a single transaction; two replicas can both dispatch the same notification before the mark lands. | **Duplicate webhook + email sends**. Bounded — at most N duplicates for N replicas — but operator-observable. |
+| 7 | `notificationRetryLoop` | **Duplicates** — same shape as `notificationProcessLoop`. | Same as #6. |
+| 8 | `shortLivedExpiryCheckLoop` | **Idempotent** — UPDATEs cert status to `Expired` based on `expires_at < NOW()`. Two replicas land the same status. | None. |
+| 9 | `networkScanLoop` | **Idempotent** — invokes `service.NetworkScanService.ScanAllEnabledTargets` which iterates scan targets, probes each, and INSERTs discovered certs with `ON CONFLICT (fingerprint, agent_id, source_path) DO NOTHING`. | None on cert insertion. Duplicate TLS probes hit the operator's targets twice per tick. Operator may want to cap to 1 replica for low-egress-budget environments. |
+| 10 | `digestLoop` | **Duplicates** — assembles the periodic digest email and dispatches via SMTP. Two replicas at the same digest tick both send. | **Duplicate digest emails**. |
+| 11 | `healthCheckLoop` | **Idempotent** — runs the active TLS-fingerprint health-check sweep across deployed certs. Same idempotency story as #8. | None on state. Duplicate TLS probes to operator targets. |
+| 12 | `cloudDiscoveryLoop` | **Duplicates the scan; idempotent on the result store** — fetches cert lists from AWS Secrets Manager / Azure Key Vault / GCP Secret Manager, INSERTs into discovered-certs with `ON CONFLICT DO NOTHING`. | **Duplicate AWS/Azure/GCP API calls** — bills operator cloud accounts 2× per tick on the discovery API surface. Storage stays clean. |
+| 13 | `crlGenerationLoop` | **Duplicates the signing; last-writer wins on storage** — regenerates CRL DER blobs per issuer, writes to `certificate_revocation_lists` table with `UPDATE ... WHERE issuer_id = $1`. Two replicas sign two CRLs with two `thisUpdate` timestamps; the later UPDATE wins. | **Duplicate CA signing operations** (cost on HSM-backed issuers). CRL output is single-valued but the audit trail records both signings. |
+| 14 | `acmeGCLoop` | **Idempotent** — DELETEs ACME nonce / authz / order rows older than the retention window. Two replicas race the same DELETEs; second one matches zero rows. | None. |
+| 15 | `sessionGCLoop` | **Idempotent** — DELETEs expired session rows. Same shape as #14. | None. |
+
+## What Bundle 4 closes
+
+Bundle 4 does NOT introduce leader election. It introduces:
+
+1. **Documented HA truth table** (this page) — operators know exactly which loops are safe to multi-replica and which produce operator-observable duplicates.
+2. **Migration HA** via `pg_advisory_lock` + `schema_migrations` audit table (see `internal/repository/postgres/db.go::RunMigrations`). Pre-Bundle-4 every replica race-ran all 45 migrations on boot. Post-Bundle-4 the first replica acquires the lock, applies migrations, populates `schema_migrations`, releases the lock. Subsequent replicas block at the lock, then observe the audit table and skip every already-applied file.
+3. **Rate-limit scope statement** at `docs/operator/rate-limit-scope.md` — process-local per-replica, restart-safe.
+
+## What Bundle 4 does NOT close (deferred, tracked in WORKSPACE-ROADMAP.md)
+
+- **Leader election** for `notificationProcessLoop`, `notificationRetryLoop`, `digestLoop`, `cloudDiscoveryLoop`, `crlGenerationLoop`. The cleanest implementation is a per-loop `pg_try_advisory_lock(lock_id)` at the top of `runX` so only one replica per tick claims the work, with a small leader-renewal mechanic for long-running loops. This would close the four duplicate-side-effect cases above. v3 work item.
+- **Shared rate limits across replicas**. See `docs/operator/rate-limit-scope.md`.
+
+## Operator guidance
+
+**Single-replica deployments (Helm `server.replicas: 1` — the chart default)**: all 15 loops work as documented. No action needed.
+
+**Multi-replica deployments**: review the four duplicate-side-effect loops above against your tolerance:
+
+- If your alerting fan-out can swallow duplicate webhooks (PagerDuty deduplicates by `dedup_key`, Slack does not), set `server.replicas > 1` and accept the duplication.
+- If your CRL signing uses an HSM with per-operation cost, pin to single-replica until leader election lands.
+- If you're running cloud discovery against billed AWS/Azure/GCP secret-manager APIs and you have a 6 h discovery interval, the doubling is bearable; at 30 min intervals it doubles your API spend.
+
+For any duplicate-side-effect class above, the operational mitigation is pinning `server.replicas: 1` and scaling vertically. The certctl-server process is CPU-bound on issuance and IO-bound on Postgres; a single replica handles substantial fleets when given enough cores + a fast database.
+
+## Source-of-truth references
+
+- Scheduler loops: `internal/scheduler/scheduler.go` (15 `<name>Loop` functions, search `^func \(s \*Scheduler\) [a-zA-Z]+Loop`).
+- Job claim primitive: `internal/repository/postgres/job.go::ClaimPendingJobs` (Bundle 1 H-6 closure, commit `6cb4414`).
+- Migration HA: `internal/repository/postgres/db.go::RunMigrations` (Bundle 4 closure).
+- Rate-limit scope: `docs/operator/rate-limit-scope.md`.
+- Load-test scope: `deploy/test/loadtest/README.md` ("What it explicitly does NOT measure").