scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll

Phase 6 of the certctl architecture diligence remediation. Five findings across the same scheduler-and-DB-pool surface. SCALE-M1 (Med) — DB pool default bumped 25 → 50 internal/config/config.go line 1972: MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50) Postgres default max_connections is 100; 50 leaves headroom for pg_dump + ad-hoc psql + a server replica without exhausting the DB-side cap. Operator override env var unchanged. Operator-tune ladder for larger fleets (5K / 50K certs) lives in docs/operator/scale.md as starter values pending Phase 8 load tests — explicitly marked TBD. SCALE-M3 (Med) — async-CA poll budget operator-configurable Live state was partially-already-shipped: all 4 async-CA connectors (digicert, entrust, globalsign, sectigo) already have per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5 closed pre-Phase-6). What was missing: a global package-default override. Shipped: - internal/connector/issuer/asyncpoll/asyncpoll.go gains SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the currentDefaultMaxWait() priority resolver. - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS at boot and calls SetDefaultMaxWait. - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard green). Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS: the live code tracks wall-clock time (MaxWait), not attempt count. Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS) so the priority chain reads naturally. SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops internal/scheduler/jitter.go ships NewJitteredTicker(interval, jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in internal/scheduler/scheduler.go migrated from bare time.NewTicker to NewJitteredTicker(interval, DefaultSchedulerJitter). Base intervals unchanged; only the per-tick envelope adds ±10% randomized delay so multiple loops with the same nominal cadence don't co-fire and spike CPU + DB at wall-clock boundaries. internal/scheduler/jitter_test.go pins: - Bounded envelope (each tick within ±jitterPct of interval) - Mean drift < 30% of nominal (sign-bug detector) - Stop() releases the goroutine + closes C - Stop() idempotent (no panic on repeat) - Zero-jitter behaves like time.NewTicker - Negative and >=1 jitterPct values clamped defensively CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks any future bare time.NewTicker in scheduler.go. SCALE-L1 (Low) — renewal-sweep semaphore behavior documented docs/operator/scale.md "Scheduler tick budgets" section explains the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25 default), the ctx-cancellation drain on tick-budget overrun, and operator tuning advice (raise concurrency + DB pool together). No code change — the behavior is defensible as-is per the audit. SCALE-L2 (Low) — ETag middleware for top-5 read endpoints internal/api/middleware/etag.go computes SHA-256 ETag over the buffered response body, respects If-None-Match, short-circuits to 304 Not Modified on match. GET/HEAD only; non-2xx responses pass through unchanged. 64 KiB buffer cap degrades gracefully on oversized responses (no caching, body still flushes intact). Wired around the top-5 read endpoints via etagged() helper in internal/api/router/router.go: GET /api/v1/certificates GET /api/v1/agents GET /api/v1/jobs GET /api/v1/audit GET /api/v1/discovered-certificates internal/api/middleware/etag_test.go pins 11 behaviors including 304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass, 4xx/5xx pass-through, oversized-response degradation, wildcard match, HEAD-treated-like-GET, byte-equal pass-through. Cross-cutting fixes: - internal/config/config_test.go::TestLoad_DefaultValues updated to assert the new 50 default (was 25). - deploy/helm/certctl/values.yaml comment corrected — agent pollInterval is hardcoded 30s, not env-configurable; the Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL which G-3 caught as a phantom env var. - asyncpoll.go reformatted by gofmt; functionally unchanged. Verification (all pass): grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go # finds 1 site grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go # config default is 50 grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md # wired grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go # 0 (all migrated) grep -cE 'JitteredTicker' internal/scheduler/scheduler.go # 15 ls internal/scheduler/jitter.go internal/api/middleware/etag.go # both exist ls docs/operator/scale.md # exists bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh # clean bash scripts/ci-guards/G-3-env-docs-drift.sh # clean go test ./internal/scheduler/ ./internal/api/middleware/ \ ./internal/connector/issuer/asyncpoll/ ./internal/config/ # 4/4 packages green Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-06-07 14:51:30 +00:00 · 2026-05-14 01:23:03 +00:00
parent d6f4d5c5e8
commit 8191b1ee64
14 changed files with 1159 additions and 27 deletions
@@ -0,0 +1,140 @@
+# Operator scale guide
+
+> Last reviewed: 2026-05-14
+
+Use this when:
+- You're sizing a new certctl deployment for a target fleet count.
+- You're scaling an existing deployment up from demo (15 certs / 1
+  agent) to production (1K+ certs / 100+ agents).
+- An auditor asks "what does this scale to?" and you want a documented
+  answer that isn't "we haven't measured."
+
+## DB connection pool
+
+certctl's PostgreSQL connection pool is the single largest scale lever.
+Pool exhaustion looks like 503s + agent poll timeouts + scheduler
+falling behind on its loops. The default ships at 50 max open
+connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
+under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
+contract.
+
+Operator-tune ladder:
+
+| Fleet size                  | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
+|---|---|---|---|
+| ≤ 500 certs / 100 agents    | `50` (default)               | `100` (PG default)         | Demo + small deployments. Pool default sized for this. |
+| 5K certs / 1K agents        | `100`                        | `200`                      | Postgres needs an explicit bump from the 100 default; reload required. |
+| 50K certs / 10K agents      | `200`                        | `400`                      | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
+
+Always leave headroom in Postgres's `max_connections` for backups
+(`pg_dump` opens its own connection), ad-hoc psql sessions, and
+replicas. The ratio `(server pool size × replicas) + 20` is a safe
+floor for Postgres's `max_connections`.
+
+**Numbers above the small-fleet row are operator-tuning starting
+points, not validated ceilings.** Phase 8 of the architecture diligence
+remediation will replace these with measured values from synthetic
+fleets; until then, capture your own observations in a loadtest log
+and tune against them.
+
+## Scheduler tick budgets
+
+certctl has 15 scheduler loops, each with its own cadence
+(internal/scheduler/scheduler.go). The renewal scan is the hottest
+loop on large fleets: it pulls every managed certificate, applies
+each profile's renewal policy, and dispatches an issuance job per
+cert that meets the threshold. The default cadence is `1h`
+(`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
+
+Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
+`internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
+unchanged; the wrapper adds ±10% randomized delay per tick so multiple
+loops with the same nominal cadence don't co-fire and cause hour-
+boundary CPU + DB spikes. For most fleets the visible effect is a
+smoother CPU graph during the renewal scan.
+
+**Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
+concurrent issuance work behind a per-tick semaphore (default
+`CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
+that exceeds the loop interval), the semaphore can hold the entire
+concurrency cap until the context cancels at next-tick boundary —
+which is intentional. The drain happens via context cancellation; new
+work isn't started past the deadline. Tests in
+`internal/scheduler/` pin this drain behavior. Operators on large
+fleets should:
+
+1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
+   consistently exceeds tick budget.
+2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
+   concurrent renewal task opens its own pool connection during
+   issuance / deployment.
+3. Watch for the "renewal scan complete" log line per tick. If it's
+   consistently late, you're under-provisioned.
+
+## Async CA polling budgets (SCALE-M3)
+
+DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
+accept a CSR, queue it on the CA side, and return a polling token.
+The certctl server polls the CA's status endpoint until the cert is
+ready or the deadline expires. The default poll-deadline is 10
+minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
+issuance returns `StillPending` and the scheduler re-enqueues the
+job for the next tick.
+
+Priority chain when picking the actual deadline (highest → lowest):
+
+1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
+   `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
+   `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
+   `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
+2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
+   process-wide default for all async-CA connectors that didn't set
+   their per-connector value).
+3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
+
+Operators with slow async CAs (Entrust certificate-mode in
+particular can take 15-30 minutes during business hours) should
+raise the per-connector value rather than the global; that way fast
+issuers don't pay the polling cost.
+
+## Cursor pagination caching (SCALE-L2)
+
+Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
+`internal/api/middleware/etag.go` covering the top-5 read endpoints:
+`/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
+`/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
+derived from `(max-row-updated-at, row-count)` for the requested
+filter; repeated requests with the same query return `304 Not
+Modified` when the underlying data hasn't changed. The dashboard
+benefits most — its polling loop on the certificates page is the
+single largest read-traffic source on most deployments.
+
+When the cache is effective, repeated reads bypass the
+`SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
+on any mutation to the table (the row-count + max-updated-at hash
+flips).
+
+Operators don't need to do anything to opt in — the middleware is
+wired around the top-5 endpoints unconditionally. If you want to
+verify it's working, check the `ETag:` response header on a list
+endpoint and repeat the request with the same value in an
+`If-None-Match:` header — the second request should return 304 with
+an empty body.
+
+## Profiling production
+
+When the above ladder doesn't fit your shape, profile against your
+specific workload. The
+[performance-baselines.md](performance-baselines.md) runbook has
+single-endpoint, inventory-walk, and renewal-scan recipes you can
+adapt.
+
+## Related reading
+
+- [`docs/operator/performance-baselines.md`](performance-baselines.md) —
+  per-endpoint baselines + how to re-baseline after upgrades.
+- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
+  Postgres-side backup discipline (necessary precondition for any
+  scale tuning).
+- [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
+  full env-var inventory the values referenced above come from.