Files
certctl/docs/operator/scale.md
T
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00

141 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operator scale guide
> Last reviewed: 2026-05-14
Use this when:
- You're sizing a new certctl deployment for a target fleet count.
- You're scaling an existing deployment up from demo (15 certs / 1
agent) to production (1K+ certs / 100+ agents).
- An auditor asks "what does this scale to?" and you want a documented
answer that isn't "we haven't measured."
## DB connection pool
certctl's PostgreSQL connection pool is the single largest scale lever.
Pool exhaustion looks like 503s + agent poll timeouts + scheduler
falling behind on its loops. The default ships at 50 max open
connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
contract.
Operator-tune ladder:
| Fleet size | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
|---|---|---|---|
| ≤ 500 certs / 100 agents | `50` (default) | `100` (PG default) | Demo + small deployments. Pool default sized for this. |
| 5K certs / 1K agents | `100` | `200` | Postgres needs an explicit bump from the 100 default; reload required. |
| 50K certs / 10K agents | `200` | `400` | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
Always leave headroom in Postgres's `max_connections` for backups
(`pg_dump` opens its own connection), ad-hoc psql sessions, and
replicas. The ratio `(server pool size × replicas) + 20` is a safe
floor for Postgres's `max_connections`.
**Numbers above the small-fleet row are operator-tuning starting
points, not validated ceilings.** Phase 8 of the architecture diligence
remediation will replace these with measured values from synthetic
fleets; until then, capture your own observations in a loadtest log
and tune against them.
## Scheduler tick budgets
certctl has 15 scheduler loops, each with its own cadence
(internal/scheduler/scheduler.go). The renewal scan is the hottest
loop on large fleets: it pulls every managed certificate, applies
each profile's renewal policy, and dispatches an issuance job per
cert that meets the threshold. The default cadence is `1h`
(`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
`internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
unchanged; the wrapper adds ±10% randomized delay per tick so multiple
loops with the same nominal cadence don't co-fire and cause hour-
boundary CPU + DB spikes. For most fleets the visible effect is a
smoother CPU graph during the renewal scan.
**Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
concurrent issuance work behind a per-tick semaphore (default
`CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
that exceeds the loop interval), the semaphore can hold the entire
concurrency cap until the context cancels at next-tick boundary —
which is intentional. The drain happens via context cancellation; new
work isn't started past the deadline. Tests in
`internal/scheduler/` pin this drain behavior. Operators on large
fleets should:
1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
consistently exceeds tick budget.
2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
concurrent renewal task opens its own pool connection during
issuance / deployment.
3. Watch for the "renewal scan complete" log line per tick. If it's
consistently late, you're under-provisioned.
## Async CA polling budgets (SCALE-M3)
DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
accept a CSR, queue it on the CA side, and return a polling token.
The certctl server polls the CA's status endpoint until the cert is
ready or the deadline expires. The default poll-deadline is 10
minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
issuance returns `StillPending` and the scheduler re-enqueues the
job for the next tick.
Priority chain when picking the actual deadline (highest → lowest):
1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
`CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
`CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
`CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
process-wide default for all async-CA connectors that didn't set
their per-connector value).
3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
Operators with slow async CAs (Entrust certificate-mode in
particular can take 15-30 minutes during business hours) should
raise the per-connector value rather than the global; that way fast
issuers don't pay the polling cost.
## Cursor pagination caching (SCALE-L2)
Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
`internal/api/middleware/etag.go` covering the top-5 read endpoints:
`/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
`/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
derived from `(max-row-updated-at, row-count)` for the requested
filter; repeated requests with the same query return `304 Not
Modified` when the underlying data hasn't changed. The dashboard
benefits most — its polling loop on the certificates page is the
single largest read-traffic source on most deployments.
When the cache is effective, repeated reads bypass the
`SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
on any mutation to the table (the row-count + max-updated-at hash
flips).
Operators don't need to do anything to opt in — the middleware is
wired around the top-5 endpoints unconditionally. If you want to
verify it's working, check the `ETag:` response header on a list
endpoint and repeat the request with the same value in an
`If-None-Match:` header — the second request should return 304 with
an empty body.
## Profiling production
When the above ladder doesn't fit your shape, profile against your
specific workload. The
[performance-baselines.md](performance-baselines.md) runbook has
single-endpoint, inventory-walk, and renewal-scan recipes you can
adapt.
## Related reading
- [`docs/operator/performance-baselines.md`](performance-baselines.md) —
per-endpoint baselines + how to re-baseline after upgrades.
- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
Postgres-side backup discipline (necessary precondition for any
scale tuning).
- [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
full env-var inventory the values referenced above come from.