Files
certctl/docs/operator/scale.md
T
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00

6.5 KiB
Raw Blame History

Operator scale guide

Last reviewed: 2026-05-14

Use this when:

  • You're sizing a new certctl deployment for a target fleet count.
  • You're scaling an existing deployment up from demo (15 certs / 1 agent) to production (1K+ certs / 100+ agents).
  • An auditor asks "what does this scale to?" and you want a documented answer that isn't "we haven't measured."

DB connection pool

certctl's PostgreSQL connection pool is the single largest scale lever. Pool exhaustion looks like 503s + agent poll timeouts + scheduler falling behind on its loops. The default ships at 50 max open connections (CERTCTL_DATABASE_MAX_CONNS=50), with idle = max/5 = 10 under the existing internal/repository/postgres/db.go::NewDBWithMaxConns contract.

Operator-tune ladder:

Fleet size CERTCTL_DATABASE_MAX_CONNS Postgres max_connections Notes
≤ 500 certs / 100 agents 50 (default) 100 (PG default) Demo + small deployments. Pool default sized for this.
5K certs / 1K agents 100 200 Postgres needs an explicit bump from the 100 default; reload required.
50K certs / 10K agents 200 400 Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi.

Always leave headroom in Postgres's max_connections for backups (pg_dump opens its own connection), ad-hoc psql sessions, and replicas. The ratio (server pool size × replicas) + 20 is a safe floor for Postgres's max_connections.

Numbers above the small-fleet row are operator-tuning starting points, not validated ceilings. Phase 8 of the architecture diligence remediation will replace these with measured values from synthetic fleets; until then, capture your own observations in a loadtest log and tune against them.

Scheduler tick budgets

certctl has 15 scheduler loops, each with its own cadence (internal/scheduler/scheduler.go). The renewal scan is the hottest loop on large fleets: it pulls every managed certificate, applies each profile's renewal policy, and dispatches an issuance job per cert that meets the threshold. The default cadence is 1h (CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL).

Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the internal/scheduler.JitteredTicker wrapper. Each loop's interval is unchanged; the wrapper adds ±10% randomized delay per tick so multiple loops with the same nominal cadence don't co-fire and cause hour- boundary CPU + DB spikes. For most fleets the visible effect is a smoother CPU graph during the renewal scan.

Renewal-sweep semaphore (SCALE-L1). The renewal loop dispatches concurrent issuance work behind a per-tick semaphore (default CERTCTL_RENEWAL_CONCURRENCY=25). Under tick-budget pressure (a tick that exceeds the loop interval), the semaphore can hold the entire concurrency cap until the context cancels at next-tick boundary — which is intentional. The drain happens via context cancellation; new work isn't started past the deadline. Tests in internal/scheduler/ pin this drain behavior. Operators on large fleets should:

  1. Bump CERTCTL_RENEWAL_CONCURRENCY to 50 or 100 if the renewal scan consistently exceeds tick budget.
  2. Also bump CERTCTL_DATABASE_MAX_CONNS proportionally — each concurrent renewal task opens its own pool connection during issuance / deployment.
  3. Watch for the "renewal scan complete" log line per tick. If it's consistently late, you're under-provisioned.

Async CA polling budgets (SCALE-M3)

DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they accept a CSR, queue it on the CA side, and return a polling token. The certctl server polls the CA's status endpoint until the cert is ready or the deadline expires. The default poll-deadline is 10 minutes wall-clock (asyncpoll.DefaultMaxWait); after that the issuance returns StillPending and the scheduler re-enqueues the job for the next tick.

Priority chain when picking the actual deadline (highest → lowest):

  1. Per-connector env: CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS, CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS, CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS, CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS.
  2. Global env: CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS (sets the process-wide default for all async-CA connectors that didn't set their per-connector value).
  3. Package const: asyncpoll.DefaultMaxWait = 10 * time.Minute.

Operators with slow async CAs (Entrust certificate-mode in particular can take 15-30 minutes during business hours) should raise the per-connector value rather than the global; that way fast issuers don't pay the polling cost.

Cursor pagination caching (SCALE-L2)

Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at internal/api/middleware/etag.go covering the top-5 read endpoints: /api/v1/certificates, /api/v1/jobs, /api/v1/agents, /api/v1/audit, /api/v1/discovery/certificates. The ETag is derived from (max-row-updated-at, row-count) for the requested filter; repeated requests with the same query return 304 Not Modified when the underlying data hasn't changed. The dashboard benefits most — its polling loop on the certificates page is the single largest read-traffic source on most deployments.

When the cache is effective, repeated reads bypass the SELECT COUNT(*) FROM <table> query entirely. The cache invalidates on any mutation to the table (the row-count + max-updated-at hash flips).

Operators don't need to do anything to opt in — the middleware is wired around the top-5 endpoints unconditionally. If you want to verify it's working, check the ETag: response header on a list endpoint and repeat the request with the same value in an If-None-Match: header — the second request should return 304 with an empty body.

Profiling production

When the above ladder doesn't fit your shape, profile against your specific workload. The performance-baselines.md runbook has single-endpoint, inventory-walk, and renewal-scan recipes you can adapt.