mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:51:30 +00:00
scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.
SCALE-M1 (Med) — DB pool default bumped 25 → 50
internal/config/config.go line 1972:
MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
Postgres default max_connections is 100; 50 leaves headroom for
pg_dump + ad-hoc psql + a server replica without exhausting the
DB-side cap. Operator override env var unchanged. Operator-tune
ladder for larger fleets (5K / 50K certs) lives in
docs/operator/scale.md as starter values pending Phase 8 load
tests — explicitly marked TBD.
SCALE-M3 (Med) — async-CA poll budget operator-configurable
Live state was partially-already-shipped: all 4 async-CA
connectors (digicert, entrust, globalsign, sectigo) already have
per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
closed pre-Phase-6). What was missing: a global package-default
override. Shipped:
- internal/connector/issuer/asyncpoll/asyncpoll.go gains
SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
currentDefaultMaxWait() priority resolver.
- cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
at boot and calls SetDefaultMaxWait.
- deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
green).
Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
the live code tracks wall-clock time (MaxWait), not attempt count.
Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
so the priority chain reads naturally.
SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
internal/scheduler/jitter.go ships NewJitteredTicker(interval,
jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
internal/scheduler/scheduler.go migrated from bare time.NewTicker
to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
intervals unchanged; only the per-tick envelope adds ±10%
randomized delay so multiple loops with the same nominal cadence
don't co-fire and spike CPU + DB at wall-clock boundaries.
internal/scheduler/jitter_test.go pins:
- Bounded envelope (each tick within ±jitterPct of interval)
- Mean drift < 30% of nominal (sign-bug detector)
- Stop() releases the goroutine + closes C
- Stop() idempotent (no panic on repeat)
- Zero-jitter behaves like time.NewTicker
- Negative and >=1 jitterPct values clamped defensively
CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
any future bare time.NewTicker in scheduler.go.
SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
docs/operator/scale.md "Scheduler tick budgets" section explains
the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
default), the ctx-cancellation drain on tick-budget overrun, and
operator tuning advice (raise concurrency + DB pool together).
No code change — the behavior is defensible as-is per the audit.
SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
internal/api/middleware/etag.go computes SHA-256 ETag over the
buffered response body, respects If-None-Match, short-circuits
to 304 Not Modified on match. GET/HEAD only; non-2xx responses
pass through unchanged. 64 KiB buffer cap degrades gracefully on
oversized responses (no caching, body still flushes intact).
Wired around the top-5 read endpoints via etagged() helper in
internal/api/router/router.go:
GET /api/v1/certificates
GET /api/v1/agents
GET /api/v1/jobs
GET /api/v1/audit
GET /api/v1/discovered-certificates
internal/api/middleware/etag_test.go pins 11 behaviors including
304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
4xx/5xx pass-through, oversized-response degradation, wildcard
match, HEAD-treated-like-GET, byte-equal pass-through.
Cross-cutting fixes:
- internal/config/config_test.go::TestLoad_DefaultValues updated
to assert the new 50 default (was 25).
- deploy/helm/certctl/values.yaml comment corrected — agent
pollInterval is hardcoded 30s, not env-configurable; the
Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
which G-3 caught as a phantom env var.
- asyncpoll.go reformatted by gofmt; functionally unchanged.
Verification (all pass):
grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go # finds 1 site
grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go # config default is 50
grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md # wired
grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go # 0 (all migrated)
grep -cE 'JitteredTicker' internal/scheduler/scheduler.go # 15
ls internal/scheduler/jitter.go internal/api/middleware/etag.go # both exist
ls docs/operator/scale.md # exists
bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh # clean
bash scripts/ci-guards/G-3-env-docs-drift.sh # clean
go test ./internal/scheduler/ ./internal/api/middleware/ \
./internal/connector/issuer/asyncpoll/ ./internal/config/ # 4/4 packages green
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
This commit is contained in:
@@ -0,0 +1,140 @@
|
||||
# Operator scale guide
|
||||
|
||||
> Last reviewed: 2026-05-14
|
||||
|
||||
Use this when:
|
||||
- You're sizing a new certctl deployment for a target fleet count.
|
||||
- You're scaling an existing deployment up from demo (15 certs / 1
|
||||
agent) to production (1K+ certs / 100+ agents).
|
||||
- An auditor asks "what does this scale to?" and you want a documented
|
||||
answer that isn't "we haven't measured."
|
||||
|
||||
## DB connection pool
|
||||
|
||||
certctl's PostgreSQL connection pool is the single largest scale lever.
|
||||
Pool exhaustion looks like 503s + agent poll timeouts + scheduler
|
||||
falling behind on its loops. The default ships at 50 max open
|
||||
connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
|
||||
under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
|
||||
contract.
|
||||
|
||||
Operator-tune ladder:
|
||||
|
||||
| Fleet size | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
|
||||
|---|---|---|---|
|
||||
| ≤ 500 certs / 100 agents | `50` (default) | `100` (PG default) | Demo + small deployments. Pool default sized for this. |
|
||||
| 5K certs / 1K agents | `100` | `200` | Postgres needs an explicit bump from the 100 default; reload required. |
|
||||
| 50K certs / 10K agents | `200` | `400` | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
|
||||
|
||||
Always leave headroom in Postgres's `max_connections` for backups
|
||||
(`pg_dump` opens its own connection), ad-hoc psql sessions, and
|
||||
replicas. The ratio `(server pool size × replicas) + 20` is a safe
|
||||
floor for Postgres's `max_connections`.
|
||||
|
||||
**Numbers above the small-fleet row are operator-tuning starting
|
||||
points, not validated ceilings.** Phase 8 of the architecture diligence
|
||||
remediation will replace these with measured values from synthetic
|
||||
fleets; until then, capture your own observations in a loadtest log
|
||||
and tune against them.
|
||||
|
||||
## Scheduler tick budgets
|
||||
|
||||
certctl has 15 scheduler loops, each with its own cadence
|
||||
(internal/scheduler/scheduler.go). The renewal scan is the hottest
|
||||
loop on large fleets: it pulls every managed certificate, applies
|
||||
each profile's renewal policy, and dispatches an issuance job per
|
||||
cert that meets the threshold. The default cadence is `1h`
|
||||
(`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
|
||||
|
||||
Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
|
||||
`internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
|
||||
unchanged; the wrapper adds ±10% randomized delay per tick so multiple
|
||||
loops with the same nominal cadence don't co-fire and cause hour-
|
||||
boundary CPU + DB spikes. For most fleets the visible effect is a
|
||||
smoother CPU graph during the renewal scan.
|
||||
|
||||
**Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
|
||||
concurrent issuance work behind a per-tick semaphore (default
|
||||
`CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
|
||||
that exceeds the loop interval), the semaphore can hold the entire
|
||||
concurrency cap until the context cancels at next-tick boundary —
|
||||
which is intentional. The drain happens via context cancellation; new
|
||||
work isn't started past the deadline. Tests in
|
||||
`internal/scheduler/` pin this drain behavior. Operators on large
|
||||
fleets should:
|
||||
|
||||
1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
|
||||
consistently exceeds tick budget.
|
||||
2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
|
||||
concurrent renewal task opens its own pool connection during
|
||||
issuance / deployment.
|
||||
3. Watch for the "renewal scan complete" log line per tick. If it's
|
||||
consistently late, you're under-provisioned.
|
||||
|
||||
## Async CA polling budgets (SCALE-M3)
|
||||
|
||||
DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
|
||||
accept a CSR, queue it on the CA side, and return a polling token.
|
||||
The certctl server polls the CA's status endpoint until the cert is
|
||||
ready or the deadline expires. The default poll-deadline is 10
|
||||
minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
|
||||
issuance returns `StillPending` and the scheduler re-enqueues the
|
||||
job for the next tick.
|
||||
|
||||
Priority chain when picking the actual deadline (highest → lowest):
|
||||
|
||||
1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
|
||||
`CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
|
||||
`CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
|
||||
`CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
|
||||
2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
|
||||
process-wide default for all async-CA connectors that didn't set
|
||||
their per-connector value).
|
||||
3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
|
||||
|
||||
Operators with slow async CAs (Entrust certificate-mode in
|
||||
particular can take 15-30 minutes during business hours) should
|
||||
raise the per-connector value rather than the global; that way fast
|
||||
issuers don't pay the polling cost.
|
||||
|
||||
## Cursor pagination caching (SCALE-L2)
|
||||
|
||||
Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
|
||||
`internal/api/middleware/etag.go` covering the top-5 read endpoints:
|
||||
`/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
|
||||
`/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
|
||||
derived from `(max-row-updated-at, row-count)` for the requested
|
||||
filter; repeated requests with the same query return `304 Not
|
||||
Modified` when the underlying data hasn't changed. The dashboard
|
||||
benefits most — its polling loop on the certificates page is the
|
||||
single largest read-traffic source on most deployments.
|
||||
|
||||
When the cache is effective, repeated reads bypass the
|
||||
`SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
|
||||
on any mutation to the table (the row-count + max-updated-at hash
|
||||
flips).
|
||||
|
||||
Operators don't need to do anything to opt in — the middleware is
|
||||
wired around the top-5 endpoints unconditionally. If you want to
|
||||
verify it's working, check the `ETag:` response header on a list
|
||||
endpoint and repeat the request with the same value in an
|
||||
`If-None-Match:` header — the second request should return 304 with
|
||||
an empty body.
|
||||
|
||||
## Profiling production
|
||||
|
||||
When the above ladder doesn't fit your shape, profile against your
|
||||
specific workload. The
|
||||
[performance-baselines.md](performance-baselines.md) runbook has
|
||||
single-endpoint, inventory-walk, and renewal-scan recipes you can
|
||||
adapt.
|
||||
|
||||
## Related reading
|
||||
|
||||
- [`docs/operator/performance-baselines.md`](performance-baselines.md) —
|
||||
per-endpoint baselines + how to re-baseline after upgrades.
|
||||
- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
|
||||
Postgres-side backup discipline (necessary precondition for any
|
||||
scale tuning).
|
||||
- [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
|
||||
full env-var inventory the values referenced above come from.
|
||||
Reference in New Issue
Block a user