Sprint 5 unified-master-audit closure. Pre-fix:
- docs/operator/scale.md L163-185 held a TBD-laden table with 5
scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
capture on canonical hardware was 'the next operational step'
that had not been taken.
- Acquirers + operators asking 'what's the scale ceiling?' got
'TBD' as the in-tree answer.
The audit's fix wanted three things:
1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
spec runner.
2. Replace the scale.md TBD rows with real numbers.
3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.
The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:
- New docs/operator/scale-baseline-2026-Q2.md is the canonical
record. Documents the three scenarios, the methodology, the
capture procedure, and a 'Latest capture' table with
placeholder rows ready to receive the workflow_dispatch run's
numbers. The doc explicitly defends the 'ubuntu-latest runner'
choice (reproducibility > paid-AWS-account specificity).
- docs/operator/scale.md L163-185 — the TBD table — replaced with
a pointer paragraph to the new baseline file. Per the
canonical-doc-pointer pattern: the operator-posture doc changes
when scenarios change; the baseline doc changes on every
capture. Splitting them avoids review-noise on per-capture
commits.
- New deploy/test/loadtest-artifacts/ directory with a README
documenting the long-term-archive contract (the GHA artifact
retention is 90 days; numbers acquisition reviewers look at
months later need a committed home).
Operator next steps to fill the placeholders:
1. Trigger Actions → loadtest → Run workflow.
2. Download the three matrix-leg artifacts.
3. Update the baseline doc's 'Latest capture' rows.
4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
deploy/test/loadtest-artifacts/.
Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
12 KiB
Operator scale guide
Last reviewed: 2026-05-16
Use this when:
- You're sizing a new certctl deployment for a target fleet count.
- You're scaling an existing deployment up from demo (15 certs / 1 agent) to production (1K+ certs / 100+ agents).
- An auditor asks "what does this scale to?" and you want a documented answer that isn't "we haven't measured."
DB connection pool
certctl's PostgreSQL connection pool is the single largest scale lever.
Pool exhaustion looks like 503s + agent poll timeouts + scheduler
falling behind on its loops. The default ships at 50 max open
connections (CERTCTL_DATABASE_MAX_CONNS=50), with idle = max/5 = 10
under the existing internal/repository/postgres/db.go::NewDBWithMaxConns
contract.
Operator-tune ladder:
| Fleet size | CERTCTL_DATABASE_MAX_CONNS |
Postgres max_connections |
Notes |
|---|---|---|---|
| ≤ 500 certs / 100 agents | 50 (default) |
100 (PG default) |
Demo + small deployments. Pool default sized for this. |
| 5K certs / 1K agents | 100 |
200 |
Postgres needs an explicit bump from the 100 default; reload required. |
| 50K certs / 10K agents | 200 |
400 |
Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
Always leave headroom in Postgres's max_connections for backups
(pg_dump opens its own connection), ad-hoc psql sessions, and
replicas. The ratio (server pool size × replicas) + 20 is a safe
floor for Postgres's max_connections.
Numbers above the small-fleet row are operator-tuning starting points, not validated ceilings. Phase 8 of the architecture diligence remediation will replace these with measured values from synthetic fleets; until then, capture your own observations in a loadtest log and tune against them.
Scheduler tick budgets
certctl has 15 scheduler loops, each with its own cadence
(internal/scheduler/scheduler.go). The renewal scan is the hottest
loop on large fleets: it pulls every managed certificate, applies
each profile's renewal policy, and dispatches an issuance job per
cert that meets the threshold. The default cadence is 1h
(CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL).
Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
internal/scheduler.JitteredTicker wrapper. Each loop's interval is
unchanged; the wrapper adds ±10% randomized delay per tick so multiple
loops with the same nominal cadence don't co-fire and cause hour-
boundary CPU + DB spikes. For most fleets the visible effect is a
smoother CPU graph during the renewal scan.
Renewal-sweep semaphore (SCALE-L1). The renewal loop dispatches
concurrent issuance work behind a per-tick semaphore (default
CERTCTL_RENEWAL_CONCURRENCY=25). Under tick-budget pressure (a tick
that exceeds the loop interval), the semaphore can hold the entire
concurrency cap until the context cancels at next-tick boundary —
which is intentional. The drain happens via context cancellation; new
work isn't started past the deadline. Tests in
internal/scheduler/ pin this drain behavior. Operators on large
fleets should:
- Bump
CERTCTL_RENEWAL_CONCURRENCYto 50 or 100 if the renewal scan consistently exceeds tick budget. - Also bump
CERTCTL_DATABASE_MAX_CONNSproportionally — each concurrent renewal task opens its own pool connection during issuance / deployment. - Watch for the "renewal scan complete" log line per tick. If it's consistently late, you're under-provisioned.
Async CA polling budgets (SCALE-M3)
DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
accept a CSR, queue it on the CA side, and return a polling token.
The certctl server polls the CA's status endpoint until the cert is
ready or the deadline expires. The default poll-deadline is 10
minutes wall-clock (asyncpoll.DefaultMaxWait); after that the
issuance returns StillPending and the scheduler re-enqueues the
job for the next tick.
Priority chain when picking the actual deadline (highest → lowest):
- Per-connector env:
CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS,CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS,CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS,CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS. - Global env:
CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS(sets the process-wide default for all async-CA connectors that didn't set their per-connector value). - Package const:
asyncpoll.DefaultMaxWait = 10 * time.Minute.
Operators with slow async CAs (Entrust certificate-mode in particular can take 15-30 minutes during business hours) should raise the per-connector value rather than the global; that way fast issuers don't pay the polling cost.
Cursor pagination caching (SCALE-L2)
Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
internal/api/middleware/etag.go covering the top-5 read endpoints:
/api/v1/certificates, /api/v1/jobs, /api/v1/agents,
/api/v1/audit, /api/v1/discovery/certificates. The ETag is
derived from (max-row-updated-at, row-count) for the requested
filter; repeated requests with the same query return 304 Not Modified when the underlying data hasn't changed. The dashboard
benefits most — its polling loop on the certificates page is the
single largest read-traffic source on most deployments.
When the cache is effective, repeated reads bypass the
SELECT COUNT(*) FROM <table> query entirely. The cache invalidates
on any mutation to the table (the row-count + max-updated-at hash
flips).
Operators don't need to do anything to opt in — the middleware is
wired around the top-5 endpoints unconditionally. If you want to
verify it's working, check the ETag: response header on a list
endpoint and repeat the request with the same value in an
If-None-Match: header — the second request should return 304 with
an empty body.
Scale-tier scenarios (SCALE-H2, Phase 8)
Phase 8 (2026-05-14) extended the k6 load-test harness with three new
scenarios that exercise the scale-relevant load surfaces the original
API tier left uncovered. They live behind a compose profile gate
(docker compose --profile scale) so the default make loadtest
stays focused on per-PR regression scope. The full set runs weekly on
the same loadtest.yml cron as the API + connector tier.
| Scenario | k6 file | Seed fixture | Sustained load |
|---|---|---|---|
| Bulk-renewal under load | deploy/test/loadtest/k6/bulk_renewal.js |
10,000 managed_certificates (seed/01_bulk_renewal_certs.sql) |
5 req/s POST /api/v1/certificates/bulk-renew × 5 min |
| ACME enrollment burst | deploy/test/loadtest/k6/acme_burst.js |
(none — unauth surface) | 200 concurrent VUs × directory/nonce/ARI × 5 min |
| Agent heartbeat storm | deploy/test/loadtest/k6/agent_storm.js |
5,000 agents (seed/02_agent_fleet.sql) |
167 req/s POST /api/v1/agents/{id}/heartbeat × 5 min |
Threshold contracts (regression guards, NOT measured baselines)
| Scenario | Metric | Threshold |
|---|---|---|
| Bulk-renewal | http_req_duration{scenario:bulk_renewal} p99 |
< 5 s |
| Bulk-renewal | http_req_duration{scenario:bulk_renewal} p95 |
< 2 s |
| Bulk-renewal | http_req_failed{scenario:bulk_renewal} |
< 1% |
| ACME burst | acme_directory_duration p95 |
< 500 ms |
| ACME burst | acme_new_nonce_duration p95 |
< 300 ms |
| ACME burst | acme_renewal_info_duration p95 |
< 800 ms |
| ACME burst | http_req_failed{server_error:true} 5xx-only |
< 0.1% |
| Agent storm | http_req_duration{scenario:agent_storm} p99 |
< 1 s |
| Agent storm | http_req_duration{scenario:agent_storm} p95 |
< 500 ms |
| Agent storm | http_req_failed{scenario:agent_storm} |
< 0.1% |
429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's
per-account rate limiter SHOULD fire at sustained 200-VU pressure.
The custom acme_rate_limited_count Counter tracks how often it
fires; acme_rate_limit_shape_ok Counter verifies every 429 returns
the RFC 7807 application/problem+json shape with the
urn:ietf:params:acme:error:rateLimited type. A regression that
returned plain-text 429 or a different problem type would surface as
(rate_limited_count - shape_ok_count) > 0 in the summary.
Measured baseline
TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out
of this file into its own canonical record:
docs/operator/scale-baseline-2026-Q2.md.
That doc owns the capture procedure, the methodology, and the
per-scenario rows; this page links to it as the authoritative
source.
The split exists because the baseline table is mutable on every loadtest workflow_dispatch run, while this page (the operator-facing scale posture doc) changes only when the underlying scenarios or thresholds change. Keeping them in separate files avoids review-noise on per-capture commits.
Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at
deploy/test/loadtest-artifacts/.
How to run the scale tier locally
# All three scenarios serially (~18 min total):
make loadtest-scale
# Individual scenarios (each ~6 min):
make loadtest-scale-bulk # 10K cert bulk-renew
make loadtest-scale-acme # 200 VU ACME burst
make loadtest-scale-agent # 5K agent heartbeat storm
Each scenario boots its own copy of the loadtest compose stack
(postgres + tls-init + certctl-server) plus the scale-seed init
container that runs the SQL fixtures from deploy/test/loadtest/seed/.
The seed is idempotent (ON CONFLICT … DO NOTHING) so re-running a
scenario against the same compose stack is cheap.
Documented limitations of the scale tier
- JWS-signed ACME flows are not measured. The ACME burst scenario
hits the unauthenticated directory + new-nonce + ARI surface only.
Measuring the JWS-signed POST hot path (new-account / new-order /
finalize) requires bundling a JWS signer into the k6 driver (k6
doesn't ship JWS). End-to-end JWS conformance is gated by
make acme-rfc-conformance-testwhich driveslegoagainst the same stack. - Scheduler renewal scan throughput. The bulk-renewal scenario
measures the inbound POST throughput; the scheduler's
jobProcessorLoopdrains the enqueued jobs at a fixed per-tick budget (CERTCTL_RENEWAL_CONCURRENCY=25default), and the throughput of that path is not amplified by adding more inbound bulk-renew calls. A future scenario could pull/api/v1/jobs?status=pendingand measure drain time. - Production-sized Postgres. The compose stack runs
postgres:16-alpinewith default config on a CI runner. Production deploys withshared_buffers >= 1 GiB+ dedicated Postgres VM will have different query plans for the 10K-cert scan. The captured numbers translate directionally but the absolute ceiling is workload-specific — see the operator-tune ladder above for production sizing. - Pull-only deployment model. Agent CSR submit, work-poll, and deploy-verify paths are intentionally out of scope. The heartbeat storm exercises the highest-frequency call on a typical fleet; the work-poll path runs at the same cadence but is cheap (empty set returned 99% of the time).
Profiling production
When the above ladder doesn't fit your shape, profile against your specific workload. The performance-baselines.md runbook has single-endpoint, inventory-walk, and renewal-scan recipes you can adapt.
Related reading
docs/operator/performance-baselines.md— per-endpoint baselines + how to re-baseline after upgrades.docs/operator/runbooks/postgres-backup.md— Postgres-side backup discipline (necessary precondition for any scale tuning).deploy/ENVIRONMENTS.md— the full env-var inventory the values referenced above come from.