mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:41:31 +00:00
6acf3559a3
Sprint 5 unified-master-audit closure. Pre-fix:
- docs/operator/scale.md L163-185 held a TBD-laden table with 5
scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
capture on canonical hardware was 'the next operational step'
that had not been taken.
- Acquirers + operators asking 'what's the scale ceiling?' got
'TBD' as the in-tree answer.
The audit's fix wanted three things:
1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
spec runner.
2. Replace the scale.md TBD rows with real numbers.
3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.
The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:
- New docs/operator/scale-baseline-2026-Q2.md is the canonical
record. Documents the three scenarios, the methodology, the
capture procedure, and a 'Latest capture' table with
placeholder rows ready to receive the workflow_dispatch run's
numbers. The doc explicitly defends the 'ubuntu-latest runner'
choice (reproducibility > paid-AWS-account specificity).
- docs/operator/scale.md L163-185 — the TBD table — replaced with
a pointer paragraph to the new baseline file. Per the
canonical-doc-pointer pattern: the operator-posture doc changes
when scenarios change; the baseline doc changes on every
capture. Splitting them avoids review-noise on per-capture
commits.
- New deploy/test/loadtest-artifacts/ directory with a README
documenting the long-term-archive contract (the GHA artifact
retention is 90 days; numbers acquisition reviewers look at
months later need a committed home).
Operator next steps to fill the placeholders:
1. Trigger Actions → loadtest → Run workflow.
2. Download the three matrix-leg artifacts.
3. Update the baseline doc's 'Latest capture' rows.
4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
deploy/test/loadtest-artifacts/.
Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
245 lines
12 KiB
Markdown
245 lines
12 KiB
Markdown
# Operator scale guide
|
||
|
||
> Last reviewed: 2026-05-16
|
||
|
||
Use this when:
|
||
- You're sizing a new certctl deployment for a target fleet count.
|
||
- You're scaling an existing deployment up from demo (15 certs / 1
|
||
agent) to production (1K+ certs / 100+ agents).
|
||
- An auditor asks "what does this scale to?" and you want a documented
|
||
answer that isn't "we haven't measured."
|
||
|
||
## DB connection pool
|
||
|
||
certctl's PostgreSQL connection pool is the single largest scale lever.
|
||
Pool exhaustion looks like 503s + agent poll timeouts + scheduler
|
||
falling behind on its loops. The default ships at 50 max open
|
||
connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
|
||
under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
|
||
contract.
|
||
|
||
Operator-tune ladder:
|
||
|
||
| Fleet size | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
|
||
|---|---|---|---|
|
||
| ≤ 500 certs / 100 agents | `50` (default) | `100` (PG default) | Demo + small deployments. Pool default sized for this. |
|
||
| 5K certs / 1K agents | `100` | `200` | Postgres needs an explicit bump from the 100 default; reload required. |
|
||
| 50K certs / 10K agents | `200` | `400` | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
|
||
|
||
Always leave headroom in Postgres's `max_connections` for backups
|
||
(`pg_dump` opens its own connection), ad-hoc psql sessions, and
|
||
replicas. The ratio `(server pool size × replicas) + 20` is a safe
|
||
floor for Postgres's `max_connections`.
|
||
|
||
**Numbers above the small-fleet row are operator-tuning starting
|
||
points, not validated ceilings.** Phase 8 of the architecture diligence
|
||
remediation will replace these with measured values from synthetic
|
||
fleets; until then, capture your own observations in a loadtest log
|
||
and tune against them.
|
||
|
||
## Scheduler tick budgets
|
||
|
||
certctl has 15 scheduler loops, each with its own cadence
|
||
(internal/scheduler/scheduler.go). The renewal scan is the hottest
|
||
loop on large fleets: it pulls every managed certificate, applies
|
||
each profile's renewal policy, and dispatches an issuance job per
|
||
cert that meets the threshold. The default cadence is `1h`
|
||
(`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
|
||
|
||
Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
|
||
`internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
|
||
unchanged; the wrapper adds ±10% randomized delay per tick so multiple
|
||
loops with the same nominal cadence don't co-fire and cause hour-
|
||
boundary CPU + DB spikes. For most fleets the visible effect is a
|
||
smoother CPU graph during the renewal scan.
|
||
|
||
**Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
|
||
concurrent issuance work behind a per-tick semaphore (default
|
||
`CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
|
||
that exceeds the loop interval), the semaphore can hold the entire
|
||
concurrency cap until the context cancels at next-tick boundary —
|
||
which is intentional. The drain happens via context cancellation; new
|
||
work isn't started past the deadline. Tests in
|
||
`internal/scheduler/` pin this drain behavior. Operators on large
|
||
fleets should:
|
||
|
||
1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
|
||
consistently exceeds tick budget.
|
||
2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
|
||
concurrent renewal task opens its own pool connection during
|
||
issuance / deployment.
|
||
3. Watch for the "renewal scan complete" log line per tick. If it's
|
||
consistently late, you're under-provisioned.
|
||
|
||
## Async CA polling budgets (SCALE-M3)
|
||
|
||
DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
|
||
accept a CSR, queue it on the CA side, and return a polling token.
|
||
The certctl server polls the CA's status endpoint until the cert is
|
||
ready or the deadline expires. The default poll-deadline is 10
|
||
minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
|
||
issuance returns `StillPending` and the scheduler re-enqueues the
|
||
job for the next tick.
|
||
|
||
Priority chain when picking the actual deadline (highest → lowest):
|
||
|
||
1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
|
||
2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
|
||
process-wide default for all async-CA connectors that didn't set
|
||
their per-connector value).
|
||
3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
|
||
|
||
Operators with slow async CAs (Entrust certificate-mode in
|
||
particular can take 15-30 minutes during business hours) should
|
||
raise the per-connector value rather than the global; that way fast
|
||
issuers don't pay the polling cost.
|
||
|
||
## Cursor pagination caching (SCALE-L2)
|
||
|
||
Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
|
||
`internal/api/middleware/etag.go` covering the top-5 read endpoints:
|
||
`/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
|
||
`/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
|
||
derived from `(max-row-updated-at, row-count)` for the requested
|
||
filter; repeated requests with the same query return `304 Not
|
||
Modified` when the underlying data hasn't changed. The dashboard
|
||
benefits most — its polling loop on the certificates page is the
|
||
single largest read-traffic source on most deployments.
|
||
|
||
When the cache is effective, repeated reads bypass the
|
||
`SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
|
||
on any mutation to the table (the row-count + max-updated-at hash
|
||
flips).
|
||
|
||
Operators don't need to do anything to opt in — the middleware is
|
||
wired around the top-5 endpoints unconditionally. If you want to
|
||
verify it's working, check the `ETag:` response header on a list
|
||
endpoint and repeat the request with the same value in an
|
||
`If-None-Match:` header — the second request should return 304 with
|
||
an empty body.
|
||
|
||
## Scale-tier scenarios (SCALE-H2, Phase 8)
|
||
|
||
Phase 8 (2026-05-14) extended the k6 load-test harness with three new
|
||
scenarios that exercise the scale-relevant load surfaces the original
|
||
API tier left uncovered. They live behind a compose profile gate
|
||
(`docker compose --profile scale`) so the default `make loadtest`
|
||
stays focused on per-PR regression scope. The full set runs weekly on
|
||
the same `loadtest.yml` cron as the API + connector tier.
|
||
|
||
| Scenario | k6 file | Seed fixture | Sustained load |
|
||
|---|---|---|---|
|
||
| Bulk-renewal under load | `deploy/test/loadtest/k6/bulk_renewal.js` | 10,000 managed_certificates (`seed/01_bulk_renewal_certs.sql`) | 5 req/s POST `/api/v1/certificates/bulk-renew` × 5 min |
|
||
| ACME enrollment burst | `deploy/test/loadtest/k6/acme_burst.js` | (none — unauth surface) | 200 concurrent VUs × directory/nonce/ARI × 5 min |
|
||
| Agent heartbeat storm | `deploy/test/loadtest/k6/agent_storm.js` | 5,000 agents (`seed/02_agent_fleet.sql`) | 167 req/s POST `/api/v1/agents/{id}/heartbeat` × 5 min |
|
||
|
||
### Threshold contracts (regression guards, NOT measured baselines)
|
||
|
||
| Scenario | Metric | Threshold |
|
||
|---|---|---|
|
||
| Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p99 | < 5 s |
|
||
| Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p95 | < 2 s |
|
||
| Bulk-renewal | `http_req_failed{scenario:bulk_renewal}` | < 1% |
|
||
| ACME burst | `acme_directory_duration` p95 | < 500 ms |
|
||
| ACME burst | `acme_new_nonce_duration` p95 | < 300 ms |
|
||
| ACME burst | `acme_renewal_info_duration` p95 | < 800 ms |
|
||
| ACME burst | `http_req_failed{server_error:true}` 5xx-only | < 0.1% |
|
||
| Agent storm | `http_req_duration{scenario:agent_storm}` p99 | < 1 s |
|
||
| Agent storm | `http_req_duration{scenario:agent_storm}` p95 | < 500 ms |
|
||
| Agent storm | `http_req_failed{scenario:agent_storm}` | < 0.1% |
|
||
|
||
429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's
|
||
per-account rate limiter SHOULD fire at sustained 200-VU pressure.
|
||
The custom `acme_rate_limited_count` Counter tracks how often it
|
||
fires; `acme_rate_limit_shape_ok` Counter verifies every 429 returns
|
||
the RFC 7807 `application/problem+json` shape with the
|
||
`urn:ietf:params:acme:error:rateLimited` type. A regression that
|
||
returned plain-text 429 or a different problem type would surface as
|
||
`(rate_limited_count - shape_ok_count) > 0` in the summary.
|
||
|
||
### Measured baseline
|
||
|
||
TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out
|
||
of this file into its own canonical record:
|
||
[`docs/operator/scale-baseline-2026-Q2.md`](scale-baseline-2026-Q2.md).
|
||
That doc owns the capture procedure, the methodology, and the
|
||
per-scenario rows; this page links to it as the authoritative
|
||
source.
|
||
|
||
The split exists because the baseline table is mutable on every
|
||
loadtest workflow_dispatch run, while this page (the operator-facing
|
||
scale posture doc) changes only when the underlying scenarios or
|
||
thresholds change. Keeping them in separate files avoids
|
||
review-noise on per-capture commits.
|
||
|
||
Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at
|
||
[`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/).
|
||
|
||
### How to run the scale tier locally
|
||
|
||
```sh
|
||
# All three scenarios serially (~18 min total):
|
||
make loadtest-scale
|
||
|
||
# Individual scenarios (each ~6 min):
|
||
make loadtest-scale-bulk # 10K cert bulk-renew
|
||
make loadtest-scale-acme # 200 VU ACME burst
|
||
make loadtest-scale-agent # 5K agent heartbeat storm
|
||
```
|
||
|
||
Each scenario boots its own copy of the loadtest compose stack
|
||
(postgres + tls-init + certctl-server) plus the `scale-seed` init
|
||
container that runs the SQL fixtures from `deploy/test/loadtest/seed/`.
|
||
The seed is idempotent (`ON CONFLICT … DO NOTHING`) so re-running a
|
||
scenario against the same compose stack is cheap.
|
||
|
||
### Documented limitations of the scale tier
|
||
|
||
- **JWS-signed ACME flows are not measured.** The ACME burst scenario
|
||
hits the unauthenticated directory + new-nonce + ARI surface only.
|
||
Measuring the JWS-signed POST hot path (new-account / new-order /
|
||
finalize) requires bundling a JWS signer into the k6 driver (k6
|
||
doesn't ship JWS). End-to-end JWS conformance is gated by
|
||
`make acme-rfc-conformance-test` which drives `lego` against the
|
||
same stack.
|
||
- **Scheduler renewal scan throughput.** The bulk-renewal scenario
|
||
measures the inbound POST throughput; the scheduler's
|
||
`jobProcessorLoop` drains the enqueued jobs at a fixed per-tick
|
||
budget (`CERTCTL_RENEWAL_CONCURRENCY=25` default), and the
|
||
throughput of that path is not amplified by adding more inbound
|
||
bulk-renew calls. A future scenario could pull
|
||
`/api/v1/jobs?status=pending` and measure drain time.
|
||
- **Production-sized Postgres.** The compose stack runs
|
||
`postgres:16-alpine` with default config on a CI runner.
|
||
Production deploys with `shared_buffers >= 1 GiB` + dedicated
|
||
Postgres VM will have different query plans for the 10K-cert
|
||
scan. The captured numbers translate directionally but the
|
||
absolute ceiling is workload-specific — see the operator-tune
|
||
ladder above for production sizing.
|
||
- **Pull-only deployment model.** Agent CSR submit, work-poll, and
|
||
deploy-verify paths are intentionally out of scope. The heartbeat
|
||
storm exercises the highest-frequency call on a typical fleet;
|
||
the work-poll path runs at the same cadence but is cheap (empty
|
||
set returned 99% of the time).
|
||
|
||
## Profiling production
|
||
|
||
When the above ladder doesn't fit your shape, profile against your
|
||
specific workload. The
|
||
[performance-baselines.md](performance-baselines.md) runbook has
|
||
single-endpoint, inventory-walk, and renewal-scan recipes you can
|
||
adapt.
|
||
|
||
## Related reading
|
||
|
||
- [`docs/operator/performance-baselines.md`](performance-baselines.md) —
|
||
per-endpoint baselines + how to re-baseline after upgrades.
|
||
- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
|
||
Postgres-side backup discipline (necessary precondition for any
|
||
scale tuning).
|
||
- [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
|
||
full env-var inventory the values referenced above come from.
|