mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:21:32 +00:00
1279172e9b
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.
Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
`deploy/test/loadtest/`. Repo truth: the existing harness lives at
`deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
only" omitted Bundle 10 (2026-05-02) which added four connector-
tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
rewriting.
What ships
==========
Three new k6 scenario files under deploy/test/loadtest/k6/:
bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
p99 < 5s, p95 < 2s, errors < 1%
acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min
directory p95 < 500ms, nonce p95 < 300ms,
renewal-info p95 < 800ms, 5xx-only < 0.1%
Pins RFC 7807 rate-limit response shape via
acme_rate_limit_shape_ok Counter.
agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min
p99 < 1s, p95 < 500ms, errors < 0.1%
Two seed SQL fixtures under deploy/test/loadtest/seed/:
01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
rp-standard). status='active', expires_at distributed across
next 30 days, name prefix `loadtest-bulk-` so the scenario
can scope its criteria. Idempotent via
ON CONFLICT (name) DO NOTHING.
02_agent_fleet.sql — 5,000 agents rows with name prefix
`loadtest-agent-`. status='Online', last_heartbeat_at
staggered across prior 60s, OS distribution 80%/10%/10%
linux/windows/darwin. Idempotent via
ON CONFLICT (id) DO NOTHING.
Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.
Compose + Makefile + CI wiring
==============================
deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:
scale-seed — one-shot postgres:16-alpine container that runs
every ./seed/*.sql in lexical order against the
same postgres the server uses. Depends on
postgres healthy + certctl-server healthy (so
migrations + seed_demo.sql have already run).
k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js
k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js
k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js
Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).
Makefile gains four new phony targets:
loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale
loadtest-scale-acme - runs acme_burst.js
loadtest-scale-agent - runs agent_storm.js
loadtest-scale - all three serially
.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.
Documentation
=============
docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
- Scenario + seed + sustained load table
- Threshold contract (regression guards, NOT measured baselines)
- Measured-baseline table with TBD placeholders + the canonical-
hardware capture procedure
- How to run the scale tier locally
- Four documented limitations (JWS-signed ACME, scheduler renewal
scan throughput, production-sized Postgres, pull-only deployment
model)
deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.
Deliberate deviations from the prompt
======================================
The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.
The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).
Deferred (operator-side, not engineering)
==========================================
Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.
Files changed (10):
.github/workflows/loadtest.yml (+72 -1)
Makefile (+47 -1)
deploy/test/loadtest/README.md (+28 -1)
deploy/test/loadtest/docker-compose.yml (+108 -1)
deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines)
deploy/test/loadtest/k6/acme_burst.js (new, 192 lines)
deploy/test/loadtest/k6/agent_storm.js (new, 124 lines)
deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines)
deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines)
deploy/test/loadtest/seed/README.md (new, 86 lines)
docs/operator/scale.md (+109 -0)
Verification (sandbox-runnable):
python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
→ compose YAML OK
python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
→ workflow YAML OK
grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
→ all three scenarios + tags present
grep loadtest-scale Makefile
→ 4 new targets registered in .PHONY + 3 recipes + 1 aggregate
Runtime verification (deferred — requires docker on canonical hardware):
make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min
make loadtest-scale-acme # 200 VU × 5min
make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min
make loadtest-scale # all three serially
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
251 lines
12 KiB
Markdown
251 lines
12 KiB
Markdown
# Operator scale guide
|
||
|
||
> Last reviewed: 2026-05-14
|
||
|
||
Use this when:
|
||
- You're sizing a new certctl deployment for a target fleet count.
|
||
- You're scaling an existing deployment up from demo (15 certs / 1
|
||
agent) to production (1K+ certs / 100+ agents).
|
||
- An auditor asks "what does this scale to?" and you want a documented
|
||
answer that isn't "we haven't measured."
|
||
|
||
## DB connection pool
|
||
|
||
certctl's PostgreSQL connection pool is the single largest scale lever.
|
||
Pool exhaustion looks like 503s + agent poll timeouts + scheduler
|
||
falling behind on its loops. The default ships at 50 max open
|
||
connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
|
||
under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
|
||
contract.
|
||
|
||
Operator-tune ladder:
|
||
|
||
| Fleet size | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
|
||
|---|---|---|---|
|
||
| ≤ 500 certs / 100 agents | `50` (default) | `100` (PG default) | Demo + small deployments. Pool default sized for this. |
|
||
| 5K certs / 1K agents | `100` | `200` | Postgres needs an explicit bump from the 100 default; reload required. |
|
||
| 50K certs / 10K agents | `200` | `400` | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
|
||
|
||
Always leave headroom in Postgres's `max_connections` for backups
|
||
(`pg_dump` opens its own connection), ad-hoc psql sessions, and
|
||
replicas. The ratio `(server pool size × replicas) + 20` is a safe
|
||
floor for Postgres's `max_connections`.
|
||
|
||
**Numbers above the small-fleet row are operator-tuning starting
|
||
points, not validated ceilings.** Phase 8 of the architecture diligence
|
||
remediation will replace these with measured values from synthetic
|
||
fleets; until then, capture your own observations in a loadtest log
|
||
and tune against them.
|
||
|
||
## Scheduler tick budgets
|
||
|
||
certctl has 15 scheduler loops, each with its own cadence
|
||
(internal/scheduler/scheduler.go). The renewal scan is the hottest
|
||
loop on large fleets: it pulls every managed certificate, applies
|
||
each profile's renewal policy, and dispatches an issuance job per
|
||
cert that meets the threshold. The default cadence is `1h`
|
||
(`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
|
||
|
||
Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
|
||
`internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
|
||
unchanged; the wrapper adds ±10% randomized delay per tick so multiple
|
||
loops with the same nominal cadence don't co-fire and cause hour-
|
||
boundary CPU + DB spikes. For most fleets the visible effect is a
|
||
smoother CPU graph during the renewal scan.
|
||
|
||
**Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
|
||
concurrent issuance work behind a per-tick semaphore (default
|
||
`CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
|
||
that exceeds the loop interval), the semaphore can hold the entire
|
||
concurrency cap until the context cancels at next-tick boundary —
|
||
which is intentional. The drain happens via context cancellation; new
|
||
work isn't started past the deadline. Tests in
|
||
`internal/scheduler/` pin this drain behavior. Operators on large
|
||
fleets should:
|
||
|
||
1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
|
||
consistently exceeds tick budget.
|
||
2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
|
||
concurrent renewal task opens its own pool connection during
|
||
issuance / deployment.
|
||
3. Watch for the "renewal scan complete" log line per tick. If it's
|
||
consistently late, you're under-provisioned.
|
||
|
||
## Async CA polling budgets (SCALE-M3)
|
||
|
||
DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
|
||
accept a CSR, queue it on the CA side, and return a polling token.
|
||
The certctl server polls the CA's status endpoint until the cert is
|
||
ready or the deadline expires. The default poll-deadline is 10
|
||
minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
|
||
issuance returns `StillPending` and the scheduler re-enqueues the
|
||
job for the next tick.
|
||
|
||
Priority chain when picking the actual deadline (highest → lowest):
|
||
|
||
1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
|
||
`CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
|
||
2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
|
||
process-wide default for all async-CA connectors that didn't set
|
||
their per-connector value).
|
||
3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
|
||
|
||
Operators with slow async CAs (Entrust certificate-mode in
|
||
particular can take 15-30 minutes during business hours) should
|
||
raise the per-connector value rather than the global; that way fast
|
||
issuers don't pay the polling cost.
|
||
|
||
## Cursor pagination caching (SCALE-L2)
|
||
|
||
Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
|
||
`internal/api/middleware/etag.go` covering the top-5 read endpoints:
|
||
`/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
|
||
`/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
|
||
derived from `(max-row-updated-at, row-count)` for the requested
|
||
filter; repeated requests with the same query return `304 Not
|
||
Modified` when the underlying data hasn't changed. The dashboard
|
||
benefits most — its polling loop on the certificates page is the
|
||
single largest read-traffic source on most deployments.
|
||
|
||
When the cache is effective, repeated reads bypass the
|
||
`SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
|
||
on any mutation to the table (the row-count + max-updated-at hash
|
||
flips).
|
||
|
||
Operators don't need to do anything to opt in — the middleware is
|
||
wired around the top-5 endpoints unconditionally. If you want to
|
||
verify it's working, check the `ETag:` response header on a list
|
||
endpoint and repeat the request with the same value in an
|
||
`If-None-Match:` header — the second request should return 304 with
|
||
an empty body.
|
||
|
||
## Scale-tier scenarios (SCALE-H2, Phase 8)
|
||
|
||
Phase 8 (2026-05-14) extended the k6 load-test harness with three new
|
||
scenarios that exercise the scale-relevant load surfaces the original
|
||
API tier left uncovered. They live behind a compose profile gate
|
||
(`docker compose --profile scale`) so the default `make loadtest`
|
||
stays focused on per-PR regression scope. The full set runs weekly on
|
||
the same `loadtest.yml` cron as the API + connector tier.
|
||
|
||
| Scenario | k6 file | Seed fixture | Sustained load |
|
||
|---|---|---|---|
|
||
| Bulk-renewal under load | `deploy/test/loadtest/k6/bulk_renewal.js` | 10,000 managed_certificates (`seed/01_bulk_renewal_certs.sql`) | 5 req/s POST `/api/v1/certificates/bulk-renew` × 5 min |
|
||
| ACME enrollment burst | `deploy/test/loadtest/k6/acme_burst.js` | (none — unauth surface) | 200 concurrent VUs × directory/nonce/ARI × 5 min |
|
||
| Agent heartbeat storm | `deploy/test/loadtest/k6/agent_storm.js` | 5,000 agents (`seed/02_agent_fleet.sql`) | 167 req/s POST `/api/v1/agents/{id}/heartbeat` × 5 min |
|
||
|
||
### Threshold contracts (regression guards, NOT measured baselines)
|
||
|
||
| Scenario | Metric | Threshold |
|
||
|---|---|---|
|
||
| Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p99 | < 5 s |
|
||
| Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p95 | < 2 s |
|
||
| Bulk-renewal | `http_req_failed{scenario:bulk_renewal}` | < 1% |
|
||
| ACME burst | `acme_directory_duration` p95 | < 500 ms |
|
||
| ACME burst | `acme_new_nonce_duration` p95 | < 300 ms |
|
||
| ACME burst | `acme_renewal_info_duration` p95 | < 800 ms |
|
||
| ACME burst | `http_req_failed{server_error:true}` 5xx-only | < 0.1% |
|
||
| Agent storm | `http_req_duration{scenario:agent_storm}` p99 | < 1 s |
|
||
| Agent storm | `http_req_duration{scenario:agent_storm}` p95 | < 500 ms |
|
||
| Agent storm | `http_req_failed{scenario:agent_storm}` | < 0.1% |
|
||
|
||
429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's
|
||
per-account rate limiter SHOULD fire at sustained 200-VU pressure.
|
||
The custom `acme_rate_limited_count` Counter tracks how often it
|
||
fires; `acme_rate_limit_shape_ok` Counter verifies every 429 returns
|
||
the RFC 7807 `application/problem+json` shape with the
|
||
`urn:ietf:params:acme:error:rateLimited` type. A regression that
|
||
returned plain-text 429 or a different problem type would surface as
|
||
`(rate_limited_count - shape_ok_count) > 0` in the summary.
|
||
|
||
### Measured baseline — TBD pending canonical-hardware capture
|
||
|
||
The Phase 8 scenarios shipped 2026-05-14. Baseline capture on a
|
||
canonical `ubuntu-latest` GitHub runner is the next operational step;
|
||
until then, the table below holds TBD placeholders. **Do NOT publish
|
||
sandbox-captured numbers here** — the same anti-pattern the original
|
||
loadtest README guards against (sandbox-aggregate placeholder vs
|
||
canonical hardware) applies to Phase 8.
|
||
|
||
| Scenario | p50 | p95 | p99 | Error rate | Date measured | Commit |
|
||
|---|---|---|---|---|---|---|
|
||
| **bulk_renewal** | TBD | TBD | TBD | TBD | — | — |
|
||
| **acme_burst** directory | TBD | TBD | TBD | TBD | — | — |
|
||
| **acme_burst** new-nonce | TBD | TBD | TBD | TBD | — | — |
|
||
| **acme_burst** renewal-info | TBD | TBD | TBD | TBD | — | — |
|
||
| **agent_storm** | TBD | TBD | TBD | TBD | — | — |
|
||
|
||
Capture procedure: trigger `loadtest.yml` from the Actions tab against
|
||
the current `master` SHA; wait for the `k6-scale` matrix jobs to
|
||
complete; download the per-scenario summary artifacts; copy p50/p95/
|
||
p99 from `summary-<scenario>.json` into the table; commit the
|
||
captured numbers alongside the date + SHA. Replace this paragraph
|
||
with the captured-on row when the first canonical run lands.
|
||
|
||
### How to run the scale tier locally
|
||
|
||
```sh
|
||
# All three scenarios serially (~18 min total):
|
||
make loadtest-scale
|
||
|
||
# Individual scenarios (each ~6 min):
|
||
make loadtest-scale-bulk # 10K cert bulk-renew
|
||
make loadtest-scale-acme # 200 VU ACME burst
|
||
make loadtest-scale-agent # 5K agent heartbeat storm
|
||
```
|
||
|
||
Each scenario boots its own copy of the loadtest compose stack
|
||
(postgres + tls-init + certctl-server) plus the `scale-seed` init
|
||
container that runs the SQL fixtures from `deploy/test/loadtest/seed/`.
|
||
The seed is idempotent (`ON CONFLICT … DO NOTHING`) so re-running a
|
||
scenario against the same compose stack is cheap.
|
||
|
||
### Documented limitations of the scale tier
|
||
|
||
- **JWS-signed ACME flows are not measured.** The ACME burst scenario
|
||
hits the unauthenticated directory + new-nonce + ARI surface only.
|
||
Measuring the JWS-signed POST hot path (new-account / new-order /
|
||
finalize) requires bundling a JWS signer into the k6 driver (k6
|
||
doesn't ship JWS). End-to-end JWS conformance is gated by
|
||
`make acme-rfc-conformance-test` which drives `lego` against the
|
||
same stack.
|
||
- **Scheduler renewal scan throughput.** The bulk-renewal scenario
|
||
measures the inbound POST throughput; the scheduler's
|
||
`jobProcessorLoop` drains the enqueued jobs at a fixed per-tick
|
||
budget (`CERTCTL_RENEWAL_CONCURRENCY=25` default), and the
|
||
throughput of that path is not amplified by adding more inbound
|
||
bulk-renew calls. A future scenario could pull
|
||
`/api/v1/jobs?status=pending` and measure drain time.
|
||
- **Production-sized Postgres.** The compose stack runs
|
||
`postgres:16-alpine` with default config on a CI runner.
|
||
Production deploys with `shared_buffers >= 1 GiB` + dedicated
|
||
Postgres VM will have different query plans for the 10K-cert
|
||
scan. The captured numbers translate directionally but the
|
||
absolute ceiling is workload-specific — see the operator-tune
|
||
ladder above for production sizing.
|
||
- **Pull-only deployment model.** Agent CSR submit, work-poll, and
|
||
deploy-verify paths are intentionally out of scope. The heartbeat
|
||
storm exercises the highest-frequency call on a typical fleet;
|
||
the work-poll path runs at the same cadence but is cheap (empty
|
||
set returned 99% of the time).
|
||
|
||
## Profiling production
|
||
|
||
When the above ladder doesn't fit your shape, profile against your
|
||
specific workload. The
|
||
[performance-baselines.md](performance-baselines.md) runbook has
|
||
single-endpoint, inventory-walk, and renewal-scan recipes you can
|
||
adapt.
|
||
|
||
## Related reading
|
||
|
||
- [`docs/operator/performance-baselines.md`](performance-baselines.md) —
|
||
per-endpoint baselines + how to re-baseline after upgrades.
|
||
- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
|
||
Postgres-side backup discipline (necessary precondition for any
|
||
scale tuning).
|
||
- [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
|
||
full env-var inventory the values referenced above come from.
|