Files
certctl/docs/operator/scale.md
T
shankar0123 6acf3559a3 docs(scale): TEST-005 — split scale baseline into its own canonical record
Sprint 5 unified-master-audit closure. Pre-fix:

  - docs/operator/scale.md L163-185 held a TBD-laden table with 5
    scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
    capture on canonical hardware was 'the next operational step'
    that had not been taken.
  - Acquirers + operators asking 'what's the scale ceiling?' got
    'TBD' as the in-tree answer.

The audit's fix wanted three things:
  1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
     spec runner.
  2. Replace the scale.md TBD rows with real numbers.
  3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.

The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:

  - New docs/operator/scale-baseline-2026-Q2.md is the canonical
    record. Documents the three scenarios, the methodology, the
    capture procedure, and a 'Latest capture' table with
    placeholder rows ready to receive the workflow_dispatch run's
    numbers. The doc explicitly defends the 'ubuntu-latest runner'
    choice (reproducibility > paid-AWS-account specificity).
  - docs/operator/scale.md L163-185 — the TBD table — replaced with
    a pointer paragraph to the new baseline file. Per the
    canonical-doc-pointer pattern: the operator-posture doc changes
    when scenarios change; the baseline doc changes on every
    capture. Splitting them avoids review-noise on per-capture
    commits.
  - New deploy/test/loadtest-artifacts/ directory with a README
    documenting the long-term-archive contract (the GHA artifact
    retention is 90 days; numbers acquisition reviewers look at
    months later need a committed home).

Operator next steps to fill the placeholders:
  1. Trigger Actions → loadtest → Run workflow.
  2. Download the three matrix-leg artifacts.
  3. Update the baseline doc's 'Latest capture' rows.
  4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
     deploy/test/loadtest-artifacts/.

Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
2026-05-16 05:19:57 +00:00

12 KiB
Raw Blame History

Operator scale guide

Last reviewed: 2026-05-16

Use this when:

  • You're sizing a new certctl deployment for a target fleet count.
  • You're scaling an existing deployment up from demo (15 certs / 1 agent) to production (1K+ certs / 100+ agents).
  • An auditor asks "what does this scale to?" and you want a documented answer that isn't "we haven't measured."

DB connection pool

certctl's PostgreSQL connection pool is the single largest scale lever. Pool exhaustion looks like 503s + agent poll timeouts + scheduler falling behind on its loops. The default ships at 50 max open connections (CERTCTL_DATABASE_MAX_CONNS=50), with idle = max/5 = 10 under the existing internal/repository/postgres/db.go::NewDBWithMaxConns contract.

Operator-tune ladder:

Fleet size CERTCTL_DATABASE_MAX_CONNS Postgres max_connections Notes
≤ 500 certs / 100 agents 50 (default) 100 (PG default) Demo + small deployments. Pool default sized for this.
5K certs / 1K agents 100 200 Postgres needs an explicit bump from the 100 default; reload required.
50K certs / 10K agents 200 400 Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi.

Always leave headroom in Postgres's max_connections for backups (pg_dump opens its own connection), ad-hoc psql sessions, and replicas. The ratio (server pool size × replicas) + 20 is a safe floor for Postgres's max_connections.

Numbers above the small-fleet row are operator-tuning starting points, not validated ceilings. Phase 8 of the architecture diligence remediation will replace these with measured values from synthetic fleets; until then, capture your own observations in a loadtest log and tune against them.

Scheduler tick budgets

certctl has 15 scheduler loops, each with its own cadence (internal/scheduler/scheduler.go). The renewal scan is the hottest loop on large fleets: it pulls every managed certificate, applies each profile's renewal policy, and dispatches an issuance job per cert that meets the threshold. The default cadence is 1h (CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL).

Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the internal/scheduler.JitteredTicker wrapper. Each loop's interval is unchanged; the wrapper adds ±10% randomized delay per tick so multiple loops with the same nominal cadence don't co-fire and cause hour- boundary CPU + DB spikes. For most fleets the visible effect is a smoother CPU graph during the renewal scan.

Renewal-sweep semaphore (SCALE-L1). The renewal loop dispatches concurrent issuance work behind a per-tick semaphore (default CERTCTL_RENEWAL_CONCURRENCY=25). Under tick-budget pressure (a tick that exceeds the loop interval), the semaphore can hold the entire concurrency cap until the context cancels at next-tick boundary — which is intentional. The drain happens via context cancellation; new work isn't started past the deadline. Tests in internal/scheduler/ pin this drain behavior. Operators on large fleets should:

  1. Bump CERTCTL_RENEWAL_CONCURRENCY to 50 or 100 if the renewal scan consistently exceeds tick budget.
  2. Also bump CERTCTL_DATABASE_MAX_CONNS proportionally — each concurrent renewal task opens its own pool connection during issuance / deployment.
  3. Watch for the "renewal scan complete" log line per tick. If it's consistently late, you're under-provisioned.

Async CA polling budgets (SCALE-M3)

DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they accept a CSR, queue it on the CA side, and return a polling token. The certctl server polls the CA's status endpoint until the cert is ready or the deadline expires. The default poll-deadline is 10 minutes wall-clock (asyncpoll.DefaultMaxWait); after that the issuance returns StillPending and the scheduler re-enqueues the job for the next tick.

Priority chain when picking the actual deadline (highest → lowest):

  1. Per-connector env: CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS, CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS, CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS, CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS.
  2. Global env: CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS (sets the process-wide default for all async-CA connectors that didn't set their per-connector value).
  3. Package const: asyncpoll.DefaultMaxWait = 10 * time.Minute.

Operators with slow async CAs (Entrust certificate-mode in particular can take 15-30 minutes during business hours) should raise the per-connector value rather than the global; that way fast issuers don't pay the polling cost.

Cursor pagination caching (SCALE-L2)

Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at internal/api/middleware/etag.go covering the top-5 read endpoints: /api/v1/certificates, /api/v1/jobs, /api/v1/agents, /api/v1/audit, /api/v1/discovery/certificates. The ETag is derived from (max-row-updated-at, row-count) for the requested filter; repeated requests with the same query return 304 Not Modified when the underlying data hasn't changed. The dashboard benefits most — its polling loop on the certificates page is the single largest read-traffic source on most deployments.

When the cache is effective, repeated reads bypass the SELECT COUNT(*) FROM <table> query entirely. The cache invalidates on any mutation to the table (the row-count + max-updated-at hash flips).

Operators don't need to do anything to opt in — the middleware is wired around the top-5 endpoints unconditionally. If you want to verify it's working, check the ETag: response header on a list endpoint and repeat the request with the same value in an If-None-Match: header — the second request should return 304 with an empty body.

Scale-tier scenarios (SCALE-H2, Phase 8)

Phase 8 (2026-05-14) extended the k6 load-test harness with three new scenarios that exercise the scale-relevant load surfaces the original API tier left uncovered. They live behind a compose profile gate (docker compose --profile scale) so the default make loadtest stays focused on per-PR regression scope. The full set runs weekly on the same loadtest.yml cron as the API + connector tier.

Scenario k6 file Seed fixture Sustained load
Bulk-renewal under load deploy/test/loadtest/k6/bulk_renewal.js 10,000 managed_certificates (seed/01_bulk_renewal_certs.sql) 5 req/s POST /api/v1/certificates/bulk-renew × 5 min
ACME enrollment burst deploy/test/loadtest/k6/acme_burst.js (none — unauth surface) 200 concurrent VUs × directory/nonce/ARI × 5 min
Agent heartbeat storm deploy/test/loadtest/k6/agent_storm.js 5,000 agents (seed/02_agent_fleet.sql) 167 req/s POST /api/v1/agents/{id}/heartbeat × 5 min

Threshold contracts (regression guards, NOT measured baselines)

Scenario Metric Threshold
Bulk-renewal http_req_duration{scenario:bulk_renewal} p99 < 5 s
Bulk-renewal http_req_duration{scenario:bulk_renewal} p95 < 2 s
Bulk-renewal http_req_failed{scenario:bulk_renewal} < 1%
ACME burst acme_directory_duration p95 < 500 ms
ACME burst acme_new_nonce_duration p95 < 300 ms
ACME burst acme_renewal_info_duration p95 < 800 ms
ACME burst http_req_failed{server_error:true} 5xx-only < 0.1%
Agent storm http_req_duration{scenario:agent_storm} p99 < 1 s
Agent storm http_req_duration{scenario:agent_storm} p95 < 500 ms
Agent storm http_req_failed{scenario:agent_storm} < 0.1%

429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's per-account rate limiter SHOULD fire at sustained 200-VU pressure. The custom acme_rate_limited_count Counter tracks how often it fires; acme_rate_limit_shape_ok Counter verifies every 429 returns the RFC 7807 application/problem+json shape with the urn:ietf:params:acme:error:rateLimited type. A regression that returned plain-text 429 or a different problem type would surface as (rate_limited_count - shape_ok_count) > 0 in the summary.

Measured baseline

TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out of this file into its own canonical record: docs/operator/scale-baseline-2026-Q2.md. That doc owns the capture procedure, the methodology, and the per-scenario rows; this page links to it as the authoritative source.

The split exists because the baseline table is mutable on every loadtest workflow_dispatch run, while this page (the operator-facing scale posture doc) changes only when the underlying scenarios or thresholds change. Keeping them in separate files avoids review-noise on per-capture commits.

Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at deploy/test/loadtest-artifacts/.

How to run the scale tier locally

# All three scenarios serially (~18 min total):
make loadtest-scale

# Individual scenarios (each ~6 min):
make loadtest-scale-bulk     # 10K cert bulk-renew
make loadtest-scale-acme     # 200 VU ACME burst
make loadtest-scale-agent    # 5K agent heartbeat storm

Each scenario boots its own copy of the loadtest compose stack (postgres + tls-init + certctl-server) plus the scale-seed init container that runs the SQL fixtures from deploy/test/loadtest/seed/. The seed is idempotent (ON CONFLICT … DO NOTHING) so re-running a scenario against the same compose stack is cheap.

Documented limitations of the scale tier

  • JWS-signed ACME flows are not measured. The ACME burst scenario hits the unauthenticated directory + new-nonce + ARI surface only. Measuring the JWS-signed POST hot path (new-account / new-order / finalize) requires bundling a JWS signer into the k6 driver (k6 doesn't ship JWS). End-to-end JWS conformance is gated by make acme-rfc-conformance-test which drives lego against the same stack.
  • Scheduler renewal scan throughput. The bulk-renewal scenario measures the inbound POST throughput; the scheduler's jobProcessorLoop drains the enqueued jobs at a fixed per-tick budget (CERTCTL_RENEWAL_CONCURRENCY=25 default), and the throughput of that path is not amplified by adding more inbound bulk-renew calls. A future scenario could pull /api/v1/jobs?status=pending and measure drain time.
  • Production-sized Postgres. The compose stack runs postgres:16-alpine with default config on a CI runner. Production deploys with shared_buffers >= 1 GiB + dedicated Postgres VM will have different query plans for the 10K-cert scan. The captured numbers translate directionally but the absolute ceiling is workload-specific — see the operator-tune ladder above for production sizing.
  • Pull-only deployment model. Agent CSR submit, work-poll, and deploy-verify paths are intentionally out of scope. The heartbeat storm exercises the highest-frequency call on a typical fleet; the work-poll path runs at the same cadence but is cheap (empty set returned 99% of the time).

Profiling production

When the above ladder doesn't fit your shape, profile against your specific workload. The performance-baselines.md runbook has single-endpoint, inventory-walk, and renewal-scan recipes you can adapt.