mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 14:01:36 +00:00

Files

T

shankar0123 6acf3559a3 docs(scale): TEST-005 — split scale baseline into its own canonical record

Sprint 5 unified-master-audit closure. Pre-fix:

  - docs/operator/scale.md L163-185 held a TBD-laden table with 5
    scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
    capture on canonical hardware was 'the next operational step'
    that had not been taken.
  - Acquirers + operators asking 'what's the scale ceiling?' got
    'TBD' as the in-tree answer.

The audit's fix wanted three things:
  1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
     spec runner.
  2. Replace the scale.md TBD rows with real numbers.
  3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.

The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:

  - New docs/operator/scale-baseline-2026-Q2.md is the canonical
    record. Documents the three scenarios, the methodology, the
    capture procedure, and a 'Latest capture' table with
    placeholder rows ready to receive the workflow_dispatch run's
    numbers. The doc explicitly defends the 'ubuntu-latest runner'
    choice (reproducibility > paid-AWS-account specificity).
  - docs/operator/scale.md L163-185 — the TBD table — replaced with
    a pointer paragraph to the new baseline file. Per the
    canonical-doc-pointer pattern: the operator-posture doc changes
    when scenarios change; the baseline doc changes on every
    capture. Splitting them avoids review-noise on per-capture
    commits.
  - New deploy/test/loadtest-artifacts/ directory with a README
    documenting the long-term-archive contract (the GHA artifact
    retention is 90 days; numbers acquisition reviewers look at
    months later need a committed home).

Operator next steps to fill the placeholders:
  1. Trigger Actions → loadtest → Run workflow.
  2. Download the three matrix-leg artifacts.
  3. Update the baseline doc's 'Latest capture' rows.
  4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
     deploy/test/loadtest-artifacts/.

Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.

2026-05-16 05:19:57 +00:00

12 KiB

Raw Blame History

Operator scale guide

Last reviewed: 2026-05-16

Use this when:

You're sizing a new certctl deployment for a target fleet count.
You're scaling an existing deployment up from demo (15 certs / 1 agent) to production (1K+ certs / 100+ agents).
An auditor asks "what does this scale to?" and you want a documented answer that isn't "we haven't measured."

DB connection pool

certctl's PostgreSQL connection pool is the single largest scale lever. Pool exhaustion looks like 503s + agent poll timeouts + scheduler falling behind on its loops. The default ships at 50 max open connections (CERTCTL_DATABASE_MAX_CONNS=50), with idle = max/5 = 10 under the existing internal/repository/postgres/db.go::NewDBWithMaxConns contract.

Operator-tune ladder:

Fleet size	`CERTCTL_DATABASE_MAX_CONNS`	Postgres `max_connections`	Notes
≤ 500 certs / 100 agents	`50` (default)	`100` (PG default)	Demo + small deployments. Pool default sized for this.
5K certs / 1K agents	`100`	`200`	Postgres needs an explicit bump from the 100 default; reload required.
50K certs / 10K agents	`200`	`400`	Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi.

Always leave headroom in Postgres's max_connections for backups (pg_dump opens its own connection), ad-hoc psql sessions, and replicas. The ratio (server pool size × replicas) + 20 is a safe floor for Postgres's max_connections.

Numbers above the small-fleet row are operator-tuning starting points, not validated ceilings. Phase 8 of the architecture diligence remediation will replace these with measured values from synthetic fleets; until then, capture your own observations in a loadtest log and tune against them.

Scheduler tick budgets

certctl has 15 scheduler loops, each with its own cadence (internal/scheduler/scheduler.go). The renewal scan is the hottest loop on large fleets: it pulls every managed certificate, applies each profile's renewal policy, and dispatches an issuance job per cert that meets the threshold. The default cadence is 1h (CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL).

Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the internal/scheduler.JitteredTicker wrapper. Each loop's interval is unchanged; the wrapper adds ±10% randomized delay per tick so multiple loops with the same nominal cadence don't co-fire and cause hour- boundary CPU + DB spikes. For most fleets the visible effect is a smoother CPU graph during the renewal scan.

Renewal-sweep semaphore (SCALE-L1). The renewal loop dispatches concurrent issuance work behind a per-tick semaphore (default CERTCTL_RENEWAL_CONCURRENCY=25). Under tick-budget pressure (a tick that exceeds the loop interval), the semaphore can hold the entire concurrency cap until the context cancels at next-tick boundary — which is intentional. The drain happens via context cancellation; new work isn't started past the deadline. Tests in internal/scheduler/ pin this drain behavior. Operators on large fleets should:

Bump CERTCTL_RENEWAL_CONCURRENCY to 50 or 100 if the renewal scan consistently exceeds tick budget.
Also bump CERTCTL_DATABASE_MAX_CONNS proportionally — each concurrent renewal task opens its own pool connection during issuance / deployment.
Watch for the "renewal scan complete" log line per tick. If it's consistently late, you're under-provisioned.

Async CA polling budgets (SCALE-M3)

DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they accept a CSR, queue it on the CA side, and return a polling token. The certctl server polls the CA's status endpoint until the cert is ready or the deadline expires. The default poll-deadline is 10 minutes wall-clock (asyncpoll.DefaultMaxWait); after that the issuance returns StillPending and the scheduler re-enqueues the job for the next tick.

Priority chain when picking the actual deadline (highest → lowest):

Per-connector env: CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS, CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS, CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS, CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS.
Global env: CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS (sets the process-wide default for all async-CA connectors that didn't set their per-connector value).
Package const: asyncpoll.DefaultMaxWait = 10 * time.Minute.

Operators with slow async CAs (Entrust certificate-mode in particular can take 15-30 minutes during business hours) should raise the per-connector value rather than the global; that way fast issuers don't pay the polling cost.

Cursor pagination caching (SCALE-L2)

Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at internal/api/middleware/etag.go covering the top-5 read endpoints: /api/v1/certificates, /api/v1/jobs, /api/v1/agents, /api/v1/audit, /api/v1/discovery/certificates. The ETag is derived from (max-row-updated-at, row-count) for the requested filter; repeated requests with the same query return 304 Not Modified when the underlying data hasn't changed. The dashboard benefits most — its polling loop on the certificates page is the single largest read-traffic source on most deployments.

When the cache is effective, repeated reads bypass the SELECT COUNT(*) FROM <table> query entirely. The cache invalidates on any mutation to the table (the row-count + max-updated-at hash flips).

Operators don't need to do anything to opt in — the middleware is wired around the top-5 endpoints unconditionally. If you want to verify it's working, check the ETag: response header on a list endpoint and repeat the request with the same value in an If-None-Match: header — the second request should return 304 with an empty body.

Scale-tier scenarios (SCALE-H2, Phase 8)

Phase 8 (2026-05-14) extended the k6 load-test harness with three new scenarios that exercise the scale-relevant load surfaces the original API tier left uncovered. They live behind a compose profile gate (docker compose --profile scale) so the default make loadtest stays focused on per-PR regression scope. The full set runs weekly on the same loadtest.yml cron as the API + connector tier.

Scenario	k6 file	Seed fixture	Sustained load
Bulk-renewal under load	`deploy/test/loadtest/k6/bulk_renewal.js`	10,000 managed_certificates (`seed/01_bulk_renewal_certs.sql`)	5 req/s POST `/api/v1/certificates/bulk-renew` × 5 min
ACME enrollment burst	`deploy/test/loadtest/k6/acme_burst.js`	(none — unauth surface)	200 concurrent VUs × directory/nonce/ARI × 5 min
Agent heartbeat storm	`deploy/test/loadtest/k6/agent_storm.js`	5,000 agents (`seed/02_agent_fleet.sql`)	167 req/s POST `/api/v1/agents/{id}/heartbeat` × 5 min

Threshold contracts (regression guards, NOT measured baselines)

Scenario	Metric	Threshold
Bulk-renewal	`http_req_duration{scenario:bulk_renewal}` p99	< 5 s
Bulk-renewal	`http_req_duration{scenario:bulk_renewal}` p95	< 2 s
Bulk-renewal	`http_req_failed{scenario:bulk_renewal}`	< 1%
ACME burst	`acme_directory_duration` p95	< 500 ms
ACME burst	`acme_new_nonce_duration` p95	< 300 ms
ACME burst	`acme_renewal_info_duration` p95	< 800 ms
ACME burst	`http_req_failed{server_error:true}` 5xx-only	< 0.1%
Agent storm	`http_req_duration{scenario:agent_storm}` p99	< 1 s
Agent storm	`http_req_duration{scenario:agent_storm}` p95	< 500 ms
Agent storm	`http_req_failed{scenario:agent_storm}`	< 0.1%

429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's per-account rate limiter SHOULD fire at sustained 200-VU pressure. The custom acme_rate_limited_count Counter tracks how often it fires; acme_rate_limit_shape_ok Counter verifies every 429 returns the RFC 7807 application/problem+json shape with the urn:ietf:params:acme:error:rateLimited type. A regression that returned plain-text 429 or a different problem type would surface as (rate_limited_count - shape_ok_count) > 0 in the summary.

Measured baseline

TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out of this file into its own canonical record: docs/operator/scale-baseline-2026-Q2.md. That doc owns the capture procedure, the methodology, and the per-scenario rows; this page links to it as the authoritative source.

The split exists because the baseline table is mutable on every loadtest workflow_dispatch run, while this page (the operator-facing scale posture doc) changes only when the underlying scenarios or thresholds change. Keeping them in separate files avoids review-noise on per-capture commits.

Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at deploy/test/loadtest-artifacts/.

How to run the scale tier locally

# All three scenarios serially (~18 min total):
make loadtest-scale

# Individual scenarios (each ~6 min):
make loadtest-scale-bulk     # 10K cert bulk-renew
make loadtest-scale-acme     # 200 VU ACME burst
make loadtest-scale-agent    # 5K agent heartbeat storm

Each scenario boots its own copy of the loadtest compose stack (postgres + tls-init + certctl-server) plus the scale-seed init container that runs the SQL fixtures from deploy/test/loadtest/seed/. The seed is idempotent (ON CONFLICT … DO NOTHING) so re-running a scenario against the same compose stack is cheap.

Documented limitations of the scale tier

JWS-signed ACME flows are not measured. The ACME burst scenario hits the unauthenticated directory + new-nonce + ARI surface only. Measuring the JWS-signed POST hot path (new-account / new-order / finalize) requires bundling a JWS signer into the k6 driver (k6 doesn't ship JWS). End-to-end JWS conformance is gated by make acme-rfc-conformance-test which drives lego against the same stack.
Scheduler renewal scan throughput. The bulk-renewal scenario measures the inbound POST throughput; the scheduler's jobProcessorLoop drains the enqueued jobs at a fixed per-tick budget (CERTCTL_RENEWAL_CONCURRENCY=25 default), and the throughput of that path is not amplified by adding more inbound bulk-renew calls. A future scenario could pull /api/v1/jobs?status=pending and measure drain time.
Production-sized Postgres. The compose stack runs postgres:16-alpine with default config on a CI runner. Production deploys with shared_buffers >= 1 GiB + dedicated Postgres VM will have different query plans for the 10K-cert scan. The captured numbers translate directionally but the absolute ceiling is workload-specific — see the operator-tune ladder above for production sizing.
Pull-only deployment model. Agent CSR submit, work-poll, and deploy-verify paths are intentionally out of scope. The heartbeat storm exercises the highest-frequency call on a typical fleet; the work-poll path runs at the same cadence but is cheap (empty set returned 99% of the time).

Profiling production

When the above ladder doesn't fit your shape, profile against your specific workload. The performance-baselines.md runbook has single-endpoint, inventory-walk, and renewal-scan recipes you can adapt.

docs/operator/performance-baselines.md — per-endpoint baselines + how to re-baseline after upgrades.
docs/operator/runbooks/postgres-backup.md — Postgres-side backup discipline (necessary precondition for any scale tuning).
deploy/ENVIRONMENTS.md — the full env-var inventory the values referenced above come from.

12 KiB Raw Blame History Unescape Escape