Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.
Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
`deploy/test/loadtest/`. Repo truth: the existing harness lives at
`deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
only" omitted Bundle 10 (2026-05-02) which added four connector-
tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
rewriting.
What ships
==========
Three new k6 scenario files under deploy/test/loadtest/k6/:
bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
p99 < 5s, p95 < 2s, errors < 1%
acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min
directory p95 < 500ms, nonce p95 < 300ms,
renewal-info p95 < 800ms, 5xx-only < 0.1%
Pins RFC 7807 rate-limit response shape via
acme_rate_limit_shape_ok Counter.
agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min
p99 < 1s, p95 < 500ms, errors < 0.1%
Two seed SQL fixtures under deploy/test/loadtest/seed/:
01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
rp-standard). status='active', expires_at distributed across
next 30 days, name prefix `loadtest-bulk-` so the scenario
can scope its criteria. Idempotent via
ON CONFLICT (name) DO NOTHING.
02_agent_fleet.sql — 5,000 agents rows with name prefix
`loadtest-agent-`. status='Online', last_heartbeat_at
staggered across prior 60s, OS distribution 80%/10%/10%
linux/windows/darwin. Idempotent via
ON CONFLICT (id) DO NOTHING.
Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.
Compose + Makefile + CI wiring
==============================
deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:
scale-seed — one-shot postgres:16-alpine container that runs
every ./seed/*.sql in lexical order against the
same postgres the server uses. Depends on
postgres healthy + certctl-server healthy (so
migrations + seed_demo.sql have already run).
k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js
k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js
k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js
Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).
Makefile gains four new phony targets:
loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale
loadtest-scale-acme - runs acme_burst.js
loadtest-scale-agent - runs agent_storm.js
loadtest-scale - all three serially
.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.
Documentation
=============
docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
- Scenario + seed + sustained load table
- Threshold contract (regression guards, NOT measured baselines)
- Measured-baseline table with TBD placeholders + the canonical-
hardware capture procedure
- How to run the scale tier locally
- Four documented limitations (JWS-signed ACME, scheduler renewal
scan throughput, production-sized Postgres, pull-only deployment
model)
deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.
Deliberate deviations from the prompt
======================================
The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.
The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).
Deferred (operator-side, not engineering)
==========================================
Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.
Files changed (10):
.github/workflows/loadtest.yml (+72 -1)
Makefile (+47 -1)
deploy/test/loadtest/README.md (+28 -1)
deploy/test/loadtest/docker-compose.yml (+108 -1)
deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines)
deploy/test/loadtest/k6/acme_burst.js (new, 192 lines)
deploy/test/loadtest/k6/agent_storm.js (new, 124 lines)
deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines)
deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines)
deploy/test/loadtest/seed/README.md (new, 86 lines)
docs/operator/scale.md (+109 -0)
Verification (sandbox-runnable):
python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
→ compose YAML OK
python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
→ workflow YAML OK
grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
→ all three scenarios + tags present
grep loadtest-scale Makefile
→ 4 new targets registered in .PHONY + 3 recipes + 1 aggregate
Runtime verification (deferred — requires docker on canonical hardware):
make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min
make loadtest-scale-acme # 200 VU × 5min
make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min
make loadtest-scale # all three serially
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
certctl Documentation
Last reviewed: 2026-05-12
The full docs index, organized by audience. Pick the section that matches what you need to do; each link below opens a focused doc rather than a wall of text.
For the elevator pitch and quickstart commands, see the repo README.md at the root. For the marketing site, see certctl.io.
Getting Started
You're new to certctl, just cloned the repo, or want to understand what it does before installing.
| Doc | What it covers |
|---|---|
| Concepts | TLS certificates explained for beginners — CAs, ACME, EST, private keys, the full glossary |
| Quickstart | Five-minute setup with Docker Compose, dashboard tour, API tour |
| Examples | Five turnkey scenarios — ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer |
| Advanced demo | End-to-end certificate lifecycle with technical depth at each step |
| Why certctl | Positioning vs ACME clients, agent-based SaaS, enterprise platforms; when to look elsewhere |
Reference
You're operating certctl in production or building integrations and need authoritative technical detail.
| Doc | What it covers |
|---|---|
| Architecture | System design, data flow, security model, deployment topologies |
| Profiles | CertificateProfile policy object — issuer wiring, EKUs, RequiresApproval gate (with profile-edit closure) |
| API | OpenAPI 3.1 spec, integration patterns, client SDK generation |
| CLI | certctl-cli command reference and CI/CD integration patterns |
| Configuration | CERTCTL_* environment variable reference (scheduler, rate limits, deploy verify, audit, agent) |
| MCP server | Model Context Protocol integration for AI assistants |
| Release verification | Cosign / SLSA / SBOM verification procedure |
| Intermediate CA hierarchy | Multi-level CA tree management — RFC 5280 §3.2/§4.2.1.9/§4.2.1.10 enforcement |
| Auth standards implemented | RFC + CWE evidence for the API-key + RBAC + OIDC + sessions + break-glass surface (NOT a compliance-mapping doc) |
| Deployment model | Atomic write, post-deploy verify, rollback semantics across all targets |
| Vendor matrix | Tested vendor versions per target connector |
Connectors
The connector index is the canonical catalog (interfaces, registry, scanners, plus an inline reference per built-in). Per-connector deep-dive siblings cover operator-grade material — vendor edges, troubleshooting, rotation playbooks, when-to-use vs alternatives.
Issuers (13 deep-dives): ACME · ADCS · AWS ACM Private CA · DigiCert · EJBCA / Keyfactor · Entrust · GlobalSign Atlas HVCA · Google CAS · Local CA · OpenSSL / Custom CA · Sectigo SCM · step-ca / Smallstep · Vault PKI
Targets (15 deep-dives): Apache · AWS Certificate Manager · Azure Key Vault · Caddy · Envoy · F5 BIG-IP · HAProxy · IIS · Java Keystore · Kubernetes Secrets · NGINX · Postfix / Dovecot · SSH (agentless) · Traefik · Windows Certificate Store
Protocols
| Doc | What it covers |
|---|---|
| ACME server | Run certctl as an RFC 8555 + RFC 9773 ARI ACME server |
| ACME server threat model | Security posture for the ACME server endpoint |
| SCEP server | RFC 8894 native SCEP server — RA cert config, multi-profile dispatch, must-staple, mTLS sibling route |
| SCEP for Microsoft Intune | Intune-specific deployment guide — NDES replacement playbook |
| EST server | RFC 7030 EST server — 802.1X / Wi-Fi enrollment, IoT bootstrap, channel binding |
| CRL & OCSP | RFC 5280 CRL + RFC 6960 OCSP responder for relying parties |
| Async CA polling | Bounded polling for async-CA issuer connectors |
Operator
You're running certctl in production and need operational guidance.
| Doc | What it covers |
|---|---|
| Security posture | Auth, rate limits, encryption at rest, key rotation, RBAC + OIDC + sessions + break-glass, bootstrap |
| Secret custody | Where private keys live; FileDriver vs HSM/KMS; encryption wire format; env-seeded vs DB-seeded plaintext policy |
| Observability | Metrics surface, Prometheus exposition vs client_golang, tracing scope, log structure, rate-limit semantics across restarts/replicas |
| RBAC operator reference | Roles, permissions, scopes, scope-down + day-0 bootstrap |
| Auth threat model | API-key + RBAC + OIDC + sessions + break-glass — token forgery, session hijacking, IdP compromise, role-grant abuse, bootstrap-token leak, audit-mutation |
| OIDC / SSO runbooks | Per-IdP setup guides — Keycloak, Authentik, Okta, Auth0, Entra ID, Google Workspace |
| Control plane TLS | Self-signed bootstrap, operator-supplied Secret, cert-manager Certificate CR |
| Database TLS | PostgreSQL transport encryption |
| Approval workflow | Two-person integrity gate for high-stakes issuance + profile-edit closure |
| Helm deployment | Kubernetes installation via the bundled chart |
| Performance baselines | Operator-runnable benchmarks for regression spot checks |
| Auth benchmarks | Session + OIDC validation p99 targets and measured baselines |
| Legacy clients (TLS 1.2) | Reverse-proxy runbook for embedded EST/SCEP clients on TLS 1.2 |
Runbooks
| Runbook | When |
|---|---|
| Cloud targets | AWS ACM + Azure Key Vault deployment, debugging, rollback |
| Expiry alerts | Per-policy multi-channel routing matrix, severity tiers |
| Disaster recovery | CRL cache, OCSP responder cert, CA private-key rotation, Postgres restore |
| Config-encryption upgrade | Force v1/v2 → v3 re-seal across the database; passphrase rotation procedure |
| PostgreSQL backup | Operator-run backup recipe (docker-compose + Kubernetes); recommended cadence; quarterly DR dry-run |
Migration
You're moving from another cert-management tool to certctl, or running both in parallel.
| From | Doc |
|---|---|
| Certbot | migration/from-certbot.md |
| acme.sh | migration/from-acmesh.md |
| cert-manager (coexistence, not replacement) | migration/cert-manager-coexistence.md |
| Caddy ACME (point Caddy at certctl) | migration/acme-from-caddy.md |
| cert-manager ACME (point cert-manager at certctl) | migration/acme-from-cert-manager.md |
| Traefik ACME (point Traefik at certctl) | migration/acme-from-traefik.md |
| API keys → RBAC (v2.0.x → v2.1.0) | migration/api-keys-to-rbac.md — AUDIT YOUR API KEYS post-upgrade |
| Enable OIDC SSO | migration/oidc-enable.md — step-by-step OIDC onboarding for an existing API-key + RBAC deployment |
Contributor
You're contributing to certctl, running tests locally, or trying to understand the CI pipeline.
| Doc | What it covers |
|---|---|
| Testing strategy | What we test and why; per-PR fast gates vs daily deep-scan |
| Test environment | Local environment with real CAs (Pebble, step-ca, etc.) |
| QA prerequisites | Before running QA: stack boot, demo data baseline, env vars |
| QA test suite | qa_test.go reference for release QA |
| GUI QA checklist | Manual GUI verification pass for release |
| Release sign-off | Release-day checklist — code state, automated gates, manual QA, artefact verification |
| CI pipeline | CI shape, regression guards, adding new checks |
| CI guards | Per-class CI guards (code-shape, contract-parity, build/dep, operational); how to add one |
Archive
Historical docs preserved for reference. Most operators don't need these.
| Doc | Why archived |
|---|---|
| Upgrade to TLS (v2.2) | Pre-v2.2 HTTPS-everywhere upgrade procedure |
| Upgrade past v2 JWT removal | G-1 milestone JWT auth removal procedure |
Reading order by role
First-time operator: Concepts → Quickstart → Examples. About 90 minutes end to end.
Production operator: Architecture → Security posture → Control plane TLS → Disaster recovery runbook. About 4 hours end to end.
PKI engineer: ACME server → SCEP server → EST server → Intermediate CA hierarchy. About 6 hours end to end.
Contributor: Architecture → Testing strategy → Test environment → CI pipeline. About 3 hours end to end.