Bundle 4 closure (2026-05-13 acquisition diligence audit). Closes the
"what happens under multi-replica" question cluster: migration runner
had no concurrency control + no applied-version ledger, 15 scheduler
loops had per-process idempotency but no cross-replica documentation,
rate limits were process-local without an operator-facing scope
statement, load-test scope explicitly omitted four hot paths without
linking them to a roadmap.
Source findings closed:
HIGH-1 + D4 + finding 4 (migration tracking)
D8 (scheduler loop ownership)
MED-1 + MED-2 (rate-limit scope)
T9 + LOW-7 + finding 7 (load-test receipt scope)
Closures by source ID:
HIGH-1 + D4 + finding 4 — Migration tracking + advisory lock.
internal/repository/postgres/db.go::RunMigrations now wraps every
migration execution in:
1. A dedicated *sql.Conn pinned to one connection for the entire
scan + apply lifecycle (pg_advisory_lock is connection-scoped).
2. pg_advisory_lock(migrationAdvisoryLockID) — fixed int64 key
derived from "certctl-migrations" so the same constant resolves
across deployments without colliding with operator advisory
locks. Blocks the second replica until the first finishes.
3. CREATE TABLE IF NOT EXISTS schema_migrations(version TEXT PK,
applied_at TIMESTAMPTZ DEFAULT NOW()) — audit ledger.
4. Skip-applied loop: SELECT version FROM schema_migrations →
map[string]struct{} → skip every .up.sql whose filename is in
the map. INSERT after successful execute, ON CONFLICT
(version) DO NOTHING for defense in depth.
Pre-Bundle-4 every server boot re-ran all 45 .up.sql files. The
"idempotency via IF NOT EXISTS / ON CONFLICT" contract in CLAUDE.md
held per-migration but offered no protection when two Helm replicas
raced on schema DDL. Post-Bundle-4 single-replica deploys see zero
behavior change beyond the audit-table population; multi-replica
deploys get HA-safe schema bootstrap.
D8 — Scheduler HA semantics documented.
New docs/operator/scheduler-ha.md with per-loop inventory of all 15
loops in internal/scheduler/scheduler.go. Classification:
- HA-safe (jobProcessorLoop, jobRetryLoop) — FOR UPDATE SKIP
LOCKED via ClaimPendingJobs (Bundle 1 H-6 closure, 3e78ecb).
- HA-safe-ish (jobTimeoutLoop) — atomic UPDATE-WHERE-status.
- Idempotent under N>1 replicas (renewalCheckLoop,
agentHealthCheckLoop, shortLivedExpiryCheckLoop, networkScanLoop,
healthCheckLoop, acmeGCLoop, sessionGCLoop) — duplicate ticks
produce idempotent side effects.
- Side-effect-duplicating under N>1 replicas
(notificationProcessLoop, notificationRetryLoop, digestLoop,
cloudDiscoveryLoop, crlGenerationLoop) — duplicate
webhook/email/AWS-API/CRL-signing operations. Operators
running multi-replica accept N× side effects or pin to
server.replicas: 1.
Leader-election work tracked in WORKSPACE-ROADMAP.md as v3.
MED-1 + MED-2 — Rate-limit scope.
New docs/operator/rate-limit-scope.md states the contract verbatim:
process-local sync.Mutex-guarded sliding-window log, effective
cluster-wide cap = configured-per-replica × server.replicas,
restart-safe (no persistent state, no shared store), bounded
(50k/100k key cap with eviction). Five call sites documented:
ocspLimiter (1m/IP), exportLimiter (1h/actor), EST per-principal
(24h/CN), EST failed-auth (1h/IP), Intune dispatcher
(24h/Subject+Issuer), plus the HTTP middleware token-bucket
(RPS+Burst per replica). Cluster-wide shared limits via Redis or
Postgres-backed bucket are tracked in WORKSPACE-ROADMAP.md as v3.
T9 + LOW-7 + finding 7 — Load-test receipt scope.
The existing harness at deploy/test/loadtest/ already
self-documents the gap ("What it explicitly does NOT measure"). No
code change needed for this finding; Bundle 4 cross-references
scheduler-ha.md and rate-limit-scope.md from those gap callouts so
the four deferred coverage classes (issuer connector, scheduler
throughput, agent fleet, DB p99) land in the same place an
acquirer reads about HA semantics and rate limits.
Tests:
internal/repository/postgres/migrations_test.go (new, 4 tests):
- TestRunMigrations_PopulatesSchemaMigrations: audit table
exists and is non-empty after the first migration run.
- TestRunMigrations_SkipsAppliedOnSecondCall: second call is
observable no-op on row count.
- TestRunMigrations_ConcurrentCallsSerialized: two goroutines
racing the migrator both return without error; row count
unchanged; no duplicate versions.
- TestRunMigrations_FreshDatabaseHappyPath: ≥ 30 migrations
land on a fresh schema.
Gated by testcontainers via the existing repo_test.go getTestDB
pattern; skipped under -short. The integration lane runs them.
Verification:
gofmt -l # clean
go vet ./internal/repository/postgres ./cmd/server # clean
go build ./cmd/server ./internal/repository/postgres # clean
go test -short -count=1 ./internal/repository/postgres
./internal/ratelimit # PASS
Operator follow-up: full integration run on workstation:
go test -count=1 ./internal/repository/postgres -run TestRunMigrations_
Receipts (paths for the audit packet):
Migration runner evidence: internal/repository/postgres/db.go
L135-340 (advisory-lock + ledger + skip-applied loop) +
internal/repository/postgres/migrations_test.go (4 tests).
Scheduler loop inventory: docs/operator/scheduler-ha.md (15-loop
table with HA classification per loop).
Rate-limit storage matrix: docs/operator/rate-limit-scope.md.
Load-test baseline: deploy/test/loadtest/README.md (already
self-documenting), cross-linked from scheduler-ha.md.
Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md):
- Leader election for the four duplicate-side-effect loops
(notificationProcessLoop, notificationRetryLoop, digestLoop,
cloudDiscoveryLoop, crlGenerationLoop). v3 work item.
- Shared rate-limits across replicas (Redis / Postgres token
bucket). v3 work item.
- Issuer-connector + scheduler-throughput + agent-fleet + DB-p99
load-test coverage. Tracked separately; per-issuer Prometheus
histograms already capture issuer round-trip latency in
production runs.
Audit-Closes: BUNDLE-4 HIGH-1 D4 D8 MED-1 MED-2 T9 LOW-7 finding-4 finding-7
certctl Documentation
Last reviewed: 2026-05-12
The full docs index, organized by audience. Pick the section that matches what you need to do; each link below opens a focused doc rather than a wall of text.
For the elevator pitch and quickstart commands, see the repo README.md at the root. For the marketing site, see certctl.io.
Getting Started
You're new to certctl, just cloned the repo, or want to understand what it does before installing.
| Doc | What it covers |
|---|---|
| Concepts | TLS certificates explained for beginners — CAs, ACME, EST, private keys, the full glossary |
| Quickstart | Five-minute setup with Docker Compose, dashboard tour, API tour |
| Examples | Five turnkey scenarios — ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer |
| Advanced demo | End-to-end certificate lifecycle with technical depth at each step |
| Why certctl | Positioning vs ACME clients, agent-based SaaS, enterprise platforms; when to look elsewhere |
Reference
You're operating certctl in production or building integrations and need authoritative technical detail.
| Doc | What it covers |
|---|---|
| Architecture | System design, data flow, security model, deployment topologies |
| Profiles | CertificateProfile policy object — issuer wiring, EKUs, RequiresApproval gate (with profile-edit closure) |
| API | OpenAPI 3.1 spec, integration patterns, client SDK generation |
| CLI | certctl-cli command reference and CI/CD integration patterns |
| Configuration | CERTCTL_* environment variable reference (scheduler, rate limits, deploy verify, audit, agent) |
| MCP server | Model Context Protocol integration for AI assistants |
| Release verification | Cosign / SLSA / SBOM verification procedure |
| Intermediate CA hierarchy | Multi-level CA tree management — RFC 5280 §3.2/§4.2.1.9/§4.2.1.10 enforcement |
| Auth standards implemented | RFC + CWE evidence for the API-key + RBAC + OIDC + sessions + break-glass surface (NOT a compliance-mapping doc) |
| Deployment model | Atomic write, post-deploy verify, rollback semantics across all targets |
| Vendor matrix | Tested vendor versions per target connector |
Connectors
The connector index is the canonical catalog (interfaces, registry, scanners, plus an inline reference per built-in). Per-connector deep-dive siblings cover operator-grade material — vendor edges, troubleshooting, rotation playbooks, when-to-use vs alternatives.
Issuers (13 deep-dives): ACME · ADCS · AWS ACM Private CA · DigiCert · EJBCA / Keyfactor · Entrust · GlobalSign Atlas HVCA · Google CAS · Local CA · OpenSSL / Custom CA · Sectigo SCM · step-ca / Smallstep · Vault PKI
Targets (15 deep-dives): Apache · AWS Certificate Manager · Azure Key Vault · Caddy · Envoy · F5 BIG-IP · HAProxy · IIS · Java Keystore · Kubernetes Secrets · NGINX · Postfix / Dovecot · SSH (agentless) · Traefik · Windows Certificate Store
Protocols
| Doc | What it covers |
|---|---|
| ACME server | Run certctl as an RFC 8555 + RFC 9773 ARI ACME server |
| ACME server threat model | Security posture for the ACME server endpoint |
| SCEP server | RFC 8894 native SCEP server — RA cert config, multi-profile dispatch, must-staple, mTLS sibling route |
| SCEP for Microsoft Intune | Intune-specific deployment guide — NDES replacement playbook |
| EST server | RFC 7030 EST server — 802.1X / Wi-Fi enrollment, IoT bootstrap, channel binding |
| CRL & OCSP | RFC 5280 CRL + RFC 6960 OCSP responder for relying parties |
| Async CA polling | Bounded polling for async-CA issuer connectors |
Operator
You're running certctl in production and need operational guidance.
| Doc | What it covers |
|---|---|
| Security posture | Auth, rate limits, encryption at rest, key rotation, RBAC + OIDC + sessions + break-glass, bootstrap |
| RBAC operator reference | Roles, permissions, scopes, scope-down + day-0 bootstrap |
| Auth threat model | API-key + RBAC + OIDC + sessions + break-glass — token forgery, session hijacking, IdP compromise, role-grant abuse, bootstrap-token leak, audit-mutation |
| OIDC / SSO runbooks | Per-IdP setup guides — Keycloak, Authentik, Okta, Auth0, Entra ID, Google Workspace |
| Control plane TLS | Self-signed bootstrap, operator-supplied Secret, cert-manager Certificate CR |
| Database TLS | PostgreSQL transport encryption |
| Approval workflow | Two-person integrity gate for high-stakes issuance + profile-edit closure |
| Helm deployment | Kubernetes installation via the bundled chart |
| Performance baselines | Operator-runnable benchmarks for regression spot checks |
| Auth benchmarks | Session + OIDC validation p99 targets and measured baselines |
| Scheduler HA semantics | Per-loop HA truth table for the 15 scheduler loops; what duplicates on multi-replica |
| Rate-limit scope | Process-local vs cluster-wide rate-limit behavior, restart semantics, multi-replica mental math |
| Legacy clients (TLS 1.2) | Reverse-proxy runbook for embedded EST/SCEP clients on TLS 1.2 |
Runbooks
| Runbook | When |
|---|---|
| Cloud targets | AWS ACM + Azure Key Vault deployment, debugging, rollback |
| Expiry alerts | Per-policy multi-channel routing matrix, severity tiers |
| Disaster recovery | CRL cache, OCSP responder cert, CA private-key rotation, Postgres restore |
Migration
You're moving from another cert-management tool to certctl, or running both in parallel.
| From | Doc |
|---|---|
| Certbot | migration/from-certbot.md |
| acme.sh | migration/from-acmesh.md |
| cert-manager (coexistence, not replacement) | migration/cert-manager-coexistence.md |
| Caddy ACME (point Caddy at certctl) | migration/acme-from-caddy.md |
| cert-manager ACME (point cert-manager at certctl) | migration/acme-from-cert-manager.md |
| Traefik ACME (point Traefik at certctl) | migration/acme-from-traefik.md |
| API keys → RBAC (v2.0.x → v2.1.0) | migration/api-keys-to-rbac.md — AUDIT YOUR API KEYS post-upgrade |
| Enable OIDC SSO | migration/oidc-enable.md — step-by-step OIDC onboarding for an existing API-key + RBAC deployment |
Contributor
You're contributing to certctl, running tests locally, or trying to understand the CI pipeline.
| Doc | What it covers |
|---|---|
| Testing strategy | What we test and why; per-PR fast gates vs daily deep-scan |
| Test environment | Local environment with real CAs (Pebble, step-ca, etc.) |
| QA prerequisites | Before running QA: stack boot, demo data baseline, env vars |
| QA test suite | qa_test.go reference for release QA |
| GUI QA checklist | Manual GUI verification pass for release |
| Release sign-off | Release-day checklist — code state, automated gates, manual QA, artefact verification |
| CI pipeline | CI shape, regression guards, adding new checks |
| CI guards | Per-class CI guards (code-shape, contract-parity, build/dep, operational); how to add one |
Archive
Historical docs preserved for reference. Most operators don't need these.
| Doc | Why archived |
|---|---|
| Upgrade to TLS (v2.2) | Pre-v2.2 HTTPS-everywhere upgrade procedure |
| Upgrade past v2 JWT removal | G-1 milestone JWT auth removal procedure |
Reading order by role
First-time operator: Concepts → Quickstart → Examples. About 90 minutes end to end.
Production operator: Architecture → Security posture → Control plane TLS → Disaster recovery runbook. About 4 hours end to end.
PKI engineer: ACME server → SCEP server → EST server → Intermediate CA hierarchy. About 6 hours end to end.
Contributor: Architecture → Testing strategy → Test environment → CI pipeline. About 3 hours end to end.