Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.
Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.
What changed
============
internal/config/server.go (+40 LOC):
- Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
time.Duration` to RateLimitConfig with full operator-facing
documentation of the two valid values (memory|postgres) +
when-to-use-which decision tree.
internal/config/config.go (+27 LOC):
- Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
- Validate() rejects anything other than ""/"memory"/"postgres"
(empty = memory equivalence for test-built Configs that bypass
Load()). Janitor interval must be ≥ 1 minute when set.
- Failure modes return clear ::error:: with the env-var name + the
valid values, so an operator typo ("postgress" → memory in a
3-replica cluster) fails fast at startup.
internal/ratelimit/factory.go (NEW, 67 LOC):
- NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
factory the 6 cmd/server/main.go call sites route through.
- Drop-in signature: same maxN/window/mapCap as
NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
— the rate_limit_buckets table grows until the janitor sweeps).
- Defensive panic on unknown backend (config.Validate is SoT;
this is belt-and-suspenders).
internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
- PostgresGC struct + NewPostgresGC + GarbageCollect.
- Single-statement DELETE FROM rate_limit_buckets WHERE
updated_at < NOW() - maxWindow. Idempotent.
- maxWindow <= 0 is a no-op (operator opt-out).
internal/scheduler/scheduler.go (+90 LOC):
- New RateLimitGarbageCollector interface (mirrors the
ACMEGarbageCollector / SessionGarbageCollector contracts).
- rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
on Scheduler.
- SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
Setters following the existing acmeGC/sessionGC pattern.
- rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
per-tick context.WithTimeout(1m). Logs row count at Debug.
- Loop counted in the Start() WaitGroup only when the GC is
non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
when backend=memory so the loop never launches for that case.
cmd/server/main.go (35 LOC diff):
- All 6 ratelimit.NewSlidingWindowLimiter call sites now route
through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
db, ...). Grep verification post-fix returns ZERO hits.
- Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
exportLimiter (1068), EST failed-basic (1535), EST per-principal
SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
— its inner type-alias wrapper is the prompt's
out-of-scope (cmd/server/*.go only).
- Conditionally constructs PostgresGC + wires the scheduler janitor
when backend=postgres; logs the wiring decision either way so
operators see "rate-limit GC sweep enabled (postgres backend)"
or "in-memory backend self-prunes" in the boot log.
internal/api/handler/{est,export,certificates,auth_breakglass}.go:
- Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
with ratelimit.Limiter (the interface). Allow() satisfies the
same call shape on both backends; the in-memory tests that
construct *SlidingWindowLimiter still compile because the
concrete type satisfies the interface (compile-time check in
internal/ratelimit/limiter.go pins this).
docs/operator/observability.md (176 LOC diff):
- Replaced the "per-process, in-memory, reset-on-restart, not
shared across replicas" paragraph with the new
configurable-backend section: operator decision tree,
backend internals (memory vs postgres), janitor description,
falsifiable closure proof (the Sprint 13.2 integration test
name + invocation), helm chart wiring example.
- Updated inventory to reflect the actual handler file paths +
actual cap configurations (the prior doc said "60s window" for
several limiters that actually use 60m / 24h windows).
- Doc smoke confirmed: grep -c 'per-process, in-memory,
reset-on-restart' docs/operator/observability.md = 0.
deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
- Exposed server.rateLimiting.backend (default "memory") +
server.rateLimiting.janitorInterval (default "5m") under the
existing rateLimiting block.
- ConfigMap renders both as rate-limit-backend +
rate-limit-janitor-interval keys.
- Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
- Helm render: `helm template deploy/helm/certctl --set
server.rateLimiting.backend=postgres` shows the env-var on the
server-deployment.yaml output.
.github/workflows/ci.yml (+12 LOC):
- Added a new step in the Go Build & Test job that runs the
Sprint 13.2 multi-replica integration test
(TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
-tags=integration -race -timeout=300s. Fails the CI status check
if the cross-replica row lock ever stops arbitrating across
replicas — the ARCH-M1 closure regression gate.
Verification (all green locally; postgres integration via CI)
============================================================
$ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
(zero hits — Sprint 13.3 receipt)
$ go test -short -count=1 \
./internal/config/... ./internal/ratelimit/... \
./internal/scheduler/... ./internal/api/handler/... \
./cmd/server/...
ok internal/config 1.177s
ok internal/ratelimit 0.007s
ok internal/scheduler 9.165s
ok internal/api/handler 6.245s
ok cmd/server 0.390s
$ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
./internal/config/... ./internal/api/handler/... ./cmd/server/...
(clean)
$ gofmt -l internal/ cmd/server/
(clean)
$ grep -c 'per-process, in-memory, reset-on-restart' \
docs/operator/observability.md
0 (doc smoke — the audit's verbatim phrasing is gone)
$ bash scripts/ci-guards/G-3-env-docs-drift.sh
G-3 env-docs-drift: clean.
$ bash scripts/ci-guards/complete-path-config-coverage.sh
OK — every CERTCTL_* env var (197) has at least one non-config-
package consumer.
Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.
Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.
Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
certctl Documentation
Last reviewed: 2026-05-12
The full docs index, organized by audience. Pick the section that matches what you need to do; each link below opens a focused doc rather than a wall of text.
For the elevator pitch and quickstart commands, see the repo README.md at the root. For the marketing site, see certctl.io.
Getting Started
You're new to certctl, just cloned the repo, or want to understand what it does before installing.
| Doc | What it covers |
|---|---|
| Concepts | TLS certificates explained for beginners — CAs, ACME, EST, private keys, the full glossary |
| Quickstart | Five-minute setup with Docker Compose, dashboard tour, API tour |
| Examples | Five turnkey scenarios — ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer |
| Advanced demo | End-to-end certificate lifecycle with technical depth at each step |
| Why certctl | Positioning vs ACME clients, agent-based SaaS, enterprise platforms; when to look elsewhere |
Reference
You're operating certctl in production or building integrations and need authoritative technical detail.
| Doc | What it covers |
|---|---|
| Architecture | System design, data flow, security model, deployment topologies |
| Profiles | CertificateProfile policy object — issuer wiring, EKUs, RequiresApproval gate (with profile-edit closure) |
| API | OpenAPI 3.1 spec, integration patterns, client SDK generation |
| CLI | certctl-cli command reference and CI/CD integration patterns |
| Configuration | CERTCTL_* environment variable reference (scheduler, rate limits, deploy verify, audit, agent) |
| MCP server | Model Context Protocol integration for AI assistants |
| Release verification | Cosign / SLSA / SBOM verification procedure |
| Intermediate CA hierarchy | Multi-level CA tree management — RFC 5280 §3.2/§4.2.1.9/§4.2.1.10 enforcement |
| Auth standards implemented | RFC + CWE evidence for the API-key + RBAC + OIDC + sessions + break-glass surface (NOT a compliance-mapping doc) |
| Deployment model | Atomic write, post-deploy verify, rollback semantics across all targets |
| Vendor matrix | Tested vendor versions per target connector |
Connectors
The connector index is the canonical catalog (interfaces, registry, scanners, plus an inline reference per built-in). Per-connector deep-dive siblings cover operator-grade material — vendor edges, troubleshooting, rotation playbooks, when-to-use vs alternatives.
Issuers (13 deep-dives): ACME · ADCS · AWS ACM Private CA · DigiCert · EJBCA / Keyfactor · Entrust · GlobalSign Atlas HVCA · Google CAS · Local CA · OpenSSL / Custom CA · Sectigo SCM · step-ca / Smallstep · Vault PKI
Targets (15 deep-dives): Apache · AWS Certificate Manager · Azure Key Vault · Caddy · Envoy · F5 BIG-IP · HAProxy · IIS · Java Keystore · Kubernetes Secrets · NGINX · Postfix / Dovecot · SSH (agentless) · Traefik · Windows Certificate Store
Protocols
| Doc | What it covers |
|---|---|
| ACME server | Run certctl as an RFC 8555 + RFC 9773 ARI ACME server |
| ACME server threat model | Security posture for the ACME server endpoint |
| SCEP server | RFC 8894 native SCEP server — RA cert config, multi-profile dispatch, must-staple, mTLS sibling route |
| SCEP for Microsoft Intune | Intune-specific deployment guide — NDES replacement playbook |
| EST server | RFC 7030 EST server — 802.1X / Wi-Fi enrollment, IoT bootstrap, channel binding |
| CRL & OCSP | RFC 5280 CRL + RFC 6960 OCSP responder for relying parties |
| Async CA polling | Bounded polling for async-CA issuer connectors |
Operator
You're running certctl in production and need operational guidance.
| Doc | What it covers |
|---|---|
| Security posture | Auth, rate limits, encryption at rest, key rotation, RBAC + OIDC + sessions + break-glass, bootstrap |
| Secret custody | Where private keys live; FileDriver vs HSM/KMS; encryption wire format; env-seeded vs DB-seeded plaintext policy |
| Observability | Metrics surface, Prometheus exposition vs client_golang, tracing scope, log structure, rate-limit semantics across restarts/replicas |
| RBAC operator reference | Roles, permissions, scopes, scope-down + day-0 bootstrap |
| Auth threat model | API-key + RBAC + OIDC + sessions + break-glass — token forgery, session hijacking, IdP compromise, role-grant abuse, bootstrap-token leak, audit-mutation |
| OIDC / SSO runbooks | Per-IdP setup guides — Keycloak, Authentik, Okta, Auth0, Entra ID, Google Workspace |
| Control plane TLS | Self-signed bootstrap, operator-supplied Secret, cert-manager Certificate CR |
| Database TLS | PostgreSQL transport encryption |
| Approval workflow | Two-person integrity gate for high-stakes issuance + profile-edit closure |
| Helm deployment | Kubernetes installation via the bundled chart |
| Performance baselines | Operator-runnable benchmarks for regression spot checks |
| Auth benchmarks | Session + OIDC validation p99 targets and measured baselines |
| Legacy clients (TLS 1.2) | Reverse-proxy runbook for embedded EST/SCEP clients on TLS 1.2 |
Runbooks
| Runbook | When |
|---|---|
| Cloud targets | AWS ACM + Azure Key Vault deployment, debugging, rollback |
| Expiry alerts | Per-policy multi-channel routing matrix, severity tiers |
| Disaster recovery | CRL cache, OCSP responder cert, CA private-key rotation, Postgres restore |
| Config-encryption upgrade | Force v1/v2 → v3 re-seal across the database; passphrase rotation procedure |
| PostgreSQL backup | Operator-run backup recipe (docker-compose + Kubernetes); recommended cadence; quarterly DR dry-run |
Migration
You're moving from another cert-management tool to certctl, or running both in parallel.
| From | Doc |
|---|---|
| Certbot | migration/from-certbot.md |
| acme.sh | migration/from-acmesh.md |
| cert-manager (coexistence, not replacement) | migration/cert-manager-coexistence.md |
| Caddy ACME (point Caddy at certctl) | migration/acme-from-caddy.md |
| cert-manager ACME (point cert-manager at certctl) | migration/acme-from-cert-manager.md |
| Traefik ACME (point Traefik at certctl) | migration/acme-from-traefik.md |
| API keys → RBAC (v2.0.x → v2.1.0) | migration/api-keys-to-rbac.md — AUDIT YOUR API KEYS post-upgrade |
| Enable OIDC SSO | migration/oidc-enable.md — step-by-step OIDC onboarding for an existing API-key + RBAC deployment |
Contributor
You're contributing to certctl, running tests locally, or trying to understand the CI pipeline.
| Doc | What it covers |
|---|---|
| Testing strategy | What we test and why; per-PR fast gates vs daily deep-scan |
| Test environment | Local environment with real CAs (Pebble, step-ca, etc.) |
| QA prerequisites | Before running QA: stack boot, demo data baseline, env vars |
| QA test suite | qa_test.go reference for release QA |
| GUI QA checklist | Manual GUI verification pass for release |
| Release sign-off | Release-day checklist — code state, automated gates, manual QA, artefact verification |
| CI pipeline | CI shape, regression guards, adding new checks |
| CI guards | Per-class CI guards (code-shape, contract-parity, build/dep, operational); how to add one |
Archive
Historical docs preserved for reference. Most operators don't need these.
| Doc | Why archived |
|---|---|
| Upgrade to TLS (v2.2) | Pre-v2.2 HTTPS-everywhere upgrade procedure |
| Upgrade past v2 JWT removal | G-1 milestone JWT auth removal procedure |
Reading order by role
First-time operator: Concepts → Quickstart → Examples. About 90 minutes end to end.
Production operator: Architecture → Security posture → Control plane TLS → Disaster recovery runbook. About 4 hours end to end.
PKI engineer: ACME server → SCEP server → EST server → Intermediate CA hierarchy. About 6 hours end to end.
Contributor: Architecture → Testing strategy → Test environment → CI pipeline. About 3 hours end to end.