Files
certctl/docs/acme-server.md
T
shankar0123 bee47f0318 acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-05-03 19:42:03 +00:00

389 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# certctl ACME Server (Built-in)
certctl ships an RFC 8555 + RFC 9773 ARI ACME server endpoint at
`/acme/profile/<profile-id>/*`. Any RFC 8555 client (cert-manager 1.15+,
Caddy, Traefik, win-acme, certbot, Posh-ACME) can integrate with certctl
as an ACME issuer with no certctl-side modification — closing the
"deploy a certctl agent on every K8s node" friction that costs deals to
external PKI vendors today.
> **Phase status (2026-05-03):** Phase 5 — production hardening +
> cert-manager integration test. Per-account rate limits applied at
> 3 entry points (orders/hour, key-change/hour, challenge-respond/hour)
> + a per-account concurrent-orders cap; a 1-minute scheduler loop
> sweeps expired nonces / authzs / orders. A kind-driven cert-manager
> integration test (gated by `KIND_AVAILABLE`) verifies the full
> happy-path against a real cert-manager 1.15+ deployment. RFC
> conformance is verified via lego against the same stack. Track
> shipped phases via `git log --grep='acme-server:'`.
## Configuration
All ACME-server config uses the `CERTCTL_ACME_SERVER_*` env-var prefix
(distinct from `CERTCTL_ACME_*` which configures the consumer-side
issuer connector). The struct definition lives in
`internal/config/config.go::ACMEServerConfig`.
| Env var | Default | Phase | Description |
|--------------------------------------------------|------------------------|-------|-------------|
| `CERTCTL_ACME_SERVER_ENABLED` | `false` | 1a | Master enable flag. Phase 1a's handler is constructed unconditionally so the registry shape stays stable; routes are registered in `internal/api/router/router.go::RegisterHandlers` regardless. Operators flip this on after configuring per-profile auth_mode. |
| `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` | `trust_authenticated` | 1a | Default value for `certificate_profiles.acme_auth_mode` on newly-created profiles. Existing profiles retain their stored value. Per-profile column is the source of truth at request time. |
| `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` | `""` | 1a | When set, `/acme/*` shorthand mirrors `/acme/profile/<DefaultProfileID>/*` for single-profile deployments. When empty, requests to the shorthand return RFC 7807 + RFC 8555 §6.7 `userActionRequired`. |
| `CERTCTL_ACME_SERVER_NONCE_TTL` | `5m` | 1a | How long an issued ACME nonce remains valid before the JWS verifier (Phase 1b) returns `urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1. Tune up if cert-manager + certctl clocks frequently skew. |
| `CERTCTL_ACME_SERVER_TOS_URL` | `""` | 1a | Optional `meta.termsOfService` URL in the directory document. |
| `CERTCTL_ACME_SERVER_WEBSITE` | `""` | 1a | Optional `meta.website` URL in the directory document. |
| `CERTCTL_ACME_SERVER_CAA_IDENTITIES` | (empty) | 1a | Comma-separated `meta.caaIdentities` list. |
| `CERTCTL_ACME_SERVER_EAB_REQUIRED` | `false` | 1a | `meta.externalAccountRequired` advertisement. EAB enforcement is a follow-up; Phase 1a only advertises. |
| `CERTCTL_ACME_SERVER_ORDER_TTL` | `24h` | 2 | Reserved field, parsed in Phase 1a so operators can set it ahead of Phase 2's order endpoints. |
| `CERTCTL_ACME_SERVER_AUTHZ_TTL` | `24h` | 2 | Reserved. |
| `CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_DNS01_RESOLVER` | `8.8.8.8:53` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_DNS01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_ARI_ENABLED` | `true` | 4 | Toggles the RFC 9773 ARI surface — both the `renewalInfo` URL in the directory document and the GET `/renewal-info/<cert-id>` handler. Set to `false` to drop ARI from the directory; ACME clients fall back to static renewal scheduling. |
| `CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL` | `6h` | 4 | Server-policy `Retry-After` value the ARI handler emits on a 200 response. RFC 9773 §4.2 leaves this server-policy. Tighten to `1h` for short-lived certs; loosen to `24h` for standard 90-day certs. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` | `100` | 5 | Per-account orders/hour cap. `0` disables. Hits return RFC 7807 + RFC 8555 §6.7 `urn:ietf:params:acme:error:rateLimited` with `Retry-After`. In-memory token-bucket; restart wipes the counter (eventual-consistency caps are acceptable). |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS` | `5` | 5 | Per-account cap on simultaneously-active orders (status in pending/ready/processing). `0` disables. Same RFC 7807 + RFC 8555 §6.7 problem shape as the per-hour cap. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR` | `5` | 5 | Per-account key-rollover cap. `0` disables. Default 5/hour: rollovers should be rare; a flood is an attack signal. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` | `60` | 5 | Per-challenge-id respond cap. `0` disables. Defends against retry storms from a misbehaving client. Keyed by challenge-id (not account-id) so a flood against one challenge doesn't drain the account's whole budget. |
| `CERTCTL_ACME_SERVER_GC_INTERVAL` | `1m` | 5 | Tick interval for the ACME GC scheduler loop. On each tick: (1) DELETE used / expired nonces; (2) UPDATE pending authzs whose `expires_at < NOW()` to `expired`; (3) UPDATE pending/ready/processing orders whose `expires_at < NOW()` to `invalid`. Each sweep is a single SQL statement; the loop is idempotent + bounded by a 1m per-sweep timeout. `0` disables the loop. |
## Per-profile auth mode
Two modes per `certificate_profiles.acme_auth_mode`:
- **`trust_authenticated`** (default for internal PKI). The JWS-
authenticated ACME account is trusted to issue certs for any
identifier the profile policy allows; there is no per-identifier
ownership proof. The most common certctl use case.
- **`challenge`**. Full HTTP-01 + DNS-01 + TLS-ALPN-01 validation per
RFC 8555 §8. Required when certctl is exposing public-trust-style PKI.
A single certctl-server can serve both modes simultaneously — the mode
is read from the bound profile's column at request time, not cached at
server start. Operators can flip a profile's mode via SQL and the next
order picks up the new mode without restart.
The `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` env var sets the default
value for newly-created profiles (e.g. via the certctl API). Existing
profile rows retain whatever value they were created with.
## TLS trust bootstrap (read this before configuring cert-manager)
When certctl-server uses a self-signed TLS bootstrap cert
(`deploy/test/certs/server.crt` is the demo default; see
[`docs/tls.md`](./tls.md)), cert-manager 1.15+ will refuse to talk to
the directory URL unless the certctl root is trusted. The fix lives in
`ClusterIssuer.spec.acme.caBundle`:
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: certctl-test
spec:
acme:
server: https://certctl.example.com:8443/acme/profile/prof-corp/directory
email: ops@example.com
caBundle: |
LS0tLS1CRUdJTi... # base64-encoded PEM of certctl's self-signed root
privateKeySecretRef:
name: certctl-test-account-key
solvers:
- http01:
ingress:
class: nginx
```
The `caBundle` value is the base64-encoded PEM of the root that signed
your certctl-server's TLS certificate. Extract it from your operator
bootstrap (e.g. `cat deploy/test/certs/ca.crt | base64 -w0`).
This is the single biggest first-time-deploy footgun on the cert-manager
integration path. The full cert-manager walkthrough lands in Phase 6;
the `caBundle` requirement is flagged here in Phase 1a's docs because
operators hit it the moment they try to point a real ACME client at
certctl.
## Endpoints
Routes registered in `internal/api/router/router.go::RegisterHandlers`:
| Method | Path | RFC ref | Auth | Description |
|--------|-------------------------------------------------------|-----------------|----------|-------------|
| GET | `/acme/profile/{id}/directory` | RFC 8555 §7.1.1 | unauth | Per-profile directory document. |
| HEAD | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 200 + Replay-Nonce header. |
| GET | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 204 + Replay-Nonce header. |
| POST | `/acme/profile/{id}/new-account` | RFC 8555 §7.3 | JWS jwk | Register a new account; idempotent re-registration of an existing JWK returns the existing row. |
| POST | `/acme/profile/{id}/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Update contact list, deactivate, or POST-as-GET (RFC 8555 §6.3) to fetch the account. |
| POST | `/acme/profile/{id}/new-order` | RFC 8555 §7.4 | JWS kid | Submit an order; identifier validation runs before order creation. |
| POST | `/acme/profile/{id}/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | POST-as-GET fetch of an order's current state. |
| POST | `/acme/profile/{id}/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Submit the CSR + finalize. Issues + persists managed cert row + version. |
| POST | `/acme/profile/{id}/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | POST-as-GET fetch of an authorization. |
| POST | `/acme/profile/{id}/challenge/{chall_id}` | RFC 8555 §7.5.1 | JWS kid | Submit a challenge for validation. Dispatches to a bounded-concurrency worker pool; clients poll authz for the eventual result. |
| POST | `/acme/profile/{id}/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | POST-as-GET cert chain download (PEM). |
| POST | `/acme/profile/{id}/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Doubly-signed account-key rollover. |
| POST | `/acme/profile/{id}/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Revoke a cert via the issuing account's key OR the cert's own private key. Routes through the certctl revocation pipeline. |
| GET | `/acme/profile/{id}/renewal-info/{cert_id}` | RFC 9773 | unauth | Fetch the suggested renewal window for a cert (cert-id is `base64url(AKI).base64url(serial)` per RFC 9773 §4.1). Response carries `Retry-After`. |
| GET | `/acme/directory` | RFC 8555 §7.1.1 | unauth | Shorthand path; mirrors per-profile when `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` is set. |
| HEAD | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
| GET | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
| POST | `/acme/new-account` | RFC 8555 §7.3 | JWS jwk | Shorthand. |
| POST | `/acme/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Shorthand. |
| POST | `/acme/new-order` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | Shorthand. |
| POST | `/acme/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | Shorthand. |
| POST | `/acme/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Shorthand. |
| POST | `/acme/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Shorthand. |
| GET | `/acme/renewal-info/{cert_id}` | RFC 9773 | unauth | Shorthand. |
After Phase 4, the full RFC 8555 + RFC 9773 surface is live. RFC 8739
(short-lived certs) and EAB enforcement remain follow-up work; cert-
manager + boulder-tested clients work today against the surface above.
## Finalize routing through `CertificateService.Create` (Phase 2 architecture)
The finalize path mirrors how every other certctl issuance surface
(EST, SCEP, agent, REST API) routes through the canonical pipeline:
1. JWS-verify the request (`internal/api/acme/jws.go`).
2. Validate the CSR's DNS-name set equals the order's identifier set
exactly (case-folded). Mismatches return RFC 8555
`urn:ietf:params:acme:error:badCSR`.
3. Update the order row to `status=processing` (`s.tx.WithinTx` +
`auditService.RecordEventWithTx` — atomic with audit row).
4. Issue the cert via the bound profile's `IssuerConnector` adapter
(same `IssueCertificate(ctx, commonName, sans, csrPEM, ekus,
maxTTLSeconds, mustStaple)` call EST/SCEP/agent take).
5. Insert the `managed_certificates` row via
`service.CertificateService.Create(ctx, *ManagedCertificate, actor)`.
Source is stamped `domain.CertificateSourceACME` so operators can
bulk-revoke ACME-issued certs by filtering on `Source=ACME`.
6. Insert the `certificate_versions` row +
transition the order to `status=valid` with `certificate_id` set
(one final `WithinTx` covering both writes + the audit row).
This means RenewalPolicy, CertificateProfile, per-issuer-type
Prometheus metrics, audit rows, and revocation-pipeline integration
all apply uniformly to ACME-issued certs via the same code path that
already serves EST/SCEP/agent/REST issuance.
The atomicity boundary: there is a brief window between step 5 (cert
exists) and step 6 (order shows valid) where the order row still says
`processing`. Phase 5's GC scheduler reconciles. The actor string on
audit rows is `acme:<account-id>`.
## JWS verification (Phase 1b)
Every JWS-authenticated POST runs through the verifier at
`internal/api/acme/jws.go::VerifyJWS`. The verifier enforces:
1. The JWS parses as a flattened single-signature object (multi-sig is
rejected per RFC 8555 §6.2).
2. The signature algorithm is in the closed allow-list `{RS256, ES256,
EdDSA}` per RFC 8555 §6.2 — `none`, `HS256`, and every other alg
are refused at parse time.
3. The protected header carries exactly one of `kid` (registered
account) or `jwk` (new-account flow); endpoints declare which they
require.
4. The protected header `url` matches the inbound request URL exactly.
5. The protected header `nonce` is consumed against the
`acme_nonces` store; missing / replayed / expired nonces return
`urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1.
6. On the `kid` path: the kid URL round-trips against the canonical
per-profile shape, the referenced account exists, and its status
is `valid`. Deactivated / revoked accounts cannot authenticate.
7. The signature verifies against the resolved key (registered
account's stored JWK on the kid path; embedded jwk on the jwk path).
Every state-mutating account operation (create, contact update,
deactivate) writes its `acme_accounts` row and an `audit_events` row
inside one `repository.Transactor.WithinTx` call — the canonical
certctl atomicity contract (matches `service.CertificateService.Create`
at `internal/service/certificate.go:131`).
## Phases (cross-reference)
| Phase | Status | Surface |
|-------|-------------|---------|
| 1a | live | directory + new-nonce + per-profile routing |
| 1b | live | new-account + account/{id} + JWS verifier (RFC 7515 + go-jose v4) |
| 2 | live | orders + authzs + finalize + cert download (trust_authenticated mode end-to-end) |
| 3 | live | HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (challenge mode end-to-end) |
| 4 | live | key rollover (RFC 8555 §7.3.5) + revoke-cert (§7.6) + ARI (RFC 9773) |
| 5 | live | rate limits + GC sweeper + kind-driven cert-manager integration test + lego conformance harness + k6 ACME-flow scenario |
| 6 | not yet | full operator-facing reference + walkthroughs + threat model |
Track shipped phases via `git log --grep='acme-server:' --oneline`.
## Operational notes (Phase 1a)
- **Schema:** `migrations/000025_acme_server.up.sql` adds 5 ACME tables
+ the `certificate_profiles.acme_auth_mode` column. Phase 1a actively
uses only `acme_nonces`. The full schema ships now so the migration
is stable and Phases 1b-4 don't need additional `CREATE TABLE`
migrations.
- **Replay protection:** nonces are persisted in `acme_nonces` (NOT
in-memory). They survive server restart, which is required for the
RFC 8555 §6.5 replay defense to hold against a multi-replica
certctl-server fleet behind a load balancer.
- **Metrics:** the service layer exposes per-op atomic counters via
`service.ACMEService.Metrics().Snapshot()`:
- `certctl_acme_directory_total`
- `certctl_acme_directory_failures_total`
- `certctl_acme_new_nonce_total`
- `certctl_acme_new_nonce_failures_total`
Phase 1b will extend with `new_account` counters; Phase 2 with order
/ finalize / cert; Phase 3 with per-challenge-type counters.
- **Audit:** Phase 1a is read-mostly (directory + nonce). Phase 1b's
account-creation path will route through the canonical
`s.tx.WithinTx(...)` + `auditService.RecordEventWithTx(...)` pattern
so every account state mutation is paired with an `audit_events`
row.
## Phase 4 — key rollover, revocation, ARI
### How do I rotate my ACME account key?
RFC 8555 §7.3.5 defines a doubly-signed JWS for the rollover. The OUTER
JWS is signed by the OLD account key (kid path); its payload IS the
INNER JWS, which is signed by the NEW account key (jwk path). cert-
manager and lego do this for you transparently — `lego renew --key-rotate`
or the cert-manager `Issuer.spec.acme.privateKeySecretRef` rollover.
Server-side validation:
1. Outer JWS verifies against the registered account's current key.
2. Inner JWS verifies against the embedded NEW jwk (proves possession).
3. Inner payload `account` matches outer `kid`.
4. Inner payload `oldKey` thumbprint-equals the registered key.
5. Inner protected `url` equals outer protected `url`.
6. New JWK thumbprint not already registered against the same profile.
7. `SELECT … FOR UPDATE` on the account row serializes concurrent
rollovers; the loser sees the winner's new thumbprint and is told
to retry (409).
### How do I revoke an ACME-issued cert?
Two auth paths per RFC 8555 §7.6:
- **kid path:** sign with your account key. The server checks the
account "owns" the cert via `acme_orders.certificate_id` lookup.
- **jwk path:** sign with the cert's own private key. The server
extracts the cert's public key, computes the JWK, and asserts it
matches the embedded jwk thumbprint.
Either path routes through `service.RevocationSvc.RevokeCertificateWithActor`
— the same pipeline the GUI revoke button, bulk-revocation, and the
ACME-consumer issuer use. So the cert-row update + revocation row + audit
row are all atomic in one `WithinTx`, the issuer is best-effort
notified, and the OCSP response cache is invalidated.
Reason codes follow RFC 5280 §5.3.1; codes 8 (removeFromCRL) and 10
(aACompromise) are not in certctl's `domain.ValidRevocationReasons`
set so they clamp to `unspecified`.
### What is ARI?
RFC 9773 ACME Renewal Information. Clients GET
`/acme/profile/<id>/renewal-info/<cert-id>` (unauthenticated) and
receive a JSON document with `suggestedWindow.start` and `.end` —
the server's recommendation for when to renew. The response also
carries `Retry-After` (RFC 9773 §4.2) hinting at the next-poll cadence.
Cert-id format is `base64url(authorityKeyIdentifier).base64url(serial)`
per RFC 9773 §4.1.
Window math:
- Cert with a bound renewal policy: window starts at
`notAfter - RenewalWindowDays`, ends at `notAfter - RenewalWindowDays/2`.
So a 30-day window cert with notAfter 2026-06-30 emits start=2026-05-31,
end=2026-06-15. Boulder-shape default that lets cert-manager schedule
inside our renewal window.
- No policy: window is the last 33% of validity.
- Past expiry: window is "now" → "now + 24h" (renew immediately).
Disable ARI globally with `CERTCTL_ACME_SERVER_ARI_ENABLED=false`. The
URL drops out of the directory; the route is still registered but
returns 404 — clients fall back to static renewal scheduling.
## Phase 5 — operational guidance
### Rate limiting
Production deployments serving multiple ACME profiles or fleets should
keep the default rate limits in place. The four caps:
- `RATE_LIMIT_ORDERS_PER_HOUR` (100) — per-account new-order cap. A
cert-manager Certificate that auto-renews at the 1/3 mark of its
validity (90-day cert → ~30-day renewal) consumes ~12 orders/year
per managed Certificate. 100/hour is generous for any plausible
fleet.
- `RATE_LIMIT_CONCURRENT_ORDERS` (5) — per-account cap on
pending/ready/processing orders. Stops a runaway client from
starving DB-row throughput. Tune up only if you observe legitimate
bursts.
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR` (5) — rollovers are rare; a flood
is an attack signal. Tune down to 1/hour if your operator
procedure mandates manual rollovers only.
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` (60) — per-challenge cap,
defends against retry storms.
Hits return RFC 8555 §6.7 `rateLimited` Problem with a `Retry-After`
header. cert-manager 1.15+ honors the header; lego too. Older clients
may not — that's the client's problem, not certctl's.
The buckets are **in-memory + per-replica**. A 3-replica certctl-
server fleet behind a load balancer effectively has 3× the configured
throughput (each replica's bucket fills independently). For
deployments where this matters operationally, the right answer is a
shared rate-limit store — that's a follow-up; not blocking for the
current threat model where same-account requests typically pin to
the same replica via session affinity.
### GC sweeper
The scheduler runs the GC sweep every `GC_INTERVAL` (default 1m). Each
sweep is three independent SQL statements:
1. `DELETE FROM acme_nonces WHERE used = TRUE OR expires_at < NOW()`.
2. `UPDATE acme_authorizations SET status='expired' WHERE status='pending' AND expires_at < NOW()`.
3. `UPDATE acme_orders SET status='invalid', error=... WHERE status IN ('pending','ready','processing') AND expires_at < NOW()`.
Each statement is bounded by a 1-minute per-sweep timeout. A failing
sweep is logged + retried on the next tick; a tick that overruns its
budget is skipped (the existing-tick atomic-Bool guard prevents
overlap). Counts are exposed via `certctl_acme_gc_*` Prometheus
metrics.
### cert-manager integration test
`make acme-cert-manager-test` brings up a kind cluster, installs
cert-manager 1.15.0, helm-deploys certctl-server with
`acmeServer.enabled=true`, and verifies a Certificate resource issues
end-to-end. Skipped in CI by default (kind is too heavy for per-PR);
operators run locally on workstation. See
`deploy/test/acme-integration/` for the YAML + Go test harness.
### lego RFC conformance harness
`make acme-rfc-conformance-test` drives lego v4 against a hermetic
certctl-server stack, exercising register → new-order → finalize.
Operators run this when shipping behavior changes to the ACME surface
to confirm a real third-party client still works.
### k6 ACME flows scenario
`deploy/test/loadtest/k6/acme_flow.js` exercises the unauthenticated
surface (directory + new-nonce + ARI) at 100 VUs × 5m. JWS-signed
flows are out of scope for k6 (no JWS support); they're covered by
the lego conformance harness above. Baseline numbers + thresholds in
`deploy/test/loadtest/README.md`.