diff --git a/docs/acme-caddy-walkthrough.md b/docs/acme-caddy-walkthrough.md new file mode 100644 index 0000000..5af0fc1 --- /dev/null +++ b/docs/acme-caddy-walkthrough.md @@ -0,0 +1,172 @@ +# Caddy Integration Walkthrough + +End-to-end recipe for issuing certs from a certctl-server deployment +through Caddy 2.7+. Target audience: operator running Caddy on a VM +or container who wants Caddy to ACME-issue from certctl instead of +Let's Encrypt. + +## Prereqs + +- A reachable certctl-server with `CERTCTL_ACME_SERVER_ENABLED=true` + and at least one profile whose `acme_auth_mode` is set. Profile + setup is identical to the cert-manager walkthrough — see + [`docs/acme-cert-manager-walkthrough.md`](./acme-cert-manager-walkthrough.md) + Step 2. +- Caddy 2.7.x or later. `caddy version` should show 2.7.0+. +- Network reachability: Caddy → certctl-server's HTTPS listener (port + 8443 by default). +- The certctl bootstrap CA, in PEM form, captured for the trust + configuration below. Capture exactly the same way as the cert-manager + walkthrough Step 3 — use `cat deploy/test/certs/ca.crt`. + +## Step 1 — Configure Caddy + +Caddy's ACME issuer is configured per-site (or globally) via the +`acme_ca` directive in a Caddyfile, or via the `tls.acme_ca` field +in JSON config. The directive points at the directory URL: + +``` +{ + email ops@example.com +} + +example.com { + tls { + acme_ca https://certctl.example.com:8443/acme/profile/prof-test/directory + issuer acme + } + reverse_proxy localhost:8080 +} +``` + +Notes: + +- `acme_ca` must point at the directory URL (ending in `/directory`), + not just the base. Caddy uses the directory document to discover + the new-account / new-order URLs, exactly the same way cert-manager + does. +- `issuer acme` is the default; included here for clarity. Caddy can + also be configured with `issuer zerossl` or `issuer internal`; for + certctl integration, `acme` is the correct issuer. +- Caddy auto-discovers `tls-alpn-01` first when port 443 is bound to + Caddy, then falls back to HTTP-01. For `trust_authenticated` mode + profiles, both work without solver round-trips. + +## Step 2 — Trust the certctl bootstrap CA + +Caddy validates the certctl-server's TLS chain before any ACME call, +the same way cert-manager does. Two options for trust: + +### Option A — OS trust store (preferred for VMs) + +``` +sudo cp deploy/test/certs/ca.crt /usr/local/share/ca-certificates/certctl-bootstrap.crt +sudo update-ca-certificates +sudo systemctl restart caddy +``` + +Caddy honors the system trust store via the Go runtime's +`crypto/x509` defaults. After `update-ca-certificates`, Caddy's HTTPS +client trusts certctl's self-signed root and the directory call +succeeds. + +### Option B — Caddy `tls.cas` (for containerized deployments) + +``` +{ + pki { + ca certctl_bootstrap { + root_cert_file /etc/caddy/certctl-bootstrap.crt + } + } +} + +example.com { + tls { + acme_ca https://certctl.example.com:8443/acme/profile/prof-test/directory + ca certctl_bootstrap + issuer acme + } + reverse_proxy localhost:8080 +} +``` + +The `pki.ca` block registers a named CA Caddy can reference; the +`tls.ca certctl_bootstrap` line in the site block scopes that trust +to ACME calls for this site only. This is the right pattern for +multi-tenant Caddy deployments where some sites trust certctl + others +don't. + +## Step 3 — Reload Caddy + +``` +caddy validate --config /etc/caddy/Caddyfile +sudo systemctl reload caddy +``` + +Caddy reloads atomically; in-flight requests complete on the old +config while new requests use the new ACME issuer. On the next +`example.com` request, Caddy hits certctl's directory URL, registers +an account, submits a new-order, and finalizes — typically completing +in under 5 seconds for `trust_authenticated` mode. + +## Step 4 — Verify + +``` +caddy list-certificates +# example.com (issuer=certctl.example.com): CN=example.com, valid until 2026-06-30 +``` + +The cert is in Caddy's certificate cache (`$XDG_DATA_HOME/caddy/certificates/` +by default). Inspect: + +``` +openssl x509 -in ~/.local/share/caddy/certificates/acme-v02.api.letsencrypt.org-directory/example.com/example.com.crt -noout -subject -issuer -dates +# subject= CN=example.com +# issuer= CN=certctl test internal CA +``` + +(Path layout is Caddy-version-dependent; check `caddy environ` for the +canonical data dir.) + +On the certctl side, the operator's audit log captures the issuance +event: + +``` +psql -c "SELECT actor, action, resource_id FROM audit_events + WHERE actor LIKE 'acme:%' ORDER BY created_at DESC LIMIT 5;" +``` + +## Common failure modes + +- **Caddy logs `tls: failed to verify certificate: x509: certificate + signed by unknown authority`** → certctl bootstrap CA is not in + Caddy's trust path. Re-do Step 2; verify with `curl --cacert + /etc/caddy/certctl-bootstrap.crt https://certctl.example.com:8443/acme/profile/prof-test/directory`. +- **Caddy logs `urn:ietf:params:acme:error:rateLimited`** → certctl + per-account orders/hour limit hit (default 100/hr). Tune via + `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` if you have + legitimately high throughput. +- **Caddy logs `urn:ietf:params:acme:error:rejectedIdentifier`** → + the SAN list includes an identifier the certctl profile policy + rejects. Cross-reference [`docs/acme-server.md` § Troubleshooting](./acme-server.md#certificate-readyfalse-with-rejectedidentifier). +- **`badNonce` in Caddy logs** → clock skew or multi-replica certctl + without sticky sessions; same fix as the cert-manager walkthrough. + +## Cleanup + +``` +caddy stop +# remove the certctl-specific block from your Caddyfile +sudo systemctl reload caddy +# Optional: delete cached certs from the certctl directory namespace. +rm -rf ~/.local/share/caddy/certificates/certctl.example.com-* +``` + +## See also + +- [`docs/acme-server.md`](./acme-server.md) — canonical reference. +- [`docs/acme-cert-manager-walkthrough.md`](./acme-cert-manager-walkthrough.md) — + K8s-native equivalent. +- [Caddy upstream ACME docs](https://caddyserver.com/docs/automatic-https#acme-issuer) + — verify behavior pinned here against Caddy 2.7.x semantics. diff --git a/docs/acme-cert-manager-walkthrough.md b/docs/acme-cert-manager-walkthrough.md new file mode 100644 index 0000000..d41ee0c --- /dev/null +++ b/docs/acme-cert-manager-walkthrough.md @@ -0,0 +1,254 @@ +# cert-manager Integration Walkthrough + +End-to-end recipe for issuing certs from a certctl-server deployment +through cert-manager 1.15+. Target audience: Kubernetes operator who +has never deployed certctl before and wants a working +`Certificate` → `Secret` flow on their cluster in under 30 minutes. + +The Phase 5 integration test (`make acme-cert-manager-test`) automates +exactly the recipe below. The YAML snippets in this doc are byte-equal +to the files under `deploy/test/acme-integration/` — re-running the +test from a fresh clone produces the same results documented here. + +## Prereqs + +- A Kubernetes cluster (kind / k3d / EKS / GKE / AKS / on-prem). For + local trial, `kind v0.20+` works exactly the way the Phase 5 test + uses it. The kind config lives at + [`deploy/test/acme-integration/kind-config.yaml`](../deploy/test/acme-integration/kind-config.yaml). +- `kubectl` v1.27+, `helm` v3.13+. +- `cert-manager` v1.15.0 installed in the `cert-manager` namespace. + If absent, run: + + ``` + bash deploy/test/acme-integration/cert-manager-install.sh + ``` + + which is the same idempotent installer the integration test uses. +- A certctl Helm chart published to a registry your cluster can pull + from. The Phase 5 test uses an `image.tag=test` placeholder; production + deployments use the actual image tag for your release line. + +## Step 1 — Deploy certctl-server + +``` +helm install certctl-test deploy/helm/certctl/ \ + --set acmeServer.enabled=true \ + --set acmeServer.defaultProfileId=prof-test \ + --set image.tag=test +kubectl wait --for=condition=Available --timeout=3m deployment/certctl-test +``` + +`acmeServer.enabled=true` flips the `CERTCTL_ACME_SERVER_ENABLED` +env var which gates the ACME route registration. +`acmeServer.defaultProfileId` sets `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` +so the `/acme/*` shorthand path mirrors the per-profile path family. + +## Step 2 — Create the certctl profile + +The ACME server requires a `certificate_profiles` row to bind issuance +to. Create one via the certctl API or GUI; for the simplest case set +`acme_auth_mode='trust_authenticated'`: + +``` +curl -X POST https://certctl-test.default.svc.cluster.local:8443/api/profiles \ + -H 'Content-Type: application/json' \ + -H "Authorization: Bearer $CERTCTL_API_KEY" \ + -d '{ + "id": "prof-test", + "name": "ACME test profile", + "issuer_id": "iss-internal-ca", + "max_ttl_seconds": 7776000, + "acme_auth_mode": "trust_authenticated" + }' +``` + +Auth-mode tradeoffs are covered in +[`docs/acme-server.md` § Auth-mode decision tree](./acme-server.md#auth-mode-decision-tree). +For first-time deployments, `trust_authenticated` is the right default. + +## Step 3 — Capture the certctl bootstrap CA + +cert-manager validates the certctl-server's TLS chain before sending +any account / order / finalize JWS. With certctl's self-signed +bootstrap cert (the demo default at `deploy/test/certs/server.crt`), +cert-manager rejects the directory URL with +`x509: certificate signed by unknown authority` unless you feed the +bootstrap CA in. + +``` +cat deploy/test/certs/ca.crt | base64 -w0 +``` + +Capture the output for Step 4. This is **the** single biggest first- +time-deploy footgun on the cert-manager integration path. The reference +recipe lives in +[`docs/acme-server.md` § TLS trust bootstrap](./acme-server.md#tls-trust-bootstrap-read-this-before-configuring-cert-manager). + +## Step 4 — Apply the ClusterIssuer + +```yaml +# Phase 5 — sample ClusterIssuer for the certctl trust_authenticated +# auth mode (RFC 8555 §6 + certctl auth_mode=trust_authenticated, where +# the JWS-authenticated ACME account is trusted to issue any identifier +# the profile policy permits — no per-identifier ownership challenges). +# +# Use this as the starting template for any internal-PKI rollout. +# Replace the caBundle placeholder with the base64-encoded PEM of the +# certctl-server's self-signed bootstrap root, then `kubectl apply`. +# +# Generate the caBundle via: +# cat deploy/test/certs/ca.crt | base64 -w0 +# (See certctl/docs/acme-server.md "TLS trust bootstrap" section for the +# end-to-end walkthrough — this is the single biggest first-time-deploy +# footgun on cert-manager, captured as audit fix #9.) +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: certctl-test-trust +spec: + acme: + email: test@example.com + # Replace 'certctl-test' with your release name + adjust the + # profile path segment. Default profile path: + # https://..svc.cluster.local:8443/acme/profile//directory + server: https://certctl-test.default.svc.cluster.local:8443/acme/profile/prof-test/directory + # caBundle: Audit fix #9. cert-manager validates the ACME server's + # TLS chain before submitting any account/order/finalize. With a + # self-signed bootstrap root, the ClusterIssuer MUST carry the root + # explicitly via this field. + caBundle: | + LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCi4uLgotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== + privateKeySecretRef: + name: certctl-test-trust-account-key + solvers: + # In trust_authenticated mode the solver is unused at the + # validation step but cert-manager still requires at least one + # solver in the spec. http01-via-ingress-nginx is the cheapest + # placeholder shape that round-trips correctly through cert- + # manager's validation webhooks. + - http01: + ingress: + class: nginx +``` + +This block is byte-equal to +[`deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml`](../deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml). +Replace the `caBundle` placeholder with the base64 string from Step 3. +The full reference YAML lives at +[`deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml`](../deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml). + +``` +kubectl apply -f deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml +kubectl wait --for=condition=Ready --timeout=2m clusterissuer/certctl-test-trust +``` + +The solver block is a placeholder under `trust_authenticated` mode — +cert-manager 1.15 still requires at least one solver in the spec, but +certctl auto-resolves authzs without a solver round-trip. The +http01-ingress-nginx shape validates against cert-manager's webhook +without needing an actual ingress controller deployed. + +For `challenge` mode profiles, swap to +[`deploy/test/acme-integration/clusterissuer-challenge.yaml`](../deploy/test/acme-integration/clusterissuer-challenge.yaml) +— same shape, but the solver is now load-bearing and you need +ingress-nginx (or your chosen ingress class) actually deployed for +HTTP-01 to work. + +## Step 5 — Apply the Certificate + +```yaml +# Phase 5 — Certificate resource the integration test applies and +# waits for. The certctl-test-trust ClusterIssuer (trust_authenticated +# mode) issues the cert without any solver round-trip; the resulting +# Secret 'test-com-tls' is asserted to carry tls.crt + tls.key. +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: test-com + namespace: default +spec: + secretName: test-com-tls + commonName: test.example.com + dnsNames: + - test.example.com + - www.test.example.com + issuerRef: + name: certctl-test-trust + kind: ClusterIssuer + duration: 720h # 30d + renewBefore: 240h # 10d +``` + +This block is byte-equal to +[`deploy/test/acme-integration/certificate-test.yaml`](../deploy/test/acme-integration/certificate-test.yaml). + +``` +kubectl apply -f deploy/test/acme-integration/certificate-test.yaml +kubectl wait --for=condition=Ready --timeout=3m certificate/test-com +``` + +cert-manager creates an `Order`, the ACME flow runs against certctl, +and the resulting Secret is populated. + +## Step 6 — Verify + +``` +kubectl get certificate test-com -o wide +# NAME READY SECRET ISSUER STATUS AGE +# test-com True test-com-tls certctl-test-trust Certificate is up to date and has not expired 42s + +kubectl get secret test-com-tls -o yaml | yq '.data."tls.crt"' | base64 -d | openssl x509 -noout -subject -issuer -dates +# subject= CN=test.example.com +# issuer= CN=certctl test internal CA +# notBefore=... notAfter=... +``` + +Both the cert-manager `Certificate` resource and the underlying Secret +are populated. The actor on the certctl side is `acme:`, +which you can correlate via the `audit_events` table: + +``` +psql -c "SELECT created_at, action, resource_type, resource_id + FROM audit_events + WHERE actor LIKE 'acme:%' + ORDER BY created_at DESC LIMIT 10;" +``` + +## Common failure modes + +These are operator-side; full troubleshooting reference is in +[`docs/acme-server.md` § Troubleshooting](./acme-server.md#troubleshooting). + +- `400 Bad Request: badNonce` → clock skew between certctl-server and + cert-manager, or a multi-replica certctl fleet without sticky + sessions. +- `x509: certificate signed by unknown authority` → missing or stale + `caBundle`. Re-run Step 3, paste the fresh value. +- `connection refused` from the HTTP-01 validator → ingress controller + not deployed, OR your network blocks port 80 inbound to the solver + Ingress. +- `Ready=False` with `rejectedIdentifier` → CSR has a SAN your profile + policy doesn't permit. Decode the `subproblems` array of the RFC + 7807 problem doc. + +## Cleanup + +``` +kubectl delete -f deploy/test/acme-integration/certificate-test.yaml +kubectl delete -f deploy/test/acme-integration/clusterissuer-trust-authenticated.yaml +helm uninstall certctl-test +# Optional: delete the certctl profile via API. +``` + +## See also + +- [`docs/acme-server.md`](./acme-server.md) — canonical reference. +- [`docs/acme-server-threat-model.md`](./acme-server-threat-model.md) — + security posture. +- [`docs/acme-caddy-walkthrough.md`](./acme-caddy-walkthrough.md) — + Caddy-side recipe. +- [`docs/acme-traefik-walkthrough.md`](./acme-traefik-walkthrough.md) — + Traefik-side recipe. +- [`deploy/test/acme-integration/`](../deploy/test/acme-integration/) — + Phase 5 integration test (the same recipe, automated). diff --git a/docs/acme-server-threat-model.md b/docs/acme-server-threat-model.md new file mode 100644 index 0000000..05b570b --- /dev/null +++ b/docs/acme-server-threat-model.md @@ -0,0 +1,278 @@ +# ACME Server — Threat Model + +Security posture for the certctl ACME server endpoint +(`/acme/profile//*`). Read this before opening a PR that changes +the JWS verifier, the challenge validators, the rate limiter, or the +GC sweeper. + +The threat model lives in this dedicated doc (rather than `docs/acme-server.md`) +because security-review reviewers want a single concentrated reference. +Production deployments under audit should treat this doc as the +canonical answer to "how does certctl resist X?" + +## Threat surface map + +The ACME server has four ingress surfaces: + +1. **JWS-authenticated POST endpoints** — new-account, new-order, + finalize, key-change, revoke-cert, account update, order POST-as-GET. + Authenticated by an ECDSA / RSA / EdDSA signature over the request. +2. **Unauthenticated GET endpoints** — directory, new-nonce, ARI + (renewal-info). Read-only; no authn. +3. **Outbound challenge validators** — HTTP-01, DNS-01, TLS-ALPN-01. + The certctl-server initiates outbound calls to operator-provided + identifiers (the SAN list of the requested cert). +4. **Scheduler-driven GC sweeper** — internal-only; no inbound surface. + +Threat actors: + +- **External Internet attacker** — no certctl credentials; can hit + unauthenticated endpoints + observe TLS metadata. +- **Authenticated ACME account holder (low-trust)** — has a valid + account on a profile but should be bounded by profile policy + + rate limits. +- **On-path attacker** between certctl-server and a challenge target + (HTTP-01 / DNS-01 / TLS-ALPN-01). +- **Compromised cert holder** — has the private key of a previously- + issued cert and wants to revoke/exfiltrate. +- **Malicious operator with profile-write access** — can change a + profile's `acme_auth_mode` or policy, but is the trusted boundary + per certctl's threat model. Out of scope here; covered by certctl's + RBAC + audit log. + +## JWS forgery resistance + +The verifier (`internal/api/acme/jws.go`) accepts only the closed +allow-list `{RS256, ES256, EdDSA}`. The allow-list is passed to +`jose.ParseSigned` so go-jose rejects every other algorithm at parse +time, before any signature work. + +Specific attacks blocked: + +- **Algorithm confusion (`alg: none`)** — RFC 7515 §6.1's classic + unauthenticated-fallback. Not in allow-list; rejected at parse. +- **HS256 substitution (alg-confusion via symmetric)** — symmetric + algs aren't in the allow-list; rejected at parse. +- **Replayed nonce** — every JWS carries a nonce consumed via + `acme_nonces.UPDATE … WHERE used = FALSE` (a single statement; + Postgres row-locking serializes the writes). A second consume of + the same nonce sees `RowsAffected=0` and the verifier returns + `badNonce`. +- **URL spoofing** — the protected-header `url` field MUST match the + request URL exactly (RFC 8555 §6.4); a JWS signed for one URL + cannot be replayed against another. +- **Multi-signature JWS** — RFC 8555 §6.2 forbids; the verifier + rejects `len(jws.Signatures) != 1` explicitly. +- **kid-vs-jwk confusion** — exactly one MUST be present per RFC 8555 + §6.2; both-present and neither-present are rejected. +- **kid round-trip mismatch** — the verifier's `AccountKID` closure + computes the canonical kid URL for the resolved account-id and + compares to the inbound `kid`; cross-profile replay is rejected + because the canonical URL differs. + +The doubly-signed key-rollover JWS (RFC 8555 §7.3.5, Phase 4) gets +its own dedicated verifier in `internal/api/acme/keychange.go`. +Inner-only invariants enforced: MUST use `jwk` not `kid`, payload +`account` MUST equal outer `kid`, payload `oldKey` MUST canonicalize- +equal the registered key (RFC 7638 thumbprint, constant-time +compare), inner `url` MUST equal outer `url`. + +## Nonce store integrity + +Nonces are persisted in PostgreSQL (`acme_nonces` table; migration +000025) with a TTL set by `CERTCTL_ACME_SERVER_NONCE_TTL` (default +5 min). The Phase 5 GC sweeper deletes used / expired rows every 1 +minute by default. + +Why DB-backed and not in-memory: + +- **Survives restart** — a multi-replica certctl-server fleet behind + a load balancer can issue a nonce on replica A and consume it on + replica B. In-memory state would force sticky sessions globally, + which the operator can't guarantee in all topologies. +- **Atomic consume** — a single `UPDATE ... WHERE used = FALSE` + statement is the consume primitive; Postgres row-locking guarantees + exactly one of two concurrent consumes wins. +- **Expiry-bounded** — even if the GC sweeper were disabled, the + nonce TTL is enforced at consume time + (`AND expires_at > NOW()` in the UPDATE). + +A nonce-store-side compromise would let an attacker forge nonces. +Mitigation: the nonce table is in the same Postgres instance certctl +already trusts; a DB compromise is broader than ACME-specific. + +## HTTP-01 SSRF resistance + +The HTTP-01 validator (Phase 3, `internal/api/acme/validators.go`) +fetches `http:///.well-known/acme-challenge/` +where the identifier is operator/client-controlled. Without +mitigation, this is a textbook SSRF surface — internal services on +RFC1918 / link-local / cloud-metadata addresses would be reachable. + +Mitigations (defense in depth): + +1. **Pre-dial check** — `validation.ValidateSafeURL` rejects URLs + whose host parses as a literal reserved IP. Cheap early bail. +2. **Per-dial check** — `validation.SafeHTTPDialContext` is installed + on the `http.Transport`. Every dial re-resolves DNS, rejects + reserved IPs, and **pins the resolved IP** (`net.JoinHostPort(ips[0], + port)`) so a racing DNS rebinding cannot substitute a different IP + between resolve and connect. +3. **Per-redirect check** — Go's HTTP client re-dials on 3xx; the + `DialContext` runs again, applying the same SSRF guards. +4. **Body cap** — the validator's `io.LimitReader` caps response + bodies at 16 KiB. A misbehaving target cannot DoS the validator + pool with a multi-GB response. +5. **Bounded redirects** — the validator caps redirects at 10 (Go + default). A redirect-loop target is bounded. + +Reserved IP set: loopback (127.0.0.0/8 + ::1), link-local +(169.254.0.0/16 + fe80::/10), all RFC1918 (10/8, 172.16/12, 192.168/16), +cloud-metadata literals (169.254.169.254 explicitly), broadcast, +multicast, IPv4-mapped-IPv6 to a reserved IPv4. See +`internal/validation/ssrf.go::isReservedIPForDial` for the full set. + +CodeQL alert #23 flags `client.Do(req)` in the SCEP-probe call site +as `go/request-forgery` despite the dial-time guard; the analyzer +can't trace through a custom `Transport.DialContext`. Operator- +acknowledged false positive (CLAUDE.md task #10) — see the SCEP +probe's same-shaped defense for the audit trail. + +## DNS-01 cache poisoning posture + +The DNS-01 validator queries +`_acme-challenge.` against a single resolver configured by +`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`). + +Threat: an operator running a private resolver (typical in air-gapped +deployments) inherits that resolver's cache-poisoning posture. A +poisoned resolver could attest a TXT record the legitimate domain +owner never published, allowing an attacker who controls the +resolver to forge ACME challenges. + +Mitigation: + +- Default `8.8.8.8:53` is Google Public DNS — DNSSEC-validating, + operationally hardened, well-monitored. +- Operators choosing a private resolver own the cache-poisoning + posture. The doc explicitly flags this in + `docs/acme-server.md` § Configuration. +- DNSSEC-validation is **not** enforced by the validator itself — + the validator trusts the resolver's answer. Operators wanting + strict DNSSEC validation should use a DNSSEC-validating resolver + (e.g. `1.1.1.1` or a self-hosted Unbound). + +## TLS-ALPN-01 challenge interception + +RFC 8737 §3 explicitly says the validator MUST NOT verify the +challenge target's certificate chain — the proof lives in the +embedded `id-pe-acmeIdentifier` extension (OID 1.3.6.1.5.5.7.1.31) +of the cert presented during the TLS handshake, not in the chain +itself. + +Implementation: `internal/api/acme/validators.go::TLSALPN01Validator` +sets `tls.Config.InsecureSkipVerify = true` with a dedicated +`//nolint:gosec` annotation citing RFC 8737 §3 and the L-001 +documentation row in `docs/tls.md`. + +What this means for on-path attackers: + +- An on-path attacker between certctl-server and the challenge target + CAN intercept the TLS handshake and present a forged cert. The + proof is the embedded extension byte-equality, which the attacker + cannot generate without the account key — so interception alone + doesn't grant cert issuance. +- An attacker who has the account key already controls the account + per RFC 8555; the TLS-ALPN-01 validator's interception window adds + no incremental capability. + +The integrity property TLS-ALPN-01 actually provides: the challenge +target proves possession of the account-key-derived key authorization +on a TLS connection bound to the requested identifier (port 443 of +the SAN). Operators wanting CA/Browser-Forum-style WebPKI strictness +should run a dedicated public-trust CA, not certctl. + +## Rate-limit tuning + +Phase 5 in-memory token buckets with per-(action, key) isolation. +Defaults: + +- `RATE_LIMIT_ORDERS_PER_HOUR=100` per account. +- `RATE_LIMIT_CONCURRENT_ORDERS=5` per account (pending/ready/processing). +- `RATE_LIMIT_KEY_CHANGE_PER_HOUR=5` per account. +- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60` per challenge-id. + +Tuning: + +- **Too loose** → enables abuse vectors. A compromised account could + burn DB-row throughput; a runaway client could fill the validator + pool. +- **Too tight** → legitimate flake-out. cert-manager's exponential + backoff after a `rateLimited` problem is conservative; a 1-hour + cooldown is a long time for an operator hitting an unexpected limit. + +Defaults are intentionally conservative on the loose-side — 100/hour +is generous for any plausible per-account fleet (a 50k-cert +deployment renewing at the 1/3-validity mark consumes ~12 +orders/year/cert ≈ 600k orders/year ≈ 70 orders/hour even spread +evenly across accounts). Tighter limits are appropriate for +deployments with many low-trust accounts. + +The buckets are in-memory + per-replica. A 3-replica certctl-server +fleet effectively has 3× the configured per-account throughput +because each replica's bucket fills independently. For deployments +where this matters operationally, the right answer is a shared rate- +limit store (Redis / Postgres-backed); not blocking for current +threat model where same-account requests typically pin to the same +replica via session affinity. + +## Audit trail + +Every ACME state mutation writes a row to `audit_events`. Actor strings +distinguish the auth path: + +- `acme:` — kid-path requests (the requesting account + signed the JWS). +- `acme-cert-key:` — jwk-path revoke (the cert's own private + key signed the JWS). +- `acme-system:gc` — scheduler-driven sweeps (no client request). + +Operators querying by actor prefix can reconstruct the full history +of any ACME-issued cert. See +`docs/acme-server.md` § FAQ "What audit-log events fire" for the +event-name catalog. + +## Out-of-scope threats + +Documented to set scope expectations for security reviewers: + +- **DDoS at the TLS layer** — the certctl-server's TLS listener + + upstream load balancer / WAF handle this. The ACME-specific rate + limits don't substitute for upstream DDoS protection. +- **cert-manager-side compromise** — if cert-manager is compromised, + it has both the account key and the private keys of every issued + cert. Out of certctl's trust boundary; operators run cert-manager + with the same care they'd run any other secret-bearing operator. +- **Compromised certctl-server filesystem** — the bootstrap CA key + lives at `deploy/test/certs/ca.key` (or the operator-managed + equivalent). A filesystem compromise is broader than ACME-specific + and is covered by certctl's HSM / signer-driver architecture (see + `docs/architecture.md` "Signer abstraction"). +- **Postgres compromise** — the nonce table, account JWKs, and + audit log all live in the same Postgres instance. A DB compromise + is broader than ACME-specific and is the operator's responsibility + to mitigate via standard DB-hardening practices. +- **Supply-chain attacks against go-jose / lib/pq** — handled by + Dependabot + the `make verify` security gate; not ACME-specific. + +## See also + +- [`docs/acme-server.md`](./acme-server.md) — operator-facing reference. +- [`docs/tls.md`](./tls.md) — TLS posture, including the L-001 + table of `InsecureSkipVerify` justifications (TLS-ALPN-01 row). +- [`internal/api/acme/jws.go`](../internal/api/acme/jws.go) — verifier + source. +- [`internal/api/acme/validators.go`](../internal/api/acme/validators.go) + — challenge validator pool. +- [`internal/validation/ssrf.go`](../internal/validation/ssrf.go) — + SSRF-defense primitives. diff --git a/docs/acme-server.md b/docs/acme-server.md index c5a2794..3f3d363 100644 --- a/docs/acme-server.md +++ b/docs/acme-server.md @@ -7,15 +7,16 @@ as an ACME issuer with no certctl-side modification — closing the "deploy a certctl agent on every K8s node" friction that costs deals to external PKI vendors today. -> **Phase status (2026-05-03):** Phase 5 — production hardening + -> cert-manager integration test. Per-account rate limits applied at -> 3 entry points (orders/hour, key-change/hour, challenge-respond/hour) -> + a per-account concurrent-orders cap; a 1-minute scheduler loop -> sweeps expired nonces / authzs / orders. A kind-driven cert-manager -> integration test (gated by `KIND_AVAILABLE`) verifies the full -> happy-path against a real cert-manager 1.15+ deployment. RFC -> conformance is verified via lego against the same stack. Track -> shipped phases via `git log --grep='acme-server:'`. +> **Phase status (2026-05-03):** Phase 6 — full operator-facing +> reference. The functional surface is complete (Phases 1a-5); this +> doc is the canonical procurement-readability reference. New: client- +> walkthrough docs for [cert-manager](./acme-cert-manager-walkthrough.md), +> [Caddy](./acme-caddy-walkthrough.md), and +> [Traefik](./acme-traefik-walkthrough.md); a dedicated +> [threat model](./acme-server-threat-model.md); a section-by-section +> RFC 8555 + RFC 9773 conformance statement; a 5-failure-mode +> troubleshooting playbook; a tested-clients version pinning table. +> Track shipped phases via `git log --grep='acme-server:'`. ## Configuration @@ -105,6 +106,41 @@ the `caBundle` requirement is flagged here in Phase 1a's docs because operators hit it the moment they try to point a real ACME client at certctl. +## Auth-mode decision tree + +Use `trust_authenticated` when: + +- The certctl deployment serves **internal-only PKI** (intranet certs, + service-mesh certs, IoT bootstrap). Identifiers in your CSRs are + controlled by your infrastructure, not by the public Internet. +- You don't have HTTP/DNS reachability **from certctl-server back to + the ACME client's solver** (e.g., the client lives in an isolated + network segment certctl-server can't reach). +- You want the simplest cert-manager integration: cert-manager submits + a CSR, certctl issues; no out-of-band ownership proof. +- You're issuing under your own root CA whose trust is operator-managed + (NOT WebPKI). Public CAs cannot use this mode — RFC 8555 §8 ownership + proof is non-negotiable for public-trust roots. + +Use `challenge` when: + +- The deployment is **public-trust-style PKI** — even if your root is + privately operated, you want CA/Browser Forum-style ownership-proof + semantics so a stolen account key can't be used to issue for arbitrary + identifiers. +- You have HTTP-01 / DNS-01 / TLS-ALPN-01 reachability from the + certctl-server to the ACME client's solver. (HTTP-01 needs port 80 + ingress to the client; DNS-01 needs DNS recursion; TLS-ALPN-01 needs + port 443 ingress.) +- You want defense-in-depth: an account-key compromise costs the + attacker nothing without also compromising the solver-side + infrastructure. + +A single certctl-server can run both modes simultaneously — the auth +mode is a per-profile column on `certificate_profiles.acme_auth_mode`, +read at request time. Operators flip a profile's mode via SQL or the +profile API, and the next order picks up the new mode without restart. + ## Endpoints Routes registered in `internal/api/router/router.go::RegisterHandlers`: @@ -143,6 +179,49 @@ After Phase 4, the full RFC 8555 + RFC 9773 surface is live. RFC 8739 (short-lived certs) and EAB enforcement remain follow-up work; cert- manager + boulder-tested clients work today against the surface above. +## RFC 8555 + RFC 9773 conformance statement + +Honest disclosure of what's implemented, where, and what's not. Procurement +engineers running gap analyses against cert-manager + Let's Encrypt's +conformance posture should read this section before anything else. + +### Implemented + +| Section | Surface | Phase | First commit | +|---------|---------|-------|--------------| +| RFC 8555 §6.2 | JWS auth + RS256/ES256/EdDSA allow-list | 1b | `27bd660` | +| RFC 8555 §6.3 | POST-as-GET | 1b | `27bd660` | +| RFC 8555 §6.4 | URL-header binding to request URL | 1b | `27bd660` | +| RFC 8555 §6.5 | Replay-Nonce + DB-backed nonce store | 1a | `e146b00` | +| RFC 8555 §6.7 | RFC 7807 problem documents | 1a | `e146b00` | +| RFC 8555 §7.1 | Directory | 1a | `e146b00` | +| RFC 8555 §7.2 | new-nonce HEAD + GET | 1a | `e146b00` | +| RFC 8555 §7.3 | new-account + idempotent re-registration | 1b | `27bd660` | +| RFC 8555 §7.3.2 + §7.3.6 | account update + deactivation | 1b | `27bd660` | +| RFC 8555 §7.3.5 | doubly-signed key rollover | 4 | `0299e4a` | +| RFC 8555 §7.4 | new-order + finalize + cert download | 2 | `4ee486e` | +| RFC 8555 §7.5 | authz POST-as-GET | 2 | `4ee486e` | +| RFC 8555 §7.5.1 | challenge response | 3 | `7e22204` | +| RFC 8555 §7.6 | revoke-cert (kid + jwk paths) | 4 | `0299e4a` | +| RFC 8555 §8.3 | HTTP-01 challenge validator | 3 | `7e22204` | +| RFC 8555 §8.4 | DNS-01 challenge validator | 3 | `7e22204` | +| RFC 8737 | TLS-ALPN-01 challenge validator | 3 | `7e22204` | +| RFC 9773 | ACME Renewal Information (ARI) | 4 | `0299e4a` | + +### Not implemented (procurement-honest) + +| Spec area | Status | Notes | +|-----------|--------|-------| +| RFC 8555 §7.3.4 — External Account Binding (EAB) | **Not implemented.** | Advertised in directory `meta.externalAccountRequired` but enforcement is a follow-up. Operators relying on EAB for account-creation gating should layer an upstream WAF. | +| RFC 8555 §8.4 + §7.4 — Wildcard with `*.` prefix > 1 level | **Not implemented.** | Single-level wildcards (e.g. `*.example.com`) work end-to-end. Multi-level wildcards (`*.*.example.com`) are RFC-spec-ambiguous and rejected at the identifier-validation layer. | +| RFC 8738 — Short-lived certs | **Not implemented.** | Operators wanting <7-day validity tune the bound issuer's TTL directly via `CertificateProfile.MaxTTLSeconds`; the ACME wire shape doesn't expose a separate notion. | +| Cross-CA proxying | **Not implemented.** | Each profile binds to one issuer. Multi-CA federation (one ACME account → multi-CA selection per identifier) is roadmap. | +| RFC 8555 §6.7 — `accountDoesNotExist` problem with hint URL | Partial. | Sentinel returns `accountDoesNotExist`; the optional hint URL embedding the `kid` is not emitted. cert-manager doesn't consume it. | + +If a procurement-side gap analysis turns up something not in either +table above, the answer is "we don't know yet" — operator-side issues +welcome. + ## Finalize routing through `CertificateService.Create` (Phase 2 architecture) The finalize path mirrors how every other certctl issuance surface @@ -214,7 +293,7 @@ at `internal/service/certificate.go:131`). | 3 | live | HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (challenge mode end-to-end) | | 4 | live | key rollover (RFC 8555 §7.3.5) + revoke-cert (§7.6) + ARI (RFC 9773) | | 5 | live | rate limits + GC sweeper + kind-driven cert-manager integration test + lego conformance harness + k6 ACME-flow scenario | -| 6 | not yet | full operator-facing reference + walkthroughs + threat model | +| 6 | live | full operator-facing reference + walkthroughs (cert-manager / Caddy / Traefik) + threat model + RFC-8555 conformance statement + troubleshooting + version pinning | Track shipped phases via `git log --grep='acme-server:' --oneline`. @@ -386,3 +465,182 @@ surface (directory + new-nonce + ARI) at 100 VUs × 5m. JWS-signed flows are out of scope for k6 (no JWS support); they're covered by the lego conformance harness above. Baseline numbers + thresholds in `deploy/test/loadtest/README.md`. + +## Troubleshooting + +The five failure modes operators hit most often + the canonical fix +for each. + +### `cert-manager logs: 400 Bad Request: badNonce` + +**Cause:** Either a nonce was replayed (a buggy client retries the +same JWS), the cert-manager + certctl-server clocks differ by more +than `CERTCTL_ACME_SERVER_NONCE_TTL` (default 5 min), or the +nonce-store row was reaped between issuance and use. + +**Fix:** First check NTP on both sides. If clocks are healthy, +lengthen `CERTCTL_ACME_SERVER_NONCE_TTL` to 10m or 15m. If the +problem persists, check for a multi-replica certctl-server fleet +without sticky session affinity — the nonce DB row lives on one +replica; if the JWS POST hits a different replica before replication +catches up, you observe spurious `badNonce`. Solution: pin client +sessions to a single replica via load-balancer cookie / `kid`-hash +routing, OR shorten replication lag if your DB is the bottleneck. + +### `cert-manager logs: x509: certificate signed by unknown authority` + +**Cause:** cert-manager refuses to talk to the directory URL because +its TLS chain doesn't terminate at a root in cert-manager's trust +store. certctl-server's bootstrap cert (Phase 1a, `deploy/test/certs/server.crt`) +is self-signed. + +**Fix:** Add the `caBundle` field to your `ClusterIssuer.spec.acme` — +see the [TLS trust bootstrap](#tls-trust-bootstrap-read-this-before-configuring-cert-manager) +section above for the 3-step recipe. This is **the** single biggest +first-time-deploy footgun on the cert-manager integration path. + +### HTTP-01 validator returns `connection refused` + +**Cause:** The HTTP-01 solver's Ingress / Service is not reachable +from certctl-server's network. Common subcases: (a) the cert-manager +http-solver pod is on a private network certctl-server can't reach; +(b) a firewall blocks port 80 inbound to the solver's address; (c) +the Ingress class annotation doesn't match an installed ingress +controller; (d) your DNS still points at an old IP. + +**Fix:** From the certctl-server pod, `curl -v +http:///.well-known/acme-challenge/` and read the +network error. If the curl fails the same way, the network path is +the issue. If curl works but the validator fails, check the validator +log lines — the SSRF guard rejects reserved IPs (RFC1918, link-local, +cloud-metadata 169.254.169.254). Public-trust style profiles that +need to reach RFC1918 solvers must be moved to `trust_authenticated` +mode OR the solver must be exposed on a routable address. + +### DNS-01 validator returns `NXDOMAIN` + +**Cause:** DNS provider hasn't propagated the `_acme-challenge.` +TXT record yet. Most providers have a 30s-2m propagation lag. cert-manager +retries by default, but Phase-5 rate limits (default 60/hour per +challenge-id) can truncate the retry budget. + +**Fix:** Verify TXT propagation with `dig +short TXT _acme-challenge. +@`. If the answer is empty, the issue is upstream. If +it's populated but certctl reports NXDOMAIN, check +`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`) is +reachable from certctl-server's network egress. Operators on isolated +networks need a private resolver; configure accordingly + own the +cache-poisoning posture (see [threat +model](./acme-server-threat-model.md)). + +### Certificate Ready=False with `rejectedIdentifier` + +**Cause:** The CSR includes an identifier (CommonName or SAN) that the +bound certificate profile's policy rejects. certctl runs syntactic + +profile-policy validation **before** order creation; the order never +reaches the database. + +**Fix:** The reject reason is in the `subproblems` array of the RFC +8555 §6.7 problem document. Decode the JSON, look at `subproblems[].detail`, +and adjust either the CSR or the profile policy. Common causes: +SAN-not-in-`AllowedIdentifierWildcards`, EKU-not-in-`AllowedEKUs`, +TTL-exceeds-`MaxTTLSeconds`. Validation logic lives in +`internal/api/acme/identifier.go::ValidateIdentifiers` + +`internal/domain/profile.go` — read those if the profile-policy rule +isn't obvious. + +## Version pinning + tested clients + +certctl's ACME server is tested against the following client versions. +Other versions probably work; these are the ones the integration suite +exercises end-to-end. + +| Client | Tested version | Where it's pinned | +|--------|----------------|-------------------| +| cert-manager | 1.15.0 | `deploy/test/acme-integration/cert-manager-install.sh::CERT_MANAGER_VERSION` | +| lego (RFC 8555 conformance harness) | v4.x latest | `deploy/test/acme-integration/conformance-lego.sh` (operator installs via `go install github.com/go-acme/lego/v4/cmd/lego@latest`) | +| kind (cluster bootstrap) | v0.20+ | `deploy/test/acme-integration/kind-config.yaml` schema requirement | +| Caddy | 2.7.x | Phase 6 walkthrough (`docs/acme-caddy-walkthrough.md`) | +| Traefik | 3.0+ | Phase 6 walkthrough (`docs/acme-traefik-walkthrough.md`) | + +Operators reporting issues with untested-version clients should include +the client version + the precise wire-level error (curl-captured request ++ response body) so we can pin a regression test if applicable. + +## FAQ + +### Why two auth modes? Isn't `challenge` strictly more secure? + +`challenge` is strictly more secure for **public-trust** PKI — RFC 8555 +§8 ownership proof is the entire point of cert-manager + Let's Encrypt. +For **internal PKI**, the threat model is different: the network itself +is the security boundary (mTLS service mesh, firewalled VPC, identifier- +namespace controlled by the operator). Forcing every internal cert to +go through a solver round-trip adds operational toil with no security +gain. `trust_authenticated` is the certctl-specific mode that +acknowledges this — the ACME account is the proof, not the solver. + +### How does this differ from `cert-manager → Let's Encrypt with certctl as a separate step`? + +Two integrations vs one. With certctl as the ACME endpoint, cert-manager +does its native flow (Certificate → Order → CSR → Secret) and certctl +mints the cert directly, recording it under its own +`managed_certificates` table with full audit + renewal-policy + bulk- +revocation surface. With Let's Encrypt as the ACME endpoint, you have +to run a separate cert-manager-uploads-to-certctl webhook OR maintain +two parallel cert tracks. The native-ACME-server path is operationally +simpler. + +### Can I use ACME endpoints from outside the K8s cluster? + +Yes. The endpoints are HTTPS over the certctl-server's listener (port +8443 by default). Caddy on a VM, win-acme on a Windows server, or +Posh-ACME on a Mac all integrate against +`https://:8443/acme/profile//directory`. +The TLS-trust-bootstrap requirement applies the same way — see the +[Caddy walkthrough](./acme-caddy-walkthrough.md) for the OS-trust-store +recipe. + +### How do I migrate manually-issued certs to ACME-issued ones? + +Not yet automatic. Operators migrating: keep the old `managed_certificates` +rows; create new ones via the ACME flow; flip targets one by one. A +dedicated bulk-migration tool is on the roadmap (post-2.1.0). Track +via the master prompt's roadmap section in +`cowork/acme-server-endpoint-prompt.md`. + +### What audit-log events fire on each ACME operation? + +Every state mutation writes an `audit_events` row. Actor strings: +`acme:` for kid-path requests; `acme-cert-key:` +for jwk-path revoke; `acme-system:gc` for scheduler-driven sweeps. +Event-name catalog: + +| Event name | Fired by | Resource type | +|------------|----------|---------------| +| `acme_account_created` | new-account | `acme_account` | +| `acme_account_contact_updated` | account update | `acme_account` | +| `acme_account_deactivated` | account deactivate | `acme_account` | +| `acme_account_key_rolled` | key-change | `acme_account` | +| `acme_order_created` | new-order | `acme_order` | +| `acme_order_finalized` | finalize | `acme_order` | +| `acme_challenge_processing` | challenge-respond (dispatch) | `acme_challenge` | +| `acme_challenge_completed` | validator callback | `acme_challenge` | +| `certificate_revoked` | revoke-cert (routes through `RevocationSvc`) | `certificate` | + +Querying by actor prefix (`actor LIKE 'acme:%'`) reconstructs the full +history of any ACME-issued cert. + +### Is there a threat model document? + +Yes — [`docs/acme-server-threat-model.md`](./acme-server-threat-model.md). +Read before writing a security review. + +## See also + +- [cert-manager integration walkthrough](./acme-cert-manager-walkthrough.md) +- [Caddy integration walkthrough](./acme-caddy-walkthrough.md) +- [Traefik integration walkthrough](./acme-traefik-walkthrough.md) +- [Threat model](./acme-server-threat-model.md) +- [TLS trust bootstrap reference](./tls.md) +- [Architecture (control-plane)](./architecture.md) diff --git a/docs/acme-traefik-walkthrough.md b/docs/acme-traefik-walkthrough.md new file mode 100644 index 0000000..7543f58 --- /dev/null +++ b/docs/acme-traefik-walkthrough.md @@ -0,0 +1,198 @@ +# Traefik Integration Walkthrough + +End-to-end recipe for issuing certs from a certctl-server deployment +through Traefik 3.0+. Target audience: operator running Traefik (in +Kubernetes or on a VM) who wants to use certctl as their ACME source +of truth instead of Let's Encrypt. + +## Prereqs + +- A reachable certctl-server with `CERTCTL_ACME_SERVER_ENABLED=true` + and at least one profile whose `acme_auth_mode` is set. Profile + setup is identical to the cert-manager walkthrough — see + [`docs/acme-cert-manager-walkthrough.md`](./acme-cert-manager-walkthrough.md) + Step 2. +- Traefik 3.0+ (the v2 API surface for ACME is also supported but the + `serversTransport.rootCAs` reference below is v3-shaped). +- The certctl bootstrap CA, in PEM form, captured the same way as the + cert-manager walkthrough Step 3. + +## Step 1 — Configure Traefik static config + +Traefik's ACME issuer is a `certificatesResolver` in the static config +(file or CLI flags or env vars). The relevant fields: + +```yaml +# /etc/traefik/traefik.yml (or wherever your static config lives) + +certificatesResolvers: + certctl: + acme: + caServer: https://certctl.example.com:8443/acme/profile/prof-test/directory + email: ops@example.com + storage: /etc/traefik/acme-certctl.json + httpChallenge: + entryPoint: web + # OR for trust_authenticated mode profiles: + # tlsChallenge: {} + +# certctl uses a self-signed bootstrap cert; Traefik needs the CA +# explicitly via serversTransport.rootCAs to call the directory URL. +serversTransports: + default: + rootCAs: + - /etc/traefik/certctl-bootstrap.crt + +# Apply the serversTransport globally so every outbound HTTPS call — +# including ACME directory + finalize — trusts the certctl CA. +api: + insecure: false + +entryPoints: + web: + address: ":80" + websecure: + address: ":443" +``` + +Notes: + +- `caServer` must point at the directory URL (ending in `/directory`). +- `httpChallenge.entryPoint: web` requires Traefik's `web` entryPoint + (port 80) to be reachable from certctl-server's HTTP-01 validator. + For `trust_authenticated` mode profiles, this is a no-op formality — + certctl auto-resolves authzs, so the solver round-trip never happens. +- `tlsChallenge: {}` is the alternative that uses TLS-ALPN-01 (RFC 8737) + via Traefik's `websecure` (port 443) entryPoint. Either works under + `challenge` mode; only the default-of-`tlsChallenge` is recommended + for `trust_authenticated` mode. + +## Step 2 — Trust the certctl bootstrap CA + +Two options: + +### Option A — `serversTransport.rootCAs` (preferred) + +``` +sudo cp deploy/test/certs/ca.crt /etc/traefik/certctl-bootstrap.crt +sudo systemctl reload traefik +``` + +`serversTransports.default.rootCAs` (shown in Step 1 above) tells +Traefik's outbound HTTPS client to trust the supplied PEM in addition +to the system trust store. This is the right pattern for containerized +Traefik where you don't want to install OS-level trust roots. + +### Option B — OS trust store + +For Traefik running directly on a VM, `update-ca-certificates`-style +installation works the same way as the Caddy walkthrough Option A. +The `serversTransport.rootCAs` field is unnecessary in that case. + +## Step 3 — Reference the resolver from a router + +Per-router (dynamic config): + +```yaml +# /etc/traefik/dynamic/example-com.yml + +http: + routers: + example-com: + rule: "Host(`example.com`)" + entryPoints: [websecure] + tls: + certResolver: certctl + service: example-com-backend + services: + example-com-backend: + loadBalancer: + servers: + - url: "http://localhost:8080" +``` + +Or, in Kubernetes via `IngressRoute` (Traefik CRD): + +```yaml +apiVersion: traefik.io/v1alpha1 +kind: IngressRoute +metadata: + name: example-com +spec: + entryPoints: [websecure] + routes: + - match: Host(`example.com`) + kind: Rule + services: + - name: example-com-backend + port: 8080 + tls: + certResolver: certctl +``` + +## Step 4 — Reload Traefik + +``` +sudo systemctl reload traefik +# OR kubectl rollout restart deployment/traefik (if you changed the static config via ConfigMap). +``` + +On the first request to `example.com`, Traefik hits certctl's directory +URL, registers an account, submits a new-order, and finalizes. The cert +is persisted to `/etc/traefik/acme-certctl.json` (or its in-cluster +PVC equivalent). + +## Step 5 — Verify + +``` +curl -kvI https://example.com 2>&1 | grep -E 'subject|issuer' +# subject: CN=example.com +# issuer: CN=certctl test internal CA +``` + +The cert is signed by certctl's bound issuer (per the `prof-test` +profile's `issuer_id`). + +On the certctl side, the audit log captures the issuance: + +``` +psql -c "SELECT actor, action, resource_id FROM audit_events + WHERE actor LIKE 'acme:%' ORDER BY created_at DESC LIMIT 5;" +``` + +## Common failure modes + +- **Traefik logs `unable to obtain ACME certificate ... x509: certificate + signed by unknown authority`** → `serversTransport.rootCAs` is not + pointing at the certctl bootstrap CA, OR the file was rotated and + Traefik hasn't reloaded. Verify with + `curl --cacert /etc/traefik/certctl-bootstrap.crt + https://certctl.example.com:8443/acme/profile/prof-test/directory`. +- **Traefik logs `urn:ietf:params:acme:error:rateLimited`** → tune + `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` on the certctl + side, OR reduce Traefik's parallel-cert-acquisition concurrency. +- **`acme: error: 400 :: POST :: ... :: badNonce`** → clock skew or + multi-replica certctl without sticky sessions; same fix as the + cert-manager walkthrough. +- **Storage file `acme-certctl.json` shows persistent failures** — + Traefik retains failed-acquisition state. After fixing the + underlying cause, delete the storage file and reload: + `rm /etc/traefik/acme-certctl.json && systemctl reload traefik`. + +## Cleanup + +``` +# Remove the certResolver from any router / IngressRoute consuming it. +sudo systemctl reload traefik +# Delete the persisted ACME storage: +sudo rm /etc/traefik/acme-certctl.json +# Or in K8s: drop the resolver from the static-config ConfigMap. +``` + +## See also + +- [`docs/acme-server.md`](./acme-server.md) — canonical reference. +- [`docs/acme-cert-manager-walkthrough.md`](./acme-cert-manager-walkthrough.md) — + cert-manager equivalent. +- [Traefik upstream ACME docs](https://doc.traefik.io/traefik/https/acme/#caserver) — + verify behavior pinned here against Traefik 3.0+ semantics. diff --git a/docs/connectors.md b/docs/connectors.md index df290cb..36c477b 100644 --- a/docs/connectors.md +++ b/docs/connectors.md @@ -19,7 +19,8 @@ Connectors extend certctl to integrate with external systems for certificate iss - [Revocation Across Issuers](#revocation-across-issuers) - [EST Integration (GetCACertPEM)](#est-integration-getcacertpem) - [Building a Custom Issuer](#building-a-custom-issuer) -3. [Target Connector](#target-connector) +3. [ACME Server (Built-in)](#acme-server-built-in) +4. [Target Connector](#target-connector) - [Interface](#interface-1) - [Built-in: NGINX](#built-in-nginx) - [Built-in: Apache httpd](#built-in-apache-httpd) @@ -34,28 +35,28 @@ Connectors extend certctl to integrate with external systems for certificate iss - [Windows Certificate Store](#windows-certificate-store) - [Java Keystore (JKS / PKCS#12)](#java-keystore-jks--pkcs12) - [Kubernetes Secrets](#kubernetes-secrets) -4. [Notifier Connector](#notifier-connector) +5. [Notifier Connector](#notifier-connector) - [Interface](#interface-2) -5. [Registering a Connector](#registering-a-connector) +6. [Registering a Connector](#registering-a-connector) - [IssuerConnectorAdapter](#issuerconnectoradapter) - [Notifier Registration](#notifier-registration) -6. [Testing Connectors](#testing-connectors) +7. [Testing Connectors](#testing-connectors) - [Unit Tests](#unit-tests) - [Integration Tests](#integration-tests) -7. [Best Practices](#best-practices) -8. [Agent Discovery Scanner](#agent-discovery-scanner) +8. [Best Practices](#best-practices) +9. [Agent Discovery Scanner](#agent-discovery-scanner) - [Configuration](#configuration) - [How It Works](#how-it-works) - [API Endpoints](#api-endpoints) - [Use Cases](#use-cases) -9. [Network Certificate Scanner (M21)](#network-certificate-scanner-m21) - - [Configuration](#configuration-1) - - [Creating Scan Targets](#creating-scan-targets) - - [How It Works](#how-it-works-1) - - [API Endpoints](#api-endpoints-1) - - [Scheduler Integration](#scheduler-integration) - - [Use Cases](#use-cases-1) -10. [What's Next](#whats-next) +10. [Network Certificate Scanner (M21)](#network-certificate-scanner-m21) + - [Configuration](#configuration-1) + - [Creating Scan Targets](#creating-scan-targets) + - [How It Works](#how-it-works-1) + - [API Endpoints](#api-endpoints-1) + - [Scheduler Integration](#scheduler-integration) + - [Use Cases](#use-cases-1) +11. [What's Next](#whats-next) ## Overview @@ -712,6 +713,56 @@ func (v *VaultIssuer) IssueCertificate(ctx context.Context, req issuer.IssuanceR // ... implement RenewCertificate, RevokeCertificate, GetOrderStatus ``` +## ACME Server (Built-in) + +certctl ships a built-in RFC 8555 + RFC 9773 ARI ACME **server** +endpoint at `/acme/profile//*`. Any RFC 8555 client +(cert-manager 1.15+, Caddy, Traefik, win-acme, certbot, Posh-ACME) +integrates with certctl as an ACME issuer with no certctl-side +modification — closing the "deploy a certctl agent on every K8s node" +friction that costs deals to external PKI vendors. + +This is **distinct** from the [ACME consumer +connector](#built-in-acme-v2-lets-encrypt-sectigo-zerossl) above. The +consumer side is `certctl → external CA over ACME`; the server side +is `external client → certctl over ACME`. Operators deploying both +should namespace env vars carefully: consumer uses `CERTCTL_ACME_*` +(`DIRECTORY_URL`, `EMAIL`, `CHALLENGE_TYPE`); server uses +`CERTCTL_ACME_SERVER_*` (`ENABLED`, `DEFAULT_PROFILE_ID`, `NONCE_TTL`, +…). + +Two auth modes per profile (`certificate_profiles.acme_auth_mode`): + +- **`trust_authenticated`** (default for internal PKI). The JWS- + authenticated ACME account is trusted to issue for any identifier + the profile policy permits; no out-of-band ownership proof. The + most common certctl use case — internal-PKI fleets where the + network itself is the trust boundary. +- **`challenge`**. Full HTTP-01 + DNS-01 + TLS-ALPN-01 validation per + RFC 8555 §8 + RFC 8737. Required for public-trust-style PKI where + account-key compromise must not cost issuance authority. + +Routes through `service.CertificateService.Create` so policy + audit ++ metrics + bulk-revocation + cloud-discovery all apply uniformly to +ACME-issued certs (just as they do to API-issued, agent-issued, EST- +issued, SCEP-issued certs). + +See: + +- [ACME Server Reference](./acme-server.md) — env-var reference, + endpoints, auth-mode decision tree, RFC 8555 conformance statement, + troubleshooting, FAQ. +- [cert-manager Walkthrough](./acme-cert-manager-walkthrough.md) — kind + → cert-manager → certctl-server → Certificate flow. +- [Caddy Walkthrough](./acme-caddy-walkthrough.md) — Caddyfile `acme_ca` + + trust configuration. +- [Traefik Walkthrough](./acme-traefik-walkthrough.md) — `certificatesResolvers` + + `serversTransport.rootCAs`. +- [Threat Model](./acme-server-threat-model.md) — JWS forgery + resistance, nonce store integrity, HTTP-01 SSRF, DNS-01 cache + posture, TLS-ALPN-01 chain-not-validated rationale, rate-limit + tuning, audit trail. + ## Target Connector Target connectors deploy certificates to infrastructure systems. They run on agents, not on the control plane.