Files
certctl/docs/reference/protocols/acme-server-threat-model.md
T
shankar0123 3a807ae37e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00

13 KiB
Raw Blame History

ACME Server — Threat Model

Security posture for the certctl ACME server endpoint (/acme/profile/<id>/*). Read this before opening a PR that changes the JWS verifier, the challenge validators, the rate limiter, or the GC sweeper.

The threat model lives in this dedicated doc (rather than docs/acme-server.md) because security-review reviewers want a single concentrated reference. Production deployments under audit should treat this doc as the canonical answer to "how does certctl resist X?"

Threat surface map

The ACME server has four ingress surfaces:

  1. JWS-authenticated POST endpoints — new-account, new-order, finalize, key-change, revoke-cert, account update, order POST-as-GET. Authenticated by an ECDSA / RSA / EdDSA signature over the request.
  2. Unauthenticated GET endpoints — directory, new-nonce, ARI (renewal-info). Read-only; no authn.
  3. Outbound challenge validators — HTTP-01, DNS-01, TLS-ALPN-01. The certctl-server initiates outbound calls to operator-provided identifiers (the SAN list of the requested cert).
  4. Scheduler-driven GC sweeper — internal-only; no inbound surface.

Threat actors:

  • External Internet attacker — no certctl credentials; can hit unauthenticated endpoints + observe TLS metadata.
  • Authenticated ACME account holder (low-trust) — has a valid account on a profile but should be bounded by profile policy + rate limits.
  • On-path attacker between certctl-server and a challenge target (HTTP-01 / DNS-01 / TLS-ALPN-01).
  • Compromised cert holder — has the private key of a previously- issued cert and wants to revoke/exfiltrate.
  • Malicious operator with profile-write access — can change a profile's acme_auth_mode or policy, but is the trusted boundary per certctl's threat model. Out of scope here; covered by certctl's RBAC + audit log.

JWS forgery resistance

The verifier (internal/api/acme/jws.go) accepts only the closed allow-list {RS256, ES256, EdDSA}. The allow-list is passed to jose.ParseSigned so go-jose rejects every other algorithm at parse time, before any signature work.

Specific attacks blocked:

  • Algorithm confusion (alg: none) — RFC 7515 §6.1's classic unauthenticated-fallback. Not in allow-list; rejected at parse.
  • HS256 substitution (alg-confusion via symmetric) — symmetric algs aren't in the allow-list; rejected at parse.
  • Replayed nonce — every JWS carries a nonce consumed via acme_nonces.UPDATE … WHERE used = FALSE (a single statement; Postgres row-locking serializes the writes). A second consume of the same nonce sees RowsAffected=0 and the verifier returns badNonce.
  • URL spoofing — the protected-header url field MUST match the request URL exactly (RFC 8555 §6.4); a JWS signed for one URL cannot be replayed against another.
  • Multi-signature JWS — RFC 8555 §6.2 forbids; the verifier rejects len(jws.Signatures) != 1 explicitly.
  • kid-vs-jwk confusion — exactly one MUST be present per RFC 8555 §6.2; both-present and neither-present are rejected.
  • kid round-trip mismatch — the verifier's AccountKID closure computes the canonical kid URL for the resolved account-id and compares to the inbound kid; cross-profile replay is rejected because the canonical URL differs.

The doubly-signed key-rollover JWS (RFC 8555 §7.3.5, Phase 4) gets its own dedicated verifier in internal/api/acme/keychange.go. Inner-only invariants enforced: MUST use jwk not kid, payload account MUST equal outer kid, payload oldKey MUST canonicalize- equal the registered key (RFC 7638 thumbprint, constant-time compare), inner url MUST equal outer url.

Nonce store integrity

Nonces are persisted in PostgreSQL (acme_nonces table; migration 000025) with a TTL set by CERTCTL_ACME_SERVER_NONCE_TTL (default 5 min). The Phase 5 GC sweeper deletes used / expired rows every 1 minute by default.

Why DB-backed and not in-memory:

  • Survives restart — a multi-replica certctl-server fleet behind a load balancer can issue a nonce on replica A and consume it on replica B. In-memory state would force sticky sessions globally, which the operator can't guarantee in all topologies.
  • Atomic consume — a single UPDATE ... WHERE used = FALSE statement is the consume primitive; Postgres row-locking guarantees exactly one of two concurrent consumes wins.
  • Expiry-bounded — even if the GC sweeper were disabled, the nonce TTL is enforced at consume time (AND expires_at > NOW() in the UPDATE).

A nonce-store-side compromise would let an attacker forge nonces. Mitigation: the nonce table is in the same Postgres instance certctl already trusts; a DB compromise is broader than ACME-specific.

HTTP-01 SSRF resistance

The HTTP-01 validator (Phase 3, internal/api/acme/validators.go) fetches http://<identifier>/.well-known/acme-challenge/<token> where the identifier is operator/client-controlled. Without mitigation, this is a textbook SSRF surface — internal services on RFC1918 / link-local / cloud-metadata addresses would be reachable.

Mitigations (defense in depth):

  1. Pre-dial checkvalidation.ValidateSafeURL rejects URLs whose host parses as a literal reserved IP. Cheap early bail.
  2. Per-dial checkvalidation.SafeHTTPDialContext is installed on the http.Transport. Every dial re-resolves DNS, rejects reserved IPs, and pins the resolved IP (net.JoinHostPort(ips[0], port)) so a racing DNS rebinding cannot substitute a different IP between resolve and connect.
  3. Per-redirect check — Go's HTTP client re-dials on 3xx; the DialContext runs again, applying the same SSRF guards.
  4. Body cap — the validator's io.LimitReader caps response bodies at 16 KiB. A misbehaving target cannot DoS the validator pool with a multi-GB response.
  5. Bounded redirects — the validator caps redirects at 10 (Go default). A redirect-loop target is bounded.

Reserved IP set: loopback (127.0.0.0/8 + ::1), link-local (169.254.0.0/16 + fe80::/10), all RFC1918 (10/8, 172.16/12, 192.168/16), cloud-metadata literals (169.254.169.254 explicitly), broadcast, multicast, IPv4-mapped-IPv6 to a reserved IPv4. See internal/validation/ssrf.go::isReservedIPForDial for the full set.

CodeQL alert #23 flags client.Do(req) in the SCEP-probe call site as go/request-forgery despite the dial-time guard; the analyzer can't trace through a custom Transport.DialContext. Operator- acknowledged false positive (CLAUDE.md task #10) — see the SCEP probe's same-shaped defense for the audit trail.

DNS-01 cache poisoning posture

The DNS-01 validator queries _acme-challenge.<domain> against a single resolver configured by CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53).

Threat: an operator running a private resolver (typical in air-gapped deployments) inherits that resolver's cache-poisoning posture. A poisoned resolver could attest a TXT record the legitimate domain owner never published, allowing an attacker who controls the resolver to forge ACME challenges.

Mitigation:

  • Default 8.8.8.8:53 is Google Public DNS — DNSSEC-validating, operationally hardened, well-monitored.
  • Operators choosing a private resolver own the cache-poisoning posture. The doc explicitly flags this in docs/acme-server.md § Configuration.
  • DNSSEC-validation is not enforced by the validator itself — the validator trusts the resolver's answer. Operators wanting strict DNSSEC validation should use a DNSSEC-validating resolver (e.g. 1.1.1.1 or a self-hosted Unbound).

TLS-ALPN-01 challenge interception

RFC 8737 §3 explicitly says the validator MUST NOT verify the challenge target's certificate chain — the proof lives in the embedded id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31) of the cert presented during the TLS handshake, not in the chain itself.

Implementation: internal/api/acme/validators.go::TLSALPN01Validator sets tls.Config.InsecureSkipVerify = true with a dedicated //nolint:gosec annotation citing RFC 8737 §3 and the L-001 documentation row in docs/tls.md.

What this means for on-path attackers:

  • An on-path attacker between certctl-server and the challenge target CAN intercept the TLS handshake and present a forged cert. The proof is the embedded extension byte-equality, which the attacker cannot generate without the account key — so interception alone doesn't grant cert issuance.
  • An attacker who has the account key already controls the account per RFC 8555; the TLS-ALPN-01 validator's interception window adds no incremental capability.

The integrity property TLS-ALPN-01 actually provides: the challenge target proves possession of the account-key-derived key authorization on a TLS connection bound to the requested identifier (port 443 of the SAN). Operators wanting CA/Browser-Forum-style WebPKI strictness should run a dedicated public-trust CA, not certctl.

Rate-limit tuning

Phase 5 in-memory token buckets with per-(action, key) isolation. Defaults:

  • RATE_LIMIT_ORDERS_PER_HOUR=100 per account.
  • RATE_LIMIT_CONCURRENT_ORDERS=5 per account (pending/ready/processing).
  • RATE_LIMIT_KEY_CHANGE_PER_HOUR=5 per account.
  • RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60 per challenge-id.

Tuning:

  • Too loose → enables abuse vectors. A compromised account could burn DB-row throughput; a runaway client could fill the validator pool.
  • Too tight → legitimate flake-out. cert-manager's exponential backoff after a rateLimited problem is conservative; a 1-hour cooldown is a long time for an operator hitting an unexpected limit.

Defaults are intentionally conservative on the loose-side — 100/hour is generous for any plausible per-account fleet (a 50k-cert deployment renewing at the 1/3-validity mark consumes ~12 orders/year/cert ≈ 600k orders/year ≈ 70 orders/hour even spread evenly across accounts). Tighter limits are appropriate for deployments with many low-trust accounts.

The buckets are in-memory + per-replica. A 3-replica certctl-server fleet effectively has 3× the configured per-account throughput because each replica's bucket fills independently. For deployments where this matters operationally, the right answer is a shared rate- limit store (Redis / Postgres-backed); not blocking for current threat model where same-account requests typically pin to the same replica via session affinity.

Audit trail

Every ACME state mutation writes a row to audit_events. Actor strings distinguish the auth path:

  • acme:<account-id> — kid-path requests (the requesting account signed the JWS).
  • acme-cert-key:<serial> — jwk-path revoke (the cert's own private key signed the JWS).
  • acme-system:gc — scheduler-driven sweeps (no client request).

Operators querying by actor prefix can reconstruct the full history of any ACME-issued cert. See docs/acme-server.md § FAQ "What audit-log events fire" for the event-name catalog.

Out-of-scope threats

Documented to set scope expectations for security reviewers:

  • DDoS at the TLS layer — the certctl-server's TLS listener + upstream load balancer / WAF handle this. The ACME-specific rate limits don't substitute for upstream DDoS protection.
  • cert-manager-side compromise — if cert-manager is compromised, it has both the account key and the private keys of every issued cert. Out of certctl's trust boundary; operators run cert-manager with the same care they'd run any other secret-bearing operator.
  • Compromised certctl-server filesystem — the bootstrap CA key lives at deploy/test/certs/ca.key (or the operator-managed equivalent). A filesystem compromise is broader than ACME-specific and is covered by certctl's HSM / signer-driver architecture (see docs/architecture.md "Signer abstraction").
  • Postgres compromise — the nonce table, account JWKs, and audit log all live in the same Postgres instance. A DB compromise is broader than ACME-specific and is the operator's responsibility to mitigate via standard DB-hardening practices.
  • Supply-chain attacks against go-jose / lib/pq — handled by Dependabot + the make verify security gate; not ACME-specific.

See also