mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-08 12:58:51 +00:00
docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
35 files moved across 8 audience-organized subdirectories:
docs/getting-started/ (5):
quickstart.md, concepts.md, examples.md, advanced-demo.md (was
demo-advanced.md), why-certctl.md
docs/reference/ (6):
architecture.md, api.md (was openapi.md), mcp.md,
intermediate-ca-hierarchy.md, deployment-model.md (was
deployment-atomicity.md), vendor-matrix.md (was
deployment-vendor-matrix.md)
docs/reference/protocols/ (6):
acme-server.md, acme-server-threat-model.md, scep-intune.md,
est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)
docs/operator/ (4):
security.md, tls.md, database-tls.md, approval-workflow.md
docs/operator/runbooks/ (3):
cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
(was runbook-expiry-alerts.md), disaster-recovery.md
docs/migration/ (3):
from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
(was migrate-from-acmesh.md), cert-manager-coexistence.md (was
certctl-for-cert-manager-users.md)
docs/compliance/ (4):
index.md (was compliance.md), soc2.md (was compliance-soc2.md),
pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
compliance-nist.md)
docs/contributor/ (4):
testing-strategy.md, test-environment.md (was test-env.md),
ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)
Deferred to later Phase 2 sub-phases:
- connectors.md split (Phase 4): docs/connectors.md +
docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
- testing-guide.md prune (Phase 5): docs/testing-guide.md still
at top level
- features.md disperse (Phase 6): docs/features.md still at top
level
- legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
still at top level
- ACME walkthrough re-homing (Phase 8): three
docs/acme-*-walkthrough.md still at top level
- Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
at top level
Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
This commit is contained in:
@@ -0,0 +1,278 @@
|
||||
# ACME Server — Threat Model
|
||||
|
||||
Security posture for the certctl ACME server endpoint
|
||||
(`/acme/profile/<id>/*`). Read this before opening a PR that changes
|
||||
the JWS verifier, the challenge validators, the rate limiter, or the
|
||||
GC sweeper.
|
||||
|
||||
The threat model lives in this dedicated doc (rather than `docs/acme-server.md`)
|
||||
because security-review reviewers want a single concentrated reference.
|
||||
Production deployments under audit should treat this doc as the
|
||||
canonical answer to "how does certctl resist X?"
|
||||
|
||||
## Threat surface map
|
||||
|
||||
The ACME server has four ingress surfaces:
|
||||
|
||||
1. **JWS-authenticated POST endpoints** — new-account, new-order,
|
||||
finalize, key-change, revoke-cert, account update, order POST-as-GET.
|
||||
Authenticated by an ECDSA / RSA / EdDSA signature over the request.
|
||||
2. **Unauthenticated GET endpoints** — directory, new-nonce, ARI
|
||||
(renewal-info). Read-only; no authn.
|
||||
3. **Outbound challenge validators** — HTTP-01, DNS-01, TLS-ALPN-01.
|
||||
The certctl-server initiates outbound calls to operator-provided
|
||||
identifiers (the SAN list of the requested cert).
|
||||
4. **Scheduler-driven GC sweeper** — internal-only; no inbound surface.
|
||||
|
||||
Threat actors:
|
||||
|
||||
- **External Internet attacker** — no certctl credentials; can hit
|
||||
unauthenticated endpoints + observe TLS metadata.
|
||||
- **Authenticated ACME account holder (low-trust)** — has a valid
|
||||
account on a profile but should be bounded by profile policy +
|
||||
rate limits.
|
||||
- **On-path attacker** between certctl-server and a challenge target
|
||||
(HTTP-01 / DNS-01 / TLS-ALPN-01).
|
||||
- **Compromised cert holder** — has the private key of a previously-
|
||||
issued cert and wants to revoke/exfiltrate.
|
||||
- **Malicious operator with profile-write access** — can change a
|
||||
profile's `acme_auth_mode` or policy, but is the trusted boundary
|
||||
per certctl's threat model. Out of scope here; covered by certctl's
|
||||
RBAC + audit log.
|
||||
|
||||
## JWS forgery resistance
|
||||
|
||||
The verifier (`internal/api/acme/jws.go`) accepts only the closed
|
||||
allow-list `{RS256, ES256, EdDSA}`. The allow-list is passed to
|
||||
`jose.ParseSigned` so go-jose rejects every other algorithm at parse
|
||||
time, before any signature work.
|
||||
|
||||
Specific attacks blocked:
|
||||
|
||||
- **Algorithm confusion (`alg: none`)** — RFC 7515 §6.1's classic
|
||||
unauthenticated-fallback. Not in allow-list; rejected at parse.
|
||||
- **HS256 substitution (alg-confusion via symmetric)** — symmetric
|
||||
algs aren't in the allow-list; rejected at parse.
|
||||
- **Replayed nonce** — every JWS carries a nonce consumed via
|
||||
`acme_nonces.UPDATE … WHERE used = FALSE` (a single statement;
|
||||
Postgres row-locking serializes the writes). A second consume of
|
||||
the same nonce sees `RowsAffected=0` and the verifier returns
|
||||
`badNonce`.
|
||||
- **URL spoofing** — the protected-header `url` field MUST match the
|
||||
request URL exactly (RFC 8555 §6.4); a JWS signed for one URL
|
||||
cannot be replayed against another.
|
||||
- **Multi-signature JWS** — RFC 8555 §6.2 forbids; the verifier
|
||||
rejects `len(jws.Signatures) != 1` explicitly.
|
||||
- **kid-vs-jwk confusion** — exactly one MUST be present per RFC 8555
|
||||
§6.2; both-present and neither-present are rejected.
|
||||
- **kid round-trip mismatch** — the verifier's `AccountKID` closure
|
||||
computes the canonical kid URL for the resolved account-id and
|
||||
compares to the inbound `kid`; cross-profile replay is rejected
|
||||
because the canonical URL differs.
|
||||
|
||||
The doubly-signed key-rollover JWS (RFC 8555 §7.3.5, Phase 4) gets
|
||||
its own dedicated verifier in `internal/api/acme/keychange.go`.
|
||||
Inner-only invariants enforced: MUST use `jwk` not `kid`, payload
|
||||
`account` MUST equal outer `kid`, payload `oldKey` MUST canonicalize-
|
||||
equal the registered key (RFC 7638 thumbprint, constant-time
|
||||
compare), inner `url` MUST equal outer `url`.
|
||||
|
||||
## Nonce store integrity
|
||||
|
||||
Nonces are persisted in PostgreSQL (`acme_nonces` table; migration
|
||||
000025) with a TTL set by `CERTCTL_ACME_SERVER_NONCE_TTL` (default
|
||||
5 min). The Phase 5 GC sweeper deletes used / expired rows every 1
|
||||
minute by default.
|
||||
|
||||
Why DB-backed and not in-memory:
|
||||
|
||||
- **Survives restart** — a multi-replica certctl-server fleet behind
|
||||
a load balancer can issue a nonce on replica A and consume it on
|
||||
replica B. In-memory state would force sticky sessions globally,
|
||||
which the operator can't guarantee in all topologies.
|
||||
- **Atomic consume** — a single `UPDATE ... WHERE used = FALSE`
|
||||
statement is the consume primitive; Postgres row-locking guarantees
|
||||
exactly one of two concurrent consumes wins.
|
||||
- **Expiry-bounded** — even if the GC sweeper were disabled, the
|
||||
nonce TTL is enforced at consume time
|
||||
(`AND expires_at > NOW()` in the UPDATE).
|
||||
|
||||
A nonce-store-side compromise would let an attacker forge nonces.
|
||||
Mitigation: the nonce table is in the same Postgres instance certctl
|
||||
already trusts; a DB compromise is broader than ACME-specific.
|
||||
|
||||
## HTTP-01 SSRF resistance
|
||||
|
||||
The HTTP-01 validator (Phase 3, `internal/api/acme/validators.go`)
|
||||
fetches `http://<identifier>/.well-known/acme-challenge/<token>`
|
||||
where the identifier is operator/client-controlled. Without
|
||||
mitigation, this is a textbook SSRF surface — internal services on
|
||||
RFC1918 / link-local / cloud-metadata addresses would be reachable.
|
||||
|
||||
Mitigations (defense in depth):
|
||||
|
||||
1. **Pre-dial check** — `validation.ValidateSafeURL` rejects URLs
|
||||
whose host parses as a literal reserved IP. Cheap early bail.
|
||||
2. **Per-dial check** — `validation.SafeHTTPDialContext` is installed
|
||||
on the `http.Transport`. Every dial re-resolves DNS, rejects
|
||||
reserved IPs, and **pins the resolved IP** (`net.JoinHostPort(ips[0],
|
||||
port)`) so a racing DNS rebinding cannot substitute a different IP
|
||||
between resolve and connect.
|
||||
3. **Per-redirect check** — Go's HTTP client re-dials on 3xx; the
|
||||
`DialContext` runs again, applying the same SSRF guards.
|
||||
4. **Body cap** — the validator's `io.LimitReader` caps response
|
||||
bodies at 16 KiB. A misbehaving target cannot DoS the validator
|
||||
pool with a multi-GB response.
|
||||
5. **Bounded redirects** — the validator caps redirects at 10 (Go
|
||||
default). A redirect-loop target is bounded.
|
||||
|
||||
Reserved IP set: loopback (127.0.0.0/8 + ::1), link-local
|
||||
(169.254.0.0/16 + fe80::/10), all RFC1918 (10/8, 172.16/12, 192.168/16),
|
||||
cloud-metadata literals (169.254.169.254 explicitly), broadcast,
|
||||
multicast, IPv4-mapped-IPv6 to a reserved IPv4. See
|
||||
`internal/validation/ssrf.go::isReservedIPForDial` for the full set.
|
||||
|
||||
CodeQL alert #23 flags `client.Do(req)` in the SCEP-probe call site
|
||||
as `go/request-forgery` despite the dial-time guard; the analyzer
|
||||
can't trace through a custom `Transport.DialContext`. Operator-
|
||||
acknowledged false positive (CLAUDE.md task #10) — see the SCEP
|
||||
probe's same-shaped defense for the audit trail.
|
||||
|
||||
## DNS-01 cache poisoning posture
|
||||
|
||||
The DNS-01 validator queries
|
||||
`_acme-challenge.<domain>` against a single resolver configured by
|
||||
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`).
|
||||
|
||||
Threat: an operator running a private resolver (typical in air-gapped
|
||||
deployments) inherits that resolver's cache-poisoning posture. A
|
||||
poisoned resolver could attest a TXT record the legitimate domain
|
||||
owner never published, allowing an attacker who controls the
|
||||
resolver to forge ACME challenges.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- Default `8.8.8.8:53` is Google Public DNS — DNSSEC-validating,
|
||||
operationally hardened, well-monitored.
|
||||
- Operators choosing a private resolver own the cache-poisoning
|
||||
posture. The doc explicitly flags this in
|
||||
`docs/acme-server.md` § Configuration.
|
||||
- DNSSEC-validation is **not** enforced by the validator itself —
|
||||
the validator trusts the resolver's answer. Operators wanting
|
||||
strict DNSSEC validation should use a DNSSEC-validating resolver
|
||||
(e.g. `1.1.1.1` or a self-hosted Unbound).
|
||||
|
||||
## TLS-ALPN-01 challenge interception
|
||||
|
||||
RFC 8737 §3 explicitly says the validator MUST NOT verify the
|
||||
challenge target's certificate chain — the proof lives in the
|
||||
embedded `id-pe-acmeIdentifier` extension (OID 1.3.6.1.5.5.7.1.31)
|
||||
of the cert presented during the TLS handshake, not in the chain
|
||||
itself.
|
||||
|
||||
Implementation: `internal/api/acme/validators.go::TLSALPN01Validator`
|
||||
sets `tls.Config.InsecureSkipVerify = true` with a dedicated
|
||||
`//nolint:gosec` annotation citing RFC 8737 §3 and the L-001
|
||||
documentation row in `docs/tls.md`.
|
||||
|
||||
What this means for on-path attackers:
|
||||
|
||||
- An on-path attacker between certctl-server and the challenge target
|
||||
CAN intercept the TLS handshake and present a forged cert. The
|
||||
proof is the embedded extension byte-equality, which the attacker
|
||||
cannot generate without the account key — so interception alone
|
||||
doesn't grant cert issuance.
|
||||
- An attacker who has the account key already controls the account
|
||||
per RFC 8555; the TLS-ALPN-01 validator's interception window adds
|
||||
no incremental capability.
|
||||
|
||||
The integrity property TLS-ALPN-01 actually provides: the challenge
|
||||
target proves possession of the account-key-derived key authorization
|
||||
on a TLS connection bound to the requested identifier (port 443 of
|
||||
the SAN). Operators wanting CA/Browser-Forum-style WebPKI strictness
|
||||
should run a dedicated public-trust CA, not certctl.
|
||||
|
||||
## Rate-limit tuning
|
||||
|
||||
Phase 5 in-memory token buckets with per-(action, key) isolation.
|
||||
Defaults:
|
||||
|
||||
- `RATE_LIMIT_ORDERS_PER_HOUR=100` per account.
|
||||
- `RATE_LIMIT_CONCURRENT_ORDERS=5` per account (pending/ready/processing).
|
||||
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR=5` per account.
|
||||
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60` per challenge-id.
|
||||
|
||||
Tuning:
|
||||
|
||||
- **Too loose** → enables abuse vectors. A compromised account could
|
||||
burn DB-row throughput; a runaway client could fill the validator
|
||||
pool.
|
||||
- **Too tight** → legitimate flake-out. cert-manager's exponential
|
||||
backoff after a `rateLimited` problem is conservative; a 1-hour
|
||||
cooldown is a long time for an operator hitting an unexpected limit.
|
||||
|
||||
Defaults are intentionally conservative on the loose-side — 100/hour
|
||||
is generous for any plausible per-account fleet (a 50k-cert
|
||||
deployment renewing at the 1/3-validity mark consumes ~12
|
||||
orders/year/cert ≈ 600k orders/year ≈ 70 orders/hour even spread
|
||||
evenly across accounts). Tighter limits are appropriate for
|
||||
deployments with many low-trust accounts.
|
||||
|
||||
The buckets are in-memory + per-replica. A 3-replica certctl-server
|
||||
fleet effectively has 3× the configured per-account throughput
|
||||
because each replica's bucket fills independently. For deployments
|
||||
where this matters operationally, the right answer is a shared rate-
|
||||
limit store (Redis / Postgres-backed); not blocking for current
|
||||
threat model where same-account requests typically pin to the same
|
||||
replica via session affinity.
|
||||
|
||||
## Audit trail
|
||||
|
||||
Every ACME state mutation writes a row to `audit_events`. Actor strings
|
||||
distinguish the auth path:
|
||||
|
||||
- `acme:<account-id>` — kid-path requests (the requesting account
|
||||
signed the JWS).
|
||||
- `acme-cert-key:<serial>` — jwk-path revoke (the cert's own private
|
||||
key signed the JWS).
|
||||
- `acme-system:gc` — scheduler-driven sweeps (no client request).
|
||||
|
||||
Operators querying by actor prefix can reconstruct the full history
|
||||
of any ACME-issued cert. See
|
||||
`docs/acme-server.md` § FAQ "What audit-log events fire" for the
|
||||
event-name catalog.
|
||||
|
||||
## Out-of-scope threats
|
||||
|
||||
Documented to set scope expectations for security reviewers:
|
||||
|
||||
- **DDoS at the TLS layer** — the certctl-server's TLS listener +
|
||||
upstream load balancer / WAF handle this. The ACME-specific rate
|
||||
limits don't substitute for upstream DDoS protection.
|
||||
- **cert-manager-side compromise** — if cert-manager is compromised,
|
||||
it has both the account key and the private keys of every issued
|
||||
cert. Out of certctl's trust boundary; operators run cert-manager
|
||||
with the same care they'd run any other secret-bearing operator.
|
||||
- **Compromised certctl-server filesystem** — the bootstrap CA key
|
||||
lives at `deploy/test/certs/ca.key` (or the operator-managed
|
||||
equivalent). A filesystem compromise is broader than ACME-specific
|
||||
and is covered by certctl's HSM / signer-driver architecture (see
|
||||
`docs/architecture.md` "Signer abstraction").
|
||||
- **Postgres compromise** — the nonce table, account JWKs, and
|
||||
audit log all live in the same Postgres instance. A DB compromise
|
||||
is broader than ACME-specific and is the operator's responsibility
|
||||
to mitigate via standard DB-hardening practices.
|
||||
- **Supply-chain attacks against go-jose / lib/pq** — handled by
|
||||
Dependabot + the `make verify` security gate; not ACME-specific.
|
||||
|
||||
## See also
|
||||
|
||||
- [`docs/acme-server.md`](./acme-server.md) — operator-facing reference.
|
||||
- [`docs/tls.md`](./tls.md) — TLS posture, including the L-001
|
||||
table of `InsecureSkipVerify` justifications (TLS-ALPN-01 row).
|
||||
- [`internal/api/acme/jws.go`](../internal/api/acme/jws.go) — verifier
|
||||
source.
|
||||
- [`internal/api/acme/validators.go`](../internal/api/acme/validators.go)
|
||||
— challenge validator pool.
|
||||
- [`internal/validation/ssrf.go`](../internal/validation/ssrf.go) —
|
||||
SSRF-defense primitives.
|
||||
@@ -0,0 +1,646 @@
|
||||
# certctl ACME Server (Built-in)
|
||||
|
||||
certctl ships an RFC 8555 + RFC 9773 ARI ACME server endpoint at
|
||||
`/acme/profile/<profile-id>/*`. Any RFC 8555 client (cert-manager 1.15+,
|
||||
Caddy, Traefik, win-acme, certbot, Posh-ACME) can integrate with certctl
|
||||
as an ACME issuer with no certctl-side modification — closing the
|
||||
"deploy a certctl agent on every K8s node" friction that costs deals to
|
||||
external PKI vendors today.
|
||||
|
||||
> **Phase status (2026-05-03):** Phase 6 — full operator-facing
|
||||
> reference. The functional surface is complete (Phases 1a-5); this
|
||||
> doc is the canonical procurement-readability reference. New: client-
|
||||
> walkthrough docs for [cert-manager](./acme-cert-manager-walkthrough.md),
|
||||
> [Caddy](./acme-caddy-walkthrough.md), and
|
||||
> [Traefik](./acme-traefik-walkthrough.md); a dedicated
|
||||
> [threat model](./acme-server-threat-model.md); a section-by-section
|
||||
> RFC 8555 + RFC 9773 conformance statement; a 5-failure-mode
|
||||
> troubleshooting playbook; a tested-clients version pinning table.
|
||||
> Track shipped phases via `git log --grep='acme-server:'`.
|
||||
|
||||
## Configuration
|
||||
|
||||
All ACME-server config uses the `CERTCTL_ACME_SERVER_*` env-var prefix
|
||||
(distinct from `CERTCTL_ACME_*` which configures the consumer-side
|
||||
issuer connector). The struct definition lives in
|
||||
`internal/config/config.go::ACMEServerConfig`.
|
||||
|
||||
| Env var | Default | Phase | Description |
|
||||
|--------------------------------------------------|------------------------|-------|-------------|
|
||||
| `CERTCTL_ACME_SERVER_ENABLED` | `false` | 1a | Master enable flag. Phase 1a's handler is constructed unconditionally so the registry shape stays stable; routes are registered in `internal/api/router/router.go::RegisterHandlers` regardless. Operators flip this on after configuring per-profile auth_mode. |
|
||||
| `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` | `trust_authenticated` | 1a | Default value for `certificate_profiles.acme_auth_mode` on newly-created profiles. Existing profiles retain their stored value. Per-profile column is the source of truth at request time. |
|
||||
| `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` | `""` | 1a | When set, `/acme/*` shorthand mirrors `/acme/profile/<DefaultProfileID>/*` for single-profile deployments. When empty, requests to the shorthand return RFC 7807 + RFC 8555 §6.7 `userActionRequired`. |
|
||||
| `CERTCTL_ACME_SERVER_NONCE_TTL` | `5m` | 1a | How long an issued ACME nonce remains valid before the JWS verifier (Phase 1b) returns `urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1. Tune up if cert-manager + certctl clocks frequently skew. |
|
||||
| `CERTCTL_ACME_SERVER_TOS_URL` | `""` | 1a | Optional `meta.termsOfService` URL in the directory document. |
|
||||
| `CERTCTL_ACME_SERVER_WEBSITE` | `""` | 1a | Optional `meta.website` URL in the directory document. |
|
||||
| `CERTCTL_ACME_SERVER_CAA_IDENTITIES` | (empty) | 1a | Comma-separated `meta.caaIdentities` list. |
|
||||
| `CERTCTL_ACME_SERVER_EAB_REQUIRED` | `false` | 1a | `meta.externalAccountRequired` advertisement. EAB enforcement is a follow-up; Phase 1a only advertises. |
|
||||
| `CERTCTL_ACME_SERVER_ORDER_TTL` | `24h` | 2 | Reserved field, parsed in Phase 1a so operators can set it ahead of Phase 2's order endpoints. |
|
||||
| `CERTCTL_ACME_SERVER_AUTHZ_TTL` | `24h` | 2 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_DNS01_RESOLVER` | `8.8.8.8:53` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_DNS01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_ARI_ENABLED` | `true` | 4 | Toggles the RFC 9773 ARI surface — both the `renewalInfo` URL in the directory document and the GET `/renewal-info/<cert-id>` handler. Set to `false` to drop ARI from the directory; ACME clients fall back to static renewal scheduling. |
|
||||
| `CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL` | `6h` | 4 | Server-policy `Retry-After` value the ARI handler emits on a 200 response. RFC 9773 §4.2 leaves this server-policy. Tighten to `1h` for short-lived certs; loosen to `24h` for standard 90-day certs. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` | `100` | 5 | Per-account orders/hour cap. `0` disables. Hits return RFC 7807 + RFC 8555 §6.7 `urn:ietf:params:acme:error:rateLimited` with `Retry-After`. In-memory token-bucket; restart wipes the counter (eventual-consistency caps are acceptable). |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS` | `5` | 5 | Per-account cap on simultaneously-active orders (status in pending/ready/processing). `0` disables. Same RFC 7807 + RFC 8555 §6.7 problem shape as the per-hour cap. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR` | `5` | 5 | Per-account key-rollover cap. `0` disables. Default 5/hour: rollovers should be rare; a flood is an attack signal. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` | `60` | 5 | Per-challenge-id respond cap. `0` disables. Defends against retry storms from a misbehaving client. Keyed by challenge-id (not account-id) so a flood against one challenge doesn't drain the account's whole budget. |
|
||||
| `CERTCTL_ACME_SERVER_GC_INTERVAL` | `1m` | 5 | Tick interval for the ACME GC scheduler loop. On each tick: (1) DELETE used / expired nonces; (2) UPDATE pending authzs whose `expires_at < NOW()` to `expired`; (3) UPDATE pending/ready/processing orders whose `expires_at < NOW()` to `invalid`. Each sweep is a single SQL statement; the loop is idempotent + bounded by a 1m per-sweep timeout. `0` disables the loop. |
|
||||
|
||||
## Per-profile auth mode
|
||||
|
||||
Two modes per `certificate_profiles.acme_auth_mode`:
|
||||
|
||||
- **`trust_authenticated`** (default for internal PKI). The JWS-
|
||||
authenticated ACME account is trusted to issue certs for any
|
||||
identifier the profile policy allows; there is no per-identifier
|
||||
ownership proof. The most common certctl use case.
|
||||
- **`challenge`**. Full HTTP-01 + DNS-01 + TLS-ALPN-01 validation per
|
||||
RFC 8555 §8. Required when certctl is exposing public-trust-style PKI.
|
||||
|
||||
A single certctl-server can serve both modes simultaneously — the mode
|
||||
is read from the bound profile's column at request time, not cached at
|
||||
server start. Operators can flip a profile's mode via SQL and the next
|
||||
order picks up the new mode without restart.
|
||||
|
||||
The `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` env var sets the default
|
||||
value for newly-created profiles (e.g. via the certctl API). Existing
|
||||
profile rows retain whatever value they were created with.
|
||||
|
||||
## TLS trust bootstrap (read this before configuring cert-manager)
|
||||
|
||||
When certctl-server uses a self-signed TLS bootstrap cert
|
||||
(`deploy/test/certs/server.crt` is the demo default; see
|
||||
[`docs/tls.md`](./tls.md)), cert-manager 1.15+ will refuse to talk to
|
||||
the directory URL unless the certctl root is trusted. The fix lives in
|
||||
`ClusterIssuer.spec.acme.caBundle`:
|
||||
|
||||
```yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: certctl-test
|
||||
spec:
|
||||
acme:
|
||||
server: https://certctl.example.com:8443/acme/profile/prof-corp/directory
|
||||
email: ops@example.com
|
||||
caBundle: |
|
||||
LS0tLS1CRUdJTi... # base64-encoded PEM of certctl's self-signed root
|
||||
privateKeySecretRef:
|
||||
name: certctl-test-account-key
|
||||
solvers:
|
||||
- http01:
|
||||
ingress:
|
||||
class: nginx
|
||||
```
|
||||
|
||||
The `caBundle` value is the base64-encoded PEM of the root that signed
|
||||
your certctl-server's TLS certificate. Extract it from your operator
|
||||
bootstrap (e.g. `cat deploy/test/certs/ca.crt | base64 -w0`).
|
||||
|
||||
This is the single biggest first-time-deploy footgun on the cert-manager
|
||||
integration path. The full cert-manager walkthrough lands in Phase 6;
|
||||
the `caBundle` requirement is flagged here in Phase 1a's docs because
|
||||
operators hit it the moment they try to point a real ACME client at
|
||||
certctl.
|
||||
|
||||
## Auth-mode decision tree
|
||||
|
||||
Use `trust_authenticated` when:
|
||||
|
||||
- The certctl deployment serves **internal-only PKI** (intranet certs,
|
||||
service-mesh certs, IoT bootstrap). Identifiers in your CSRs are
|
||||
controlled by your infrastructure, not by the public Internet.
|
||||
- You don't have HTTP/DNS reachability **from certctl-server back to
|
||||
the ACME client's solver** (e.g., the client lives in an isolated
|
||||
network segment certctl-server can't reach).
|
||||
- You want the simplest cert-manager integration: cert-manager submits
|
||||
a CSR, certctl issues; no out-of-band ownership proof.
|
||||
- You're issuing under your own root CA whose trust is operator-managed
|
||||
(NOT WebPKI). Public CAs cannot use this mode — RFC 8555 §8 ownership
|
||||
proof is non-negotiable for public-trust roots.
|
||||
|
||||
Use `challenge` when:
|
||||
|
||||
- The deployment is **public-trust-style PKI** — even if your root is
|
||||
privately operated, you want CA/Browser Forum-style ownership-proof
|
||||
semantics so a stolen account key can't be used to issue for arbitrary
|
||||
identifiers.
|
||||
- You have HTTP-01 / DNS-01 / TLS-ALPN-01 reachability from the
|
||||
certctl-server to the ACME client's solver. (HTTP-01 needs port 80
|
||||
ingress to the client; DNS-01 needs DNS recursion; TLS-ALPN-01 needs
|
||||
port 443 ingress.)
|
||||
- You want defense-in-depth: an account-key compromise costs the
|
||||
attacker nothing without also compromising the solver-side
|
||||
infrastructure.
|
||||
|
||||
A single certctl-server can run both modes simultaneously — the auth
|
||||
mode is a per-profile column on `certificate_profiles.acme_auth_mode`,
|
||||
read at request time. Operators flip a profile's mode via SQL or the
|
||||
profile API, and the next order picks up the new mode without restart.
|
||||
|
||||
## Endpoints
|
||||
|
||||
Routes registered in `internal/api/router/router.go::RegisterHandlers`:
|
||||
|
||||
| Method | Path | RFC ref | Auth | Description |
|
||||
|--------|-------------------------------------------------------|-----------------|----------|-------------|
|
||||
| GET | `/acme/profile/{id}/directory` | RFC 8555 §7.1.1 | unauth | Per-profile directory document. |
|
||||
| HEAD | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 200 + Replay-Nonce header. |
|
||||
| GET | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 204 + Replay-Nonce header. |
|
||||
| POST | `/acme/profile/{id}/new-account` | RFC 8555 §7.3 | JWS jwk | Register a new account; idempotent re-registration of an existing JWK returns the existing row. |
|
||||
| POST | `/acme/profile/{id}/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Update contact list, deactivate, or POST-as-GET (RFC 8555 §6.3) to fetch the account. |
|
||||
| POST | `/acme/profile/{id}/new-order` | RFC 8555 §7.4 | JWS kid | Submit an order; identifier validation runs before order creation. |
|
||||
| POST | `/acme/profile/{id}/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | POST-as-GET fetch of an order's current state. |
|
||||
| POST | `/acme/profile/{id}/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Submit the CSR + finalize. Issues + persists managed cert row + version. |
|
||||
| POST | `/acme/profile/{id}/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | POST-as-GET fetch of an authorization. |
|
||||
| POST | `/acme/profile/{id}/challenge/{chall_id}` | RFC 8555 §7.5.1 | JWS kid | Submit a challenge for validation. Dispatches to a bounded-concurrency worker pool; clients poll authz for the eventual result. |
|
||||
| POST | `/acme/profile/{id}/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | POST-as-GET cert chain download (PEM). |
|
||||
| POST | `/acme/profile/{id}/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Doubly-signed account-key rollover. |
|
||||
| POST | `/acme/profile/{id}/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Revoke a cert via the issuing account's key OR the cert's own private key. Routes through the certctl revocation pipeline. |
|
||||
| GET | `/acme/profile/{id}/renewal-info/{cert_id}` | RFC 9773 | unauth | Fetch the suggested renewal window for a cert (cert-id is `base64url(AKI).base64url(serial)` per RFC 9773 §4.1). Response carries `Retry-After`. |
|
||||
| GET | `/acme/directory` | RFC 8555 §7.1.1 | unauth | Shorthand path; mirrors per-profile when `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` is set. |
|
||||
| HEAD | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
|
||||
| GET | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
|
||||
| POST | `/acme/new-account` | RFC 8555 §7.3 | JWS jwk | Shorthand. |
|
||||
| POST | `/acme/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/new-order` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Shorthand. |
|
||||
| POST | `/acme/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Shorthand. |
|
||||
| GET | `/acme/renewal-info/{cert_id}` | RFC 9773 | unauth | Shorthand. |
|
||||
|
||||
After Phase 4, the full RFC 8555 + RFC 9773 surface is live. RFC 8739
|
||||
(short-lived certs) and EAB enforcement remain follow-up work; cert-
|
||||
manager + boulder-tested clients work today against the surface above.
|
||||
|
||||
## RFC 8555 + RFC 9773 conformance statement
|
||||
|
||||
Honest disclosure of what's implemented, where, and what's not. Procurement
|
||||
engineers running gap analyses against cert-manager + Let's Encrypt's
|
||||
conformance posture should read this section before anything else.
|
||||
|
||||
### Implemented
|
||||
|
||||
| Section | Surface | Phase | First commit |
|
||||
|---------|---------|-------|--------------|
|
||||
| RFC 8555 §6.2 | JWS auth + RS256/ES256/EdDSA allow-list | 1b | `27bd660` |
|
||||
| RFC 8555 §6.3 | POST-as-GET | 1b | `27bd660` |
|
||||
| RFC 8555 §6.4 | URL-header binding to request URL | 1b | `27bd660` |
|
||||
| RFC 8555 §6.5 | Replay-Nonce + DB-backed nonce store | 1a | `e146b00` |
|
||||
| RFC 8555 §6.7 | RFC 7807 problem documents | 1a | `e146b00` |
|
||||
| RFC 8555 §7.1 | Directory | 1a | `e146b00` |
|
||||
| RFC 8555 §7.2 | new-nonce HEAD + GET | 1a | `e146b00` |
|
||||
| RFC 8555 §7.3 | new-account + idempotent re-registration | 1b | `27bd660` |
|
||||
| RFC 8555 §7.3.2 + §7.3.6 | account update + deactivation | 1b | `27bd660` |
|
||||
| RFC 8555 §7.3.5 | doubly-signed key rollover | 4 | `0299e4a` |
|
||||
| RFC 8555 §7.4 | new-order + finalize + cert download | 2 | `4ee486e` |
|
||||
| RFC 8555 §7.5 | authz POST-as-GET | 2 | `4ee486e` |
|
||||
| RFC 8555 §7.5.1 | challenge response | 3 | `7e22204` |
|
||||
| RFC 8555 §7.6 | revoke-cert (kid + jwk paths) | 4 | `0299e4a` |
|
||||
| RFC 8555 §8.3 | HTTP-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 8555 §8.4 | DNS-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 8737 | TLS-ALPN-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 9773 | ACME Renewal Information (ARI) | 4 | `0299e4a` |
|
||||
|
||||
### Not implemented (procurement-honest)
|
||||
|
||||
| Spec area | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| RFC 8555 §7.3.4 — External Account Binding (EAB) | **Not implemented.** | Advertised in directory `meta.externalAccountRequired` but enforcement is a follow-up. Operators relying on EAB for account-creation gating should layer an upstream WAF. |
|
||||
| RFC 8555 §8.4 + §7.4 — Wildcard with `*.` prefix > 1 level | **Not implemented.** | Single-level wildcards (e.g. `*.example.com`) work end-to-end. Multi-level wildcards (`*.*.example.com`) are RFC-spec-ambiguous and rejected at the identifier-validation layer. |
|
||||
| RFC 8738 — Short-lived certs | **Not implemented.** | Operators wanting <7-day validity tune the bound issuer's TTL directly via `CertificateProfile.MaxTTLSeconds`; the ACME wire shape doesn't expose a separate notion. |
|
||||
| Cross-CA proxying | **Not implemented.** | Each profile binds to one issuer. Multi-CA federation (one ACME account → multi-CA selection per identifier) is roadmap. |
|
||||
| RFC 8555 §6.7 — `accountDoesNotExist` problem with hint URL | Partial. | Sentinel returns `accountDoesNotExist`; the optional hint URL embedding the `kid` is not emitted. cert-manager doesn't consume it. |
|
||||
|
||||
If a procurement-side gap analysis turns up something not in either
|
||||
table above, the answer is "we don't know yet" — operator-side issues
|
||||
welcome.
|
||||
|
||||
## Finalize routing through `CertificateService.Create` (Phase 2 architecture)
|
||||
|
||||
The finalize path mirrors how every other certctl issuance surface
|
||||
(EST, SCEP, agent, REST API) routes through the canonical pipeline:
|
||||
|
||||
1. JWS-verify the request (`internal/api/acme/jws.go`).
|
||||
2. Validate the CSR's DNS-name set equals the order's identifier set
|
||||
exactly (case-folded). Mismatches return RFC 8555
|
||||
`urn:ietf:params:acme:error:badCSR`.
|
||||
3. Update the order row to `status=processing` (`s.tx.WithinTx` +
|
||||
`auditService.RecordEventWithTx` — atomic with audit row).
|
||||
4. Issue the cert via the bound profile's `IssuerConnector` adapter
|
||||
(same `IssueCertificate(ctx, commonName, sans, csrPEM, ekus,
|
||||
maxTTLSeconds, mustStaple)` call EST/SCEP/agent take).
|
||||
5. Insert the `managed_certificates` row via
|
||||
`service.CertificateService.Create(ctx, *ManagedCertificate, actor)`.
|
||||
Source is stamped `domain.CertificateSourceACME` so operators can
|
||||
bulk-revoke ACME-issued certs by filtering on `Source=ACME`.
|
||||
6. Insert the `certificate_versions` row +
|
||||
transition the order to `status=valid` with `certificate_id` set
|
||||
(one final `WithinTx` covering both writes + the audit row).
|
||||
|
||||
This means RenewalPolicy, CertificateProfile, per-issuer-type
|
||||
Prometheus metrics, audit rows, and revocation-pipeline integration
|
||||
all apply uniformly to ACME-issued certs via the same code path that
|
||||
already serves EST/SCEP/agent/REST issuance.
|
||||
|
||||
The atomicity boundary: there is a brief window between step 5 (cert
|
||||
exists) and step 6 (order shows valid) where the order row still says
|
||||
`processing`. Phase 5's GC scheduler reconciles. The actor string on
|
||||
audit rows is `acme:<account-id>`.
|
||||
|
||||
## JWS verification (Phase 1b)
|
||||
|
||||
Every JWS-authenticated POST runs through the verifier at
|
||||
`internal/api/acme/jws.go::VerifyJWS`. The verifier enforces:
|
||||
|
||||
1. The JWS parses as a flattened single-signature object (multi-sig is
|
||||
rejected per RFC 8555 §6.2).
|
||||
2. The signature algorithm is in the closed allow-list `{RS256, ES256,
|
||||
EdDSA}` per RFC 8555 §6.2 — `none`, `HS256`, and every other alg
|
||||
are refused at parse time.
|
||||
3. The protected header carries exactly one of `kid` (registered
|
||||
account) or `jwk` (new-account flow); endpoints declare which they
|
||||
require.
|
||||
4. The protected header `url` matches the inbound request URL exactly.
|
||||
5. The protected header `nonce` is consumed against the
|
||||
`acme_nonces` store; missing / replayed / expired nonces return
|
||||
`urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1.
|
||||
6. On the `kid` path: the kid URL round-trips against the canonical
|
||||
per-profile shape, the referenced account exists, and its status
|
||||
is `valid`. Deactivated / revoked accounts cannot authenticate.
|
||||
7. The signature verifies against the resolved key (registered
|
||||
account's stored JWK on the kid path; embedded jwk on the jwk path).
|
||||
|
||||
Every state-mutating account operation (create, contact update,
|
||||
deactivate) writes its `acme_accounts` row and an `audit_events` row
|
||||
inside one `repository.Transactor.WithinTx` call — the canonical
|
||||
certctl atomicity contract (matches `service.CertificateService.Create`
|
||||
at `internal/service/certificate.go:131`).
|
||||
|
||||
## Phases (cross-reference)
|
||||
|
||||
| Phase | Status | Surface |
|
||||
|-------|-------------|---------|
|
||||
| 1a | live | directory + new-nonce + per-profile routing |
|
||||
| 1b | live | new-account + account/{id} + JWS verifier (RFC 7515 + go-jose v4) |
|
||||
| 2 | live | orders + authzs + finalize + cert download (trust_authenticated mode end-to-end) |
|
||||
| 3 | live | HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (challenge mode end-to-end) |
|
||||
| 4 | live | key rollover (RFC 8555 §7.3.5) + revoke-cert (§7.6) + ARI (RFC 9773) |
|
||||
| 5 | live | rate limits + GC sweeper + kind-driven cert-manager integration test + lego conformance harness + k6 ACME-flow scenario |
|
||||
| 6 | live | full operator-facing reference + walkthroughs (cert-manager / Caddy / Traefik) + threat model + RFC-8555 conformance statement + troubleshooting + version pinning |
|
||||
|
||||
Track shipped phases via `git log --grep='acme-server:' --oneline`.
|
||||
|
||||
## Operational notes (Phase 1a)
|
||||
|
||||
- **Schema:** `migrations/000025_acme_server.up.sql` adds 5 ACME tables
|
||||
+ the `certificate_profiles.acme_auth_mode` column. Phase 1a actively
|
||||
uses only `acme_nonces`. The full schema ships now so the migration
|
||||
is stable and Phases 1b-4 don't need additional `CREATE TABLE`
|
||||
migrations.
|
||||
|
||||
- **Replay protection:** nonces are persisted in `acme_nonces` (NOT
|
||||
in-memory). They survive server restart, which is required for the
|
||||
RFC 8555 §6.5 replay defense to hold against a multi-replica
|
||||
certctl-server fleet behind a load balancer.
|
||||
|
||||
- **Metrics:** the service layer exposes per-op atomic counters via
|
||||
`service.ACMEService.Metrics().Snapshot()`:
|
||||
- `certctl_acme_directory_total`
|
||||
- `certctl_acme_directory_failures_total`
|
||||
- `certctl_acme_new_nonce_total`
|
||||
- `certctl_acme_new_nonce_failures_total`
|
||||
|
||||
Phase 1b will extend with `new_account` counters; Phase 2 with order
|
||||
/ finalize / cert; Phase 3 with per-challenge-type counters.
|
||||
|
||||
- **Audit:** Phase 1a is read-mostly (directory + nonce). Phase 1b's
|
||||
account-creation path will route through the canonical
|
||||
`s.tx.WithinTx(...)` + `auditService.RecordEventWithTx(...)` pattern
|
||||
so every account state mutation is paired with an `audit_events`
|
||||
row.
|
||||
|
||||
## Phase 4 — key rollover, revocation, ARI
|
||||
|
||||
### How do I rotate my ACME account key?
|
||||
|
||||
RFC 8555 §7.3.5 defines a doubly-signed JWS for the rollover. The OUTER
|
||||
JWS is signed by the OLD account key (kid path); its payload IS the
|
||||
INNER JWS, which is signed by the NEW account key (jwk path). cert-
|
||||
manager and lego do this for you transparently — `lego renew --key-rotate`
|
||||
or the cert-manager `Issuer.spec.acme.privateKeySecretRef` rollover.
|
||||
|
||||
Server-side validation:
|
||||
|
||||
1. Outer JWS verifies against the registered account's current key.
|
||||
2. Inner JWS verifies against the embedded NEW jwk (proves possession).
|
||||
3. Inner payload `account` matches outer `kid`.
|
||||
4. Inner payload `oldKey` thumbprint-equals the registered key.
|
||||
5. Inner protected `url` equals outer protected `url`.
|
||||
6. New JWK thumbprint not already registered against the same profile.
|
||||
7. `SELECT … FOR UPDATE` on the account row serializes concurrent
|
||||
rollovers; the loser sees the winner's new thumbprint and is told
|
||||
to retry (409).
|
||||
|
||||
### How do I revoke an ACME-issued cert?
|
||||
|
||||
Two auth paths per RFC 8555 §7.6:
|
||||
|
||||
- **kid path:** sign with your account key. The server checks the
|
||||
account "owns" the cert via `acme_orders.certificate_id` lookup.
|
||||
- **jwk path:** sign with the cert's own private key. The server
|
||||
extracts the cert's public key, computes the JWK, and asserts it
|
||||
matches the embedded jwk thumbprint.
|
||||
|
||||
Either path routes through `service.RevocationSvc.RevokeCertificateWithActor`
|
||||
— the same pipeline the GUI revoke button, bulk-revocation, and the
|
||||
ACME-consumer issuer use. So the cert-row update + revocation row + audit
|
||||
row are all atomic in one `WithinTx`, the issuer is best-effort
|
||||
notified, and the OCSP response cache is invalidated.
|
||||
|
||||
Reason codes follow RFC 5280 §5.3.1; codes 8 (removeFromCRL) and 10
|
||||
(aACompromise) are not in certctl's `domain.ValidRevocationReasons`
|
||||
set so they clamp to `unspecified`.
|
||||
|
||||
### What is ARI?
|
||||
|
||||
RFC 9773 ACME Renewal Information. Clients GET
|
||||
`/acme/profile/<id>/renewal-info/<cert-id>` (unauthenticated) and
|
||||
receive a JSON document with `suggestedWindow.start` and `.end` —
|
||||
the server's recommendation for when to renew. The response also
|
||||
carries `Retry-After` (RFC 9773 §4.2) hinting at the next-poll cadence.
|
||||
|
||||
Cert-id format is `base64url(authorityKeyIdentifier).base64url(serial)`
|
||||
per RFC 9773 §4.1.
|
||||
|
||||
Window math:
|
||||
|
||||
- Cert with a bound renewal policy: window starts at
|
||||
`notAfter - RenewalWindowDays`, ends at `notAfter - RenewalWindowDays/2`.
|
||||
So a 30-day window cert with notAfter 2026-06-30 emits start=2026-05-31,
|
||||
end=2026-06-15. Boulder-shape default that lets cert-manager schedule
|
||||
inside our renewal window.
|
||||
- No policy: window is the last 33% of validity.
|
||||
- Past expiry: window is "now" → "now + 24h" (renew immediately).
|
||||
|
||||
Disable ARI globally with `CERTCTL_ACME_SERVER_ARI_ENABLED=false`. The
|
||||
URL drops out of the directory; the route is still registered but
|
||||
returns 404 — clients fall back to static renewal scheduling.
|
||||
|
||||
## Phase 5 — operational guidance
|
||||
|
||||
### Rate limiting
|
||||
|
||||
Production deployments serving multiple ACME profiles or fleets should
|
||||
keep the default rate limits in place. The four caps:
|
||||
|
||||
- `RATE_LIMIT_ORDERS_PER_HOUR` (100) — per-account new-order cap. A
|
||||
cert-manager Certificate that auto-renews at the 1/3 mark of its
|
||||
validity (90-day cert → ~30-day renewal) consumes ~12 orders/year
|
||||
per managed Certificate. 100/hour is generous for any plausible
|
||||
fleet.
|
||||
- `RATE_LIMIT_CONCURRENT_ORDERS` (5) — per-account cap on
|
||||
pending/ready/processing orders. Stops a runaway client from
|
||||
starving DB-row throughput. Tune up only if you observe legitimate
|
||||
bursts.
|
||||
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR` (5) — rollovers are rare; a flood
|
||||
is an attack signal. Tune down to 1/hour if your operator
|
||||
procedure mandates manual rollovers only.
|
||||
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` (60) — per-challenge cap,
|
||||
defends against retry storms.
|
||||
|
||||
Hits return RFC 8555 §6.7 `rateLimited` Problem with a `Retry-After`
|
||||
header. cert-manager 1.15+ honors the header; lego too. Older clients
|
||||
may not — that's the client's problem, not certctl's.
|
||||
|
||||
The buckets are **in-memory + per-replica**. A 3-replica certctl-
|
||||
server fleet behind a load balancer effectively has 3× the configured
|
||||
throughput (each replica's bucket fills independently). For
|
||||
deployments where this matters operationally, the right answer is a
|
||||
shared rate-limit store — that's a follow-up; not blocking for the
|
||||
current threat model where same-account requests typically pin to
|
||||
the same replica via session affinity.
|
||||
|
||||
### GC sweeper
|
||||
|
||||
The scheduler runs the GC sweep every `GC_INTERVAL` (default 1m). Each
|
||||
sweep is three independent SQL statements:
|
||||
|
||||
1. `DELETE FROM acme_nonces WHERE used = TRUE OR expires_at < NOW()`.
|
||||
2. `UPDATE acme_authorizations SET status='expired' WHERE status='pending' AND expires_at < NOW()`.
|
||||
3. `UPDATE acme_orders SET status='invalid', error=... WHERE status IN ('pending','ready','processing') AND expires_at < NOW()`.
|
||||
|
||||
Each statement is bounded by a 1-minute per-sweep timeout. A failing
|
||||
sweep is logged + retried on the next tick; a tick that overruns its
|
||||
budget is skipped (the existing-tick atomic-Bool guard prevents
|
||||
overlap). Counts are exposed via `certctl_acme_gc_*` Prometheus
|
||||
metrics.
|
||||
|
||||
### cert-manager integration test
|
||||
|
||||
`make acme-cert-manager-test` brings up a kind cluster, installs
|
||||
cert-manager 1.15.0, helm-deploys certctl-server with
|
||||
`acmeServer.enabled=true`, and verifies a Certificate resource issues
|
||||
end-to-end. Skipped in CI by default (kind is too heavy for per-PR);
|
||||
operators run locally on workstation. See
|
||||
`deploy/test/acme-integration/` for the YAML + Go test harness.
|
||||
|
||||
### lego RFC conformance harness
|
||||
|
||||
`make acme-rfc-conformance-test` drives lego v4 against a hermetic
|
||||
certctl-server stack, exercising register → new-order → finalize.
|
||||
Operators run this when shipping behavior changes to the ACME surface
|
||||
to confirm a real third-party client still works.
|
||||
|
||||
### k6 ACME flows scenario
|
||||
|
||||
`deploy/test/loadtest/k6/acme_flow.js` exercises the unauthenticated
|
||||
surface (directory + new-nonce + ARI) at 100 VUs × 5m. JWS-signed
|
||||
flows are out of scope for k6 (no JWS support); they're covered by
|
||||
the lego conformance harness above. Baseline numbers + thresholds in
|
||||
`deploy/test/loadtest/README.md`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
The five failure modes operators hit most often + the canonical fix
|
||||
for each.
|
||||
|
||||
### `cert-manager logs: 400 Bad Request: badNonce`
|
||||
|
||||
**Cause:** Either a nonce was replayed (a buggy client retries the
|
||||
same JWS), the cert-manager + certctl-server clocks differ by more
|
||||
than `CERTCTL_ACME_SERVER_NONCE_TTL` (default 5 min), or the
|
||||
nonce-store row was reaped between issuance and use.
|
||||
|
||||
**Fix:** First check NTP on both sides. If clocks are healthy,
|
||||
lengthen `CERTCTL_ACME_SERVER_NONCE_TTL` to 10m or 15m. If the
|
||||
problem persists, check for a multi-replica certctl-server fleet
|
||||
without sticky session affinity — the nonce DB row lives on one
|
||||
replica; if the JWS POST hits a different replica before replication
|
||||
catches up, you observe spurious `badNonce`. Solution: pin client
|
||||
sessions to a single replica via load-balancer cookie / `kid`-hash
|
||||
routing, OR shorten replication lag if your DB is the bottleneck.
|
||||
|
||||
### `cert-manager logs: x509: certificate signed by unknown authority`
|
||||
|
||||
**Cause:** cert-manager refuses to talk to the directory URL because
|
||||
its TLS chain doesn't terminate at a root in cert-manager's trust
|
||||
store. certctl-server's bootstrap cert (Phase 1a, `deploy/test/certs/server.crt`)
|
||||
is self-signed.
|
||||
|
||||
**Fix:** Add the `caBundle` field to your `ClusterIssuer.spec.acme` —
|
||||
see the [TLS trust bootstrap](#tls-trust-bootstrap-read-this-before-configuring-cert-manager)
|
||||
section above for the 3-step recipe. This is **the** single biggest
|
||||
first-time-deploy footgun on the cert-manager integration path.
|
||||
|
||||
### HTTP-01 validator returns `connection refused`
|
||||
|
||||
**Cause:** The HTTP-01 solver's Ingress / Service is not reachable
|
||||
from certctl-server's network. Common subcases: (a) the cert-manager
|
||||
http-solver pod is on a private network certctl-server can't reach;
|
||||
(b) a firewall blocks port 80 inbound to the solver's address; (c)
|
||||
the Ingress class annotation doesn't match an installed ingress
|
||||
controller; (d) your DNS still points at an old IP.
|
||||
|
||||
**Fix:** From the certctl-server pod, `curl -v
|
||||
http://<identifier>/.well-known/acme-challenge/<token>` and read the
|
||||
network error. If the curl fails the same way, the network path is
|
||||
the issue. If curl works but the validator fails, check the validator
|
||||
log lines — the SSRF guard rejects reserved IPs (RFC1918, link-local,
|
||||
cloud-metadata 169.254.169.254). Public-trust style profiles that
|
||||
need to reach RFC1918 solvers must be moved to `trust_authenticated`
|
||||
mode OR the solver must be exposed on a routable address.
|
||||
|
||||
### DNS-01 validator returns `NXDOMAIN`
|
||||
|
||||
**Cause:** DNS provider hasn't propagated the `_acme-challenge.<domain>`
|
||||
TXT record yet. Most providers have a 30s-2m propagation lag. cert-manager
|
||||
retries by default, but Phase-5 rate limits (default 60/hour per
|
||||
challenge-id) can truncate the retry budget.
|
||||
|
||||
**Fix:** Verify TXT propagation with `dig +short TXT _acme-challenge.<domain>
|
||||
@<your-resolver>`. If the answer is empty, the issue is upstream. If
|
||||
it's populated but certctl reports NXDOMAIN, check
|
||||
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`) is
|
||||
reachable from certctl-server's network egress. Operators on isolated
|
||||
networks need a private resolver; configure accordingly + own the
|
||||
cache-poisoning posture (see [threat
|
||||
model](./acme-server-threat-model.md)).
|
||||
|
||||
### Certificate Ready=False with `rejectedIdentifier`
|
||||
|
||||
**Cause:** The CSR includes an identifier (CommonName or SAN) that the
|
||||
bound certificate profile's policy rejects. certctl runs syntactic +
|
||||
profile-policy validation **before** order creation; the order never
|
||||
reaches the database.
|
||||
|
||||
**Fix:** The reject reason is in the `subproblems` array of the RFC
|
||||
8555 §6.7 problem document. Decode the JSON, look at `subproblems[].detail`,
|
||||
and adjust either the CSR or the profile policy. Common causes:
|
||||
SAN-not-in-`AllowedIdentifierWildcards`, EKU-not-in-`AllowedEKUs`,
|
||||
TTL-exceeds-`MaxTTLSeconds`. Validation logic lives in
|
||||
`internal/api/acme/identifier.go::ValidateIdentifiers` +
|
||||
`internal/domain/profile.go` — read those if the profile-policy rule
|
||||
isn't obvious.
|
||||
|
||||
## Version pinning + tested clients
|
||||
|
||||
certctl's ACME server is tested against the following client versions.
|
||||
Other versions probably work; these are the ones the integration suite
|
||||
exercises end-to-end.
|
||||
|
||||
| Client | Tested version | Where it's pinned |
|
||||
|--------|----------------|-------------------|
|
||||
| cert-manager | 1.15.0 | `deploy/test/acme-integration/cert-manager-install.sh::CERT_MANAGER_VERSION` |
|
||||
| lego (RFC 8555 conformance harness) | v4.x latest | `deploy/test/acme-integration/conformance-lego.sh` (operator installs via `go install github.com/go-acme/lego/v4/cmd/lego@latest`) |
|
||||
| kind (cluster bootstrap) | v0.20+ | `deploy/test/acme-integration/kind-config.yaml` schema requirement |
|
||||
| Caddy | 2.7.x | Phase 6 walkthrough (`docs/acme-caddy-walkthrough.md`) |
|
||||
| Traefik | 3.0+ | Phase 6 walkthrough (`docs/acme-traefik-walkthrough.md`) |
|
||||
|
||||
Operators reporting issues with untested-version clients should include
|
||||
the client version + the precise wire-level error (curl-captured request
|
||||
+ response body) so we can pin a regression test if applicable.
|
||||
|
||||
## FAQ
|
||||
|
||||
### Why two auth modes? Isn't `challenge` strictly more secure?
|
||||
|
||||
`challenge` is strictly more secure for **public-trust** PKI — RFC 8555
|
||||
§8 ownership proof is the entire point of cert-manager + Let's Encrypt.
|
||||
For **internal PKI**, the threat model is different: the network itself
|
||||
is the security boundary (mTLS service mesh, firewalled VPC, identifier-
|
||||
namespace controlled by the operator). Forcing every internal cert to
|
||||
go through a solver round-trip adds operational toil with no security
|
||||
gain. `trust_authenticated` is the certctl-specific mode that
|
||||
acknowledges this — the ACME account is the proof, not the solver.
|
||||
|
||||
### How does this differ from `cert-manager → Let's Encrypt with certctl as a separate step`?
|
||||
|
||||
Two integrations vs one. With certctl as the ACME endpoint, cert-manager
|
||||
does its native flow (Certificate → Order → CSR → Secret) and certctl
|
||||
mints the cert directly, recording it under its own
|
||||
`managed_certificates` table with full audit + renewal-policy + bulk-
|
||||
revocation surface. With Let's Encrypt as the ACME endpoint, you have
|
||||
to run a separate cert-manager-uploads-to-certctl webhook OR maintain
|
||||
two parallel cert tracks. The native-ACME-server path is operationally
|
||||
simpler.
|
||||
|
||||
### Can I use ACME endpoints from outside the K8s cluster?
|
||||
|
||||
Yes. The endpoints are HTTPS over the certctl-server's listener (port
|
||||
8443 by default). Caddy on a VM, win-acme on a Windows server, or
|
||||
Posh-ACME on a Mac all integrate against
|
||||
`https://<certctl-server>:8443/acme/profile/<profile-id>/directory`.
|
||||
The TLS-trust-bootstrap requirement applies the same way — see the
|
||||
[Caddy walkthrough](./acme-caddy-walkthrough.md) for the OS-trust-store
|
||||
recipe.
|
||||
|
||||
### How do I migrate manually-issued certs to ACME-issued ones?
|
||||
|
||||
Not yet automatic. Operators migrating: keep the old `managed_certificates`
|
||||
rows; create new ones via the ACME flow; flip targets one by one. A
|
||||
dedicated bulk-migration tool is on the roadmap (post-2.1.0). Track
|
||||
via the master prompt's roadmap section in
|
||||
`cowork/acme-server-endpoint-prompt.md`.
|
||||
|
||||
### What audit-log events fire on each ACME operation?
|
||||
|
||||
Every state mutation writes an `audit_events` row. Actor strings:
|
||||
`acme:<account-id>` for kid-path requests; `acme-cert-key:<serial>`
|
||||
for jwk-path revoke; `acme-system:gc` for scheduler-driven sweeps.
|
||||
Event-name catalog:
|
||||
|
||||
| Event name | Fired by | Resource type |
|
||||
|------------|----------|---------------|
|
||||
| `acme_account_created` | new-account | `acme_account` |
|
||||
| `acme_account_contact_updated` | account update | `acme_account` |
|
||||
| `acme_account_deactivated` | account deactivate | `acme_account` |
|
||||
| `acme_account_key_rolled` | key-change | `acme_account` |
|
||||
| `acme_order_created` | new-order | `acme_order` |
|
||||
| `acme_order_finalized` | finalize | `acme_order` |
|
||||
| `acme_challenge_processing` | challenge-respond (dispatch) | `acme_challenge` |
|
||||
| `acme_challenge_completed` | validator callback | `acme_challenge` |
|
||||
| `certificate_revoked` | revoke-cert (routes through `RevocationSvc`) | `certificate` |
|
||||
|
||||
Querying by actor prefix (`actor LIKE 'acme:%'`) reconstructs the full
|
||||
history of any ACME-issued cert.
|
||||
|
||||
### Is there a threat model document?
|
||||
|
||||
Yes — [`docs/acme-server-threat-model.md`](./acme-server-threat-model.md).
|
||||
Read before writing a security review.
|
||||
|
||||
## See also
|
||||
|
||||
- [cert-manager integration walkthrough](./acme-cert-manager-walkthrough.md)
|
||||
- [Caddy integration walkthrough](./acme-caddy-walkthrough.md)
|
||||
- [Traefik integration walkthrough](./acme-traefik-walkthrough.md)
|
||||
- [Threat model](./acme-server-threat-model.md)
|
||||
- [TLS trust bootstrap reference](./tls.md)
|
||||
- [Architecture (control-plane)](./architecture.md)
|
||||
@@ -0,0 +1,118 @@
|
||||
# Async-CA Polling — Operator Reference
|
||||
|
||||
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
|
||||
|
||||
## What this is
|
||||
|
||||
Four issuer connectors talk to Certificate Authorities that issue
|
||||
certificates **asynchronously** — `IssueCertificate` returns an order
|
||||
ID immediately, and the caller (or scheduler) must call
|
||||
`GetOrderStatus` later to retrieve the issued cert:
|
||||
|
||||
- **DigiCert** (CertCentral)
|
||||
- **Sectigo** (Certificate Manager)
|
||||
- **Entrust** (Certificate Services / CA Gateway)
|
||||
- **GlobalSign** (Atlas HVCA)
|
||||
|
||||
Pre-fix, each connector's `GetOrderStatus` made one HTTP call per
|
||||
invocation with no exponential backoff, no retry cap, and no deadline.
|
||||
Under a renewal sweep, certctl would hammer the upstream CA's
|
||||
rate-limit budget. A 429 response was treated as a hard error,
|
||||
which then caused the scheduler to retry on the next tick — re-fanning
|
||||
out the same call that just got rate-limited.
|
||||
|
||||
Post-fix, `GetOrderStatus` blocks for up to `PollMaxWait` (default
|
||||
10 minutes) doing **bounded internal polling**:
|
||||
|
||||
```
|
||||
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
|
||||
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
|
||||
```
|
||||
|
||||
±20% jitter applied at every wait so multiple certctl instances
|
||||
never synchronize on the upstream CA's rate-limit window. The
|
||||
`PollMaxWait` deadline is a hard cap; if the upstream still hasn't
|
||||
completed by then, `GetOrderStatus` returns `StillPending` and the
|
||||
scheduler can re-enqueue the job for a future tick.
|
||||
|
||||
## Status-code triage
|
||||
|
||||
Each connector classifies HTTP responses to drive polling decisions:
|
||||
|
||||
| Response | Meaning | Decision |
|
||||
|---|---|---|
|
||||
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
|
||||
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
|
||||
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return `OrderStatus{Status:"failed"}` |
|
||||
| 2xx + parse failure | Body is broken | Failed — return error |
|
||||
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
|
||||
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
|
||||
| 5xx | Transient | StillPending — keep polling with backoff |
|
||||
| Network / TLS error | Transient | StillPending — keep polling with backoff |
|
||||
|
||||
## Operator tuning
|
||||
|
||||
Each connector exposes a `PollMaxWaitSeconds` config field and
|
||||
matching env var:
|
||||
|
||||
| Connector | Env var | Default |
|
||||
|---|---|---|
|
||||
| DigiCert | `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| Sectigo | `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| Entrust | `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| GlobalSign | `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
|
||||
Tune up (e.g., `86400` = 24 hours) for **Entrust approval-pending
|
||||
workflows** where humans manually approve enrollments. Tune down (e.g.,
|
||||
`60`) for high-throughput environments that prefer to recycle the
|
||||
scheduler tick rather than block one renewal goroutine for minutes.
|
||||
|
||||
A value of 0 (or unset) falls back to the package default in
|
||||
`internal/connector/issuer/asyncpoll`.
|
||||
|
||||
## Failure modes
|
||||
|
||||
**Upstream returns 429 forever.** The Poller respects the backoff
|
||||
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
|
||||
the full `PollMaxWait` budget with at most 7-8 attempts (instead of
|
||||
~600 attempts at 1/sec). After `PollMaxWait` expires, `GetOrderStatus`
|
||||
returns `StillPending`; the scheduler re-enqueues for the next tick.
|
||||
The total request volume against the upstream is bounded by `tick
|
||||
interval / minimum backoff` — typically 1-2 requests per minute even
|
||||
under heavy load.
|
||||
|
||||
**Sectigo `collectNotReady` sentinel.** When the SCM status endpoint
|
||||
reports `Issued` but the cert collect endpoint isn't yet ready, the
|
||||
old code branched into a special "pending" return. Now that branch
|
||||
returns `StillPending` from the poll closure, so the cert collection
|
||||
rides the same backoff schedule.
|
||||
|
||||
**Entrust approval-pending.** The `AWAITING_APPROVAL` status maps to
|
||||
`StillPending`. With the default `PollMaxWait=10m`, the scheduler
|
||||
will re-enqueue once per tick if approval hasn't happened yet; with
|
||||
`PollMaxWait=24h` the same renewal goroutine waits the full approval
|
||||
window. Pick the latter when you have many approval-pending
|
||||
enrollments per tick.
|
||||
|
||||
## Where the implementation lives
|
||||
|
||||
- `internal/connector/issuer/asyncpoll/asyncpoll.go` — shared `Poller`
|
||||
with backoff math, jitter, deadline, and ctx-aware cancellation.
|
||||
- `internal/connector/issuer/digicert/digicert.go` —
|
||||
`pollOrderOnce` + `GetOrderStatus` orchestrator.
|
||||
- `internal/connector/issuer/sectigo/sectigo.go` —
|
||||
`pollEnrollmentOnce` + status-code permanence triage
|
||||
(`isPermanentStatusError`).
|
||||
- `internal/connector/issuer/entrust/entrust.go` —
|
||||
`pollEnrollmentOnce` + approval-pending mapping.
|
||||
- `internal/connector/issuer/globalsign/globalsign.go` —
|
||||
`pollCertificateOnce` (serial-number tracking).
|
||||
- `internal/connector/issuer/asyncpoll/asyncpoll_test.go` — 11 unit
|
||||
tests covering happy path, transient-then-success, Failed
|
||||
termination, MaxWait timeout, last-error wrap, ctx cancel,
|
||||
multiplicative backoff, jitter bounds, defaults.
|
||||
|
||||
## Audit blocker reference
|
||||
|
||||
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5
|
||||
(Part 1.5 finding #4: "No polling backoff for async CAs").
|
||||
@@ -0,0 +1,411 @@
|
||||
# CRL & OCSP — Revocation Status for Relying Parties
|
||||
|
||||
This guide is the operator + relying-party reference for certctl's revocation
|
||||
status surfaces. It covers the wire format, endpoint URLs, configuration knobs,
|
||||
the OCSP responder cert lifecycle, and how to point common consumers
|
||||
(cert-manager, Firefox, OpenSSL) at the endpoints.
|
||||
|
||||
If you're looking for the higher-level architecture, see
|
||||
[`architecture.md` § Security Model](architecture.md#security-model). If you're
|
||||
looking for the revocation policy / reason codes the API accepts, see
|
||||
[`api/openapi.yaml` § /certificates/{id}/revoke](../api/openapi.yaml).
|
||||
|
||||
---
|
||||
|
||||
## Conceptual overview
|
||||
|
||||
**Why two formats.** RFC 5280 §5 defines a Certificate Revocation List (CRL)
|
||||
— a periodically-published, signed list of every revoked certificate for an
|
||||
issuer. RFC 6960 defines the Online Certificate Status Protocol (OCSP) — a
|
||||
request/response protocol that returns the status of a single certificate by
|
||||
serial number. CRLs are batch-friendly and cacheable; OCSP is point-query and
|
||||
fresh. Production PKI deployments serve both because different relying parties
|
||||
prefer different trade-offs:
|
||||
|
||||
- Browsers (Firefox / Safari) prefer OCSP for freshness; some pin OCSP
|
||||
stapling.
|
||||
- cert-manager and most Linux TLS clients fall back to CRL when OCSP is
|
||||
unreachable.
|
||||
- Microsoft Intune / corporate device-state validators do periodic CRL pulls.
|
||||
- OpenSSL `s_client -status` exercises OCSP via the `Certificate Status
|
||||
Request` extension during the handshake.
|
||||
|
||||
certctl's local issuer publishes both, with a pre-generation cache so a busy
|
||||
CA does not DOS itself rebuilding the CRL on every fetch.
|
||||
|
||||
**Why a separate OCSP responder cert.** RFC 6960 §2.6 + §4.2.2.2 strongly
|
||||
recommend that OCSP responses be signed by a delegated "OCSP responder cert"
|
||||
issued by the CA, NOT by the CA private key directly. The responder cert
|
||||
carries the `id-pkix-ocsp-nocheck` extension (RFC 6960 §4.2.2.2.1) so OCSP
|
||||
clients do not recursively check the responder cert's revocation status. This
|
||||
keeps the CA private key cold (an HSM operation per OCSP request would be
|
||||
prohibitive at scale) and lets the responder key live on disk, on a separate
|
||||
HSM partition, or rotate frequently while the CA key stays untouched.
|
||||
|
||||
---
|
||||
|
||||
## Endpoints
|
||||
|
||||
All revocation endpoints live under `/.well-known/pki/` per RFC 8615 and run
|
||||
**unauthenticated** — relying parties without certctl API credentials must be
|
||||
able to validate revocation status. The HTTPS-only TLS 1.3 control plane
|
||||
applies; there is no plaintext fallback.
|
||||
|
||||
### CRL — Certificate Revocation List
|
||||
|
||||
```
|
||||
GET https://<host>/.well-known/pki/crl/{issuer_id}
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `GET` |
|
||||
| Auth | None (unauthenticated, RFC 5280 §5 distribution semantics) |
|
||||
| Response Content-Type | `application/pkix-crl` |
|
||||
| Response body | DER-encoded X.509 CRL signed by the issuer's CA |
|
||||
| Cache | Pre-generated by the scheduler; configurable interval |
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
curl --cacert ca.crt \
|
||||
-o crl.der \
|
||||
https://localhost:8443/.well-known/pki/crl/iss-local
|
||||
|
||||
openssl crl -inform DER -in crl.der -text -noout
|
||||
```
|
||||
|
||||
### OCSP — Online Certificate Status Protocol
|
||||
|
||||
certctl serves both the GET form (RFC 6960 §A.1.1, simple URL-path lookup)
|
||||
and the POST form (RFC 6960 §A.1.1, binary OCSPRequest body). Most
|
||||
production OCSP clients (Firefox, OpenSSL `s_client -status`, cert-manager,
|
||||
Intune) use POST. The GET form is preserved for ops curl-debugging.
|
||||
|
||||
#### GET form
|
||||
|
||||
```
|
||||
GET https://<host>/.well-known/pki/ocsp/{issuer_id}/{serial_hex}
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `GET` |
|
||||
| Auth | None |
|
||||
| Response Content-Type | `application/ocsp-response` |
|
||||
| Response body | DER-encoded OCSPResponse signed by the **OCSP responder cert** (NOT the CA cert) |
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
curl --cacert ca.crt \
|
||||
-o response.der \
|
||||
https://localhost:8443/.well-known/pki/ocsp/iss-local/a1b2c3d4
|
||||
|
||||
openssl ocsp -respin response.der -text -CAfile ca.crt
|
||||
```
|
||||
|
||||
#### POST form (the standard one)
|
||||
|
||||
```
|
||||
POST https://<host>/.well-known/pki/ocsp/{issuer_id}
|
||||
Content-Type: application/ocsp-request
|
||||
Body: <DER-encoded OCSPRequest>
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `POST` |
|
||||
| Auth | None |
|
||||
| Request Content-Type | `application/ocsp-request` |
|
||||
| Response Content-Type | `application/ocsp-response` |
|
||||
|
||||
Example with OpenSSL building the request:
|
||||
|
||||
```bash
|
||||
openssl ocsp -issuer ca.crt -cert leaf.crt -reqout request.der
|
||||
|
||||
curl --cacert ca.crt \
|
||||
-X POST \
|
||||
-H "Content-Type: application/ocsp-request" \
|
||||
--data-binary @request.der \
|
||||
-o response.der \
|
||||
https://localhost:8443/.well-known/pki/ocsp/iss-local
|
||||
|
||||
openssl ocsp -respin response.der -text -CAfile ca.crt
|
||||
```
|
||||
|
||||
The body-size limit applies (`http.MaxBytesReader` from middleware,
|
||||
default 1MB, configurable via `CERTCTL_MAX_BODY_SIZE`); a typical OCSPRequest
|
||||
is ~200 bytes so this is a generous cap.
|
||||
|
||||
### Admin observability endpoint
|
||||
|
||||
```
|
||||
GET https://<host>/api/v1/admin/crl/cache
|
||||
Authorization: Bearer <token-with-admin-flag>
|
||||
```
|
||||
|
||||
Returns the per-issuer cache state — for ops dashboards, GUI badges, or
|
||||
"is the scheduler keeping up?" diagnostics. Admin-gated (M-008 admin-gated
|
||||
handler allowlist; non-admin Bearer callers receive HTTP 403). Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"cache_rows": [
|
||||
{
|
||||
"issuer_id": "iss-local",
|
||||
"cache_present": true,
|
||||
"crl_number": 42,
|
||||
"this_update": "2026-04-29T10:00:00Z",
|
||||
"next_update": "2026-04-29T11:00:00Z",
|
||||
"generated_at": "2026-04-29T10:00:00Z",
|
||||
"generation_duration_ms": 87,
|
||||
"revoked_count": 13,
|
||||
"is_stale": false,
|
||||
"recent_events": [
|
||||
{
|
||||
"started_at": "2026-04-29T10:00:00Z",
|
||||
"duration_ms": 87,
|
||||
"succeeded": true,
|
||||
"crl_number": 42,
|
||||
"revoked_count": 13
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"row_count": 1,
|
||||
"generated_at": "2026-04-29T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
Issuers that have not yet had a CRL generated appear with `cache_present:
|
||||
false` so the GUI can render a "Not yet generated" pill rather than 404.
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
| Env var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `CERTCTL_CRL_GENERATION_INTERVAL` | `1h` | How often the scheduler walks every CRL-supporting issuer and rebuilds. The HTTP handler reads from the cache, not from a per-request rebuild. |
|
||||
| `CERTCTL_OCSP_RESPONDER_KEY_DIR` | unset | **Operator MUST set in production.** Directory where the FileDriver persists each issuer's OCSP responder key (`ocsp-responder-<issuer_id>.key`). When unset, the responder service uses a temporary directory that does NOT survive restarts — fine for dev, NEVER for prod. |
|
||||
| `CERTCTL_OCSP_RESPONDER_ROTATION_GRACE` | `7d` | When the responder cert's `NotAfter` falls within this window, `EnsureResponder` rotates to a fresh cert+key on the next OCSP request or scheduler tick. |
|
||||
| `CERTCTL_OCSP_RESPONDER_VALIDITY` | `30d` | How long each newly-issued responder cert is valid for. Short by design — relying parties cache OCSP responses, not the responder cert chain, and `id-pkix-ocsp-nocheck` blocks recursive revocation checking on the responder itself. |
|
||||
|
||||
The issuer-level CRL `nextUpdate` is derived from the generation timestamp +
|
||||
the configured CRL validity (currently a build-time constant in the
|
||||
`CRLCacheService`; configurable knob deferred until an operator asks).
|
||||
|
||||
---
|
||||
|
||||
## OCSP responder cert lifecycle
|
||||
|
||||
1. **First OCSP request for an issuer (or scheduler tick).** The local
|
||||
issuer's `SignOCSPResponse` calls into `OCSPResponderService.EnsureResponder`.
|
||||
2. **Cache lookup.** `EnsureResponder` queries the `ocsp_responders` table for
|
||||
a row keyed by `issuer_id`.
|
||||
3. **Disk lookup.** If a row exists, the FileDriver reads the persisted key
|
||||
from `<keydir>/ocsp-responder-<issuer_id>.key`. **Self-healing:** if the
|
||||
row exists but the file is missing (operator pruned the keydir without
|
||||
pruning the DB), the service treats this as "rotate now" rather than
|
||||
crashing.
|
||||
4. **Rotation check.** If `cert.NotAfter < now + RotationGrace`, the service
|
||||
generates a fresh ECDSA-P256 key, builds a `*x509.CertificateRequest`,
|
||||
and asks the local issuer's existing `IssueCertificate` flow to sign it.
|
||||
The signing template carries:
|
||||
- `KeyUsage: x509.KeyUsageDigitalSignature` (signing OCSP responses)
|
||||
- `ExtKeyUsage: x509.ExtKeyUsageOCSPSigning` (RFC 6960 §4.2.2.2)
|
||||
- The `id-pkix-ocsp-nocheck` extension (OID `1.3.6.1.5.5.7.48.1.5`,
|
||||
DER value `NULL`, RFC 6960 §4.2.2.2.1) wired through
|
||||
`Certificate.ExtraExtensions`.
|
||||
5. **Persistence.** The new cert + key path are written to `ocsp_responders`
|
||||
via an idempotent `INSERT … ON CONFLICT DO UPDATE`.
|
||||
6. **Response signing.** `ocsp.CreateResponse(caCert, responderCert,
|
||||
template, responderSigner)` produces the response bytes; the responder
|
||||
cert is included in the response chain so relying parties can validate
|
||||
without a separate fetch.
|
||||
|
||||
The race between scheduler-driven cache refresh and on-demand cache miss is
|
||||
collapsed by the `CRLCacheService`'s in-tree singleflight (a `sync.Map` of
|
||||
`*flightEntry` keyed by `issuer_id`). Concurrent generation requests for the
|
||||
same issuer wait on the in-flight result rather than each rebuilding from
|
||||
scratch.
|
||||
|
||||
---
|
||||
|
||||
## Pointing common consumers at the endpoints
|
||||
|
||||
### cert-manager (Kubernetes)
|
||||
|
||||
cert-manager's certificate-validation logic checks both the AIA OCSP URI
|
||||
embedded in the leaf and the CDP CRL URI. Both are populated automatically
|
||||
by the local issuer's certificate template — relying parties should NOT
|
||||
need any additional configuration. To verify:
|
||||
|
||||
```bash
|
||||
openssl x509 -in leaf.crt -text -noout | grep -A1 "Authority Information Access"
|
||||
openssl x509 -in leaf.crt -text -noout | grep -A2 "CRL Distribution Points"
|
||||
```
|
||||
|
||||
If your cert-manager pods cannot reach `https://<certctl-host>:8443/.well-known/pki/`,
|
||||
add a NetworkPolicy egress rule or expose the certctl service via the
|
||||
appropriate ingress class.
|
||||
|
||||
### Firefox
|
||||
|
||||
Firefox honors the AIA OCSP URI by default. To force-refresh the local
|
||||
revocation cache after revoking a cert in dev:
|
||||
|
||||
```
|
||||
about:preferences#privacy → Certificates → Query OCSP responder servers
|
||||
```
|
||||
|
||||
If Firefox reports `SEC_ERROR_OCSP_INVALID_SIGNING_CERT`, verify that the
|
||||
responder cert chain is reachable from the system trust store —
|
||||
`id-pkix-ocsp-nocheck` is a Firefox-strict extension and is set automatically
|
||||
on every responder cert certctl issues.
|
||||
|
||||
### OpenSSL
|
||||
|
||||
```bash
|
||||
# OCSP via stand-alone request
|
||||
openssl ocsp -issuer ca.crt -cert leaf.crt -url https://localhost:8443/.well-known/pki/ocsp/iss-local -CAfile ca.crt -text
|
||||
|
||||
# OCSP via TLS Certificate Status Request extension
|
||||
openssl s_client -connect example.com:443 -status -CAfile ca.crt
|
||||
```
|
||||
|
||||
### Intune (corporate device state)
|
||||
|
||||
Intune device-compliance validators pull the CRL on a schedule (configured in
|
||||
the Intune admin console, default 24h). Configure the CRL distribution point
|
||||
to `https://<certctl-host>:8443/.well-known/pki/crl/<issuer_id>` and Intune
|
||||
will pull on its own cadence.
|
||||
|
||||
---
|
||||
|
||||
## Production hardening II additions (post-2026-04-30)
|
||||
|
||||
The following capabilities were folded into V2 (free) by the production
|
||||
hardening II bundle. Each closes a real procurement-team checklist gap
|
||||
without requiring a paid tier.
|
||||
|
||||
### OCSP nonce extension (RFC 6960 §4.4.1)
|
||||
|
||||
The POST OCSP handler echoes the request's nonce extension (OID
|
||||
`1.3.6.1.5.5.7.48.1.2`) in the response. Defends against replay attacks
|
||||
where a relying party's cached response is replayed against a now-revoked
|
||||
cert. Always-on; no operator opt-out.
|
||||
|
||||
Failure modes:
|
||||
|
||||
- **No nonce in request** — back-compat; response omits the extension.
|
||||
- **Well-formed nonce ≤ 32 bytes** — response echoes it; tracked in
|
||||
`certctl_ocsp_counter_total{label="nonce_echoed"}`.
|
||||
- **Empty or oversized nonce (> 32 bytes per CA/B Forum BR §4.10.2)** —
|
||||
responder returns the canonical "unauthorized" status (RFC 6960 §2.3
|
||||
status 6); tracked in `certctl_ocsp_counter_total{label="nonce_malformed"}`.
|
||||
|
||||
### OCSP pre-signed response cache
|
||||
|
||||
Mirrors the existing CRL cache. Per-(issuer, serial) entries pre-signed
|
||||
and stored in `ocsp_response_cache`; the read-through facade in
|
||||
`CAOperationsSvc.GetOCSPResponseWithNonce` consults the cache for
|
||||
nil-nonce requests and falls through to live signing on miss + writes
|
||||
the result back. Nonce-bearing requests always live-sign because the
|
||||
cache stores nil-nonce blobs.
|
||||
|
||||
**Load-bearing security wire:** `RevocationSvc.RevokeCertificateWithActor`
|
||||
calls `InvalidateOnRevoke` after a successful revocation so the next
|
||||
OCSP fetch returns the revoked status. There is no stale-good window
|
||||
after revoke.
|
||||
|
||||
### Per-source-IP OCSP rate limit + per-actor cert-export rate limit
|
||||
|
||||
Defaults: 1000 req/min/IP for OCSP; 50 exports/hr/operator for the
|
||||
cert-export endpoints. Configurable via
|
||||
`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` and
|
||||
`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`; zero disables.
|
||||
|
||||
OCSP rate-limit trip: canonical "unauthorized" OCSP blob plus
|
||||
`Retry-After: 60`. Cert-export trip: HTTP 429 + JSON
|
||||
`{"error":"rate_limit_exceeded","retry_after_seconds":3600}`.
|
||||
|
||||
The OCSP limiter does NOT honor `X-Forwarded-For` because OCSP is
|
||||
publicly reachable and untrusted intermediaries could spoof the header
|
||||
to bypass the cap.
|
||||
|
||||
### CRL HTTP caching headers (RFC 7232)
|
||||
|
||||
`GET /.well-known/pki/crl/{issuer_id}` now returns weak-form ETag,
|
||||
`Cache-Control: public, max-age=3600, must-revalidate`, and respects
|
||||
`If-None-Match` for HTTP 304 short-circuits. Lets CDNs and reverse
|
||||
proxies serve repeated fetches from edge cache.
|
||||
|
||||
### CRL DistributionPoint auto-injection
|
||||
|
||||
Local issuer config field `CRLDistributionPointURLs []string`; when
|
||||
non-empty, every issued cert carries the RFC 5280 §4.2.1.13
|
||||
`id-ce-cRLDistributionPoints` extension pointing at certctl's CRL
|
||||
endpoint. Refusing to silently inject an empty CDP is deliberate —
|
||||
silent-empty fails relying-party validation worse than no CDP.
|
||||
|
||||
### Cert-export typed audit codes + Prometheus per-area metrics
|
||||
|
||||
Audit emission now carries typed action constants
|
||||
(`cert_export_pem`, `cert_export_pkcs12`, `cert_export_failed`)
|
||||
alongside legacy bare codes. Detail map enriched with
|
||||
`has_private_key` (always false in V2) and `cipher`
|
||||
(`AES-256-CBC-PBE2-SHA256` — pinned).
|
||||
|
||||
`GET /api/v1/metrics/prometheus` surfaces the new per-area counters
|
||||
under the `certctl_<area>_counter_total{label=...}` family. OCSP
|
||||
shipped in this bundle; alert recommendations:
|
||||
|
||||
- `{label="rate_limited"}` rate > 0 sustained > 5m → notify (limiter
|
||||
is doing its job; investigate source IP).
|
||||
- `{label="nonce_malformed"}` > 0 → notify (legitimate clients don't
|
||||
send malformed nonces).
|
||||
- `{label="signing_failed"}` > 0 → page on-call (issuer connector
|
||||
failing).
|
||||
|
||||
## What this release does NOT include (V3-Pro)
|
||||
|
||||
Still out of scope for V2; tracked for V3-Pro:
|
||||
|
||||
- **Delta CRLs (RFC 5280 §5.2.4).** Useful for very large CRLs (10k+
|
||||
revoked certs); the data model accommodates the Base CRL Number
|
||||
reference but the pipeline only emits Base CRLs in V2.
|
||||
- **OCSP stapling at SCEP/EST CertRep response time.** Server-side
|
||||
pre-staple into the TLS handshake context.
|
||||
- **OCSP request signature verification (RFC 6960 §4.1.1).** Optional
|
||||
per-spec; certctl currently ignores the signature.
|
||||
- **OCSP responder HA / multi-region replication.** Active-active
|
||||
OCSP cache with Postgres logical replication.
|
||||
- **CRL Issuing Distribution Point (IDP) extension** (RFC 5280
|
||||
§5.2.5) — for sharded CRL deployments.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`pki/crl/<issuer_id>` returns 404.** The issuer either does not support
|
||||
CRL signing (Vault, EJBCA, DigiCert serve their own CRL infrastructure;
|
||||
certctl's connectors return `nil` from `GenerateCRL` for these) or the
|
||||
issuer ID is wrong. Verify with `GET /api/v1/issuers`.
|
||||
|
||||
**`pki/ocsp/<issuer_id>/<serial>` returns 200 but `openssl ocsp -text`
|
||||
shows "unauthorized".** Check that the serial in the URL is hex-encoded (no
|
||||
`0x` prefix, no leading zeros stripped, lowercase). Mismatched serials
|
||||
return an OCSP response with status `unauthorized` per RFC 6960 §2.3.
|
||||
|
||||
**Admin cache endpoint returns 403.** The Bearer key does not carry the
|
||||
admin flag. M-008 gates this endpoint server-side; the GUI also gates the
|
||||
fetch on `useAuth().admin`. Either escalate the key (`certctl admin
|
||||
keys promote <key-id>`) or use a different identity.
|
||||
|
||||
**Cache shows `is_stale: true` repeatedly.** The scheduler is not running
|
||||
(or not getting scheduled often enough). Check `CERTCTL_CRL_GENERATION_INTERVAL`
|
||||
and confirm the scheduler started: `grep crlGenerationLoop` in the server
|
||||
logs at startup.
|
||||
@@ -0,0 +1,809 @@
|
||||
# EST (RFC 7030) — Operator Guide
|
||||
|
||||
> **Status (this document):** EST RFC 7030 hardening master bundle Phases
|
||||
> 1–11 shipped on `master`; this guide is the Phase-12 deliverable
|
||||
> against the bundle. Every behavior described here is exercised by the
|
||||
> tests at `internal/api/handler/est*_test.go`,
|
||||
> `internal/service/est*_test.go`, and (for the libest interop layer)
|
||||
> `deploy/test/est_e2e_test.go` under `//go:build integration`. The
|
||||
> bundle is **V2-free**; per-tenant CA isolation, Conditional-Access
|
||||
> compliance gating, and EST cert-bound usage analytics are documented
|
||||
> as V3-Pro deferrals in [V3-Pro deferrals](#v3-pro-deferrals).
|
||||
|
||||
## Contents
|
||||
|
||||
1. [Concepts](#concepts)
|
||||
2. [Quick start](#quick-start)
|
||||
3. [Multi-profile dispatch](#multi-profile-dispatch)
|
||||
4. [Authentication modes](#authentication-modes)
|
||||
5. [RFC 9266 channel binding](#rfc-9266-channel-binding)
|
||||
6. [WiFi / 802.1X recipe (FreeRADIUS)](#wifi--8021x-recipe-freeradius)
|
||||
7. [IoT bootstrap recipe](#iot-bootstrap-recipe)
|
||||
8. [`serverkeygen` for resource-constrained devices](#serverkeygen-for-resource-constrained-devices)
|
||||
9. [HSM-backed CA signing for EST](#hsm-backed-ca-signing-for-est)
|
||||
10. [Operator GUI (EST Admin tabs)](#operator-gui-est-admin-tabs)
|
||||
11. [CLI + MCP tools](#cli--mcp-tools)
|
||||
12. [Renewal: device-driven model](#renewal-device-driven-model)
|
||||
13. [Troubleshooting matrix](#troubleshooting-matrix)
|
||||
14. [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook)
|
||||
15. [Threat model](#threat-model)
|
||||
16. [V3-Pro deferrals](#v3-pro-deferrals)
|
||||
17. [Appendix A: libest reference client](#appendix-a-libest-reference-client)
|
||||
18. [Appendix B: RFC 7030 wire-format quirks](#appendix-b-rfc-7030-wire-format-quirks)
|
||||
19. [Related docs](#related-docs)
|
||||
|
||||
## Concepts
|
||||
|
||||
EST (RFC 7030) is the IETF-standardized successor to SCEP for device
|
||||
enrollment over HTTPS. certctl ships a native EST server that handles
|
||||
all six RFC 7030 endpoints — `cacerts`, `simpleenroll`,
|
||||
`simplereenroll`, `csrattrs`, `serverkeygen`, and (proxy-pass)
|
||||
`fullcmc` — out of a single binary, with per-profile dispatch so a
|
||||
single deploy can serve multiple device fleets from the same control
|
||||
plane.
|
||||
|
||||
**EST is a handler-level protocol, not a connector.** The
|
||||
`ESTHandler` parses the wire format, enforces auth, and delegates
|
||||
issuance to whichever `IssuerConnector` the profile binds. EST does
|
||||
not replace your CA — it sits in front of the local CA, Vault PKI,
|
||||
EJBCA, ADCS, step-ca, or anything else certctl already knows how to
|
||||
issue against. Devices submit a CSR; certctl validates, gates, signs,
|
||||
and returns a PKCS#7 certs-only response.
|
||||
|
||||
**Two enrollment models, one server.**
|
||||
|
||||
- **Host enrollment** — a long-lived device or laptop boots, generates
|
||||
its own keypair locally, and enrolls via `simpleenroll` (initial)
|
||||
then `simplereenroll` (renewal) over the device's TLS-pinned
|
||||
channel. Private keys never leave the device.
|
||||
- **User enrollment** — a network supplicant (corporate WiFi, VPN
|
||||
client) drives `simpleenroll` against certctl on behalf of the user
|
||||
identity. The CSR carries the user UPN as a SAN; the FreeRADIUS or
|
||||
VPN policy gates session establishment on cert validity.
|
||||
|
||||
**Profile-driven policy.** Every EST profile carries its own:
|
||||
|
||||
- Issuer binding (`CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID`)
|
||||
- Optional `CertificateProfile` (`_PROFILE_ID`) that constrains
|
||||
allowed key algorithms, key sizes, EKUs, SANs, max TTL, and
|
||||
must-staple
|
||||
- Auth mode mix: mTLS only, HTTP Basic only, both, or none (for
|
||||
back-compat with anonymous deploys — strongly discouraged)
|
||||
- Optional RFC 9266 `tls-exporter` channel binding
|
||||
- Optional per-(CN, sourceIP) sliding-window rate limit
|
||||
- Optional server-side keygen
|
||||
|
||||
The per-profile family is documented exhaustively in
|
||||
[`features.md`](features.md).
|
||||
|
||||
**Multi-profile dispatch.** `CERTCTL_EST_PROFILES=corp,iot,wifi`
|
||||
publishes three independent endpoint groups under
|
||||
`/.well-known/est/<pathID>/`. Each profile's auth, trust anchor, and
|
||||
issuer binding is isolated; a compromise of one profile's enrollment
|
||||
password does not affect any other profile.
|
||||
|
||||
## Quick start
|
||||
|
||||
The five-minute single-profile setup runs EST anonymously over
|
||||
HTTPS-only. **Use this only on a private network during evaluation;**
|
||||
production deploys MUST set an auth mode (see
|
||||
[Authentication modes](#authentication-modes)).
|
||||
|
||||
1. Have certctl running with TLS configured per [`tls.md`](tls.md).
|
||||
The control plane listens on `:8443`; EST shares the same listener
|
||||
under `/.well-known/est/`.
|
||||
2. Set the legacy single-profile env vars in your compose file or
|
||||
Helm values:
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_ISSUER_ID=iss-local
|
||||
```
|
||||
|
||||
3. Restart certctl. The startup log line `EST server enabled` should
|
||||
surface; the routes `/.well-known/est/{cacerts,simpleenroll,simplereenroll,csrattrs}`
|
||||
are now live.
|
||||
4. Ground-truth check from a client host:
|
||||
|
||||
```bash
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/.well-known/est/cacerts \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs -noout
|
||||
```
|
||||
|
||||
You should see your CA cert subject and `NotAfter`. This is the
|
||||
`/cacerts` endpoint serving the PKCS#7 SignedData certs-only
|
||||
response per RFC 7030 §4.1.
|
||||
|
||||
5. Generate a CSR and enroll:
|
||||
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out device.key
|
||||
openssl req -new -key device.key -subj "/CN=device-001.example.com" -out device.csr
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -in device.csr -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est/simpleenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > device.crt
|
||||
```
|
||||
|
||||
The response is a PKCS#7 certs-only blob; the issued cert lands in
|
||||
`device.crt`.
|
||||
|
||||
If the curl fails with a TLS error, walk through [`tls.md`](tls.md);
|
||||
the EST handler relies on the same listener as the REST API and
|
||||
SHARES NO TRUST POLICY with the legacy plaintext :8080 of pre-v2.2
|
||||
deploys (which was removed when the HTTPS-only policy landed).
|
||||
|
||||
## Multi-profile dispatch
|
||||
|
||||
A single certctl binary publishes one EST endpoint group per name in
|
||||
`CERTCTL_EST_PROFILES`. Set the comma-separated list, then a matching
|
||||
set of `CERTCTL_EST_PROFILE_<NAME>_*` env vars per profile:
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_PROFILES=corp,iot,wifi
|
||||
|
||||
# per-profile config — `<NAME>` placeholder gets replaced by the
|
||||
# uppercased name from the list (so "corp" → CORP, "iot" → IOT,
|
||||
# "wifi" → WIFI). The URL path uses the lowercased form.
|
||||
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
|
||||
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-corp-laptops
|
||||
CERTCTL_EST_PROFILE_<NAME>_ENROLLMENT_PASSWORD=<random>
|
||||
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=basic
|
||||
```
|
||||
|
||||
This publishes:
|
||||
|
||||
- `/.well-known/est/corp/{cacerts,simpleenroll,simplereenroll,csrattrs,serverkeygen}`
|
||||
- `/.well-known/est/iot/...`
|
||||
- `/.well-known/est/wifi/...`
|
||||
|
||||
Each profile is independently validated at startup (see
|
||||
`internal/config/config.go::Validate`). Per-profile failures log the
|
||||
offending PathID and refuse the boot. The legacy single-profile
|
||||
shape (`CERTCTL_EST_ENABLED` + `CERTCTL_EST_ISSUER_ID` without
|
||||
`CERTCTL_EST_PROFILES`) continues to work — the back-compat shim in
|
||||
`loadESTProfilesFromEnv` synthesises a single profile bound to the
|
||||
empty PathID, which the router serves at `/.well-known/est/` (no
|
||||
path component).
|
||||
|
||||
PathID rules (enforced at boot):
|
||||
|
||||
- Lowercased ASCII `[a-z0-9-]+` only, no leading/trailing hyphen.
|
||||
- Distinct PathIDs per profile (no duplicates).
|
||||
- Reserved name `est` rejected (would collide with the legacy root).
|
||||
|
||||
Mirrors the SCEP `CERTCTL_SCEP_PROFILES` family from the SCEP RFC
|
||||
8894 master bundle — see [`legacy-est-scep.md`](legacy-est-scep.md)
|
||||
for the SCEP equivalent.
|
||||
|
||||
## Authentication modes
|
||||
|
||||
certctl supports three EST authentication topologies per profile,
|
||||
mixed and matched via `CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES`:
|
||||
|
||||
| Mode | Endpoint | When to use |
|
||||
|---------|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `mtls` | `/.well-known/est-mtls/<pathID>/...` | The device already has a bootstrap cert (factory-provisioned, previous-cert renewal, or out-of-band onboarding). Enterprise procurement teams almost always require this for production fleets — shared-password auth is a checkbox-fail regardless of password strength. |
|
||||
| `basic` | `/.well-known/est/<pathID>/...` | First-cert bootstrap when no prior cert exists. The `_ENROLLMENT_PASSWORD` is a per-profile shared secret; constant-time comparison via `crypto/subtle.ConstantTimeCompare`. Pair with the source-IP failed-auth rate limit (see below). |
|
||||
| both | both routes published | Migration window: existing devices renew via mTLS, new devices bootstrap via Basic. Same profile config, just both routes registered. |
|
||||
| (empty) | `/.well-known/est/<pathID>/...` | Anonymous; no auth required at the EST layer. Back-compat for pre-Phase-1 deploys. Hardened-deployment best practice is to set this explicitly to `basic` or `mtls` — a future bundle may flip the default. |
|
||||
|
||||
Per-profile cross-check enforced at boot:
|
||||
|
||||
- `mtls` in the list requires `_MTLS_ENABLED=true` AND
|
||||
`_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH` non-empty.
|
||||
- `basic` in the list requires `_ENROLLMENT_PASSWORD` non-empty.
|
||||
- Unknown auth modes refused at boot with the offending token in the
|
||||
error message.
|
||||
|
||||
**Source-IP failed-auth rate limit.** When `_ENROLLMENT_PASSWORD` is
|
||||
set and the Basic-auth gate trips, the handler increments a sliding-
|
||||
window counter keyed on the source IP. After 10 consecutive failures
|
||||
in an hour, the source is locked out (HTTP 429-equivalent failure
|
||||
code) for the rest of the window. The limiter is process-local
|
||||
(50k-IP cap, sliding 1h window — defaults; tunable in a follow-up).
|
||||
This is independent of the per-(CN, sourceIP) per-principal limiter
|
||||
discussed under [Renewal](#renewal-device-driven-model).
|
||||
|
||||
## RFC 9266 channel binding
|
||||
|
||||
When `CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true`, the
|
||||
EST handler enforces RFC 9266 `tls-exporter` channel binding. The
|
||||
client must include an `id-aa-channelBindings` attribute in the CSR
|
||||
whose value matches the server's
|
||||
`r.TLS.ConnectionState().ExportKeyingMaterial("EXPORTER-Channel-Binding", nil, 32)`
|
||||
output, computed independently at request time.
|
||||
|
||||
What this defends against: an attacker that bridges two TLS
|
||||
connections (one client → attacker, another attacker → certctl) and
|
||||
forwards the device's CSR through the attacker's TLS session. Without
|
||||
channel binding, certctl sees a valid CSR submitted over a TLS
|
||||
session authenticated by the attacker's cert; with channel binding,
|
||||
the CSR's binding bytes only match if the CSR was signed against
|
||||
THIS TLS session's exporter material.
|
||||
|
||||
Failure mode mapping:
|
||||
|
||||
| Server-side error | HTTP status | Meaning |
|
||||
|-------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------|
|
||||
| `ErrChannelBindingMissing` | 400 | `_CHANNEL_BINDING_REQUIRED=true` but the CSR's attribute is absent. Bad client config (or a non-RFC-9266 EST client). |
|
||||
| `ErrChannelBindingMismatch` | 409 | Attribute present but doesn't match the live exporter — MITM signal. Treat as a security event, log the source IP. |
|
||||
| `ErrChannelBindingNotTLS13` | 426 | Client connected over TLS 1.2 — `tls-exporter` requires TLS 1.3. Upgrade client OR rely on the TLS-1.2 reverse-proxy runbook. |
|
||||
|
||||
Cross-check at boot: setting `_CHANNEL_BINDING_REQUIRED=true` on a
|
||||
profile with `_MTLS_ENABLED=false` is refused — channel binding is
|
||||
meaningful only when mTLS is in use (otherwise the binding has no
|
||||
client identity to bind to).
|
||||
|
||||
**libest support.** Cisco libest v3.0+ supports the RFC 9266
|
||||
`--tls-exporter` flag. Older builds (commonly distros' packaged
|
||||
versions through 2024) do not; per-profile opt-out via leaving the
|
||||
env var `false` is the migration path. The libest sidecar in
|
||||
`deploy/test/libest/Dockerfile` builds v3.2.0-2 from source and
|
||||
includes the flag.
|
||||
|
||||
## WiFi / 802.1X recipe (FreeRADIUS)
|
||||
|
||||
This recipe stands up an EAP-TLS-authenticated corporate WiFi network
|
||||
where certctl issues every device certificate via EST. End-to-end
|
||||
flow:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Laptop["Laptop / supplicant<br/>(wpa_supplicant / iwd / Apple WiFi)"]
|
||||
AP["WiFi access point (NAS)"]
|
||||
Radius["FreeRADIUS<br/>(validate cert chain)"]
|
||||
CA["certctl CA<br/>(EST profile 'wifi')"]
|
||||
Laptop -->|EAP| AP
|
||||
AP -->|Radius| Radius
|
||||
Radius -.->|trusts| CA
|
||||
Laptop -->|"EST: /simpleenroll, /simplereenroll<br/>(one-time, then renewal)"| CA
|
||||
```
|
||||
|
||||
### certctl-side: EST profile config for 802.1X
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_PROFILES=wifi
|
||||
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
|
||||
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-wifi-eap-tls
|
||||
CERTCTL_EST_PROFILE_<NAME>_MTLS_ENABLED=true
|
||||
CERTCTL_EST_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH=/etc/certctl/wifi-bootstrap-ca.pem
|
||||
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=mtls
|
||||
CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true
|
||||
CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H=3
|
||||
```
|
||||
|
||||
The matching `CertificateProfile` (`cp-wifi-eap-tls`) configured via
|
||||
the API or GUI:
|
||||
|
||||
- `AllowedKeyAlgorithms`: ECDSA P-256 (covers Apple, Android, modern
|
||||
laptop supplicants) plus optional RSA 2048+ for legacy clients.
|
||||
- `AllowedEKUs`: `clientAuth` only (`1.3.6.1.5.5.7.3.2`). Drops
|
||||
`serverAuth` so a device cert can't be reused as a TLS server cert.
|
||||
EAP-TLS requires `clientAuth`; FreeRADIUS will reject certs without
|
||||
it when `eap_chain_check_eku` is on.
|
||||
- `RequiredCSRAttributes`: `["deviceSerialNumber"]` so the device's
|
||||
serial appears in the issued cert (operators correlate WiFi grants
|
||||
back to inventory).
|
||||
- `MaxTTLSeconds`: 31536000 (1 year). Long enough for laptop fleets
|
||||
that don't renew daily; short enough to limit the cert's blast
|
||||
radius on key compromise.
|
||||
|
||||
### Device-side: drive `simpleenroll` from the supplicant
|
||||
|
||||
For Linux/embedded laptops:
|
||||
|
||||
```bash
|
||||
# Bootstrap once (factory bootstrap cert presented over mTLS):
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out /etc/wifi/eap.key
|
||||
openssl req -new -key /etc/wifi/eap.key \
|
||||
-subj "/CN=laptop-001/serialNumber=ABC123" \
|
||||
-out /etc/wifi/eap.csr
|
||||
curl -sS --cacert /etc/certctl/ca.crt \
|
||||
--cert /etc/wifi/bootstrap.crt \
|
||||
--key /etc/wifi/bootstrap.key \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -in /etc/wifi/eap.csr -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simpleenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt
|
||||
|
||||
# Renewal cycle (cron, 10 days before NotAfter):
|
||||
curl -sS --cacert /etc/certctl/ca.crt \
|
||||
--cert /etc/wifi/eap.crt \
|
||||
--key /etc/wifi/eap.key \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -new -key /etc/wifi/eap.key -subj "/CN=laptop-001" -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simplereenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt.new && \
|
||||
mv /etc/wifi/eap.crt.new /etc/wifi/eap.crt
|
||||
```
|
||||
|
||||
For Apple-managed devices the equivalent flow is wrapped by an MDM
|
||||
profile that drives EST. For ChromeOS the Admin Console SCEP profile
|
||||
remains the easier path until Google's EST support stabilises (track
|
||||
the [SCEP+ChromeOS guide](legacy-est-scep.md#scep-rfc-8894-native-implementation-post-2026-04-29)).
|
||||
|
||||
### FreeRADIUS-side: EAP-TLS configuration
|
||||
|
||||
In `mods-available/eap`:
|
||||
|
||||
```
|
||||
eap {
|
||||
default_eap_type = tls
|
||||
tls-config tls-common {
|
||||
# The CA bundle that signed certctl's EST-issued device certs.
|
||||
# Save the certctl issuer's CA chain to this path; the
|
||||
# FreeRADIUS daemon reloads on HUP.
|
||||
ca_file = /etc/freeradius/certs/certctl-ca.pem
|
||||
|
||||
# Server cert presented to the supplicant for tunnel TLS.
|
||||
# Separate cert chain — FreeRADIUS's own cert, NOT a certctl-
|
||||
# issued client cert.
|
||||
certificate_file = /etc/freeradius/certs/freeradius-server.pem
|
||||
private_key_file = /etc/freeradius/certs/freeradius-server.key
|
||||
|
||||
# Validate the supplicant's cert chain to certctl-ca.pem.
|
||||
check_cert_issuer = "/CN=certctl-corp-ca"
|
||||
|
||||
# Pin the supplicant's EKU to clientAuth.
|
||||
check_cert_cn = "%{User-Name}"
|
||||
}
|
||||
tls {
|
||||
tls = tls-common
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The matching `sites-available/default` authorize block invokes
|
||||
`eap` and rejects on cert-chain failure. CRL/OCSP validation against
|
||||
certctl's CRL endpoint (`/.well-known/pki/crls/<issuerID>.crl`) is
|
||||
configured under `tls-common.crl_dir` — see [`crl-ocsp.md`](crl-ocsp.md)
|
||||
for the certctl-side CRL distribution endpoint and refresh cadence.
|
||||
|
||||
### End-to-end flow
|
||||
|
||||
1. Laptop boots, supplicant starts EAP-TLS handshake against the AP.
|
||||
2. AP forwards the EAP frames to FreeRADIUS over RADIUS.
|
||||
3. FreeRADIUS validates the supplicant cert chain against
|
||||
`certctl-ca.pem`, checks revocation against the certctl CRL, and
|
||||
pins the EKU to `clientAuth`.
|
||||
4. On valid cert, FreeRADIUS returns Access-Accept; the AP grants
|
||||
network access.
|
||||
5. ~10 days before the cert's `NotAfter`, the device's renewal cron
|
||||
hits `simplereenroll` over the EXISTING mTLS-authenticated session
|
||||
— no operator interaction.
|
||||
|
||||
What can go wrong (operator playbook):
|
||||
|
||||
| Symptom | Diagnostic | Fix |
|
||||
|----------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
|
||||
| Supplicant rejected at TLS handshake | `tcpdump` on AP shows TLS-1.2 hello | Update supplicant to TLS 1.3 OR ensure FreeRADIUS's cert is signed under a chain it trusts. |
|
||||
| FreeRADIUS rejects with "expired CRL" | `freeradius -X` log surfaces stale CRL | certctl regenerates per-issuer CRLs hourly (see [`crl-ocsp.md`](crl-ocsp.md)); tighten `crl_dir` reload cadence in FreeRADIUS. |
|
||||
| Renewal fails with HTTP 429 | certctl audit log shows `est_rate_limited` for this device | Per-(CN, sourceIP) limit tripped; either widen `_RATE_LIMIT_PER_PRINCIPAL_24H` or investigate why the device is renewing >3x/24h. |
|
||||
| Renewal fails with HTTP 401 | certctl audit log shows `est_auth_failed_mtls` | Bootstrap cert chain doesn't trace to `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. Re-issue or rotate. |
|
||||
| Sustained `est_auth_failed_basic` from one IP | certctl audit log + IP reverse lookup | Likely brute-force; the source-IP limiter will lock the IP after 10 fails/hr. Block at firewall.|
|
||||
|
||||
## IoT bootstrap recipe
|
||||
|
||||
Long-running devices in the field — sensors, gateways, kiosks —
|
||||
typically follow this lifecycle:
|
||||
|
||||
1. **Factory provisioning** — bake one of:
|
||||
- A **bootstrap enrollment password** into the device firmware
|
||||
(per-fleet shared secret; pair with the source-IP rate limit)
|
||||
- A **factory-installed bootstrap cert** signed by the operator's
|
||||
factory CA, suitable for mTLS on first enroll
|
||||
2. **First boot** — device generates an ECDSA P-256 keypair locally,
|
||||
builds a CSR with its serial in `deviceSerialNumber`, and POSTs to
|
||||
`/.well-known/est/<pathID>/simpleenroll` (with HTTP Basic) or
|
||||
`/.well-known/est-mtls/<pathID>/simpleenroll` (with the bootstrap
|
||||
cert). On success, the device persists the issued cert and the
|
||||
bootstrap material can be discarded.
|
||||
3. **Steady state** — device drives `simplereenroll` over the
|
||||
issued cert's mTLS session ~10–25% before `NotAfter`. The
|
||||
re-enrollment uses the issued cert as the client cert; no shared
|
||||
secrets in the renewal path.
|
||||
4. **Compromise / decommission** — operator hits the bulk-revoke
|
||||
endpoint:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer $CERTCTL_API_KEY" \
|
||||
--cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/api/v1/est/certificates/bulk-revoke \
|
||||
-d '{"reason":"keyCompromise","profile_id":"cp-iot-sensors"}'
|
||||
```
|
||||
|
||||
The endpoint is M-008 admin-gated; non-admin Bearer callers receive
|
||||
HTTP 403. Source is auto-pinned to `EST` server-side, so the
|
||||
operation only revokes EST-issued certs even if the criteria match
|
||||
non-EST sources too. The CRL/OCSP responder picks up the revocations
|
||||
on the next refresh cycle (`CERTCTL_CRL_GENERATION_INTERVAL`,
|
||||
default 1h) — see [`crl-ocsp.md`](crl-ocsp.md).
|
||||
|
||||
**Recommended cert lifetimes for IoT.** Set `MaxTTLSeconds = 7776000`
|
||||
(90 days) on the IoT `CertificateProfile`. Long enough to absorb
|
||||
multi-day network outages without losing the device; short enough to
|
||||
limit exposure on key compromise (combined with bulk revoke + CRL
|
||||
refresh, the worst-case window is `1h + crl_refresh_interval` from
|
||||
revocation to relying-party rejection).
|
||||
|
||||
**Renewal trigger ratio for IoT.** Set the device's renewal cron to
|
||||
fire at 25% remaining lifetime — that gives ~22 days of buffer for a
|
||||
device that's offline at expiry-time to reconnect, retry, and
|
||||
re-enroll before the cert hard-expires. Mirrors the renewal-trigger
|
||||
ratio for laptops at 50% (laptops are online more often, so the
|
||||
buffer can be tighter relative to lifetime).
|
||||
|
||||
## `serverkeygen` for resource-constrained devices
|
||||
|
||||
RFC 7030 §4.4 lets the server generate the keypair on behalf of the
|
||||
client when the device lacks a hardware RNG — typical of ultra-low-
|
||||
power IoT or embedded modules without a TRNG. certctl supports this
|
||||
via `CERTCTL_EST_PROFILE_<NAME>_SERVERKEYGEN_ENABLED=true`.
|
||||
|
||||
Wire format: `POST /.well-known/est/<pathID>/serverkeygen` with the
|
||||
device's CSR as the request body. The handler:
|
||||
|
||||
1. Parses the CSR; the CSR's pubkey is treated as the **recipient
|
||||
key** for CMS EnvelopedData wrapping (RFC 7030 §4.4.2). The CSR's
|
||||
pubkey must support keyTrans (RSA-only at this revision; ECDH
|
||||
defer to a follow-up bundle) — non-RSA CSRs return HTTP 400 with
|
||||
`ErrServerKeygenRequiresKeyEncipherment`.
|
||||
2. Resolves the per-profile key algorithm from
|
||||
`CertificateProfile.AllowedKeyAlgorithms` (default RSA-2048).
|
||||
3. Generates a fresh keypair in process memory.
|
||||
4. Re-builds the CSR with the server-generated pubkey (so the issuer
|
||||
sees a CSR that matches the cert it's signing).
|
||||
5. Runs the existing issuer pipeline.
|
||||
6. Marshals the private key as PKCS#8 DER, then wraps it in CMS
|
||||
EnvelopedData encrypted to the device's CSR pubkey via AES-256-CBC
|
||||
with a per-call random IV.
|
||||
7. Returns the response as `multipart/mixed` per RFC 7030 §4.4.2:
|
||||
first part is the cert chain (PKCS#7), second part is the
|
||||
EnvelopedData blob (`application/pkcs8`).
|
||||
8. **Zeroizes** the plaintext key + PKCS#8 bytes before return —
|
||||
`internal/service/est.go::zeroizeKey` + `zeroizeBytes`. The
|
||||
private key never persists to disk on the certctl side.
|
||||
|
||||
Cross-check at boot: setting `_SERVERKEYGEN_ENABLED=true` on a
|
||||
profile with empty `_PROFILE_ID` is refused — server-keygen needs a
|
||||
`CertificateProfile` to pin `AllowedKeyAlgorithms` (the server has
|
||||
to decide what key to generate, and a profile-less default would be
|
||||
arbitrary).
|
||||
|
||||
**Security caveats.**
|
||||
|
||||
- **Trust transitivity.** Server-keygen breaks the cardinal property
|
||||
of agent-based key management: that the private key never leaves
|
||||
the device. The CMS wrap protects the key in transit, but the
|
||||
device still trusts certctl with the key material at generation
|
||||
time. Use only when the device cannot generate its own keypair —
|
||||
not as a convenience.
|
||||
- **Heap residency window.** The plaintext key lives in process heap
|
||||
between generation and CMS encryption. The zeroize step closes the
|
||||
obvious leakage leg, but a Go runtime that GC-relocates the buffer
|
||||
before zeroize fires could leave a copy. The threat-model carve-out
|
||||
is documented in [Threat model](#threat-model); use HSM-backed
|
||||
signing for highest-assurance fleets.
|
||||
- **No audit-log trail of the key bytes.** The audit row records
|
||||
the issuance (cert serial, subject, issuer) but never the key
|
||||
bytes; the operator cannot recover a key after issuance. This is
|
||||
by design — the key bytes only exist for the duration of the
|
||||
request.
|
||||
|
||||
## HSM-backed CA signing for EST
|
||||
|
||||
EST signs certs using whatever issuer connector the profile binds.
|
||||
The `internal/crypto/signer/` interface (post-2026-04-28) means a
|
||||
future HSM/PKCS#11 driver bundle (parking-lot at
|
||||
`cowork/hsm-pkcs11-driver-prompt.md`) plugs in transparently — the
|
||||
EST handler doesn't change. EST-issued certs benefit from HSM-backed
|
||||
signing automatically once the HSM bundle ships and the operator
|
||||
swaps the local issuer's `FileDriver` for a `PKCS11Driver`.
|
||||
|
||||
For deploys that need HSM-backed CA signing today, use the local
|
||||
issuer's `FileDriver` with the CA key on a read-only TPM-protected
|
||||
tmpfs; the L-014 file-on-disk threat-model carve-out in
|
||||
`internal/connector/issuer/local/local.go` documents the
|
||||
defense-in-depth steps.
|
||||
|
||||
## Operator GUI (EST Admin tabs)
|
||||
|
||||
The EST Admin surface lives at `/est` (route `web/src/main.tsx`,
|
||||
nav link `web/src/components/Layout.tsx::EST Admin`). The page is
|
||||
admin-gated at the top level — non-admin Bearer callers see an
|
||||
"Admin access required" banner, and the underlying admin endpoints
|
||||
(`/api/v1/admin/est/*`) are M-008 protected server-side independently.
|
||||
|
||||
Three tabs:
|
||||
|
||||
- **Profiles** (default) — per-profile lean cards with auth-mode
|
||||
badges, mTLS trust-anchor expiry countdown (green ≥30d / amber
|
||||
7–30d / red <7d / EXPIRED), the 12-cell live counter grid (every
|
||||
`est_*` failure mode), and a "Reload trust anchor" modal that
|
||||
hits `POST /api/v1/admin/est/reload-trust` (the SIGHUP-equivalent;
|
||||
bad reloads keep the OLD pool in place per the
|
||||
[Threat model](#threat-model) reload semantics).
|
||||
- **Recent Activity** — merges the four EST audit-action prefixes
|
||||
(`est_simple_enroll`, `est_simple_reenroll`, `est_server_keygen`,
|
||||
`est_auth_failed`) across four parallel queries with chip filters
|
||||
(All / Enrollment / Re-enrollment / ServerKeygen / AuthFailure).
|
||||
Polled every 60s.
|
||||
- **Trust Bundle** — per-mTLS-profile cert subjects + expiries
|
||||
surfaced from the trust holder snapshot. Used during rotation:
|
||||
operator extracts the new bundle, overwrites the on-disk file,
|
||||
hits Reload, then reloads this tab to confirm the new subjects.
|
||||
|
||||
All three admin endpoints (`GET /api/v1/admin/est/profiles`,
|
||||
`POST /api/v1/admin/est/reload-trust`, plus the audit-query merge in
|
||||
the GUI) are M-008 admin-gated. The page itself hides (UX hint) and
|
||||
the server-side gate enforces (security boundary).
|
||||
|
||||
## CLI + MCP tools
|
||||
|
||||
The `certctl-cli est` subcommand family (`internal/cli/est.go`):
|
||||
|
||||
```
|
||||
certctl-cli est cacerts --profile <name>
|
||||
certctl-cli est csrattrs --profile <name>
|
||||
certctl-cli est enroll --profile <name> --csr <path|-> [--out <path>]
|
||||
certctl-cli est reenroll --profile <name> --csr <path|-> [--out <path>]
|
||||
certctl-cli est serverkeygen --profile <name> --csr <path> --out <prefix>
|
||||
certctl-cli est test --profile <name>
|
||||
```
|
||||
|
||||
`--profile` is the lowercased PathID (matches the URL path). Empty
|
||||
profile string maps to the legacy `/.well-known/est/` root — use only
|
||||
during a back-compat migration. Server-keygen writes
|
||||
`<prefix>.cert.pem` plus `<prefix>.key.enveloped` (the EnvelopedData
|
||||
blob, decryptable with `openssl smime`).
|
||||
|
||||
The MCP server (`internal/mcp/tools_est.go`) exposes six tools that
|
||||
mirror the CLI surface for AI-orchestrated workflows:
|
||||
|
||||
- `est_list_profiles` — every configured EST profile + its auth modes
|
||||
+ counters
|
||||
- `est_admin_stats` — alias of the above; matches the
|
||||
`scep_admin_stats` naming convention
|
||||
- `est_get_cacerts` — base64 PKCS#7 cert chain
|
||||
- `est_get_csrattrs` — base64 DER attributes blob (per-profile when
|
||||
`RequiredCSRAttributes` is set)
|
||||
- `est_enroll` — body carries the CSR PEM; returns the issued cert
|
||||
- `est_reenroll` — same but uses the previous-cert mTLS path
|
||||
|
||||
All six are gated by the standard MCP Bearer auth + the page-level
|
||||
admin gate where applicable (`est_list_profiles`, `est_admin_stats`).
|
||||
|
||||
## Renewal: device-driven model
|
||||
|
||||
RFC 7030 §4.2.2 mandates the renewal model: the **device** decides
|
||||
when to renew and drives `simplereenroll` over its existing cert.
|
||||
There is no server-initiated push — certctl never reaches out to a
|
||||
device fleet to force renewal.
|
||||
|
||||
Practical implications:
|
||||
|
||||
- A device offline at expiry-time **loses its cert**. Mitigation:
|
||||
pick a renewal-trigger ratio with enough buffer (50% remaining
|
||||
lifetime for laptops, 25% for IoT — see
|
||||
[IoT bootstrap recipe](#iot-bootstrap-recipe)). On chronically
|
||||
offline fleets, lengthen `MaxTTLSeconds`.
|
||||
- The "operator wants to push renewal" case is handled via the
|
||||
notification webhook surface (`internal/connector/notifier/webhook/`)
|
||||
— operator publishes an event on a topic the device fleet
|
||||
subscribes to (or the operator's MDM picks up); the device's MDM
|
||||
agent triggers the renewal cron out-of-band. certctl emits a
|
||||
`cert.expiring_soon` event on the standard 30/7/1-day pre-expiry
|
||||
schedule (`internal/scheduler/scheduler.go::expiryNotificationLoop`).
|
||||
- Per-(CN, sourceIP) sliding-window cap keeps a misbehaving device
|
||||
from hammering the server. Default is `0` (disabled, back-compat);
|
||||
production deploys set `3` per `CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H`.
|
||||
Mirrors the SCEP/Intune per-device limit pattern from
|
||||
[`scep-intune.md`](scep-intune.md).
|
||||
|
||||
## Troubleshooting matrix
|
||||
|
||||
The handler emits a typed audit-action code per failure mode. Filter
|
||||
the GUI Recent Activity tab on the action prefix to find the
|
||||
offending requests, and use the table below to map back to root
|
||||
cause + fix.
|
||||
|
||||
| Audit action | Symptom | Root cause + fix |
|
||||
|--------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `est_simple_enroll_success` | (success counter) | No action needed. |
|
||||
| `est_simple_enroll_failed` | An enrollment failed — the bare `_failed` codes give the typed reason | The audit row's `details` carries the inner reason; cross-reference one of the rows below. |
|
||||
| `est_simple_reenroll_success` | (success counter) | No action needed. |
|
||||
| `est_simple_reenroll_failed` | A renewal failed | Same as `est_simple_enroll_failed`; cross-reference inner reason. |
|
||||
| `est_server_keygen_success` | (success counter) | No action needed. |
|
||||
| `est_server_keygen_failed` | Server-keygen failed | Most common: device CSR carries a non-RSA pubkey (the keyTrans wrap requires RSA at this revision). Switch the device to an RSA CSR or wait for ECDH support. |
|
||||
| `est_auth_failed_basic` | HTTP Basic gate tripped | Wrong password OR the password env var rotated and the device wasn't re-provisioned. Watch the source-IP for sustained failures — the limiter locks out after 10 fails/hr. |
|
||||
| `est_auth_failed_mtls` | mTLS gate tripped | Client cert doesn't chain to the trust anchor OR the cert is past `NotAfter` OR the cert presented is for a different EST profile (cross-profile bleed defense). Check `details.subject` against `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. |
|
||||
| `est_auth_failed_channel_binding` | RFC 9266 channel-binding gate tripped | One of: missing `id-aa-channelBindings` attribute on the CSR (libest <v3.0); mismatch (MITM signal — log + escalate); TLS 1.2 client (channel binding requires TLS 1.3). Map the inner error to the [channel-binding table](#rfc-9266-channel-binding). |
|
||||
| `est_rate_limited` | Per-(CN, sourceIP) cap tripped | If legitimate (recovery + first-cert + post-wipe in 24h), bump `_RATE_LIMIT_PER_PRINCIPAL_24H`. If suspicious, the limiter is doing its job — investigate the device. |
|
||||
| `est_csr_policy_violation` | CSR violates the bound `CertificateProfile` rules | Inner detail names the dimension (key alg, key size, EKU, SAN, max TTL). Either fix the device CSR or relax the policy — never silently accept. |
|
||||
| `est_bulk_revoke` | Operator-initiated bulk revoke | Audit-only signal; no failure. Cross-reference the operator's identity in `details.actor`. |
|
||||
| `est_trust_anchor_reloaded` | Operator-initiated SIGHUP-equivalent reload | Audit-only signal; no failure. Failed reloads do NOT emit this code (the OLD pool stays in place; check the GUI Reload modal's error message + the `details.path_id`). |
|
||||
|
||||
The bare action codes (without the `_success`/`_failed` suffix) are
|
||||
also emitted for back-compat with the GUI activity-tab filter chips
|
||||
which match by exact-string `startsWith()` — the split-emit pattern
|
||||
preserves both the legacy-grep and the new typed-counter use cases.
|
||||
See `internal/service/est_audit_actions.go` for the constant
|
||||
definitions; the per-action emission sites are in
|
||||
`internal/service/est.go::processEnrollment`.
|
||||
|
||||
## TLS 1.2 reverse-proxy runbook
|
||||
|
||||
Some embedded EST clients only speak TLS 1.2 — older OpenWRT routers,
|
||||
some industrial PLCs, IoT firmware that can't be field-upgraded.
|
||||
certctl's control plane is TLS 1.3 only (pinned at
|
||||
`cmd/server/tls.go::buildServerTLSConfig`). The migration path is the
|
||||
TLS 1.2 reverse-proxy pattern documented in
|
||||
[`legacy-est-scep.md`](legacy-est-scep.md):
|
||||
|
||||
- nginx / HAProxy terminates TLS 1.2 from the legacy client
|
||||
- Forwards the EST request body unchanged to certctl on TLS 1.3
|
||||
- Optionally forwards the client cert via `X-SSL-Client-Cert` for the
|
||||
proxy-side mTLS trust pin
|
||||
|
||||
Important caveat: **RFC 9266 channel binding cannot work through a
|
||||
reverse proxy.** The channel binding bytes are derived from the
|
||||
client↔proxy TLS session, NOT the proxy↔certctl session. Disable
|
||||
`_CHANNEL_BINDING_REQUIRED` for profiles that serve via the proxy
|
||||
runbook.
|
||||
|
||||
## Threat model
|
||||
|
||||
The EST hardening bundle's threat model rests on these load-bearing
|
||||
properties; deviations need explicit operator awareness:
|
||||
|
||||
- **Trust anchor reload is fail-safe.** A SIGHUP that hits a
|
||||
half-rotated bundle (parse error, expired cert) keeps the OLD pool
|
||||
in place. The validator never accepts an unparseable bundle. The
|
||||
GUI reload modal surfaces the error so the operator can correct
|
||||
the file and retry without taking the EST endpoint down.
|
||||
- **Per-profile counter isolation.** Each ESTService instance has
|
||||
its own `estCounterTab` (sync/atomic-backed). A future shared-
|
||||
counter refactor would fail at the compile-time pointer-identity
|
||||
check in `internal/service/est_profile_counter_isolation_test.go`.
|
||||
This means the Recent Activity tab's per-profile filter is a real
|
||||
filter, not a fan-out display of one shared counter.
|
||||
- **mTLS cross-profile bleed is blocked.** A client cert presented
|
||||
to profile A's mTLS endpoint must chain to A's trust bundle, not
|
||||
any other profile's. The per-handler re-verify enforces this even
|
||||
when both profiles share a TLS listener union pool (see
|
||||
`cmd/server/tls.go::buildServerTLSConfigWithMTLS`).
|
||||
- **Source-IP failed-Basic limiter is process-local.** The 10/hr
|
||||
cap is enforced in-process; a load-balanced multi-pod deploy where
|
||||
request distribution is round-robin can amplify the effective
|
||||
per-IP rate by the pod count. Mitigation: use sticky-source-IP
|
||||
load balancing for `/.well-known/est/` if this is in scope.
|
||||
- **Server-keygen has a heap-residency window.** The plaintext
|
||||
private key lives in process memory between generation and CMS
|
||||
EnvelopedData encryption. The zeroize step closes the obvious
|
||||
leakage leg, but a GC-relocation between generation and zeroize
|
||||
could leave a copy. Use HSM-backed signing for highest-assurance
|
||||
fleets where this matters.
|
||||
- **HTTP Basic password is in-process only.** Stored in
|
||||
`ESTHandler.basicPassword`, never logged, never written to disk by
|
||||
certctl. Operators ARE responsible for the env-var injection path
|
||||
(Helm secret, Docker secret, Vault) — see `tls.md` for the
|
||||
recommended secret-mount conventions.
|
||||
- **The legacy unauthenticated default exists for back-compat.**
|
||||
Pre-Phase-1 deploys had no `_ALLOWED_AUTH_MODES` env var; the
|
||||
default is empty (anonymous) so existing deploys continue to work.
|
||||
A future bundle MAY flip the default to require explicit opt-in;
|
||||
production deploys should set `_ALLOWED_AUTH_MODES` explicitly
|
||||
today regardless.
|
||||
|
||||
## V3-Pro deferrals
|
||||
|
||||
These capabilities are deferred to V3-Pro (paid tier). They're not
|
||||
oversights — they're the natural follow-on bundles after v2.X.0 GA:
|
||||
|
||||
- **Conditional Access / device-posture gating.** The per-profile
|
||||
ESTService exposes a nil-default compliance-hook seam (mirrors the
|
||||
SCEP/Intune `ComplianceCheck` pattern). V3-Pro plugs in a
|
||||
Microsoft Graph or other posture-check callback before issuance;
|
||||
non-compliant devices fail with a typed `est_compliance_failed`
|
||||
reason.
|
||||
- **Multi-tenant CA isolation.** V2 has one trust anchor pool per
|
||||
EST profile and one issuer binding. V3-Pro ships per-tenant root
|
||||
+ per-tenant audit isolation for MSPs running shared certctl
|
||||
deployments across customers.
|
||||
- **EST cert-bound usage analytics.** Forward device-side handshake
|
||||
logs into certctl for cert-bound session analytics. V3-Pro (or
|
||||
delegate to a real session-management product like Teleport for
|
||||
TLS sessions).
|
||||
- **EST-cert-manager-style controller for K8s host fleets.**
|
||||
External-issuer pattern that lets cert-manager use certctl's EST
|
||||
server as a backend. Parking-lot per `WORKSPACE-ROADMAP.md::Cloud
|
||||
and Kubernetes`.
|
||||
- **Standalone `certctl-est` CLI binary.** All EST ops route through
|
||||
the certctl server in V2; a standalone binary that an operator can
|
||||
run on a laptop without the full server (similar to the SCEP probe
|
||||
deferred CLI binary). V2 ships the `certctl-cli est` subcommand
|
||||
family which solves the same operator workflow at a lower
|
||||
packaging cost.
|
||||
- **`fullcmc` (RFC 7030 §4.3) implementation.** Rare in practice;
|
||||
only Cisco IOS and a few financial-PKI vendors use it. Defer
|
||||
until a customer asks.
|
||||
|
||||
## Appendix A: libest reference client
|
||||
|
||||
certctl's CI exercises the EST endpoints against Cisco's libest
|
||||
reference implementation via the sidecar at
|
||||
`deploy/test/libest/Dockerfile`. The build reproduces v3.2.0-2 from
|
||||
source on `debian:bookworm-slim` (digest-pinned per the H-001 guard).
|
||||
|
||||
To reproduce locally:
|
||||
|
||||
```bash
|
||||
# From the repo root.
|
||||
docker compose --profile est-e2e -f deploy/docker-compose.test.yml build libest-client
|
||||
docker compose --profile est-e2e -f deploy/docker-compose.test.yml up -d libest-client
|
||||
docker exec -it certctl-libest-client estclient --help
|
||||
```
|
||||
|
||||
The integration test suite (`deploy/test/est_e2e_test.go`, build
|
||||
tag `integration`) drives the live certctl server through the
|
||||
sidecar via `docker exec` for these scenarios:
|
||||
|
||||
- `TestEST_LibESTClient_Enrollment_Integration` — `cacerts`
|
||||
→ `simpleenroll` → cert assertion
|
||||
- `TestEST_LibESTClient_MTLSEnrollment_Integration` — mTLS sibling
|
||||
route
|
||||
- `TestEST_LibESTClient_ServerKeygen_Integration` — RFC 7030 §4.4
|
||||
multipart/mixed
|
||||
- `TestEST_LibESTClient_RateLimited_Integration` — exhausts the
|
||||
per-principal cap and asserts the 429-shaped error
|
||||
- `TestEST_LibESTClient_ChannelBinding_Integration` — RFC 9266
|
||||
`--tls-exporter` (skipped when libest build lacks the flag)
|
||||
|
||||
Run the suite via `INTEGRATION=1 go test -tags integration ./deploy/test/... -run EST`.
|
||||
|
||||
## Appendix B: RFC 7030 wire-format quirks
|
||||
|
||||
certctl's EST handler ships with quirk-tolerance for documented EST
|
||||
client populations. The fixtures + unit tests live at
|
||||
`internal/api/handler/cisco_ios_quirks_test.go` +
|
||||
`internal/api/handler/testdata/cisco_ios_*.txt`.
|
||||
|
||||
| Vendor / version | Quirk | certctl behavior |
|
||||
|-----------------------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Cisco IOS 15.x | Some images send the CSR as `application/x-pem-file` (not the spec'd `application/pkcs10`) | The handler dispatches on the body prefix (`-----BEGIN`) rather than the Content-Type header — accepted as PEM-encoded PKCS#10. |
|
||||
| Cisco IOS 16.x | Trailing newlines on the base64 body (variable count) | `strings.TrimSpace` pass before base64 decode; bodies tolerated regardless of trailing whitespace. |
|
||||
| Apple MDM (some firmware) | CRLF line wrapping inside the base64 body | `base64.StdEncoding` handles both LF and CRLF. |
|
||||
| OpenWRT (older builds) | TLS 1.2 only | Use the [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook); disable channel binding for affected profiles. |
|
||||
| libest <v3.0 | No RFC 9266 `--tls-exporter` flag | Set `_CHANNEL_BINDING_REQUIRED=false` for affected profiles; the server still validates everything else. |
|
||||
|
||||
If you find a new wire-format quirk in a real device, file an issue
|
||||
with a base64 dump of the failing request — we'll add a fixture +
|
||||
the matching tolerance pass.
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`legacy-est-scep.md`](legacy-est-scep.md) — TLS 1.2 reverse-proxy
|
||||
runbook + the SCEP RFC 8894 native implementation parallels.
|
||||
- [`scep-intune.md`](scep-intune.md) — the SCEP/Intune master bundle
|
||||
that established the multi-profile dispatch + admin GUI + golden
|
||||
fixture patterns this EST bundle mirrors.
|
||||
- [`crl-ocsp.md`](crl-ocsp.md) — the per-issuer CRL distribution
|
||||
endpoint and OCSP responder that EST-issued certs are revoked
|
||||
through.
|
||||
- [`features.md`](features.md) — every `CERTCTL_*` env var,
|
||||
including the per-profile `CERTCTL_EST_PROFILE_<NAME>_*` family
|
||||
documented here.
|
||||
- [`architecture.md`](architecture.md) — overall control-plane
|
||||
architecture; EST Server section + Security Model trust-anchor
|
||||
rotation discussion.
|
||||
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
|
||||
prerequisite for any production EST deploy.
|
||||
- [`connectors.md`](connectors.md) — issuer connectors that EST
|
||||
delegates to.
|
||||
@@ -0,0 +1,385 @@
|
||||
# Microsoft Intune SCEP enrollment via certctl
|
||||
|
||||
> **Status (this document):** Phase 11 of the SCEP RFC 8894 + Intune master
|
||||
> bundle. The behavior described here is shipped on `master` and exercised
|
||||
> end-to-end by `internal/api/handler/scep_intune_e2e_test.go`. The
|
||||
> bundle is V2-free (community edition) — Conditional-Access compliance
|
||||
> gating, native Microsoft Graph integration, and per-tenant trust
|
||||
> anchors are documented under [Limitations](#limitations) as V3-Pro
|
||||
> features.
|
||||
|
||||
## TL;DR
|
||||
|
||||
certctl is a **drop-in NDES replacement** for Microsoft Intune SCEP fleets.
|
||||
Intune-managed devices keep using the existing Intune Certificate Connector;
|
||||
only the SCEP server URL changes. certctl validates the Connector's
|
||||
signed challenge using its installation signing cert (no Microsoft API
|
||||
calls — the Connector already did that), binds the device claim to the
|
||||
inbound CSR, and issues through whichever certctl issuer connector you
|
||||
have configured (local CA, Vault, EJBCA, ADCS, etc.).
|
||||
|
||||
What you get over NDES:
|
||||
|
||||
- Per-profile SCEP endpoints (`/scep/corp` vs. `/scep/iot` etc.) so a
|
||||
single certctl deploy serves multiple device fleets with distinct
|
||||
challenge passwords + trust anchors.
|
||||
- Audit log entries with the device GUID, claim subject, and CSR
|
||||
binding details — much better forensics than NDES + IIS logs.
|
||||
- Trust anchor reload via `SIGHUP` (no service restart) when the
|
||||
Connector signing cert rotates.
|
||||
- A built-in admin GUI tab (Intune Monitoring) showing per-profile
|
||||
enrollment counters, trust-anchor expiry countdowns, and the recent
|
||||
failures table.
|
||||
- Per-device rate limit (sliding window log keyed by Subject + Issuer)
|
||||
that catches a compromised Connector signing key issuing many
|
||||
different valid challenges for the same device.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Cloud["Intune cloud<br/>(Microsoft)"]
|
||||
Connector["Intune Certificate Connector<br/>(customer infra)"]
|
||||
Server["certctl SCEP server<br/>(you)"]
|
||||
Issuer["issuer connector<br/>(local CA / Vault / EJBCA / …)"]
|
||||
Cloud --> Connector --> Server --> Issuer
|
||||
```
|
||||
|
||||
**certctl replaces NDES, not the Connector.** The Intune Certificate
|
||||
Connector is the bridge between the Intune cloud and your on-prem PKI;
|
||||
Microsoft installs and maintains it. What you replace is the
|
||||
**Network Device Enrollment Service** (NDES) — the SCEP server
|
||||
historically deployed on a Windows host, sitting between the Connector
|
||||
and an Active Directory Certificate Services CA. certctl sits in
|
||||
exactly that slot and speaks SCEP RFC 8894 to the Connector.
|
||||
|
||||
### What certctl validates per request
|
||||
|
||||
For every Intune-flavored SCEP request the dispatcher in
|
||||
`internal/service/scep.go::dispatchIntuneChallenge` walks the
|
||||
following gates in order. A failure on any gate produces a CertRep
|
||||
PKIMessage with the documented `pkiStatus`/`failInfo` codes (per RFC
|
||||
8894 §3.2.1.4.5) and increments the corresponding metric counter.
|
||||
|
||||
1. **Shape pre-check** — `looksIntuneShaped(challengePassword)`:
|
||||
length > 200 + exactly two dots. False positives are fine; false
|
||||
negatives on real Intune challenges would route them to the static
|
||||
compare and reject. The pre-check just decides whether to invoke
|
||||
the full validator.
|
||||
2. **JWS signature** — `intune.ValidateChallenge` re-derives the
|
||||
signing input from the raw on-wire bytes (per RFC 7515 §3.1, NOT
|
||||
re-base64-encoded segments) and verifies against every cert in the
|
||||
trust anchor pool. Supports RS256 and ES256 (both fixed-width
|
||||
r||s and ASN.1-DER form). Explicitly rejects `alg=none` and
|
||||
HMAC algs.
|
||||
3. **Version dispatch** — extracts the `version` claim from the
|
||||
payload prelude. v1 (current Connector format, no `version` key)
|
||||
routes to `unmarshalChallengeV1`. Future v2 plugs in a sibling
|
||||
parser without touching the validator.
|
||||
4. **Time bounds** — `now+tolerance ≥ iat AND now-tolerance < exp`.
|
||||
The `±tolerance` window is configurable per profile via
|
||||
`INTUNE_CLOCK_SKEW_TOLERANCE` (default 60s, covers modest clock
|
||||
drift between the Connector host and certctl). Configurable cap on
|
||||
top via `INTUNE_CHALLENGE_VALIDITY` (defense-in-depth against a
|
||||
Connector that mints long-validity challenges). The validator
|
||||
refuses `tolerance ≥ ChallengeValidity` at startup-validation time
|
||||
to keep the cap meaningful.
|
||||
5. **Audience pin** — `claim.aud == INTUNE_AUDIENCE` (skipped when
|
||||
`INTUNE_AUDIENCE` is empty for proxy/load-balancer scenarios).
|
||||
6. **CSR binding** — `claim.DeviceMatchesCSR(csr)` checks
|
||||
set-equality between the claim's `device_name` / `san_dns` /
|
||||
`san_rfc822` / `san_upn` and the CSR's CN + SANs. Set-equality
|
||||
means the CSR carries EXACTLY the claim's values, no extras and
|
||||
no missing.
|
||||
7. **Replay** — `intune.ReplayCache.CheckAndInsert` rejects
|
||||
duplicates within the configured TTL. Sized for 100k entries
|
||||
(covers a ~25 RPS Intune fleet's steady-state).
|
||||
8. **Per-device rate limit** — sliding window log keyed by
|
||||
`(claim.Subject, claim.Issuer)`. Catches a compromised Connector
|
||||
issuing many DIFFERENT valid challenges for the same device. Default
|
||||
3 enrollments per 24h covers legitimate first-cert + recovery +
|
||||
post-wipe.
|
||||
9. **Optional compliance check** — V3-Pro plug-in seam (nil-default
|
||||
no-op). When set, the gate calls Microsoft Graph's compliance API
|
||||
and short-circuits non-compliant devices with FAILURE+BadRequest.
|
||||
|
||||
A request that passes all nine gates flows to
|
||||
`processEnrollment`, which builds the issuance request, calls the
|
||||
configured issuer connector, and emits a CertRep PKIMessage with the
|
||||
issued cert encrypted to the device's transient signing cert per RFC
|
||||
8894 §3.3.2.
|
||||
|
||||
## Migration from NDES + EJBCA (or NDES + ADCS)
|
||||
|
||||
The migration plan below is conservative — install certctl alongside
|
||||
your existing NDES so you can flip Intune profiles fleet-by-fleet
|
||||
without a flag day. Validated against a fresh `docker compose up`
|
||||
stack; the docker-compose.test.yml stack does not currently bake
|
||||
Intune in (Phase 10.2 ships a hermetic in-process e2e test instead),
|
||||
so the production validation step is a manual run-book item.
|
||||
|
||||
1. **Install certctl alongside existing NDES.** Stand up the certctl
|
||||
server on a separate host (or as a Kubernetes deployment) reachable
|
||||
from the Connector host. Use the existing operator-run-book in
|
||||
`docs/tls.md` for the TLS bootstrap.
|
||||
2. **Configure a per-profile SCEP endpoint.** Pick a path id (e.g.
|
||||
`corp` — referenced as `<NAME>` below; the value gets uppercased
|
||||
for the env-var key and lowercased for the URL path) and set:
|
||||
|
||||
```
|
||||
CERTCTL_SCEP_ENABLED=true
|
||||
CERTCTL_SCEP_PROFILES=corp
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID=iss-local # or your existing issuer
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_CHALLENGE_PASSWORD=<random> # Intune still requires this
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_RA_CERT_PATH=/etc/certctl/ra-corp.pem
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_RA_KEY_PATH=/etc/certctl/ra-corp.key
|
||||
```
|
||||
|
||||
The endpoint will be served at `https://certctl.example.com/scep/corp`
|
||||
— the URL path uses the lowercased name and the env-var keys use
|
||||
the uppercased form. Concrete env-var name mappings are listed in
|
||||
[`features.md`](features.md).
|
||||
3. **Extract the Intune Connector's signing cert.** On the Connector
|
||||
host (Windows), the Connector's installation creates a self-signed
|
||||
cert in the local machine's `Personal` cert store with subject
|
||||
`CN=Microsoft Intune Certificate Connector` (path documented by
|
||||
Microsoft — see Microsoft Learn link in the
|
||||
[Microsoft support statement](#microsoft-support-statement) below).
|
||||
Export the public cert (no private key) as a base64 `.cer` file.
|
||||
4. **Configure the trust anchor.** Copy the `.cer` to the certctl host
|
||||
(or mount via your secret manager) and set:
|
||||
|
||||
```
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH=/etc/certctl/intune-corp.pem
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_AUDIENCE=https://certctl.example.com/scep/corp
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CHALLENGE_VALIDITY=60m
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE=60s # ±tolerance on iat/exp; raise on poorly-NTP-synced fleets, lower to enforce strict time
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_PER_DEVICE_RATE_LIMIT_24H=3
|
||||
```
|
||||
|
||||
Restart certctl. The startup preflight refuses to boot if the
|
||||
trust anchor file is missing, unparseable, or contains an expired
|
||||
cert — failure is loud at boot rather than silent at request time.
|
||||
5. **Configure the issuer connector.** If you're keeping EJBCA,
|
||||
point `CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID` at your EJBCA issuer
|
||||
profile (see `docs/connectors.md`). For a clean cut-over to the
|
||||
built-in local CA, follow `docs/tls.md` to bootstrap a sub-CA cert.
|
||||
6. **Migrate one Intune SCEP profile to certctl.** In the Intune
|
||||
admin center, edit the SCEP profile for a small canary device
|
||||
group and update the SCEP server URL to
|
||||
`https://certctl.example.com/scep/corp`. Push the profile and
|
||||
wait for the canary devices to rotate (24-48h).
|
||||
7. **Verify enrollment.** Open the certctl admin GUI's
|
||||
[SCEP Intune Monitoring tab](#operational-monitoring) and watch
|
||||
the `success` counter tick on the `corp` profile card. The
|
||||
`recent failures` table surfaces any rejected enrollments with
|
||||
the exact reason (e.g. `signature_invalid`, `claim_mismatch`).
|
||||
8. **Roll out the rest of the fleet.** Once the canary is clean,
|
||||
migrate the remaining Intune SCEP profiles in batches.
|
||||
9. **Decommission NDES.** After all fleets are migrated and a few
|
||||
renewal cycles have completed cleanly, take down the NDES role
|
||||
and the IIS site. The existing certs continue to chain to your
|
||||
issuer; only the enrollment path changes.
|
||||
|
||||
## Intune SCEP profile fields → certctl behavior
|
||||
|
||||
The Intune admin center's SCEP profile editor exposes a fixed set of
|
||||
fields. The mapping below is what each field controls relative to
|
||||
certctl's behavior.
|
||||
|
||||
| Intune profile field | certctl behavior |
|
||||
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Certificate type | Treated as device or user; surfaces in the claim's `subject` field (device GUID vs. user UPN). certctl doesn't gate on type; the issuer's certificate profile decides. |
|
||||
| Subject name format | Drives the CSR's CN. The Intune Connector sets `claim.device_name` from this value; certctl's CSR-binding gate enforces equality. |
|
||||
| Subject alternative name | Drives the CSR's SAN list. Intune supports DNS / RFC 822 / UPN; certctl's claim binding checks set-equality per dimension. Mismatches surface as `ErrClaimSANDNSMismatch` / `_SANRFC822Mismatch` / `_SANUPNMismatch`. |
|
||||
| Certificate validity period | Honored by the issuer connector. certctl caps via the per-profile `CertificateProfile.MaxTTLSeconds`; the smaller of the two wins. |
|
||||
| Key storage provider | Device-side concern (the Connector negotiates with the device's TPM / Software KSP). certctl never sees the device's private key — it only signs the CSR. |
|
||||
| Key usage / Extended key usage | Honored by the issuer connector via the bound `CertificateProfile.AllowedEKUs`. CSRs requesting an EKU outside the allowed set are rejected by the crypto-policy gate (`ValidateCSRAgainstProfile`). |
|
||||
| Hash algorithm | The CSR's signature hash (SHA-256 typical). The SCEP `GetCACaps` advertises SHA-256 + SHA-512; the device picks. |
|
||||
| SCEP server URL | The endpoint URL the Connector posts to. Set to `https://certctl.example.com/scep/<profile-name>`. |
|
||||
|
||||
## Trust anchor extraction
|
||||
|
||||
The Intune Certificate Connector self-signs an installation cert at
|
||||
install time. To configure certctl, extract this cert (PUBLIC ONLY,
|
||||
no private key) as PEM:
|
||||
|
||||
1. On the Connector host (Windows), open `certlm.msc` (Local Machine
|
||||
Certificate Manager).
|
||||
2. Navigate to `Personal` → `Certificates`. Find the cert with
|
||||
subject `CN=Microsoft Intune Certificate Connector`.
|
||||
3. Right-click → All Tasks → Export. Choose **No, do not export
|
||||
the private key**. Format: **Base-64 encoded X.509 (.CER)**.
|
||||
4. Copy the resulting `.cer` file to the certctl host. Rename to
|
||||
`.pem` (the bytes are identical; certctl's PEM loader accepts
|
||||
either extension).
|
||||
5. Set `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH` to
|
||||
the file path.
|
||||
6. If you have multiple Connectors in HA, repeat steps 1-3 on each
|
||||
and concatenate the PEM blocks into one bundle file.
|
||||
|
||||
When the operator rotates the Connector signing cert (typically once
|
||||
every few years per Microsoft's Connector lifecycle), repeat the
|
||||
extraction, overwrite the on-disk file, then send `SIGHUP` to the
|
||||
certctl process. The trust holder swaps atomically; bad files (parse
|
||||
error, expired cert) keep the OLD pool in place so a half-rotation
|
||||
doesn't take Intune enrollment down.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
The dispatcher emits a typed metric label per failure mode plus a
|
||||
matching audit-log entry. The table below maps the label to the most
|
||||
common root cause and the operator action.
|
||||
|
||||
| Counter label | Symptom | Root cause + fix |
|
||||
|----------------------|------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `signature_invalid` | Every enrollment from a specific profile failing | Trust anchor mismatch — the Connector's signing cert was rotated and certctl wasn't reloaded. Re-extract the cert ([trust anchor extraction](#trust-anchor-extraction)), overwrite the file, send `SIGHUP`. |
|
||||
| `claim_mismatch` | Some enrollments from one Intune SCEP profile failing | The Intune SCEP profile's SAN config doesn't match what the device CSR actually has. Compare the `recent failures` table's claim row to the device's CSR; usually a SAN format mismatch (e.g. claim wants UPN, CSR has DNS). |
|
||||
| `expired` | All enrollments failing on a date boundary | Either clock skew between the Connector host and certctl (NTP both ends) OR the Connector's signing cert is past `NotAfter`. The certctl preflight catches an expired trust anchor at boot; check the Monitoring tab's expiry countdown. |
|
||||
| `not_yet_valid` | All enrollments failing | Reverse clock skew (certctl's clock is BEHIND the Connector's). Sync via NTP. |
|
||||
| `wrong_audience` | All enrollments from a profile failing | `INTUNE_AUDIENCE` doesn't match the URL the Connector is configured to call. Either fix `INTUNE_AUDIENCE` to match the operator URL, or unset it (defense-in-depth then disabled — the claim's exp + sig still gate). |
|
||||
| `replay` | Sporadic per-device failures, mostly during retries | The device retried the SAME challenge after the first one failed. The replay cache TTL is `INTUNE_CHALLENGE_VALIDITY` (default 60m). Either widen the device's retry window (Intune-side) or shorten validity. |
|
||||
| `rate_limited` | A specific device hitting `429`-equivalent failures | The device exceeded `INTUNE_PER_DEVICE_RATE_LIMIT_24H` (default 3). If legitimate (post-wipe + recovery + first-cert all in 24h), bump the cap. If suspicious, this is the limiter doing its job — investigate the device. |
|
||||
| `unknown_version` | Sudden onset of failures across the entire fleet | Microsoft shipped a new Connector version with a `version` claim certctl doesn't understand. Open an issue on the certctl repo with the failing claim payload (anonymized); the parser dispatcher accepts new versions in ~30 LoC. |
|
||||
| `malformed` | Sporadic, low-volume | Malformed challenge bytes — almost always a network proxy mangling the request body, or the Connector logging itself out mid-handshake. Capture a packet trace; the Connector should re-emit on the next device retry. |
|
||||
| `compliance_failed` | V3-Pro only | The pluggable compliance check returned non-compliant. The audit-log details carries the reason string from Microsoft Graph. V2 deployments never see this counter tick. |
|
||||
|
||||
## Operational monitoring (SCEP Administration → Intune Monitoring tab)
|
||||
|
||||
The admin GUI surface for SCEP lives at `/scep` and is structured as
|
||||
three tabs: **Profiles** (default landing — every configured SCEP
|
||||
profile, lean cards with always-present fields), **Intune Monitoring**
|
||||
(the Intune-specific deep-dive described below), and **Recent Activity**
|
||||
(full SCEP audit log filter). Operators monitoring an Intune deployment
|
||||
spend most of their time on the Intune Monitoring tab, deep-linkable via
|
||||
`/scep?tab=intune` or the legacy alias `/scep/intune`. The Profiles tab
|
||||
gives the at-a-glance per-profile health (RA cert expiry, mTLS status,
|
||||
Intune enabled/disabled badge, challenge-password-set indicator) and a
|
||||
"View Intune details →" link from each Intune-enabled card that switches
|
||||
into this tab filtered to that profile.
|
||||
|
||||
The Intune Monitoring tab shows:
|
||||
|
||||
- **Per-profile cards** — one card per SCEP profile, with the trust
|
||||
anchor expiry countdown badge:
|
||||
- `green` ≥ 30 days remaining
|
||||
- `amber` 7-30 days remaining (rotate soon)
|
||||
- `red` < 7 days remaining
|
||||
- `EXPIRED` past `NotAfter`
|
||||
- **Live counters** — the per-status enrollment counts polled every
|
||||
30s. The order in the grid puts `success` first (vanity) and
|
||||
failure modes after.
|
||||
- **Recent failures table** — the last 50 audit-log events with
|
||||
action `scep_pkcsreq_intune` or `scep_renewalreq_intune`, sorted
|
||||
by timestamp descending. Polled every 60s.
|
||||
- **Trust anchor reload button** — confirms via modal then issues
|
||||
`POST /api/v1/admin/scep/intune/reload-trust` (the SIGHUP-equivalent).
|
||||
Bad reloads keep the OLD pool in place; the modal stays open with
|
||||
the underlying error so the operator can correct the file and retry.
|
||||
|
||||
Three admin endpoints back the page:
|
||||
|
||||
- `GET /api/v1/admin/scep/profiles` — per-profile snapshot for the
|
||||
Profiles tab; surfaces RA cert subject + NotAfter + days-to-expiry,
|
||||
mTLS sibling-route status + bundle path, challenge-password-set flag,
|
||||
and an optional `intune` sub-block for Intune-enabled profiles.
|
||||
- `GET /api/v1/admin/scep/intune/stats` — Intune-specific deep-dive
|
||||
for the Intune Monitoring tab; per-status counters + trust anchor
|
||||
pool details. Backward-compat shape preserved from Phase 9.
|
||||
- `POST /api/v1/admin/scep/intune/reload-trust` — SIGHUP-equivalent
|
||||
trust anchor reload, body `{"path_id": "<pathID>"}`.
|
||||
|
||||
All three are M-008 admin-gated. Non-admin Bearer callers get HTTP 403
|
||||
+ a clear message; the GUI hides the page entirely for non-admin users
|
||||
(UX hint; server-side enforcement is independent).
|
||||
|
||||
### Recommended alert thresholds
|
||||
|
||||
The counters are exposed in the GUI as snapshots; if you wrap them
|
||||
in a Prometheus exporter (V3-Pro plug-in seam — V2 doesn't ship a
|
||||
`/metrics` surface today), reasonable starting thresholds:
|
||||
|
||||
- `signature_invalid` rate > 0 for > 5 minutes → page on-call. The
|
||||
trust anchor is stale; the operator missed a SIGHUP after a
|
||||
Connector rotation.
|
||||
- `claim_mismatch` rate > 0 sustained > 1 hour → notify (not page).
|
||||
An Intune SCEP profile is misconfigured; an admin needs to fix
|
||||
the SAN definition or the operator's CertificateProfile.
|
||||
- `replay` rate climbing → notify. Either an aggressive retry policy
|
||||
on the device side OR active replay attempts. Cross-reference
|
||||
source IPs in the audit log.
|
||||
- `rate_limited` for a single device > 1 per hour → notify. Either
|
||||
legitimate enrollment storm (post-wipe scenarios) or a compromised
|
||||
Connector signing key.
|
||||
- Trust anchor `days_to_expiry` < 30 on any profile → notify; rotate
|
||||
the Connector's signing cert before the cliff.
|
||||
|
||||
## Limitations
|
||||
|
||||
This bundle is V2-free. The following capabilities are deferred to
|
||||
V3-Pro:
|
||||
|
||||
- **Native Microsoft Graph integration.** certctl validates the
|
||||
Connector's signed challenge but doesn't call Microsoft's API
|
||||
directly — the Connector already did that. V3-Pro could ship a
|
||||
Graph client that pulls device-compliance state in addition to
|
||||
the challenge claim.
|
||||
- **Conditional Access compliance gating.** The dispatcher exposes a
|
||||
nil-default `ComplianceCheck` hook. V3-Pro plugs in a Microsoft
|
||||
Graph compliance lookup before issuance; non-compliant devices
|
||||
fail with a typed `compliance_failed` failInfo.
|
||||
- **Per-tenant trust anchors.** V2 has one trust anchor pool per
|
||||
SCEP profile; V3-Pro could support per-AAD-tenant anchor scoping
|
||||
for MSPs running shared certctl deployments across customers.
|
||||
- **OCSP stapling at SCEP-response time.** The CertRep doesn't carry
|
||||
a stapled OCSP response today; certificate validators look up OCSP
|
||||
via the `id-pkix-ocsp` extension on the issued cert. V3-Pro could
|
||||
staple inline.
|
||||
- **Auto-discovery of the Connector signing cert.** V2 requires the
|
||||
operator to extract the cert manually and configure the path.
|
||||
V3-Pro could pull from a Microsoft-published endpoint (with the
|
||||
appropriate trust constraints).
|
||||
|
||||
These deferrals are deliberate, not oversights. The V2 surface
|
||||
covers every operationally-required path for a single-tenant
|
||||
enterprise replacing NDES; V3-Pro adds the multi-tenant + native-API
|
||||
features procurement teams sometimes ask for.
|
||||
|
||||
## Microsoft support statement
|
||||
|
||||
Microsoft documents the Intune Certificate Connector as
|
||||
**RFC-8894-compliant** and supports its use against any RFC 8894
|
||||
SCEP server. The relevant Microsoft Learn pages:
|
||||
|
||||
- [Intune Certificate Connector overview](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-overview) —
|
||||
documents the Connector's architecture and explicitly notes it
|
||||
speaks RFC-8894-compliant SCEP.
|
||||
- [Use SCEP certificate profiles in Intune](https://learn.microsoft.com/en-us/mem/intune/protect/certificates-scep-configure) —
|
||||
the operator-facing setup guide, with the SCEP server URL field
|
||||
the migration playbook above edits.
|
||||
- [Validate setup of Intune Certificate Connector](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-install) —
|
||||
the install-validation checklist; useful when troubleshooting
|
||||
Connector-side failures vs. certctl-side failures.
|
||||
|
||||
certctl's role per Microsoft's framing: a third-party SCEP server
|
||||
that the Connector posts to. Microsoft supports this topology; only
|
||||
certctl's own RFC 8894 implementation is in scope for certctl
|
||||
support. The end-to-end Connector → certctl → issuer flow is
|
||||
exercised in `internal/api/handler/scep_intune_e2e_test.go` and
|
||||
the golden-file fixtures in `internal/scep/intune/testdata/`.
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`legacy-est-scep.md`](legacy-est-scep.md) — the per-profile SCEP
|
||||
setup guide + RFC 8894 reference + mTLS sibling route. Read this
|
||||
first if you're not already running certctl SCEP for non-Intune
|
||||
fleets.
|
||||
- [`architecture.md`](architecture.md) — overall control-plane
|
||||
architecture; Security Model section calls out the Intune trust
|
||||
anchor as a sensitive operator-configured surface.
|
||||
- [`features.md`](features.md) — every `CERTCTL_*` env var,
|
||||
including the per-profile `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_*`
|
||||
family.
|
||||
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
|
||||
prerequisite for any production deploy.
|
||||
Reference in New Issue
Block a user