docs: Phase 2 mechanical file moves to subdirectory structure

Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
This commit is contained in:
shankar0123
2026-05-05 02:49:28 +00:00
parent f347811cfb
commit b375df767e
35 changed files with 0 additions and 0 deletions
+196
View File
@@ -0,0 +1,196 @@
# OpenAPI Specification Guide
certctl ships with a complete OpenAPI 3.1 specification at `api/openapi.yaml`. This spec documents all 78 API operations currently specified, every request/response schema, pagination conventions, authentication requirements, and error formats. It's the single source of truth for the documented REST API. (Note: The spec will be updated to include 7 additional certificate discovery endpoints from M18b.)
This guide covers how to use the spec for API exploration, client SDK generation, and integration testing.
## Where to Find It
The spec lives at `api/openapi.yaml` in the repository root. It's versioned alongside the code and updated with every API change.
```bash
# View the spec
cat api/openapi.yaml
# Count operations
grep "operationId:" api/openapi.yaml | wc -l
# 78 (includes health + ready, 7 discovery endpoints pending spec update)
```
## Viewing with Swagger UI
The fastest way to explore the API interactively is Swagger UI. Run it as a Docker container pointing at the spec:
```bash
# From the certctl repo root
docker run -p 8080:8080 \
-e SWAGGER_JSON=/spec/openapi.yaml \
-v $(pwd)/api:/spec \
swaggerapi/swagger-ui
```
Open http://localhost:8080 to see the full API reference with "Try it out" buttons for every endpoint.
Alternatively, use Redoc for a cleaner read-only view:
```bash
docker run -p 8080:80 \
-e SPEC_URL=/spec/openapi.yaml \
-v $(pwd)/api:/usr/share/nginx/html/spec \
redocly/redoc
```
## API Structure
The spec organizes endpoints into 16 tags:
| Tag | Endpoints | Description |
|-----|-----------|-------------|
| Certificates | 12 | CRUD, versions, renewal, deployment, revocation, deployments |
| CRL & OCSP | 3 | JSON CRL, DER CRL per issuer, OCSP responder |
| Issuers | 5 | CA connector management |
| Targets | 5 | Deployment target management |
| Agents | 7 | Registration, heartbeat, CSR submission, work polling |
| Jobs | 5 | Job queue with approve/reject |
| Policies | 5 | Policy rules and violations |
| Profiles | 5 | Certificate enrollment profiles |
| Teams | 5 | Team management |
| Owners | 5 | Certificate owners |
| Agent Groups | 5 | Dynamic agent grouping |
| Audit | 2 | Immutable audit trail |
| Notifications | 3 | Notification events |
| Stats | 5 | Dashboard statistics |
| Metrics | 1 | System metrics |
| Health | 3 | Health, readiness, auth info |
## Authentication
The spec declares a `bearerAuth` security scheme applied globally. All endpoints under `/api/v1/` require a Bearer token by default:
```bash
# The default compose stack uses a self-signed cert; pin with --cacert
curl --cacert ./deploy/test/certs/ca.crt \
-H "Authorization: Bearer your-api-key" \
https://localhost:8443/api/v1/certificates
```
Three endpoints are exempt from auth (declared with `security: []` in the spec): `/health`, `/ready`, and `/api/v1/auth/info`. The auth info endpoint tells clients whether authentication is enabled and what type is required — useful for GUIs that need to show/hide a login screen.
## Pagination Convention
All list endpoints follow the same pagination pattern:
**Request parameters:**
- `page` (integer, default 1) — page number
- `per_page` (integer, default 50, max 500) — results per page
**Response envelope:**
```json
{
"data": [...],
"total": 150,
"page": 1,
"per_page": 50
}
```
Certificates also support cursor-based pagination for large datasets:
- `cursor` (string) — opaque cursor token from previous response
- `page_size` (integer) — results per page when using cursor mode
## Generating Client SDKs
The OpenAPI spec can generate typed client libraries for any language. Here are examples using common generators:
### TypeScript (openapi-typescript-codegen)
```bash
npx openapi-typescript-codegen \
--input api/openapi.yaml \
--output src/generated/certctl \
--client axios
```
### Python (openapi-python-client)
```bash
pip install openapi-python-client
openapi-python-client generate --path api/openapi.yaml
```
### Go (oapi-codegen)
```bash
go install github.com/oapi-codegen/oapi-codegen/v2/cmd/oapi-codegen@latest
oapi-codegen -generate types,client -package certctl api/openapi.yaml > certctl_client.go
```
### Java (OpenAPI Generator)
```bash
npx @openapitools/openapi-generator-cli generate \
-i api/openapi.yaml \
-g java \
-o generated/java-client
```
## Validating the Spec
Verify the spec is valid OpenAPI 3.1:
```bash
# Using spectral (recommended)
npx @stoplight/spectral-cli lint api/openapi.yaml
# Using swagger-cli
npx @apidevtools/swagger-cli validate api/openapi.yaml
```
## Using with Postman
Import the spec directly into Postman:
1. Open Postman → Import → File → select `api/openapi.yaml`
2. Postman creates a collection with all 78 documented operations organized by tag
3. Set the `baseUrl` variable to `https://localhost:8443` (HTTPS-only as of v2.2)
4. Add an `Authorization: Bearer your-api-key` header to the collection
5. Import the demo stack CA bundle (`deploy/test/certs/ca.crt`) into Postman's Settings → Certificates → CA Certificates, or disable certificate verification for the `localhost` host (Settings → General → SSL certificate verification)
## Key Schemas
The spec defines typed schemas for all domain objects. Key schemas to know:
| Schema | Description |
|--------|-------------|
| `ManagedCertificate` | Core certificate record with status, expiry, owner, tags, profile |
| `CertificateVersion` | Individual cert version with PEM, serial, fingerprint, validity |
| `Agent` | Agent with heartbeat, metadata (OS, arch, IP, version), capabilities |
| `Job` | Job record with type, status (7 states), certificate/target references |
| `PolicyRule` | Policy with type (5 types), config, severity, enabled state |
| `CertificateProfile` | Enrollment profile with allowed key types, max TTL, constraints |
| `AuditEvent` | Immutable audit record with actor, action, resource, timestamp |
| `RevocationReason` | RFC 5280 reason code enum (8 values) |
| `DashboardSummary` | Aggregate stats (total certs, expiring, agents, jobs) |
## Integration Testing
Use the spec to generate contract tests that verify the API matches the spec:
```bash
# Using schemathesis for fuzz testing against the spec
pip install schemathesis
# The default compose stack uses a self-signed cert — export a CA bundle or set REQUESTS_CA_BUNDLE
export REQUESTS_CA_BUNDLE=$(pwd)/deploy/test/certs/ca.crt
schemathesis run api/openapi.yaml \
--base-url https://localhost:8443 \
--header "Authorization: Bearer your-api-key"
```
This sends randomized valid requests to every endpoint and verifies the responses match the declared schemas.
## What's Next
- [MCP Server Guide](mcp.md) — AI-native access to the certctl API
- [Quick Start](quickstart.md) — Get certctl running locally
- [Connector Guide](connectors.md) — Build custom issuer and target connectors
- [Architecture](architecture.md) — System design deep dive
File diff suppressed because it is too large Load Diff
+359
View File
@@ -0,0 +1,359 @@
# Deployment Atomicity, Post-Deploy Verification, and Rollback
> Deploy-hardening I master bundle (v2.X.0). Operator + integrator
> reference for the atomic-write + post-deploy TLS verify +
> rollback pipeline that closes the procurement-checklist gap with
> commercial competitors (Venafi, DigiCert Certificate Manager,
> Sectigo).
## 1. Overview
Before deploy-hardening I, certctl's target connectors used
duplicated `os.WriteFile` flows. A failure mid-deploy could leave
a target with a renewed cert but no chain (or vice versa); a
reload-fail produced a half-deployed state that required manual
rollback; a wrong-vhost cert was silent until users reported it.
Deploy-hardening I closes three procurement-checklist gaps in
a single shared primitive:
| Gap | Pre-bundle | Post-bundle |
|---|---|---|
| **Atomic deploy with rollback** | F5 only (transactional API) | 12 of 13 connectors via `deploy.Apply` (K8s pending Bundle 2 — see [Section 1.5](#15-audit-closure-status-2026-05-02-deployment-target-audit)) |
| **Post-deploy TLS verification** | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
| **Vendor-specific deployment recipes** | Light docs | (Bundle II — `cowork/deploy-hardening-ii-prompt.md`) |
This document describes the operator-visible surface. The Go-level
contract lives at `internal/deploy/doc.go`.
## 1.5. Audit closure status (2026-05-02 deployment-target audit)
The 2026-05-02 deployment-target coverage audit
(`cowork/deployment-target-audit-2026-05-02/RESULTS.md`) tightened the
atomic + rollback contract on the connectors below. All bundles in the
table are committed to `master` as of this section's last edit; commit
hashes pin to the canonical landing commit for each piece of work.
| Connector | Bundle | Commit | Closes |
|-----------------|-----------|-----------|--------|
| envoy | Bundle 3 | `d8cd981` | atomic SDS JSON write + post-deploy watcher pickup poll |
| traefik | Bundle 4 | `37634e6` | single `deploy.Apply` Plan + all-files atomicity + rollback |
| iis | Bundle 5 | `223f279` | pre-deploy `Get-WebBinding` snapshot + on-failure binding rollback |
| ssh | Bundle 6 | `eb39059` | pre-deploy SFTP snapshot + reload-failure rollback |
| wincertstore | Bundle 7 | `1dd1dd4` | `Get-ChildItem` snapshot + on-import-failure rollback |
| javakeystore | Bundle 8 | `87e0009` | `keytool -exportkeystore` snapshot + on-import-failure rollback + operator playbook for argv password |
| caddy | Bundle 9 | `8cda860` | duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
| postfix/dovecot | Bundle 11 | `88e8881` | applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
**Outstanding from the same audit:**
- **Bundle 2 (k8ssecret).** The production `realK8sClient` is still a
stub (see Section 3 / row `k8ssecret` below). Replacing it with a
real `k8s.io/client-go` implementation + `ResourceVersion` plumbing
+ post-deploy SHA-256 verify + kubelet sync poll is the remaining
V2 P0 blocker. Tracking prompt:
`cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`.
Bundle 10 (per-connector loadtest harness, commit `6286cd4`) does not
modify the per-connector contract table; it's a CI / observability
addition documented separately at `deploy/test/loadtest/README.md`.
The original Bundle 1 audit spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore rollback claims first while bundles 58
catch the implementation up". Execution order inverted that loop —
Bundles 311 shipped before the doc-realignment commit, so the rows
in Section 3 below are honest as-shipped without ever needing a
softening pass. The K8s row is the one exception, and Section 3's
notes call it out explicitly.
## 2. The atomic-write primitive — `Plan` / `Apply`
`internal/deploy.Apply(ctx, plan)` is the load-bearing entry
point. Connectors build a `Plan` describing one or more files +
their PreCommit (validate) and PostCommit (reload) hooks; Apply
executes them all-or-nothing.
```go
plan := deploy.Plan{
Files: []deploy.File{
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
},
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
// Run `nginx -t` against the staged config — bytes already
// written to <path>.certctl-tmp.<unix-nanos>.
return runValidate(ctx, "nginx -t")
},
PostCommit: func(ctx context.Context) error {
return runReload(ctx, "nginx -s reload")
},
}
res, err := deploy.Apply(ctx, plan)
```
Apply's algorithm:
1. Per-file mutex acquired (sync.Map; coarse-grained per-path
serialization).
2. SHA-256 idempotency short-circuit. If every File's destination
already matches, return `Result.SkippedAsIdempotent=true`
without firing PreCommit/PostCommit.
3. Pre-deploy backup: copy each existing destination to
`<path>.certctl-bak.<unix-nanos>`.
4. Write each File's bytes to `<path>.certctl-tmp.<unix-nanos>`
in the destination directory (same-filesystem rename).
5. Apply ownership (chown + chmod) to each temp file BEFORE
rename so the swap is atomic with the right perms.
6. Call `PreCommit(ctx, tempPaths)`. On error: clean up temps;
return `ErrValidateFailed`.
7. `os.Rename` each temp → final. POSIX guarantees atomic.
8. Call `PostCommit(ctx)`. On error: restore each backup; re-call
PostCommit. If second PostCommit also fails: return
`ErrRollbackFailed` (operator-actionable).
9. Janitor: prune backups beyond `Plan.BackupRetention`
(default 3, -1 to disable).
## 3. Per-connector atomic contract
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|---|---|---|---|---|
| nginx | `nginx -t` | `nginx -s reload` | TLS handshake to `host:443` | Default key mode 0640 (worker reads via group) |
| apache | `apachectl configtest` | `apachectl graceful` | TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
| haproxy | `haproxy -c -f <cfg>` | `systemctl reload haproxy` | TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| postfix | `postfix check` | `postfix reload` | TLS handshake to port 25 | Chain appended to cert if no ChainPath |
| dovecot | `doveconf -n` | `doveadm reload` | TLS handshake to port 993 | Same code path as postfix |
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
| ssh | (Connect probe) | (SCP upload + remote chmod) | `tls.Dial` to remote TLS port | Pre-deploy SCP backup of remote files |
| wincertstore | (Get-ChildItem Cert:\) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
| javakeystore | (`keytool -list`) | (`keytool -importkeystore`) | (admin probe) | keytool snapshot; rollback via `keytool -delete` + re-import |
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | **V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit.** Production `realK8sClient` at `internal/connector/target/k8ssecret/k8ssecret.go:397-420` is a stub (every method returns `"real Kubernetes client not implemented — use NewWithClient for tests"`). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via `NewWithClient` work today. Tracking prompt: `cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`. |
> **Postfix vs Dovecot mode**: see "Choosing Mode=postfix vs Mode=dovecot" in
> `docs/connectors.md` for the per-mode defaults (cert/key paths, validate +
> reload commands), the dual-deploy guidance for mail servers running both
> daemons, and the test-pin reference (Bundle 11 commit `88e8881`).
## 4. Post-deploy TLS verification
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
**ON by default** when the operator configures
`PostDeployVerify.Endpoint`. Per-target opt-out via
`PostDeployVerify.Enabled = false`.
The connector-side flow:
```go
// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
// Mismatch — wrong vhost, NGINX serving cached cert,
// load-balanced target hit a different pod, etc.
rollbackToBackups(ctx, applyResult.BackupPaths)
emitAlert("post-deploy verify SHA-256 mismatch")
}
```
Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
against load-balanced targets where the verify might hit a
different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
`post_deploy_verify_backoff`.
```yaml
post_deploy_verify:
enabled: true
endpoint: "nginx.svc.cluster.local:443"
timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s
```
## 5. Rollback semantics
Rollback fires automatically on three triggers:
1. **PostCommit (reload) fails** → Apply restores backups + retries
reload. Returns `ErrReloadFailed` on success (degraded
no-op) or `ErrRollbackFailed` if the second reload also fails.
2. **Post-deploy verify fails** → Connector manually triggers
rollback (Apply already returned successfully). Backups are
restored + reload is invoked again. Same escalation path on
second failure.
3. **Mid-loop rename fails** (rare; only with cross-filesystem
misuse) → Apply rolls back the renames that already
succeeded.
`ErrRollbackFailed` is operator-actionable. The destination is in
a known-bad state; operators must either:
- Restore from `Result.BackupPaths` manually + run `<reload command>`
- Push a fresh known-good cert via the next deploy cycle
The `certctl_deploy_rollback_total{outcome="also_failed"}` metric
is the alert target.
## 6. ValidateOnly — dry-run mode
`target.Connector.ValidateOnly(ctx, request)` runs the validate
step without touching the live cert. Connectors that can't
dry-run (Traefik / Envoy / Caddy file mode) return
`target.ErrValidateOnlyNotSupported`.
| Connector | ValidateOnly |
|---|---|
| nginx | `nginx -t` |
| apache | `apachectl configtest` |
| haproxy | `haproxy -c -f <cfg>` |
| postfix/dovecot | `postfix check` / `doveconf -n` |
| caddy (api) | GET /config/ probe |
| caddy (file) / traefik / envoy | `ErrValidateOnlyNotSupported` |
| f5 | `client.Authenticate()` probe |
| iis | `Get-WebSite -Name <SiteName>` |
| ssh | `client.Connect()` probe |
| wincertstore | `Get-ChildItem Cert:\<loc>\<store>` |
| javakeystore | `keytool -list -keystore <path>` |
| k8ssecret | `client.GetSecret()` RBAC probe |
Operators preview a deploy via the agent's `--dry-run` flag (or
the equivalent CLI invocation).
## 7. File ownership + mode preservation
The single most common silent-failure mode pre-bundle: agent runs
as root, calls `os.WriteFile(path, bytes, 0600)`, locks NGINX out
of the existing nginx:nginx 0640 key file.
Per frozen decision 0.7, `deploy.Apply` resolves ownership via
this precedence:
1. Explicit `File.Mode` / `File.Owner` / `File.Group` (per-target
config) → use as given.
2. Existing destination file → preserve its `chown` + `chmod`.
3. `Plan.Defaults.Mode` / `.Owner` / `.Group` → use as fallback
for new files.
4. Nothing set → `os.WriteFile` default (0644) for new files;
preserved for existing.
Per-connector defaults (cross-distro, fall back to no-chown if
no candidate user exists):
| Connector | Default user | Default group | Default cert mode | Default key mode |
|---|---|---|---|---|
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
| apache | apache → www-data → httpd | same | 0644 | 0600 |
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
| traefik | (none) | (none) | 0644 | 0600 |
| envoy | (none) | (none) | 0644 | 0600 |
| caddy | (none) | (none) | 0644 | 0600 |
## 8. Per-target deploy mutex
Phase 2 of the master bundle: the agent (`cmd/agent/main.go`)
serializes concurrent deploys to the same target ID via a
`sync.Map[targetID]*sync.Mutex`. Granularity per frozen decision
0.5: one mutex per target, NOT per (target, cert).
Cert deploy throughput is operator-grade tens-per-minute. Coarse
serialization is fine and simplifies reasoning about reload-side
race windows.
## 9. Idempotency via SHA-256
Every `deploy.Apply` short-circuits when all File destinations
already match SHA-256 of the new bytes. PreCommit + PostCommit do
not fire; backups are not created; the result reports
`SkippedAsIdempotent = true`.
Defends against agent-restart retry storms that would otherwise
hammer targets with no-op reloads. Operator-visible signal:
`certctl_deploy_idempotent_skip_total{target_type="..."}`.
## 10. Troubleshooting matrix
| Symptom | Root cause | Operator action |
|---|---|---|
| `ErrValidateFailed: nginx -t failed` | Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
| `ErrReloadFailed: nginx -s reload failed; rolled back` | Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
| `ErrRollbackFailed` | Reload AND rollback both failed; in known-bad state | Restore from `Result.BackupPaths` manually; run reload command directly; check disk space + ownership |
| `post-deploy TLS verify SHA-256 mismatch` | New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via `PostDeployVerifyAttempts` |
| `chown ... permission denied` (in agent log) | Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
| Backups accumulating in cert dir | BackupRetention misconfigured | Set `BackupRetention: 3` (default) or higher on per-target config |
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit `KeyFileMode: 0640` (NGINX) or `KeyFileMode: 0600` (Apache) |
## 11. V3-Pro deferrals
Out of scope for the V2-free deploy-hardening I bundle:
- **Multi-region deployment coordination** — orchestration of N
data-center deploys with operator approval gates per stage.
- **Cert-pinning verification against mobile-app pin manifests**.
- **SOC 2 evidence-report generator** — auto-export of the
deploy audit trail in the format SOC 2 auditors expect.
- **Customer-paid validation matrices** — vendor-version certified
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
`cowork/deploy-hardening-ii-prompt.md` for the per-vendor
edge-case audit + integration test sidecars.
## 12. Per-connector quick reference
Paste-able config snippets for the most-used connectors. Full
field reference at `docs/connectors.md`.
### NGINX
```yaml
target_type: nginx
target_config:
cert_path: /etc/nginx/certs/cert.pem
chain_path: /etc/nginx/certs/chain.pem
key_path: /etc/nginx/certs/key.pem
reload_command: "nginx -s reload"
validate_command: "nginx -t"
cert_file_mode: 0644
key_file_mode: 0640
post_deploy_verify:
enabled: true
endpoint: "nginx.example.com:443"
timeout: 10s
backup_retention: 3
```
### HAProxy
```yaml
target_type: haproxy
target_config:
pem_path: /etc/haproxy/certs/cert.pem
reload_command: "systemctl reload haproxy"
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
pem_file_mode: 0600
post_deploy_verify:
enabled: true
endpoint: "haproxy.example.com:443"
```
### Traefik (file watcher; no reload command)
```yaml
target_type: traefik
target_config:
cert_dir: /etc/traefik/certs
cert_file: cert.pem
key_file: key.pem
post_deploy_verify:
enabled: true
endpoint: "traefik.example.com:443"
```
See per-connector tests at
`internal/connector/target/<name>/<name>_atomic_test.go` for the
full failure-mode matrix each connector handles.
+231
View File
@@ -0,0 +1,231 @@
# Intermediate CA hierarchy — operator runbook
Rank 8 of the 2026-05-03 deep-research deliverable. This page is the
canonical reference for operators running certctl as a multi-level
internal PKI.
The default `single`-mode flow (one operator-supplied sub-CA loaded
from disk at boot) is unchanged and will keep working byte-for-byte
forever. This page is for operators who need a real CA tree:
- FedRAMP boundary-CA deployments where the regulator requires
separation of policy and issuing authorities.
- Financial-services policy-CA deployments (one root, one policy CA
per business unit, one issuing CA per environment).
- OT / industrial control networks where the air-gapped root signs
online sub-CAs that go in and out of service on a rotation.
## Concepts
`Issuer.HierarchyMode` is a per-issuer column on the `issuers` table.
Two values are valid (the database default is `"single"` — back-compat
byte-identical for unmigrated rows):
- `single` — pre-Rank-8 historical flow. The local connector loads a
pre-signed CA cert+key from disk via `local.Config.CACertPath` /
`local.Config.CAKeyPath`. Existing operators upgrade with no
behavior change.
- `tree` — the issuer's CAs are managed via the `intermediate_cas`
table. Chain assembly walks the `parent_ca_id` foreign key from the
issuing leaf CA up to the root and attaches the assembled chain to
every `IssuanceResult`.
Each row in `intermediate_cas` is one CA cert (root, policy, issuing).
The lifecycle is `created``active``retiring``retired`. The
state column is a closed enum and validates at the service layer; the
postgres CHECK constraint enforces it at the database layer too.
A CA's private key bytes are NEVER persisted on the row. The
`key_driver_id` column is a reference (filesystem path / KMS key ID /
HSM slot) that the `signer.Driver` resolves at sign time. A SQL
injection or a row-leak surface MUST NEVER expose key bytes; only the
reference can leak.
## Lifecycle states
```mermaid
stateDiagram-v2
[*] --> created : CreateRoot / CreateChild
created --> active : registration completes
active --> retiring : Retire(confirm=false)
retiring --> retired : Retire(confirm=true)
retired --> [*]
note right of retiring
Drain start. CA stops issuing
NEW children; existing children
keep issuing until they retire.
end note
note right of retired
Terminal. Refused if active children
remain (ErrCAStillHasActiveChildren
→ HTTP 409). OCSP keeps responding
for already-issued leaves until expiry.
end note
```
Drain-first semantics: a CA in `retiring` state cannot terminalize to
`retired` while it still has active children. The service layer
returns `ErrCAStillHasActiveChildren`; the API surfaces HTTP 409. Drain
the children first.
## Common deployment patterns
### Pattern A — 4-level FedRAMP boundary CA
```mermaid
flowchart TD
Root["Acme Root CA<br/>path_len=3<br/>offline air-gapped"]
Policy["Acme Policy CA<br/>path_len=2<br/>FedRAMP-Moderate boundary"]
IssA["Acme Issuing A<br/>path_len=0<br/>prod workload leaves"]
IssB["Acme Issuing B<br/>path_len=0<br/>ephemeral pod identity"]
Root --> Policy --> IssA --> IssB
```
Operator workflow:
1. Mint the root cert+key on the offline workstation. Move the cert
PEM (no key) to the online operator workstation.
2. `POST /api/v1/issuers/{id}/intermediates` with the empty
`parent_ca_id` and `root_cert_pem` + `key_driver_id` populated
(the operator pre-positions the root key file at the path the
`key_driver_id` points to). The service validates RFC 5280 §3.2
self-signed semantics + cross-checks the operator-supplied key
matches the cert (rejects mismatched bundles at registration time
with `ErrCAKeyMismatch`).
3. `POST /api/v1/issuers/{id}/intermediates` with `parent_ca_id`
pointing at the root for the Policy CA. The service generates the
child key via `signer.Driver.Generate`, signs the child cert via
the parent's signer (loaded from the parent's `key_driver_id`),
and persists the new row with the next `path_len` value (parent's
- 1 if unset). Repeat for each lower level.
4. Set `Issuer.HierarchyMode = "tree"` on the issuer row + set the
`treeIssuingCAID` connector field to point at the deepest CA
(Acme Issuing B in the example above) — issued leaves chain via
`AssembleChain` from B up to the root.
### Pattern B — 3-level financial-services policy CA
```mermaid
flowchart TD
Root["FinCo Root CA<br/>path_len=2"]
Pol["FinCo Trading Policy CA<br/>path_len=1<br/>permitted DNS = trading.finco.example"]
Iss["FinCo Trading Issuing CA<br/>path_len=0"]
Root --> Pol --> Iss
```
Per business-unit name constraints: each policy CA carries a
`PermittedDNSDomains` list scoped to the business unit (RFC 5280
§4.2.1.10). The service enforces subset semantics — a child policy CA
cannot widen the parent's permitted set, and cannot remove an
excluded subtree. Operators submit `name_constraints` on the
`POST /api/v1/issuers/{id}/intermediates` body.
### Pattern C — 2-level internal PKI
```mermaid
flowchart TD
Root["Internal Root CA<br/>path_len=0"]
Iss["Internal Issuing CA<br/>path_len=0<br/>issues leaves directly"]
Root --> Iss
```
The simplest tree-mode deployment. Roughly equivalent to single mode
in terms of operator overhead, but provides one extra layer of
indirection so the root key can stay offline while only the issuing
CA's key sits on the certctl host.
## RFC 5280 enforcement
All enforcement happens at the service layer. The local connector
trusts the service's contract; the API layer translates errors to
HTTP codes.
- §3.2 self-signed root validation: `cert.CheckSignatureFrom(cert)` +
subject == issuer DN. Rejected with `ErrCANotSelfSigned`
HTTP 400.
- §4.2.1.9 path-length tightening: child's `PathLenConstraint` must
be strictly less than parent's. Default to `parent - 1` when unset.
Rejected with `ErrPathLenExceeded` → HTTP 400.
- §4.2.1.10 NameConstraints subset: child's `Permitted` set must be a
subset of parent's; child's `Excluded` set must be a superset of
parent's. Rejected with `ErrNameConstraintExceeded` → HTTP 400.
- §4.1.2.5 validity capping: child's `notAfter` capped to parent's
`notAfter` automatically (chain breaks at parent's expiry
regardless).
## Migrating a single-mode issuer to tree mode
Pre-flight: the load-bearing pin
`TestLocal_HierarchyMode_SingleVsTree_ByteIdentical` guarantees that
a 1-level tree wired around the same on-disk root cert+key produces
byte-identical issuance bundles to single mode. Migration is therefore
a no-downtime operation if done carefully:
1. Register the existing single-mode CA cert as an `intermediate_cas`
row via `CreateRoot` (with the existing on-disk key referenced as
`key_driver_id`).
2. Update the issuer row's `hierarchy_mode` to `"tree"` and set the
connector's `SetTreeIssuingCAID` to the new row's ID. Restart the
server (no new code path activates until the connector reads the
updated mode at boot).
3. Issue a test cert. The byte-equivalence pin guarantees the wire
bytes match the pre-migration output for a 1-level tree.
4. Build out the child CAs via `CreateChild` calls. Update
`treeIssuingCAID` to the new leaf CA. Test, then ramp.
If the pin breaks during migration, abort: roll back the
`hierarchy_mode` flip and investigate. The byte-equivalence pin is
the canary — if it goes red, deeper bugs lurk.
## API reference
All endpoints under `/api/v1/issuers/{id}/intermediates` and
`/api/v1/intermediates/{id}` are admin-gated. Non-admin Bearer callers
get HTTP 403.
| Method | Path | Purpose |
|--------|------|---------|
| POST | `/api/v1/issuers/{id}/intermediates` | Register root OR sign child (body discriminator) |
| GET | `/api/v1/issuers/{id}/intermediates` | List flat hierarchy for issuer |
| GET | `/api/v1/intermediates/{id}` | Single-row detail |
| POST | `/api/v1/intermediates/{id}/retire` | Two-phase retirement |
See `api/openapi.yaml` for full request/response schemas.
## Observability
`IntermediateCAMetrics` ships counters dimensioned by `(issuer_id,
kind)`:
- `create_root` — successful CreateRoot calls.
- `create_child` — successful CreateChild calls.
- `retire_retiring``active → retiring` transitions.
- `retire_retired``retiring → retired` transitions.
The Prometheus exposer reads the snapshot via
`SnapshotIntermediateCA()` from a single instance constructed in
`cmd/server/main.go` (the snapshotter is the single source of truth
between the service-side recording path and the metrics-side exposing
path).
The audit table receives one row per CreateRoot / CreateChild /
Retire transition, scoped to the actor extracted from the API
request's auth context.
## Known limitations
The following are tracked in `WORKSPACE-ROADMAP.md` as Rank-8 follow-on
work — none are required for the v2.1.0 acquisition gate:
- HSM-backed roots beyond `signer.FileDriver` (PKCS#11 / cloud KMS
drivers).
- Automated rotation: scheduled re-issuance of sub-CAs ahead of
expiry with parallel-validity windows.
- Intra-hierarchy CRL chaining: each non-leaf CA publishes a CRL
covering its direct children's revocations.
- NameConstraints policy templates: declarative templates an operator
can pick from instead of hand-rolling the JSON.
- D3 dendrogram visualization on the GUI page (today's render is a
recursive `<ul>` nested list).
+197
View File
@@ -0,0 +1,197 @@
# MCP Server Guide
certctl ships with an MCP (Model Context Protocol) server that lets AI assistants manage your certificate infrastructure through natural language. Ask Claude to "show me all expiring certificates," "revoke the VPN cert," or "what agents are offline?" and the MCP server translates that into API calls against your certctl instance.
This guide covers setup, configuration, and usage with Claude, Cursor, and other MCP-compatible tools.
## What Is MCP?
MCP is an open protocol that connects AI assistants to external tools and data sources. Instead of copying and pasting API responses into a chat window, MCP lets the AI call your tools directly. The certctl MCP server exposes all 78 API endpoints as MCP tools — the AI sees typed schemas describing what each tool does, what parameters it accepts, and what it returns.
The MCP server is a separate binary (`cmd/mcp-server/`) that communicates via stdio transport. It's a stateless HTTP proxy: every MCP tool call becomes an HTTP request to the certctl REST API. No new state, no new database tables, no new attack surface beyond what the API already exposes.
## Prerequisites
You need:
1. A running certctl server (see [Quick Start](quickstart.md))
2. The MCP server binary — either built from source or from a Docker image
3. An MCP-compatible AI client (Claude Desktop, Cursor, VS Code with Copilot, etc.)
## Building the MCP Server
```bash
cd certctl
go build -o certctl-mcp ./cmd/mcp-server/
```
The binary has zero runtime dependencies beyond the certctl server it connects to.
## Configuration
The MCP server reads three environment variables:
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `CERTCTL_SERVER_URL` | No | `https://localhost:8443` | URL of the certctl REST API (HTTPS-only as of v2.2) |
| `CERTCTL_API_KEY` | No | (empty) | API key for authentication (passed as `Bearer` token) |
| `CERTCTL_SERVER_CA_BUNDLE_PATH` | Yes (for self-signed / internal CA) | (empty) | Path to PEM CA bundle that signed the server cert. Required when the server cert isn't rooted in the system trust store (the default compose stack ships a self-signed cert at `deploy/test/certs/ca.crt`). |
If your certctl server has auth enabled (the default), you must provide the API key. The MCP server passes it through to every HTTP request.
Since v2.2 the certctl control plane is HTTPS-only. If the server cert is self-signed or chained to an internal CA, set `CERTCTL_SERVER_CA_BUNDLE_PATH` so the MCP server can verify the TLS handshake. Never set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` outside local development — it disables all certificate validation.
## Setting Up with Claude Desktop
Add this to your Claude Desktop MCP configuration file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
```json
{
"mcpServers": {
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
}
}
```
Restart Claude Desktop. You should see "certctl" appear in the MCP tools list with 78 available tools.
## Setting Up with Cursor
In Cursor, go to Settings → MCP Servers and add:
```json
{
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
}
```
## Setting Up with Claude Code
Add certctl as an MCP server in your project's `.mcp.json`:
```json
{
"mcpServers": {
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
}
}
```
## Available Tools
The MCP server exposes the full REST API organized across 16 resource domains:
| Domain | Tools | Examples |
|--------|-------|---------|
| Certificates | 9 | List, get, create, update, archive, versions, renew, deploy, revoke |
| CRL & OCSP | 3 | Get JSON CRL, get DER CRL by issuer, check OCSP status |
| Issuers | 6 | List, get, create, update, delete, test connection |
| Targets | 5 | List, get, create, update, delete |
| Agents | 8 | List, get, register, heartbeat, CSR submit, certificate pickup, get work, report job status |
| Jobs | 5 | List, get, cancel, approve, reject |
| Policies | 6 | List, get, create, update, delete, list violations |
| Profiles | 5 | List, get, create, update, delete |
| Teams | 5 | List, get, create, update, delete |
| Owners | 5 | List, get, create, update, delete |
| Agent Groups | 6 | List, get, create, update, delete, list members |
| Audit | 2 | List events (with filters), get event by ID |
| Notifications | 3 | List, get, mark as read |
| Stats | 5 | Summary, certs by status, expiration timeline, job trends, issuance rate |
| Metrics | 1 | System metrics (gauges, counters, uptime) |
| Health | 4 | Health check, readiness probe, auth info, auth check |
Every tool has typed input parameters with `jsonschema` descriptions, so the AI knows exactly what arguments to provide and what each field means.
## Example Conversations
Once configured, you can interact with certctl through natural language:
**"Show me all certificates expiring in the next 14 days"**
The AI calls `certctl_list_certificates` with `status=Expiring` and interprets the results.
**"Renew the API production certificate"**
The AI calls `certctl_trigger_renewal` with `id=mc-api-prod`.
**"Who owns the payments gateway cert?"**
The AI calls `certctl_get_certificate` with `id=mc-payments-prod` and reads the `owner_id` and `team_id` fields.
**"Are any agents offline?"**
The AI calls `certctl_list_agents` and checks the heartbeat timestamps.
**"Revoke the old VPN cert — the key was compromised"**
The AI calls `certctl_revoke_certificate` with `id=mc-vpn-old` and `reason=keyCompromise`.
**"Give me a summary of the certificate fleet"**
The AI calls `certctl_dashboard_summary` for aggregate stats, then optionally `certctl_certificates_by_status` for the breakdown.
**"Create a new cert for staging.api.example.com owned by the platform team"**
The AI calls `certctl_create_certificate` with the common name, team ID, and owner ID.
## Architecture
```mermaid
flowchart LR
AI["AI Assistant\n(Claude, Cursor)"]
MCP["certctl MCP\ncmd/mcp-server/"]
SERVER["certctl Server\n:8443"]
AI <-->|"stdio"| MCP
MCP -->|"HTTP + Bearer token"| SERVER
MCP ~~~ TOOLS["REST API via MCP · 16 domains\nTyped input structs"]
```
The MCP server is intentionally thin:
- **No state** — every request is a pass-through HTTP call. Restart it anytime.
- **No new auth** — uses the same API key as the REST API.
- **No new dependencies** — just the official MCP Go SDK (`modelcontextprotocol/go-sdk`).
- **No new attack surface** — the AI can only do what the API key allows.
## Security Considerations
The MCP server inherits the security properties of the REST API:
- **API key scoping**: The MCP server uses whatever API key you configure. If certctl gets API key scoping in a future release (per-resource or per-action permissions), the MCP server will automatically respect those restrictions.
- **Audit trail**: Every tool call results in an HTTP request that's logged in the API audit middleware — actor, method, path, status, and latency are all recorded.
- **Read-only usage**: For read-only AI access, you could configure a restricted API key (when key scoping ships). Until then, be aware that the AI can call write endpoints (create, update, delete, revoke) if the API key permits it.
- **No private key exposure**: The MCP server never sees or transmits private keys — the same architectural guarantee as the REST API.
## Troubleshooting
**"MCP server not connecting"**
Check that `CERTCTL_SERVER_URL` is reachable from where the MCP binary runs. Try `curl $CERTCTL_SERVER_URL/health` to verify.
**"401 Unauthorized on every tool call"**
Your `CERTCTL_API_KEY` is missing or wrong. Check the key matches what the certctl server expects.
**"Tool calls return empty results"**
The certctl server might have no data. Run the demo seed (`docker compose up`) to populate demo data, or check that your database has records.
## What's Next
- [Quick Start](quickstart.md) — Get certctl running locally
- [OpenAPI Spec](openapi.md) — Full API reference and SDK generation
- [Architecture](architecture.md) — System design deep dive
- [Concepts](concepts.md) — Certificate lifecycle fundamentals
@@ -0,0 +1,278 @@
# ACME Server — Threat Model
Security posture for the certctl ACME server endpoint
(`/acme/profile/<id>/*`). Read this before opening a PR that changes
the JWS verifier, the challenge validators, the rate limiter, or the
GC sweeper.
The threat model lives in this dedicated doc (rather than `docs/acme-server.md`)
because security-review reviewers want a single concentrated reference.
Production deployments under audit should treat this doc as the
canonical answer to "how does certctl resist X?"
## Threat surface map
The ACME server has four ingress surfaces:
1. **JWS-authenticated POST endpoints** — new-account, new-order,
finalize, key-change, revoke-cert, account update, order POST-as-GET.
Authenticated by an ECDSA / RSA / EdDSA signature over the request.
2. **Unauthenticated GET endpoints** — directory, new-nonce, ARI
(renewal-info). Read-only; no authn.
3. **Outbound challenge validators** — HTTP-01, DNS-01, TLS-ALPN-01.
The certctl-server initiates outbound calls to operator-provided
identifiers (the SAN list of the requested cert).
4. **Scheduler-driven GC sweeper** — internal-only; no inbound surface.
Threat actors:
- **External Internet attacker** — no certctl credentials; can hit
unauthenticated endpoints + observe TLS metadata.
- **Authenticated ACME account holder (low-trust)** — has a valid
account on a profile but should be bounded by profile policy +
rate limits.
- **On-path attacker** between certctl-server and a challenge target
(HTTP-01 / DNS-01 / TLS-ALPN-01).
- **Compromised cert holder** — has the private key of a previously-
issued cert and wants to revoke/exfiltrate.
- **Malicious operator with profile-write access** — can change a
profile's `acme_auth_mode` or policy, but is the trusted boundary
per certctl's threat model. Out of scope here; covered by certctl's
RBAC + audit log.
## JWS forgery resistance
The verifier (`internal/api/acme/jws.go`) accepts only the closed
allow-list `{RS256, ES256, EdDSA}`. The allow-list is passed to
`jose.ParseSigned` so go-jose rejects every other algorithm at parse
time, before any signature work.
Specific attacks blocked:
- **Algorithm confusion (`alg: none`)** — RFC 7515 §6.1's classic
unauthenticated-fallback. Not in allow-list; rejected at parse.
- **HS256 substitution (alg-confusion via symmetric)** — symmetric
algs aren't in the allow-list; rejected at parse.
- **Replayed nonce** — every JWS carries a nonce consumed via
`acme_nonces.UPDATE … WHERE used = FALSE` (a single statement;
Postgres row-locking serializes the writes). A second consume of
the same nonce sees `RowsAffected=0` and the verifier returns
`badNonce`.
- **URL spoofing** — the protected-header `url` field MUST match the
request URL exactly (RFC 8555 §6.4); a JWS signed for one URL
cannot be replayed against another.
- **Multi-signature JWS** — RFC 8555 §6.2 forbids; the verifier
rejects `len(jws.Signatures) != 1` explicitly.
- **kid-vs-jwk confusion** — exactly one MUST be present per RFC 8555
§6.2; both-present and neither-present are rejected.
- **kid round-trip mismatch** — the verifier's `AccountKID` closure
computes the canonical kid URL for the resolved account-id and
compares to the inbound `kid`; cross-profile replay is rejected
because the canonical URL differs.
The doubly-signed key-rollover JWS (RFC 8555 §7.3.5, Phase 4) gets
its own dedicated verifier in `internal/api/acme/keychange.go`.
Inner-only invariants enforced: MUST use `jwk` not `kid`, payload
`account` MUST equal outer `kid`, payload `oldKey` MUST canonicalize-
equal the registered key (RFC 7638 thumbprint, constant-time
compare), inner `url` MUST equal outer `url`.
## Nonce store integrity
Nonces are persisted in PostgreSQL (`acme_nonces` table; migration
000025) with a TTL set by `CERTCTL_ACME_SERVER_NONCE_TTL` (default
5 min). The Phase 5 GC sweeper deletes used / expired rows every 1
minute by default.
Why DB-backed and not in-memory:
- **Survives restart** — a multi-replica certctl-server fleet behind
a load balancer can issue a nonce on replica A and consume it on
replica B. In-memory state would force sticky sessions globally,
which the operator can't guarantee in all topologies.
- **Atomic consume** — a single `UPDATE ... WHERE used = FALSE`
statement is the consume primitive; Postgres row-locking guarantees
exactly one of two concurrent consumes wins.
- **Expiry-bounded** — even if the GC sweeper were disabled, the
nonce TTL is enforced at consume time
(`AND expires_at > NOW()` in the UPDATE).
A nonce-store-side compromise would let an attacker forge nonces.
Mitigation: the nonce table is in the same Postgres instance certctl
already trusts; a DB compromise is broader than ACME-specific.
## HTTP-01 SSRF resistance
The HTTP-01 validator (Phase 3, `internal/api/acme/validators.go`)
fetches `http://<identifier>/.well-known/acme-challenge/<token>`
where the identifier is operator/client-controlled. Without
mitigation, this is a textbook SSRF surface — internal services on
RFC1918 / link-local / cloud-metadata addresses would be reachable.
Mitigations (defense in depth):
1. **Pre-dial check**`validation.ValidateSafeURL` rejects URLs
whose host parses as a literal reserved IP. Cheap early bail.
2. **Per-dial check**`validation.SafeHTTPDialContext` is installed
on the `http.Transport`. Every dial re-resolves DNS, rejects
reserved IPs, and **pins the resolved IP** (`net.JoinHostPort(ips[0],
port)`) so a racing DNS rebinding cannot substitute a different IP
between resolve and connect.
3. **Per-redirect check** — Go's HTTP client re-dials on 3xx; the
`DialContext` runs again, applying the same SSRF guards.
4. **Body cap** — the validator's `io.LimitReader` caps response
bodies at 16 KiB. A misbehaving target cannot DoS the validator
pool with a multi-GB response.
5. **Bounded redirects** — the validator caps redirects at 10 (Go
default). A redirect-loop target is bounded.
Reserved IP set: loopback (127.0.0.0/8 + ::1), link-local
(169.254.0.0/16 + fe80::/10), all RFC1918 (10/8, 172.16/12, 192.168/16),
cloud-metadata literals (169.254.169.254 explicitly), broadcast,
multicast, IPv4-mapped-IPv6 to a reserved IPv4. See
`internal/validation/ssrf.go::isReservedIPForDial` for the full set.
CodeQL alert #23 flags `client.Do(req)` in the SCEP-probe call site
as `go/request-forgery` despite the dial-time guard; the analyzer
can't trace through a custom `Transport.DialContext`. Operator-
acknowledged false positive (CLAUDE.md task #10) — see the SCEP
probe's same-shaped defense for the audit trail.
## DNS-01 cache poisoning posture
The DNS-01 validator queries
`_acme-challenge.<domain>` against a single resolver configured by
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`).
Threat: an operator running a private resolver (typical in air-gapped
deployments) inherits that resolver's cache-poisoning posture. A
poisoned resolver could attest a TXT record the legitimate domain
owner never published, allowing an attacker who controls the
resolver to forge ACME challenges.
Mitigation:
- Default `8.8.8.8:53` is Google Public DNS — DNSSEC-validating,
operationally hardened, well-monitored.
- Operators choosing a private resolver own the cache-poisoning
posture. The doc explicitly flags this in
`docs/acme-server.md` § Configuration.
- DNSSEC-validation is **not** enforced by the validator itself —
the validator trusts the resolver's answer. Operators wanting
strict DNSSEC validation should use a DNSSEC-validating resolver
(e.g. `1.1.1.1` or a self-hosted Unbound).
## TLS-ALPN-01 challenge interception
RFC 8737 §3 explicitly says the validator MUST NOT verify the
challenge target's certificate chain — the proof lives in the
embedded `id-pe-acmeIdentifier` extension (OID 1.3.6.1.5.5.7.1.31)
of the cert presented during the TLS handshake, not in the chain
itself.
Implementation: `internal/api/acme/validators.go::TLSALPN01Validator`
sets `tls.Config.InsecureSkipVerify = true` with a dedicated
`//nolint:gosec` annotation citing RFC 8737 §3 and the L-001
documentation row in `docs/tls.md`.
What this means for on-path attackers:
- An on-path attacker between certctl-server and the challenge target
CAN intercept the TLS handshake and present a forged cert. The
proof is the embedded extension byte-equality, which the attacker
cannot generate without the account key — so interception alone
doesn't grant cert issuance.
- An attacker who has the account key already controls the account
per RFC 8555; the TLS-ALPN-01 validator's interception window adds
no incremental capability.
The integrity property TLS-ALPN-01 actually provides: the challenge
target proves possession of the account-key-derived key authorization
on a TLS connection bound to the requested identifier (port 443 of
the SAN). Operators wanting CA/Browser-Forum-style WebPKI strictness
should run a dedicated public-trust CA, not certctl.
## Rate-limit tuning
Phase 5 in-memory token buckets with per-(action, key) isolation.
Defaults:
- `RATE_LIMIT_ORDERS_PER_HOUR=100` per account.
- `RATE_LIMIT_CONCURRENT_ORDERS=5` per account (pending/ready/processing).
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR=5` per account.
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60` per challenge-id.
Tuning:
- **Too loose** → enables abuse vectors. A compromised account could
burn DB-row throughput; a runaway client could fill the validator
pool.
- **Too tight** → legitimate flake-out. cert-manager's exponential
backoff after a `rateLimited` problem is conservative; a 1-hour
cooldown is a long time for an operator hitting an unexpected limit.
Defaults are intentionally conservative on the loose-side — 100/hour
is generous for any plausible per-account fleet (a 50k-cert
deployment renewing at the 1/3-validity mark consumes ~12
orders/year/cert ≈ 600k orders/year ≈ 70 orders/hour even spread
evenly across accounts). Tighter limits are appropriate for
deployments with many low-trust accounts.
The buckets are in-memory + per-replica. A 3-replica certctl-server
fleet effectively has 3× the configured per-account throughput
because each replica's bucket fills independently. For deployments
where this matters operationally, the right answer is a shared rate-
limit store (Redis / Postgres-backed); not blocking for current
threat model where same-account requests typically pin to the same
replica via session affinity.
## Audit trail
Every ACME state mutation writes a row to `audit_events`. Actor strings
distinguish the auth path:
- `acme:<account-id>` — kid-path requests (the requesting account
signed the JWS).
- `acme-cert-key:<serial>` — jwk-path revoke (the cert's own private
key signed the JWS).
- `acme-system:gc` — scheduler-driven sweeps (no client request).
Operators querying by actor prefix can reconstruct the full history
of any ACME-issued cert. See
`docs/acme-server.md` § FAQ "What audit-log events fire" for the
event-name catalog.
## Out-of-scope threats
Documented to set scope expectations for security reviewers:
- **DDoS at the TLS layer** — the certctl-server's TLS listener +
upstream load balancer / WAF handle this. The ACME-specific rate
limits don't substitute for upstream DDoS protection.
- **cert-manager-side compromise** — if cert-manager is compromised,
it has both the account key and the private keys of every issued
cert. Out of certctl's trust boundary; operators run cert-manager
with the same care they'd run any other secret-bearing operator.
- **Compromised certctl-server filesystem** — the bootstrap CA key
lives at `deploy/test/certs/ca.key` (or the operator-managed
equivalent). A filesystem compromise is broader than ACME-specific
and is covered by certctl's HSM / signer-driver architecture (see
`docs/architecture.md` "Signer abstraction").
- **Postgres compromise** — the nonce table, account JWKs, and
audit log all live in the same Postgres instance. A DB compromise
is broader than ACME-specific and is the operator's responsibility
to mitigate via standard DB-hardening practices.
- **Supply-chain attacks against go-jose / lib/pq** — handled by
Dependabot + the `make verify` security gate; not ACME-specific.
## See also
- [`docs/acme-server.md`](./acme-server.md) — operator-facing reference.
- [`docs/tls.md`](./tls.md) — TLS posture, including the L-001
table of `InsecureSkipVerify` justifications (TLS-ALPN-01 row).
- [`internal/api/acme/jws.go`](../internal/api/acme/jws.go) — verifier
source.
- [`internal/api/acme/validators.go`](../internal/api/acme/validators.go)
— challenge validator pool.
- [`internal/validation/ssrf.go`](../internal/validation/ssrf.go) —
SSRF-defense primitives.
+646
View File
@@ -0,0 +1,646 @@
# certctl ACME Server (Built-in)
certctl ships an RFC 8555 + RFC 9773 ARI ACME server endpoint at
`/acme/profile/<profile-id>/*`. Any RFC 8555 client (cert-manager 1.15+,
Caddy, Traefik, win-acme, certbot, Posh-ACME) can integrate with certctl
as an ACME issuer with no certctl-side modification — closing the
"deploy a certctl agent on every K8s node" friction that costs deals to
external PKI vendors today.
> **Phase status (2026-05-03):** Phase 6 — full operator-facing
> reference. The functional surface is complete (Phases 1a-5); this
> doc is the canonical procurement-readability reference. New: client-
> walkthrough docs for [cert-manager](./acme-cert-manager-walkthrough.md),
> [Caddy](./acme-caddy-walkthrough.md), and
> [Traefik](./acme-traefik-walkthrough.md); a dedicated
> [threat model](./acme-server-threat-model.md); a section-by-section
> RFC 8555 + RFC 9773 conformance statement; a 5-failure-mode
> troubleshooting playbook; a tested-clients version pinning table.
> Track shipped phases via `git log --grep='acme-server:'`.
## Configuration
All ACME-server config uses the `CERTCTL_ACME_SERVER_*` env-var prefix
(distinct from `CERTCTL_ACME_*` which configures the consumer-side
issuer connector). The struct definition lives in
`internal/config/config.go::ACMEServerConfig`.
| Env var | Default | Phase | Description |
|--------------------------------------------------|------------------------|-------|-------------|
| `CERTCTL_ACME_SERVER_ENABLED` | `false` | 1a | Master enable flag. Phase 1a's handler is constructed unconditionally so the registry shape stays stable; routes are registered in `internal/api/router/router.go::RegisterHandlers` regardless. Operators flip this on after configuring per-profile auth_mode. |
| `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` | `trust_authenticated` | 1a | Default value for `certificate_profiles.acme_auth_mode` on newly-created profiles. Existing profiles retain their stored value. Per-profile column is the source of truth at request time. |
| `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` | `""` | 1a | When set, `/acme/*` shorthand mirrors `/acme/profile/<DefaultProfileID>/*` for single-profile deployments. When empty, requests to the shorthand return RFC 7807 + RFC 8555 §6.7 `userActionRequired`. |
| `CERTCTL_ACME_SERVER_NONCE_TTL` | `5m` | 1a | How long an issued ACME nonce remains valid before the JWS verifier (Phase 1b) returns `urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1. Tune up if cert-manager + certctl clocks frequently skew. |
| `CERTCTL_ACME_SERVER_TOS_URL` | `""` | 1a | Optional `meta.termsOfService` URL in the directory document. |
| `CERTCTL_ACME_SERVER_WEBSITE` | `""` | 1a | Optional `meta.website` URL in the directory document. |
| `CERTCTL_ACME_SERVER_CAA_IDENTITIES` | (empty) | 1a | Comma-separated `meta.caaIdentities` list. |
| `CERTCTL_ACME_SERVER_EAB_REQUIRED` | `false` | 1a | `meta.externalAccountRequired` advertisement. EAB enforcement is a follow-up; Phase 1a only advertises. |
| `CERTCTL_ACME_SERVER_ORDER_TTL` | `24h` | 2 | Reserved field, parsed in Phase 1a so operators can set it ahead of Phase 2's order endpoints. |
| `CERTCTL_ACME_SERVER_AUTHZ_TTL` | `24h` | 2 | Reserved. |
| `CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_DNS01_RESOLVER` | `8.8.8.8:53` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_DNS01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY` | `10` | 3 | Reserved. |
| `CERTCTL_ACME_SERVER_ARI_ENABLED` | `true` | 4 | Toggles the RFC 9773 ARI surface — both the `renewalInfo` URL in the directory document and the GET `/renewal-info/<cert-id>` handler. Set to `false` to drop ARI from the directory; ACME clients fall back to static renewal scheduling. |
| `CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL` | `6h` | 4 | Server-policy `Retry-After` value the ARI handler emits on a 200 response. RFC 9773 §4.2 leaves this server-policy. Tighten to `1h` for short-lived certs; loosen to `24h` for standard 90-day certs. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` | `100` | 5 | Per-account orders/hour cap. `0` disables. Hits return RFC 7807 + RFC 8555 §6.7 `urn:ietf:params:acme:error:rateLimited` with `Retry-After`. In-memory token-bucket; restart wipes the counter (eventual-consistency caps are acceptable). |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS` | `5` | 5 | Per-account cap on simultaneously-active orders (status in pending/ready/processing). `0` disables. Same RFC 7807 + RFC 8555 §6.7 problem shape as the per-hour cap. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR` | `5` | 5 | Per-account key-rollover cap. `0` disables. Default 5/hour: rollovers should be rare; a flood is an attack signal. |
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` | `60` | 5 | Per-challenge-id respond cap. `0` disables. Defends against retry storms from a misbehaving client. Keyed by challenge-id (not account-id) so a flood against one challenge doesn't drain the account's whole budget. |
| `CERTCTL_ACME_SERVER_GC_INTERVAL` | `1m` | 5 | Tick interval for the ACME GC scheduler loop. On each tick: (1) DELETE used / expired nonces; (2) UPDATE pending authzs whose `expires_at < NOW()` to `expired`; (3) UPDATE pending/ready/processing orders whose `expires_at < NOW()` to `invalid`. Each sweep is a single SQL statement; the loop is idempotent + bounded by a 1m per-sweep timeout. `0` disables the loop. |
## Per-profile auth mode
Two modes per `certificate_profiles.acme_auth_mode`:
- **`trust_authenticated`** (default for internal PKI). The JWS-
authenticated ACME account is trusted to issue certs for any
identifier the profile policy allows; there is no per-identifier
ownership proof. The most common certctl use case.
- **`challenge`**. Full HTTP-01 + DNS-01 + TLS-ALPN-01 validation per
RFC 8555 §8. Required when certctl is exposing public-trust-style PKI.
A single certctl-server can serve both modes simultaneously — the mode
is read from the bound profile's column at request time, not cached at
server start. Operators can flip a profile's mode via SQL and the next
order picks up the new mode without restart.
The `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` env var sets the default
value for newly-created profiles (e.g. via the certctl API). Existing
profile rows retain whatever value they were created with.
## TLS trust bootstrap (read this before configuring cert-manager)
When certctl-server uses a self-signed TLS bootstrap cert
(`deploy/test/certs/server.crt` is the demo default; see
[`docs/tls.md`](./tls.md)), cert-manager 1.15+ will refuse to talk to
the directory URL unless the certctl root is trusted. The fix lives in
`ClusterIssuer.spec.acme.caBundle`:
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: certctl-test
spec:
acme:
server: https://certctl.example.com:8443/acme/profile/prof-corp/directory
email: ops@example.com
caBundle: |
LS0tLS1CRUdJTi... # base64-encoded PEM of certctl's self-signed root
privateKeySecretRef:
name: certctl-test-account-key
solvers:
- http01:
ingress:
class: nginx
```
The `caBundle` value is the base64-encoded PEM of the root that signed
your certctl-server's TLS certificate. Extract it from your operator
bootstrap (e.g. `cat deploy/test/certs/ca.crt | base64 -w0`).
This is the single biggest first-time-deploy footgun on the cert-manager
integration path. The full cert-manager walkthrough lands in Phase 6;
the `caBundle` requirement is flagged here in Phase 1a's docs because
operators hit it the moment they try to point a real ACME client at
certctl.
## Auth-mode decision tree
Use `trust_authenticated` when:
- The certctl deployment serves **internal-only PKI** (intranet certs,
service-mesh certs, IoT bootstrap). Identifiers in your CSRs are
controlled by your infrastructure, not by the public Internet.
- You don't have HTTP/DNS reachability **from certctl-server back to
the ACME client's solver** (e.g., the client lives in an isolated
network segment certctl-server can't reach).
- You want the simplest cert-manager integration: cert-manager submits
a CSR, certctl issues; no out-of-band ownership proof.
- You're issuing under your own root CA whose trust is operator-managed
(NOT WebPKI). Public CAs cannot use this mode — RFC 8555 §8 ownership
proof is non-negotiable for public-trust roots.
Use `challenge` when:
- The deployment is **public-trust-style PKI** — even if your root is
privately operated, you want CA/Browser Forum-style ownership-proof
semantics so a stolen account key can't be used to issue for arbitrary
identifiers.
- You have HTTP-01 / DNS-01 / TLS-ALPN-01 reachability from the
certctl-server to the ACME client's solver. (HTTP-01 needs port 80
ingress to the client; DNS-01 needs DNS recursion; TLS-ALPN-01 needs
port 443 ingress.)
- You want defense-in-depth: an account-key compromise costs the
attacker nothing without also compromising the solver-side
infrastructure.
A single certctl-server can run both modes simultaneously — the auth
mode is a per-profile column on `certificate_profiles.acme_auth_mode`,
read at request time. Operators flip a profile's mode via SQL or the
profile API, and the next order picks up the new mode without restart.
## Endpoints
Routes registered in `internal/api/router/router.go::RegisterHandlers`:
| Method | Path | RFC ref | Auth | Description |
|--------|-------------------------------------------------------|-----------------|----------|-------------|
| GET | `/acme/profile/{id}/directory` | RFC 8555 §7.1.1 | unauth | Per-profile directory document. |
| HEAD | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 200 + Replay-Nonce header. |
| GET | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 204 + Replay-Nonce header. |
| POST | `/acme/profile/{id}/new-account` | RFC 8555 §7.3 | JWS jwk | Register a new account; idempotent re-registration of an existing JWK returns the existing row. |
| POST | `/acme/profile/{id}/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Update contact list, deactivate, or POST-as-GET (RFC 8555 §6.3) to fetch the account. |
| POST | `/acme/profile/{id}/new-order` | RFC 8555 §7.4 | JWS kid | Submit an order; identifier validation runs before order creation. |
| POST | `/acme/profile/{id}/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | POST-as-GET fetch of an order's current state. |
| POST | `/acme/profile/{id}/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Submit the CSR + finalize. Issues + persists managed cert row + version. |
| POST | `/acme/profile/{id}/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | POST-as-GET fetch of an authorization. |
| POST | `/acme/profile/{id}/challenge/{chall_id}` | RFC 8555 §7.5.1 | JWS kid | Submit a challenge for validation. Dispatches to a bounded-concurrency worker pool; clients poll authz for the eventual result. |
| POST | `/acme/profile/{id}/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | POST-as-GET cert chain download (PEM). |
| POST | `/acme/profile/{id}/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Doubly-signed account-key rollover. |
| POST | `/acme/profile/{id}/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Revoke a cert via the issuing account's key OR the cert's own private key. Routes through the certctl revocation pipeline. |
| GET | `/acme/profile/{id}/renewal-info/{cert_id}` | RFC 9773 | unauth | Fetch the suggested renewal window for a cert (cert-id is `base64url(AKI).base64url(serial)` per RFC 9773 §4.1). Response carries `Retry-After`. |
| GET | `/acme/directory` | RFC 8555 §7.1.1 | unauth | Shorthand path; mirrors per-profile when `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` is set. |
| HEAD | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
| GET | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
| POST | `/acme/new-account` | RFC 8555 §7.3 | JWS jwk | Shorthand. |
| POST | `/acme/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Shorthand. |
| POST | `/acme/new-order` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Shorthand. |
| POST | `/acme/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | Shorthand. |
| POST | `/acme/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | Shorthand. |
| POST | `/acme/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Shorthand. |
| POST | `/acme/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Shorthand. |
| GET | `/acme/renewal-info/{cert_id}` | RFC 9773 | unauth | Shorthand. |
After Phase 4, the full RFC 8555 + RFC 9773 surface is live. RFC 8739
(short-lived certs) and EAB enforcement remain follow-up work; cert-
manager + boulder-tested clients work today against the surface above.
## RFC 8555 + RFC 9773 conformance statement
Honest disclosure of what's implemented, where, and what's not. Procurement
engineers running gap analyses against cert-manager + Let's Encrypt's
conformance posture should read this section before anything else.
### Implemented
| Section | Surface | Phase | First commit |
|---------|---------|-------|--------------|
| RFC 8555 §6.2 | JWS auth + RS256/ES256/EdDSA allow-list | 1b | `27bd660` |
| RFC 8555 §6.3 | POST-as-GET | 1b | `27bd660` |
| RFC 8555 §6.4 | URL-header binding to request URL | 1b | `27bd660` |
| RFC 8555 §6.5 | Replay-Nonce + DB-backed nonce store | 1a | `e146b00` |
| RFC 8555 §6.7 | RFC 7807 problem documents | 1a | `e146b00` |
| RFC 8555 §7.1 | Directory | 1a | `e146b00` |
| RFC 8555 §7.2 | new-nonce HEAD + GET | 1a | `e146b00` |
| RFC 8555 §7.3 | new-account + idempotent re-registration | 1b | `27bd660` |
| RFC 8555 §7.3.2 + §7.3.6 | account update + deactivation | 1b | `27bd660` |
| RFC 8555 §7.3.5 | doubly-signed key rollover | 4 | `0299e4a` |
| RFC 8555 §7.4 | new-order + finalize + cert download | 2 | `4ee486e` |
| RFC 8555 §7.5 | authz POST-as-GET | 2 | `4ee486e` |
| RFC 8555 §7.5.1 | challenge response | 3 | `7e22204` |
| RFC 8555 §7.6 | revoke-cert (kid + jwk paths) | 4 | `0299e4a` |
| RFC 8555 §8.3 | HTTP-01 challenge validator | 3 | `7e22204` |
| RFC 8555 §8.4 | DNS-01 challenge validator | 3 | `7e22204` |
| RFC 8737 | TLS-ALPN-01 challenge validator | 3 | `7e22204` |
| RFC 9773 | ACME Renewal Information (ARI) | 4 | `0299e4a` |
### Not implemented (procurement-honest)
| Spec area | Status | Notes |
|-----------|--------|-------|
| RFC 8555 §7.3.4 — External Account Binding (EAB) | **Not implemented.** | Advertised in directory `meta.externalAccountRequired` but enforcement is a follow-up. Operators relying on EAB for account-creation gating should layer an upstream WAF. |
| RFC 8555 §8.4 + §7.4 — Wildcard with `*.` prefix > 1 level | **Not implemented.** | Single-level wildcards (e.g. `*.example.com`) work end-to-end. Multi-level wildcards (`*.*.example.com`) are RFC-spec-ambiguous and rejected at the identifier-validation layer. |
| RFC 8738 — Short-lived certs | **Not implemented.** | Operators wanting <7-day validity tune the bound issuer's TTL directly via `CertificateProfile.MaxTTLSeconds`; the ACME wire shape doesn't expose a separate notion. |
| Cross-CA proxying | **Not implemented.** | Each profile binds to one issuer. Multi-CA federation (one ACME account → multi-CA selection per identifier) is roadmap. |
| RFC 8555 §6.7 — `accountDoesNotExist` problem with hint URL | Partial. | Sentinel returns `accountDoesNotExist`; the optional hint URL embedding the `kid` is not emitted. cert-manager doesn't consume it. |
If a procurement-side gap analysis turns up something not in either
table above, the answer is "we don't know yet" — operator-side issues
welcome.
## Finalize routing through `CertificateService.Create` (Phase 2 architecture)
The finalize path mirrors how every other certctl issuance surface
(EST, SCEP, agent, REST API) routes through the canonical pipeline:
1. JWS-verify the request (`internal/api/acme/jws.go`).
2. Validate the CSR's DNS-name set equals the order's identifier set
exactly (case-folded). Mismatches return RFC 8555
`urn:ietf:params:acme:error:badCSR`.
3. Update the order row to `status=processing` (`s.tx.WithinTx` +
`auditService.RecordEventWithTx` — atomic with audit row).
4. Issue the cert via the bound profile's `IssuerConnector` adapter
(same `IssueCertificate(ctx, commonName, sans, csrPEM, ekus,
maxTTLSeconds, mustStaple)` call EST/SCEP/agent take).
5. Insert the `managed_certificates` row via
`service.CertificateService.Create(ctx, *ManagedCertificate, actor)`.
Source is stamped `domain.CertificateSourceACME` so operators can
bulk-revoke ACME-issued certs by filtering on `Source=ACME`.
6. Insert the `certificate_versions` row +
transition the order to `status=valid` with `certificate_id` set
(one final `WithinTx` covering both writes + the audit row).
This means RenewalPolicy, CertificateProfile, per-issuer-type
Prometheus metrics, audit rows, and revocation-pipeline integration
all apply uniformly to ACME-issued certs via the same code path that
already serves EST/SCEP/agent/REST issuance.
The atomicity boundary: there is a brief window between step 5 (cert
exists) and step 6 (order shows valid) where the order row still says
`processing`. Phase 5's GC scheduler reconciles. The actor string on
audit rows is `acme:<account-id>`.
## JWS verification (Phase 1b)
Every JWS-authenticated POST runs through the verifier at
`internal/api/acme/jws.go::VerifyJWS`. The verifier enforces:
1. The JWS parses as a flattened single-signature object (multi-sig is
rejected per RFC 8555 §6.2).
2. The signature algorithm is in the closed allow-list `{RS256, ES256,
EdDSA}` per RFC 8555 §6.2 — `none`, `HS256`, and every other alg
are refused at parse time.
3. The protected header carries exactly one of `kid` (registered
account) or `jwk` (new-account flow); endpoints declare which they
require.
4. The protected header `url` matches the inbound request URL exactly.
5. The protected header `nonce` is consumed against the
`acme_nonces` store; missing / replayed / expired nonces return
`urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1.
6. On the `kid` path: the kid URL round-trips against the canonical
per-profile shape, the referenced account exists, and its status
is `valid`. Deactivated / revoked accounts cannot authenticate.
7. The signature verifies against the resolved key (registered
account's stored JWK on the kid path; embedded jwk on the jwk path).
Every state-mutating account operation (create, contact update,
deactivate) writes its `acme_accounts` row and an `audit_events` row
inside one `repository.Transactor.WithinTx` call — the canonical
certctl atomicity contract (matches `service.CertificateService.Create`
at `internal/service/certificate.go:131`).
## Phases (cross-reference)
| Phase | Status | Surface |
|-------|-------------|---------|
| 1a | live | directory + new-nonce + per-profile routing |
| 1b | live | new-account + account/{id} + JWS verifier (RFC 7515 + go-jose v4) |
| 2 | live | orders + authzs + finalize + cert download (trust_authenticated mode end-to-end) |
| 3 | live | HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (challenge mode end-to-end) |
| 4 | live | key rollover (RFC 8555 §7.3.5) + revoke-cert (§7.6) + ARI (RFC 9773) |
| 5 | live | rate limits + GC sweeper + kind-driven cert-manager integration test + lego conformance harness + k6 ACME-flow scenario |
| 6 | live | full operator-facing reference + walkthroughs (cert-manager / Caddy / Traefik) + threat model + RFC-8555 conformance statement + troubleshooting + version pinning |
Track shipped phases via `git log --grep='acme-server:' --oneline`.
## Operational notes (Phase 1a)
- **Schema:** `migrations/000025_acme_server.up.sql` adds 5 ACME tables
+ the `certificate_profiles.acme_auth_mode` column. Phase 1a actively
uses only `acme_nonces`. The full schema ships now so the migration
is stable and Phases 1b-4 don't need additional `CREATE TABLE`
migrations.
- **Replay protection:** nonces are persisted in `acme_nonces` (NOT
in-memory). They survive server restart, which is required for the
RFC 8555 §6.5 replay defense to hold against a multi-replica
certctl-server fleet behind a load balancer.
- **Metrics:** the service layer exposes per-op atomic counters via
`service.ACMEService.Metrics().Snapshot()`:
- `certctl_acme_directory_total`
- `certctl_acme_directory_failures_total`
- `certctl_acme_new_nonce_total`
- `certctl_acme_new_nonce_failures_total`
Phase 1b will extend with `new_account` counters; Phase 2 with order
/ finalize / cert; Phase 3 with per-challenge-type counters.
- **Audit:** Phase 1a is read-mostly (directory + nonce). Phase 1b's
account-creation path will route through the canonical
`s.tx.WithinTx(...)` + `auditService.RecordEventWithTx(...)` pattern
so every account state mutation is paired with an `audit_events`
row.
## Phase 4 — key rollover, revocation, ARI
### How do I rotate my ACME account key?
RFC 8555 §7.3.5 defines a doubly-signed JWS for the rollover. The OUTER
JWS is signed by the OLD account key (kid path); its payload IS the
INNER JWS, which is signed by the NEW account key (jwk path). cert-
manager and lego do this for you transparently — `lego renew --key-rotate`
or the cert-manager `Issuer.spec.acme.privateKeySecretRef` rollover.
Server-side validation:
1. Outer JWS verifies against the registered account's current key.
2. Inner JWS verifies against the embedded NEW jwk (proves possession).
3. Inner payload `account` matches outer `kid`.
4. Inner payload `oldKey` thumbprint-equals the registered key.
5. Inner protected `url` equals outer protected `url`.
6. New JWK thumbprint not already registered against the same profile.
7. `SELECT … FOR UPDATE` on the account row serializes concurrent
rollovers; the loser sees the winner's new thumbprint and is told
to retry (409).
### How do I revoke an ACME-issued cert?
Two auth paths per RFC 8555 §7.6:
- **kid path:** sign with your account key. The server checks the
account "owns" the cert via `acme_orders.certificate_id` lookup.
- **jwk path:** sign with the cert's own private key. The server
extracts the cert's public key, computes the JWK, and asserts it
matches the embedded jwk thumbprint.
Either path routes through `service.RevocationSvc.RevokeCertificateWithActor`
— the same pipeline the GUI revoke button, bulk-revocation, and the
ACME-consumer issuer use. So the cert-row update + revocation row + audit
row are all atomic in one `WithinTx`, the issuer is best-effort
notified, and the OCSP response cache is invalidated.
Reason codes follow RFC 5280 §5.3.1; codes 8 (removeFromCRL) and 10
(aACompromise) are not in certctl's `domain.ValidRevocationReasons`
set so they clamp to `unspecified`.
### What is ARI?
RFC 9773 ACME Renewal Information. Clients GET
`/acme/profile/<id>/renewal-info/<cert-id>` (unauthenticated) and
receive a JSON document with `suggestedWindow.start` and `.end` —
the server's recommendation for when to renew. The response also
carries `Retry-After` (RFC 9773 §4.2) hinting at the next-poll cadence.
Cert-id format is `base64url(authorityKeyIdentifier).base64url(serial)`
per RFC 9773 §4.1.
Window math:
- Cert with a bound renewal policy: window starts at
`notAfter - RenewalWindowDays`, ends at `notAfter - RenewalWindowDays/2`.
So a 30-day window cert with notAfter 2026-06-30 emits start=2026-05-31,
end=2026-06-15. Boulder-shape default that lets cert-manager schedule
inside our renewal window.
- No policy: window is the last 33% of validity.
- Past expiry: window is "now" → "now + 24h" (renew immediately).
Disable ARI globally with `CERTCTL_ACME_SERVER_ARI_ENABLED=false`. The
URL drops out of the directory; the route is still registered but
returns 404 — clients fall back to static renewal scheduling.
## Phase 5 — operational guidance
### Rate limiting
Production deployments serving multiple ACME profiles or fleets should
keep the default rate limits in place. The four caps:
- `RATE_LIMIT_ORDERS_PER_HOUR` (100) — per-account new-order cap. A
cert-manager Certificate that auto-renews at the 1/3 mark of its
validity (90-day cert → ~30-day renewal) consumes ~12 orders/year
per managed Certificate. 100/hour is generous for any plausible
fleet.
- `RATE_LIMIT_CONCURRENT_ORDERS` (5) — per-account cap on
pending/ready/processing orders. Stops a runaway client from
starving DB-row throughput. Tune up only if you observe legitimate
bursts.
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR` (5) — rollovers are rare; a flood
is an attack signal. Tune down to 1/hour if your operator
procedure mandates manual rollovers only.
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` (60) — per-challenge cap,
defends against retry storms.
Hits return RFC 8555 §6.7 `rateLimited` Problem with a `Retry-After`
header. cert-manager 1.15+ honors the header; lego too. Older clients
may not — that's the client's problem, not certctl's.
The buckets are **in-memory + per-replica**. A 3-replica certctl-
server fleet behind a load balancer effectively has 3× the configured
throughput (each replica's bucket fills independently). For
deployments where this matters operationally, the right answer is a
shared rate-limit store — that's a follow-up; not blocking for the
current threat model where same-account requests typically pin to
the same replica via session affinity.
### GC sweeper
The scheduler runs the GC sweep every `GC_INTERVAL` (default 1m). Each
sweep is three independent SQL statements:
1. `DELETE FROM acme_nonces WHERE used = TRUE OR expires_at < NOW()`.
2. `UPDATE acme_authorizations SET status='expired' WHERE status='pending' AND expires_at < NOW()`.
3. `UPDATE acme_orders SET status='invalid', error=... WHERE status IN ('pending','ready','processing') AND expires_at < NOW()`.
Each statement is bounded by a 1-minute per-sweep timeout. A failing
sweep is logged + retried on the next tick; a tick that overruns its
budget is skipped (the existing-tick atomic-Bool guard prevents
overlap). Counts are exposed via `certctl_acme_gc_*` Prometheus
metrics.
### cert-manager integration test
`make acme-cert-manager-test` brings up a kind cluster, installs
cert-manager 1.15.0, helm-deploys certctl-server with
`acmeServer.enabled=true`, and verifies a Certificate resource issues
end-to-end. Skipped in CI by default (kind is too heavy for per-PR);
operators run locally on workstation. See
`deploy/test/acme-integration/` for the YAML + Go test harness.
### lego RFC conformance harness
`make acme-rfc-conformance-test` drives lego v4 against a hermetic
certctl-server stack, exercising register → new-order → finalize.
Operators run this when shipping behavior changes to the ACME surface
to confirm a real third-party client still works.
### k6 ACME flows scenario
`deploy/test/loadtest/k6/acme_flow.js` exercises the unauthenticated
surface (directory + new-nonce + ARI) at 100 VUs × 5m. JWS-signed
flows are out of scope for k6 (no JWS support); they're covered by
the lego conformance harness above. Baseline numbers + thresholds in
`deploy/test/loadtest/README.md`.
## Troubleshooting
The five failure modes operators hit most often + the canonical fix
for each.
### `cert-manager logs: 400 Bad Request: badNonce`
**Cause:** Either a nonce was replayed (a buggy client retries the
same JWS), the cert-manager + certctl-server clocks differ by more
than `CERTCTL_ACME_SERVER_NONCE_TTL` (default 5 min), or the
nonce-store row was reaped between issuance and use.
**Fix:** First check NTP on both sides. If clocks are healthy,
lengthen `CERTCTL_ACME_SERVER_NONCE_TTL` to 10m or 15m. If the
problem persists, check for a multi-replica certctl-server fleet
without sticky session affinity — the nonce DB row lives on one
replica; if the JWS POST hits a different replica before replication
catches up, you observe spurious `badNonce`. Solution: pin client
sessions to a single replica via load-balancer cookie / `kid`-hash
routing, OR shorten replication lag if your DB is the bottleneck.
### `cert-manager logs: x509: certificate signed by unknown authority`
**Cause:** cert-manager refuses to talk to the directory URL because
its TLS chain doesn't terminate at a root in cert-manager's trust
store. certctl-server's bootstrap cert (Phase 1a, `deploy/test/certs/server.crt`)
is self-signed.
**Fix:** Add the `caBundle` field to your `ClusterIssuer.spec.acme` —
see the [TLS trust bootstrap](#tls-trust-bootstrap-read-this-before-configuring-cert-manager)
section above for the 3-step recipe. This is **the** single biggest
first-time-deploy footgun on the cert-manager integration path.
### HTTP-01 validator returns `connection refused`
**Cause:** The HTTP-01 solver's Ingress / Service is not reachable
from certctl-server's network. Common subcases: (a) the cert-manager
http-solver pod is on a private network certctl-server can't reach;
(b) a firewall blocks port 80 inbound to the solver's address; (c)
the Ingress class annotation doesn't match an installed ingress
controller; (d) your DNS still points at an old IP.
**Fix:** From the certctl-server pod, `curl -v
http://<identifier>/.well-known/acme-challenge/<token>` and read the
network error. If the curl fails the same way, the network path is
the issue. If curl works but the validator fails, check the validator
log lines — the SSRF guard rejects reserved IPs (RFC1918, link-local,
cloud-metadata 169.254.169.254). Public-trust style profiles that
need to reach RFC1918 solvers must be moved to `trust_authenticated`
mode OR the solver must be exposed on a routable address.
### DNS-01 validator returns `NXDOMAIN`
**Cause:** DNS provider hasn't propagated the `_acme-challenge.<domain>`
TXT record yet. Most providers have a 30s-2m propagation lag. cert-manager
retries by default, but Phase-5 rate limits (default 60/hour per
challenge-id) can truncate the retry budget.
**Fix:** Verify TXT propagation with `dig +short TXT _acme-challenge.<domain>
@<your-resolver>`. If the answer is empty, the issue is upstream. If
it's populated but certctl reports NXDOMAIN, check
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`) is
reachable from certctl-server's network egress. Operators on isolated
networks need a private resolver; configure accordingly + own the
cache-poisoning posture (see [threat
model](./acme-server-threat-model.md)).
### Certificate Ready=False with `rejectedIdentifier`
**Cause:** The CSR includes an identifier (CommonName or SAN) that the
bound certificate profile's policy rejects. certctl runs syntactic +
profile-policy validation **before** order creation; the order never
reaches the database.
**Fix:** The reject reason is in the `subproblems` array of the RFC
8555 §6.7 problem document. Decode the JSON, look at `subproblems[].detail`,
and adjust either the CSR or the profile policy. Common causes:
SAN-not-in-`AllowedIdentifierWildcards`, EKU-not-in-`AllowedEKUs`,
TTL-exceeds-`MaxTTLSeconds`. Validation logic lives in
`internal/api/acme/identifier.go::ValidateIdentifiers` +
`internal/domain/profile.go` — read those if the profile-policy rule
isn't obvious.
## Version pinning + tested clients
certctl's ACME server is tested against the following client versions.
Other versions probably work; these are the ones the integration suite
exercises end-to-end.
| Client | Tested version | Where it's pinned |
|--------|----------------|-------------------|
| cert-manager | 1.15.0 | `deploy/test/acme-integration/cert-manager-install.sh::CERT_MANAGER_VERSION` |
| lego (RFC 8555 conformance harness) | v4.x latest | `deploy/test/acme-integration/conformance-lego.sh` (operator installs via `go install github.com/go-acme/lego/v4/cmd/lego@latest`) |
| kind (cluster bootstrap) | v0.20+ | `deploy/test/acme-integration/kind-config.yaml` schema requirement |
| Caddy | 2.7.x | Phase 6 walkthrough (`docs/acme-caddy-walkthrough.md`) |
| Traefik | 3.0+ | Phase 6 walkthrough (`docs/acme-traefik-walkthrough.md`) |
Operators reporting issues with untested-version clients should include
the client version + the precise wire-level error (curl-captured request
+ response body) so we can pin a regression test if applicable.
## FAQ
### Why two auth modes? Isn't `challenge` strictly more secure?
`challenge` is strictly more secure for **public-trust** PKI — RFC 8555
§8 ownership proof is the entire point of cert-manager + Let's Encrypt.
For **internal PKI**, the threat model is different: the network itself
is the security boundary (mTLS service mesh, firewalled VPC, identifier-
namespace controlled by the operator). Forcing every internal cert to
go through a solver round-trip adds operational toil with no security
gain. `trust_authenticated` is the certctl-specific mode that
acknowledges this — the ACME account is the proof, not the solver.
### How does this differ from `cert-manager → Let's Encrypt with certctl as a separate step`?
Two integrations vs one. With certctl as the ACME endpoint, cert-manager
does its native flow (Certificate → Order → CSR → Secret) and certctl
mints the cert directly, recording it under its own
`managed_certificates` table with full audit + renewal-policy + bulk-
revocation surface. With Let's Encrypt as the ACME endpoint, you have
to run a separate cert-manager-uploads-to-certctl webhook OR maintain
two parallel cert tracks. The native-ACME-server path is operationally
simpler.
### Can I use ACME endpoints from outside the K8s cluster?
Yes. The endpoints are HTTPS over the certctl-server's listener (port
8443 by default). Caddy on a VM, win-acme on a Windows server, or
Posh-ACME on a Mac all integrate against
`https://<certctl-server>:8443/acme/profile/<profile-id>/directory`.
The TLS-trust-bootstrap requirement applies the same way — see the
[Caddy walkthrough](./acme-caddy-walkthrough.md) for the OS-trust-store
recipe.
### How do I migrate manually-issued certs to ACME-issued ones?
Not yet automatic. Operators migrating: keep the old `managed_certificates`
rows; create new ones via the ACME flow; flip targets one by one. A
dedicated bulk-migration tool is on the roadmap (post-2.1.0). Track
via the master prompt's roadmap section in
`cowork/acme-server-endpoint-prompt.md`.
### What audit-log events fire on each ACME operation?
Every state mutation writes an `audit_events` row. Actor strings:
`acme:<account-id>` for kid-path requests; `acme-cert-key:<serial>`
for jwk-path revoke; `acme-system:gc` for scheduler-driven sweeps.
Event-name catalog:
| Event name | Fired by | Resource type |
|------------|----------|---------------|
| `acme_account_created` | new-account | `acme_account` |
| `acme_account_contact_updated` | account update | `acme_account` |
| `acme_account_deactivated` | account deactivate | `acme_account` |
| `acme_account_key_rolled` | key-change | `acme_account` |
| `acme_order_created` | new-order | `acme_order` |
| `acme_order_finalized` | finalize | `acme_order` |
| `acme_challenge_processing` | challenge-respond (dispatch) | `acme_challenge` |
| `acme_challenge_completed` | validator callback | `acme_challenge` |
| `certificate_revoked` | revoke-cert (routes through `RevocationSvc`) | `certificate` |
Querying by actor prefix (`actor LIKE 'acme:%'`) reconstructs the full
history of any ACME-issued cert.
### Is there a threat model document?
Yes — [`docs/acme-server-threat-model.md`](./acme-server-threat-model.md).
Read before writing a security review.
## See also
- [cert-manager integration walkthrough](./acme-cert-manager-walkthrough.md)
- [Caddy integration walkthrough](./acme-caddy-walkthrough.md)
- [Traefik integration walkthrough](./acme-traefik-walkthrough.md)
- [Threat model](./acme-server-threat-model.md)
- [TLS trust bootstrap reference](./tls.md)
- [Architecture (control-plane)](./architecture.md)
@@ -0,0 +1,118 @@
# Async-CA Polling — Operator Reference
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
## What this is
Four issuer connectors talk to Certificate Authorities that issue
certificates **asynchronously**`IssueCertificate` returns an order
ID immediately, and the caller (or scheduler) must call
`GetOrderStatus` later to retrieve the issued cert:
- **DigiCert** (CertCentral)
- **Sectigo** (Certificate Manager)
- **Entrust** (Certificate Services / CA Gateway)
- **GlobalSign** (Atlas HVCA)
Pre-fix, each connector's `GetOrderStatus` made one HTTP call per
invocation with no exponential backoff, no retry cap, and no deadline.
Under a renewal sweep, certctl would hammer the upstream CA's
rate-limit budget. A 429 response was treated as a hard error,
which then caused the scheduler to retry on the next tick — re-fanning
out the same call that just got rate-limited.
Post-fix, `GetOrderStatus` blocks for up to `PollMaxWait` (default
10 minutes) doing **bounded internal polling**:
```
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
```
±20% jitter applied at every wait so multiple certctl instances
never synchronize on the upstream CA's rate-limit window. The
`PollMaxWait` deadline is a hard cap; if the upstream still hasn't
completed by then, `GetOrderStatus` returns `StillPending` and the
scheduler can re-enqueue the job for a future tick.
## Status-code triage
Each connector classifies HTTP responses to drive polling decisions:
| Response | Meaning | Decision |
|---|---|---|
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return `OrderStatus{Status:"failed"}` |
| 2xx + parse failure | Body is broken | Failed — return error |
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
| 5xx | Transient | StillPending — keep polling with backoff |
| Network / TLS error | Transient | StillPending — keep polling with backoff |
## Operator tuning
Each connector exposes a `PollMaxWaitSeconds` config field and
matching env var:
| Connector | Env var | Default |
|---|---|---|
| DigiCert | `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
| Sectigo | `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
| Entrust | `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
| GlobalSign | `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
Tune up (e.g., `86400` = 24 hours) for **Entrust approval-pending
workflows** where humans manually approve enrollments. Tune down (e.g.,
`60`) for high-throughput environments that prefer to recycle the
scheduler tick rather than block one renewal goroutine for minutes.
A value of 0 (or unset) falls back to the package default in
`internal/connector/issuer/asyncpoll`.
## Failure modes
**Upstream returns 429 forever.** The Poller respects the backoff
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
the full `PollMaxWait` budget with at most 7-8 attempts (instead of
~600 attempts at 1/sec). After `PollMaxWait` expires, `GetOrderStatus`
returns `StillPending`; the scheduler re-enqueues for the next tick.
The total request volume against the upstream is bounded by `tick
interval / minimum backoff` — typically 1-2 requests per minute even
under heavy load.
**Sectigo `collectNotReady` sentinel.** When the SCM status endpoint
reports `Issued` but the cert collect endpoint isn't yet ready, the
old code branched into a special "pending" return. Now that branch
returns `StillPending` from the poll closure, so the cert collection
rides the same backoff schedule.
**Entrust approval-pending.** The `AWAITING_APPROVAL` status maps to
`StillPending`. With the default `PollMaxWait=10m`, the scheduler
will re-enqueue once per tick if approval hasn't happened yet; with
`PollMaxWait=24h` the same renewal goroutine waits the full approval
window. Pick the latter when you have many approval-pending
enrollments per tick.
## Where the implementation lives
- `internal/connector/issuer/asyncpoll/asyncpoll.go` — shared `Poller`
with backoff math, jitter, deadline, and ctx-aware cancellation.
- `internal/connector/issuer/digicert/digicert.go`
`pollOrderOnce` + `GetOrderStatus` orchestrator.
- `internal/connector/issuer/sectigo/sectigo.go`
`pollEnrollmentOnce` + status-code permanence triage
(`isPermanentStatusError`).
- `internal/connector/issuer/entrust/entrust.go`
`pollEnrollmentOnce` + approval-pending mapping.
- `internal/connector/issuer/globalsign/globalsign.go`
`pollCertificateOnce` (serial-number tracking).
- `internal/connector/issuer/asyncpoll/asyncpoll_test.go` — 11 unit
tests covering happy path, transient-then-success, Failed
termination, MaxWait timeout, last-error wrap, ctx cancel,
multiplicative backoff, jitter bounds, defaults.
## Audit blocker reference
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5
(Part 1.5 finding #4: "No polling backoff for async CAs").
+411
View File
@@ -0,0 +1,411 @@
# CRL & OCSP — Revocation Status for Relying Parties
This guide is the operator + relying-party reference for certctl's revocation
status surfaces. It covers the wire format, endpoint URLs, configuration knobs,
the OCSP responder cert lifecycle, and how to point common consumers
(cert-manager, Firefox, OpenSSL) at the endpoints.
If you're looking for the higher-level architecture, see
[`architecture.md` § Security Model](architecture.md#security-model). If you're
looking for the revocation policy / reason codes the API accepts, see
[`api/openapi.yaml` § /certificates/{id}/revoke](../api/openapi.yaml).
---
## Conceptual overview
**Why two formats.** RFC 5280 §5 defines a Certificate Revocation List (CRL)
— a periodically-published, signed list of every revoked certificate for an
issuer. RFC 6960 defines the Online Certificate Status Protocol (OCSP) — a
request/response protocol that returns the status of a single certificate by
serial number. CRLs are batch-friendly and cacheable; OCSP is point-query and
fresh. Production PKI deployments serve both because different relying parties
prefer different trade-offs:
- Browsers (Firefox / Safari) prefer OCSP for freshness; some pin OCSP
stapling.
- cert-manager and most Linux TLS clients fall back to CRL when OCSP is
unreachable.
- Microsoft Intune / corporate device-state validators do periodic CRL pulls.
- OpenSSL `s_client -status` exercises OCSP via the `Certificate Status
Request` extension during the handshake.
certctl's local issuer publishes both, with a pre-generation cache so a busy
CA does not DOS itself rebuilding the CRL on every fetch.
**Why a separate OCSP responder cert.** RFC 6960 §2.6 + §4.2.2.2 strongly
recommend that OCSP responses be signed by a delegated "OCSP responder cert"
issued by the CA, NOT by the CA private key directly. The responder cert
carries the `id-pkix-ocsp-nocheck` extension (RFC 6960 §4.2.2.2.1) so OCSP
clients do not recursively check the responder cert's revocation status. This
keeps the CA private key cold (an HSM operation per OCSP request would be
prohibitive at scale) and lets the responder key live on disk, on a separate
HSM partition, or rotate frequently while the CA key stays untouched.
---
## Endpoints
All revocation endpoints live under `/.well-known/pki/` per RFC 8615 and run
**unauthenticated** — relying parties without certctl API credentials must be
able to validate revocation status. The HTTPS-only TLS 1.3 control plane
applies; there is no plaintext fallback.
### CRL — Certificate Revocation List
```
GET https://<host>/.well-known/pki/crl/{issuer_id}
```
| Field | Value |
| --- | --- |
| Method | `GET` |
| Auth | None (unauthenticated, RFC 5280 §5 distribution semantics) |
| Response Content-Type | `application/pkix-crl` |
| Response body | DER-encoded X.509 CRL signed by the issuer's CA |
| Cache | Pre-generated by the scheduler; configurable interval |
Example:
```bash
curl --cacert ca.crt \
-o crl.der \
https://localhost:8443/.well-known/pki/crl/iss-local
openssl crl -inform DER -in crl.der -text -noout
```
### OCSP — Online Certificate Status Protocol
certctl serves both the GET form (RFC 6960 §A.1.1, simple URL-path lookup)
and the POST form (RFC 6960 §A.1.1, binary OCSPRequest body). Most
production OCSP clients (Firefox, OpenSSL `s_client -status`, cert-manager,
Intune) use POST. The GET form is preserved for ops curl-debugging.
#### GET form
```
GET https://<host>/.well-known/pki/ocsp/{issuer_id}/{serial_hex}
```
| Field | Value |
| --- | --- |
| Method | `GET` |
| Auth | None |
| Response Content-Type | `application/ocsp-response` |
| Response body | DER-encoded OCSPResponse signed by the **OCSP responder cert** (NOT the CA cert) |
Example:
```bash
curl --cacert ca.crt \
-o response.der \
https://localhost:8443/.well-known/pki/ocsp/iss-local/a1b2c3d4
openssl ocsp -respin response.der -text -CAfile ca.crt
```
#### POST form (the standard one)
```
POST https://<host>/.well-known/pki/ocsp/{issuer_id}
Content-Type: application/ocsp-request
Body: <DER-encoded OCSPRequest>
```
| Field | Value |
| --- | --- |
| Method | `POST` |
| Auth | None |
| Request Content-Type | `application/ocsp-request` |
| Response Content-Type | `application/ocsp-response` |
Example with OpenSSL building the request:
```bash
openssl ocsp -issuer ca.crt -cert leaf.crt -reqout request.der
curl --cacert ca.crt \
-X POST \
-H "Content-Type: application/ocsp-request" \
--data-binary @request.der \
-o response.der \
https://localhost:8443/.well-known/pki/ocsp/iss-local
openssl ocsp -respin response.der -text -CAfile ca.crt
```
The body-size limit applies (`http.MaxBytesReader` from middleware,
default 1MB, configurable via `CERTCTL_MAX_BODY_SIZE`); a typical OCSPRequest
is ~200 bytes so this is a generous cap.
### Admin observability endpoint
```
GET https://<host>/api/v1/admin/crl/cache
Authorization: Bearer <token-with-admin-flag>
```
Returns the per-issuer cache state — for ops dashboards, GUI badges, or
"is the scheduler keeping up?" diagnostics. Admin-gated (M-008 admin-gated
handler allowlist; non-admin Bearer callers receive HTTP 403). Response shape:
```json
{
"cache_rows": [
{
"issuer_id": "iss-local",
"cache_present": true,
"crl_number": 42,
"this_update": "2026-04-29T10:00:00Z",
"next_update": "2026-04-29T11:00:00Z",
"generated_at": "2026-04-29T10:00:00Z",
"generation_duration_ms": 87,
"revoked_count": 13,
"is_stale": false,
"recent_events": [
{
"started_at": "2026-04-29T10:00:00Z",
"duration_ms": 87,
"succeeded": true,
"crl_number": 42,
"revoked_count": 13
}
]
}
],
"row_count": 1,
"generated_at": "2026-04-29T10:30:00Z"
}
```
Issuers that have not yet had a CRL generated appear with `cache_present:
false` so the GUI can render a "Not yet generated" pill rather than 404.
---
## Configuration
| Env var | Default | Meaning |
| --- | --- | --- |
| `CERTCTL_CRL_GENERATION_INTERVAL` | `1h` | How often the scheduler walks every CRL-supporting issuer and rebuilds. The HTTP handler reads from the cache, not from a per-request rebuild. |
| `CERTCTL_OCSP_RESPONDER_KEY_DIR` | unset | **Operator MUST set in production.** Directory where the FileDriver persists each issuer's OCSP responder key (`ocsp-responder-<issuer_id>.key`). When unset, the responder service uses a temporary directory that does NOT survive restarts — fine for dev, NEVER for prod. |
| `CERTCTL_OCSP_RESPONDER_ROTATION_GRACE` | `7d` | When the responder cert's `NotAfter` falls within this window, `EnsureResponder` rotates to a fresh cert+key on the next OCSP request or scheduler tick. |
| `CERTCTL_OCSP_RESPONDER_VALIDITY` | `30d` | How long each newly-issued responder cert is valid for. Short by design — relying parties cache OCSP responses, not the responder cert chain, and `id-pkix-ocsp-nocheck` blocks recursive revocation checking on the responder itself. |
The issuer-level CRL `nextUpdate` is derived from the generation timestamp +
the configured CRL validity (currently a build-time constant in the
`CRLCacheService`; configurable knob deferred until an operator asks).
---
## OCSP responder cert lifecycle
1. **First OCSP request for an issuer (or scheduler tick).** The local
issuer's `SignOCSPResponse` calls into `OCSPResponderService.EnsureResponder`.
2. **Cache lookup.** `EnsureResponder` queries the `ocsp_responders` table for
a row keyed by `issuer_id`.
3. **Disk lookup.** If a row exists, the FileDriver reads the persisted key
from `<keydir>/ocsp-responder-<issuer_id>.key`. **Self-healing:** if the
row exists but the file is missing (operator pruned the keydir without
pruning the DB), the service treats this as "rotate now" rather than
crashing.
4. **Rotation check.** If `cert.NotAfter < now + RotationGrace`, the service
generates a fresh ECDSA-P256 key, builds a `*x509.CertificateRequest`,
and asks the local issuer's existing `IssueCertificate` flow to sign it.
The signing template carries:
- `KeyUsage: x509.KeyUsageDigitalSignature` (signing OCSP responses)
- `ExtKeyUsage: x509.ExtKeyUsageOCSPSigning` (RFC 6960 §4.2.2.2)
- The `id-pkix-ocsp-nocheck` extension (OID `1.3.6.1.5.5.7.48.1.5`,
DER value `NULL`, RFC 6960 §4.2.2.2.1) wired through
`Certificate.ExtraExtensions`.
5. **Persistence.** The new cert + key path are written to `ocsp_responders`
via an idempotent `INSERT … ON CONFLICT DO UPDATE`.
6. **Response signing.** `ocsp.CreateResponse(caCert, responderCert,
template, responderSigner)` produces the response bytes; the responder
cert is included in the response chain so relying parties can validate
without a separate fetch.
The race between scheduler-driven cache refresh and on-demand cache miss is
collapsed by the `CRLCacheService`'s in-tree singleflight (a `sync.Map` of
`*flightEntry` keyed by `issuer_id`). Concurrent generation requests for the
same issuer wait on the in-flight result rather than each rebuilding from
scratch.
---
## Pointing common consumers at the endpoints
### cert-manager (Kubernetes)
cert-manager's certificate-validation logic checks both the AIA OCSP URI
embedded in the leaf and the CDP CRL URI. Both are populated automatically
by the local issuer's certificate template — relying parties should NOT
need any additional configuration. To verify:
```bash
openssl x509 -in leaf.crt -text -noout | grep -A1 "Authority Information Access"
openssl x509 -in leaf.crt -text -noout | grep -A2 "CRL Distribution Points"
```
If your cert-manager pods cannot reach `https://<certctl-host>:8443/.well-known/pki/`,
add a NetworkPolicy egress rule or expose the certctl service via the
appropriate ingress class.
### Firefox
Firefox honors the AIA OCSP URI by default. To force-refresh the local
revocation cache after revoking a cert in dev:
```
about:preferences#privacy → Certificates → Query OCSP responder servers
```
If Firefox reports `SEC_ERROR_OCSP_INVALID_SIGNING_CERT`, verify that the
responder cert chain is reachable from the system trust store —
`id-pkix-ocsp-nocheck` is a Firefox-strict extension and is set automatically
on every responder cert certctl issues.
### OpenSSL
```bash
# OCSP via stand-alone request
openssl ocsp -issuer ca.crt -cert leaf.crt -url https://localhost:8443/.well-known/pki/ocsp/iss-local -CAfile ca.crt -text
# OCSP via TLS Certificate Status Request extension
openssl s_client -connect example.com:443 -status -CAfile ca.crt
```
### Intune (corporate device state)
Intune device-compliance validators pull the CRL on a schedule (configured in
the Intune admin console, default 24h). Configure the CRL distribution point
to `https://<certctl-host>:8443/.well-known/pki/crl/<issuer_id>` and Intune
will pull on its own cadence.
---
## Production hardening II additions (post-2026-04-30)
The following capabilities were folded into V2 (free) by the production
hardening II bundle. Each closes a real procurement-team checklist gap
without requiring a paid tier.
### OCSP nonce extension (RFC 6960 §4.4.1)
The POST OCSP handler echoes the request's nonce extension (OID
`1.3.6.1.5.5.7.48.1.2`) in the response. Defends against replay attacks
where a relying party's cached response is replayed against a now-revoked
cert. Always-on; no operator opt-out.
Failure modes:
- **No nonce in request** — back-compat; response omits the extension.
- **Well-formed nonce ≤ 32 bytes** — response echoes it; tracked in
`certctl_ocsp_counter_total{label="nonce_echoed"}`.
- **Empty or oversized nonce (> 32 bytes per CA/B Forum BR §4.10.2)** —
responder returns the canonical "unauthorized" status (RFC 6960 §2.3
status 6); tracked in `certctl_ocsp_counter_total{label="nonce_malformed"}`.
### OCSP pre-signed response cache
Mirrors the existing CRL cache. Per-(issuer, serial) entries pre-signed
and stored in `ocsp_response_cache`; the read-through facade in
`CAOperationsSvc.GetOCSPResponseWithNonce` consults the cache for
nil-nonce requests and falls through to live signing on miss + writes
the result back. Nonce-bearing requests always live-sign because the
cache stores nil-nonce blobs.
**Load-bearing security wire:** `RevocationSvc.RevokeCertificateWithActor`
calls `InvalidateOnRevoke` after a successful revocation so the next
OCSP fetch returns the revoked status. There is no stale-good window
after revoke.
### Per-source-IP OCSP rate limit + per-actor cert-export rate limit
Defaults: 1000 req/min/IP for OCSP; 50 exports/hr/operator for the
cert-export endpoints. Configurable via
`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` and
`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`; zero disables.
OCSP rate-limit trip: canonical "unauthorized" OCSP blob plus
`Retry-After: 60`. Cert-export trip: HTTP 429 + JSON
`{"error":"rate_limit_exceeded","retry_after_seconds":3600}`.
The OCSP limiter does NOT honor `X-Forwarded-For` because OCSP is
publicly reachable and untrusted intermediaries could spoof the header
to bypass the cap.
### CRL HTTP caching headers (RFC 7232)
`GET /.well-known/pki/crl/{issuer_id}` now returns weak-form ETag,
`Cache-Control: public, max-age=3600, must-revalidate`, and respects
`If-None-Match` for HTTP 304 short-circuits. Lets CDNs and reverse
proxies serve repeated fetches from edge cache.
### CRL DistributionPoint auto-injection
Local issuer config field `CRLDistributionPointURLs []string`; when
non-empty, every issued cert carries the RFC 5280 §4.2.1.13
`id-ce-cRLDistributionPoints` extension pointing at certctl's CRL
endpoint. Refusing to silently inject an empty CDP is deliberate —
silent-empty fails relying-party validation worse than no CDP.
### Cert-export typed audit codes + Prometheus per-area metrics
Audit emission now carries typed action constants
(`cert_export_pem`, `cert_export_pkcs12`, `cert_export_failed`)
alongside legacy bare codes. Detail map enriched with
`has_private_key` (always false in V2) and `cipher`
(`AES-256-CBC-PBE2-SHA256` — pinned).
`GET /api/v1/metrics/prometheus` surfaces the new per-area counters
under the `certctl_<area>_counter_total{label=...}` family. OCSP
shipped in this bundle; alert recommendations:
- `{label="rate_limited"}` rate > 0 sustained > 5m → notify (limiter
is doing its job; investigate source IP).
- `{label="nonce_malformed"}` > 0 → notify (legitimate clients don't
send malformed nonces).
- `{label="signing_failed"}` > 0 → page on-call (issuer connector
failing).
## What this release does NOT include (V3-Pro)
Still out of scope for V2; tracked for V3-Pro:
- **Delta CRLs (RFC 5280 §5.2.4).** Useful for very large CRLs (10k+
revoked certs); the data model accommodates the Base CRL Number
reference but the pipeline only emits Base CRLs in V2.
- **OCSP stapling at SCEP/EST CertRep response time.** Server-side
pre-staple into the TLS handshake context.
- **OCSP request signature verification (RFC 6960 §4.1.1).** Optional
per-spec; certctl currently ignores the signature.
- **OCSP responder HA / multi-region replication.** Active-active
OCSP cache with Postgres logical replication.
- **CRL Issuing Distribution Point (IDP) extension** (RFC 5280
§5.2.5) — for sharded CRL deployments.
---
## Troubleshooting
**`pki/crl/<issuer_id>` returns 404.** The issuer either does not support
CRL signing (Vault, EJBCA, DigiCert serve their own CRL infrastructure;
certctl's connectors return `nil` from `GenerateCRL` for these) or the
issuer ID is wrong. Verify with `GET /api/v1/issuers`.
**`pki/ocsp/<issuer_id>/<serial>` returns 200 but `openssl ocsp -text`
shows "unauthorized".** Check that the serial in the URL is hex-encoded (no
`0x` prefix, no leading zeros stripped, lowercase). Mismatched serials
return an OCSP response with status `unauthorized` per RFC 6960 §2.3.
**Admin cache endpoint returns 403.** The Bearer key does not carry the
admin flag. M-008 gates this endpoint server-side; the GUI also gates the
fetch on `useAuth().admin`. Either escalate the key (`certctl admin
keys promote <key-id>`) or use a different identity.
**Cache shows `is_stale: true` repeatedly.** The scheduler is not running
(or not getting scheduled often enough). Check `CERTCTL_CRL_GENERATION_INTERVAL`
and confirm the scheduler started: `grep crlGenerationLoop` in the server
logs at startup.
+809
View File
@@ -0,0 +1,809 @@
# EST (RFC 7030) — Operator Guide
> **Status (this document):** EST RFC 7030 hardening master bundle Phases
> 111 shipped on `master`; this guide is the Phase-12 deliverable
> against the bundle. Every behavior described here is exercised by the
> tests at `internal/api/handler/est*_test.go`,
> `internal/service/est*_test.go`, and (for the libest interop layer)
> `deploy/test/est_e2e_test.go` under `//go:build integration`. The
> bundle is **V2-free**; per-tenant CA isolation, Conditional-Access
> compliance gating, and EST cert-bound usage analytics are documented
> as V3-Pro deferrals in [V3-Pro deferrals](#v3-pro-deferrals).
## Contents
1. [Concepts](#concepts)
2. [Quick start](#quick-start)
3. [Multi-profile dispatch](#multi-profile-dispatch)
4. [Authentication modes](#authentication-modes)
5. [RFC 9266 channel binding](#rfc-9266-channel-binding)
6. [WiFi / 802.1X recipe (FreeRADIUS)](#wifi--8021x-recipe-freeradius)
7. [IoT bootstrap recipe](#iot-bootstrap-recipe)
8. [`serverkeygen` for resource-constrained devices](#serverkeygen-for-resource-constrained-devices)
9. [HSM-backed CA signing for EST](#hsm-backed-ca-signing-for-est)
10. [Operator GUI (EST Admin tabs)](#operator-gui-est-admin-tabs)
11. [CLI + MCP tools](#cli--mcp-tools)
12. [Renewal: device-driven model](#renewal-device-driven-model)
13. [Troubleshooting matrix](#troubleshooting-matrix)
14. [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook)
15. [Threat model](#threat-model)
16. [V3-Pro deferrals](#v3-pro-deferrals)
17. [Appendix A: libest reference client](#appendix-a-libest-reference-client)
18. [Appendix B: RFC 7030 wire-format quirks](#appendix-b-rfc-7030-wire-format-quirks)
19. [Related docs](#related-docs)
## Concepts
EST (RFC 7030) is the IETF-standardized successor to SCEP for device
enrollment over HTTPS. certctl ships a native EST server that handles
all six RFC 7030 endpoints — `cacerts`, `simpleenroll`,
`simplereenroll`, `csrattrs`, `serverkeygen`, and (proxy-pass)
`fullcmc` — out of a single binary, with per-profile dispatch so a
single deploy can serve multiple device fleets from the same control
plane.
**EST is a handler-level protocol, not a connector.** The
`ESTHandler` parses the wire format, enforces auth, and delegates
issuance to whichever `IssuerConnector` the profile binds. EST does
not replace your CA — it sits in front of the local CA, Vault PKI,
EJBCA, ADCS, step-ca, or anything else certctl already knows how to
issue against. Devices submit a CSR; certctl validates, gates, signs,
and returns a PKCS#7 certs-only response.
**Two enrollment models, one server.**
- **Host enrollment** — a long-lived device or laptop boots, generates
its own keypair locally, and enrolls via `simpleenroll` (initial)
then `simplereenroll` (renewal) over the device's TLS-pinned
channel. Private keys never leave the device.
- **User enrollment** — a network supplicant (corporate WiFi, VPN
client) drives `simpleenroll` against certctl on behalf of the user
identity. The CSR carries the user UPN as a SAN; the FreeRADIUS or
VPN policy gates session establishment on cert validity.
**Profile-driven policy.** Every EST profile carries its own:
- Issuer binding (`CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID`)
- Optional `CertificateProfile` (`_PROFILE_ID`) that constrains
allowed key algorithms, key sizes, EKUs, SANs, max TTL, and
must-staple
- Auth mode mix: mTLS only, HTTP Basic only, both, or none (for
back-compat with anonymous deploys — strongly discouraged)
- Optional RFC 9266 `tls-exporter` channel binding
- Optional per-(CN, sourceIP) sliding-window rate limit
- Optional server-side keygen
The per-profile family is documented exhaustively in
[`features.md`](features.md).
**Multi-profile dispatch.** `CERTCTL_EST_PROFILES=corp,iot,wifi`
publishes three independent endpoint groups under
`/.well-known/est/<pathID>/`. Each profile's auth, trust anchor, and
issuer binding is isolated; a compromise of one profile's enrollment
password does not affect any other profile.
## Quick start
The five-minute single-profile setup runs EST anonymously over
HTTPS-only. **Use this only on a private network during evaluation;**
production deploys MUST set an auth mode (see
[Authentication modes](#authentication-modes)).
1. Have certctl running with TLS configured per [`tls.md`](tls.md).
The control plane listens on `:8443`; EST shares the same listener
under `/.well-known/est/`.
2. Set the legacy single-profile env vars in your compose file or
Helm values:
```
CERTCTL_EST_ENABLED=true
CERTCTL_EST_ISSUER_ID=iss-local
```
3. Restart certctl. The startup log line `EST server enabled` should
surface; the routes `/.well-known/est/{cacerts,simpleenroll,simplereenroll,csrattrs}`
are now live.
4. Ground-truth check from a client host:
```bash
curl -sS --cacert /path/to/ca.crt \
https://certctl.example.com:8443/.well-known/est/cacerts \
| base64 -d | openssl pkcs7 -inform DER -print_certs -noout
```
You should see your CA cert subject and `NotAfter`. This is the
`/cacerts` endpoint serving the PKCS#7 SignedData certs-only
response per RFC 7030 §4.1.
5. Generate a CSR and enroll:
```bash
openssl ecparam -name prime256v1 -genkey -noout -out device.key
openssl req -new -key device.key -subj "/CN=device-001.example.com" -out device.csr
curl -sS --cacert /path/to/ca.crt \
-H "Content-Type: application/pkcs10" \
--data-binary @<(openssl req -in device.csr -outform DER | base64 -w0) \
https://certctl.example.com:8443/.well-known/est/simpleenroll \
| base64 -d | openssl pkcs7 -inform DER -print_certs > device.crt
```
The response is a PKCS#7 certs-only blob; the issued cert lands in
`device.crt`.
If the curl fails with a TLS error, walk through [`tls.md`](tls.md);
the EST handler relies on the same listener as the REST API and
SHARES NO TRUST POLICY with the legacy plaintext :8080 of pre-v2.2
deploys (which was removed when the HTTPS-only policy landed).
## Multi-profile dispatch
A single certctl binary publishes one EST endpoint group per name in
`CERTCTL_EST_PROFILES`. Set the comma-separated list, then a matching
set of `CERTCTL_EST_PROFILE_<NAME>_*` env vars per profile:
```
CERTCTL_EST_ENABLED=true
CERTCTL_EST_PROFILES=corp,iot,wifi
# per-profile config — `<NAME>` placeholder gets replaced by the
# uppercased name from the list (so "corp" → CORP, "iot" → IOT,
# "wifi" → WIFI). The URL path uses the lowercased form.
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-corp-laptops
CERTCTL_EST_PROFILE_<NAME>_ENROLLMENT_PASSWORD=<random>
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=basic
```
This publishes:
- `/.well-known/est/corp/{cacerts,simpleenroll,simplereenroll,csrattrs,serverkeygen}`
- `/.well-known/est/iot/...`
- `/.well-known/est/wifi/...`
Each profile is independently validated at startup (see
`internal/config/config.go::Validate`). Per-profile failures log the
offending PathID and refuse the boot. The legacy single-profile
shape (`CERTCTL_EST_ENABLED` + `CERTCTL_EST_ISSUER_ID` without
`CERTCTL_EST_PROFILES`) continues to work — the back-compat shim in
`loadESTProfilesFromEnv` synthesises a single profile bound to the
empty PathID, which the router serves at `/.well-known/est/` (no
path component).
PathID rules (enforced at boot):
- Lowercased ASCII `[a-z0-9-]+` only, no leading/trailing hyphen.
- Distinct PathIDs per profile (no duplicates).
- Reserved name `est` rejected (would collide with the legacy root).
Mirrors the SCEP `CERTCTL_SCEP_PROFILES` family from the SCEP RFC
8894 master bundle — see [`legacy-est-scep.md`](legacy-est-scep.md)
for the SCEP equivalent.
## Authentication modes
certctl supports three EST authentication topologies per profile,
mixed and matched via `CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES`:
| Mode | Endpoint | When to use |
|---------|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| `mtls` | `/.well-known/est-mtls/<pathID>/...` | The device already has a bootstrap cert (factory-provisioned, previous-cert renewal, or out-of-band onboarding). Enterprise procurement teams almost always require this for production fleets — shared-password auth is a checkbox-fail regardless of password strength. |
| `basic` | `/.well-known/est/<pathID>/...` | First-cert bootstrap when no prior cert exists. The `_ENROLLMENT_PASSWORD` is a per-profile shared secret; constant-time comparison via `crypto/subtle.ConstantTimeCompare`. Pair with the source-IP failed-auth rate limit (see below). |
| both | both routes published | Migration window: existing devices renew via mTLS, new devices bootstrap via Basic. Same profile config, just both routes registered. |
| (empty) | `/.well-known/est/<pathID>/...` | Anonymous; no auth required at the EST layer. Back-compat for pre-Phase-1 deploys. Hardened-deployment best practice is to set this explicitly to `basic` or `mtls` — a future bundle may flip the default. |
Per-profile cross-check enforced at boot:
- `mtls` in the list requires `_MTLS_ENABLED=true` AND
`_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH` non-empty.
- `basic` in the list requires `_ENROLLMENT_PASSWORD` non-empty.
- Unknown auth modes refused at boot with the offending token in the
error message.
**Source-IP failed-auth rate limit.** When `_ENROLLMENT_PASSWORD` is
set and the Basic-auth gate trips, the handler increments a sliding-
window counter keyed on the source IP. After 10 consecutive failures
in an hour, the source is locked out (HTTP 429-equivalent failure
code) for the rest of the window. The limiter is process-local
(50k-IP cap, sliding 1h window — defaults; tunable in a follow-up).
This is independent of the per-(CN, sourceIP) per-principal limiter
discussed under [Renewal](#renewal-device-driven-model).
## RFC 9266 channel binding
When `CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true`, the
EST handler enforces RFC 9266 `tls-exporter` channel binding. The
client must include an `id-aa-channelBindings` attribute in the CSR
whose value matches the server's
`r.TLS.ConnectionState().ExportKeyingMaterial("EXPORTER-Channel-Binding", nil, 32)`
output, computed independently at request time.
What this defends against: an attacker that bridges two TLS
connections (one client → attacker, another attacker → certctl) and
forwards the device's CSR through the attacker's TLS session. Without
channel binding, certctl sees a valid CSR submitted over a TLS
session authenticated by the attacker's cert; with channel binding,
the CSR's binding bytes only match if the CSR was signed against
THIS TLS session's exporter material.
Failure mode mapping:
| Server-side error | HTTP status | Meaning |
|-------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------|
| `ErrChannelBindingMissing` | 400 | `_CHANNEL_BINDING_REQUIRED=true` but the CSR's attribute is absent. Bad client config (or a non-RFC-9266 EST client). |
| `ErrChannelBindingMismatch` | 409 | Attribute present but doesn't match the live exporter — MITM signal. Treat as a security event, log the source IP. |
| `ErrChannelBindingNotTLS13` | 426 | Client connected over TLS 1.2 — `tls-exporter` requires TLS 1.3. Upgrade client OR rely on the TLS-1.2 reverse-proxy runbook. |
Cross-check at boot: setting `_CHANNEL_BINDING_REQUIRED=true` on a
profile with `_MTLS_ENABLED=false` is refused — channel binding is
meaningful only when mTLS is in use (otherwise the binding has no
client identity to bind to).
**libest support.** Cisco libest v3.0+ supports the RFC 9266
`--tls-exporter` flag. Older builds (commonly distros' packaged
versions through 2024) do not; per-profile opt-out via leaving the
env var `false` is the migration path. The libest sidecar in
`deploy/test/libest/Dockerfile` builds v3.2.0-2 from source and
includes the flag.
## WiFi / 802.1X recipe (FreeRADIUS)
This recipe stands up an EAP-TLS-authenticated corporate WiFi network
where certctl issues every device certificate via EST. End-to-end
flow:
```mermaid
flowchart LR
Laptop["Laptop / supplicant<br/>(wpa_supplicant / iwd / Apple WiFi)"]
AP["WiFi access point (NAS)"]
Radius["FreeRADIUS<br/>(validate cert chain)"]
CA["certctl CA<br/>(EST profile 'wifi')"]
Laptop -->|EAP| AP
AP -->|Radius| Radius
Radius -.->|trusts| CA
Laptop -->|"EST: /simpleenroll, /simplereenroll<br/>(one-time, then renewal)"| CA
```
### certctl-side: EST profile config for 802.1X
```
CERTCTL_EST_ENABLED=true
CERTCTL_EST_PROFILES=wifi
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-wifi-eap-tls
CERTCTL_EST_PROFILE_<NAME>_MTLS_ENABLED=true
CERTCTL_EST_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH=/etc/certctl/wifi-bootstrap-ca.pem
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=mtls
CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true
CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H=3
```
The matching `CertificateProfile` (`cp-wifi-eap-tls`) configured via
the API or GUI:
- `AllowedKeyAlgorithms`: ECDSA P-256 (covers Apple, Android, modern
laptop supplicants) plus optional RSA 2048+ for legacy clients.
- `AllowedEKUs`: `clientAuth` only (`1.3.6.1.5.5.7.3.2`). Drops
`serverAuth` so a device cert can't be reused as a TLS server cert.
EAP-TLS requires `clientAuth`; FreeRADIUS will reject certs without
it when `eap_chain_check_eku` is on.
- `RequiredCSRAttributes`: `["deviceSerialNumber"]` so the device's
serial appears in the issued cert (operators correlate WiFi grants
back to inventory).
- `MaxTTLSeconds`: 31536000 (1 year). Long enough for laptop fleets
that don't renew daily; short enough to limit the cert's blast
radius on key compromise.
### Device-side: drive `simpleenroll` from the supplicant
For Linux/embedded laptops:
```bash
# Bootstrap once (factory bootstrap cert presented over mTLS):
openssl ecparam -name prime256v1 -genkey -noout -out /etc/wifi/eap.key
openssl req -new -key /etc/wifi/eap.key \
-subj "/CN=laptop-001/serialNumber=ABC123" \
-out /etc/wifi/eap.csr
curl -sS --cacert /etc/certctl/ca.crt \
--cert /etc/wifi/bootstrap.crt \
--key /etc/wifi/bootstrap.key \
-H "Content-Type: application/pkcs10" \
--data-binary @<(openssl req -in /etc/wifi/eap.csr -outform DER | base64 -w0) \
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simpleenroll \
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt
# Renewal cycle (cron, 10 days before NotAfter):
curl -sS --cacert /etc/certctl/ca.crt \
--cert /etc/wifi/eap.crt \
--key /etc/wifi/eap.key \
-H "Content-Type: application/pkcs10" \
--data-binary @<(openssl req -new -key /etc/wifi/eap.key -subj "/CN=laptop-001" -outform DER | base64 -w0) \
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simplereenroll \
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt.new && \
mv /etc/wifi/eap.crt.new /etc/wifi/eap.crt
```
For Apple-managed devices the equivalent flow is wrapped by an MDM
profile that drives EST. For ChromeOS the Admin Console SCEP profile
remains the easier path until Google's EST support stabilises (track
the [SCEP+ChromeOS guide](legacy-est-scep.md#scep-rfc-8894-native-implementation-post-2026-04-29)).
### FreeRADIUS-side: EAP-TLS configuration
In `mods-available/eap`:
```
eap {
default_eap_type = tls
tls-config tls-common {
# The CA bundle that signed certctl's EST-issued device certs.
# Save the certctl issuer's CA chain to this path; the
# FreeRADIUS daemon reloads on HUP.
ca_file = /etc/freeradius/certs/certctl-ca.pem
# Server cert presented to the supplicant for tunnel TLS.
# Separate cert chain — FreeRADIUS's own cert, NOT a certctl-
# issued client cert.
certificate_file = /etc/freeradius/certs/freeradius-server.pem
private_key_file = /etc/freeradius/certs/freeradius-server.key
# Validate the supplicant's cert chain to certctl-ca.pem.
check_cert_issuer = "/CN=certctl-corp-ca"
# Pin the supplicant's EKU to clientAuth.
check_cert_cn = "%{User-Name}"
}
tls {
tls = tls-common
}
}
```
The matching `sites-available/default` authorize block invokes
`eap` and rejects on cert-chain failure. CRL/OCSP validation against
certctl's CRL endpoint (`/.well-known/pki/crls/<issuerID>.crl`) is
configured under `tls-common.crl_dir` — see [`crl-ocsp.md`](crl-ocsp.md)
for the certctl-side CRL distribution endpoint and refresh cadence.
### End-to-end flow
1. Laptop boots, supplicant starts EAP-TLS handshake against the AP.
2. AP forwards the EAP frames to FreeRADIUS over RADIUS.
3. FreeRADIUS validates the supplicant cert chain against
`certctl-ca.pem`, checks revocation against the certctl CRL, and
pins the EKU to `clientAuth`.
4. On valid cert, FreeRADIUS returns Access-Accept; the AP grants
network access.
5. ~10 days before the cert's `NotAfter`, the device's renewal cron
hits `simplereenroll` over the EXISTING mTLS-authenticated session
— no operator interaction.
What can go wrong (operator playbook):
| Symptom | Diagnostic | Fix |
|----------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| Supplicant rejected at TLS handshake | `tcpdump` on AP shows TLS-1.2 hello | Update supplicant to TLS 1.3 OR ensure FreeRADIUS's cert is signed under a chain it trusts. |
| FreeRADIUS rejects with "expired CRL" | `freeradius -X` log surfaces stale CRL | certctl regenerates per-issuer CRLs hourly (see [`crl-ocsp.md`](crl-ocsp.md)); tighten `crl_dir` reload cadence in FreeRADIUS. |
| Renewal fails with HTTP 429 | certctl audit log shows `est_rate_limited` for this device | Per-(CN, sourceIP) limit tripped; either widen `_RATE_LIMIT_PER_PRINCIPAL_24H` or investigate why the device is renewing >3x/24h. |
| Renewal fails with HTTP 401 | certctl audit log shows `est_auth_failed_mtls` | Bootstrap cert chain doesn't trace to `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. Re-issue or rotate. |
| Sustained `est_auth_failed_basic` from one IP | certctl audit log + IP reverse lookup | Likely brute-force; the source-IP limiter will lock the IP after 10 fails/hr. Block at firewall.|
## IoT bootstrap recipe
Long-running devices in the field — sensors, gateways, kiosks —
typically follow this lifecycle:
1. **Factory provisioning** — bake one of:
- A **bootstrap enrollment password** into the device firmware
(per-fleet shared secret; pair with the source-IP rate limit)
- A **factory-installed bootstrap cert** signed by the operator's
factory CA, suitable for mTLS on first enroll
2. **First boot** — device generates an ECDSA P-256 keypair locally,
builds a CSR with its serial in `deviceSerialNumber`, and POSTs to
`/.well-known/est/<pathID>/simpleenroll` (with HTTP Basic) or
`/.well-known/est-mtls/<pathID>/simpleenroll` (with the bootstrap
cert). On success, the device persists the issued cert and the
bootstrap material can be discarded.
3. **Steady state** — device drives `simplereenroll` over the
issued cert's mTLS session ~1025% before `NotAfter`. The
re-enrollment uses the issued cert as the client cert; no shared
secrets in the renewal path.
4. **Compromise / decommission** — operator hits the bulk-revoke
endpoint:
```bash
curl -sS -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $CERTCTL_API_KEY" \
--cacert /path/to/ca.crt \
https://certctl.example.com:8443/api/v1/est/certificates/bulk-revoke \
-d '{"reason":"keyCompromise","profile_id":"cp-iot-sensors"}'
```
The endpoint is M-008 admin-gated; non-admin Bearer callers receive
HTTP 403. Source is auto-pinned to `EST` server-side, so the
operation only revokes EST-issued certs even if the criteria match
non-EST sources too. The CRL/OCSP responder picks up the revocations
on the next refresh cycle (`CERTCTL_CRL_GENERATION_INTERVAL`,
default 1h) — see [`crl-ocsp.md`](crl-ocsp.md).
**Recommended cert lifetimes for IoT.** Set `MaxTTLSeconds = 7776000`
(90 days) on the IoT `CertificateProfile`. Long enough to absorb
multi-day network outages without losing the device; short enough to
limit exposure on key compromise (combined with bulk revoke + CRL
refresh, the worst-case window is `1h + crl_refresh_interval` from
revocation to relying-party rejection).
**Renewal trigger ratio for IoT.** Set the device's renewal cron to
fire at 25% remaining lifetime — that gives ~22 days of buffer for a
device that's offline at expiry-time to reconnect, retry, and
re-enroll before the cert hard-expires. Mirrors the renewal-trigger
ratio for laptops at 50% (laptops are online more often, so the
buffer can be tighter relative to lifetime).
## `serverkeygen` for resource-constrained devices
RFC 7030 §4.4 lets the server generate the keypair on behalf of the
client when the device lacks a hardware RNG — typical of ultra-low-
power IoT or embedded modules without a TRNG. certctl supports this
via `CERTCTL_EST_PROFILE_<NAME>_SERVERKEYGEN_ENABLED=true`.
Wire format: `POST /.well-known/est/<pathID>/serverkeygen` with the
device's CSR as the request body. The handler:
1. Parses the CSR; the CSR's pubkey is treated as the **recipient
key** for CMS EnvelopedData wrapping (RFC 7030 §4.4.2). The CSR's
pubkey must support keyTrans (RSA-only at this revision; ECDH
defer to a follow-up bundle) — non-RSA CSRs return HTTP 400 with
`ErrServerKeygenRequiresKeyEncipherment`.
2. Resolves the per-profile key algorithm from
`CertificateProfile.AllowedKeyAlgorithms` (default RSA-2048).
3. Generates a fresh keypair in process memory.
4. Re-builds the CSR with the server-generated pubkey (so the issuer
sees a CSR that matches the cert it's signing).
5. Runs the existing issuer pipeline.
6. Marshals the private key as PKCS#8 DER, then wraps it in CMS
EnvelopedData encrypted to the device's CSR pubkey via AES-256-CBC
with a per-call random IV.
7. Returns the response as `multipart/mixed` per RFC 7030 §4.4.2:
first part is the cert chain (PKCS#7), second part is the
EnvelopedData blob (`application/pkcs8`).
8. **Zeroizes** the plaintext key + PKCS#8 bytes before return —
`internal/service/est.go::zeroizeKey` + `zeroizeBytes`. The
private key never persists to disk on the certctl side.
Cross-check at boot: setting `_SERVERKEYGEN_ENABLED=true` on a
profile with empty `_PROFILE_ID` is refused — server-keygen needs a
`CertificateProfile` to pin `AllowedKeyAlgorithms` (the server has
to decide what key to generate, and a profile-less default would be
arbitrary).
**Security caveats.**
- **Trust transitivity.** Server-keygen breaks the cardinal property
of agent-based key management: that the private key never leaves
the device. The CMS wrap protects the key in transit, but the
device still trusts certctl with the key material at generation
time. Use only when the device cannot generate its own keypair —
not as a convenience.
- **Heap residency window.** The plaintext key lives in process heap
between generation and CMS encryption. The zeroize step closes the
obvious leakage leg, but a Go runtime that GC-relocates the buffer
before zeroize fires could leave a copy. The threat-model carve-out
is documented in [Threat model](#threat-model); use HSM-backed
signing for highest-assurance fleets.
- **No audit-log trail of the key bytes.** The audit row records
the issuance (cert serial, subject, issuer) but never the key
bytes; the operator cannot recover a key after issuance. This is
by design — the key bytes only exist for the duration of the
request.
## HSM-backed CA signing for EST
EST signs certs using whatever issuer connector the profile binds.
The `internal/crypto/signer/` interface (post-2026-04-28) means a
future HSM/PKCS#11 driver bundle (parking-lot at
`cowork/hsm-pkcs11-driver-prompt.md`) plugs in transparently — the
EST handler doesn't change. EST-issued certs benefit from HSM-backed
signing automatically once the HSM bundle ships and the operator
swaps the local issuer's `FileDriver` for a `PKCS11Driver`.
For deploys that need HSM-backed CA signing today, use the local
issuer's `FileDriver` with the CA key on a read-only TPM-protected
tmpfs; the L-014 file-on-disk threat-model carve-out in
`internal/connector/issuer/local/local.go` documents the
defense-in-depth steps.
## Operator GUI (EST Admin tabs)
The EST Admin surface lives at `/est` (route `web/src/main.tsx`,
nav link `web/src/components/Layout.tsx::EST Admin`). The page is
admin-gated at the top level — non-admin Bearer callers see an
"Admin access required" banner, and the underlying admin endpoints
(`/api/v1/admin/est/*`) are M-008 protected server-side independently.
Three tabs:
- **Profiles** (default) — per-profile lean cards with auth-mode
badges, mTLS trust-anchor expiry countdown (green ≥30d / amber
730d / red <7d / EXPIRED), the 12-cell live counter grid (every
`est_*` failure mode), and a "Reload trust anchor" modal that
hits `POST /api/v1/admin/est/reload-trust` (the SIGHUP-equivalent;
bad reloads keep the OLD pool in place per the
[Threat model](#threat-model) reload semantics).
- **Recent Activity** — merges the four EST audit-action prefixes
(`est_simple_enroll`, `est_simple_reenroll`, `est_server_keygen`,
`est_auth_failed`) across four parallel queries with chip filters
(All / Enrollment / Re-enrollment / ServerKeygen / AuthFailure).
Polled every 60s.
- **Trust Bundle** — per-mTLS-profile cert subjects + expiries
surfaced from the trust holder snapshot. Used during rotation:
operator extracts the new bundle, overwrites the on-disk file,
hits Reload, then reloads this tab to confirm the new subjects.
All three admin endpoints (`GET /api/v1/admin/est/profiles`,
`POST /api/v1/admin/est/reload-trust`, plus the audit-query merge in
the GUI) are M-008 admin-gated. The page itself hides (UX hint) and
the server-side gate enforces (security boundary).
## CLI + MCP tools
The `certctl-cli est` subcommand family (`internal/cli/est.go`):
```
certctl-cli est cacerts --profile <name>
certctl-cli est csrattrs --profile <name>
certctl-cli est enroll --profile <name> --csr <path|-> [--out <path>]
certctl-cli est reenroll --profile <name> --csr <path|-> [--out <path>]
certctl-cli est serverkeygen --profile <name> --csr <path> --out <prefix>
certctl-cli est test --profile <name>
```
`--profile` is the lowercased PathID (matches the URL path). Empty
profile string maps to the legacy `/.well-known/est/` root — use only
during a back-compat migration. Server-keygen writes
`<prefix>.cert.pem` plus `<prefix>.key.enveloped` (the EnvelopedData
blob, decryptable with `openssl smime`).
The MCP server (`internal/mcp/tools_est.go`) exposes six tools that
mirror the CLI surface for AI-orchestrated workflows:
- `est_list_profiles` — every configured EST profile + its auth modes
+ counters
- `est_admin_stats` — alias of the above; matches the
`scep_admin_stats` naming convention
- `est_get_cacerts` — base64 PKCS#7 cert chain
- `est_get_csrattrs` — base64 DER attributes blob (per-profile when
`RequiredCSRAttributes` is set)
- `est_enroll` — body carries the CSR PEM; returns the issued cert
- `est_reenroll` — same but uses the previous-cert mTLS path
All six are gated by the standard MCP Bearer auth + the page-level
admin gate where applicable (`est_list_profiles`, `est_admin_stats`).
## Renewal: device-driven model
RFC 7030 §4.2.2 mandates the renewal model: the **device** decides
when to renew and drives `simplereenroll` over its existing cert.
There is no server-initiated push — certctl never reaches out to a
device fleet to force renewal.
Practical implications:
- A device offline at expiry-time **loses its cert**. Mitigation:
pick a renewal-trigger ratio with enough buffer (50% remaining
lifetime for laptops, 25% for IoT — see
[IoT bootstrap recipe](#iot-bootstrap-recipe)). On chronically
offline fleets, lengthen `MaxTTLSeconds`.
- The "operator wants to push renewal" case is handled via the
notification webhook surface (`internal/connector/notifier/webhook/`)
— operator publishes an event on a topic the device fleet
subscribes to (or the operator's MDM picks up); the device's MDM
agent triggers the renewal cron out-of-band. certctl emits a
`cert.expiring_soon` event on the standard 30/7/1-day pre-expiry
schedule (`internal/scheduler/scheduler.go::expiryNotificationLoop`).
- Per-(CN, sourceIP) sliding-window cap keeps a misbehaving device
from hammering the server. Default is `0` (disabled, back-compat);
production deploys set `3` per `CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H`.
Mirrors the SCEP/Intune per-device limit pattern from
[`scep-intune.md`](scep-intune.md).
## Troubleshooting matrix
The handler emits a typed audit-action code per failure mode. Filter
the GUI Recent Activity tab on the action prefix to find the
offending requests, and use the table below to map back to root
cause + fix.
| Audit action | Symptom | Root cause + fix |
|--------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `est_simple_enroll_success` | (success counter) | No action needed. |
| `est_simple_enroll_failed` | An enrollment failed — the bare `_failed` codes give the typed reason | The audit row's `details` carries the inner reason; cross-reference one of the rows below. |
| `est_simple_reenroll_success` | (success counter) | No action needed. |
| `est_simple_reenroll_failed` | A renewal failed | Same as `est_simple_enroll_failed`; cross-reference inner reason. |
| `est_server_keygen_success` | (success counter) | No action needed. |
| `est_server_keygen_failed` | Server-keygen failed | Most common: device CSR carries a non-RSA pubkey (the keyTrans wrap requires RSA at this revision). Switch the device to an RSA CSR or wait for ECDH support. |
| `est_auth_failed_basic` | HTTP Basic gate tripped | Wrong password OR the password env var rotated and the device wasn't re-provisioned. Watch the source-IP for sustained failures — the limiter locks out after 10 fails/hr. |
| `est_auth_failed_mtls` | mTLS gate tripped | Client cert doesn't chain to the trust anchor OR the cert is past `NotAfter` OR the cert presented is for a different EST profile (cross-profile bleed defense). Check `details.subject` against `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. |
| `est_auth_failed_channel_binding` | RFC 9266 channel-binding gate tripped | One of: missing `id-aa-channelBindings` attribute on the CSR (libest <v3.0); mismatch (MITM signal — log + escalate); TLS 1.2 client (channel binding requires TLS 1.3). Map the inner error to the [channel-binding table](#rfc-9266-channel-binding). |
| `est_rate_limited` | Per-(CN, sourceIP) cap tripped | If legitimate (recovery + first-cert + post-wipe in 24h), bump `_RATE_LIMIT_PER_PRINCIPAL_24H`. If suspicious, the limiter is doing its job — investigate the device. |
| `est_csr_policy_violation` | CSR violates the bound `CertificateProfile` rules | Inner detail names the dimension (key alg, key size, EKU, SAN, max TTL). Either fix the device CSR or relax the policy — never silently accept. |
| `est_bulk_revoke` | Operator-initiated bulk revoke | Audit-only signal; no failure. Cross-reference the operator's identity in `details.actor`. |
| `est_trust_anchor_reloaded` | Operator-initiated SIGHUP-equivalent reload | Audit-only signal; no failure. Failed reloads do NOT emit this code (the OLD pool stays in place; check the GUI Reload modal's error message + the `details.path_id`). |
The bare action codes (without the `_success`/`_failed` suffix) are
also emitted for back-compat with the GUI activity-tab filter chips
which match by exact-string `startsWith()` — the split-emit pattern
preserves both the legacy-grep and the new typed-counter use cases.
See `internal/service/est_audit_actions.go` for the constant
definitions; the per-action emission sites are in
`internal/service/est.go::processEnrollment`.
## TLS 1.2 reverse-proxy runbook
Some embedded EST clients only speak TLS 1.2 — older OpenWRT routers,
some industrial PLCs, IoT firmware that can't be field-upgraded.
certctl's control plane is TLS 1.3 only (pinned at
`cmd/server/tls.go::buildServerTLSConfig`). The migration path is the
TLS 1.2 reverse-proxy pattern documented in
[`legacy-est-scep.md`](legacy-est-scep.md):
- nginx / HAProxy terminates TLS 1.2 from the legacy client
- Forwards the EST request body unchanged to certctl on TLS 1.3
- Optionally forwards the client cert via `X-SSL-Client-Cert` for the
proxy-side mTLS trust pin
Important caveat: **RFC 9266 channel binding cannot work through a
reverse proxy.** The channel binding bytes are derived from the
client↔proxy TLS session, NOT the proxy↔certctl session. Disable
`_CHANNEL_BINDING_REQUIRED` for profiles that serve via the proxy
runbook.
## Threat model
The EST hardening bundle's threat model rests on these load-bearing
properties; deviations need explicit operator awareness:
- **Trust anchor reload is fail-safe.** A SIGHUP that hits a
half-rotated bundle (parse error, expired cert) keeps the OLD pool
in place. The validator never accepts an unparseable bundle. The
GUI reload modal surfaces the error so the operator can correct
the file and retry without taking the EST endpoint down.
- **Per-profile counter isolation.** Each ESTService instance has
its own `estCounterTab` (sync/atomic-backed). A future shared-
counter refactor would fail at the compile-time pointer-identity
check in `internal/service/est_profile_counter_isolation_test.go`.
This means the Recent Activity tab's per-profile filter is a real
filter, not a fan-out display of one shared counter.
- **mTLS cross-profile bleed is blocked.** A client cert presented
to profile A's mTLS endpoint must chain to A's trust bundle, not
any other profile's. The per-handler re-verify enforces this even
when both profiles share a TLS listener union pool (see
`cmd/server/tls.go::buildServerTLSConfigWithMTLS`).
- **Source-IP failed-Basic limiter is process-local.** The 10/hr
cap is enforced in-process; a load-balanced multi-pod deploy where
request distribution is round-robin can amplify the effective
per-IP rate by the pod count. Mitigation: use sticky-source-IP
load balancing for `/.well-known/est/` if this is in scope.
- **Server-keygen has a heap-residency window.** The plaintext
private key lives in process memory between generation and CMS
EnvelopedData encryption. The zeroize step closes the obvious
leakage leg, but a GC-relocation between generation and zeroize
could leave a copy. Use HSM-backed signing for highest-assurance
fleets where this matters.
- **HTTP Basic password is in-process only.** Stored in
`ESTHandler.basicPassword`, never logged, never written to disk by
certctl. Operators ARE responsible for the env-var injection path
(Helm secret, Docker secret, Vault) — see `tls.md` for the
recommended secret-mount conventions.
- **The legacy unauthenticated default exists for back-compat.**
Pre-Phase-1 deploys had no `_ALLOWED_AUTH_MODES` env var; the
default is empty (anonymous) so existing deploys continue to work.
A future bundle MAY flip the default to require explicit opt-in;
production deploys should set `_ALLOWED_AUTH_MODES` explicitly
today regardless.
## V3-Pro deferrals
These capabilities are deferred to V3-Pro (paid tier). They're not
oversights — they're the natural follow-on bundles after v2.X.0 GA:
- **Conditional Access / device-posture gating.** The per-profile
ESTService exposes a nil-default compliance-hook seam (mirrors the
SCEP/Intune `ComplianceCheck` pattern). V3-Pro plugs in a
Microsoft Graph or other posture-check callback before issuance;
non-compliant devices fail with a typed `est_compliance_failed`
reason.
- **Multi-tenant CA isolation.** V2 has one trust anchor pool per
EST profile and one issuer binding. V3-Pro ships per-tenant root
+ per-tenant audit isolation for MSPs running shared certctl
deployments across customers.
- **EST cert-bound usage analytics.** Forward device-side handshake
logs into certctl for cert-bound session analytics. V3-Pro (or
delegate to a real session-management product like Teleport for
TLS sessions).
- **EST-cert-manager-style controller for K8s host fleets.**
External-issuer pattern that lets cert-manager use certctl's EST
server as a backend. Parking-lot per `WORKSPACE-ROADMAP.md::Cloud
and Kubernetes`.
- **Standalone `certctl-est` CLI binary.** All EST ops route through
the certctl server in V2; a standalone binary that an operator can
run on a laptop without the full server (similar to the SCEP probe
deferred CLI binary). V2 ships the `certctl-cli est` subcommand
family which solves the same operator workflow at a lower
packaging cost.
- **`fullcmc` (RFC 7030 §4.3) implementation.** Rare in practice;
only Cisco IOS and a few financial-PKI vendors use it. Defer
until a customer asks.
## Appendix A: libest reference client
certctl's CI exercises the EST endpoints against Cisco's libest
reference implementation via the sidecar at
`deploy/test/libest/Dockerfile`. The build reproduces v3.2.0-2 from
source on `debian:bookworm-slim` (digest-pinned per the H-001 guard).
To reproduce locally:
```bash
# From the repo root.
docker compose --profile est-e2e -f deploy/docker-compose.test.yml build libest-client
docker compose --profile est-e2e -f deploy/docker-compose.test.yml up -d libest-client
docker exec -it certctl-libest-client estclient --help
```
The integration test suite (`deploy/test/est_e2e_test.go`, build
tag `integration`) drives the live certctl server through the
sidecar via `docker exec` for these scenarios:
- `TestEST_LibESTClient_Enrollment_Integration` — `cacerts`
→ `simpleenroll` → cert assertion
- `TestEST_LibESTClient_MTLSEnrollment_Integration` — mTLS sibling
route
- `TestEST_LibESTClient_ServerKeygen_Integration` — RFC 7030 §4.4
multipart/mixed
- `TestEST_LibESTClient_RateLimited_Integration` — exhausts the
per-principal cap and asserts the 429-shaped error
- `TestEST_LibESTClient_ChannelBinding_Integration` — RFC 9266
`--tls-exporter` (skipped when libest build lacks the flag)
Run the suite via `INTEGRATION=1 go test -tags integration ./deploy/test/... -run EST`.
## Appendix B: RFC 7030 wire-format quirks
certctl's EST handler ships with quirk-tolerance for documented EST
client populations. The fixtures + unit tests live at
`internal/api/handler/cisco_ios_quirks_test.go` +
`internal/api/handler/testdata/cisco_ios_*.txt`.
| Vendor / version | Quirk | certctl behavior |
|-----------------------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Cisco IOS 15.x | Some images send the CSR as `application/x-pem-file` (not the spec'd `application/pkcs10`) | The handler dispatches on the body prefix (`-----BEGIN`) rather than the Content-Type header — accepted as PEM-encoded PKCS#10. |
| Cisco IOS 16.x | Trailing newlines on the base64 body (variable count) | `strings.TrimSpace` pass before base64 decode; bodies tolerated regardless of trailing whitespace. |
| Apple MDM (some firmware) | CRLF line wrapping inside the base64 body | `base64.StdEncoding` handles both LF and CRLF. |
| OpenWRT (older builds) | TLS 1.2 only | Use the [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook); disable channel binding for affected profiles. |
| libest <v3.0 | No RFC 9266 `--tls-exporter` flag | Set `_CHANNEL_BINDING_REQUIRED=false` for affected profiles; the server still validates everything else. |
If you find a new wire-format quirk in a real device, file an issue
with a base64 dump of the failing request — we'll add a fixture +
the matching tolerance pass.
## Related docs
- [`legacy-est-scep.md`](legacy-est-scep.md) — TLS 1.2 reverse-proxy
runbook + the SCEP RFC 8894 native implementation parallels.
- [`scep-intune.md`](scep-intune.md) — the SCEP/Intune master bundle
that established the multi-profile dispatch + admin GUI + golden
fixture patterns this EST bundle mirrors.
- [`crl-ocsp.md`](crl-ocsp.md) — the per-issuer CRL distribution
endpoint and OCSP responder that EST-issued certs are revoked
through.
- [`features.md`](features.md) — every `CERTCTL_*` env var,
including the per-profile `CERTCTL_EST_PROFILE_<NAME>_*` family
documented here.
- [`architecture.md`](architecture.md) — overall control-plane
architecture; EST Server section + Security Model trust-anchor
rotation discussion.
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
prerequisite for any production EST deploy.
- [`connectors.md`](connectors.md) — issuer connectors that EST
delegates to.
+385
View File
@@ -0,0 +1,385 @@
# Microsoft Intune SCEP enrollment via certctl
> **Status (this document):** Phase 11 of the SCEP RFC 8894 + Intune master
> bundle. The behavior described here is shipped on `master` and exercised
> end-to-end by `internal/api/handler/scep_intune_e2e_test.go`. The
> bundle is V2-free (community edition) — Conditional-Access compliance
> gating, native Microsoft Graph integration, and per-tenant trust
> anchors are documented under [Limitations](#limitations) as V3-Pro
> features.
## TL;DR
certctl is a **drop-in NDES replacement** for Microsoft Intune SCEP fleets.
Intune-managed devices keep using the existing Intune Certificate Connector;
only the SCEP server URL changes. certctl validates the Connector's
signed challenge using its installation signing cert (no Microsoft API
calls — the Connector already did that), binds the device claim to the
inbound CSR, and issues through whichever certctl issuer connector you
have configured (local CA, Vault, EJBCA, ADCS, etc.).
What you get over NDES:
- Per-profile SCEP endpoints (`/scep/corp` vs. `/scep/iot` etc.) so a
single certctl deploy serves multiple device fleets with distinct
challenge passwords + trust anchors.
- Audit log entries with the device GUID, claim subject, and CSR
binding details — much better forensics than NDES + IIS logs.
- Trust anchor reload via `SIGHUP` (no service restart) when the
Connector signing cert rotates.
- A built-in admin GUI tab (Intune Monitoring) showing per-profile
enrollment counters, trust-anchor expiry countdowns, and the recent
failures table.
- Per-device rate limit (sliding window log keyed by Subject + Issuer)
that catches a compromised Connector signing key issuing many
different valid challenges for the same device.
## Architecture
```mermaid
flowchart LR
Cloud["Intune cloud<br/>(Microsoft)"]
Connector["Intune Certificate Connector<br/>(customer infra)"]
Server["certctl SCEP server<br/>(you)"]
Issuer["issuer connector<br/>(local CA / Vault / EJBCA / …)"]
Cloud --> Connector --> Server --> Issuer
```
**certctl replaces NDES, not the Connector.** The Intune Certificate
Connector is the bridge between the Intune cloud and your on-prem PKI;
Microsoft installs and maintains it. What you replace is the
**Network Device Enrollment Service** (NDES) — the SCEP server
historically deployed on a Windows host, sitting between the Connector
and an Active Directory Certificate Services CA. certctl sits in
exactly that slot and speaks SCEP RFC 8894 to the Connector.
### What certctl validates per request
For every Intune-flavored SCEP request the dispatcher in
`internal/service/scep.go::dispatchIntuneChallenge` walks the
following gates in order. A failure on any gate produces a CertRep
PKIMessage with the documented `pkiStatus`/`failInfo` codes (per RFC
8894 §3.2.1.4.5) and increments the corresponding metric counter.
1. **Shape pre-check**`looksIntuneShaped(challengePassword)`:
length > 200 + exactly two dots. False positives are fine; false
negatives on real Intune challenges would route them to the static
compare and reject. The pre-check just decides whether to invoke
the full validator.
2. **JWS signature**`intune.ValidateChallenge` re-derives the
signing input from the raw on-wire bytes (per RFC 7515 §3.1, NOT
re-base64-encoded segments) and verifies against every cert in the
trust anchor pool. Supports RS256 and ES256 (both fixed-width
r||s and ASN.1-DER form). Explicitly rejects `alg=none` and
HMAC algs.
3. **Version dispatch** — extracts the `version` claim from the
payload prelude. v1 (current Connector format, no `version` key)
routes to `unmarshalChallengeV1`. Future v2 plugs in a sibling
parser without touching the validator.
4. **Time bounds**`now+tolerance ≥ iat AND now-tolerance < exp`.
The `±tolerance` window is configurable per profile via
`INTUNE_CLOCK_SKEW_TOLERANCE` (default 60s, covers modest clock
drift between the Connector host and certctl). Configurable cap on
top via `INTUNE_CHALLENGE_VALIDITY` (defense-in-depth against a
Connector that mints long-validity challenges). The validator
refuses `tolerance ≥ ChallengeValidity` at startup-validation time
to keep the cap meaningful.
5. **Audience pin**`claim.aud == INTUNE_AUDIENCE` (skipped when
`INTUNE_AUDIENCE` is empty for proxy/load-balancer scenarios).
6. **CSR binding**`claim.DeviceMatchesCSR(csr)` checks
set-equality between the claim's `device_name` / `san_dns` /
`san_rfc822` / `san_upn` and the CSR's CN + SANs. Set-equality
means the CSR carries EXACTLY the claim's values, no extras and
no missing.
7. **Replay**`intune.ReplayCache.CheckAndInsert` rejects
duplicates within the configured TTL. Sized for 100k entries
(covers a ~25 RPS Intune fleet's steady-state).
8. **Per-device rate limit** — sliding window log keyed by
`(claim.Subject, claim.Issuer)`. Catches a compromised Connector
issuing many DIFFERENT valid challenges for the same device. Default
3 enrollments per 24h covers legitimate first-cert + recovery +
post-wipe.
9. **Optional compliance check** — V3-Pro plug-in seam (nil-default
no-op). When set, the gate calls Microsoft Graph's compliance API
and short-circuits non-compliant devices with FAILURE+BadRequest.
A request that passes all nine gates flows to
`processEnrollment`, which builds the issuance request, calls the
configured issuer connector, and emits a CertRep PKIMessage with the
issued cert encrypted to the device's transient signing cert per RFC
8894 §3.3.2.
## Migration from NDES + EJBCA (or NDES + ADCS)
The migration plan below is conservative — install certctl alongside
your existing NDES so you can flip Intune profiles fleet-by-fleet
without a flag day. Validated against a fresh `docker compose up`
stack; the docker-compose.test.yml stack does not currently bake
Intune in (Phase 10.2 ships a hermetic in-process e2e test instead),
so the production validation step is a manual run-book item.
1. **Install certctl alongside existing NDES.** Stand up the certctl
server on a separate host (or as a Kubernetes deployment) reachable
from the Connector host. Use the existing operator-run-book in
`docs/tls.md` for the TLS bootstrap.
2. **Configure a per-profile SCEP endpoint.** Pick a path id (e.g.
`corp` — referenced as `<NAME>` below; the value gets uppercased
for the env-var key and lowercased for the URL path) and set:
```
CERTCTL_SCEP_ENABLED=true
CERTCTL_SCEP_PROFILES=corp
CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID=iss-local # or your existing issuer
CERTCTL_SCEP_PROFILE_<NAME>_CHALLENGE_PASSWORD=<random> # Intune still requires this
CERTCTL_SCEP_PROFILE_<NAME>_RA_CERT_PATH=/etc/certctl/ra-corp.pem
CERTCTL_SCEP_PROFILE_<NAME>_RA_KEY_PATH=/etc/certctl/ra-corp.key
```
The endpoint will be served at `https://certctl.example.com/scep/corp`
— the URL path uses the lowercased name and the env-var keys use
the uppercased form. Concrete env-var name mappings are listed in
[`features.md`](features.md).
3. **Extract the Intune Connector's signing cert.** On the Connector
host (Windows), the Connector's installation creates a self-signed
cert in the local machine's `Personal` cert store with subject
`CN=Microsoft Intune Certificate Connector` (path documented by
Microsoft — see Microsoft Learn link in the
[Microsoft support statement](#microsoft-support-statement) below).
Export the public cert (no private key) as a base64 `.cer` file.
4. **Configure the trust anchor.** Copy the `.cer` to the certctl host
(or mount via your secret manager) and set:
```
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH=/etc/certctl/intune-corp.pem
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_AUDIENCE=https://certctl.example.com/scep/corp
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CHALLENGE_VALIDITY=60m
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE=60s # ±tolerance on iat/exp; raise on poorly-NTP-synced fleets, lower to enforce strict time
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_PER_DEVICE_RATE_LIMIT_24H=3
```
Restart certctl. The startup preflight refuses to boot if the
trust anchor file is missing, unparseable, or contains an expired
cert — failure is loud at boot rather than silent at request time.
5. **Configure the issuer connector.** If you're keeping EJBCA,
point `CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID` at your EJBCA issuer
profile (see `docs/connectors.md`). For a clean cut-over to the
built-in local CA, follow `docs/tls.md` to bootstrap a sub-CA cert.
6. **Migrate one Intune SCEP profile to certctl.** In the Intune
admin center, edit the SCEP profile for a small canary device
group and update the SCEP server URL to
`https://certctl.example.com/scep/corp`. Push the profile and
wait for the canary devices to rotate (24-48h).
7. **Verify enrollment.** Open the certctl admin GUI's
[SCEP Intune Monitoring tab](#operational-monitoring) and watch
the `success` counter tick on the `corp` profile card. The
`recent failures` table surfaces any rejected enrollments with
the exact reason (e.g. `signature_invalid`, `claim_mismatch`).
8. **Roll out the rest of the fleet.** Once the canary is clean,
migrate the remaining Intune SCEP profiles in batches.
9. **Decommission NDES.** After all fleets are migrated and a few
renewal cycles have completed cleanly, take down the NDES role
and the IIS site. The existing certs continue to chain to your
issuer; only the enrollment path changes.
## Intune SCEP profile fields → certctl behavior
The Intune admin center's SCEP profile editor exposes a fixed set of
fields. The mapping below is what each field controls relative to
certctl's behavior.
| Intune profile field | certctl behavior |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Certificate type | Treated as device or user; surfaces in the claim's `subject` field (device GUID vs. user UPN). certctl doesn't gate on type; the issuer's certificate profile decides. |
| Subject name format | Drives the CSR's CN. The Intune Connector sets `claim.device_name` from this value; certctl's CSR-binding gate enforces equality. |
| Subject alternative name | Drives the CSR's SAN list. Intune supports DNS / RFC 822 / UPN; certctl's claim binding checks set-equality per dimension. Mismatches surface as `ErrClaimSANDNSMismatch` / `_SANRFC822Mismatch` / `_SANUPNMismatch`. |
| Certificate validity period | Honored by the issuer connector. certctl caps via the per-profile `CertificateProfile.MaxTTLSeconds`; the smaller of the two wins. |
| Key storage provider | Device-side concern (the Connector negotiates with the device's TPM / Software KSP). certctl never sees the device's private key — it only signs the CSR. |
| Key usage / Extended key usage | Honored by the issuer connector via the bound `CertificateProfile.AllowedEKUs`. CSRs requesting an EKU outside the allowed set are rejected by the crypto-policy gate (`ValidateCSRAgainstProfile`). |
| Hash algorithm | The CSR's signature hash (SHA-256 typical). The SCEP `GetCACaps` advertises SHA-256 + SHA-512; the device picks. |
| SCEP server URL | The endpoint URL the Connector posts to. Set to `https://certctl.example.com/scep/<profile-name>`. |
## Trust anchor extraction
The Intune Certificate Connector self-signs an installation cert at
install time. To configure certctl, extract this cert (PUBLIC ONLY,
no private key) as PEM:
1. On the Connector host (Windows), open `certlm.msc` (Local Machine
Certificate Manager).
2. Navigate to `Personal` → `Certificates`. Find the cert with
subject `CN=Microsoft Intune Certificate Connector`.
3. Right-click → All Tasks → Export. Choose **No, do not export
the private key**. Format: **Base-64 encoded X.509 (.CER)**.
4. Copy the resulting `.cer` file to the certctl host. Rename to
`.pem` (the bytes are identical; certctl's PEM loader accepts
either extension).
5. Set `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH` to
the file path.
6. If you have multiple Connectors in HA, repeat steps 1-3 on each
and concatenate the PEM blocks into one bundle file.
When the operator rotates the Connector signing cert (typically once
every few years per Microsoft's Connector lifecycle), repeat the
extraction, overwrite the on-disk file, then send `SIGHUP` to the
certctl process. The trust holder swaps atomically; bad files (parse
error, expired cert) keep the OLD pool in place so a half-rotation
doesn't take Intune enrollment down.
## Troubleshooting
The dispatcher emits a typed metric label per failure mode plus a
matching audit-log entry. The table below maps the label to the most
common root cause and the operator action.
| Counter label | Symptom | Root cause + fix |
|----------------------|------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `signature_invalid` | Every enrollment from a specific profile failing | Trust anchor mismatch — the Connector's signing cert was rotated and certctl wasn't reloaded. Re-extract the cert ([trust anchor extraction](#trust-anchor-extraction)), overwrite the file, send `SIGHUP`. |
| `claim_mismatch` | Some enrollments from one Intune SCEP profile failing | The Intune SCEP profile's SAN config doesn't match what the device CSR actually has. Compare the `recent failures` table's claim row to the device's CSR; usually a SAN format mismatch (e.g. claim wants UPN, CSR has DNS). |
| `expired` | All enrollments failing on a date boundary | Either clock skew between the Connector host and certctl (NTP both ends) OR the Connector's signing cert is past `NotAfter`. The certctl preflight catches an expired trust anchor at boot; check the Monitoring tab's expiry countdown. |
| `not_yet_valid` | All enrollments failing | Reverse clock skew (certctl's clock is BEHIND the Connector's). Sync via NTP. |
| `wrong_audience` | All enrollments from a profile failing | `INTUNE_AUDIENCE` doesn't match the URL the Connector is configured to call. Either fix `INTUNE_AUDIENCE` to match the operator URL, or unset it (defense-in-depth then disabled — the claim's exp + sig still gate). |
| `replay` | Sporadic per-device failures, mostly during retries | The device retried the SAME challenge after the first one failed. The replay cache TTL is `INTUNE_CHALLENGE_VALIDITY` (default 60m). Either widen the device's retry window (Intune-side) or shorten validity. |
| `rate_limited` | A specific device hitting `429`-equivalent failures | The device exceeded `INTUNE_PER_DEVICE_RATE_LIMIT_24H` (default 3). If legitimate (post-wipe + recovery + first-cert all in 24h), bump the cap. If suspicious, this is the limiter doing its job — investigate the device. |
| `unknown_version` | Sudden onset of failures across the entire fleet | Microsoft shipped a new Connector version with a `version` claim certctl doesn't understand. Open an issue on the certctl repo with the failing claim payload (anonymized); the parser dispatcher accepts new versions in ~30 LoC. |
| `malformed` | Sporadic, low-volume | Malformed challenge bytes — almost always a network proxy mangling the request body, or the Connector logging itself out mid-handshake. Capture a packet trace; the Connector should re-emit on the next device retry. |
| `compliance_failed` | V3-Pro only | The pluggable compliance check returned non-compliant. The audit-log details carries the reason string from Microsoft Graph. V2 deployments never see this counter tick. |
## Operational monitoring (SCEP Administration → Intune Monitoring tab)
The admin GUI surface for SCEP lives at `/scep` and is structured as
three tabs: **Profiles** (default landing — every configured SCEP
profile, lean cards with always-present fields), **Intune Monitoring**
(the Intune-specific deep-dive described below), and **Recent Activity**
(full SCEP audit log filter). Operators monitoring an Intune deployment
spend most of their time on the Intune Monitoring tab, deep-linkable via
`/scep?tab=intune` or the legacy alias `/scep/intune`. The Profiles tab
gives the at-a-glance per-profile health (RA cert expiry, mTLS status,
Intune enabled/disabled badge, challenge-password-set indicator) and a
"View Intune details →" link from each Intune-enabled card that switches
into this tab filtered to that profile.
The Intune Monitoring tab shows:
- **Per-profile cards** — one card per SCEP profile, with the trust
anchor expiry countdown badge:
- `green` ≥ 30 days remaining
- `amber` 7-30 days remaining (rotate soon)
- `red` < 7 days remaining
- `EXPIRED` past `NotAfter`
- **Live counters** — the per-status enrollment counts polled every
30s. The order in the grid puts `success` first (vanity) and
failure modes after.
- **Recent failures table** — the last 50 audit-log events with
action `scep_pkcsreq_intune` or `scep_renewalreq_intune`, sorted
by timestamp descending. Polled every 60s.
- **Trust anchor reload button** — confirms via modal then issues
`POST /api/v1/admin/scep/intune/reload-trust` (the SIGHUP-equivalent).
Bad reloads keep the OLD pool in place; the modal stays open with
the underlying error so the operator can correct the file and retry.
Three admin endpoints back the page:
- `GET /api/v1/admin/scep/profiles` — per-profile snapshot for the
Profiles tab; surfaces RA cert subject + NotAfter + days-to-expiry,
mTLS sibling-route status + bundle path, challenge-password-set flag,
and an optional `intune` sub-block for Intune-enabled profiles.
- `GET /api/v1/admin/scep/intune/stats` — Intune-specific deep-dive
for the Intune Monitoring tab; per-status counters + trust anchor
pool details. Backward-compat shape preserved from Phase 9.
- `POST /api/v1/admin/scep/intune/reload-trust` — SIGHUP-equivalent
trust anchor reload, body `{"path_id": "<pathID>"}`.
All three are M-008 admin-gated. Non-admin Bearer callers get HTTP 403
+ a clear message; the GUI hides the page entirely for non-admin users
(UX hint; server-side enforcement is independent).
### Recommended alert thresholds
The counters are exposed in the GUI as snapshots; if you wrap them
in a Prometheus exporter (V3-Pro plug-in seam — V2 doesn't ship a
`/metrics` surface today), reasonable starting thresholds:
- `signature_invalid` rate > 0 for > 5 minutes → page on-call. The
trust anchor is stale; the operator missed a SIGHUP after a
Connector rotation.
- `claim_mismatch` rate > 0 sustained > 1 hour → notify (not page).
An Intune SCEP profile is misconfigured; an admin needs to fix
the SAN definition or the operator's CertificateProfile.
- `replay` rate climbing → notify. Either an aggressive retry policy
on the device side OR active replay attempts. Cross-reference
source IPs in the audit log.
- `rate_limited` for a single device > 1 per hour → notify. Either
legitimate enrollment storm (post-wipe scenarios) or a compromised
Connector signing key.
- Trust anchor `days_to_expiry` < 30 on any profile → notify; rotate
the Connector's signing cert before the cliff.
## Limitations
This bundle is V2-free. The following capabilities are deferred to
V3-Pro:
- **Native Microsoft Graph integration.** certctl validates the
Connector's signed challenge but doesn't call Microsoft's API
directly — the Connector already did that. V3-Pro could ship a
Graph client that pulls device-compliance state in addition to
the challenge claim.
- **Conditional Access compliance gating.** The dispatcher exposes a
nil-default `ComplianceCheck` hook. V3-Pro plugs in a Microsoft
Graph compliance lookup before issuance; non-compliant devices
fail with a typed `compliance_failed` failInfo.
- **Per-tenant trust anchors.** V2 has one trust anchor pool per
SCEP profile; V3-Pro could support per-AAD-tenant anchor scoping
for MSPs running shared certctl deployments across customers.
- **OCSP stapling at SCEP-response time.** The CertRep doesn't carry
a stapled OCSP response today; certificate validators look up OCSP
via the `id-pkix-ocsp` extension on the issued cert. V3-Pro could
staple inline.
- **Auto-discovery of the Connector signing cert.** V2 requires the
operator to extract the cert manually and configure the path.
V3-Pro could pull from a Microsoft-published endpoint (with the
appropriate trust constraints).
These deferrals are deliberate, not oversights. The V2 surface
covers every operationally-required path for a single-tenant
enterprise replacing NDES; V3-Pro adds the multi-tenant + native-API
features procurement teams sometimes ask for.
## Microsoft support statement
Microsoft documents the Intune Certificate Connector as
**RFC-8894-compliant** and supports its use against any RFC 8894
SCEP server. The relevant Microsoft Learn pages:
- [Intune Certificate Connector overview](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-overview) —
documents the Connector's architecture and explicitly notes it
speaks RFC-8894-compliant SCEP.
- [Use SCEP certificate profiles in Intune](https://learn.microsoft.com/en-us/mem/intune/protect/certificates-scep-configure) —
the operator-facing setup guide, with the SCEP server URL field
the migration playbook above edits.
- [Validate setup of Intune Certificate Connector](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-install) —
the install-validation checklist; useful when troubleshooting
Connector-side failures vs. certctl-side failures.
certctl's role per Microsoft's framing: a third-party SCEP server
that the Connector posts to. Microsoft supports this topology; only
certctl's own RFC 8894 implementation is in scope for certctl
support. The end-to-end Connector → certctl → issuer flow is
exercised in `internal/api/handler/scep_intune_e2e_test.go` and
the golden-file fixtures in `internal/scep/intune/testdata/`.
## Related docs
- [`legacy-est-scep.md`](legacy-est-scep.md) — the per-profile SCEP
setup guide + RFC 8894 reference + mTLS sibling route. Read this
first if you're not already running certctl SCEP for non-Intune
fleets.
- [`architecture.md`](architecture.md) — overall control-plane
architecture; Security Model section calls out the Intune trust
anchor as a sensitive operator-configured surface.
- [`features.md`](features.md) — every `CERTCTL_*` env var,
including the per-profile `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_*`
family.
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
prerequisite for any production deploy.
+91
View File
@@ -0,0 +1,91 @@
# Deployment Vendor Compatibility Matrix
> Deploy-hardening II master bundle deliverable. The procurement-team
> headline doc — SOC 2 / PCI auditors paste this into evidence packs.
> Per frozen decision 0.14: a (connector × vendor-version) cell is
> "verified" only when ALL apply: ≥1 happy-path e2e passes against
> the real sidecar; ≥1 specific-quirk test for that version passes;
> operator manual smoke completed at least once on a real (non-CI)
> instance of that vendor version.
## Status legend
- **✓** — verified per the three-criterion bar above
- **CI** — happy-path + quirk e2e green in CI; operator manual smoke
pending (the third criterion)
- **mock** — verified against the in-tree mock; real-vendor validation
is the operator's tier above
- **pending** — planned; tests written; sidecar not yet wired
- **n/a** — combination not applicable
Per frozen decision 0.1: only LTS + current-stable versions per
vendor. EOL versions explicitly excluded.
## Matrix
| Connector | Vendor | Version | Status | Known Issues | Workaround | E2E Test Name(s) |
|---|---|---|---|---|---|---|
| **NGINX** | nginx.org | 1.25 LTS | CI | SSL session cache holds old cert ~5min | `ssl_session_timeout 5m;` (default) — operator-tunable | `TestVendorEdge_NGINX_SSLSessionCacheHoldsOldCert_E2E` |
| NGINX | nginx.org | 1.27 stable | CI | (same) | (same) | (same) |
| **Apache httpd** | httpd.apache.org | 2.4 LTS | CI | mod_ssl multi-vhost ownership | per-vhost cert config; SSLCertificateFile per `<VirtualHost>` | `TestVendorEdge_Apache_MultiVhostCertByVhost_E2E` |
| **HAProxy** | haproxy.org | 2.6 LTS | CI | reload vs restart semantics | use `systemctl reload haproxy` not `restart` | `TestVendorEdge_HAProxy_ReloadPreservesConnectionsViaSocketActivation_E2E` |
| HAProxy | haproxy.org | 2.8 | CI | (same) | (same) | (same) |
| HAProxy | haproxy.org | 3.0 | CI | (same) | (same) | (same) |
| **Traefik** | traefik.io | 2.x | CI | static-config cert paths require restart | use dynamic file-provider config | `TestVendorEdge_Traefik_StaticConfigRequiresRestart_DocumentedAsLimitation_E2E` |
| Traefik | traefik.io | 3.x | CI | (same) | (same) | (same) |
| **Caddy** | caddyserver.com | 2.x | CI | admin API auth lockdown breaks default deploy | set `Caddy.AdminAuthorizationHeader` per-target | `TestVendorEdge_Caddy_AdminAPILockedDownWithAuth_DeployUsesConfiguredAuthHeaders_E2E` |
| **Envoy** | envoyproxy.io | 1.30 | CI | file-mode SDS only in V2; gRPC SDS V3-Pro | use SDS=file (default) | `TestVendorEdge_Envoy_SDSFileMode_DeployRewritesYAML_EnvoyHotReloads_E2E` |
| Envoy | envoyproxy.io | 1.32 | CI | (same) | (same) | (same) |
| **Postfix** | postfix.org | 3.6 | CI | per-listener cert binding | configure cert per-listener block | `TestVendorEdge_Postfix_MultiListenerCertBinding_DeployUpdatesCorrectListener_E2E` |
| Postfix | postfix.org | 3.8 | CI | (same) | (same) | (same) |
| **Dovecot** | dovecot.org | 2.3 | CI | submission/submissions port variants | configure both inet_listener blocks | `TestVendorEdge_Dovecot_SubmissionSubmissionsPortVariants_E2E` |
| **IIS** | microsoft.com | IIS 10 (Server 2019) | operator-playbook | Windows-host-only validation per [operator playbook](connector-iis.md#operator-validation-playbook-windows-host); app-pool recycle opt-in | `AppPoolRecycle: true` per-target if needed | `TestVendorEdge_IIS_AppPoolRecycle_OptInForCertChange_E2E` |
| IIS | microsoft.com | IIS 10 (Server 2022) | operator-playbook | (same) | (same) | (same) |
| **F5 BIG-IP** | f5.com | v15.1 LTS | mock | larger cert chain (>4 links) historical issue | use cert chain ≤4 links OR upgrade to v17 | `TestVendorEdge_F5_LargeCertChainHandling_E2E` |
| F5 BIG-IP | f5.com | v17.0 | mock | (chain limit lifted) | n/a | (same) |
| F5 BIG-IP | f5.com | v17.5 | mock | (same) | n/a | (same) |
| **SSH** | openssh.com | OpenSSH 8.x | CI | sftp subsystem may be disabled | connector falls back to scp | `TestVendorEdge_SSH_SFTPSubsystemAbsent_FallsBackToSCP_E2E` |
| SSH | openssh.com | OpenSSH 9.x | CI | (same) | (same) | (same) |
| **WinCertStore** | microsoft.com | Windows Server 2019 | operator-playbook | Windows-host-only validation per [operator playbook](connector-iis.md#operator-validation-playbook-windows-host); cert store ACL: NS vs IIS_IUSRS | configure store ACL per IIS app-pool identity | `TestVendorEdge_WinCertStore_CertStoreACL_NetworkServiceAccess_E2E` |
| WinCertStore | microsoft.com | Windows Server 2022 | operator-playbook | (same) | (same) | (same) |
| **JavaKeystore** | adoptium.net | JDK 11 LTS | pending | keytool `-importkeystore` semantics | use `KeytoolPath` config to pin to JDK | `TestVendorEdge_JavaKeystore_JDK11_vs_17_vs_21_KeytoolBehavior_E2E` |
| JavaKeystore | adoptium.net | JDK 17 LTS | pending | (same) | (same) | (same) |
| JavaKeystore | adoptium.net | JDK 21 LTS | pending | (same) | (same) | (same) |
| **Kubernetes** | kubernetes.io | 1.28 LTS | CI | kubelet sync ~60s for pod-mounted Secrets | `CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT=60s` (default) | `TestVendorEdge_K8s_KubeletSyncWaitContract_DefaultTimeout60s_E2E` |
| Kubernetes | kubernetes.io | 1.30 | CI | (same) | (same) | (same) |
| Kubernetes | kubernetes.io | 1.31 current | CI | (same) | (same) | (same) |
## Quarterly re-pin cadence
Every sidecar `FROM` in `deploy/docker-compose.test.yml` carries a
SHA-256 digest pin per the H-001 CI guard. Operator re-pins
quarterly:
1. Pull the latest tag of each sidecar image.
2. Run the per-vendor e2e matrix against the new digest.
3. If green, update the digest in `docker-compose.test.yml` + this
matrix's "Status" column.
4. If red, file an issue against the connector + leave the digest
pinned to the last-known-good.
## How to add a new vendor version
1. Add a new sidecar entry to `deploy/docker-compose.test.yml` with
the new image digest.
2. Add a row to this matrix marking status as "pending".
3. Write `TestVendorEdge_<connector>_<edge>_E2E` test(s) that
exercise the vendor's known quirks against the new sidecar.
4. Once tests pass in CI, mark status "CI".
5. After operator manual smoke, mark status "✓".
## Per-connector deep-dive docs
For the top 5 most-deployed connectors:
- [NGINX deep-dive](connector-nginx.md)
- [Kubernetes deep-dive](connector-k8s.md)
- [IIS deep-dive](connector-iis.md)
- [Apache deep-dive](connector-apache.md)
- [F5 deep-dive](connector-f5.md)
Other connector docs live in [docs/connectors.md](connectors.md).