mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:22:07 +00:00
docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
35 files moved across 8 audience-organized subdirectories:
docs/getting-started/ (5):
quickstart.md, concepts.md, examples.md, advanced-demo.md (was
demo-advanced.md), why-certctl.md
docs/reference/ (6):
architecture.md, api.md (was openapi.md), mcp.md,
intermediate-ca-hierarchy.md, deployment-model.md (was
deployment-atomicity.md), vendor-matrix.md (was
deployment-vendor-matrix.md)
docs/reference/protocols/ (6):
acme-server.md, acme-server-threat-model.md, scep-intune.md,
est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)
docs/operator/ (4):
security.md, tls.md, database-tls.md, approval-workflow.md
docs/operator/runbooks/ (3):
cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
(was runbook-expiry-alerts.md), disaster-recovery.md
docs/migration/ (3):
from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
(was migrate-from-acmesh.md), cert-manager-coexistence.md (was
certctl-for-cert-manager-users.md)
docs/compliance/ (4):
index.md (was compliance.md), soc2.md (was compliance-soc2.md),
pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
compliance-nist.md)
docs/contributor/ (4):
testing-strategy.md, test-environment.md (was test-env.md),
ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)
Deferred to later Phase 2 sub-phases:
- connectors.md split (Phase 4): docs/connectors.md +
docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
- testing-guide.md prune (Phase 5): docs/testing-guide.md still
at top level
- features.md disperse (Phase 6): docs/features.md still at top
level
- legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
still at top level
- ACME walkthrough re-homing (Phase 8): three
docs/acme-*-walkthrough.md still at top level
- Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
at top level
Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
This commit is contained in:
@@ -0,0 +1,196 @@
|
||||
# OpenAPI Specification Guide
|
||||
|
||||
certctl ships with a complete OpenAPI 3.1 specification at `api/openapi.yaml`. This spec documents all 78 API operations currently specified, every request/response schema, pagination conventions, authentication requirements, and error formats. It's the single source of truth for the documented REST API. (Note: The spec will be updated to include 7 additional certificate discovery endpoints from M18b.)
|
||||
|
||||
This guide covers how to use the spec for API exploration, client SDK generation, and integration testing.
|
||||
|
||||
## Where to Find It
|
||||
|
||||
The spec lives at `api/openapi.yaml` in the repository root. It's versioned alongside the code and updated with every API change.
|
||||
|
||||
```bash
|
||||
# View the spec
|
||||
cat api/openapi.yaml
|
||||
|
||||
# Count operations
|
||||
grep "operationId:" api/openapi.yaml | wc -l
|
||||
# 78 (includes health + ready, 7 discovery endpoints pending spec update)
|
||||
```
|
||||
|
||||
## Viewing with Swagger UI
|
||||
|
||||
The fastest way to explore the API interactively is Swagger UI. Run it as a Docker container pointing at the spec:
|
||||
|
||||
```bash
|
||||
# From the certctl repo root
|
||||
docker run -p 8080:8080 \
|
||||
-e SWAGGER_JSON=/spec/openapi.yaml \
|
||||
-v $(pwd)/api:/spec \
|
||||
swaggerapi/swagger-ui
|
||||
```
|
||||
|
||||
Open http://localhost:8080 to see the full API reference with "Try it out" buttons for every endpoint.
|
||||
|
||||
Alternatively, use Redoc for a cleaner read-only view:
|
||||
|
||||
```bash
|
||||
docker run -p 8080:80 \
|
||||
-e SPEC_URL=/spec/openapi.yaml \
|
||||
-v $(pwd)/api:/usr/share/nginx/html/spec \
|
||||
redocly/redoc
|
||||
```
|
||||
|
||||
## API Structure
|
||||
|
||||
The spec organizes endpoints into 16 tags:
|
||||
|
||||
| Tag | Endpoints | Description |
|
||||
|-----|-----------|-------------|
|
||||
| Certificates | 12 | CRUD, versions, renewal, deployment, revocation, deployments |
|
||||
| CRL & OCSP | 3 | JSON CRL, DER CRL per issuer, OCSP responder |
|
||||
| Issuers | 5 | CA connector management |
|
||||
| Targets | 5 | Deployment target management |
|
||||
| Agents | 7 | Registration, heartbeat, CSR submission, work polling |
|
||||
| Jobs | 5 | Job queue with approve/reject |
|
||||
| Policies | 5 | Policy rules and violations |
|
||||
| Profiles | 5 | Certificate enrollment profiles |
|
||||
| Teams | 5 | Team management |
|
||||
| Owners | 5 | Certificate owners |
|
||||
| Agent Groups | 5 | Dynamic agent grouping |
|
||||
| Audit | 2 | Immutable audit trail |
|
||||
| Notifications | 3 | Notification events |
|
||||
| Stats | 5 | Dashboard statistics |
|
||||
| Metrics | 1 | System metrics |
|
||||
| Health | 3 | Health, readiness, auth info |
|
||||
|
||||
## Authentication
|
||||
|
||||
The spec declares a `bearerAuth` security scheme applied globally. All endpoints under `/api/v1/` require a Bearer token by default:
|
||||
|
||||
```bash
|
||||
# The default compose stack uses a self-signed cert; pin with --cacert
|
||||
curl --cacert ./deploy/test/certs/ca.crt \
|
||||
-H "Authorization: Bearer your-api-key" \
|
||||
https://localhost:8443/api/v1/certificates
|
||||
```
|
||||
|
||||
Three endpoints are exempt from auth (declared with `security: []` in the spec): `/health`, `/ready`, and `/api/v1/auth/info`. The auth info endpoint tells clients whether authentication is enabled and what type is required — useful for GUIs that need to show/hide a login screen.
|
||||
|
||||
## Pagination Convention
|
||||
|
||||
All list endpoints follow the same pagination pattern:
|
||||
|
||||
**Request parameters:**
|
||||
- `page` (integer, default 1) — page number
|
||||
- `per_page` (integer, default 50, max 500) — results per page
|
||||
|
||||
**Response envelope:**
|
||||
```json
|
||||
{
|
||||
"data": [...],
|
||||
"total": 150,
|
||||
"page": 1,
|
||||
"per_page": 50
|
||||
}
|
||||
```
|
||||
|
||||
Certificates also support cursor-based pagination for large datasets:
|
||||
- `cursor` (string) — opaque cursor token from previous response
|
||||
- `page_size` (integer) — results per page when using cursor mode
|
||||
|
||||
## Generating Client SDKs
|
||||
|
||||
The OpenAPI spec can generate typed client libraries for any language. Here are examples using common generators:
|
||||
|
||||
### TypeScript (openapi-typescript-codegen)
|
||||
|
||||
```bash
|
||||
npx openapi-typescript-codegen \
|
||||
--input api/openapi.yaml \
|
||||
--output src/generated/certctl \
|
||||
--client axios
|
||||
```
|
||||
|
||||
### Python (openapi-python-client)
|
||||
|
||||
```bash
|
||||
pip install openapi-python-client
|
||||
openapi-python-client generate --path api/openapi.yaml
|
||||
```
|
||||
|
||||
### Go (oapi-codegen)
|
||||
|
||||
```bash
|
||||
go install github.com/oapi-codegen/oapi-codegen/v2/cmd/oapi-codegen@latest
|
||||
oapi-codegen -generate types,client -package certctl api/openapi.yaml > certctl_client.go
|
||||
```
|
||||
|
||||
### Java (OpenAPI Generator)
|
||||
|
||||
```bash
|
||||
npx @openapitools/openapi-generator-cli generate \
|
||||
-i api/openapi.yaml \
|
||||
-g java \
|
||||
-o generated/java-client
|
||||
```
|
||||
|
||||
## Validating the Spec
|
||||
|
||||
Verify the spec is valid OpenAPI 3.1:
|
||||
|
||||
```bash
|
||||
# Using spectral (recommended)
|
||||
npx @stoplight/spectral-cli lint api/openapi.yaml
|
||||
|
||||
# Using swagger-cli
|
||||
npx @apidevtools/swagger-cli validate api/openapi.yaml
|
||||
```
|
||||
|
||||
## Using with Postman
|
||||
|
||||
Import the spec directly into Postman:
|
||||
|
||||
1. Open Postman → Import → File → select `api/openapi.yaml`
|
||||
2. Postman creates a collection with all 78 documented operations organized by tag
|
||||
3. Set the `baseUrl` variable to `https://localhost:8443` (HTTPS-only as of v2.2)
|
||||
4. Add an `Authorization: Bearer your-api-key` header to the collection
|
||||
5. Import the demo stack CA bundle (`deploy/test/certs/ca.crt`) into Postman's Settings → Certificates → CA Certificates, or disable certificate verification for the `localhost` host (Settings → General → SSL certificate verification)
|
||||
|
||||
## Key Schemas
|
||||
|
||||
The spec defines typed schemas for all domain objects. Key schemas to know:
|
||||
|
||||
| Schema | Description |
|
||||
|--------|-------------|
|
||||
| `ManagedCertificate` | Core certificate record with status, expiry, owner, tags, profile |
|
||||
| `CertificateVersion` | Individual cert version with PEM, serial, fingerprint, validity |
|
||||
| `Agent` | Agent with heartbeat, metadata (OS, arch, IP, version), capabilities |
|
||||
| `Job` | Job record with type, status (7 states), certificate/target references |
|
||||
| `PolicyRule` | Policy with type (5 types), config, severity, enabled state |
|
||||
| `CertificateProfile` | Enrollment profile with allowed key types, max TTL, constraints |
|
||||
| `AuditEvent` | Immutable audit record with actor, action, resource, timestamp |
|
||||
| `RevocationReason` | RFC 5280 reason code enum (8 values) |
|
||||
| `DashboardSummary` | Aggregate stats (total certs, expiring, agents, jobs) |
|
||||
|
||||
## Integration Testing
|
||||
|
||||
Use the spec to generate contract tests that verify the API matches the spec:
|
||||
|
||||
```bash
|
||||
# Using schemathesis for fuzz testing against the spec
|
||||
pip install schemathesis
|
||||
# The default compose stack uses a self-signed cert — export a CA bundle or set REQUESTS_CA_BUNDLE
|
||||
export REQUESTS_CA_BUNDLE=$(pwd)/deploy/test/certs/ca.crt
|
||||
schemathesis run api/openapi.yaml \
|
||||
--base-url https://localhost:8443 \
|
||||
--header "Authorization: Bearer your-api-key"
|
||||
```
|
||||
|
||||
This sends randomized valid requests to every endpoint and verifies the responses match the declared schemas.
|
||||
|
||||
## What's Next
|
||||
|
||||
- [MCP Server Guide](mcp.md) — AI-native access to the certctl API
|
||||
- [Quick Start](quickstart.md) — Get certctl running locally
|
||||
- [Connector Guide](connectors.md) — Build custom issuer and target connectors
|
||||
- [Architecture](architecture.md) — System design deep dive
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,359 @@
|
||||
# Deployment Atomicity, Post-Deploy Verification, and Rollback
|
||||
|
||||
> Deploy-hardening I master bundle (v2.X.0). Operator + integrator
|
||||
> reference for the atomic-write + post-deploy TLS verify +
|
||||
> rollback pipeline that closes the procurement-checklist gap with
|
||||
> commercial competitors (Venafi, DigiCert Certificate Manager,
|
||||
> Sectigo).
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Before deploy-hardening I, certctl's target connectors used
|
||||
duplicated `os.WriteFile` flows. A failure mid-deploy could leave
|
||||
a target with a renewed cert but no chain (or vice versa); a
|
||||
reload-fail produced a half-deployed state that required manual
|
||||
rollback; a wrong-vhost cert was silent until users reported it.
|
||||
|
||||
Deploy-hardening I closes three procurement-checklist gaps in
|
||||
a single shared primitive:
|
||||
|
||||
| Gap | Pre-bundle | Post-bundle |
|
||||
|---|---|---|
|
||||
| **Atomic deploy with rollback** | F5 only (transactional API) | 12 of 13 connectors via `deploy.Apply` (K8s pending Bundle 2 — see [Section 1.5](#15-audit-closure-status-2026-05-02-deployment-target-audit)) |
|
||||
| **Post-deploy TLS verification** | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
|
||||
| **Vendor-specific deployment recipes** | Light docs | (Bundle II — `cowork/deploy-hardening-ii-prompt.md`) |
|
||||
|
||||
This document describes the operator-visible surface. The Go-level
|
||||
contract lives at `internal/deploy/doc.go`.
|
||||
|
||||
## 1.5. Audit closure status (2026-05-02 deployment-target audit)
|
||||
|
||||
The 2026-05-02 deployment-target coverage audit
|
||||
(`cowork/deployment-target-audit-2026-05-02/RESULTS.md`) tightened the
|
||||
atomic + rollback contract on the connectors below. All bundles in the
|
||||
table are committed to `master` as of this section's last edit; commit
|
||||
hashes pin to the canonical landing commit for each piece of work.
|
||||
|
||||
| Connector | Bundle | Commit | Closes |
|
||||
|-----------------|-----------|-----------|--------|
|
||||
| envoy | Bundle 3 | `d8cd981` | atomic SDS JSON write + post-deploy watcher pickup poll |
|
||||
| traefik | Bundle 4 | `37634e6` | single `deploy.Apply` Plan + all-files atomicity + rollback |
|
||||
| iis | Bundle 5 | `223f279` | pre-deploy `Get-WebBinding` snapshot + on-failure binding rollback |
|
||||
| ssh | Bundle 6 | `eb39059` | pre-deploy SFTP snapshot + reload-failure rollback |
|
||||
| wincertstore | Bundle 7 | `1dd1dd4` | `Get-ChildItem` snapshot + on-import-failure rollback |
|
||||
| javakeystore | Bundle 8 | `87e0009` | `keytool -exportkeystore` snapshot + on-import-failure rollback + operator playbook for argv password |
|
||||
| caddy | Bundle 9 | `8cda860` | duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
|
||||
| postfix/dovecot | Bundle 11 | `88e8881` | applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
|
||||
|
||||
**Outstanding from the same audit:**
|
||||
|
||||
- **Bundle 2 (k8ssecret).** The production `realK8sClient` is still a
|
||||
stub (see Section 3 / row `k8ssecret` below). Replacing it with a
|
||||
real `k8s.io/client-go` implementation + `ResourceVersion` plumbing
|
||||
+ post-deploy SHA-256 verify + kubelet sync poll is the remaining
|
||||
V2 P0 blocker. Tracking prompt:
|
||||
`cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`.
|
||||
|
||||
Bundle 10 (per-connector loadtest harness, commit `6286cd4`) does not
|
||||
modify the per-connector contract table; it's a CI / observability
|
||||
addition documented separately at `deploy/test/loadtest/README.md`.
|
||||
|
||||
The original Bundle 1 audit spec read "soften the IIS / SSH /
|
||||
WinCertStore / JavaKeystore rollback claims first while bundles 5–8
|
||||
catch the implementation up". Execution order inverted that loop —
|
||||
Bundles 3–11 shipped before the doc-realignment commit, so the rows
|
||||
in Section 3 below are honest as-shipped without ever needing a
|
||||
softening pass. The K8s row is the one exception, and Section 3's
|
||||
notes call it out explicitly.
|
||||
|
||||
## 2. The atomic-write primitive — `Plan` / `Apply`
|
||||
|
||||
`internal/deploy.Apply(ctx, plan)` is the load-bearing entry
|
||||
point. Connectors build a `Plan` describing one or more files +
|
||||
their PreCommit (validate) and PostCommit (reload) hooks; Apply
|
||||
executes them all-or-nothing.
|
||||
|
||||
```go
|
||||
plan := deploy.Plan{
|
||||
Files: []deploy.File{
|
||||
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
|
||||
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
|
||||
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
|
||||
},
|
||||
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
|
||||
// Run `nginx -t` against the staged config — bytes already
|
||||
// written to <path>.certctl-tmp.<unix-nanos>.
|
||||
return runValidate(ctx, "nginx -t")
|
||||
},
|
||||
PostCommit: func(ctx context.Context) error {
|
||||
return runReload(ctx, "nginx -s reload")
|
||||
},
|
||||
}
|
||||
res, err := deploy.Apply(ctx, plan)
|
||||
```
|
||||
|
||||
Apply's algorithm:
|
||||
|
||||
1. Per-file mutex acquired (sync.Map; coarse-grained per-path
|
||||
serialization).
|
||||
2. SHA-256 idempotency short-circuit. If every File's destination
|
||||
already matches, return `Result.SkippedAsIdempotent=true`
|
||||
without firing PreCommit/PostCommit.
|
||||
3. Pre-deploy backup: copy each existing destination to
|
||||
`<path>.certctl-bak.<unix-nanos>`.
|
||||
4. Write each File's bytes to `<path>.certctl-tmp.<unix-nanos>`
|
||||
in the destination directory (same-filesystem rename).
|
||||
5. Apply ownership (chown + chmod) to each temp file BEFORE
|
||||
rename so the swap is atomic with the right perms.
|
||||
6. Call `PreCommit(ctx, tempPaths)`. On error: clean up temps;
|
||||
return `ErrValidateFailed`.
|
||||
7. `os.Rename` each temp → final. POSIX guarantees atomic.
|
||||
8. Call `PostCommit(ctx)`. On error: restore each backup; re-call
|
||||
PostCommit. If second PostCommit also fails: return
|
||||
`ErrRollbackFailed` (operator-actionable).
|
||||
9. Janitor: prune backups beyond `Plan.BackupRetention`
|
||||
(default 3, -1 to disable).
|
||||
|
||||
## 3. Per-connector atomic contract
|
||||
|
||||
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|
||||
|---|---|---|---|---|
|
||||
| nginx | `nginx -t` | `nginx -s reload` | TLS handshake to `host:443` | Default key mode 0640 (worker reads via group) |
|
||||
| apache | `apachectl configtest` | `apachectl graceful` | TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
|
||||
| haproxy | `haproxy -c -f <cfg>` | `systemctl reload haproxy` | TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
|
||||
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
|
||||
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
|
||||
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
|
||||
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
|
||||
| postfix | `postfix check` | `postfix reload` | TLS handshake to port 25 | Chain appended to cert if no ChainPath |
|
||||
| dovecot | `doveconf -n` | `doveadm reload` | TLS handshake to port 993 | Same code path as postfix |
|
||||
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
|
||||
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
|
||||
| ssh | (Connect probe) | (SCP upload + remote chmod) | `tls.Dial` to remote TLS port | Pre-deploy SCP backup of remote files |
|
||||
| wincertstore | (Get-ChildItem Cert:\) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
|
||||
| javakeystore | (`keytool -list`) | (`keytool -importkeystore`) | (admin probe) | keytool snapshot; rollback via `keytool -delete` + re-import |
|
||||
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | **V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit.** Production `realK8sClient` at `internal/connector/target/k8ssecret/k8ssecret.go:397-420` is a stub (every method returns `"real Kubernetes client not implemented — use NewWithClient for tests"`). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via `NewWithClient` work today. Tracking prompt: `cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`. |
|
||||
|
||||
> **Postfix vs Dovecot mode**: see "Choosing Mode=postfix vs Mode=dovecot" in
|
||||
> `docs/connectors.md` for the per-mode defaults (cert/key paths, validate +
|
||||
> reload commands), the dual-deploy guidance for mail servers running both
|
||||
> daemons, and the test-pin reference (Bundle 11 commit `88e8881`).
|
||||
|
||||
## 4. Post-deploy TLS verification
|
||||
|
||||
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
|
||||
**ON by default** when the operator configures
|
||||
`PostDeployVerify.Endpoint`. Per-target opt-out via
|
||||
`PostDeployVerify.Enabled = false`.
|
||||
|
||||
The connector-side flow:
|
||||
|
||||
```go
|
||||
// After Apply returns successfully, the connector dials the
|
||||
// configured endpoint, pulls the leaf cert SHA-256, and compares.
|
||||
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
|
||||
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
|
||||
// Mismatch — wrong vhost, NGINX serving cached cert,
|
||||
// load-balanced target hit a different pod, etc.
|
||||
rollbackToBackups(ctx, applyResult.BackupPaths)
|
||||
emitAlert("post-deploy verify SHA-256 mismatch")
|
||||
}
|
||||
```
|
||||
|
||||
Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
|
||||
against load-balanced targets where the verify might hit a
|
||||
different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
|
||||
giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
|
||||
(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
|
||||
`post_deploy_verify_backoff`.
|
||||
|
||||
```yaml
|
||||
post_deploy_verify:
|
||||
enabled: true
|
||||
endpoint: "nginx.svc.cluster.local:443"
|
||||
timeout: 10s
|
||||
post_deploy_verify_attempts: 3
|
||||
post_deploy_verify_backoff: 1s
|
||||
post_deploy_verify_max_backoff: 16s
|
||||
```
|
||||
|
||||
## 5. Rollback semantics
|
||||
|
||||
Rollback fires automatically on three triggers:
|
||||
|
||||
1. **PostCommit (reload) fails** → Apply restores backups + retries
|
||||
reload. Returns `ErrReloadFailed` on success (degraded
|
||||
no-op) or `ErrRollbackFailed` if the second reload also fails.
|
||||
2. **Post-deploy verify fails** → Connector manually triggers
|
||||
rollback (Apply already returned successfully). Backups are
|
||||
restored + reload is invoked again. Same escalation path on
|
||||
second failure.
|
||||
3. **Mid-loop rename fails** (rare; only with cross-filesystem
|
||||
misuse) → Apply rolls back the renames that already
|
||||
succeeded.
|
||||
|
||||
`ErrRollbackFailed` is operator-actionable. The destination is in
|
||||
a known-bad state; operators must either:
|
||||
- Restore from `Result.BackupPaths` manually + run `<reload command>`
|
||||
- Push a fresh known-good cert via the next deploy cycle
|
||||
|
||||
The `certctl_deploy_rollback_total{outcome="also_failed"}` metric
|
||||
is the alert target.
|
||||
|
||||
## 6. ValidateOnly — dry-run mode
|
||||
|
||||
`target.Connector.ValidateOnly(ctx, request)` runs the validate
|
||||
step without touching the live cert. Connectors that can't
|
||||
dry-run (Traefik / Envoy / Caddy file mode) return
|
||||
`target.ErrValidateOnlyNotSupported`.
|
||||
|
||||
| Connector | ValidateOnly |
|
||||
|---|---|
|
||||
| nginx | `nginx -t` |
|
||||
| apache | `apachectl configtest` |
|
||||
| haproxy | `haproxy -c -f <cfg>` |
|
||||
| postfix/dovecot | `postfix check` / `doveconf -n` |
|
||||
| caddy (api) | GET /config/ probe |
|
||||
| caddy (file) / traefik / envoy | `ErrValidateOnlyNotSupported` |
|
||||
| f5 | `client.Authenticate()` probe |
|
||||
| iis | `Get-WebSite -Name <SiteName>` |
|
||||
| ssh | `client.Connect()` probe |
|
||||
| wincertstore | `Get-ChildItem Cert:\<loc>\<store>` |
|
||||
| javakeystore | `keytool -list -keystore <path>` |
|
||||
| k8ssecret | `client.GetSecret()` RBAC probe |
|
||||
|
||||
Operators preview a deploy via the agent's `--dry-run` flag (or
|
||||
the equivalent CLI invocation).
|
||||
|
||||
## 7. File ownership + mode preservation
|
||||
|
||||
The single most common silent-failure mode pre-bundle: agent runs
|
||||
as root, calls `os.WriteFile(path, bytes, 0600)`, locks NGINX out
|
||||
of the existing nginx:nginx 0640 key file.
|
||||
|
||||
Per frozen decision 0.7, `deploy.Apply` resolves ownership via
|
||||
this precedence:
|
||||
|
||||
1. Explicit `File.Mode` / `File.Owner` / `File.Group` (per-target
|
||||
config) → use as given.
|
||||
2. Existing destination file → preserve its `chown` + `chmod`.
|
||||
3. `Plan.Defaults.Mode` / `.Owner` / `.Group` → use as fallback
|
||||
for new files.
|
||||
4. Nothing set → `os.WriteFile` default (0644) for new files;
|
||||
preserved for existing.
|
||||
|
||||
Per-connector defaults (cross-distro, fall back to no-chown if
|
||||
no candidate user exists):
|
||||
|
||||
| Connector | Default user | Default group | Default cert mode | Default key mode |
|
||||
|---|---|---|---|---|
|
||||
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
|
||||
| apache | apache → www-data → httpd | same | 0644 | 0600 |
|
||||
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
|
||||
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
|
||||
| traefik | (none) | (none) | 0644 | 0600 |
|
||||
| envoy | (none) | (none) | 0644 | 0600 |
|
||||
| caddy | (none) | (none) | 0644 | 0600 |
|
||||
|
||||
## 8. Per-target deploy mutex
|
||||
|
||||
Phase 2 of the master bundle: the agent (`cmd/agent/main.go`)
|
||||
serializes concurrent deploys to the same target ID via a
|
||||
`sync.Map[targetID]*sync.Mutex`. Granularity per frozen decision
|
||||
0.5: one mutex per target, NOT per (target, cert).
|
||||
|
||||
Cert deploy throughput is operator-grade tens-per-minute. Coarse
|
||||
serialization is fine and simplifies reasoning about reload-side
|
||||
race windows.
|
||||
|
||||
## 9. Idempotency via SHA-256
|
||||
|
||||
Every `deploy.Apply` short-circuits when all File destinations
|
||||
already match SHA-256 of the new bytes. PreCommit + PostCommit do
|
||||
not fire; backups are not created; the result reports
|
||||
`SkippedAsIdempotent = true`.
|
||||
|
||||
Defends against agent-restart retry storms that would otherwise
|
||||
hammer targets with no-op reloads. Operator-visible signal:
|
||||
`certctl_deploy_idempotent_skip_total{target_type="..."}`.
|
||||
|
||||
## 10. Troubleshooting matrix
|
||||
|
||||
| Symptom | Root cause | Operator action |
|
||||
|---|---|---|
|
||||
| `ErrValidateFailed: nginx -t failed` | Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
|
||||
| `ErrReloadFailed: nginx -s reload failed; rolled back` | Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
|
||||
| `ErrRollbackFailed` | Reload AND rollback both failed; in known-bad state | Restore from `Result.BackupPaths` manually; run reload command directly; check disk space + ownership |
|
||||
| `post-deploy TLS verify SHA-256 mismatch` | New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via `PostDeployVerifyAttempts` |
|
||||
| `chown ... permission denied` (in agent log) | Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
|
||||
| Backups accumulating in cert dir | BackupRetention misconfigured | Set `BackupRetention: 3` (default) or higher on per-target config |
|
||||
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit `KeyFileMode: 0640` (NGINX) or `KeyFileMode: 0600` (Apache) |
|
||||
|
||||
## 11. V3-Pro deferrals
|
||||
|
||||
Out of scope for the V2-free deploy-hardening I bundle:
|
||||
|
||||
- **Multi-region deployment coordination** — orchestration of N
|
||||
data-center deploys with operator approval gates per stage.
|
||||
- **Cert-pinning verification against mobile-app pin manifests**.
|
||||
- **SOC 2 evidence-report generator** — auto-export of the
|
||||
deploy audit trail in the format SOC 2 auditors expect.
|
||||
- **Customer-paid validation matrices** — vendor-version certified
|
||||
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
|
||||
`cowork/deploy-hardening-ii-prompt.md` for the per-vendor
|
||||
edge-case audit + integration test sidecars.
|
||||
|
||||
## 12. Per-connector quick reference
|
||||
|
||||
Paste-able config snippets for the most-used connectors. Full
|
||||
field reference at `docs/connectors.md`.
|
||||
|
||||
### NGINX
|
||||
|
||||
```yaml
|
||||
target_type: nginx
|
||||
target_config:
|
||||
cert_path: /etc/nginx/certs/cert.pem
|
||||
chain_path: /etc/nginx/certs/chain.pem
|
||||
key_path: /etc/nginx/certs/key.pem
|
||||
reload_command: "nginx -s reload"
|
||||
validate_command: "nginx -t"
|
||||
cert_file_mode: 0644
|
||||
key_file_mode: 0640
|
||||
post_deploy_verify:
|
||||
enabled: true
|
||||
endpoint: "nginx.example.com:443"
|
||||
timeout: 10s
|
||||
backup_retention: 3
|
||||
```
|
||||
|
||||
### HAProxy
|
||||
|
||||
```yaml
|
||||
target_type: haproxy
|
||||
target_config:
|
||||
pem_path: /etc/haproxy/certs/cert.pem
|
||||
reload_command: "systemctl reload haproxy"
|
||||
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
|
||||
pem_file_mode: 0600
|
||||
post_deploy_verify:
|
||||
enabled: true
|
||||
endpoint: "haproxy.example.com:443"
|
||||
```
|
||||
|
||||
### Traefik (file watcher; no reload command)
|
||||
|
||||
```yaml
|
||||
target_type: traefik
|
||||
target_config:
|
||||
cert_dir: /etc/traefik/certs
|
||||
cert_file: cert.pem
|
||||
key_file: key.pem
|
||||
post_deploy_verify:
|
||||
enabled: true
|
||||
endpoint: "traefik.example.com:443"
|
||||
```
|
||||
|
||||
See per-connector tests at
|
||||
`internal/connector/target/<name>/<name>_atomic_test.go` for the
|
||||
full failure-mode matrix each connector handles.
|
||||
@@ -0,0 +1,231 @@
|
||||
# Intermediate CA hierarchy — operator runbook
|
||||
|
||||
Rank 8 of the 2026-05-03 deep-research deliverable. This page is the
|
||||
canonical reference for operators running certctl as a multi-level
|
||||
internal PKI.
|
||||
|
||||
The default `single`-mode flow (one operator-supplied sub-CA loaded
|
||||
from disk at boot) is unchanged and will keep working byte-for-byte
|
||||
forever. This page is for operators who need a real CA tree:
|
||||
|
||||
- FedRAMP boundary-CA deployments where the regulator requires
|
||||
separation of policy and issuing authorities.
|
||||
- Financial-services policy-CA deployments (one root, one policy CA
|
||||
per business unit, one issuing CA per environment).
|
||||
- OT / industrial control networks where the air-gapped root signs
|
||||
online sub-CAs that go in and out of service on a rotation.
|
||||
|
||||
## Concepts
|
||||
|
||||
`Issuer.HierarchyMode` is a per-issuer column on the `issuers` table.
|
||||
Two values are valid (the database default is `"single"` — back-compat
|
||||
byte-identical for unmigrated rows):
|
||||
|
||||
- `single` — pre-Rank-8 historical flow. The local connector loads a
|
||||
pre-signed CA cert+key from disk via `local.Config.CACertPath` /
|
||||
`local.Config.CAKeyPath`. Existing operators upgrade with no
|
||||
behavior change.
|
||||
- `tree` — the issuer's CAs are managed via the `intermediate_cas`
|
||||
table. Chain assembly walks the `parent_ca_id` foreign key from the
|
||||
issuing leaf CA up to the root and attaches the assembled chain to
|
||||
every `IssuanceResult`.
|
||||
|
||||
Each row in `intermediate_cas` is one CA cert (root, policy, issuing).
|
||||
The lifecycle is `created` → `active` → `retiring` → `retired`. The
|
||||
state column is a closed enum and validates at the service layer; the
|
||||
postgres CHECK constraint enforces it at the database layer too.
|
||||
|
||||
A CA's private key bytes are NEVER persisted on the row. The
|
||||
`key_driver_id` column is a reference (filesystem path / KMS key ID /
|
||||
HSM slot) that the `signer.Driver` resolves at sign time. A SQL
|
||||
injection or a row-leak surface MUST NEVER expose key bytes; only the
|
||||
reference can leak.
|
||||
|
||||
## Lifecycle states
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> created : CreateRoot / CreateChild
|
||||
created --> active : registration completes
|
||||
active --> retiring : Retire(confirm=false)
|
||||
retiring --> retired : Retire(confirm=true)
|
||||
retired --> [*]
|
||||
|
||||
note right of retiring
|
||||
Drain start. CA stops issuing
|
||||
NEW children; existing children
|
||||
keep issuing until they retire.
|
||||
end note
|
||||
|
||||
note right of retired
|
||||
Terminal. Refused if active children
|
||||
remain (ErrCAStillHasActiveChildren
|
||||
→ HTTP 409). OCSP keeps responding
|
||||
for already-issued leaves until expiry.
|
||||
end note
|
||||
```
|
||||
|
||||
Drain-first semantics: a CA in `retiring` state cannot terminalize to
|
||||
`retired` while it still has active children. The service layer
|
||||
returns `ErrCAStillHasActiveChildren`; the API surfaces HTTP 409. Drain
|
||||
the children first.
|
||||
|
||||
## Common deployment patterns
|
||||
|
||||
### Pattern A — 4-level FedRAMP boundary CA
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Root["Acme Root CA<br/>path_len=3<br/>offline air-gapped"]
|
||||
Policy["Acme Policy CA<br/>path_len=2<br/>FedRAMP-Moderate boundary"]
|
||||
IssA["Acme Issuing A<br/>path_len=0<br/>prod workload leaves"]
|
||||
IssB["Acme Issuing B<br/>path_len=0<br/>ephemeral pod identity"]
|
||||
Root --> Policy --> IssA --> IssB
|
||||
```
|
||||
|
||||
Operator workflow:
|
||||
|
||||
1. Mint the root cert+key on the offline workstation. Move the cert
|
||||
PEM (no key) to the online operator workstation.
|
||||
2. `POST /api/v1/issuers/{id}/intermediates` with the empty
|
||||
`parent_ca_id` and `root_cert_pem` + `key_driver_id` populated
|
||||
(the operator pre-positions the root key file at the path the
|
||||
`key_driver_id` points to). The service validates RFC 5280 §3.2
|
||||
self-signed semantics + cross-checks the operator-supplied key
|
||||
matches the cert (rejects mismatched bundles at registration time
|
||||
with `ErrCAKeyMismatch`).
|
||||
3. `POST /api/v1/issuers/{id}/intermediates` with `parent_ca_id`
|
||||
pointing at the root for the Policy CA. The service generates the
|
||||
child key via `signer.Driver.Generate`, signs the child cert via
|
||||
the parent's signer (loaded from the parent's `key_driver_id`),
|
||||
and persists the new row with the next `path_len` value (parent's
|
||||
- 1 if unset). Repeat for each lower level.
|
||||
4. Set `Issuer.HierarchyMode = "tree"` on the issuer row + set the
|
||||
`treeIssuingCAID` connector field to point at the deepest CA
|
||||
(Acme Issuing B in the example above) — issued leaves chain via
|
||||
`AssembleChain` from B up to the root.
|
||||
|
||||
### Pattern B — 3-level financial-services policy CA
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Root["FinCo Root CA<br/>path_len=2"]
|
||||
Pol["FinCo Trading Policy CA<br/>path_len=1<br/>permitted DNS = trading.finco.example"]
|
||||
Iss["FinCo Trading Issuing CA<br/>path_len=0"]
|
||||
Root --> Pol --> Iss
|
||||
```
|
||||
|
||||
Per business-unit name constraints: each policy CA carries a
|
||||
`PermittedDNSDomains` list scoped to the business unit (RFC 5280
|
||||
§4.2.1.10). The service enforces subset semantics — a child policy CA
|
||||
cannot widen the parent's permitted set, and cannot remove an
|
||||
excluded subtree. Operators submit `name_constraints` on the
|
||||
`POST /api/v1/issuers/{id}/intermediates` body.
|
||||
|
||||
### Pattern C — 2-level internal PKI
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Root["Internal Root CA<br/>path_len=0"]
|
||||
Iss["Internal Issuing CA<br/>path_len=0<br/>issues leaves directly"]
|
||||
Root --> Iss
|
||||
```
|
||||
|
||||
The simplest tree-mode deployment. Roughly equivalent to single mode
|
||||
in terms of operator overhead, but provides one extra layer of
|
||||
indirection so the root key can stay offline while only the issuing
|
||||
CA's key sits on the certctl host.
|
||||
|
||||
## RFC 5280 enforcement
|
||||
|
||||
All enforcement happens at the service layer. The local connector
|
||||
trusts the service's contract; the API layer translates errors to
|
||||
HTTP codes.
|
||||
|
||||
- §3.2 self-signed root validation: `cert.CheckSignatureFrom(cert)` +
|
||||
subject == issuer DN. Rejected with `ErrCANotSelfSigned` →
|
||||
HTTP 400.
|
||||
- §4.2.1.9 path-length tightening: child's `PathLenConstraint` must
|
||||
be strictly less than parent's. Default to `parent - 1` when unset.
|
||||
Rejected with `ErrPathLenExceeded` → HTTP 400.
|
||||
- §4.2.1.10 NameConstraints subset: child's `Permitted` set must be a
|
||||
subset of parent's; child's `Excluded` set must be a superset of
|
||||
parent's. Rejected with `ErrNameConstraintExceeded` → HTTP 400.
|
||||
- §4.1.2.5 validity capping: child's `notAfter` capped to parent's
|
||||
`notAfter` automatically (chain breaks at parent's expiry
|
||||
regardless).
|
||||
|
||||
## Migrating a single-mode issuer to tree mode
|
||||
|
||||
Pre-flight: the load-bearing pin
|
||||
`TestLocal_HierarchyMode_SingleVsTree_ByteIdentical` guarantees that
|
||||
a 1-level tree wired around the same on-disk root cert+key produces
|
||||
byte-identical issuance bundles to single mode. Migration is therefore
|
||||
a no-downtime operation if done carefully:
|
||||
|
||||
1. Register the existing single-mode CA cert as an `intermediate_cas`
|
||||
row via `CreateRoot` (with the existing on-disk key referenced as
|
||||
`key_driver_id`).
|
||||
2. Update the issuer row's `hierarchy_mode` to `"tree"` and set the
|
||||
connector's `SetTreeIssuingCAID` to the new row's ID. Restart the
|
||||
server (no new code path activates until the connector reads the
|
||||
updated mode at boot).
|
||||
3. Issue a test cert. The byte-equivalence pin guarantees the wire
|
||||
bytes match the pre-migration output for a 1-level tree.
|
||||
4. Build out the child CAs via `CreateChild` calls. Update
|
||||
`treeIssuingCAID` to the new leaf CA. Test, then ramp.
|
||||
|
||||
If the pin breaks during migration, abort: roll back the
|
||||
`hierarchy_mode` flip and investigate. The byte-equivalence pin is
|
||||
the canary — if it goes red, deeper bugs lurk.
|
||||
|
||||
## API reference
|
||||
|
||||
All endpoints under `/api/v1/issuers/{id}/intermediates` and
|
||||
`/api/v1/intermediates/{id}` are admin-gated. Non-admin Bearer callers
|
||||
get HTTP 403.
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|------|---------|
|
||||
| POST | `/api/v1/issuers/{id}/intermediates` | Register root OR sign child (body discriminator) |
|
||||
| GET | `/api/v1/issuers/{id}/intermediates` | List flat hierarchy for issuer |
|
||||
| GET | `/api/v1/intermediates/{id}` | Single-row detail |
|
||||
| POST | `/api/v1/intermediates/{id}/retire` | Two-phase retirement |
|
||||
|
||||
See `api/openapi.yaml` for full request/response schemas.
|
||||
|
||||
## Observability
|
||||
|
||||
`IntermediateCAMetrics` ships counters dimensioned by `(issuer_id,
|
||||
kind)`:
|
||||
|
||||
- `create_root` — successful CreateRoot calls.
|
||||
- `create_child` — successful CreateChild calls.
|
||||
- `retire_retiring` — `active → retiring` transitions.
|
||||
- `retire_retired` — `retiring → retired` transitions.
|
||||
|
||||
The Prometheus exposer reads the snapshot via
|
||||
`SnapshotIntermediateCA()` from a single instance constructed in
|
||||
`cmd/server/main.go` (the snapshotter is the single source of truth
|
||||
between the service-side recording path and the metrics-side exposing
|
||||
path).
|
||||
|
||||
The audit table receives one row per CreateRoot / CreateChild /
|
||||
Retire transition, scoped to the actor extracted from the API
|
||||
request's auth context.
|
||||
|
||||
## Known limitations
|
||||
|
||||
The following are tracked in `WORKSPACE-ROADMAP.md` as Rank-8 follow-on
|
||||
work — none are required for the v2.1.0 acquisition gate:
|
||||
|
||||
- HSM-backed roots beyond `signer.FileDriver` (PKCS#11 / cloud KMS
|
||||
drivers).
|
||||
- Automated rotation: scheduled re-issuance of sub-CAs ahead of
|
||||
expiry with parallel-validity windows.
|
||||
- Intra-hierarchy CRL chaining: each non-leaf CA publishes a CRL
|
||||
covering its direct children's revocations.
|
||||
- NameConstraints policy templates: declarative templates an operator
|
||||
can pick from instead of hand-rolling the JSON.
|
||||
- D3 dendrogram visualization on the GUI page (today's render is a
|
||||
recursive `<ul>` nested list).
|
||||
@@ -0,0 +1,197 @@
|
||||
# MCP Server Guide
|
||||
|
||||
certctl ships with an MCP (Model Context Protocol) server that lets AI assistants manage your certificate infrastructure through natural language. Ask Claude to "show me all expiring certificates," "revoke the VPN cert," or "what agents are offline?" and the MCP server translates that into API calls against your certctl instance.
|
||||
|
||||
This guide covers setup, configuration, and usage with Claude, Cursor, and other MCP-compatible tools.
|
||||
|
||||
## What Is MCP?
|
||||
|
||||
MCP is an open protocol that connects AI assistants to external tools and data sources. Instead of copying and pasting API responses into a chat window, MCP lets the AI call your tools directly. The certctl MCP server exposes all 78 API endpoints as MCP tools — the AI sees typed schemas describing what each tool does, what parameters it accepts, and what it returns.
|
||||
|
||||
The MCP server is a separate binary (`cmd/mcp-server/`) that communicates via stdio transport. It's a stateless HTTP proxy: every MCP tool call becomes an HTTP request to the certctl REST API. No new state, no new database tables, no new attack surface beyond what the API already exposes.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
You need:
|
||||
|
||||
1. A running certctl server (see [Quick Start](quickstart.md))
|
||||
2. The MCP server binary — either built from source or from a Docker image
|
||||
3. An MCP-compatible AI client (Claude Desktop, Cursor, VS Code with Copilot, etc.)
|
||||
|
||||
## Building the MCP Server
|
||||
|
||||
```bash
|
||||
cd certctl
|
||||
go build -o certctl-mcp ./cmd/mcp-server/
|
||||
```
|
||||
|
||||
The binary has zero runtime dependencies beyond the certctl server it connects to.
|
||||
|
||||
## Configuration
|
||||
|
||||
The MCP server reads three environment variables:
|
||||
|
||||
| Variable | Required | Default | Description |
|
||||
|----------|----------|---------|-------------|
|
||||
| `CERTCTL_SERVER_URL` | No | `https://localhost:8443` | URL of the certctl REST API (HTTPS-only as of v2.2) |
|
||||
| `CERTCTL_API_KEY` | No | (empty) | API key for authentication (passed as `Bearer` token) |
|
||||
| `CERTCTL_SERVER_CA_BUNDLE_PATH` | Yes (for self-signed / internal CA) | (empty) | Path to PEM CA bundle that signed the server cert. Required when the server cert isn't rooted in the system trust store (the default compose stack ships a self-signed cert at `deploy/test/certs/ca.crt`). |
|
||||
|
||||
If your certctl server has auth enabled (the default), you must provide the API key. The MCP server passes it through to every HTTP request.
|
||||
|
||||
Since v2.2 the certctl control plane is HTTPS-only. If the server cert is self-signed or chained to an internal CA, set `CERTCTL_SERVER_CA_BUNDLE_PATH` so the MCP server can verify the TLS handshake. Never set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` outside local development — it disables all certificate validation.
|
||||
|
||||
## Setting Up with Claude Desktop
|
||||
|
||||
Add this to your Claude Desktop MCP configuration file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"certctl": {
|
||||
"command": "/path/to/certctl-mcp",
|
||||
"env": {
|
||||
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||
"CERTCTL_API_KEY": "your-api-key-here"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Restart Claude Desktop. You should see "certctl" appear in the MCP tools list with 78 available tools.
|
||||
|
||||
## Setting Up with Cursor
|
||||
|
||||
In Cursor, go to Settings → MCP Servers and add:
|
||||
|
||||
```json
|
||||
{
|
||||
"certctl": {
|
||||
"command": "/path/to/certctl-mcp",
|
||||
"env": {
|
||||
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||
"CERTCTL_API_KEY": "your-api-key-here"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Setting Up with Claude Code
|
||||
|
||||
Add certctl as an MCP server in your project's `.mcp.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"certctl": {
|
||||
"command": "/path/to/certctl-mcp",
|
||||
"env": {
|
||||
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||
"CERTCTL_API_KEY": "your-api-key-here"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Available Tools
|
||||
|
||||
The MCP server exposes the full REST API organized across 16 resource domains:
|
||||
|
||||
| Domain | Tools | Examples |
|
||||
|--------|-------|---------|
|
||||
| Certificates | 9 | List, get, create, update, archive, versions, renew, deploy, revoke |
|
||||
| CRL & OCSP | 3 | Get JSON CRL, get DER CRL by issuer, check OCSP status |
|
||||
| Issuers | 6 | List, get, create, update, delete, test connection |
|
||||
| Targets | 5 | List, get, create, update, delete |
|
||||
| Agents | 8 | List, get, register, heartbeat, CSR submit, certificate pickup, get work, report job status |
|
||||
| Jobs | 5 | List, get, cancel, approve, reject |
|
||||
| Policies | 6 | List, get, create, update, delete, list violations |
|
||||
| Profiles | 5 | List, get, create, update, delete |
|
||||
| Teams | 5 | List, get, create, update, delete |
|
||||
| Owners | 5 | List, get, create, update, delete |
|
||||
| Agent Groups | 6 | List, get, create, update, delete, list members |
|
||||
| Audit | 2 | List events (with filters), get event by ID |
|
||||
| Notifications | 3 | List, get, mark as read |
|
||||
| Stats | 5 | Summary, certs by status, expiration timeline, job trends, issuance rate |
|
||||
| Metrics | 1 | System metrics (gauges, counters, uptime) |
|
||||
| Health | 4 | Health check, readiness probe, auth info, auth check |
|
||||
|
||||
Every tool has typed input parameters with `jsonschema` descriptions, so the AI knows exactly what arguments to provide and what each field means.
|
||||
|
||||
## Example Conversations
|
||||
|
||||
Once configured, you can interact with certctl through natural language:
|
||||
|
||||
**"Show me all certificates expiring in the next 14 days"**
|
||||
The AI calls `certctl_list_certificates` with `status=Expiring` and interprets the results.
|
||||
|
||||
**"Renew the API production certificate"**
|
||||
The AI calls `certctl_trigger_renewal` with `id=mc-api-prod`.
|
||||
|
||||
**"Who owns the payments gateway cert?"**
|
||||
The AI calls `certctl_get_certificate` with `id=mc-payments-prod` and reads the `owner_id` and `team_id` fields.
|
||||
|
||||
**"Are any agents offline?"**
|
||||
The AI calls `certctl_list_agents` and checks the heartbeat timestamps.
|
||||
|
||||
**"Revoke the old VPN cert — the key was compromised"**
|
||||
The AI calls `certctl_revoke_certificate` with `id=mc-vpn-old` and `reason=keyCompromise`.
|
||||
|
||||
**"Give me a summary of the certificate fleet"**
|
||||
The AI calls `certctl_dashboard_summary` for aggregate stats, then optionally `certctl_certificates_by_status` for the breakdown.
|
||||
|
||||
**"Create a new cert for staging.api.example.com owned by the platform team"**
|
||||
The AI calls `certctl_create_certificate` with the common name, team ID, and owner ID.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
AI["AI Assistant\n(Claude, Cursor)"]
|
||||
MCP["certctl MCP\ncmd/mcp-server/"]
|
||||
SERVER["certctl Server\n:8443"]
|
||||
|
||||
AI <-->|"stdio"| MCP
|
||||
MCP -->|"HTTP + Bearer token"| SERVER
|
||||
|
||||
MCP ~~~ TOOLS["REST API via MCP · 16 domains\nTyped input structs"]
|
||||
```
|
||||
|
||||
The MCP server is intentionally thin:
|
||||
|
||||
- **No state** — every request is a pass-through HTTP call. Restart it anytime.
|
||||
- **No new auth** — uses the same API key as the REST API.
|
||||
- **No new dependencies** — just the official MCP Go SDK (`modelcontextprotocol/go-sdk`).
|
||||
- **No new attack surface** — the AI can only do what the API key allows.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
The MCP server inherits the security properties of the REST API:
|
||||
|
||||
- **API key scoping**: The MCP server uses whatever API key you configure. If certctl gets API key scoping in a future release (per-resource or per-action permissions), the MCP server will automatically respect those restrictions.
|
||||
- **Audit trail**: Every tool call results in an HTTP request that's logged in the API audit middleware — actor, method, path, status, and latency are all recorded.
|
||||
- **Read-only usage**: For read-only AI access, you could configure a restricted API key (when key scoping ships). Until then, be aware that the AI can call write endpoints (create, update, delete, revoke) if the API key permits it.
|
||||
- **No private key exposure**: The MCP server never sees or transmits private keys — the same architectural guarantee as the REST API.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**"MCP server not connecting"**
|
||||
Check that `CERTCTL_SERVER_URL` is reachable from where the MCP binary runs. Try `curl $CERTCTL_SERVER_URL/health` to verify.
|
||||
|
||||
**"401 Unauthorized on every tool call"**
|
||||
Your `CERTCTL_API_KEY` is missing or wrong. Check the key matches what the certctl server expects.
|
||||
|
||||
**"Tool calls return empty results"**
|
||||
The certctl server might have no data. Run the demo seed (`docker compose up`) to populate demo data, or check that your database has records.
|
||||
|
||||
## What's Next
|
||||
|
||||
- [Quick Start](quickstart.md) — Get certctl running locally
|
||||
- [OpenAPI Spec](openapi.md) — Full API reference and SDK generation
|
||||
- [Architecture](architecture.md) — System design deep dive
|
||||
- [Concepts](concepts.md) — Certificate lifecycle fundamentals
|
||||
@@ -0,0 +1,278 @@
|
||||
# ACME Server — Threat Model
|
||||
|
||||
Security posture for the certctl ACME server endpoint
|
||||
(`/acme/profile/<id>/*`). Read this before opening a PR that changes
|
||||
the JWS verifier, the challenge validators, the rate limiter, or the
|
||||
GC sweeper.
|
||||
|
||||
The threat model lives in this dedicated doc (rather than `docs/acme-server.md`)
|
||||
because security-review reviewers want a single concentrated reference.
|
||||
Production deployments under audit should treat this doc as the
|
||||
canonical answer to "how does certctl resist X?"
|
||||
|
||||
## Threat surface map
|
||||
|
||||
The ACME server has four ingress surfaces:
|
||||
|
||||
1. **JWS-authenticated POST endpoints** — new-account, new-order,
|
||||
finalize, key-change, revoke-cert, account update, order POST-as-GET.
|
||||
Authenticated by an ECDSA / RSA / EdDSA signature over the request.
|
||||
2. **Unauthenticated GET endpoints** — directory, new-nonce, ARI
|
||||
(renewal-info). Read-only; no authn.
|
||||
3. **Outbound challenge validators** — HTTP-01, DNS-01, TLS-ALPN-01.
|
||||
The certctl-server initiates outbound calls to operator-provided
|
||||
identifiers (the SAN list of the requested cert).
|
||||
4. **Scheduler-driven GC sweeper** — internal-only; no inbound surface.
|
||||
|
||||
Threat actors:
|
||||
|
||||
- **External Internet attacker** — no certctl credentials; can hit
|
||||
unauthenticated endpoints + observe TLS metadata.
|
||||
- **Authenticated ACME account holder (low-trust)** — has a valid
|
||||
account on a profile but should be bounded by profile policy +
|
||||
rate limits.
|
||||
- **On-path attacker** between certctl-server and a challenge target
|
||||
(HTTP-01 / DNS-01 / TLS-ALPN-01).
|
||||
- **Compromised cert holder** — has the private key of a previously-
|
||||
issued cert and wants to revoke/exfiltrate.
|
||||
- **Malicious operator with profile-write access** — can change a
|
||||
profile's `acme_auth_mode` or policy, but is the trusted boundary
|
||||
per certctl's threat model. Out of scope here; covered by certctl's
|
||||
RBAC + audit log.
|
||||
|
||||
## JWS forgery resistance
|
||||
|
||||
The verifier (`internal/api/acme/jws.go`) accepts only the closed
|
||||
allow-list `{RS256, ES256, EdDSA}`. The allow-list is passed to
|
||||
`jose.ParseSigned` so go-jose rejects every other algorithm at parse
|
||||
time, before any signature work.
|
||||
|
||||
Specific attacks blocked:
|
||||
|
||||
- **Algorithm confusion (`alg: none`)** — RFC 7515 §6.1's classic
|
||||
unauthenticated-fallback. Not in allow-list; rejected at parse.
|
||||
- **HS256 substitution (alg-confusion via symmetric)** — symmetric
|
||||
algs aren't in the allow-list; rejected at parse.
|
||||
- **Replayed nonce** — every JWS carries a nonce consumed via
|
||||
`acme_nonces.UPDATE … WHERE used = FALSE` (a single statement;
|
||||
Postgres row-locking serializes the writes). A second consume of
|
||||
the same nonce sees `RowsAffected=0` and the verifier returns
|
||||
`badNonce`.
|
||||
- **URL spoofing** — the protected-header `url` field MUST match the
|
||||
request URL exactly (RFC 8555 §6.4); a JWS signed for one URL
|
||||
cannot be replayed against another.
|
||||
- **Multi-signature JWS** — RFC 8555 §6.2 forbids; the verifier
|
||||
rejects `len(jws.Signatures) != 1` explicitly.
|
||||
- **kid-vs-jwk confusion** — exactly one MUST be present per RFC 8555
|
||||
§6.2; both-present and neither-present are rejected.
|
||||
- **kid round-trip mismatch** — the verifier's `AccountKID` closure
|
||||
computes the canonical kid URL for the resolved account-id and
|
||||
compares to the inbound `kid`; cross-profile replay is rejected
|
||||
because the canonical URL differs.
|
||||
|
||||
The doubly-signed key-rollover JWS (RFC 8555 §7.3.5, Phase 4) gets
|
||||
its own dedicated verifier in `internal/api/acme/keychange.go`.
|
||||
Inner-only invariants enforced: MUST use `jwk` not `kid`, payload
|
||||
`account` MUST equal outer `kid`, payload `oldKey` MUST canonicalize-
|
||||
equal the registered key (RFC 7638 thumbprint, constant-time
|
||||
compare), inner `url` MUST equal outer `url`.
|
||||
|
||||
## Nonce store integrity
|
||||
|
||||
Nonces are persisted in PostgreSQL (`acme_nonces` table; migration
|
||||
000025) with a TTL set by `CERTCTL_ACME_SERVER_NONCE_TTL` (default
|
||||
5 min). The Phase 5 GC sweeper deletes used / expired rows every 1
|
||||
minute by default.
|
||||
|
||||
Why DB-backed and not in-memory:
|
||||
|
||||
- **Survives restart** — a multi-replica certctl-server fleet behind
|
||||
a load balancer can issue a nonce on replica A and consume it on
|
||||
replica B. In-memory state would force sticky sessions globally,
|
||||
which the operator can't guarantee in all topologies.
|
||||
- **Atomic consume** — a single `UPDATE ... WHERE used = FALSE`
|
||||
statement is the consume primitive; Postgres row-locking guarantees
|
||||
exactly one of two concurrent consumes wins.
|
||||
- **Expiry-bounded** — even if the GC sweeper were disabled, the
|
||||
nonce TTL is enforced at consume time
|
||||
(`AND expires_at > NOW()` in the UPDATE).
|
||||
|
||||
A nonce-store-side compromise would let an attacker forge nonces.
|
||||
Mitigation: the nonce table is in the same Postgres instance certctl
|
||||
already trusts; a DB compromise is broader than ACME-specific.
|
||||
|
||||
## HTTP-01 SSRF resistance
|
||||
|
||||
The HTTP-01 validator (Phase 3, `internal/api/acme/validators.go`)
|
||||
fetches `http://<identifier>/.well-known/acme-challenge/<token>`
|
||||
where the identifier is operator/client-controlled. Without
|
||||
mitigation, this is a textbook SSRF surface — internal services on
|
||||
RFC1918 / link-local / cloud-metadata addresses would be reachable.
|
||||
|
||||
Mitigations (defense in depth):
|
||||
|
||||
1. **Pre-dial check** — `validation.ValidateSafeURL` rejects URLs
|
||||
whose host parses as a literal reserved IP. Cheap early bail.
|
||||
2. **Per-dial check** — `validation.SafeHTTPDialContext` is installed
|
||||
on the `http.Transport`. Every dial re-resolves DNS, rejects
|
||||
reserved IPs, and **pins the resolved IP** (`net.JoinHostPort(ips[0],
|
||||
port)`) so a racing DNS rebinding cannot substitute a different IP
|
||||
between resolve and connect.
|
||||
3. **Per-redirect check** — Go's HTTP client re-dials on 3xx; the
|
||||
`DialContext` runs again, applying the same SSRF guards.
|
||||
4. **Body cap** — the validator's `io.LimitReader` caps response
|
||||
bodies at 16 KiB. A misbehaving target cannot DoS the validator
|
||||
pool with a multi-GB response.
|
||||
5. **Bounded redirects** — the validator caps redirects at 10 (Go
|
||||
default). A redirect-loop target is bounded.
|
||||
|
||||
Reserved IP set: loopback (127.0.0.0/8 + ::1), link-local
|
||||
(169.254.0.0/16 + fe80::/10), all RFC1918 (10/8, 172.16/12, 192.168/16),
|
||||
cloud-metadata literals (169.254.169.254 explicitly), broadcast,
|
||||
multicast, IPv4-mapped-IPv6 to a reserved IPv4. See
|
||||
`internal/validation/ssrf.go::isReservedIPForDial` for the full set.
|
||||
|
||||
CodeQL alert #23 flags `client.Do(req)` in the SCEP-probe call site
|
||||
as `go/request-forgery` despite the dial-time guard; the analyzer
|
||||
can't trace through a custom `Transport.DialContext`. Operator-
|
||||
acknowledged false positive (CLAUDE.md task #10) — see the SCEP
|
||||
probe's same-shaped defense for the audit trail.
|
||||
|
||||
## DNS-01 cache poisoning posture
|
||||
|
||||
The DNS-01 validator queries
|
||||
`_acme-challenge.<domain>` against a single resolver configured by
|
||||
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`).
|
||||
|
||||
Threat: an operator running a private resolver (typical in air-gapped
|
||||
deployments) inherits that resolver's cache-poisoning posture. A
|
||||
poisoned resolver could attest a TXT record the legitimate domain
|
||||
owner never published, allowing an attacker who controls the
|
||||
resolver to forge ACME challenges.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- Default `8.8.8.8:53` is Google Public DNS — DNSSEC-validating,
|
||||
operationally hardened, well-monitored.
|
||||
- Operators choosing a private resolver own the cache-poisoning
|
||||
posture. The doc explicitly flags this in
|
||||
`docs/acme-server.md` § Configuration.
|
||||
- DNSSEC-validation is **not** enforced by the validator itself —
|
||||
the validator trusts the resolver's answer. Operators wanting
|
||||
strict DNSSEC validation should use a DNSSEC-validating resolver
|
||||
(e.g. `1.1.1.1` or a self-hosted Unbound).
|
||||
|
||||
## TLS-ALPN-01 challenge interception
|
||||
|
||||
RFC 8737 §3 explicitly says the validator MUST NOT verify the
|
||||
challenge target's certificate chain — the proof lives in the
|
||||
embedded `id-pe-acmeIdentifier` extension (OID 1.3.6.1.5.5.7.1.31)
|
||||
of the cert presented during the TLS handshake, not in the chain
|
||||
itself.
|
||||
|
||||
Implementation: `internal/api/acme/validators.go::TLSALPN01Validator`
|
||||
sets `tls.Config.InsecureSkipVerify = true` with a dedicated
|
||||
`//nolint:gosec` annotation citing RFC 8737 §3 and the L-001
|
||||
documentation row in `docs/tls.md`.
|
||||
|
||||
What this means for on-path attackers:
|
||||
|
||||
- An on-path attacker between certctl-server and the challenge target
|
||||
CAN intercept the TLS handshake and present a forged cert. The
|
||||
proof is the embedded extension byte-equality, which the attacker
|
||||
cannot generate without the account key — so interception alone
|
||||
doesn't grant cert issuance.
|
||||
- An attacker who has the account key already controls the account
|
||||
per RFC 8555; the TLS-ALPN-01 validator's interception window adds
|
||||
no incremental capability.
|
||||
|
||||
The integrity property TLS-ALPN-01 actually provides: the challenge
|
||||
target proves possession of the account-key-derived key authorization
|
||||
on a TLS connection bound to the requested identifier (port 443 of
|
||||
the SAN). Operators wanting CA/Browser-Forum-style WebPKI strictness
|
||||
should run a dedicated public-trust CA, not certctl.
|
||||
|
||||
## Rate-limit tuning
|
||||
|
||||
Phase 5 in-memory token buckets with per-(action, key) isolation.
|
||||
Defaults:
|
||||
|
||||
- `RATE_LIMIT_ORDERS_PER_HOUR=100` per account.
|
||||
- `RATE_LIMIT_CONCURRENT_ORDERS=5` per account (pending/ready/processing).
|
||||
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR=5` per account.
|
||||
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60` per challenge-id.
|
||||
|
||||
Tuning:
|
||||
|
||||
- **Too loose** → enables abuse vectors. A compromised account could
|
||||
burn DB-row throughput; a runaway client could fill the validator
|
||||
pool.
|
||||
- **Too tight** → legitimate flake-out. cert-manager's exponential
|
||||
backoff after a `rateLimited` problem is conservative; a 1-hour
|
||||
cooldown is a long time for an operator hitting an unexpected limit.
|
||||
|
||||
Defaults are intentionally conservative on the loose-side — 100/hour
|
||||
is generous for any plausible per-account fleet (a 50k-cert
|
||||
deployment renewing at the 1/3-validity mark consumes ~12
|
||||
orders/year/cert ≈ 600k orders/year ≈ 70 orders/hour even spread
|
||||
evenly across accounts). Tighter limits are appropriate for
|
||||
deployments with many low-trust accounts.
|
||||
|
||||
The buckets are in-memory + per-replica. A 3-replica certctl-server
|
||||
fleet effectively has 3× the configured per-account throughput
|
||||
because each replica's bucket fills independently. For deployments
|
||||
where this matters operationally, the right answer is a shared rate-
|
||||
limit store (Redis / Postgres-backed); not blocking for current
|
||||
threat model where same-account requests typically pin to the same
|
||||
replica via session affinity.
|
||||
|
||||
## Audit trail
|
||||
|
||||
Every ACME state mutation writes a row to `audit_events`. Actor strings
|
||||
distinguish the auth path:
|
||||
|
||||
- `acme:<account-id>` — kid-path requests (the requesting account
|
||||
signed the JWS).
|
||||
- `acme-cert-key:<serial>` — jwk-path revoke (the cert's own private
|
||||
key signed the JWS).
|
||||
- `acme-system:gc` — scheduler-driven sweeps (no client request).
|
||||
|
||||
Operators querying by actor prefix can reconstruct the full history
|
||||
of any ACME-issued cert. See
|
||||
`docs/acme-server.md` § FAQ "What audit-log events fire" for the
|
||||
event-name catalog.
|
||||
|
||||
## Out-of-scope threats
|
||||
|
||||
Documented to set scope expectations for security reviewers:
|
||||
|
||||
- **DDoS at the TLS layer** — the certctl-server's TLS listener +
|
||||
upstream load balancer / WAF handle this. The ACME-specific rate
|
||||
limits don't substitute for upstream DDoS protection.
|
||||
- **cert-manager-side compromise** — if cert-manager is compromised,
|
||||
it has both the account key and the private keys of every issued
|
||||
cert. Out of certctl's trust boundary; operators run cert-manager
|
||||
with the same care they'd run any other secret-bearing operator.
|
||||
- **Compromised certctl-server filesystem** — the bootstrap CA key
|
||||
lives at `deploy/test/certs/ca.key` (or the operator-managed
|
||||
equivalent). A filesystem compromise is broader than ACME-specific
|
||||
and is covered by certctl's HSM / signer-driver architecture (see
|
||||
`docs/architecture.md` "Signer abstraction").
|
||||
- **Postgres compromise** — the nonce table, account JWKs, and
|
||||
audit log all live in the same Postgres instance. A DB compromise
|
||||
is broader than ACME-specific and is the operator's responsibility
|
||||
to mitigate via standard DB-hardening practices.
|
||||
- **Supply-chain attacks against go-jose / lib/pq** — handled by
|
||||
Dependabot + the `make verify` security gate; not ACME-specific.
|
||||
|
||||
## See also
|
||||
|
||||
- [`docs/acme-server.md`](./acme-server.md) — operator-facing reference.
|
||||
- [`docs/tls.md`](./tls.md) — TLS posture, including the L-001
|
||||
table of `InsecureSkipVerify` justifications (TLS-ALPN-01 row).
|
||||
- [`internal/api/acme/jws.go`](../internal/api/acme/jws.go) — verifier
|
||||
source.
|
||||
- [`internal/api/acme/validators.go`](../internal/api/acme/validators.go)
|
||||
— challenge validator pool.
|
||||
- [`internal/validation/ssrf.go`](../internal/validation/ssrf.go) —
|
||||
SSRF-defense primitives.
|
||||
@@ -0,0 +1,646 @@
|
||||
# certctl ACME Server (Built-in)
|
||||
|
||||
certctl ships an RFC 8555 + RFC 9773 ARI ACME server endpoint at
|
||||
`/acme/profile/<profile-id>/*`. Any RFC 8555 client (cert-manager 1.15+,
|
||||
Caddy, Traefik, win-acme, certbot, Posh-ACME) can integrate with certctl
|
||||
as an ACME issuer with no certctl-side modification — closing the
|
||||
"deploy a certctl agent on every K8s node" friction that costs deals to
|
||||
external PKI vendors today.
|
||||
|
||||
> **Phase status (2026-05-03):** Phase 6 — full operator-facing
|
||||
> reference. The functional surface is complete (Phases 1a-5); this
|
||||
> doc is the canonical procurement-readability reference. New: client-
|
||||
> walkthrough docs for [cert-manager](./acme-cert-manager-walkthrough.md),
|
||||
> [Caddy](./acme-caddy-walkthrough.md), and
|
||||
> [Traefik](./acme-traefik-walkthrough.md); a dedicated
|
||||
> [threat model](./acme-server-threat-model.md); a section-by-section
|
||||
> RFC 8555 + RFC 9773 conformance statement; a 5-failure-mode
|
||||
> troubleshooting playbook; a tested-clients version pinning table.
|
||||
> Track shipped phases via `git log --grep='acme-server:'`.
|
||||
|
||||
## Configuration
|
||||
|
||||
All ACME-server config uses the `CERTCTL_ACME_SERVER_*` env-var prefix
|
||||
(distinct from `CERTCTL_ACME_*` which configures the consumer-side
|
||||
issuer connector). The struct definition lives in
|
||||
`internal/config/config.go::ACMEServerConfig`.
|
||||
|
||||
| Env var | Default | Phase | Description |
|
||||
|--------------------------------------------------|------------------------|-------|-------------|
|
||||
| `CERTCTL_ACME_SERVER_ENABLED` | `false` | 1a | Master enable flag. Phase 1a's handler is constructed unconditionally so the registry shape stays stable; routes are registered in `internal/api/router/router.go::RegisterHandlers` regardless. Operators flip this on after configuring per-profile auth_mode. |
|
||||
| `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` | `trust_authenticated` | 1a | Default value for `certificate_profiles.acme_auth_mode` on newly-created profiles. Existing profiles retain their stored value. Per-profile column is the source of truth at request time. |
|
||||
| `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` | `""` | 1a | When set, `/acme/*` shorthand mirrors `/acme/profile/<DefaultProfileID>/*` for single-profile deployments. When empty, requests to the shorthand return RFC 7807 + RFC 8555 §6.7 `userActionRequired`. |
|
||||
| `CERTCTL_ACME_SERVER_NONCE_TTL` | `5m` | 1a | How long an issued ACME nonce remains valid before the JWS verifier (Phase 1b) returns `urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1. Tune up if cert-manager + certctl clocks frequently skew. |
|
||||
| `CERTCTL_ACME_SERVER_TOS_URL` | `""` | 1a | Optional `meta.termsOfService` URL in the directory document. |
|
||||
| `CERTCTL_ACME_SERVER_WEBSITE` | `""` | 1a | Optional `meta.website` URL in the directory document. |
|
||||
| `CERTCTL_ACME_SERVER_CAA_IDENTITIES` | (empty) | 1a | Comma-separated `meta.caaIdentities` list. |
|
||||
| `CERTCTL_ACME_SERVER_EAB_REQUIRED` | `false` | 1a | `meta.externalAccountRequired` advertisement. EAB enforcement is a follow-up; Phase 1a only advertises. |
|
||||
| `CERTCTL_ACME_SERVER_ORDER_TTL` | `24h` | 2 | Reserved field, parsed in Phase 1a so operators can set it ahead of Phase 2's order endpoints. |
|
||||
| `CERTCTL_ACME_SERVER_AUTHZ_TTL` | `24h` | 2 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_DNS01_RESOLVER` | `8.8.8.8:53` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_DNS01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY` | `10` | 3 | Reserved. |
|
||||
| `CERTCTL_ACME_SERVER_ARI_ENABLED` | `true` | 4 | Toggles the RFC 9773 ARI surface — both the `renewalInfo` URL in the directory document and the GET `/renewal-info/<cert-id>` handler. Set to `false` to drop ARI from the directory; ACME clients fall back to static renewal scheduling. |
|
||||
| `CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL` | `6h` | 4 | Server-policy `Retry-After` value the ARI handler emits on a 200 response. RFC 9773 §4.2 leaves this server-policy. Tighten to `1h` for short-lived certs; loosen to `24h` for standard 90-day certs. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR` | `100` | 5 | Per-account orders/hour cap. `0` disables. Hits return RFC 7807 + RFC 8555 §6.7 `urn:ietf:params:acme:error:rateLimited` with `Retry-After`. In-memory token-bucket; restart wipes the counter (eventual-consistency caps are acceptable). |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS` | `5` | 5 | Per-account cap on simultaneously-active orders (status in pending/ready/processing). `0` disables. Same RFC 7807 + RFC 8555 §6.7 problem shape as the per-hour cap. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR` | `5` | 5 | Per-account key-rollover cap. `0` disables. Default 5/hour: rollovers should be rare; a flood is an attack signal. |
|
||||
| `CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` | `60` | 5 | Per-challenge-id respond cap. `0` disables. Defends against retry storms from a misbehaving client. Keyed by challenge-id (not account-id) so a flood against one challenge doesn't drain the account's whole budget. |
|
||||
| `CERTCTL_ACME_SERVER_GC_INTERVAL` | `1m` | 5 | Tick interval for the ACME GC scheduler loop. On each tick: (1) DELETE used / expired nonces; (2) UPDATE pending authzs whose `expires_at < NOW()` to `expired`; (3) UPDATE pending/ready/processing orders whose `expires_at < NOW()` to `invalid`. Each sweep is a single SQL statement; the loop is idempotent + bounded by a 1m per-sweep timeout. `0` disables the loop. |
|
||||
|
||||
## Per-profile auth mode
|
||||
|
||||
Two modes per `certificate_profiles.acme_auth_mode`:
|
||||
|
||||
- **`trust_authenticated`** (default for internal PKI). The JWS-
|
||||
authenticated ACME account is trusted to issue certs for any
|
||||
identifier the profile policy allows; there is no per-identifier
|
||||
ownership proof. The most common certctl use case.
|
||||
- **`challenge`**. Full HTTP-01 + DNS-01 + TLS-ALPN-01 validation per
|
||||
RFC 8555 §8. Required when certctl is exposing public-trust-style PKI.
|
||||
|
||||
A single certctl-server can serve both modes simultaneously — the mode
|
||||
is read from the bound profile's column at request time, not cached at
|
||||
server start. Operators can flip a profile's mode via SQL and the next
|
||||
order picks up the new mode without restart.
|
||||
|
||||
The `CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE` env var sets the default
|
||||
value for newly-created profiles (e.g. via the certctl API). Existing
|
||||
profile rows retain whatever value they were created with.
|
||||
|
||||
## TLS trust bootstrap (read this before configuring cert-manager)
|
||||
|
||||
When certctl-server uses a self-signed TLS bootstrap cert
|
||||
(`deploy/test/certs/server.crt` is the demo default; see
|
||||
[`docs/tls.md`](./tls.md)), cert-manager 1.15+ will refuse to talk to
|
||||
the directory URL unless the certctl root is trusted. The fix lives in
|
||||
`ClusterIssuer.spec.acme.caBundle`:
|
||||
|
||||
```yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: certctl-test
|
||||
spec:
|
||||
acme:
|
||||
server: https://certctl.example.com:8443/acme/profile/prof-corp/directory
|
||||
email: ops@example.com
|
||||
caBundle: |
|
||||
LS0tLS1CRUdJTi... # base64-encoded PEM of certctl's self-signed root
|
||||
privateKeySecretRef:
|
||||
name: certctl-test-account-key
|
||||
solvers:
|
||||
- http01:
|
||||
ingress:
|
||||
class: nginx
|
||||
```
|
||||
|
||||
The `caBundle` value is the base64-encoded PEM of the root that signed
|
||||
your certctl-server's TLS certificate. Extract it from your operator
|
||||
bootstrap (e.g. `cat deploy/test/certs/ca.crt | base64 -w0`).
|
||||
|
||||
This is the single biggest first-time-deploy footgun on the cert-manager
|
||||
integration path. The full cert-manager walkthrough lands in Phase 6;
|
||||
the `caBundle` requirement is flagged here in Phase 1a's docs because
|
||||
operators hit it the moment they try to point a real ACME client at
|
||||
certctl.
|
||||
|
||||
## Auth-mode decision tree
|
||||
|
||||
Use `trust_authenticated` when:
|
||||
|
||||
- The certctl deployment serves **internal-only PKI** (intranet certs,
|
||||
service-mesh certs, IoT bootstrap). Identifiers in your CSRs are
|
||||
controlled by your infrastructure, not by the public Internet.
|
||||
- You don't have HTTP/DNS reachability **from certctl-server back to
|
||||
the ACME client's solver** (e.g., the client lives in an isolated
|
||||
network segment certctl-server can't reach).
|
||||
- You want the simplest cert-manager integration: cert-manager submits
|
||||
a CSR, certctl issues; no out-of-band ownership proof.
|
||||
- You're issuing under your own root CA whose trust is operator-managed
|
||||
(NOT WebPKI). Public CAs cannot use this mode — RFC 8555 §8 ownership
|
||||
proof is non-negotiable for public-trust roots.
|
||||
|
||||
Use `challenge` when:
|
||||
|
||||
- The deployment is **public-trust-style PKI** — even if your root is
|
||||
privately operated, you want CA/Browser Forum-style ownership-proof
|
||||
semantics so a stolen account key can't be used to issue for arbitrary
|
||||
identifiers.
|
||||
- You have HTTP-01 / DNS-01 / TLS-ALPN-01 reachability from the
|
||||
certctl-server to the ACME client's solver. (HTTP-01 needs port 80
|
||||
ingress to the client; DNS-01 needs DNS recursion; TLS-ALPN-01 needs
|
||||
port 443 ingress.)
|
||||
- You want defense-in-depth: an account-key compromise costs the
|
||||
attacker nothing without also compromising the solver-side
|
||||
infrastructure.
|
||||
|
||||
A single certctl-server can run both modes simultaneously — the auth
|
||||
mode is a per-profile column on `certificate_profiles.acme_auth_mode`,
|
||||
read at request time. Operators flip a profile's mode via SQL or the
|
||||
profile API, and the next order picks up the new mode without restart.
|
||||
|
||||
## Endpoints
|
||||
|
||||
Routes registered in `internal/api/router/router.go::RegisterHandlers`:
|
||||
|
||||
| Method | Path | RFC ref | Auth | Description |
|
||||
|--------|-------------------------------------------------------|-----------------|----------|-------------|
|
||||
| GET | `/acme/profile/{id}/directory` | RFC 8555 §7.1.1 | unauth | Per-profile directory document. |
|
||||
| HEAD | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 200 + Replay-Nonce header. |
|
||||
| GET | `/acme/profile/{id}/new-nonce` | RFC 8555 §7.2 | unauth | Returns 204 + Replay-Nonce header. |
|
||||
| POST | `/acme/profile/{id}/new-account` | RFC 8555 §7.3 | JWS jwk | Register a new account; idempotent re-registration of an existing JWK returns the existing row. |
|
||||
| POST | `/acme/profile/{id}/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Update contact list, deactivate, or POST-as-GET (RFC 8555 §6.3) to fetch the account. |
|
||||
| POST | `/acme/profile/{id}/new-order` | RFC 8555 §7.4 | JWS kid | Submit an order; identifier validation runs before order creation. |
|
||||
| POST | `/acme/profile/{id}/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | POST-as-GET fetch of an order's current state. |
|
||||
| POST | `/acme/profile/{id}/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Submit the CSR + finalize. Issues + persists managed cert row + version. |
|
||||
| POST | `/acme/profile/{id}/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | POST-as-GET fetch of an authorization. |
|
||||
| POST | `/acme/profile/{id}/challenge/{chall_id}` | RFC 8555 §7.5.1 | JWS kid | Submit a challenge for validation. Dispatches to a bounded-concurrency worker pool; clients poll authz for the eventual result. |
|
||||
| POST | `/acme/profile/{id}/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | POST-as-GET cert chain download (PEM). |
|
||||
| POST | `/acme/profile/{id}/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Doubly-signed account-key rollover. |
|
||||
| POST | `/acme/profile/{id}/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Revoke a cert via the issuing account's key OR the cert's own private key. Routes through the certctl revocation pipeline. |
|
||||
| GET | `/acme/profile/{id}/renewal-info/{cert_id}` | RFC 9773 | unauth | Fetch the suggested renewal window for a cert (cert-id is `base64url(AKI).base64url(serial)` per RFC 9773 §4.1). Response carries `Retry-After`. |
|
||||
| GET | `/acme/directory` | RFC 8555 §7.1.1 | unauth | Shorthand path; mirrors per-profile when `CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID` is set. |
|
||||
| HEAD | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
|
||||
| GET | `/acme/new-nonce` | RFC 8555 §7.2 | unauth | Shorthand. |
|
||||
| POST | `/acme/new-account` | RFC 8555 §7.3 | JWS jwk | Shorthand. |
|
||||
| POST | `/acme/account/{acc_id}` | RFC 8555 §7.3.2 + §7.3.6 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/new-order` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/order/{ord_id}` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/order/{ord_id}/finalize` | RFC 8555 §7.4 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/authz/{authz_id}` | RFC 8555 §7.5 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/cert/{cert_id}` | RFC 8555 §7.4.2 | JWS kid | Shorthand. |
|
||||
| POST | `/acme/key-change` | RFC 8555 §7.3.5 | JWS kid (outer) + jwk (inner) | Shorthand. |
|
||||
| POST | `/acme/revoke-cert` | RFC 8555 §7.6 | JWS kid OR jwk | Shorthand. |
|
||||
| GET | `/acme/renewal-info/{cert_id}` | RFC 9773 | unauth | Shorthand. |
|
||||
|
||||
After Phase 4, the full RFC 8555 + RFC 9773 surface is live. RFC 8739
|
||||
(short-lived certs) and EAB enforcement remain follow-up work; cert-
|
||||
manager + boulder-tested clients work today against the surface above.
|
||||
|
||||
## RFC 8555 + RFC 9773 conformance statement
|
||||
|
||||
Honest disclosure of what's implemented, where, and what's not. Procurement
|
||||
engineers running gap analyses against cert-manager + Let's Encrypt's
|
||||
conformance posture should read this section before anything else.
|
||||
|
||||
### Implemented
|
||||
|
||||
| Section | Surface | Phase | First commit |
|
||||
|---------|---------|-------|--------------|
|
||||
| RFC 8555 §6.2 | JWS auth + RS256/ES256/EdDSA allow-list | 1b | `27bd660` |
|
||||
| RFC 8555 §6.3 | POST-as-GET | 1b | `27bd660` |
|
||||
| RFC 8555 §6.4 | URL-header binding to request URL | 1b | `27bd660` |
|
||||
| RFC 8555 §6.5 | Replay-Nonce + DB-backed nonce store | 1a | `e146b00` |
|
||||
| RFC 8555 §6.7 | RFC 7807 problem documents | 1a | `e146b00` |
|
||||
| RFC 8555 §7.1 | Directory | 1a | `e146b00` |
|
||||
| RFC 8555 §7.2 | new-nonce HEAD + GET | 1a | `e146b00` |
|
||||
| RFC 8555 §7.3 | new-account + idempotent re-registration | 1b | `27bd660` |
|
||||
| RFC 8555 §7.3.2 + §7.3.6 | account update + deactivation | 1b | `27bd660` |
|
||||
| RFC 8555 §7.3.5 | doubly-signed key rollover | 4 | `0299e4a` |
|
||||
| RFC 8555 §7.4 | new-order + finalize + cert download | 2 | `4ee486e` |
|
||||
| RFC 8555 §7.5 | authz POST-as-GET | 2 | `4ee486e` |
|
||||
| RFC 8555 §7.5.1 | challenge response | 3 | `7e22204` |
|
||||
| RFC 8555 §7.6 | revoke-cert (kid + jwk paths) | 4 | `0299e4a` |
|
||||
| RFC 8555 §8.3 | HTTP-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 8555 §8.4 | DNS-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 8737 | TLS-ALPN-01 challenge validator | 3 | `7e22204` |
|
||||
| RFC 9773 | ACME Renewal Information (ARI) | 4 | `0299e4a` |
|
||||
|
||||
### Not implemented (procurement-honest)
|
||||
|
||||
| Spec area | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| RFC 8555 §7.3.4 — External Account Binding (EAB) | **Not implemented.** | Advertised in directory `meta.externalAccountRequired` but enforcement is a follow-up. Operators relying on EAB for account-creation gating should layer an upstream WAF. |
|
||||
| RFC 8555 §8.4 + §7.4 — Wildcard with `*.` prefix > 1 level | **Not implemented.** | Single-level wildcards (e.g. `*.example.com`) work end-to-end. Multi-level wildcards (`*.*.example.com`) are RFC-spec-ambiguous and rejected at the identifier-validation layer. |
|
||||
| RFC 8738 — Short-lived certs | **Not implemented.** | Operators wanting <7-day validity tune the bound issuer's TTL directly via `CertificateProfile.MaxTTLSeconds`; the ACME wire shape doesn't expose a separate notion. |
|
||||
| Cross-CA proxying | **Not implemented.** | Each profile binds to one issuer. Multi-CA federation (one ACME account → multi-CA selection per identifier) is roadmap. |
|
||||
| RFC 8555 §6.7 — `accountDoesNotExist` problem with hint URL | Partial. | Sentinel returns `accountDoesNotExist`; the optional hint URL embedding the `kid` is not emitted. cert-manager doesn't consume it. |
|
||||
|
||||
If a procurement-side gap analysis turns up something not in either
|
||||
table above, the answer is "we don't know yet" — operator-side issues
|
||||
welcome.
|
||||
|
||||
## Finalize routing through `CertificateService.Create` (Phase 2 architecture)
|
||||
|
||||
The finalize path mirrors how every other certctl issuance surface
|
||||
(EST, SCEP, agent, REST API) routes through the canonical pipeline:
|
||||
|
||||
1. JWS-verify the request (`internal/api/acme/jws.go`).
|
||||
2. Validate the CSR's DNS-name set equals the order's identifier set
|
||||
exactly (case-folded). Mismatches return RFC 8555
|
||||
`urn:ietf:params:acme:error:badCSR`.
|
||||
3. Update the order row to `status=processing` (`s.tx.WithinTx` +
|
||||
`auditService.RecordEventWithTx` — atomic with audit row).
|
||||
4. Issue the cert via the bound profile's `IssuerConnector` adapter
|
||||
(same `IssueCertificate(ctx, commonName, sans, csrPEM, ekus,
|
||||
maxTTLSeconds, mustStaple)` call EST/SCEP/agent take).
|
||||
5. Insert the `managed_certificates` row via
|
||||
`service.CertificateService.Create(ctx, *ManagedCertificate, actor)`.
|
||||
Source is stamped `domain.CertificateSourceACME` so operators can
|
||||
bulk-revoke ACME-issued certs by filtering on `Source=ACME`.
|
||||
6. Insert the `certificate_versions` row +
|
||||
transition the order to `status=valid` with `certificate_id` set
|
||||
(one final `WithinTx` covering both writes + the audit row).
|
||||
|
||||
This means RenewalPolicy, CertificateProfile, per-issuer-type
|
||||
Prometheus metrics, audit rows, and revocation-pipeline integration
|
||||
all apply uniformly to ACME-issued certs via the same code path that
|
||||
already serves EST/SCEP/agent/REST issuance.
|
||||
|
||||
The atomicity boundary: there is a brief window between step 5 (cert
|
||||
exists) and step 6 (order shows valid) where the order row still says
|
||||
`processing`. Phase 5's GC scheduler reconciles. The actor string on
|
||||
audit rows is `acme:<account-id>`.
|
||||
|
||||
## JWS verification (Phase 1b)
|
||||
|
||||
Every JWS-authenticated POST runs through the verifier at
|
||||
`internal/api/acme/jws.go::VerifyJWS`. The verifier enforces:
|
||||
|
||||
1. The JWS parses as a flattened single-signature object (multi-sig is
|
||||
rejected per RFC 8555 §6.2).
|
||||
2. The signature algorithm is in the closed allow-list `{RS256, ES256,
|
||||
EdDSA}` per RFC 8555 §6.2 — `none`, `HS256`, and every other alg
|
||||
are refused at parse time.
|
||||
3. The protected header carries exactly one of `kid` (registered
|
||||
account) or `jwk` (new-account flow); endpoints declare which they
|
||||
require.
|
||||
4. The protected header `url` matches the inbound request URL exactly.
|
||||
5. The protected header `nonce` is consumed against the
|
||||
`acme_nonces` store; missing / replayed / expired nonces return
|
||||
`urn:ietf:params:acme:error:badNonce` per RFC 8555 §6.5.1.
|
||||
6. On the `kid` path: the kid URL round-trips against the canonical
|
||||
per-profile shape, the referenced account exists, and its status
|
||||
is `valid`. Deactivated / revoked accounts cannot authenticate.
|
||||
7. The signature verifies against the resolved key (registered
|
||||
account's stored JWK on the kid path; embedded jwk on the jwk path).
|
||||
|
||||
Every state-mutating account operation (create, contact update,
|
||||
deactivate) writes its `acme_accounts` row and an `audit_events` row
|
||||
inside one `repository.Transactor.WithinTx` call — the canonical
|
||||
certctl atomicity contract (matches `service.CertificateService.Create`
|
||||
at `internal/service/certificate.go:131`).
|
||||
|
||||
## Phases (cross-reference)
|
||||
|
||||
| Phase | Status | Surface |
|
||||
|-------|-------------|---------|
|
||||
| 1a | live | directory + new-nonce + per-profile routing |
|
||||
| 1b | live | new-account + account/{id} + JWS verifier (RFC 7515 + go-jose v4) |
|
||||
| 2 | live | orders + authzs + finalize + cert download (trust_authenticated mode end-to-end) |
|
||||
| 3 | live | HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (challenge mode end-to-end) |
|
||||
| 4 | live | key rollover (RFC 8555 §7.3.5) + revoke-cert (§7.6) + ARI (RFC 9773) |
|
||||
| 5 | live | rate limits + GC sweeper + kind-driven cert-manager integration test + lego conformance harness + k6 ACME-flow scenario |
|
||||
| 6 | live | full operator-facing reference + walkthroughs (cert-manager / Caddy / Traefik) + threat model + RFC-8555 conformance statement + troubleshooting + version pinning |
|
||||
|
||||
Track shipped phases via `git log --grep='acme-server:' --oneline`.
|
||||
|
||||
## Operational notes (Phase 1a)
|
||||
|
||||
- **Schema:** `migrations/000025_acme_server.up.sql` adds 5 ACME tables
|
||||
+ the `certificate_profiles.acme_auth_mode` column. Phase 1a actively
|
||||
uses only `acme_nonces`. The full schema ships now so the migration
|
||||
is stable and Phases 1b-4 don't need additional `CREATE TABLE`
|
||||
migrations.
|
||||
|
||||
- **Replay protection:** nonces are persisted in `acme_nonces` (NOT
|
||||
in-memory). They survive server restart, which is required for the
|
||||
RFC 8555 §6.5 replay defense to hold against a multi-replica
|
||||
certctl-server fleet behind a load balancer.
|
||||
|
||||
- **Metrics:** the service layer exposes per-op atomic counters via
|
||||
`service.ACMEService.Metrics().Snapshot()`:
|
||||
- `certctl_acme_directory_total`
|
||||
- `certctl_acme_directory_failures_total`
|
||||
- `certctl_acme_new_nonce_total`
|
||||
- `certctl_acme_new_nonce_failures_total`
|
||||
|
||||
Phase 1b will extend with `new_account` counters; Phase 2 with order
|
||||
/ finalize / cert; Phase 3 with per-challenge-type counters.
|
||||
|
||||
- **Audit:** Phase 1a is read-mostly (directory + nonce). Phase 1b's
|
||||
account-creation path will route through the canonical
|
||||
`s.tx.WithinTx(...)` + `auditService.RecordEventWithTx(...)` pattern
|
||||
so every account state mutation is paired with an `audit_events`
|
||||
row.
|
||||
|
||||
## Phase 4 — key rollover, revocation, ARI
|
||||
|
||||
### How do I rotate my ACME account key?
|
||||
|
||||
RFC 8555 §7.3.5 defines a doubly-signed JWS for the rollover. The OUTER
|
||||
JWS is signed by the OLD account key (kid path); its payload IS the
|
||||
INNER JWS, which is signed by the NEW account key (jwk path). cert-
|
||||
manager and lego do this for you transparently — `lego renew --key-rotate`
|
||||
or the cert-manager `Issuer.spec.acme.privateKeySecretRef` rollover.
|
||||
|
||||
Server-side validation:
|
||||
|
||||
1. Outer JWS verifies against the registered account's current key.
|
||||
2. Inner JWS verifies against the embedded NEW jwk (proves possession).
|
||||
3. Inner payload `account` matches outer `kid`.
|
||||
4. Inner payload `oldKey` thumbprint-equals the registered key.
|
||||
5. Inner protected `url` equals outer protected `url`.
|
||||
6. New JWK thumbprint not already registered against the same profile.
|
||||
7. `SELECT … FOR UPDATE` on the account row serializes concurrent
|
||||
rollovers; the loser sees the winner's new thumbprint and is told
|
||||
to retry (409).
|
||||
|
||||
### How do I revoke an ACME-issued cert?
|
||||
|
||||
Two auth paths per RFC 8555 §7.6:
|
||||
|
||||
- **kid path:** sign with your account key. The server checks the
|
||||
account "owns" the cert via `acme_orders.certificate_id` lookup.
|
||||
- **jwk path:** sign with the cert's own private key. The server
|
||||
extracts the cert's public key, computes the JWK, and asserts it
|
||||
matches the embedded jwk thumbprint.
|
||||
|
||||
Either path routes through `service.RevocationSvc.RevokeCertificateWithActor`
|
||||
— the same pipeline the GUI revoke button, bulk-revocation, and the
|
||||
ACME-consumer issuer use. So the cert-row update + revocation row + audit
|
||||
row are all atomic in one `WithinTx`, the issuer is best-effort
|
||||
notified, and the OCSP response cache is invalidated.
|
||||
|
||||
Reason codes follow RFC 5280 §5.3.1; codes 8 (removeFromCRL) and 10
|
||||
(aACompromise) are not in certctl's `domain.ValidRevocationReasons`
|
||||
set so they clamp to `unspecified`.
|
||||
|
||||
### What is ARI?
|
||||
|
||||
RFC 9773 ACME Renewal Information. Clients GET
|
||||
`/acme/profile/<id>/renewal-info/<cert-id>` (unauthenticated) and
|
||||
receive a JSON document with `suggestedWindow.start` and `.end` —
|
||||
the server's recommendation for when to renew. The response also
|
||||
carries `Retry-After` (RFC 9773 §4.2) hinting at the next-poll cadence.
|
||||
|
||||
Cert-id format is `base64url(authorityKeyIdentifier).base64url(serial)`
|
||||
per RFC 9773 §4.1.
|
||||
|
||||
Window math:
|
||||
|
||||
- Cert with a bound renewal policy: window starts at
|
||||
`notAfter - RenewalWindowDays`, ends at `notAfter - RenewalWindowDays/2`.
|
||||
So a 30-day window cert with notAfter 2026-06-30 emits start=2026-05-31,
|
||||
end=2026-06-15. Boulder-shape default that lets cert-manager schedule
|
||||
inside our renewal window.
|
||||
- No policy: window is the last 33% of validity.
|
||||
- Past expiry: window is "now" → "now + 24h" (renew immediately).
|
||||
|
||||
Disable ARI globally with `CERTCTL_ACME_SERVER_ARI_ENABLED=false`. The
|
||||
URL drops out of the directory; the route is still registered but
|
||||
returns 404 — clients fall back to static renewal scheduling.
|
||||
|
||||
## Phase 5 — operational guidance
|
||||
|
||||
### Rate limiting
|
||||
|
||||
Production deployments serving multiple ACME profiles or fleets should
|
||||
keep the default rate limits in place. The four caps:
|
||||
|
||||
- `RATE_LIMIT_ORDERS_PER_HOUR` (100) — per-account new-order cap. A
|
||||
cert-manager Certificate that auto-renews at the 1/3 mark of its
|
||||
validity (90-day cert → ~30-day renewal) consumes ~12 orders/year
|
||||
per managed Certificate. 100/hour is generous for any plausible
|
||||
fleet.
|
||||
- `RATE_LIMIT_CONCURRENT_ORDERS` (5) — per-account cap on
|
||||
pending/ready/processing orders. Stops a runaway client from
|
||||
starving DB-row throughput. Tune up only if you observe legitimate
|
||||
bursts.
|
||||
- `RATE_LIMIT_KEY_CHANGE_PER_HOUR` (5) — rollovers are rare; a flood
|
||||
is an attack signal. Tune down to 1/hour if your operator
|
||||
procedure mandates manual rollovers only.
|
||||
- `RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR` (60) — per-challenge cap,
|
||||
defends against retry storms.
|
||||
|
||||
Hits return RFC 8555 §6.7 `rateLimited` Problem with a `Retry-After`
|
||||
header. cert-manager 1.15+ honors the header; lego too. Older clients
|
||||
may not — that's the client's problem, not certctl's.
|
||||
|
||||
The buckets are **in-memory + per-replica**. A 3-replica certctl-
|
||||
server fleet behind a load balancer effectively has 3× the configured
|
||||
throughput (each replica's bucket fills independently). For
|
||||
deployments where this matters operationally, the right answer is a
|
||||
shared rate-limit store — that's a follow-up; not blocking for the
|
||||
current threat model where same-account requests typically pin to
|
||||
the same replica via session affinity.
|
||||
|
||||
### GC sweeper
|
||||
|
||||
The scheduler runs the GC sweep every `GC_INTERVAL` (default 1m). Each
|
||||
sweep is three independent SQL statements:
|
||||
|
||||
1. `DELETE FROM acme_nonces WHERE used = TRUE OR expires_at < NOW()`.
|
||||
2. `UPDATE acme_authorizations SET status='expired' WHERE status='pending' AND expires_at < NOW()`.
|
||||
3. `UPDATE acme_orders SET status='invalid', error=... WHERE status IN ('pending','ready','processing') AND expires_at < NOW()`.
|
||||
|
||||
Each statement is bounded by a 1-minute per-sweep timeout. A failing
|
||||
sweep is logged + retried on the next tick; a tick that overruns its
|
||||
budget is skipped (the existing-tick atomic-Bool guard prevents
|
||||
overlap). Counts are exposed via `certctl_acme_gc_*` Prometheus
|
||||
metrics.
|
||||
|
||||
### cert-manager integration test
|
||||
|
||||
`make acme-cert-manager-test` brings up a kind cluster, installs
|
||||
cert-manager 1.15.0, helm-deploys certctl-server with
|
||||
`acmeServer.enabled=true`, and verifies a Certificate resource issues
|
||||
end-to-end. Skipped in CI by default (kind is too heavy for per-PR);
|
||||
operators run locally on workstation. See
|
||||
`deploy/test/acme-integration/` for the YAML + Go test harness.
|
||||
|
||||
### lego RFC conformance harness
|
||||
|
||||
`make acme-rfc-conformance-test` drives lego v4 against a hermetic
|
||||
certctl-server stack, exercising register → new-order → finalize.
|
||||
Operators run this when shipping behavior changes to the ACME surface
|
||||
to confirm a real third-party client still works.
|
||||
|
||||
### k6 ACME flows scenario
|
||||
|
||||
`deploy/test/loadtest/k6/acme_flow.js` exercises the unauthenticated
|
||||
surface (directory + new-nonce + ARI) at 100 VUs × 5m. JWS-signed
|
||||
flows are out of scope for k6 (no JWS support); they're covered by
|
||||
the lego conformance harness above. Baseline numbers + thresholds in
|
||||
`deploy/test/loadtest/README.md`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
The five failure modes operators hit most often + the canonical fix
|
||||
for each.
|
||||
|
||||
### `cert-manager logs: 400 Bad Request: badNonce`
|
||||
|
||||
**Cause:** Either a nonce was replayed (a buggy client retries the
|
||||
same JWS), the cert-manager + certctl-server clocks differ by more
|
||||
than `CERTCTL_ACME_SERVER_NONCE_TTL` (default 5 min), or the
|
||||
nonce-store row was reaped between issuance and use.
|
||||
|
||||
**Fix:** First check NTP on both sides. If clocks are healthy,
|
||||
lengthen `CERTCTL_ACME_SERVER_NONCE_TTL` to 10m or 15m. If the
|
||||
problem persists, check for a multi-replica certctl-server fleet
|
||||
without sticky session affinity — the nonce DB row lives on one
|
||||
replica; if the JWS POST hits a different replica before replication
|
||||
catches up, you observe spurious `badNonce`. Solution: pin client
|
||||
sessions to a single replica via load-balancer cookie / `kid`-hash
|
||||
routing, OR shorten replication lag if your DB is the bottleneck.
|
||||
|
||||
### `cert-manager logs: x509: certificate signed by unknown authority`
|
||||
|
||||
**Cause:** cert-manager refuses to talk to the directory URL because
|
||||
its TLS chain doesn't terminate at a root in cert-manager's trust
|
||||
store. certctl-server's bootstrap cert (Phase 1a, `deploy/test/certs/server.crt`)
|
||||
is self-signed.
|
||||
|
||||
**Fix:** Add the `caBundle` field to your `ClusterIssuer.spec.acme` —
|
||||
see the [TLS trust bootstrap](#tls-trust-bootstrap-read-this-before-configuring-cert-manager)
|
||||
section above for the 3-step recipe. This is **the** single biggest
|
||||
first-time-deploy footgun on the cert-manager integration path.
|
||||
|
||||
### HTTP-01 validator returns `connection refused`
|
||||
|
||||
**Cause:** The HTTP-01 solver's Ingress / Service is not reachable
|
||||
from certctl-server's network. Common subcases: (a) the cert-manager
|
||||
http-solver pod is on a private network certctl-server can't reach;
|
||||
(b) a firewall blocks port 80 inbound to the solver's address; (c)
|
||||
the Ingress class annotation doesn't match an installed ingress
|
||||
controller; (d) your DNS still points at an old IP.
|
||||
|
||||
**Fix:** From the certctl-server pod, `curl -v
|
||||
http://<identifier>/.well-known/acme-challenge/<token>` and read the
|
||||
network error. If the curl fails the same way, the network path is
|
||||
the issue. If curl works but the validator fails, check the validator
|
||||
log lines — the SSRF guard rejects reserved IPs (RFC1918, link-local,
|
||||
cloud-metadata 169.254.169.254). Public-trust style profiles that
|
||||
need to reach RFC1918 solvers must be moved to `trust_authenticated`
|
||||
mode OR the solver must be exposed on a routable address.
|
||||
|
||||
### DNS-01 validator returns `NXDOMAIN`
|
||||
|
||||
**Cause:** DNS provider hasn't propagated the `_acme-challenge.<domain>`
|
||||
TXT record yet. Most providers have a 30s-2m propagation lag. cert-manager
|
||||
retries by default, but Phase-5 rate limits (default 60/hour per
|
||||
challenge-id) can truncate the retry budget.
|
||||
|
||||
**Fix:** Verify TXT propagation with `dig +short TXT _acme-challenge.<domain>
|
||||
@<your-resolver>`. If the answer is empty, the issue is upstream. If
|
||||
it's populated but certctl reports NXDOMAIN, check
|
||||
`CERTCTL_ACME_SERVER_DNS01_RESOLVER` (default `8.8.8.8:53`) is
|
||||
reachable from certctl-server's network egress. Operators on isolated
|
||||
networks need a private resolver; configure accordingly + own the
|
||||
cache-poisoning posture (see [threat
|
||||
model](./acme-server-threat-model.md)).
|
||||
|
||||
### Certificate Ready=False with `rejectedIdentifier`
|
||||
|
||||
**Cause:** The CSR includes an identifier (CommonName or SAN) that the
|
||||
bound certificate profile's policy rejects. certctl runs syntactic +
|
||||
profile-policy validation **before** order creation; the order never
|
||||
reaches the database.
|
||||
|
||||
**Fix:** The reject reason is in the `subproblems` array of the RFC
|
||||
8555 §6.7 problem document. Decode the JSON, look at `subproblems[].detail`,
|
||||
and adjust either the CSR or the profile policy. Common causes:
|
||||
SAN-not-in-`AllowedIdentifierWildcards`, EKU-not-in-`AllowedEKUs`,
|
||||
TTL-exceeds-`MaxTTLSeconds`. Validation logic lives in
|
||||
`internal/api/acme/identifier.go::ValidateIdentifiers` +
|
||||
`internal/domain/profile.go` — read those if the profile-policy rule
|
||||
isn't obvious.
|
||||
|
||||
## Version pinning + tested clients
|
||||
|
||||
certctl's ACME server is tested against the following client versions.
|
||||
Other versions probably work; these are the ones the integration suite
|
||||
exercises end-to-end.
|
||||
|
||||
| Client | Tested version | Where it's pinned |
|
||||
|--------|----------------|-------------------|
|
||||
| cert-manager | 1.15.0 | `deploy/test/acme-integration/cert-manager-install.sh::CERT_MANAGER_VERSION` |
|
||||
| lego (RFC 8555 conformance harness) | v4.x latest | `deploy/test/acme-integration/conformance-lego.sh` (operator installs via `go install github.com/go-acme/lego/v4/cmd/lego@latest`) |
|
||||
| kind (cluster bootstrap) | v0.20+ | `deploy/test/acme-integration/kind-config.yaml` schema requirement |
|
||||
| Caddy | 2.7.x | Phase 6 walkthrough (`docs/acme-caddy-walkthrough.md`) |
|
||||
| Traefik | 3.0+ | Phase 6 walkthrough (`docs/acme-traefik-walkthrough.md`) |
|
||||
|
||||
Operators reporting issues with untested-version clients should include
|
||||
the client version + the precise wire-level error (curl-captured request
|
||||
+ response body) so we can pin a regression test if applicable.
|
||||
|
||||
## FAQ
|
||||
|
||||
### Why two auth modes? Isn't `challenge` strictly more secure?
|
||||
|
||||
`challenge` is strictly more secure for **public-trust** PKI — RFC 8555
|
||||
§8 ownership proof is the entire point of cert-manager + Let's Encrypt.
|
||||
For **internal PKI**, the threat model is different: the network itself
|
||||
is the security boundary (mTLS service mesh, firewalled VPC, identifier-
|
||||
namespace controlled by the operator). Forcing every internal cert to
|
||||
go through a solver round-trip adds operational toil with no security
|
||||
gain. `trust_authenticated` is the certctl-specific mode that
|
||||
acknowledges this — the ACME account is the proof, not the solver.
|
||||
|
||||
### How does this differ from `cert-manager → Let's Encrypt with certctl as a separate step`?
|
||||
|
||||
Two integrations vs one. With certctl as the ACME endpoint, cert-manager
|
||||
does its native flow (Certificate → Order → CSR → Secret) and certctl
|
||||
mints the cert directly, recording it under its own
|
||||
`managed_certificates` table with full audit + renewal-policy + bulk-
|
||||
revocation surface. With Let's Encrypt as the ACME endpoint, you have
|
||||
to run a separate cert-manager-uploads-to-certctl webhook OR maintain
|
||||
two parallel cert tracks. The native-ACME-server path is operationally
|
||||
simpler.
|
||||
|
||||
### Can I use ACME endpoints from outside the K8s cluster?
|
||||
|
||||
Yes. The endpoints are HTTPS over the certctl-server's listener (port
|
||||
8443 by default). Caddy on a VM, win-acme on a Windows server, or
|
||||
Posh-ACME on a Mac all integrate against
|
||||
`https://<certctl-server>:8443/acme/profile/<profile-id>/directory`.
|
||||
The TLS-trust-bootstrap requirement applies the same way — see the
|
||||
[Caddy walkthrough](./acme-caddy-walkthrough.md) for the OS-trust-store
|
||||
recipe.
|
||||
|
||||
### How do I migrate manually-issued certs to ACME-issued ones?
|
||||
|
||||
Not yet automatic. Operators migrating: keep the old `managed_certificates`
|
||||
rows; create new ones via the ACME flow; flip targets one by one. A
|
||||
dedicated bulk-migration tool is on the roadmap (post-2.1.0). Track
|
||||
via the master prompt's roadmap section in
|
||||
`cowork/acme-server-endpoint-prompt.md`.
|
||||
|
||||
### What audit-log events fire on each ACME operation?
|
||||
|
||||
Every state mutation writes an `audit_events` row. Actor strings:
|
||||
`acme:<account-id>` for kid-path requests; `acme-cert-key:<serial>`
|
||||
for jwk-path revoke; `acme-system:gc` for scheduler-driven sweeps.
|
||||
Event-name catalog:
|
||||
|
||||
| Event name | Fired by | Resource type |
|
||||
|------------|----------|---------------|
|
||||
| `acme_account_created` | new-account | `acme_account` |
|
||||
| `acme_account_contact_updated` | account update | `acme_account` |
|
||||
| `acme_account_deactivated` | account deactivate | `acme_account` |
|
||||
| `acme_account_key_rolled` | key-change | `acme_account` |
|
||||
| `acme_order_created` | new-order | `acme_order` |
|
||||
| `acme_order_finalized` | finalize | `acme_order` |
|
||||
| `acme_challenge_processing` | challenge-respond (dispatch) | `acme_challenge` |
|
||||
| `acme_challenge_completed` | validator callback | `acme_challenge` |
|
||||
| `certificate_revoked` | revoke-cert (routes through `RevocationSvc`) | `certificate` |
|
||||
|
||||
Querying by actor prefix (`actor LIKE 'acme:%'`) reconstructs the full
|
||||
history of any ACME-issued cert.
|
||||
|
||||
### Is there a threat model document?
|
||||
|
||||
Yes — [`docs/acme-server-threat-model.md`](./acme-server-threat-model.md).
|
||||
Read before writing a security review.
|
||||
|
||||
## See also
|
||||
|
||||
- [cert-manager integration walkthrough](./acme-cert-manager-walkthrough.md)
|
||||
- [Caddy integration walkthrough](./acme-caddy-walkthrough.md)
|
||||
- [Traefik integration walkthrough](./acme-traefik-walkthrough.md)
|
||||
- [Threat model](./acme-server-threat-model.md)
|
||||
- [TLS trust bootstrap reference](./tls.md)
|
||||
- [Architecture (control-plane)](./architecture.md)
|
||||
@@ -0,0 +1,118 @@
|
||||
# Async-CA Polling — Operator Reference
|
||||
|
||||
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
|
||||
|
||||
## What this is
|
||||
|
||||
Four issuer connectors talk to Certificate Authorities that issue
|
||||
certificates **asynchronously** — `IssueCertificate` returns an order
|
||||
ID immediately, and the caller (or scheduler) must call
|
||||
`GetOrderStatus` later to retrieve the issued cert:
|
||||
|
||||
- **DigiCert** (CertCentral)
|
||||
- **Sectigo** (Certificate Manager)
|
||||
- **Entrust** (Certificate Services / CA Gateway)
|
||||
- **GlobalSign** (Atlas HVCA)
|
||||
|
||||
Pre-fix, each connector's `GetOrderStatus` made one HTTP call per
|
||||
invocation with no exponential backoff, no retry cap, and no deadline.
|
||||
Under a renewal sweep, certctl would hammer the upstream CA's
|
||||
rate-limit budget. A 429 response was treated as a hard error,
|
||||
which then caused the scheduler to retry on the next tick — re-fanning
|
||||
out the same call that just got rate-limited.
|
||||
|
||||
Post-fix, `GetOrderStatus` blocks for up to `PollMaxWait` (default
|
||||
10 minutes) doing **bounded internal polling**:
|
||||
|
||||
```
|
||||
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
|
||||
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
|
||||
```
|
||||
|
||||
±20% jitter applied at every wait so multiple certctl instances
|
||||
never synchronize on the upstream CA's rate-limit window. The
|
||||
`PollMaxWait` deadline is a hard cap; if the upstream still hasn't
|
||||
completed by then, `GetOrderStatus` returns `StillPending` and the
|
||||
scheduler can re-enqueue the job for a future tick.
|
||||
|
||||
## Status-code triage
|
||||
|
||||
Each connector classifies HTTP responses to drive polling decisions:
|
||||
|
||||
| Response | Meaning | Decision |
|
||||
|---|---|---|
|
||||
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
|
||||
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
|
||||
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return `OrderStatus{Status:"failed"}` |
|
||||
| 2xx + parse failure | Body is broken | Failed — return error |
|
||||
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
|
||||
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
|
||||
| 5xx | Transient | StillPending — keep polling with backoff |
|
||||
| Network / TLS error | Transient | StillPending — keep polling with backoff |
|
||||
|
||||
## Operator tuning
|
||||
|
||||
Each connector exposes a `PollMaxWaitSeconds` config field and
|
||||
matching env var:
|
||||
|
||||
| Connector | Env var | Default |
|
||||
|---|---|---|
|
||||
| DigiCert | `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| Sectigo | `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| Entrust | `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
| GlobalSign | `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
||||
|
||||
Tune up (e.g., `86400` = 24 hours) for **Entrust approval-pending
|
||||
workflows** where humans manually approve enrollments. Tune down (e.g.,
|
||||
`60`) for high-throughput environments that prefer to recycle the
|
||||
scheduler tick rather than block one renewal goroutine for minutes.
|
||||
|
||||
A value of 0 (or unset) falls back to the package default in
|
||||
`internal/connector/issuer/asyncpoll`.
|
||||
|
||||
## Failure modes
|
||||
|
||||
**Upstream returns 429 forever.** The Poller respects the backoff
|
||||
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
|
||||
the full `PollMaxWait` budget with at most 7-8 attempts (instead of
|
||||
~600 attempts at 1/sec). After `PollMaxWait` expires, `GetOrderStatus`
|
||||
returns `StillPending`; the scheduler re-enqueues for the next tick.
|
||||
The total request volume against the upstream is bounded by `tick
|
||||
interval / minimum backoff` — typically 1-2 requests per minute even
|
||||
under heavy load.
|
||||
|
||||
**Sectigo `collectNotReady` sentinel.** When the SCM status endpoint
|
||||
reports `Issued` but the cert collect endpoint isn't yet ready, the
|
||||
old code branched into a special "pending" return. Now that branch
|
||||
returns `StillPending` from the poll closure, so the cert collection
|
||||
rides the same backoff schedule.
|
||||
|
||||
**Entrust approval-pending.** The `AWAITING_APPROVAL` status maps to
|
||||
`StillPending`. With the default `PollMaxWait=10m`, the scheduler
|
||||
will re-enqueue once per tick if approval hasn't happened yet; with
|
||||
`PollMaxWait=24h` the same renewal goroutine waits the full approval
|
||||
window. Pick the latter when you have many approval-pending
|
||||
enrollments per tick.
|
||||
|
||||
## Where the implementation lives
|
||||
|
||||
- `internal/connector/issuer/asyncpoll/asyncpoll.go` — shared `Poller`
|
||||
with backoff math, jitter, deadline, and ctx-aware cancellation.
|
||||
- `internal/connector/issuer/digicert/digicert.go` —
|
||||
`pollOrderOnce` + `GetOrderStatus` orchestrator.
|
||||
- `internal/connector/issuer/sectigo/sectigo.go` —
|
||||
`pollEnrollmentOnce` + status-code permanence triage
|
||||
(`isPermanentStatusError`).
|
||||
- `internal/connector/issuer/entrust/entrust.go` —
|
||||
`pollEnrollmentOnce` + approval-pending mapping.
|
||||
- `internal/connector/issuer/globalsign/globalsign.go` —
|
||||
`pollCertificateOnce` (serial-number tracking).
|
||||
- `internal/connector/issuer/asyncpoll/asyncpoll_test.go` — 11 unit
|
||||
tests covering happy path, transient-then-success, Failed
|
||||
termination, MaxWait timeout, last-error wrap, ctx cancel,
|
||||
multiplicative backoff, jitter bounds, defaults.
|
||||
|
||||
## Audit blocker reference
|
||||
|
||||
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5
|
||||
(Part 1.5 finding #4: "No polling backoff for async CAs").
|
||||
@@ -0,0 +1,411 @@
|
||||
# CRL & OCSP — Revocation Status for Relying Parties
|
||||
|
||||
This guide is the operator + relying-party reference for certctl's revocation
|
||||
status surfaces. It covers the wire format, endpoint URLs, configuration knobs,
|
||||
the OCSP responder cert lifecycle, and how to point common consumers
|
||||
(cert-manager, Firefox, OpenSSL) at the endpoints.
|
||||
|
||||
If you're looking for the higher-level architecture, see
|
||||
[`architecture.md` § Security Model](architecture.md#security-model). If you're
|
||||
looking for the revocation policy / reason codes the API accepts, see
|
||||
[`api/openapi.yaml` § /certificates/{id}/revoke](../api/openapi.yaml).
|
||||
|
||||
---
|
||||
|
||||
## Conceptual overview
|
||||
|
||||
**Why two formats.** RFC 5280 §5 defines a Certificate Revocation List (CRL)
|
||||
— a periodically-published, signed list of every revoked certificate for an
|
||||
issuer. RFC 6960 defines the Online Certificate Status Protocol (OCSP) — a
|
||||
request/response protocol that returns the status of a single certificate by
|
||||
serial number. CRLs are batch-friendly and cacheable; OCSP is point-query and
|
||||
fresh. Production PKI deployments serve both because different relying parties
|
||||
prefer different trade-offs:
|
||||
|
||||
- Browsers (Firefox / Safari) prefer OCSP for freshness; some pin OCSP
|
||||
stapling.
|
||||
- cert-manager and most Linux TLS clients fall back to CRL when OCSP is
|
||||
unreachable.
|
||||
- Microsoft Intune / corporate device-state validators do periodic CRL pulls.
|
||||
- OpenSSL `s_client -status` exercises OCSP via the `Certificate Status
|
||||
Request` extension during the handshake.
|
||||
|
||||
certctl's local issuer publishes both, with a pre-generation cache so a busy
|
||||
CA does not DOS itself rebuilding the CRL on every fetch.
|
||||
|
||||
**Why a separate OCSP responder cert.** RFC 6960 §2.6 + §4.2.2.2 strongly
|
||||
recommend that OCSP responses be signed by a delegated "OCSP responder cert"
|
||||
issued by the CA, NOT by the CA private key directly. The responder cert
|
||||
carries the `id-pkix-ocsp-nocheck` extension (RFC 6960 §4.2.2.2.1) so OCSP
|
||||
clients do not recursively check the responder cert's revocation status. This
|
||||
keeps the CA private key cold (an HSM operation per OCSP request would be
|
||||
prohibitive at scale) and lets the responder key live on disk, on a separate
|
||||
HSM partition, or rotate frequently while the CA key stays untouched.
|
||||
|
||||
---
|
||||
|
||||
## Endpoints
|
||||
|
||||
All revocation endpoints live under `/.well-known/pki/` per RFC 8615 and run
|
||||
**unauthenticated** — relying parties without certctl API credentials must be
|
||||
able to validate revocation status. The HTTPS-only TLS 1.3 control plane
|
||||
applies; there is no plaintext fallback.
|
||||
|
||||
### CRL — Certificate Revocation List
|
||||
|
||||
```
|
||||
GET https://<host>/.well-known/pki/crl/{issuer_id}
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `GET` |
|
||||
| Auth | None (unauthenticated, RFC 5280 §5 distribution semantics) |
|
||||
| Response Content-Type | `application/pkix-crl` |
|
||||
| Response body | DER-encoded X.509 CRL signed by the issuer's CA |
|
||||
| Cache | Pre-generated by the scheduler; configurable interval |
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
curl --cacert ca.crt \
|
||||
-o crl.der \
|
||||
https://localhost:8443/.well-known/pki/crl/iss-local
|
||||
|
||||
openssl crl -inform DER -in crl.der -text -noout
|
||||
```
|
||||
|
||||
### OCSP — Online Certificate Status Protocol
|
||||
|
||||
certctl serves both the GET form (RFC 6960 §A.1.1, simple URL-path lookup)
|
||||
and the POST form (RFC 6960 §A.1.1, binary OCSPRequest body). Most
|
||||
production OCSP clients (Firefox, OpenSSL `s_client -status`, cert-manager,
|
||||
Intune) use POST. The GET form is preserved for ops curl-debugging.
|
||||
|
||||
#### GET form
|
||||
|
||||
```
|
||||
GET https://<host>/.well-known/pki/ocsp/{issuer_id}/{serial_hex}
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `GET` |
|
||||
| Auth | None |
|
||||
| Response Content-Type | `application/ocsp-response` |
|
||||
| Response body | DER-encoded OCSPResponse signed by the **OCSP responder cert** (NOT the CA cert) |
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
curl --cacert ca.crt \
|
||||
-o response.der \
|
||||
https://localhost:8443/.well-known/pki/ocsp/iss-local/a1b2c3d4
|
||||
|
||||
openssl ocsp -respin response.der -text -CAfile ca.crt
|
||||
```
|
||||
|
||||
#### POST form (the standard one)
|
||||
|
||||
```
|
||||
POST https://<host>/.well-known/pki/ocsp/{issuer_id}
|
||||
Content-Type: application/ocsp-request
|
||||
Body: <DER-encoded OCSPRequest>
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Method | `POST` |
|
||||
| Auth | None |
|
||||
| Request Content-Type | `application/ocsp-request` |
|
||||
| Response Content-Type | `application/ocsp-response` |
|
||||
|
||||
Example with OpenSSL building the request:
|
||||
|
||||
```bash
|
||||
openssl ocsp -issuer ca.crt -cert leaf.crt -reqout request.der
|
||||
|
||||
curl --cacert ca.crt \
|
||||
-X POST \
|
||||
-H "Content-Type: application/ocsp-request" \
|
||||
--data-binary @request.der \
|
||||
-o response.der \
|
||||
https://localhost:8443/.well-known/pki/ocsp/iss-local
|
||||
|
||||
openssl ocsp -respin response.der -text -CAfile ca.crt
|
||||
```
|
||||
|
||||
The body-size limit applies (`http.MaxBytesReader` from middleware,
|
||||
default 1MB, configurable via `CERTCTL_MAX_BODY_SIZE`); a typical OCSPRequest
|
||||
is ~200 bytes so this is a generous cap.
|
||||
|
||||
### Admin observability endpoint
|
||||
|
||||
```
|
||||
GET https://<host>/api/v1/admin/crl/cache
|
||||
Authorization: Bearer <token-with-admin-flag>
|
||||
```
|
||||
|
||||
Returns the per-issuer cache state — for ops dashboards, GUI badges, or
|
||||
"is the scheduler keeping up?" diagnostics. Admin-gated (M-008 admin-gated
|
||||
handler allowlist; non-admin Bearer callers receive HTTP 403). Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"cache_rows": [
|
||||
{
|
||||
"issuer_id": "iss-local",
|
||||
"cache_present": true,
|
||||
"crl_number": 42,
|
||||
"this_update": "2026-04-29T10:00:00Z",
|
||||
"next_update": "2026-04-29T11:00:00Z",
|
||||
"generated_at": "2026-04-29T10:00:00Z",
|
||||
"generation_duration_ms": 87,
|
||||
"revoked_count": 13,
|
||||
"is_stale": false,
|
||||
"recent_events": [
|
||||
{
|
||||
"started_at": "2026-04-29T10:00:00Z",
|
||||
"duration_ms": 87,
|
||||
"succeeded": true,
|
||||
"crl_number": 42,
|
||||
"revoked_count": 13
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"row_count": 1,
|
||||
"generated_at": "2026-04-29T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
Issuers that have not yet had a CRL generated appear with `cache_present:
|
||||
false` so the GUI can render a "Not yet generated" pill rather than 404.
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
| Env var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `CERTCTL_CRL_GENERATION_INTERVAL` | `1h` | How often the scheduler walks every CRL-supporting issuer and rebuilds. The HTTP handler reads from the cache, not from a per-request rebuild. |
|
||||
| `CERTCTL_OCSP_RESPONDER_KEY_DIR` | unset | **Operator MUST set in production.** Directory where the FileDriver persists each issuer's OCSP responder key (`ocsp-responder-<issuer_id>.key`). When unset, the responder service uses a temporary directory that does NOT survive restarts — fine for dev, NEVER for prod. |
|
||||
| `CERTCTL_OCSP_RESPONDER_ROTATION_GRACE` | `7d` | When the responder cert's `NotAfter` falls within this window, `EnsureResponder` rotates to a fresh cert+key on the next OCSP request or scheduler tick. |
|
||||
| `CERTCTL_OCSP_RESPONDER_VALIDITY` | `30d` | How long each newly-issued responder cert is valid for. Short by design — relying parties cache OCSP responses, not the responder cert chain, and `id-pkix-ocsp-nocheck` blocks recursive revocation checking on the responder itself. |
|
||||
|
||||
The issuer-level CRL `nextUpdate` is derived from the generation timestamp +
|
||||
the configured CRL validity (currently a build-time constant in the
|
||||
`CRLCacheService`; configurable knob deferred until an operator asks).
|
||||
|
||||
---
|
||||
|
||||
## OCSP responder cert lifecycle
|
||||
|
||||
1. **First OCSP request for an issuer (or scheduler tick).** The local
|
||||
issuer's `SignOCSPResponse` calls into `OCSPResponderService.EnsureResponder`.
|
||||
2. **Cache lookup.** `EnsureResponder` queries the `ocsp_responders` table for
|
||||
a row keyed by `issuer_id`.
|
||||
3. **Disk lookup.** If a row exists, the FileDriver reads the persisted key
|
||||
from `<keydir>/ocsp-responder-<issuer_id>.key`. **Self-healing:** if the
|
||||
row exists but the file is missing (operator pruned the keydir without
|
||||
pruning the DB), the service treats this as "rotate now" rather than
|
||||
crashing.
|
||||
4. **Rotation check.** If `cert.NotAfter < now + RotationGrace`, the service
|
||||
generates a fresh ECDSA-P256 key, builds a `*x509.CertificateRequest`,
|
||||
and asks the local issuer's existing `IssueCertificate` flow to sign it.
|
||||
The signing template carries:
|
||||
- `KeyUsage: x509.KeyUsageDigitalSignature` (signing OCSP responses)
|
||||
- `ExtKeyUsage: x509.ExtKeyUsageOCSPSigning` (RFC 6960 §4.2.2.2)
|
||||
- The `id-pkix-ocsp-nocheck` extension (OID `1.3.6.1.5.5.7.48.1.5`,
|
||||
DER value `NULL`, RFC 6960 §4.2.2.2.1) wired through
|
||||
`Certificate.ExtraExtensions`.
|
||||
5. **Persistence.** The new cert + key path are written to `ocsp_responders`
|
||||
via an idempotent `INSERT … ON CONFLICT DO UPDATE`.
|
||||
6. **Response signing.** `ocsp.CreateResponse(caCert, responderCert,
|
||||
template, responderSigner)` produces the response bytes; the responder
|
||||
cert is included in the response chain so relying parties can validate
|
||||
without a separate fetch.
|
||||
|
||||
The race between scheduler-driven cache refresh and on-demand cache miss is
|
||||
collapsed by the `CRLCacheService`'s in-tree singleflight (a `sync.Map` of
|
||||
`*flightEntry` keyed by `issuer_id`). Concurrent generation requests for the
|
||||
same issuer wait on the in-flight result rather than each rebuilding from
|
||||
scratch.
|
||||
|
||||
---
|
||||
|
||||
## Pointing common consumers at the endpoints
|
||||
|
||||
### cert-manager (Kubernetes)
|
||||
|
||||
cert-manager's certificate-validation logic checks both the AIA OCSP URI
|
||||
embedded in the leaf and the CDP CRL URI. Both are populated automatically
|
||||
by the local issuer's certificate template — relying parties should NOT
|
||||
need any additional configuration. To verify:
|
||||
|
||||
```bash
|
||||
openssl x509 -in leaf.crt -text -noout | grep -A1 "Authority Information Access"
|
||||
openssl x509 -in leaf.crt -text -noout | grep -A2 "CRL Distribution Points"
|
||||
```
|
||||
|
||||
If your cert-manager pods cannot reach `https://<certctl-host>:8443/.well-known/pki/`,
|
||||
add a NetworkPolicy egress rule or expose the certctl service via the
|
||||
appropriate ingress class.
|
||||
|
||||
### Firefox
|
||||
|
||||
Firefox honors the AIA OCSP URI by default. To force-refresh the local
|
||||
revocation cache after revoking a cert in dev:
|
||||
|
||||
```
|
||||
about:preferences#privacy → Certificates → Query OCSP responder servers
|
||||
```
|
||||
|
||||
If Firefox reports `SEC_ERROR_OCSP_INVALID_SIGNING_CERT`, verify that the
|
||||
responder cert chain is reachable from the system trust store —
|
||||
`id-pkix-ocsp-nocheck` is a Firefox-strict extension and is set automatically
|
||||
on every responder cert certctl issues.
|
||||
|
||||
### OpenSSL
|
||||
|
||||
```bash
|
||||
# OCSP via stand-alone request
|
||||
openssl ocsp -issuer ca.crt -cert leaf.crt -url https://localhost:8443/.well-known/pki/ocsp/iss-local -CAfile ca.crt -text
|
||||
|
||||
# OCSP via TLS Certificate Status Request extension
|
||||
openssl s_client -connect example.com:443 -status -CAfile ca.crt
|
||||
```
|
||||
|
||||
### Intune (corporate device state)
|
||||
|
||||
Intune device-compliance validators pull the CRL on a schedule (configured in
|
||||
the Intune admin console, default 24h). Configure the CRL distribution point
|
||||
to `https://<certctl-host>:8443/.well-known/pki/crl/<issuer_id>` and Intune
|
||||
will pull on its own cadence.
|
||||
|
||||
---
|
||||
|
||||
## Production hardening II additions (post-2026-04-30)
|
||||
|
||||
The following capabilities were folded into V2 (free) by the production
|
||||
hardening II bundle. Each closes a real procurement-team checklist gap
|
||||
without requiring a paid tier.
|
||||
|
||||
### OCSP nonce extension (RFC 6960 §4.4.1)
|
||||
|
||||
The POST OCSP handler echoes the request's nonce extension (OID
|
||||
`1.3.6.1.5.5.7.48.1.2`) in the response. Defends against replay attacks
|
||||
where a relying party's cached response is replayed against a now-revoked
|
||||
cert. Always-on; no operator opt-out.
|
||||
|
||||
Failure modes:
|
||||
|
||||
- **No nonce in request** — back-compat; response omits the extension.
|
||||
- **Well-formed nonce ≤ 32 bytes** — response echoes it; tracked in
|
||||
`certctl_ocsp_counter_total{label="nonce_echoed"}`.
|
||||
- **Empty or oversized nonce (> 32 bytes per CA/B Forum BR §4.10.2)** —
|
||||
responder returns the canonical "unauthorized" status (RFC 6960 §2.3
|
||||
status 6); tracked in `certctl_ocsp_counter_total{label="nonce_malformed"}`.
|
||||
|
||||
### OCSP pre-signed response cache
|
||||
|
||||
Mirrors the existing CRL cache. Per-(issuer, serial) entries pre-signed
|
||||
and stored in `ocsp_response_cache`; the read-through facade in
|
||||
`CAOperationsSvc.GetOCSPResponseWithNonce` consults the cache for
|
||||
nil-nonce requests and falls through to live signing on miss + writes
|
||||
the result back. Nonce-bearing requests always live-sign because the
|
||||
cache stores nil-nonce blobs.
|
||||
|
||||
**Load-bearing security wire:** `RevocationSvc.RevokeCertificateWithActor`
|
||||
calls `InvalidateOnRevoke` after a successful revocation so the next
|
||||
OCSP fetch returns the revoked status. There is no stale-good window
|
||||
after revoke.
|
||||
|
||||
### Per-source-IP OCSP rate limit + per-actor cert-export rate limit
|
||||
|
||||
Defaults: 1000 req/min/IP for OCSP; 50 exports/hr/operator for the
|
||||
cert-export endpoints. Configurable via
|
||||
`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` and
|
||||
`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`; zero disables.
|
||||
|
||||
OCSP rate-limit trip: canonical "unauthorized" OCSP blob plus
|
||||
`Retry-After: 60`. Cert-export trip: HTTP 429 + JSON
|
||||
`{"error":"rate_limit_exceeded","retry_after_seconds":3600}`.
|
||||
|
||||
The OCSP limiter does NOT honor `X-Forwarded-For` because OCSP is
|
||||
publicly reachable and untrusted intermediaries could spoof the header
|
||||
to bypass the cap.
|
||||
|
||||
### CRL HTTP caching headers (RFC 7232)
|
||||
|
||||
`GET /.well-known/pki/crl/{issuer_id}` now returns weak-form ETag,
|
||||
`Cache-Control: public, max-age=3600, must-revalidate`, and respects
|
||||
`If-None-Match` for HTTP 304 short-circuits. Lets CDNs and reverse
|
||||
proxies serve repeated fetches from edge cache.
|
||||
|
||||
### CRL DistributionPoint auto-injection
|
||||
|
||||
Local issuer config field `CRLDistributionPointURLs []string`; when
|
||||
non-empty, every issued cert carries the RFC 5280 §4.2.1.13
|
||||
`id-ce-cRLDistributionPoints` extension pointing at certctl's CRL
|
||||
endpoint. Refusing to silently inject an empty CDP is deliberate —
|
||||
silent-empty fails relying-party validation worse than no CDP.
|
||||
|
||||
### Cert-export typed audit codes + Prometheus per-area metrics
|
||||
|
||||
Audit emission now carries typed action constants
|
||||
(`cert_export_pem`, `cert_export_pkcs12`, `cert_export_failed`)
|
||||
alongside legacy bare codes. Detail map enriched with
|
||||
`has_private_key` (always false in V2) and `cipher`
|
||||
(`AES-256-CBC-PBE2-SHA256` — pinned).
|
||||
|
||||
`GET /api/v1/metrics/prometheus` surfaces the new per-area counters
|
||||
under the `certctl_<area>_counter_total{label=...}` family. OCSP
|
||||
shipped in this bundle; alert recommendations:
|
||||
|
||||
- `{label="rate_limited"}` rate > 0 sustained > 5m → notify (limiter
|
||||
is doing its job; investigate source IP).
|
||||
- `{label="nonce_malformed"}` > 0 → notify (legitimate clients don't
|
||||
send malformed nonces).
|
||||
- `{label="signing_failed"}` > 0 → page on-call (issuer connector
|
||||
failing).
|
||||
|
||||
## What this release does NOT include (V3-Pro)
|
||||
|
||||
Still out of scope for V2; tracked for V3-Pro:
|
||||
|
||||
- **Delta CRLs (RFC 5280 §5.2.4).** Useful for very large CRLs (10k+
|
||||
revoked certs); the data model accommodates the Base CRL Number
|
||||
reference but the pipeline only emits Base CRLs in V2.
|
||||
- **OCSP stapling at SCEP/EST CertRep response time.** Server-side
|
||||
pre-staple into the TLS handshake context.
|
||||
- **OCSP request signature verification (RFC 6960 §4.1.1).** Optional
|
||||
per-spec; certctl currently ignores the signature.
|
||||
- **OCSP responder HA / multi-region replication.** Active-active
|
||||
OCSP cache with Postgres logical replication.
|
||||
- **CRL Issuing Distribution Point (IDP) extension** (RFC 5280
|
||||
§5.2.5) — for sharded CRL deployments.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`pki/crl/<issuer_id>` returns 404.** The issuer either does not support
|
||||
CRL signing (Vault, EJBCA, DigiCert serve their own CRL infrastructure;
|
||||
certctl's connectors return `nil` from `GenerateCRL` for these) or the
|
||||
issuer ID is wrong. Verify with `GET /api/v1/issuers`.
|
||||
|
||||
**`pki/ocsp/<issuer_id>/<serial>` returns 200 but `openssl ocsp -text`
|
||||
shows "unauthorized".** Check that the serial in the URL is hex-encoded (no
|
||||
`0x` prefix, no leading zeros stripped, lowercase). Mismatched serials
|
||||
return an OCSP response with status `unauthorized` per RFC 6960 §2.3.
|
||||
|
||||
**Admin cache endpoint returns 403.** The Bearer key does not carry the
|
||||
admin flag. M-008 gates this endpoint server-side; the GUI also gates the
|
||||
fetch on `useAuth().admin`. Either escalate the key (`certctl admin
|
||||
keys promote <key-id>`) or use a different identity.
|
||||
|
||||
**Cache shows `is_stale: true` repeatedly.** The scheduler is not running
|
||||
(or not getting scheduled often enough). Check `CERTCTL_CRL_GENERATION_INTERVAL`
|
||||
and confirm the scheduler started: `grep crlGenerationLoop` in the server
|
||||
logs at startup.
|
||||
@@ -0,0 +1,809 @@
|
||||
# EST (RFC 7030) — Operator Guide
|
||||
|
||||
> **Status (this document):** EST RFC 7030 hardening master bundle Phases
|
||||
> 1–11 shipped on `master`; this guide is the Phase-12 deliverable
|
||||
> against the bundle. Every behavior described here is exercised by the
|
||||
> tests at `internal/api/handler/est*_test.go`,
|
||||
> `internal/service/est*_test.go`, and (for the libest interop layer)
|
||||
> `deploy/test/est_e2e_test.go` under `//go:build integration`. The
|
||||
> bundle is **V2-free**; per-tenant CA isolation, Conditional-Access
|
||||
> compliance gating, and EST cert-bound usage analytics are documented
|
||||
> as V3-Pro deferrals in [V3-Pro deferrals](#v3-pro-deferrals).
|
||||
|
||||
## Contents
|
||||
|
||||
1. [Concepts](#concepts)
|
||||
2. [Quick start](#quick-start)
|
||||
3. [Multi-profile dispatch](#multi-profile-dispatch)
|
||||
4. [Authentication modes](#authentication-modes)
|
||||
5. [RFC 9266 channel binding](#rfc-9266-channel-binding)
|
||||
6. [WiFi / 802.1X recipe (FreeRADIUS)](#wifi--8021x-recipe-freeradius)
|
||||
7. [IoT bootstrap recipe](#iot-bootstrap-recipe)
|
||||
8. [`serverkeygen` for resource-constrained devices](#serverkeygen-for-resource-constrained-devices)
|
||||
9. [HSM-backed CA signing for EST](#hsm-backed-ca-signing-for-est)
|
||||
10. [Operator GUI (EST Admin tabs)](#operator-gui-est-admin-tabs)
|
||||
11. [CLI + MCP tools](#cli--mcp-tools)
|
||||
12. [Renewal: device-driven model](#renewal-device-driven-model)
|
||||
13. [Troubleshooting matrix](#troubleshooting-matrix)
|
||||
14. [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook)
|
||||
15. [Threat model](#threat-model)
|
||||
16. [V3-Pro deferrals](#v3-pro-deferrals)
|
||||
17. [Appendix A: libest reference client](#appendix-a-libest-reference-client)
|
||||
18. [Appendix B: RFC 7030 wire-format quirks](#appendix-b-rfc-7030-wire-format-quirks)
|
||||
19. [Related docs](#related-docs)
|
||||
|
||||
## Concepts
|
||||
|
||||
EST (RFC 7030) is the IETF-standardized successor to SCEP for device
|
||||
enrollment over HTTPS. certctl ships a native EST server that handles
|
||||
all six RFC 7030 endpoints — `cacerts`, `simpleenroll`,
|
||||
`simplereenroll`, `csrattrs`, `serverkeygen`, and (proxy-pass)
|
||||
`fullcmc` — out of a single binary, with per-profile dispatch so a
|
||||
single deploy can serve multiple device fleets from the same control
|
||||
plane.
|
||||
|
||||
**EST is a handler-level protocol, not a connector.** The
|
||||
`ESTHandler` parses the wire format, enforces auth, and delegates
|
||||
issuance to whichever `IssuerConnector` the profile binds. EST does
|
||||
not replace your CA — it sits in front of the local CA, Vault PKI,
|
||||
EJBCA, ADCS, step-ca, or anything else certctl already knows how to
|
||||
issue against. Devices submit a CSR; certctl validates, gates, signs,
|
||||
and returns a PKCS#7 certs-only response.
|
||||
|
||||
**Two enrollment models, one server.**
|
||||
|
||||
- **Host enrollment** — a long-lived device or laptop boots, generates
|
||||
its own keypair locally, and enrolls via `simpleenroll` (initial)
|
||||
then `simplereenroll` (renewal) over the device's TLS-pinned
|
||||
channel. Private keys never leave the device.
|
||||
- **User enrollment** — a network supplicant (corporate WiFi, VPN
|
||||
client) drives `simpleenroll` against certctl on behalf of the user
|
||||
identity. The CSR carries the user UPN as a SAN; the FreeRADIUS or
|
||||
VPN policy gates session establishment on cert validity.
|
||||
|
||||
**Profile-driven policy.** Every EST profile carries its own:
|
||||
|
||||
- Issuer binding (`CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID`)
|
||||
- Optional `CertificateProfile` (`_PROFILE_ID`) that constrains
|
||||
allowed key algorithms, key sizes, EKUs, SANs, max TTL, and
|
||||
must-staple
|
||||
- Auth mode mix: mTLS only, HTTP Basic only, both, or none (for
|
||||
back-compat with anonymous deploys — strongly discouraged)
|
||||
- Optional RFC 9266 `tls-exporter` channel binding
|
||||
- Optional per-(CN, sourceIP) sliding-window rate limit
|
||||
- Optional server-side keygen
|
||||
|
||||
The per-profile family is documented exhaustively in
|
||||
[`features.md`](features.md).
|
||||
|
||||
**Multi-profile dispatch.** `CERTCTL_EST_PROFILES=corp,iot,wifi`
|
||||
publishes three independent endpoint groups under
|
||||
`/.well-known/est/<pathID>/`. Each profile's auth, trust anchor, and
|
||||
issuer binding is isolated; a compromise of one profile's enrollment
|
||||
password does not affect any other profile.
|
||||
|
||||
## Quick start
|
||||
|
||||
The five-minute single-profile setup runs EST anonymously over
|
||||
HTTPS-only. **Use this only on a private network during evaluation;**
|
||||
production deploys MUST set an auth mode (see
|
||||
[Authentication modes](#authentication-modes)).
|
||||
|
||||
1. Have certctl running with TLS configured per [`tls.md`](tls.md).
|
||||
The control plane listens on `:8443`; EST shares the same listener
|
||||
under `/.well-known/est/`.
|
||||
2. Set the legacy single-profile env vars in your compose file or
|
||||
Helm values:
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_ISSUER_ID=iss-local
|
||||
```
|
||||
|
||||
3. Restart certctl. The startup log line `EST server enabled` should
|
||||
surface; the routes `/.well-known/est/{cacerts,simpleenroll,simplereenroll,csrattrs}`
|
||||
are now live.
|
||||
4. Ground-truth check from a client host:
|
||||
|
||||
```bash
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/.well-known/est/cacerts \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs -noout
|
||||
```
|
||||
|
||||
You should see your CA cert subject and `NotAfter`. This is the
|
||||
`/cacerts` endpoint serving the PKCS#7 SignedData certs-only
|
||||
response per RFC 7030 §4.1.
|
||||
|
||||
5. Generate a CSR and enroll:
|
||||
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out device.key
|
||||
openssl req -new -key device.key -subj "/CN=device-001.example.com" -out device.csr
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -in device.csr -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est/simpleenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > device.crt
|
||||
```
|
||||
|
||||
The response is a PKCS#7 certs-only blob; the issued cert lands in
|
||||
`device.crt`.
|
||||
|
||||
If the curl fails with a TLS error, walk through [`tls.md`](tls.md);
|
||||
the EST handler relies on the same listener as the REST API and
|
||||
SHARES NO TRUST POLICY with the legacy plaintext :8080 of pre-v2.2
|
||||
deploys (which was removed when the HTTPS-only policy landed).
|
||||
|
||||
## Multi-profile dispatch
|
||||
|
||||
A single certctl binary publishes one EST endpoint group per name in
|
||||
`CERTCTL_EST_PROFILES`. Set the comma-separated list, then a matching
|
||||
set of `CERTCTL_EST_PROFILE_<NAME>_*` env vars per profile:
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_PROFILES=corp,iot,wifi
|
||||
|
||||
# per-profile config — `<NAME>` placeholder gets replaced by the
|
||||
# uppercased name from the list (so "corp" → CORP, "iot" → IOT,
|
||||
# "wifi" → WIFI). The URL path uses the lowercased form.
|
||||
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
|
||||
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-corp-laptops
|
||||
CERTCTL_EST_PROFILE_<NAME>_ENROLLMENT_PASSWORD=<random>
|
||||
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=basic
|
||||
```
|
||||
|
||||
This publishes:
|
||||
|
||||
- `/.well-known/est/corp/{cacerts,simpleenroll,simplereenroll,csrattrs,serverkeygen}`
|
||||
- `/.well-known/est/iot/...`
|
||||
- `/.well-known/est/wifi/...`
|
||||
|
||||
Each profile is independently validated at startup (see
|
||||
`internal/config/config.go::Validate`). Per-profile failures log the
|
||||
offending PathID and refuse the boot. The legacy single-profile
|
||||
shape (`CERTCTL_EST_ENABLED` + `CERTCTL_EST_ISSUER_ID` without
|
||||
`CERTCTL_EST_PROFILES`) continues to work — the back-compat shim in
|
||||
`loadESTProfilesFromEnv` synthesises a single profile bound to the
|
||||
empty PathID, which the router serves at `/.well-known/est/` (no
|
||||
path component).
|
||||
|
||||
PathID rules (enforced at boot):
|
||||
|
||||
- Lowercased ASCII `[a-z0-9-]+` only, no leading/trailing hyphen.
|
||||
- Distinct PathIDs per profile (no duplicates).
|
||||
- Reserved name `est` rejected (would collide with the legacy root).
|
||||
|
||||
Mirrors the SCEP `CERTCTL_SCEP_PROFILES` family from the SCEP RFC
|
||||
8894 master bundle — see [`legacy-est-scep.md`](legacy-est-scep.md)
|
||||
for the SCEP equivalent.
|
||||
|
||||
## Authentication modes
|
||||
|
||||
certctl supports three EST authentication topologies per profile,
|
||||
mixed and matched via `CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES`:
|
||||
|
||||
| Mode | Endpoint | When to use |
|
||||
|---------|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `mtls` | `/.well-known/est-mtls/<pathID>/...` | The device already has a bootstrap cert (factory-provisioned, previous-cert renewal, or out-of-band onboarding). Enterprise procurement teams almost always require this for production fleets — shared-password auth is a checkbox-fail regardless of password strength. |
|
||||
| `basic` | `/.well-known/est/<pathID>/...` | First-cert bootstrap when no prior cert exists. The `_ENROLLMENT_PASSWORD` is a per-profile shared secret; constant-time comparison via `crypto/subtle.ConstantTimeCompare`. Pair with the source-IP failed-auth rate limit (see below). |
|
||||
| both | both routes published | Migration window: existing devices renew via mTLS, new devices bootstrap via Basic. Same profile config, just both routes registered. |
|
||||
| (empty) | `/.well-known/est/<pathID>/...` | Anonymous; no auth required at the EST layer. Back-compat for pre-Phase-1 deploys. Hardened-deployment best practice is to set this explicitly to `basic` or `mtls` — a future bundle may flip the default. |
|
||||
|
||||
Per-profile cross-check enforced at boot:
|
||||
|
||||
- `mtls` in the list requires `_MTLS_ENABLED=true` AND
|
||||
`_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH` non-empty.
|
||||
- `basic` in the list requires `_ENROLLMENT_PASSWORD` non-empty.
|
||||
- Unknown auth modes refused at boot with the offending token in the
|
||||
error message.
|
||||
|
||||
**Source-IP failed-auth rate limit.** When `_ENROLLMENT_PASSWORD` is
|
||||
set and the Basic-auth gate trips, the handler increments a sliding-
|
||||
window counter keyed on the source IP. After 10 consecutive failures
|
||||
in an hour, the source is locked out (HTTP 429-equivalent failure
|
||||
code) for the rest of the window. The limiter is process-local
|
||||
(50k-IP cap, sliding 1h window — defaults; tunable in a follow-up).
|
||||
This is independent of the per-(CN, sourceIP) per-principal limiter
|
||||
discussed under [Renewal](#renewal-device-driven-model).
|
||||
|
||||
## RFC 9266 channel binding
|
||||
|
||||
When `CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true`, the
|
||||
EST handler enforces RFC 9266 `tls-exporter` channel binding. The
|
||||
client must include an `id-aa-channelBindings` attribute in the CSR
|
||||
whose value matches the server's
|
||||
`r.TLS.ConnectionState().ExportKeyingMaterial("EXPORTER-Channel-Binding", nil, 32)`
|
||||
output, computed independently at request time.
|
||||
|
||||
What this defends against: an attacker that bridges two TLS
|
||||
connections (one client → attacker, another attacker → certctl) and
|
||||
forwards the device's CSR through the attacker's TLS session. Without
|
||||
channel binding, certctl sees a valid CSR submitted over a TLS
|
||||
session authenticated by the attacker's cert; with channel binding,
|
||||
the CSR's binding bytes only match if the CSR was signed against
|
||||
THIS TLS session's exporter material.
|
||||
|
||||
Failure mode mapping:
|
||||
|
||||
| Server-side error | HTTP status | Meaning |
|
||||
|-------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------|
|
||||
| `ErrChannelBindingMissing` | 400 | `_CHANNEL_BINDING_REQUIRED=true` but the CSR's attribute is absent. Bad client config (or a non-RFC-9266 EST client). |
|
||||
| `ErrChannelBindingMismatch` | 409 | Attribute present but doesn't match the live exporter — MITM signal. Treat as a security event, log the source IP. |
|
||||
| `ErrChannelBindingNotTLS13` | 426 | Client connected over TLS 1.2 — `tls-exporter` requires TLS 1.3. Upgrade client OR rely on the TLS-1.2 reverse-proxy runbook. |
|
||||
|
||||
Cross-check at boot: setting `_CHANNEL_BINDING_REQUIRED=true` on a
|
||||
profile with `_MTLS_ENABLED=false` is refused — channel binding is
|
||||
meaningful only when mTLS is in use (otherwise the binding has no
|
||||
client identity to bind to).
|
||||
|
||||
**libest support.** Cisco libest v3.0+ supports the RFC 9266
|
||||
`--tls-exporter` flag. Older builds (commonly distros' packaged
|
||||
versions through 2024) do not; per-profile opt-out via leaving the
|
||||
env var `false` is the migration path. The libest sidecar in
|
||||
`deploy/test/libest/Dockerfile` builds v3.2.0-2 from source and
|
||||
includes the flag.
|
||||
|
||||
## WiFi / 802.1X recipe (FreeRADIUS)
|
||||
|
||||
This recipe stands up an EAP-TLS-authenticated corporate WiFi network
|
||||
where certctl issues every device certificate via EST. End-to-end
|
||||
flow:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Laptop["Laptop / supplicant<br/>(wpa_supplicant / iwd / Apple WiFi)"]
|
||||
AP["WiFi access point (NAS)"]
|
||||
Radius["FreeRADIUS<br/>(validate cert chain)"]
|
||||
CA["certctl CA<br/>(EST profile 'wifi')"]
|
||||
Laptop -->|EAP| AP
|
||||
AP -->|Radius| Radius
|
||||
Radius -.->|trusts| CA
|
||||
Laptop -->|"EST: /simpleenroll, /simplereenroll<br/>(one-time, then renewal)"| CA
|
||||
```
|
||||
|
||||
### certctl-side: EST profile config for 802.1X
|
||||
|
||||
```
|
||||
CERTCTL_EST_ENABLED=true
|
||||
CERTCTL_EST_PROFILES=wifi
|
||||
CERTCTL_EST_PROFILE_<NAME>_ISSUER_ID=iss-local
|
||||
CERTCTL_EST_PROFILE_<NAME>_PROFILE_ID=cp-wifi-eap-tls
|
||||
CERTCTL_EST_PROFILE_<NAME>_MTLS_ENABLED=true
|
||||
CERTCTL_EST_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH=/etc/certctl/wifi-bootstrap-ca.pem
|
||||
CERTCTL_EST_PROFILE_<NAME>_ALLOWED_AUTH_MODES=mtls
|
||||
CERTCTL_EST_PROFILE_<NAME>_CHANNEL_BINDING_REQUIRED=true
|
||||
CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H=3
|
||||
```
|
||||
|
||||
The matching `CertificateProfile` (`cp-wifi-eap-tls`) configured via
|
||||
the API or GUI:
|
||||
|
||||
- `AllowedKeyAlgorithms`: ECDSA P-256 (covers Apple, Android, modern
|
||||
laptop supplicants) plus optional RSA 2048+ for legacy clients.
|
||||
- `AllowedEKUs`: `clientAuth` only (`1.3.6.1.5.5.7.3.2`). Drops
|
||||
`serverAuth` so a device cert can't be reused as a TLS server cert.
|
||||
EAP-TLS requires `clientAuth`; FreeRADIUS will reject certs without
|
||||
it when `eap_chain_check_eku` is on.
|
||||
- `RequiredCSRAttributes`: `["deviceSerialNumber"]` so the device's
|
||||
serial appears in the issued cert (operators correlate WiFi grants
|
||||
back to inventory).
|
||||
- `MaxTTLSeconds`: 31536000 (1 year). Long enough for laptop fleets
|
||||
that don't renew daily; short enough to limit the cert's blast
|
||||
radius on key compromise.
|
||||
|
||||
### Device-side: drive `simpleenroll` from the supplicant
|
||||
|
||||
For Linux/embedded laptops:
|
||||
|
||||
```bash
|
||||
# Bootstrap once (factory bootstrap cert presented over mTLS):
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out /etc/wifi/eap.key
|
||||
openssl req -new -key /etc/wifi/eap.key \
|
||||
-subj "/CN=laptop-001/serialNumber=ABC123" \
|
||||
-out /etc/wifi/eap.csr
|
||||
curl -sS --cacert /etc/certctl/ca.crt \
|
||||
--cert /etc/wifi/bootstrap.crt \
|
||||
--key /etc/wifi/bootstrap.key \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -in /etc/wifi/eap.csr -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simpleenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt
|
||||
|
||||
# Renewal cycle (cron, 10 days before NotAfter):
|
||||
curl -sS --cacert /etc/certctl/ca.crt \
|
||||
--cert /etc/wifi/eap.crt \
|
||||
--key /etc/wifi/eap.key \
|
||||
-H "Content-Type: application/pkcs10" \
|
||||
--data-binary @<(openssl req -new -key /etc/wifi/eap.key -subj "/CN=laptop-001" -outform DER | base64 -w0) \
|
||||
https://certctl.example.com:8443/.well-known/est-mtls/wifi/simplereenroll \
|
||||
| base64 -d | openssl pkcs7 -inform DER -print_certs > /etc/wifi/eap.crt.new && \
|
||||
mv /etc/wifi/eap.crt.new /etc/wifi/eap.crt
|
||||
```
|
||||
|
||||
For Apple-managed devices the equivalent flow is wrapped by an MDM
|
||||
profile that drives EST. For ChromeOS the Admin Console SCEP profile
|
||||
remains the easier path until Google's EST support stabilises (track
|
||||
the [SCEP+ChromeOS guide](legacy-est-scep.md#scep-rfc-8894-native-implementation-post-2026-04-29)).
|
||||
|
||||
### FreeRADIUS-side: EAP-TLS configuration
|
||||
|
||||
In `mods-available/eap`:
|
||||
|
||||
```
|
||||
eap {
|
||||
default_eap_type = tls
|
||||
tls-config tls-common {
|
||||
# The CA bundle that signed certctl's EST-issued device certs.
|
||||
# Save the certctl issuer's CA chain to this path; the
|
||||
# FreeRADIUS daemon reloads on HUP.
|
||||
ca_file = /etc/freeradius/certs/certctl-ca.pem
|
||||
|
||||
# Server cert presented to the supplicant for tunnel TLS.
|
||||
# Separate cert chain — FreeRADIUS's own cert, NOT a certctl-
|
||||
# issued client cert.
|
||||
certificate_file = /etc/freeradius/certs/freeradius-server.pem
|
||||
private_key_file = /etc/freeradius/certs/freeradius-server.key
|
||||
|
||||
# Validate the supplicant's cert chain to certctl-ca.pem.
|
||||
check_cert_issuer = "/CN=certctl-corp-ca"
|
||||
|
||||
# Pin the supplicant's EKU to clientAuth.
|
||||
check_cert_cn = "%{User-Name}"
|
||||
}
|
||||
tls {
|
||||
tls = tls-common
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The matching `sites-available/default` authorize block invokes
|
||||
`eap` and rejects on cert-chain failure. CRL/OCSP validation against
|
||||
certctl's CRL endpoint (`/.well-known/pki/crls/<issuerID>.crl`) is
|
||||
configured under `tls-common.crl_dir` — see [`crl-ocsp.md`](crl-ocsp.md)
|
||||
for the certctl-side CRL distribution endpoint and refresh cadence.
|
||||
|
||||
### End-to-end flow
|
||||
|
||||
1. Laptop boots, supplicant starts EAP-TLS handshake against the AP.
|
||||
2. AP forwards the EAP frames to FreeRADIUS over RADIUS.
|
||||
3. FreeRADIUS validates the supplicant cert chain against
|
||||
`certctl-ca.pem`, checks revocation against the certctl CRL, and
|
||||
pins the EKU to `clientAuth`.
|
||||
4. On valid cert, FreeRADIUS returns Access-Accept; the AP grants
|
||||
network access.
|
||||
5. ~10 days before the cert's `NotAfter`, the device's renewal cron
|
||||
hits `simplereenroll` over the EXISTING mTLS-authenticated session
|
||||
— no operator interaction.
|
||||
|
||||
What can go wrong (operator playbook):
|
||||
|
||||
| Symptom | Diagnostic | Fix |
|
||||
|----------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
|
||||
| Supplicant rejected at TLS handshake | `tcpdump` on AP shows TLS-1.2 hello | Update supplicant to TLS 1.3 OR ensure FreeRADIUS's cert is signed under a chain it trusts. |
|
||||
| FreeRADIUS rejects with "expired CRL" | `freeradius -X` log surfaces stale CRL | certctl regenerates per-issuer CRLs hourly (see [`crl-ocsp.md`](crl-ocsp.md)); tighten `crl_dir` reload cadence in FreeRADIUS. |
|
||||
| Renewal fails with HTTP 429 | certctl audit log shows `est_rate_limited` for this device | Per-(CN, sourceIP) limit tripped; either widen `_RATE_LIMIT_PER_PRINCIPAL_24H` or investigate why the device is renewing >3x/24h. |
|
||||
| Renewal fails with HTTP 401 | certctl audit log shows `est_auth_failed_mtls` | Bootstrap cert chain doesn't trace to `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. Re-issue or rotate. |
|
||||
| Sustained `est_auth_failed_basic` from one IP | certctl audit log + IP reverse lookup | Likely brute-force; the source-IP limiter will lock the IP after 10 fails/hr. Block at firewall.|
|
||||
|
||||
## IoT bootstrap recipe
|
||||
|
||||
Long-running devices in the field — sensors, gateways, kiosks —
|
||||
typically follow this lifecycle:
|
||||
|
||||
1. **Factory provisioning** — bake one of:
|
||||
- A **bootstrap enrollment password** into the device firmware
|
||||
(per-fleet shared secret; pair with the source-IP rate limit)
|
||||
- A **factory-installed bootstrap cert** signed by the operator's
|
||||
factory CA, suitable for mTLS on first enroll
|
||||
2. **First boot** — device generates an ECDSA P-256 keypair locally,
|
||||
builds a CSR with its serial in `deviceSerialNumber`, and POSTs to
|
||||
`/.well-known/est/<pathID>/simpleenroll` (with HTTP Basic) or
|
||||
`/.well-known/est-mtls/<pathID>/simpleenroll` (with the bootstrap
|
||||
cert). On success, the device persists the issued cert and the
|
||||
bootstrap material can be discarded.
|
||||
3. **Steady state** — device drives `simplereenroll` over the
|
||||
issued cert's mTLS session ~10–25% before `NotAfter`. The
|
||||
re-enrollment uses the issued cert as the client cert; no shared
|
||||
secrets in the renewal path.
|
||||
4. **Compromise / decommission** — operator hits the bulk-revoke
|
||||
endpoint:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer $CERTCTL_API_KEY" \
|
||||
--cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/api/v1/est/certificates/bulk-revoke \
|
||||
-d '{"reason":"keyCompromise","profile_id":"cp-iot-sensors"}'
|
||||
```
|
||||
|
||||
The endpoint is M-008 admin-gated; non-admin Bearer callers receive
|
||||
HTTP 403. Source is auto-pinned to `EST` server-side, so the
|
||||
operation only revokes EST-issued certs even if the criteria match
|
||||
non-EST sources too. The CRL/OCSP responder picks up the revocations
|
||||
on the next refresh cycle (`CERTCTL_CRL_GENERATION_INTERVAL`,
|
||||
default 1h) — see [`crl-ocsp.md`](crl-ocsp.md).
|
||||
|
||||
**Recommended cert lifetimes for IoT.** Set `MaxTTLSeconds = 7776000`
|
||||
(90 days) on the IoT `CertificateProfile`. Long enough to absorb
|
||||
multi-day network outages without losing the device; short enough to
|
||||
limit exposure on key compromise (combined with bulk revoke + CRL
|
||||
refresh, the worst-case window is `1h + crl_refresh_interval` from
|
||||
revocation to relying-party rejection).
|
||||
|
||||
**Renewal trigger ratio for IoT.** Set the device's renewal cron to
|
||||
fire at 25% remaining lifetime — that gives ~22 days of buffer for a
|
||||
device that's offline at expiry-time to reconnect, retry, and
|
||||
re-enroll before the cert hard-expires. Mirrors the renewal-trigger
|
||||
ratio for laptops at 50% (laptops are online more often, so the
|
||||
buffer can be tighter relative to lifetime).
|
||||
|
||||
## `serverkeygen` for resource-constrained devices
|
||||
|
||||
RFC 7030 §4.4 lets the server generate the keypair on behalf of the
|
||||
client when the device lacks a hardware RNG — typical of ultra-low-
|
||||
power IoT or embedded modules without a TRNG. certctl supports this
|
||||
via `CERTCTL_EST_PROFILE_<NAME>_SERVERKEYGEN_ENABLED=true`.
|
||||
|
||||
Wire format: `POST /.well-known/est/<pathID>/serverkeygen` with the
|
||||
device's CSR as the request body. The handler:
|
||||
|
||||
1. Parses the CSR; the CSR's pubkey is treated as the **recipient
|
||||
key** for CMS EnvelopedData wrapping (RFC 7030 §4.4.2). The CSR's
|
||||
pubkey must support keyTrans (RSA-only at this revision; ECDH
|
||||
defer to a follow-up bundle) — non-RSA CSRs return HTTP 400 with
|
||||
`ErrServerKeygenRequiresKeyEncipherment`.
|
||||
2. Resolves the per-profile key algorithm from
|
||||
`CertificateProfile.AllowedKeyAlgorithms` (default RSA-2048).
|
||||
3. Generates a fresh keypair in process memory.
|
||||
4. Re-builds the CSR with the server-generated pubkey (so the issuer
|
||||
sees a CSR that matches the cert it's signing).
|
||||
5. Runs the existing issuer pipeline.
|
||||
6. Marshals the private key as PKCS#8 DER, then wraps it in CMS
|
||||
EnvelopedData encrypted to the device's CSR pubkey via AES-256-CBC
|
||||
with a per-call random IV.
|
||||
7. Returns the response as `multipart/mixed` per RFC 7030 §4.4.2:
|
||||
first part is the cert chain (PKCS#7), second part is the
|
||||
EnvelopedData blob (`application/pkcs8`).
|
||||
8. **Zeroizes** the plaintext key + PKCS#8 bytes before return —
|
||||
`internal/service/est.go::zeroizeKey` + `zeroizeBytes`. The
|
||||
private key never persists to disk on the certctl side.
|
||||
|
||||
Cross-check at boot: setting `_SERVERKEYGEN_ENABLED=true` on a
|
||||
profile with empty `_PROFILE_ID` is refused — server-keygen needs a
|
||||
`CertificateProfile` to pin `AllowedKeyAlgorithms` (the server has
|
||||
to decide what key to generate, and a profile-less default would be
|
||||
arbitrary).
|
||||
|
||||
**Security caveats.**
|
||||
|
||||
- **Trust transitivity.** Server-keygen breaks the cardinal property
|
||||
of agent-based key management: that the private key never leaves
|
||||
the device. The CMS wrap protects the key in transit, but the
|
||||
device still trusts certctl with the key material at generation
|
||||
time. Use only when the device cannot generate its own keypair —
|
||||
not as a convenience.
|
||||
- **Heap residency window.** The plaintext key lives in process heap
|
||||
between generation and CMS encryption. The zeroize step closes the
|
||||
obvious leakage leg, but a Go runtime that GC-relocates the buffer
|
||||
before zeroize fires could leave a copy. The threat-model carve-out
|
||||
is documented in [Threat model](#threat-model); use HSM-backed
|
||||
signing for highest-assurance fleets.
|
||||
- **No audit-log trail of the key bytes.** The audit row records
|
||||
the issuance (cert serial, subject, issuer) but never the key
|
||||
bytes; the operator cannot recover a key after issuance. This is
|
||||
by design — the key bytes only exist for the duration of the
|
||||
request.
|
||||
|
||||
## HSM-backed CA signing for EST
|
||||
|
||||
EST signs certs using whatever issuer connector the profile binds.
|
||||
The `internal/crypto/signer/` interface (post-2026-04-28) means a
|
||||
future HSM/PKCS#11 driver bundle (parking-lot at
|
||||
`cowork/hsm-pkcs11-driver-prompt.md`) plugs in transparently — the
|
||||
EST handler doesn't change. EST-issued certs benefit from HSM-backed
|
||||
signing automatically once the HSM bundle ships and the operator
|
||||
swaps the local issuer's `FileDriver` for a `PKCS11Driver`.
|
||||
|
||||
For deploys that need HSM-backed CA signing today, use the local
|
||||
issuer's `FileDriver` with the CA key on a read-only TPM-protected
|
||||
tmpfs; the L-014 file-on-disk threat-model carve-out in
|
||||
`internal/connector/issuer/local/local.go` documents the
|
||||
defense-in-depth steps.
|
||||
|
||||
## Operator GUI (EST Admin tabs)
|
||||
|
||||
The EST Admin surface lives at `/est` (route `web/src/main.tsx`,
|
||||
nav link `web/src/components/Layout.tsx::EST Admin`). The page is
|
||||
admin-gated at the top level — non-admin Bearer callers see an
|
||||
"Admin access required" banner, and the underlying admin endpoints
|
||||
(`/api/v1/admin/est/*`) are M-008 protected server-side independently.
|
||||
|
||||
Three tabs:
|
||||
|
||||
- **Profiles** (default) — per-profile lean cards with auth-mode
|
||||
badges, mTLS trust-anchor expiry countdown (green ≥30d / amber
|
||||
7–30d / red <7d / EXPIRED), the 12-cell live counter grid (every
|
||||
`est_*` failure mode), and a "Reload trust anchor" modal that
|
||||
hits `POST /api/v1/admin/est/reload-trust` (the SIGHUP-equivalent;
|
||||
bad reloads keep the OLD pool in place per the
|
||||
[Threat model](#threat-model) reload semantics).
|
||||
- **Recent Activity** — merges the four EST audit-action prefixes
|
||||
(`est_simple_enroll`, `est_simple_reenroll`, `est_server_keygen`,
|
||||
`est_auth_failed`) across four parallel queries with chip filters
|
||||
(All / Enrollment / Re-enrollment / ServerKeygen / AuthFailure).
|
||||
Polled every 60s.
|
||||
- **Trust Bundle** — per-mTLS-profile cert subjects + expiries
|
||||
surfaced from the trust holder snapshot. Used during rotation:
|
||||
operator extracts the new bundle, overwrites the on-disk file,
|
||||
hits Reload, then reloads this tab to confirm the new subjects.
|
||||
|
||||
All three admin endpoints (`GET /api/v1/admin/est/profiles`,
|
||||
`POST /api/v1/admin/est/reload-trust`, plus the audit-query merge in
|
||||
the GUI) are M-008 admin-gated. The page itself hides (UX hint) and
|
||||
the server-side gate enforces (security boundary).
|
||||
|
||||
## CLI + MCP tools
|
||||
|
||||
The `certctl-cli est` subcommand family (`internal/cli/est.go`):
|
||||
|
||||
```
|
||||
certctl-cli est cacerts --profile <name>
|
||||
certctl-cli est csrattrs --profile <name>
|
||||
certctl-cli est enroll --profile <name> --csr <path|-> [--out <path>]
|
||||
certctl-cli est reenroll --profile <name> --csr <path|-> [--out <path>]
|
||||
certctl-cli est serverkeygen --profile <name> --csr <path> --out <prefix>
|
||||
certctl-cli est test --profile <name>
|
||||
```
|
||||
|
||||
`--profile` is the lowercased PathID (matches the URL path). Empty
|
||||
profile string maps to the legacy `/.well-known/est/` root — use only
|
||||
during a back-compat migration. Server-keygen writes
|
||||
`<prefix>.cert.pem` plus `<prefix>.key.enveloped` (the EnvelopedData
|
||||
blob, decryptable with `openssl smime`).
|
||||
|
||||
The MCP server (`internal/mcp/tools_est.go`) exposes six tools that
|
||||
mirror the CLI surface for AI-orchestrated workflows:
|
||||
|
||||
- `est_list_profiles` — every configured EST profile + its auth modes
|
||||
+ counters
|
||||
- `est_admin_stats` — alias of the above; matches the
|
||||
`scep_admin_stats` naming convention
|
||||
- `est_get_cacerts` — base64 PKCS#7 cert chain
|
||||
- `est_get_csrattrs` — base64 DER attributes blob (per-profile when
|
||||
`RequiredCSRAttributes` is set)
|
||||
- `est_enroll` — body carries the CSR PEM; returns the issued cert
|
||||
- `est_reenroll` — same but uses the previous-cert mTLS path
|
||||
|
||||
All six are gated by the standard MCP Bearer auth + the page-level
|
||||
admin gate where applicable (`est_list_profiles`, `est_admin_stats`).
|
||||
|
||||
## Renewal: device-driven model
|
||||
|
||||
RFC 7030 §4.2.2 mandates the renewal model: the **device** decides
|
||||
when to renew and drives `simplereenroll` over its existing cert.
|
||||
There is no server-initiated push — certctl never reaches out to a
|
||||
device fleet to force renewal.
|
||||
|
||||
Practical implications:
|
||||
|
||||
- A device offline at expiry-time **loses its cert**. Mitigation:
|
||||
pick a renewal-trigger ratio with enough buffer (50% remaining
|
||||
lifetime for laptops, 25% for IoT — see
|
||||
[IoT bootstrap recipe](#iot-bootstrap-recipe)). On chronically
|
||||
offline fleets, lengthen `MaxTTLSeconds`.
|
||||
- The "operator wants to push renewal" case is handled via the
|
||||
notification webhook surface (`internal/connector/notifier/webhook/`)
|
||||
— operator publishes an event on a topic the device fleet
|
||||
subscribes to (or the operator's MDM picks up); the device's MDM
|
||||
agent triggers the renewal cron out-of-band. certctl emits a
|
||||
`cert.expiring_soon` event on the standard 30/7/1-day pre-expiry
|
||||
schedule (`internal/scheduler/scheduler.go::expiryNotificationLoop`).
|
||||
- Per-(CN, sourceIP) sliding-window cap keeps a misbehaving device
|
||||
from hammering the server. Default is `0` (disabled, back-compat);
|
||||
production deploys set `3` per `CERTCTL_EST_PROFILE_<NAME>_RATE_LIMIT_PER_PRINCIPAL_24H`.
|
||||
Mirrors the SCEP/Intune per-device limit pattern from
|
||||
[`scep-intune.md`](scep-intune.md).
|
||||
|
||||
## Troubleshooting matrix
|
||||
|
||||
The handler emits a typed audit-action code per failure mode. Filter
|
||||
the GUI Recent Activity tab on the action prefix to find the
|
||||
offending requests, and use the table below to map back to root
|
||||
cause + fix.
|
||||
|
||||
| Audit action | Symptom | Root cause + fix |
|
||||
|--------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `est_simple_enroll_success` | (success counter) | No action needed. |
|
||||
| `est_simple_enroll_failed` | An enrollment failed — the bare `_failed` codes give the typed reason | The audit row's `details` carries the inner reason; cross-reference one of the rows below. |
|
||||
| `est_simple_reenroll_success` | (success counter) | No action needed. |
|
||||
| `est_simple_reenroll_failed` | A renewal failed | Same as `est_simple_enroll_failed`; cross-reference inner reason. |
|
||||
| `est_server_keygen_success` | (success counter) | No action needed. |
|
||||
| `est_server_keygen_failed` | Server-keygen failed | Most common: device CSR carries a non-RSA pubkey (the keyTrans wrap requires RSA at this revision). Switch the device to an RSA CSR or wait for ECDH support. |
|
||||
| `est_auth_failed_basic` | HTTP Basic gate tripped | Wrong password OR the password env var rotated and the device wasn't re-provisioned. Watch the source-IP for sustained failures — the limiter locks out after 10 fails/hr. |
|
||||
| `est_auth_failed_mtls` | mTLS gate tripped | Client cert doesn't chain to the trust anchor OR the cert is past `NotAfter` OR the cert presented is for a different EST profile (cross-profile bleed defense). Check `details.subject` against `_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH`. |
|
||||
| `est_auth_failed_channel_binding` | RFC 9266 channel-binding gate tripped | One of: missing `id-aa-channelBindings` attribute on the CSR (libest <v3.0); mismatch (MITM signal — log + escalate); TLS 1.2 client (channel binding requires TLS 1.3). Map the inner error to the [channel-binding table](#rfc-9266-channel-binding). |
|
||||
| `est_rate_limited` | Per-(CN, sourceIP) cap tripped | If legitimate (recovery + first-cert + post-wipe in 24h), bump `_RATE_LIMIT_PER_PRINCIPAL_24H`. If suspicious, the limiter is doing its job — investigate the device. |
|
||||
| `est_csr_policy_violation` | CSR violates the bound `CertificateProfile` rules | Inner detail names the dimension (key alg, key size, EKU, SAN, max TTL). Either fix the device CSR or relax the policy — never silently accept. |
|
||||
| `est_bulk_revoke` | Operator-initiated bulk revoke | Audit-only signal; no failure. Cross-reference the operator's identity in `details.actor`. |
|
||||
| `est_trust_anchor_reloaded` | Operator-initiated SIGHUP-equivalent reload | Audit-only signal; no failure. Failed reloads do NOT emit this code (the OLD pool stays in place; check the GUI Reload modal's error message + the `details.path_id`). |
|
||||
|
||||
The bare action codes (without the `_success`/`_failed` suffix) are
|
||||
also emitted for back-compat with the GUI activity-tab filter chips
|
||||
which match by exact-string `startsWith()` — the split-emit pattern
|
||||
preserves both the legacy-grep and the new typed-counter use cases.
|
||||
See `internal/service/est_audit_actions.go` for the constant
|
||||
definitions; the per-action emission sites are in
|
||||
`internal/service/est.go::processEnrollment`.
|
||||
|
||||
## TLS 1.2 reverse-proxy runbook
|
||||
|
||||
Some embedded EST clients only speak TLS 1.2 — older OpenWRT routers,
|
||||
some industrial PLCs, IoT firmware that can't be field-upgraded.
|
||||
certctl's control plane is TLS 1.3 only (pinned at
|
||||
`cmd/server/tls.go::buildServerTLSConfig`). The migration path is the
|
||||
TLS 1.2 reverse-proxy pattern documented in
|
||||
[`legacy-est-scep.md`](legacy-est-scep.md):
|
||||
|
||||
- nginx / HAProxy terminates TLS 1.2 from the legacy client
|
||||
- Forwards the EST request body unchanged to certctl on TLS 1.3
|
||||
- Optionally forwards the client cert via `X-SSL-Client-Cert` for the
|
||||
proxy-side mTLS trust pin
|
||||
|
||||
Important caveat: **RFC 9266 channel binding cannot work through a
|
||||
reverse proxy.** The channel binding bytes are derived from the
|
||||
client↔proxy TLS session, NOT the proxy↔certctl session. Disable
|
||||
`_CHANNEL_BINDING_REQUIRED` for profiles that serve via the proxy
|
||||
runbook.
|
||||
|
||||
## Threat model
|
||||
|
||||
The EST hardening bundle's threat model rests on these load-bearing
|
||||
properties; deviations need explicit operator awareness:
|
||||
|
||||
- **Trust anchor reload is fail-safe.** A SIGHUP that hits a
|
||||
half-rotated bundle (parse error, expired cert) keeps the OLD pool
|
||||
in place. The validator never accepts an unparseable bundle. The
|
||||
GUI reload modal surfaces the error so the operator can correct
|
||||
the file and retry without taking the EST endpoint down.
|
||||
- **Per-profile counter isolation.** Each ESTService instance has
|
||||
its own `estCounterTab` (sync/atomic-backed). A future shared-
|
||||
counter refactor would fail at the compile-time pointer-identity
|
||||
check in `internal/service/est_profile_counter_isolation_test.go`.
|
||||
This means the Recent Activity tab's per-profile filter is a real
|
||||
filter, not a fan-out display of one shared counter.
|
||||
- **mTLS cross-profile bleed is blocked.** A client cert presented
|
||||
to profile A's mTLS endpoint must chain to A's trust bundle, not
|
||||
any other profile's. The per-handler re-verify enforces this even
|
||||
when both profiles share a TLS listener union pool (see
|
||||
`cmd/server/tls.go::buildServerTLSConfigWithMTLS`).
|
||||
- **Source-IP failed-Basic limiter is process-local.** The 10/hr
|
||||
cap is enforced in-process; a load-balanced multi-pod deploy where
|
||||
request distribution is round-robin can amplify the effective
|
||||
per-IP rate by the pod count. Mitigation: use sticky-source-IP
|
||||
load balancing for `/.well-known/est/` if this is in scope.
|
||||
- **Server-keygen has a heap-residency window.** The plaintext
|
||||
private key lives in process memory between generation and CMS
|
||||
EnvelopedData encryption. The zeroize step closes the obvious
|
||||
leakage leg, but a GC-relocation between generation and zeroize
|
||||
could leave a copy. Use HSM-backed signing for highest-assurance
|
||||
fleets where this matters.
|
||||
- **HTTP Basic password is in-process only.** Stored in
|
||||
`ESTHandler.basicPassword`, never logged, never written to disk by
|
||||
certctl. Operators ARE responsible for the env-var injection path
|
||||
(Helm secret, Docker secret, Vault) — see `tls.md` for the
|
||||
recommended secret-mount conventions.
|
||||
- **The legacy unauthenticated default exists for back-compat.**
|
||||
Pre-Phase-1 deploys had no `_ALLOWED_AUTH_MODES` env var; the
|
||||
default is empty (anonymous) so existing deploys continue to work.
|
||||
A future bundle MAY flip the default to require explicit opt-in;
|
||||
production deploys should set `_ALLOWED_AUTH_MODES` explicitly
|
||||
today regardless.
|
||||
|
||||
## V3-Pro deferrals
|
||||
|
||||
These capabilities are deferred to V3-Pro (paid tier). They're not
|
||||
oversights — they're the natural follow-on bundles after v2.X.0 GA:
|
||||
|
||||
- **Conditional Access / device-posture gating.** The per-profile
|
||||
ESTService exposes a nil-default compliance-hook seam (mirrors the
|
||||
SCEP/Intune `ComplianceCheck` pattern). V3-Pro plugs in a
|
||||
Microsoft Graph or other posture-check callback before issuance;
|
||||
non-compliant devices fail with a typed `est_compliance_failed`
|
||||
reason.
|
||||
- **Multi-tenant CA isolation.** V2 has one trust anchor pool per
|
||||
EST profile and one issuer binding. V3-Pro ships per-tenant root
|
||||
+ per-tenant audit isolation for MSPs running shared certctl
|
||||
deployments across customers.
|
||||
- **EST cert-bound usage analytics.** Forward device-side handshake
|
||||
logs into certctl for cert-bound session analytics. V3-Pro (or
|
||||
delegate to a real session-management product like Teleport for
|
||||
TLS sessions).
|
||||
- **EST-cert-manager-style controller for K8s host fleets.**
|
||||
External-issuer pattern that lets cert-manager use certctl's EST
|
||||
server as a backend. Parking-lot per `WORKSPACE-ROADMAP.md::Cloud
|
||||
and Kubernetes`.
|
||||
- **Standalone `certctl-est` CLI binary.** All EST ops route through
|
||||
the certctl server in V2; a standalone binary that an operator can
|
||||
run on a laptop without the full server (similar to the SCEP probe
|
||||
deferred CLI binary). V2 ships the `certctl-cli est` subcommand
|
||||
family which solves the same operator workflow at a lower
|
||||
packaging cost.
|
||||
- **`fullcmc` (RFC 7030 §4.3) implementation.** Rare in practice;
|
||||
only Cisco IOS and a few financial-PKI vendors use it. Defer
|
||||
until a customer asks.
|
||||
|
||||
## Appendix A: libest reference client
|
||||
|
||||
certctl's CI exercises the EST endpoints against Cisco's libest
|
||||
reference implementation via the sidecar at
|
||||
`deploy/test/libest/Dockerfile`. The build reproduces v3.2.0-2 from
|
||||
source on `debian:bookworm-slim` (digest-pinned per the H-001 guard).
|
||||
|
||||
To reproduce locally:
|
||||
|
||||
```bash
|
||||
# From the repo root.
|
||||
docker compose --profile est-e2e -f deploy/docker-compose.test.yml build libest-client
|
||||
docker compose --profile est-e2e -f deploy/docker-compose.test.yml up -d libest-client
|
||||
docker exec -it certctl-libest-client estclient --help
|
||||
```
|
||||
|
||||
The integration test suite (`deploy/test/est_e2e_test.go`, build
|
||||
tag `integration`) drives the live certctl server through the
|
||||
sidecar via `docker exec` for these scenarios:
|
||||
|
||||
- `TestEST_LibESTClient_Enrollment_Integration` — `cacerts`
|
||||
→ `simpleenroll` → cert assertion
|
||||
- `TestEST_LibESTClient_MTLSEnrollment_Integration` — mTLS sibling
|
||||
route
|
||||
- `TestEST_LibESTClient_ServerKeygen_Integration` — RFC 7030 §4.4
|
||||
multipart/mixed
|
||||
- `TestEST_LibESTClient_RateLimited_Integration` — exhausts the
|
||||
per-principal cap and asserts the 429-shaped error
|
||||
- `TestEST_LibESTClient_ChannelBinding_Integration` — RFC 9266
|
||||
`--tls-exporter` (skipped when libest build lacks the flag)
|
||||
|
||||
Run the suite via `INTEGRATION=1 go test -tags integration ./deploy/test/... -run EST`.
|
||||
|
||||
## Appendix B: RFC 7030 wire-format quirks
|
||||
|
||||
certctl's EST handler ships with quirk-tolerance for documented EST
|
||||
client populations. The fixtures + unit tests live at
|
||||
`internal/api/handler/cisco_ios_quirks_test.go` +
|
||||
`internal/api/handler/testdata/cisco_ios_*.txt`.
|
||||
|
||||
| Vendor / version | Quirk | certctl behavior |
|
||||
|-----------------------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Cisco IOS 15.x | Some images send the CSR as `application/x-pem-file` (not the spec'd `application/pkcs10`) | The handler dispatches on the body prefix (`-----BEGIN`) rather than the Content-Type header — accepted as PEM-encoded PKCS#10. |
|
||||
| Cisco IOS 16.x | Trailing newlines on the base64 body (variable count) | `strings.TrimSpace` pass before base64 decode; bodies tolerated regardless of trailing whitespace. |
|
||||
| Apple MDM (some firmware) | CRLF line wrapping inside the base64 body | `base64.StdEncoding` handles both LF and CRLF. |
|
||||
| OpenWRT (older builds) | TLS 1.2 only | Use the [TLS 1.2 reverse-proxy runbook](#tls-12-reverse-proxy-runbook); disable channel binding for affected profiles. |
|
||||
| libest <v3.0 | No RFC 9266 `--tls-exporter` flag | Set `_CHANNEL_BINDING_REQUIRED=false` for affected profiles; the server still validates everything else. |
|
||||
|
||||
If you find a new wire-format quirk in a real device, file an issue
|
||||
with a base64 dump of the failing request — we'll add a fixture +
|
||||
the matching tolerance pass.
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`legacy-est-scep.md`](legacy-est-scep.md) — TLS 1.2 reverse-proxy
|
||||
runbook + the SCEP RFC 8894 native implementation parallels.
|
||||
- [`scep-intune.md`](scep-intune.md) — the SCEP/Intune master bundle
|
||||
that established the multi-profile dispatch + admin GUI + golden
|
||||
fixture patterns this EST bundle mirrors.
|
||||
- [`crl-ocsp.md`](crl-ocsp.md) — the per-issuer CRL distribution
|
||||
endpoint and OCSP responder that EST-issued certs are revoked
|
||||
through.
|
||||
- [`features.md`](features.md) — every `CERTCTL_*` env var,
|
||||
including the per-profile `CERTCTL_EST_PROFILE_<NAME>_*` family
|
||||
documented here.
|
||||
- [`architecture.md`](architecture.md) — overall control-plane
|
||||
architecture; EST Server section + Security Model trust-anchor
|
||||
rotation discussion.
|
||||
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
|
||||
prerequisite for any production EST deploy.
|
||||
- [`connectors.md`](connectors.md) — issuer connectors that EST
|
||||
delegates to.
|
||||
@@ -0,0 +1,385 @@
|
||||
# Microsoft Intune SCEP enrollment via certctl
|
||||
|
||||
> **Status (this document):** Phase 11 of the SCEP RFC 8894 + Intune master
|
||||
> bundle. The behavior described here is shipped on `master` and exercised
|
||||
> end-to-end by `internal/api/handler/scep_intune_e2e_test.go`. The
|
||||
> bundle is V2-free (community edition) — Conditional-Access compliance
|
||||
> gating, native Microsoft Graph integration, and per-tenant trust
|
||||
> anchors are documented under [Limitations](#limitations) as V3-Pro
|
||||
> features.
|
||||
|
||||
## TL;DR
|
||||
|
||||
certctl is a **drop-in NDES replacement** for Microsoft Intune SCEP fleets.
|
||||
Intune-managed devices keep using the existing Intune Certificate Connector;
|
||||
only the SCEP server URL changes. certctl validates the Connector's
|
||||
signed challenge using its installation signing cert (no Microsoft API
|
||||
calls — the Connector already did that), binds the device claim to the
|
||||
inbound CSR, and issues through whichever certctl issuer connector you
|
||||
have configured (local CA, Vault, EJBCA, ADCS, etc.).
|
||||
|
||||
What you get over NDES:
|
||||
|
||||
- Per-profile SCEP endpoints (`/scep/corp` vs. `/scep/iot` etc.) so a
|
||||
single certctl deploy serves multiple device fleets with distinct
|
||||
challenge passwords + trust anchors.
|
||||
- Audit log entries with the device GUID, claim subject, and CSR
|
||||
binding details — much better forensics than NDES + IIS logs.
|
||||
- Trust anchor reload via `SIGHUP` (no service restart) when the
|
||||
Connector signing cert rotates.
|
||||
- A built-in admin GUI tab (Intune Monitoring) showing per-profile
|
||||
enrollment counters, trust-anchor expiry countdowns, and the recent
|
||||
failures table.
|
||||
- Per-device rate limit (sliding window log keyed by Subject + Issuer)
|
||||
that catches a compromised Connector signing key issuing many
|
||||
different valid challenges for the same device.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Cloud["Intune cloud<br/>(Microsoft)"]
|
||||
Connector["Intune Certificate Connector<br/>(customer infra)"]
|
||||
Server["certctl SCEP server<br/>(you)"]
|
||||
Issuer["issuer connector<br/>(local CA / Vault / EJBCA / …)"]
|
||||
Cloud --> Connector --> Server --> Issuer
|
||||
```
|
||||
|
||||
**certctl replaces NDES, not the Connector.** The Intune Certificate
|
||||
Connector is the bridge between the Intune cloud and your on-prem PKI;
|
||||
Microsoft installs and maintains it. What you replace is the
|
||||
**Network Device Enrollment Service** (NDES) — the SCEP server
|
||||
historically deployed on a Windows host, sitting between the Connector
|
||||
and an Active Directory Certificate Services CA. certctl sits in
|
||||
exactly that slot and speaks SCEP RFC 8894 to the Connector.
|
||||
|
||||
### What certctl validates per request
|
||||
|
||||
For every Intune-flavored SCEP request the dispatcher in
|
||||
`internal/service/scep.go::dispatchIntuneChallenge` walks the
|
||||
following gates in order. A failure on any gate produces a CertRep
|
||||
PKIMessage with the documented `pkiStatus`/`failInfo` codes (per RFC
|
||||
8894 §3.2.1.4.5) and increments the corresponding metric counter.
|
||||
|
||||
1. **Shape pre-check** — `looksIntuneShaped(challengePassword)`:
|
||||
length > 200 + exactly two dots. False positives are fine; false
|
||||
negatives on real Intune challenges would route them to the static
|
||||
compare and reject. The pre-check just decides whether to invoke
|
||||
the full validator.
|
||||
2. **JWS signature** — `intune.ValidateChallenge` re-derives the
|
||||
signing input from the raw on-wire bytes (per RFC 7515 §3.1, NOT
|
||||
re-base64-encoded segments) and verifies against every cert in the
|
||||
trust anchor pool. Supports RS256 and ES256 (both fixed-width
|
||||
r||s and ASN.1-DER form). Explicitly rejects `alg=none` and
|
||||
HMAC algs.
|
||||
3. **Version dispatch** — extracts the `version` claim from the
|
||||
payload prelude. v1 (current Connector format, no `version` key)
|
||||
routes to `unmarshalChallengeV1`. Future v2 plugs in a sibling
|
||||
parser without touching the validator.
|
||||
4. **Time bounds** — `now+tolerance ≥ iat AND now-tolerance < exp`.
|
||||
The `±tolerance` window is configurable per profile via
|
||||
`INTUNE_CLOCK_SKEW_TOLERANCE` (default 60s, covers modest clock
|
||||
drift between the Connector host and certctl). Configurable cap on
|
||||
top via `INTUNE_CHALLENGE_VALIDITY` (defense-in-depth against a
|
||||
Connector that mints long-validity challenges). The validator
|
||||
refuses `tolerance ≥ ChallengeValidity` at startup-validation time
|
||||
to keep the cap meaningful.
|
||||
5. **Audience pin** — `claim.aud == INTUNE_AUDIENCE` (skipped when
|
||||
`INTUNE_AUDIENCE` is empty for proxy/load-balancer scenarios).
|
||||
6. **CSR binding** — `claim.DeviceMatchesCSR(csr)` checks
|
||||
set-equality between the claim's `device_name` / `san_dns` /
|
||||
`san_rfc822` / `san_upn` and the CSR's CN + SANs. Set-equality
|
||||
means the CSR carries EXACTLY the claim's values, no extras and
|
||||
no missing.
|
||||
7. **Replay** — `intune.ReplayCache.CheckAndInsert` rejects
|
||||
duplicates within the configured TTL. Sized for 100k entries
|
||||
(covers a ~25 RPS Intune fleet's steady-state).
|
||||
8. **Per-device rate limit** — sliding window log keyed by
|
||||
`(claim.Subject, claim.Issuer)`. Catches a compromised Connector
|
||||
issuing many DIFFERENT valid challenges for the same device. Default
|
||||
3 enrollments per 24h covers legitimate first-cert + recovery +
|
||||
post-wipe.
|
||||
9. **Optional compliance check** — V3-Pro plug-in seam (nil-default
|
||||
no-op). When set, the gate calls Microsoft Graph's compliance API
|
||||
and short-circuits non-compliant devices with FAILURE+BadRequest.
|
||||
|
||||
A request that passes all nine gates flows to
|
||||
`processEnrollment`, which builds the issuance request, calls the
|
||||
configured issuer connector, and emits a CertRep PKIMessage with the
|
||||
issued cert encrypted to the device's transient signing cert per RFC
|
||||
8894 §3.3.2.
|
||||
|
||||
## Migration from NDES + EJBCA (or NDES + ADCS)
|
||||
|
||||
The migration plan below is conservative — install certctl alongside
|
||||
your existing NDES so you can flip Intune profiles fleet-by-fleet
|
||||
without a flag day. Validated against a fresh `docker compose up`
|
||||
stack; the docker-compose.test.yml stack does not currently bake
|
||||
Intune in (Phase 10.2 ships a hermetic in-process e2e test instead),
|
||||
so the production validation step is a manual run-book item.
|
||||
|
||||
1. **Install certctl alongside existing NDES.** Stand up the certctl
|
||||
server on a separate host (or as a Kubernetes deployment) reachable
|
||||
from the Connector host. Use the existing operator-run-book in
|
||||
`docs/tls.md` for the TLS bootstrap.
|
||||
2. **Configure a per-profile SCEP endpoint.** Pick a path id (e.g.
|
||||
`corp` — referenced as `<NAME>` below; the value gets uppercased
|
||||
for the env-var key and lowercased for the URL path) and set:
|
||||
|
||||
```
|
||||
CERTCTL_SCEP_ENABLED=true
|
||||
CERTCTL_SCEP_PROFILES=corp
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID=iss-local # or your existing issuer
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_CHALLENGE_PASSWORD=<random> # Intune still requires this
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_RA_CERT_PATH=/etc/certctl/ra-corp.pem
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_RA_KEY_PATH=/etc/certctl/ra-corp.key
|
||||
```
|
||||
|
||||
The endpoint will be served at `https://certctl.example.com/scep/corp`
|
||||
— the URL path uses the lowercased name and the env-var keys use
|
||||
the uppercased form. Concrete env-var name mappings are listed in
|
||||
[`features.md`](features.md).
|
||||
3. **Extract the Intune Connector's signing cert.** On the Connector
|
||||
host (Windows), the Connector's installation creates a self-signed
|
||||
cert in the local machine's `Personal` cert store with subject
|
||||
`CN=Microsoft Intune Certificate Connector` (path documented by
|
||||
Microsoft — see Microsoft Learn link in the
|
||||
[Microsoft support statement](#microsoft-support-statement) below).
|
||||
Export the public cert (no private key) as a base64 `.cer` file.
|
||||
4. **Configure the trust anchor.** Copy the `.cer` to the certctl host
|
||||
(or mount via your secret manager) and set:
|
||||
|
||||
```
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH=/etc/certctl/intune-corp.pem
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_AUDIENCE=https://certctl.example.com/scep/corp
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CHALLENGE_VALIDITY=60m
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE=60s # ±tolerance on iat/exp; raise on poorly-NTP-synced fleets, lower to enforce strict time
|
||||
CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_PER_DEVICE_RATE_LIMIT_24H=3
|
||||
```
|
||||
|
||||
Restart certctl. The startup preflight refuses to boot if the
|
||||
trust anchor file is missing, unparseable, or contains an expired
|
||||
cert — failure is loud at boot rather than silent at request time.
|
||||
5. **Configure the issuer connector.** If you're keeping EJBCA,
|
||||
point `CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID` at your EJBCA issuer
|
||||
profile (see `docs/connectors.md`). For a clean cut-over to the
|
||||
built-in local CA, follow `docs/tls.md` to bootstrap a sub-CA cert.
|
||||
6. **Migrate one Intune SCEP profile to certctl.** In the Intune
|
||||
admin center, edit the SCEP profile for a small canary device
|
||||
group and update the SCEP server URL to
|
||||
`https://certctl.example.com/scep/corp`. Push the profile and
|
||||
wait for the canary devices to rotate (24-48h).
|
||||
7. **Verify enrollment.** Open the certctl admin GUI's
|
||||
[SCEP Intune Monitoring tab](#operational-monitoring) and watch
|
||||
the `success` counter tick on the `corp` profile card. The
|
||||
`recent failures` table surfaces any rejected enrollments with
|
||||
the exact reason (e.g. `signature_invalid`, `claim_mismatch`).
|
||||
8. **Roll out the rest of the fleet.** Once the canary is clean,
|
||||
migrate the remaining Intune SCEP profiles in batches.
|
||||
9. **Decommission NDES.** After all fleets are migrated and a few
|
||||
renewal cycles have completed cleanly, take down the NDES role
|
||||
and the IIS site. The existing certs continue to chain to your
|
||||
issuer; only the enrollment path changes.
|
||||
|
||||
## Intune SCEP profile fields → certctl behavior
|
||||
|
||||
The Intune admin center's SCEP profile editor exposes a fixed set of
|
||||
fields. The mapping below is what each field controls relative to
|
||||
certctl's behavior.
|
||||
|
||||
| Intune profile field | certctl behavior |
|
||||
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Certificate type | Treated as device or user; surfaces in the claim's `subject` field (device GUID vs. user UPN). certctl doesn't gate on type; the issuer's certificate profile decides. |
|
||||
| Subject name format | Drives the CSR's CN. The Intune Connector sets `claim.device_name` from this value; certctl's CSR-binding gate enforces equality. |
|
||||
| Subject alternative name | Drives the CSR's SAN list. Intune supports DNS / RFC 822 / UPN; certctl's claim binding checks set-equality per dimension. Mismatches surface as `ErrClaimSANDNSMismatch` / `_SANRFC822Mismatch` / `_SANUPNMismatch`. |
|
||||
| Certificate validity period | Honored by the issuer connector. certctl caps via the per-profile `CertificateProfile.MaxTTLSeconds`; the smaller of the two wins. |
|
||||
| Key storage provider | Device-side concern (the Connector negotiates with the device's TPM / Software KSP). certctl never sees the device's private key — it only signs the CSR. |
|
||||
| Key usage / Extended key usage | Honored by the issuer connector via the bound `CertificateProfile.AllowedEKUs`. CSRs requesting an EKU outside the allowed set are rejected by the crypto-policy gate (`ValidateCSRAgainstProfile`). |
|
||||
| Hash algorithm | The CSR's signature hash (SHA-256 typical). The SCEP `GetCACaps` advertises SHA-256 + SHA-512; the device picks. |
|
||||
| SCEP server URL | The endpoint URL the Connector posts to. Set to `https://certctl.example.com/scep/<profile-name>`. |
|
||||
|
||||
## Trust anchor extraction
|
||||
|
||||
The Intune Certificate Connector self-signs an installation cert at
|
||||
install time. To configure certctl, extract this cert (PUBLIC ONLY,
|
||||
no private key) as PEM:
|
||||
|
||||
1. On the Connector host (Windows), open `certlm.msc` (Local Machine
|
||||
Certificate Manager).
|
||||
2. Navigate to `Personal` → `Certificates`. Find the cert with
|
||||
subject `CN=Microsoft Intune Certificate Connector`.
|
||||
3. Right-click → All Tasks → Export. Choose **No, do not export
|
||||
the private key**. Format: **Base-64 encoded X.509 (.CER)**.
|
||||
4. Copy the resulting `.cer` file to the certctl host. Rename to
|
||||
`.pem` (the bytes are identical; certctl's PEM loader accepts
|
||||
either extension).
|
||||
5. Set `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH` to
|
||||
the file path.
|
||||
6. If you have multiple Connectors in HA, repeat steps 1-3 on each
|
||||
and concatenate the PEM blocks into one bundle file.
|
||||
|
||||
When the operator rotates the Connector signing cert (typically once
|
||||
every few years per Microsoft's Connector lifecycle), repeat the
|
||||
extraction, overwrite the on-disk file, then send `SIGHUP` to the
|
||||
certctl process. The trust holder swaps atomically; bad files (parse
|
||||
error, expired cert) keep the OLD pool in place so a half-rotation
|
||||
doesn't take Intune enrollment down.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
The dispatcher emits a typed metric label per failure mode plus a
|
||||
matching audit-log entry. The table below maps the label to the most
|
||||
common root cause and the operator action.
|
||||
|
||||
| Counter label | Symptom | Root cause + fix |
|
||||
|----------------------|------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `signature_invalid` | Every enrollment from a specific profile failing | Trust anchor mismatch — the Connector's signing cert was rotated and certctl wasn't reloaded. Re-extract the cert ([trust anchor extraction](#trust-anchor-extraction)), overwrite the file, send `SIGHUP`. |
|
||||
| `claim_mismatch` | Some enrollments from one Intune SCEP profile failing | The Intune SCEP profile's SAN config doesn't match what the device CSR actually has. Compare the `recent failures` table's claim row to the device's CSR; usually a SAN format mismatch (e.g. claim wants UPN, CSR has DNS). |
|
||||
| `expired` | All enrollments failing on a date boundary | Either clock skew between the Connector host and certctl (NTP both ends) OR the Connector's signing cert is past `NotAfter`. The certctl preflight catches an expired trust anchor at boot; check the Monitoring tab's expiry countdown. |
|
||||
| `not_yet_valid` | All enrollments failing | Reverse clock skew (certctl's clock is BEHIND the Connector's). Sync via NTP. |
|
||||
| `wrong_audience` | All enrollments from a profile failing | `INTUNE_AUDIENCE` doesn't match the URL the Connector is configured to call. Either fix `INTUNE_AUDIENCE` to match the operator URL, or unset it (defense-in-depth then disabled — the claim's exp + sig still gate). |
|
||||
| `replay` | Sporadic per-device failures, mostly during retries | The device retried the SAME challenge after the first one failed. The replay cache TTL is `INTUNE_CHALLENGE_VALIDITY` (default 60m). Either widen the device's retry window (Intune-side) or shorten validity. |
|
||||
| `rate_limited` | A specific device hitting `429`-equivalent failures | The device exceeded `INTUNE_PER_DEVICE_RATE_LIMIT_24H` (default 3). If legitimate (post-wipe + recovery + first-cert all in 24h), bump the cap. If suspicious, this is the limiter doing its job — investigate the device. |
|
||||
| `unknown_version` | Sudden onset of failures across the entire fleet | Microsoft shipped a new Connector version with a `version` claim certctl doesn't understand. Open an issue on the certctl repo with the failing claim payload (anonymized); the parser dispatcher accepts new versions in ~30 LoC. |
|
||||
| `malformed` | Sporadic, low-volume | Malformed challenge bytes — almost always a network proxy mangling the request body, or the Connector logging itself out mid-handshake. Capture a packet trace; the Connector should re-emit on the next device retry. |
|
||||
| `compliance_failed` | V3-Pro only | The pluggable compliance check returned non-compliant. The audit-log details carries the reason string from Microsoft Graph. V2 deployments never see this counter tick. |
|
||||
|
||||
## Operational monitoring (SCEP Administration → Intune Monitoring tab)
|
||||
|
||||
The admin GUI surface for SCEP lives at `/scep` and is structured as
|
||||
three tabs: **Profiles** (default landing — every configured SCEP
|
||||
profile, lean cards with always-present fields), **Intune Monitoring**
|
||||
(the Intune-specific deep-dive described below), and **Recent Activity**
|
||||
(full SCEP audit log filter). Operators monitoring an Intune deployment
|
||||
spend most of their time on the Intune Monitoring tab, deep-linkable via
|
||||
`/scep?tab=intune` or the legacy alias `/scep/intune`. The Profiles tab
|
||||
gives the at-a-glance per-profile health (RA cert expiry, mTLS status,
|
||||
Intune enabled/disabled badge, challenge-password-set indicator) and a
|
||||
"View Intune details →" link from each Intune-enabled card that switches
|
||||
into this tab filtered to that profile.
|
||||
|
||||
The Intune Monitoring tab shows:
|
||||
|
||||
- **Per-profile cards** — one card per SCEP profile, with the trust
|
||||
anchor expiry countdown badge:
|
||||
- `green` ≥ 30 days remaining
|
||||
- `amber` 7-30 days remaining (rotate soon)
|
||||
- `red` < 7 days remaining
|
||||
- `EXPIRED` past `NotAfter`
|
||||
- **Live counters** — the per-status enrollment counts polled every
|
||||
30s. The order in the grid puts `success` first (vanity) and
|
||||
failure modes after.
|
||||
- **Recent failures table** — the last 50 audit-log events with
|
||||
action `scep_pkcsreq_intune` or `scep_renewalreq_intune`, sorted
|
||||
by timestamp descending. Polled every 60s.
|
||||
- **Trust anchor reload button** — confirms via modal then issues
|
||||
`POST /api/v1/admin/scep/intune/reload-trust` (the SIGHUP-equivalent).
|
||||
Bad reloads keep the OLD pool in place; the modal stays open with
|
||||
the underlying error so the operator can correct the file and retry.
|
||||
|
||||
Three admin endpoints back the page:
|
||||
|
||||
- `GET /api/v1/admin/scep/profiles` — per-profile snapshot for the
|
||||
Profiles tab; surfaces RA cert subject + NotAfter + days-to-expiry,
|
||||
mTLS sibling-route status + bundle path, challenge-password-set flag,
|
||||
and an optional `intune` sub-block for Intune-enabled profiles.
|
||||
- `GET /api/v1/admin/scep/intune/stats` — Intune-specific deep-dive
|
||||
for the Intune Monitoring tab; per-status counters + trust anchor
|
||||
pool details. Backward-compat shape preserved from Phase 9.
|
||||
- `POST /api/v1/admin/scep/intune/reload-trust` — SIGHUP-equivalent
|
||||
trust anchor reload, body `{"path_id": "<pathID>"}`.
|
||||
|
||||
All three are M-008 admin-gated. Non-admin Bearer callers get HTTP 403
|
||||
+ a clear message; the GUI hides the page entirely for non-admin users
|
||||
(UX hint; server-side enforcement is independent).
|
||||
|
||||
### Recommended alert thresholds
|
||||
|
||||
The counters are exposed in the GUI as snapshots; if you wrap them
|
||||
in a Prometheus exporter (V3-Pro plug-in seam — V2 doesn't ship a
|
||||
`/metrics` surface today), reasonable starting thresholds:
|
||||
|
||||
- `signature_invalid` rate > 0 for > 5 minutes → page on-call. The
|
||||
trust anchor is stale; the operator missed a SIGHUP after a
|
||||
Connector rotation.
|
||||
- `claim_mismatch` rate > 0 sustained > 1 hour → notify (not page).
|
||||
An Intune SCEP profile is misconfigured; an admin needs to fix
|
||||
the SAN definition or the operator's CertificateProfile.
|
||||
- `replay` rate climbing → notify. Either an aggressive retry policy
|
||||
on the device side OR active replay attempts. Cross-reference
|
||||
source IPs in the audit log.
|
||||
- `rate_limited` for a single device > 1 per hour → notify. Either
|
||||
legitimate enrollment storm (post-wipe scenarios) or a compromised
|
||||
Connector signing key.
|
||||
- Trust anchor `days_to_expiry` < 30 on any profile → notify; rotate
|
||||
the Connector's signing cert before the cliff.
|
||||
|
||||
## Limitations
|
||||
|
||||
This bundle is V2-free. The following capabilities are deferred to
|
||||
V3-Pro:
|
||||
|
||||
- **Native Microsoft Graph integration.** certctl validates the
|
||||
Connector's signed challenge but doesn't call Microsoft's API
|
||||
directly — the Connector already did that. V3-Pro could ship a
|
||||
Graph client that pulls device-compliance state in addition to
|
||||
the challenge claim.
|
||||
- **Conditional Access compliance gating.** The dispatcher exposes a
|
||||
nil-default `ComplianceCheck` hook. V3-Pro plugs in a Microsoft
|
||||
Graph compliance lookup before issuance; non-compliant devices
|
||||
fail with a typed `compliance_failed` failInfo.
|
||||
- **Per-tenant trust anchors.** V2 has one trust anchor pool per
|
||||
SCEP profile; V3-Pro could support per-AAD-tenant anchor scoping
|
||||
for MSPs running shared certctl deployments across customers.
|
||||
- **OCSP stapling at SCEP-response time.** The CertRep doesn't carry
|
||||
a stapled OCSP response today; certificate validators look up OCSP
|
||||
via the `id-pkix-ocsp` extension on the issued cert. V3-Pro could
|
||||
staple inline.
|
||||
- **Auto-discovery of the Connector signing cert.** V2 requires the
|
||||
operator to extract the cert manually and configure the path.
|
||||
V3-Pro could pull from a Microsoft-published endpoint (with the
|
||||
appropriate trust constraints).
|
||||
|
||||
These deferrals are deliberate, not oversights. The V2 surface
|
||||
covers every operationally-required path for a single-tenant
|
||||
enterprise replacing NDES; V3-Pro adds the multi-tenant + native-API
|
||||
features procurement teams sometimes ask for.
|
||||
|
||||
## Microsoft support statement
|
||||
|
||||
Microsoft documents the Intune Certificate Connector as
|
||||
**RFC-8894-compliant** and supports its use against any RFC 8894
|
||||
SCEP server. The relevant Microsoft Learn pages:
|
||||
|
||||
- [Intune Certificate Connector overview](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-overview) —
|
||||
documents the Connector's architecture and explicitly notes it
|
||||
speaks RFC-8894-compliant SCEP.
|
||||
- [Use SCEP certificate profiles in Intune](https://learn.microsoft.com/en-us/mem/intune/protect/certificates-scep-configure) —
|
||||
the operator-facing setup guide, with the SCEP server URL field
|
||||
the migration playbook above edits.
|
||||
- [Validate setup of Intune Certificate Connector](https://learn.microsoft.com/en-us/mem/intune/protect/certificate-connector-install) —
|
||||
the install-validation checklist; useful when troubleshooting
|
||||
Connector-side failures vs. certctl-side failures.
|
||||
|
||||
certctl's role per Microsoft's framing: a third-party SCEP server
|
||||
that the Connector posts to. Microsoft supports this topology; only
|
||||
certctl's own RFC 8894 implementation is in scope for certctl
|
||||
support. The end-to-end Connector → certctl → issuer flow is
|
||||
exercised in `internal/api/handler/scep_intune_e2e_test.go` and
|
||||
the golden-file fixtures in `internal/scep/intune/testdata/`.
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`legacy-est-scep.md`](legacy-est-scep.md) — the per-profile SCEP
|
||||
setup guide + RFC 8894 reference + mTLS sibling route. Read this
|
||||
first if you're not already running certctl SCEP for non-Intune
|
||||
fleets.
|
||||
- [`architecture.md`](architecture.md) — overall control-plane
|
||||
architecture; Security Model section calls out the Intune trust
|
||||
anchor as a sensitive operator-configured surface.
|
||||
- [`features.md`](features.md) — every `CERTCTL_*` env var,
|
||||
including the per-profile `CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_*`
|
||||
family.
|
||||
- [`tls.md`](tls.md) — TLS bootstrap for the certctl control plane;
|
||||
prerequisite for any production deploy.
|
||||
@@ -0,0 +1,91 @@
|
||||
# Deployment Vendor Compatibility Matrix
|
||||
|
||||
> Deploy-hardening II master bundle deliverable. The procurement-team
|
||||
> headline doc — SOC 2 / PCI auditors paste this into evidence packs.
|
||||
> Per frozen decision 0.14: a (connector × vendor-version) cell is
|
||||
> "verified" only when ALL apply: ≥1 happy-path e2e passes against
|
||||
> the real sidecar; ≥1 specific-quirk test for that version passes;
|
||||
> operator manual smoke completed at least once on a real (non-CI)
|
||||
> instance of that vendor version.
|
||||
|
||||
## Status legend
|
||||
|
||||
- **✓** — verified per the three-criterion bar above
|
||||
- **CI** — happy-path + quirk e2e green in CI; operator manual smoke
|
||||
pending (the third criterion)
|
||||
- **mock** — verified against the in-tree mock; real-vendor validation
|
||||
is the operator's tier above
|
||||
- **pending** — planned; tests written; sidecar not yet wired
|
||||
- **n/a** — combination not applicable
|
||||
|
||||
Per frozen decision 0.1: only LTS + current-stable versions per
|
||||
vendor. EOL versions explicitly excluded.
|
||||
|
||||
## Matrix
|
||||
|
||||
| Connector | Vendor | Version | Status | Known Issues | Workaround | E2E Test Name(s) |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **NGINX** | nginx.org | 1.25 LTS | CI | SSL session cache holds old cert ~5min | `ssl_session_timeout 5m;` (default) — operator-tunable | `TestVendorEdge_NGINX_SSLSessionCacheHoldsOldCert_E2E` |
|
||||
| NGINX | nginx.org | 1.27 stable | CI | (same) | (same) | (same) |
|
||||
| **Apache httpd** | httpd.apache.org | 2.4 LTS | CI | mod_ssl multi-vhost ownership | per-vhost cert config; SSLCertificateFile per `<VirtualHost>` | `TestVendorEdge_Apache_MultiVhostCertByVhost_E2E` |
|
||||
| **HAProxy** | haproxy.org | 2.6 LTS | CI | reload vs restart semantics | use `systemctl reload haproxy` not `restart` | `TestVendorEdge_HAProxy_ReloadPreservesConnectionsViaSocketActivation_E2E` |
|
||||
| HAProxy | haproxy.org | 2.8 | CI | (same) | (same) | (same) |
|
||||
| HAProxy | haproxy.org | 3.0 | CI | (same) | (same) | (same) |
|
||||
| **Traefik** | traefik.io | 2.x | CI | static-config cert paths require restart | use dynamic file-provider config | `TestVendorEdge_Traefik_StaticConfigRequiresRestart_DocumentedAsLimitation_E2E` |
|
||||
| Traefik | traefik.io | 3.x | CI | (same) | (same) | (same) |
|
||||
| **Caddy** | caddyserver.com | 2.x | CI | admin API auth lockdown breaks default deploy | set `Caddy.AdminAuthorizationHeader` per-target | `TestVendorEdge_Caddy_AdminAPILockedDownWithAuth_DeployUsesConfiguredAuthHeaders_E2E` |
|
||||
| **Envoy** | envoyproxy.io | 1.30 | CI | file-mode SDS only in V2; gRPC SDS V3-Pro | use SDS=file (default) | `TestVendorEdge_Envoy_SDSFileMode_DeployRewritesYAML_EnvoyHotReloads_E2E` |
|
||||
| Envoy | envoyproxy.io | 1.32 | CI | (same) | (same) | (same) |
|
||||
| **Postfix** | postfix.org | 3.6 | CI | per-listener cert binding | configure cert per-listener block | `TestVendorEdge_Postfix_MultiListenerCertBinding_DeployUpdatesCorrectListener_E2E` |
|
||||
| Postfix | postfix.org | 3.8 | CI | (same) | (same) | (same) |
|
||||
| **Dovecot** | dovecot.org | 2.3 | CI | submission/submissions port variants | configure both inet_listener blocks | `TestVendorEdge_Dovecot_SubmissionSubmissionsPortVariants_E2E` |
|
||||
| **IIS** | microsoft.com | IIS 10 (Server 2019) | operator-playbook | Windows-host-only validation per [operator playbook](connector-iis.md#operator-validation-playbook-windows-host); app-pool recycle opt-in | `AppPoolRecycle: true` per-target if needed | `TestVendorEdge_IIS_AppPoolRecycle_OptInForCertChange_E2E` |
|
||||
| IIS | microsoft.com | IIS 10 (Server 2022) | operator-playbook | (same) | (same) | (same) |
|
||||
| **F5 BIG-IP** | f5.com | v15.1 LTS | mock | larger cert chain (>4 links) historical issue | use cert chain ≤4 links OR upgrade to v17 | `TestVendorEdge_F5_LargeCertChainHandling_E2E` |
|
||||
| F5 BIG-IP | f5.com | v17.0 | mock | (chain limit lifted) | n/a | (same) |
|
||||
| F5 BIG-IP | f5.com | v17.5 | mock | (same) | n/a | (same) |
|
||||
| **SSH** | openssh.com | OpenSSH 8.x | CI | sftp subsystem may be disabled | connector falls back to scp | `TestVendorEdge_SSH_SFTPSubsystemAbsent_FallsBackToSCP_E2E` |
|
||||
| SSH | openssh.com | OpenSSH 9.x | CI | (same) | (same) | (same) |
|
||||
| **WinCertStore** | microsoft.com | Windows Server 2019 | operator-playbook | Windows-host-only validation per [operator playbook](connector-iis.md#operator-validation-playbook-windows-host); cert store ACL: NS vs IIS_IUSRS | configure store ACL per IIS app-pool identity | `TestVendorEdge_WinCertStore_CertStoreACL_NetworkServiceAccess_E2E` |
|
||||
| WinCertStore | microsoft.com | Windows Server 2022 | operator-playbook | (same) | (same) | (same) |
|
||||
| **JavaKeystore** | adoptium.net | JDK 11 LTS | pending | keytool `-importkeystore` semantics | use `KeytoolPath` config to pin to JDK | `TestVendorEdge_JavaKeystore_JDK11_vs_17_vs_21_KeytoolBehavior_E2E` |
|
||||
| JavaKeystore | adoptium.net | JDK 17 LTS | pending | (same) | (same) | (same) |
|
||||
| JavaKeystore | adoptium.net | JDK 21 LTS | pending | (same) | (same) | (same) |
|
||||
| **Kubernetes** | kubernetes.io | 1.28 LTS | CI | kubelet sync ~60s for pod-mounted Secrets | `CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT=60s` (default) | `TestVendorEdge_K8s_KubeletSyncWaitContract_DefaultTimeout60s_E2E` |
|
||||
| Kubernetes | kubernetes.io | 1.30 | CI | (same) | (same) | (same) |
|
||||
| Kubernetes | kubernetes.io | 1.31 current | CI | (same) | (same) | (same) |
|
||||
|
||||
## Quarterly re-pin cadence
|
||||
|
||||
Every sidecar `FROM` in `deploy/docker-compose.test.yml` carries a
|
||||
SHA-256 digest pin per the H-001 CI guard. Operator re-pins
|
||||
quarterly:
|
||||
|
||||
1. Pull the latest tag of each sidecar image.
|
||||
2. Run the per-vendor e2e matrix against the new digest.
|
||||
3. If green, update the digest in `docker-compose.test.yml` + this
|
||||
matrix's "Status" column.
|
||||
4. If red, file an issue against the connector + leave the digest
|
||||
pinned to the last-known-good.
|
||||
|
||||
## How to add a new vendor version
|
||||
|
||||
1. Add a new sidecar entry to `deploy/docker-compose.test.yml` with
|
||||
the new image digest.
|
||||
2. Add a row to this matrix marking status as "pending".
|
||||
3. Write `TestVendorEdge_<connector>_<edge>_E2E` test(s) that
|
||||
exercise the vendor's known quirks against the new sidecar.
|
||||
4. Once tests pass in CI, mark status "CI".
|
||||
5. After operator manual smoke, mark status "✓".
|
||||
|
||||
## Per-connector deep-dive docs
|
||||
|
||||
For the top 5 most-deployed connectors:
|
||||
|
||||
- [NGINX deep-dive](connector-nginx.md)
|
||||
- [Kubernetes deep-dive](connector-k8s.md)
|
||||
- [IIS deep-dive](connector-iis.md)
|
||||
- [Apache deep-dive](connector-apache.md)
|
||||
- [F5 deep-dive](connector-f5.md)
|
||||
|
||||
Other connector docs live in [docs/connectors.md](connectors.md).
|
||||
Reference in New Issue
Block a user