Production hardening II Phase 10 — operator-facing documentation
that codifies the new V2 surfaces shipped in Phases 1-8.
NEW docs/disaster-recovery.md (8 sections, ~280 lines):
- Overview of automatic fail-safes already in code
- CRL cache recovery (delete row + scheduler regenerates)
- OCSP responder cert recovery (delete row + ensureOCSPResponder
re-bootstraps on next request)
- OCSP response cache recovery (delete row + read-through fallback)
- CA private-key rotation procedure (9-step playbook)
- Postgres restore (with explicit list of operator-managed
artifacts NOT in DB)
- Trust-bundle reload semantics (SCEP / EST / Intune SIGHUP-
equivalent fail-safe behavior)
- DR checklist (printable; pin near on-call)
This is the SOC 2 / PCI procurement-team deliverable. Auditors and
on-call operators get a single document that tells them what to do
when state corrupts, when keys need rotation, when Postgres needs
restoring. Nothing in the runbook requires new code — it codifies
behaviors already in the codebase.
UPDATED docs/crl-ocsp.md:
- New "Production hardening II additions" section: OCSP nonce
extension, OCSP pre-signed cache (with the load-bearing security
wire called out), per-source-IP OCSP rate limit, per-actor cert-
export rate limit, CRL HTTP caching headers (RFC 7232), CRL
DistributionPoints auto-injection, cert-export typed audit
codes, per-area Prometheus metrics with operator alert
recommendations.
- Pruned the V3-Pro deferral list to remove items that this
bundle SHIPPED (OCSP rate-limiting moved out; remaining V3-Pro:
delta CRLs, OCSP stapling, OCSP request signature verification,
HA / multi-region replication, IDP extension for sharded CRLs).
UPDATED docs/features.md:
- CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN row (default 1000)
- CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR row (default 50)
G-3 docs-drift CI guard reproduced clean: every new CERTCTL_* env
var documented in features.md AND consumed in Go source. S-1 stale-
counts guard clean (no literal-number prose for current-state
counts in README/docs).
16 KiB
CRL & OCSP — Revocation Status for Relying Parties
This guide is the operator + relying-party reference for certctl's revocation status surfaces. It covers the wire format, endpoint URLs, configuration knobs, the OCSP responder cert lifecycle, and how to point common consumers (cert-manager, Firefox, OpenSSL) at the endpoints.
If you're looking for the higher-level architecture, see
architecture.md § Security Model. If you're
looking for the revocation policy / reason codes the API accepts, see
api/openapi.yaml § /certificates/{id}/revoke.
Conceptual overview
Why two formats. RFC 5280 §5 defines a Certificate Revocation List (CRL) — a periodically-published, signed list of every revoked certificate for an issuer. RFC 6960 defines the Online Certificate Status Protocol (OCSP) — a request/response protocol that returns the status of a single certificate by serial number. CRLs are batch-friendly and cacheable; OCSP is point-query and fresh. Production PKI deployments serve both because different relying parties prefer different trade-offs:
- Browsers (Firefox / Safari) prefer OCSP for freshness; some pin OCSP stapling.
- cert-manager and most Linux TLS clients fall back to CRL when OCSP is unreachable.
- Microsoft Intune / corporate device-state validators do periodic CRL pulls.
- OpenSSL
s_client -statusexercises OCSP via theCertificate Status Requestextension during the handshake.
certctl's local issuer publishes both, with a pre-generation cache so a busy CA does not DOS itself rebuilding the CRL on every fetch.
Why a separate OCSP responder cert. RFC 6960 §2.6 + §4.2.2.2 strongly
recommend that OCSP responses be signed by a delegated "OCSP responder cert"
issued by the CA, NOT by the CA private key directly. The responder cert
carries the id-pkix-ocsp-nocheck extension (RFC 6960 §4.2.2.2.1) so OCSP
clients do not recursively check the responder cert's revocation status. This
keeps the CA private key cold (an HSM operation per OCSP request would be
prohibitive at scale) and lets the responder key live on disk, on a separate
HSM partition, or rotate frequently while the CA key stays untouched.
Endpoints
All revocation endpoints live under /.well-known/pki/ per RFC 8615 and run
unauthenticated — relying parties without certctl API credentials must be
able to validate revocation status. The HTTPS-only TLS 1.3 control plane
applies; there is no plaintext fallback.
CRL — Certificate Revocation List
GET https://<host>/.well-known/pki/crl/{issuer_id}
| Field | Value |
|---|---|
| Method | GET |
| Auth | None (unauthenticated, RFC 5280 §5 distribution semantics) |
| Response Content-Type | application/pkix-crl |
| Response body | DER-encoded X.509 CRL signed by the issuer's CA |
| Cache | Pre-generated by the scheduler; configurable interval |
Example:
curl --cacert ca.crt \
-o crl.der \
https://localhost:8443/.well-known/pki/crl/iss-local
openssl crl -inform DER -in crl.der -text -noout
OCSP — Online Certificate Status Protocol
certctl serves both the GET form (RFC 6960 §A.1.1, simple URL-path lookup)
and the POST form (RFC 6960 §A.1.1, binary OCSPRequest body). Most
production OCSP clients (Firefox, OpenSSL s_client -status, cert-manager,
Intune) use POST. The GET form is preserved for ops curl-debugging.
GET form
GET https://<host>/.well-known/pki/ocsp/{issuer_id}/{serial_hex}
| Field | Value |
|---|---|
| Method | GET |
| Auth | None |
| Response Content-Type | application/ocsp-response |
| Response body | DER-encoded OCSPResponse signed by the OCSP responder cert (NOT the CA cert) |
Example:
curl --cacert ca.crt \
-o response.der \
https://localhost:8443/.well-known/pki/ocsp/iss-local/a1b2c3d4
openssl ocsp -respin response.der -text -CAfile ca.crt
POST form (the standard one)
POST https://<host>/.well-known/pki/ocsp/{issuer_id}
Content-Type: application/ocsp-request
Body: <DER-encoded OCSPRequest>
| Field | Value |
|---|---|
| Method | POST |
| Auth | None |
| Request Content-Type | application/ocsp-request |
| Response Content-Type | application/ocsp-response |
Example with OpenSSL building the request:
openssl ocsp -issuer ca.crt -cert leaf.crt -reqout request.der
curl --cacert ca.crt \
-X POST \
-H "Content-Type: application/ocsp-request" \
--data-binary @request.der \
-o response.der \
https://localhost:8443/.well-known/pki/ocsp/iss-local
openssl ocsp -respin response.der -text -CAfile ca.crt
The body-size limit applies (http.MaxBytesReader from middleware,
default 1MB, configurable via CERTCTL_MAX_BODY_SIZE); a typical OCSPRequest
is ~200 bytes so this is a generous cap.
Admin observability endpoint
GET https://<host>/api/v1/admin/crl/cache
Authorization: Bearer <token-with-admin-flag>
Returns the per-issuer cache state — for ops dashboards, GUI badges, or "is the scheduler keeping up?" diagnostics. Admin-gated (M-008 admin-gated handler allowlist; non-admin Bearer callers receive HTTP 403). Response shape:
{
"cache_rows": [
{
"issuer_id": "iss-local",
"cache_present": true,
"crl_number": 42,
"this_update": "2026-04-29T10:00:00Z",
"next_update": "2026-04-29T11:00:00Z",
"generated_at": "2026-04-29T10:00:00Z",
"generation_duration_ms": 87,
"revoked_count": 13,
"is_stale": false,
"recent_events": [
{
"started_at": "2026-04-29T10:00:00Z",
"duration_ms": 87,
"succeeded": true,
"crl_number": 42,
"revoked_count": 13
}
]
}
],
"row_count": 1,
"generated_at": "2026-04-29T10:30:00Z"
}
Issuers that have not yet had a CRL generated appear with cache_present: false so the GUI can render a "Not yet generated" pill rather than 404.
Configuration
| Env var | Default | Meaning |
|---|---|---|
CERTCTL_CRL_GENERATION_INTERVAL |
1h |
How often the scheduler walks every CRL-supporting issuer and rebuilds. The HTTP handler reads from the cache, not from a per-request rebuild. |
CERTCTL_OCSP_RESPONDER_KEY_DIR |
unset | Operator MUST set in production. Directory where the FileDriver persists each issuer's OCSP responder key (ocsp-responder-<issuer_id>.key). When unset, the responder service uses a temporary directory that does NOT survive restarts — fine for dev, NEVER for prod. |
CERTCTL_OCSP_RESPONDER_ROTATION_GRACE |
7d |
When the responder cert's NotAfter falls within this window, EnsureResponder rotates to a fresh cert+key on the next OCSP request or scheduler tick. |
CERTCTL_OCSP_RESPONDER_VALIDITY |
30d |
How long each newly-issued responder cert is valid for. Short by design — relying parties cache OCSP responses, not the responder cert chain, and id-pkix-ocsp-nocheck blocks recursive revocation checking on the responder itself. |
The issuer-level CRL nextUpdate is derived from the generation timestamp +
the configured CRL validity (currently a build-time constant in the
CRLCacheService; configurable knob deferred until an operator asks).
OCSP responder cert lifecycle
- First OCSP request for an issuer (or scheduler tick). The local
issuer's
SignOCSPResponsecalls intoOCSPResponderService.EnsureResponder. - Cache lookup.
EnsureResponderqueries theocsp_responderstable for a row keyed byissuer_id. - Disk lookup. If a row exists, the FileDriver reads the persisted key
from
<keydir>/ocsp-responder-<issuer_id>.key. Self-healing: if the row exists but the file is missing (operator pruned the keydir without pruning the DB), the service treats this as "rotate now" rather than crashing. - Rotation check. If
cert.NotAfter < now + RotationGrace, the service generates a fresh ECDSA-P256 key, builds a*x509.CertificateRequest, and asks the local issuer's existingIssueCertificateflow to sign it. The signing template carries:KeyUsage: x509.KeyUsageDigitalSignature(signing OCSP responses)ExtKeyUsage: x509.ExtKeyUsageOCSPSigning(RFC 6960 §4.2.2.2)- The
id-pkix-ocsp-nocheckextension (OID1.3.6.1.5.5.7.48.1.5, DER valueNULL, RFC 6960 §4.2.2.2.1) wired throughCertificate.ExtraExtensions.
- Persistence. The new cert + key path are written to
ocsp_respondersvia an idempotentINSERT … ON CONFLICT DO UPDATE. - Response signing.
ocsp.CreateResponse(caCert, responderCert, template, responderSigner)produces the response bytes; the responder cert is included in the response chain so relying parties can validate without a separate fetch.
The race between scheduler-driven cache refresh and on-demand cache miss is
collapsed by the CRLCacheService's in-tree singleflight (a sync.Map of
*flightEntry keyed by issuer_id). Concurrent generation requests for the
same issuer wait on the in-flight result rather than each rebuilding from
scratch.
Pointing common consumers at the endpoints
cert-manager (Kubernetes)
cert-manager's certificate-validation logic checks both the AIA OCSP URI embedded in the leaf and the CDP CRL URI. Both are populated automatically by the local issuer's certificate template — relying parties should NOT need any additional configuration. To verify:
openssl x509 -in leaf.crt -text -noout | grep -A1 "Authority Information Access"
openssl x509 -in leaf.crt -text -noout | grep -A2 "CRL Distribution Points"
If your cert-manager pods cannot reach https://<certctl-host>:8443/.well-known/pki/,
add a NetworkPolicy egress rule or expose the certctl service via the
appropriate ingress class.
Firefox
Firefox honors the AIA OCSP URI by default. To force-refresh the local revocation cache after revoking a cert in dev:
about:preferences#privacy → Certificates → Query OCSP responder servers
If Firefox reports SEC_ERROR_OCSP_INVALID_SIGNING_CERT, verify that the
responder cert chain is reachable from the system trust store —
id-pkix-ocsp-nocheck is a Firefox-strict extension and is set automatically
on every responder cert certctl issues.
OpenSSL
# OCSP via stand-alone request
openssl ocsp -issuer ca.crt -cert leaf.crt -url https://localhost:8443/.well-known/pki/ocsp/iss-local -CAfile ca.crt -text
# OCSP via TLS Certificate Status Request extension
openssl s_client -connect example.com:443 -status -CAfile ca.crt
Intune (corporate device state)
Intune device-compliance validators pull the CRL on a schedule (configured in
the Intune admin console, default 24h). Configure the CRL distribution point
to https://<certctl-host>:8443/.well-known/pki/crl/<issuer_id> and Intune
will pull on its own cadence.
Production hardening II additions (post-2026-04-30)
The following capabilities were folded into V2 (free) by the production hardening II bundle. Each closes a real procurement-team checklist gap without requiring a paid tier.
OCSP nonce extension (RFC 6960 §4.4.1)
The POST OCSP handler echoes the request's nonce extension (OID
1.3.6.1.5.5.7.48.1.2) in the response. Defends against replay attacks
where a relying party's cached response is replayed against a now-revoked
cert. Always-on; no operator opt-out.
Failure modes:
- No nonce in request — back-compat; response omits the extension.
- Well-formed nonce ≤ 32 bytes — response echoes it; tracked in
certctl_ocsp_counter_total{label="nonce_echoed"}. - Empty or oversized nonce (> 32 bytes per CA/B Forum BR §4.10.2) —
responder returns the canonical "unauthorized" status (RFC 6960 §2.3
status 6); tracked in
certctl_ocsp_counter_total{label="nonce_malformed"}.
OCSP pre-signed response cache
Mirrors the existing CRL cache. Per-(issuer, serial) entries pre-signed
and stored in ocsp_response_cache; the read-through facade in
CAOperationsSvc.GetOCSPResponseWithNonce consults the cache for
nil-nonce requests and falls through to live signing on miss + writes
the result back. Nonce-bearing requests always live-sign because the
cache stores nil-nonce blobs.
Load-bearing security wire: RevocationSvc.RevokeCertificateWithActor
calls InvalidateOnRevoke after a successful revocation so the next
OCSP fetch returns the revoked status. There is no stale-good window
after revoke.
Per-source-IP OCSP rate limit + per-actor cert-export rate limit
Defaults: 1000 req/min/IP for OCSP; 50 exports/hr/operator for the
cert-export endpoints. Configurable via
CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN and
CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR; zero disables.
OCSP rate-limit trip: canonical "unauthorized" OCSP blob plus
Retry-After: 60. Cert-export trip: HTTP 429 + JSON
{"error":"rate_limit_exceeded","retry_after_seconds":3600}.
The OCSP limiter does NOT honor X-Forwarded-For because OCSP is
publicly reachable and untrusted intermediaries could spoof the header
to bypass the cap.
CRL HTTP caching headers (RFC 7232)
GET /.well-known/pki/crl/{issuer_id} now returns weak-form ETag,
Cache-Control: public, max-age=3600, must-revalidate, and respects
If-None-Match for HTTP 304 short-circuits. Lets CDNs and reverse
proxies serve repeated fetches from edge cache.
CRL DistributionPoint auto-injection
Local issuer config field CRLDistributionPointURLs []string; when
non-empty, every issued cert carries the RFC 5280 §4.2.1.13
id-ce-cRLDistributionPoints extension pointing at certctl's CRL
endpoint. Refusing to silently inject an empty CDP is deliberate —
silent-empty fails relying-party validation worse than no CDP.
Cert-export typed audit codes + Prometheus per-area metrics
Audit emission now carries typed action constants
(cert_export_pem, cert_export_pkcs12, cert_export_failed)
alongside legacy bare codes. Detail map enriched with
has_private_key (always false in V2) and cipher
(AES-256-CBC-PBE2-SHA256 — pinned).
GET /api/v1/metrics/prometheus surfaces the new per-area counters
under the certctl_<area>_counter_total{label=...} family. OCSP
shipped in this bundle; alert recommendations:
{label="rate_limited"}rate > 0 sustained > 5m → notify (limiter is doing its job; investigate source IP).{label="nonce_malformed"}> 0 → notify (legitimate clients don't send malformed nonces).{label="signing_failed"}> 0 → page on-call (issuer connector failing).
What this release does NOT include (V3-Pro)
Still out of scope for V2; tracked for V3-Pro:
- Delta CRLs (RFC 5280 §5.2.4). Useful for very large CRLs (10k+ revoked certs); the data model accommodates the Base CRL Number reference but the pipeline only emits Base CRLs in V2.
- OCSP stapling at SCEP/EST CertRep response time. Server-side pre-staple into the TLS handshake context.
- OCSP request signature verification (RFC 6960 §4.1.1). Optional per-spec; certctl currently ignores the signature.
- OCSP responder HA / multi-region replication. Active-active OCSP cache with Postgres logical replication.
- CRL Issuing Distribution Point (IDP) extension (RFC 5280 §5.2.5) — for sharded CRL deployments.
Troubleshooting
pki/crl/<issuer_id> returns 404. The issuer either does not support
CRL signing (Vault, EJBCA, DigiCert serve their own CRL infrastructure;
certctl's connectors return nil from GenerateCRL for these) or the
issuer ID is wrong. Verify with GET /api/v1/issuers.
pki/ocsp/<issuer_id>/<serial> returns 200 but openssl ocsp -text
shows "unauthorized". Check that the serial in the URL is hex-encoded (no
0x prefix, no leading zeros stripped, lowercase). Mismatched serials
return an OCSP response with status unauthorized per RFC 6960 §2.3.
Admin cache endpoint returns 403. The Bearer key does not carry the
admin flag. M-008 gates this endpoint server-side; the GUI also gates the
fetch on useAuth().admin. Either escalate the key (certctl admin keys promote <key-id>) or use a different identity.
Cache shows is_stale: true repeatedly. The scheduler is not running
(or not getting scheduled often enough). Check CERTCTL_CRL_GENERATION_INTERVAL
and confirm the scheduler started: grep crlGenerationLoop in the server
logs at startup.