From faf580aa109d3600b9a145caf81e15590a601ef0 Mon Sep 17 00:00:00 2001
From: shankar0123 <skreddy040@gmail.com>
Date: Thu, 30 Apr 2026 05:19:56 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20production=20hardening=20II=20=E2=80=94?=
 =?UTF-8?q?=20DR=20runbook=20+=20crl-ocsp=20updates=20+=20features.md=20en?=
 =?UTF-8?q?v=20vars=20(Phase=2010)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Production hardening II Phase 10 — operator-facing documentation
that codifies the new V2 surfaces shipped in Phases 1-8.

NEW docs/disaster-recovery.md (8 sections, ~280 lines):
  - Overview of automatic fail-safes already in code
  - CRL cache recovery (delete row + scheduler regenerates)
  - OCSP responder cert recovery (delete row + ensureOCSPResponder
    re-bootstraps on next request)
  - OCSP response cache recovery (delete row + read-through fallback)
  - CA private-key rotation procedure (9-step playbook)
  - Postgres restore (with explicit list of operator-managed
    artifacts NOT in DB)
  - Trust-bundle reload semantics (SCEP / EST / Intune SIGHUP-
    equivalent fail-safe behavior)
  - DR checklist (printable; pin near on-call)

This is the SOC 2 / PCI procurement-team deliverable. Auditors and
on-call operators get a single document that tells them what to do
when state corrupts, when keys need rotation, when Postgres needs
restoring. Nothing in the runbook requires new code — it codifies
behaviors already in the codebase.

UPDATED docs/crl-ocsp.md:
  - New "Production hardening II additions" section: OCSP nonce
    extension, OCSP pre-signed cache (with the load-bearing security
    wire called out), per-source-IP OCSP rate limit, per-actor cert-
    export rate limit, CRL HTTP caching headers (RFC 7232), CRL
    DistributionPoints auto-injection, cert-export typed audit
    codes, per-area Prometheus metrics with operator alert
    recommendations.
  - Pruned the V3-Pro deferral list to remove items that this
    bundle SHIPPED (OCSP rate-limiting moved out; remaining V3-Pro:
    delta CRLs, OCSP stapling, OCSP request signature verification,
    HA / multi-region replication, IDP extension for sharded CRLs).

UPDATED docs/features.md:
  - CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN row (default 1000)
  - CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR row (default 50)

G-3 docs-drift CI guard reproduced clean: every new CERTCTL_* env
var documented in features.md AND consumed in Go source. S-1 stale-
counts guard clean (no literal-number prose for current-state
counts in README/docs).
---
 docs/crl-ocsp.md          | 108 ++++++++++--
 docs/disaster-recovery.md | 348 ++++++++++++++++++++++++++++++++++++++
 docs/features.md          |   2 +
 3 files changed, 445 insertions(+), 13 deletions(-)
 create mode 100644 docs/disaster-recovery.md

diff --git a/docs/crl-ocsp.md b/docs/crl-ocsp.md
index 624e17b..c79c44d 100644
--- a/docs/crl-ocsp.md
+++ b/docs/crl-ocsp.md
@@ -285,24 +285,106 @@ will pull on its own cadence.
 
 ---
 
+## Production hardening II additions (post-2026-04-30)
+
+The following capabilities were folded into V2 (free) by the production
+hardening II bundle. Each closes a real procurement-team checklist gap
+without requiring a paid tier.
+
+### OCSP nonce extension (RFC 6960 §4.4.1)
+
+The POST OCSP handler echoes the request's nonce extension (OID
+`1.3.6.1.5.5.7.48.1.2`) in the response. Defends against replay attacks
+where a relying party's cached response is replayed against a now-revoked
+cert. Always-on; no operator opt-out.
+
+Failure modes:
+
+- **No nonce in request** — back-compat; response omits the extension.
+- **Well-formed nonce ≤ 32 bytes** — response echoes it; tracked in
+  `certctl_ocsp_counter_total{label="nonce_echoed"}`.
+- **Empty or oversized nonce (> 32 bytes per CA/B Forum BR §4.10.2)** —
+  responder returns the canonical "unauthorized" status (RFC 6960 §2.3
+  status 6); tracked in `certctl_ocsp_counter_total{label="nonce_malformed"}`.
+
+### OCSP pre-signed response cache
+
+Mirrors the existing CRL cache. Per-(issuer, serial) entries pre-signed
+and stored in `ocsp_response_cache`; the read-through facade in
+`CAOperationsSvc.GetOCSPResponseWithNonce` consults the cache for
+nil-nonce requests and falls through to live signing on miss + writes
+the result back. Nonce-bearing requests always live-sign because the
+cache stores nil-nonce blobs.
+
+**Load-bearing security wire:** `RevocationSvc.RevokeCertificateWithActor`
+calls `InvalidateOnRevoke` after a successful revocation so the next
+OCSP fetch returns the revoked status. There is no stale-good window
+after revoke.
+
+### Per-source-IP OCSP rate limit + per-actor cert-export rate limit
+
+Defaults: 1000 req/min/IP for OCSP; 50 exports/hr/operator for the
+cert-export endpoints. Configurable via
+`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` and
+`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`; zero disables.
+
+OCSP rate-limit trip: canonical "unauthorized" OCSP blob plus
+`Retry-After: 60`. Cert-export trip: HTTP 429 + JSON
+`{"error":"rate_limit_exceeded","retry_after_seconds":3600}`.
+
+The OCSP limiter does NOT honor `X-Forwarded-For` because OCSP is
+publicly reachable and untrusted intermediaries could spoof the header
+to bypass the cap.
+
+### CRL HTTP caching headers (RFC 7232)
+
+`GET /.well-known/pki/crl/{issuer_id}` now returns weak-form ETag,
+`Cache-Control: public, max-age=3600, must-revalidate`, and respects
+`If-None-Match` for HTTP 304 short-circuits. Lets CDNs and reverse
+proxies serve repeated fetches from edge cache.
+
+### CRL DistributionPoint auto-injection
+
+Local issuer config field `CRLDistributionPointURLs []string`; when
+non-empty, every issued cert carries the RFC 5280 §4.2.1.13
+`id-ce-cRLDistributionPoints` extension pointing at certctl's CRL
+endpoint. Refusing to silently inject an empty CDP is deliberate —
+silent-empty fails relying-party validation worse than no CDP.
+
+### Cert-export typed audit codes + Prometheus per-area metrics
+
+Audit emission now carries typed action constants
+(`cert_export_pem`, `cert_export_pkcs12`, `cert_export_failed`)
+alongside legacy bare codes. Detail map enriched with
+`has_private_key` (always false in V2) and `cipher`
+(`AES-256-CBC-PBE2-SHA256` — pinned).
+
+`GET /api/v1/metrics/prometheus` surfaces the new per-area counters
+under the `certctl_<area>_counter_total{label=...}` family. OCSP
+shipped in this bundle; alert recommendations:
+
+- `{label="rate_limited"}` rate > 0 sustained > 5m → notify (limiter
+  is doing its job; investigate source IP).
+- `{label="nonce_malformed"}` > 0 → notify (legitimate clients don't
+  send malformed nonces).
+- `{label="signing_failed"}` > 0 → page on-call (issuer connector
+  failing).
+
 ## What this release does NOT include (V3-Pro)
 
-The following are explicitly out of scope for the V2 (free) bundle and are
-tracked for the certctl Pro release:
+Still out of scope for V2; tracked for V3-Pro:
 
 - **Delta CRLs (RFC 5280 §5.2.4).** Useful for very large CRLs (10k+
-  revoked certs); the data model already accommodates the Base CRL Number
+  revoked certs); the data model accommodates the Base CRL Number
   reference but the pipeline only emits Base CRLs in V2.
-- **OCSP rate-limiting per relying party.** Per-IP token bucket on the OCSP
-  endpoint — V3-Pro because it justifies per-seat pricing for high-traffic
-  responders.
-- **OCSP stapling.** Server-side: cache pre-fetched OCSP responses + serve
-  in TLS handshake. Client-side: a "stapling fetcher" agent for non-stapling
-  origins.
-
-The MaxBytesReader cap is the only request-level guard in V2; the
-unauthenticated-by-design relying-party endpoints are intentionally not
-rate-limited per IP.
+- **OCSP stapling at SCEP/EST CertRep response time.** Server-side
+  pre-staple into the TLS handshake context.
+- **OCSP request signature verification (RFC 6960 §4.1.1).** Optional
+  per-spec; certctl currently ignores the signature.
+- **OCSP responder HA / multi-region replication.** Active-active
+  OCSP cache with Postgres logical replication.
+- **CRL Issuing Distribution Point (IDP) extension** (RFC 5280
+  §5.2.5) — for sharded CRL deployments.
 
 ---
 
diff --git a/docs/disaster-recovery.md b/docs/disaster-recovery.md
new file mode 100644
index 0000000..8957fd0
--- /dev/null
+++ b/docs/disaster-recovery.md
@@ -0,0 +1,348 @@
+# Disaster recovery runbook
+
+> **Status (this document):** Production hardening II Phase 10
+> deliverable. Codifies the fail-safe behaviors that already exist in
+> the codebase and the operator procedures for recovering from
+> common failure modes. Nothing in this runbook requires new code —
+> if a procedure here doesn't work as documented, that's a bug in
+> docs (file an issue).
+
+This runbook is the SOC 2 / PCI procurement-team deliverable: it tells
+auditors and on-call operators what to do when a piece of certctl's
+state corrupts, when a CA key needs rotation, or when Postgres needs
+a point-in-time restore. Read it once when you set up certctl; print
+the [DR checklist](#dr-checklist) and pin it near your on-call rotation.
+
+## Contents
+
+1. [Overview — what's already automatic](#overview)
+2. [CRL cache recovery](#crl-cache-recovery)
+3. [OCSP responder cert recovery](#ocsp-responder-cert-recovery)
+4. [OCSP response cache recovery](#ocsp-response-cache-recovery)
+5. [CA private-key rotation](#ca-private-key-rotation)
+6. [Postgres restore](#postgres-restore)
+7. [Trust-bundle reload semantics (SCEP / EST / Intune)](#trust-bundle-reload-semantics)
+8. [DR checklist](#dr-checklist)
+
+## Overview
+
+certctl is engineered so most failure modes are auto-recoverable
+without operator action. The fail-safes in the codebase:
+
+- **CRL cache corruption** — the scheduler's `crlGenerationLoop`
+  regenerates the CRL for every issuer on its tick (default 1h via
+  `CERTCTL_CRL_GENERATION_INTERVAL`). A corrupt or missing
+  `crl_cache` row causes the next HTTP fetch to fall through to the
+  live-signing path; the scheduler then writes the fresh CRL back to
+  cache.
+- **OCSP responder cert missing** — `ensureOCSPResponder` lazily
+  bootstraps the responder cert on the first OCSP request after a
+  missing row. The CA-key signing operation is rare (only at
+  bootstrap / 7-day rotation cycle), so this is fast even on a
+  cold cache.
+- **OCSP response cache corruption** — the read-through facade in
+  `CAOperationsSvc.GetOCSPResponseWithNonce` falls through to live
+  signing on cache miss + writes the fresh response back. Operators
+  can `DELETE FROM ocsp_response_cache;` and the cache rebuilds
+  organically as relying parties query.
+- **Trust anchor reload after a half-rotation** — `TrustAnchorHolder`
+  (used by SCEP/Intune + EST mTLS) keeps the OLD pool in place when
+  a SIGHUP-triggered reload fails (parse error, expired cert). The
+  GUI reload modal surfaces the typed error so the operator can
+  correct the file and retry without taking the EST/SCEP endpoint
+  down.
+
+These fail-safes mean most of this runbook is "delete the corrupt
+row + wait for the next tick" rather than "restore from backup +
+manually re-issue." The runbook documents the full procedures
+anyway because compliance auditors need to see them written down.
+
+## CRL cache recovery
+
+**Symptom:** `GET /.well-known/pki/crl/{issuer_id}` returns 500, or
+the CRL it returns has the wrong revocations / wrong signature, or
+parses as garbage.
+
+**Diagnosis:**
+
+```bash
+# 1. Look at the cached row directly:
+psql -c "SELECT issuer_id, length(crl_der), this_update, next_update,
+                generated_at, generation_duration_ms, revoked_count
+         FROM crl_cache WHERE issuer_id = 'iss-local';"
+
+# 2. Look at recent generation events:
+psql -c "SELECT started_at, succeeded, error, duration_ms
+         FROM crl_generation_events
+         WHERE issuer_id = 'iss-local'
+         ORDER BY started_at DESC LIMIT 10;"
+```
+
+**Recovery:**
+
+```bash
+# Force regeneration on next request by deleting the cache row.
+# The next HTTP fetch falls through to the live-signing path AND the
+# next crlGenerationLoop tick (≤1h by default) writes a fresh row.
+psql -c "DELETE FROM crl_cache WHERE issuer_id = 'iss-local';"
+
+# Verify:
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/crl/iss-local \
+  | openssl crl -inform DER -noout -text \
+  | head -20
+```
+
+**Worst case** — if the underlying revocation data in
+`certificate_revocations` is also corrupt, restore Postgres
+(see [Postgres restore](#postgres-restore)) and the CRL regenerates
+from the restored data on the next tick.
+
+## OCSP responder cert recovery
+
+**Symptom:** OCSP requests return 500 with errors like "responder
+not configured" or "failed to load responder key."
+
+**Diagnosis:**
+
+```bash
+psql -c "SELECT issuer_id, cert_subject, not_before, not_after,
+                created_at, key_path
+         FROM ocsp_responder_certs
+         WHERE issuer_id = 'iss-local';"
+
+# Check the on-disk responder key file (path from the row above):
+ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
+```
+
+**Recovery:**
+
+```bash
+# Delete the responder row. The next OCSP request triggers
+# ensureOCSPResponder which generates a fresh keypair, signs a new
+# responder cert with the CA key (rare CA-key use), and persists
+# the new row + the on-disk key file (mode 0600 enforced).
+psql -c "DELETE FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
+
+# If the on-disk key file is also corrupt, delete it first:
+rm -f /etc/certctl/ocsp-responder-keys/iss-local.key
+
+# Trigger the bootstrap by issuing one OCSP request:
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/00 \
+  > /dev/null
+
+# Verify the new row + file:
+psql -c "SELECT * FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
+ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
+```
+
+The new responder cert carries the same `id-pkix-ocsp-nocheck`
+extension as the original (per RFC 6960 §4.2.2.2.1) so relying
+parties accept it without recursing through OCSP for the responder
+itself.
+
+## OCSP response cache recovery
+
+**Symptom:** an OCSP request returns a stale response (e.g. "good"
+for a cert you just revoked). This usually means the
+`InvalidateOnRevoke` wire failed to fire — see the warning logs from
+`RevocationSvc.RevokeCertificateWithActor`.
+
+**Recovery:**
+
+```bash
+# Delete the stale cache entry. The next OCSP request falls through
+# to live signing which reads the now-current revocation_status.
+psql -c "DELETE FROM ocsp_response_cache
+         WHERE issuer_id = 'iss-local' AND serial_hex = 'deadbeef...';"
+
+# Verify the next fetch returns "revoked":
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/deadbeef... \
+  | openssl ocsp -respin /dev/stdin -resp_text -CAfile /path/to/ca.crt \
+  | grep "Cert Status"
+```
+
+For a fleet-wide invalidation (e.g. you rotated the CA key — see
+next section), nuke the whole cache:
+
+```bash
+psql -c "TRUNCATE ocsp_response_cache;"
+```
+
+The cache rebuilds organically as relying parties query. There's no
+service-degradation window because the live-sign fallback is always
+available; only the per-request CPU cost goes up until the cache
+warms back up.
+
+## CA private-key rotation
+
+**Symptom:** scheduled rotation cycle (annual or longer), or
+emergency rotation due to suspected compromise.
+
+This procedure rotates the CA private key for the local issuer.
+After rotation, every existing cert chains to the OLD CA cert which
+remains trusted by relying parties until its `notAfter` (typical
+10y); newly-issued certs chain to the NEW CA cert.
+
+**Procedure:**
+
+1. **Backup the current CA cert + key.** The on-disk paths are
+   `CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH` (typically
+   `/etc/certctl/ca.crt` + `/etc/certctl/ca.key`). Copy both to
+   a secure offline location with at least 2y retention (relying
+   parties may still send OCSP requests against certs the OLD CA
+   issued).
+2. **Generate a new keypair + cert.** For self-signed mode:
+   ```bash
+   openssl ecparam -name prime256v1 -genkey -noout -out new-ca.key
+   openssl req -x509 -key new-ca.key -days 3650 \
+       -subj "/CN=certctl Local CA" -out new-ca.crt
+   ```
+   For sub-CA mode, generate a CSR and have your enterprise root
+   sign it instead.
+3. **Stop certctl.** `kill -TERM <pid>` or `docker stop certctl`.
+4. **Move the new files into place + back up the old:**
+   ```bash
+   mv /etc/certctl/ca.crt /etc/certctl/ca.crt.old-rotated-20XX-XX-XX
+   mv /etc/certctl/ca.key /etc/certctl/ca.key.old-rotated-20XX-XX-XX
+   mv new-ca.crt /etc/certctl/ca.crt
+   mv new-ca.key /etc/certctl/ca.key
+   chmod 0600 /etc/certctl/ca.key
+   ```
+5. **Truncate the OCSP responder cert table** so the responder
+   bootstrap re-fires against the new CA:
+   ```bash
+   psql -c "DELETE FROM ocsp_responder_certs;"
+   ```
+6. **Truncate the CRL cache** so the next `crlGenerationLoop` tick
+   regenerates the CRL signed by the new CA:
+   ```bash
+   psql -c "TRUNCATE crl_cache;"
+   ```
+7. **Truncate the OCSP response cache** so future OCSP requests
+   live-sign with the new CA's responder cert:
+   ```bash
+   psql -c "TRUNCATE ocsp_response_cache;"
+   ```
+8. **Start certctl.** The startup preflight loads the new CA cert +
+   key. The next HTTP request bootstraps a new responder cert.
+9. **Verify:**
+   ```bash
+   # Issue a test cert
+   curl ... new-cert
+   # Confirm chain to the new CA
+   openssl x509 -in new-cert -noout -issuer
+   ```
+
+**Future:** when the HSM/PKCS#11 driver bundle (`cowork/hsm-pkcs11-
+driver-prompt.md`) ships, this rotation procedure changes
+substantially — the HSM-backed key never moves, only the cert wrap
+rotates. The signer interface seam is the load-bearing prerequisite
+for that.
+
+## Postgres restore
+
+certctl's full state lives in Postgres. The on-disk artifacts (CA
+cert/key, RA cert/key for SCEP, responder keys for OCSP, trust
+bundles for SCEP/Intune/EST mTLS) are operator-managed; everything
+else is in DB rows.
+
+**Restore procedure:**
+
+1. Stop certctl. `kill -TERM <pid>` or `docker stop certctl`.
+2. Restore the Postgres database from your point-in-time backup
+   (`pg_restore` or your managed-DB equivalent).
+3. Run any migrations newer than the backup's snapshot:
+   ```bash
+   migrate -path migrations/ -database "$DATABASE_URL" up
+   ```
+4. **Truncate the caches** that may now hold stale data referencing
+   pre-restore rows:
+   ```bash
+   psql -c "TRUNCATE crl_cache;"
+   psql -c "TRUNCATE ocsp_response_cache;"
+   ```
+5. Start certctl. The schedulers regenerate caches on their next
+   ticks.
+
+**Recoverable from DB only:** managed certificates, revocations,
+audit log, jobs, agents, owners, teams, profiles, issuer/target/
+notifier configs, scheduled tasks, network scan results.
+
+**Operator-managed (NOT in DB):**
+- CA cert + key (`CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH`)
+- SCEP RA cert + key per profile
+- OCSP responder keys per issuer (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)
+- SCEP/Intune trust anchor PEM bundles
+- EST mTLS client CA trust bundles
+- `CERTCTL_API_KEY`, `CERTCTL_AGENT_BOOTSTRAP_TOKEN`,
+  `CERTCTL_CONFIG_ENCRYPTION_KEY`
+
+Back these up out-of-band on the same cadence as your Postgres
+backups. Without them, a restored DB is unusable.
+
+## Trust-bundle reload semantics
+
+This section codifies the fail-safe behavior that's already in code,
+for compliance auditors who need to see the procedure documented.
+
+**Pattern:** every trust-bundle holder (`internal/trustanchor.Holder`,
+used by SCEP/Intune dispatcher + EST mTLS sibling route) implements
+the same SIGHUP-equivalent reload semantics:
+
+- A bad reload (parse error, expired cert, empty bundle) keeps the
+  OLD pool in place. The endpoint stays up; the operator sees the
+  typed error in the GUI Reload modal.
+- The reload is atomic. There's no window where the holder is
+  empty or pointing at a half-loaded bundle.
+- In-flight requests use a snapshot taken at request-start. A
+  request that crosses a SIGHUP uses the OLD pool — no mid-request
+  validation drift.
+
+**Operator workflow:**
+
+1. Receive the new trust bundle (e.g., rotated Intune Connector
+   signing cert, rotated EST mTLS client CA).
+2. Overwrite the on-disk PEM file at the configured path.
+3. Trigger reload via the GUI (`/scep` Profiles tab → Reload trust
+   anchor; `/est` Profiles tab → same) OR send `kill -HUP <certctl-pid>`
+   directly.
+4. The Reload modal returns success or shows the typed error. On
+   error, fix the file (`openssl x509 -in trust.pem -noout -text`
+   to validate) and retry; the OLD pool stays in place between
+   attempts.
+
+## DR checklist
+
+Print this. Pin it near your on-call rotation.
+
+```
+☐ Backups: Postgres backup runs nightly + retention ≥ 30 days
+☐ Backups: CA cert + key offsite + retention ≥ NotAfter + 2y
+☐ Backups: OCSP responder keys offsite (or accept rotate-from-CA on restore)
+☐ Backups: Trust anchor PEMs offsite
+☐ Backups: Operator-managed env vars (API_KEY, BOOTSTRAP_TOKEN,
+  CONFIG_ENCRYPTION_KEY) in a separate secret manager
+
+☐ Quarterly: dry-run a Postgres restore into a staging environment
+☐ Quarterly: verify CA cert NotAfter > 1y
+☐ Quarterly: rotate the OCSP responder cert (auto-handled by
+  ensureOCSPResponder; verify the rotation actually fires by
+  diffing the responder row's serial_number quarter-over-quarter)
+
+☐ Annually: dry-run a full DR — restore Postgres + CA + responders
+  into a clean environment + issue + revoke a test cert end-to-end
+☐ Annually: rotate API_KEY, AGENT_BOOTSTRAP_TOKEN
+☐ Every 5y: rotate the CA private key (see CA rotation section above)
+```
+
+## Related docs
+
+- [`crl-ocsp.md`](crl-ocsp.md) — CRL/OCSP responder operator guide.
+- [`tls.md`](tls.md) — control-plane TLS bootstrap.
+- [`security.md`](security.md) — production-grade security posture.
+- [`scep-intune.md`](scep-intune.md) — SCEP/Intune trust-anchor
+  rotation specifics.
+- [`est.md`](est.md) — EST mTLS trust-bundle rotation specifics.
diff --git a/docs/features.md b/docs/features.md
index ef4c74c..fd5da0b 100644
--- a/docs/features.md
+++ b/docs/features.md
@@ -413,6 +413,8 @@ Self-signed or sub-CA mode using `crypto/x509`.
 | `CERTCTL_OCSP_RESPONDER_KEY_DIR` | (none) | **Operator MUST set in production.** Directory where the FileDriver persists each issuer's OCSP responder key (`ocsp-responder-<issuer_id>.key`). When unset, the responder service uses a temporary directory that does NOT survive restarts — fine for dev, NEVER for prod. |
 | `CERTCTL_OCSP_RESPONDER_ROTATION_GRACE` | `7d` | When the responder cert's `NotAfter` falls within this window, `EnsureResponder` rotates to a fresh cert+key on the next OCSP request or scheduler tick. |
 | `CERTCTL_OCSP_RESPONDER_VALIDITY` | `30d` | How long each newly-issued responder cert is valid for. Short by design: relying parties cache OCSP responses, not the responder cert chain, and `id-pkix-ocsp-nocheck` blocks recursive revocation checking on the responder itself. |
+| `CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN` | `1000` | **Production hardening II Phase 3.** Per-source-IP cap on OCSP requests per minute. Zero disables the limit. Trip returns the canonical OCSP "unauthorized" status (RFC 6960 §2.3) plus `Retry-After: 60`. The limiter does NOT honor `X-Forwarded-For` (OCSP is publicly reachable; spoofed headers would bypass the cap). |
+| `CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR` | `50` | **Production hardening II Phase 3.** Per-actor cap on cert-export requests (PEM + PKCS#12) per hour. Zero disables. Trip returns HTTP 429 + JSON `{"error":"rate_limit_exceeded","retry_after_seconds":3600}` plus `Retry-After: 3600`. Defends against bulk-export from a compromised admin token. |
 
 Sub-CA mode validates `IsCA=true` and `KeyUsageCertSign` on the loaded certificate. Falls back to self-signed when paths are not set. Supports CRL generation (`GenerateCRL`) and OCSP response signing (`SignOCSPResponse`). All CA-key signing flows through the `signer.Signer` interface (`internal/crypto/signer/`); the OCSP responder cert is signed by the CA via the existing issuance pipeline and OCSP responses are signed by the responder key (NOT the CA key directly) per RFC 6960 §2.6.