docs: Phase 2 mechanical file moves to subdirectory structure

Pure git mv operations; no content edits. Internal links remain pointing at old paths and will be fixed in Phase 11. Per the Phase 1 audit recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/. 35 files moved across 8 audience-organized subdirectories: docs/getting-started/ (5): quickstart.md, concepts.md, examples.md, advanced-demo.md (was demo-advanced.md), why-certctl.md docs/reference/ (6): architecture.md, api.md (was openapi.md), mcp.md, intermediate-ca-hierarchy.md, deployment-model.md (was deployment-atomicity.md), vendor-matrix.md (was deployment-vendor-matrix.md) docs/reference/protocols/ (6): acme-server.md, acme-server-threat-model.md, scep-intune.md, est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md) docs/operator/ (4): security.md, tls.md, database-tls.md, approval-workflow.md docs/operator/runbooks/ (3): cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md (was runbook-expiry-alerts.md), disaster-recovery.md docs/migration/ (3): from-certbot.md (was migrate-from-certbot.md), from-acmesh.md (was migrate-from-acmesh.md), cert-manager-coexistence.md (was certctl-for-cert-manager-users.md) docs/compliance/ (4): index.md (was compliance.md), soc2.md (was compliance-soc2.md), pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was compliance-nist.md) docs/contributor/ (4): testing-strategy.md, test-environment.md (was test-env.md), ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md) Deferred to later Phase 2 sub-phases: - connectors.md split (Phase 4): docs/connectors.md + docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level - testing-guide.md prune (Phase 5): docs/testing-guide.md still at top level - features.md disperse (Phase 6): docs/features.md still at top level - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md still at top level - ACME walkthrough re-homing (Phase 8): three docs/acme-*-walkthrough.md still at top level - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still at top level Cross-reference updates (Phase 11) will happen after all moves and content edits land. Internal links to docs/* paths are temporarily broken until that phase completes.
2026-06-07 13:51:36 +00:00 · 2026-05-05 02:49:28 +00:00
parent cda957f302
commit 3a807ae37e
35 changed files with 0 additions and 0 deletions
@@ -0,0 +1,334 @@
+# Runbook: cloud-target deployment connectors (AWS ACM + Azure Key Vault)
+
+This runbook covers the SDK-driven cloud target connectors that ship in
+certctl post-2026-05-03 (Rank 5 of the Infisical deep-research
+deliverable). It complements the operator-facing
+[AWS Certificate Manager](connectors.md#aws-certificate-manager-acm) and
+[Azure Key Vault](connectors.md#azure-key-vault) sections in
+`docs/connectors.md`.
+
+Audience: a platform sysadmin or SRE who needs to configure, debug, or
+audit certctl's cloud-target deploys. Not a walkthrough of how to
+install certctl.
+
+---
+
+## End-to-end flow (cloud targets)
+
+```mermaid
+flowchart TD
+    Renew["cert renewed → renewal job created"]
+    Pick["agent picks up DeployCertificate work item"]
+    Dispatch["target.Connector.DeployCertificate(ctx, request)"]
+
+    Renew --> Pick --> Dispatch
+    Dispatch --> AWS
+    Dispatch --> AZ
+
+    subgraph AWS["AWS ACM path"]
+        A1["1. rotate-in-place only:<br/>DescribeCertificate(arn)"]
+        A2["2. GetCertificate(arn) —<br/>capture snapshot bytes for rollback"]
+        A3["3. ImportCertificate(arn, new_bytes) —<br/>fresh ARN OR rotate-in-place"]
+        A4["4. AddTagsToCertificate(arn, provenance) —<br/>ACM strips on re-import; we re-apply"]
+        A5["5. DescribeCertificate(arn) —<br/>verify serial matches expected"]
+        A6["6. ON MISMATCH: rollback<br/>ImportCertificate(arn, snapshot_bytes)"]
+        A1 --> A2 --> A3 --> A4 --> A5 --> A6
+    end
+
+    subgraph AZ["Azure Key Vault path"]
+        Z1["1. GetCertificate(name, '' = latest) —<br/>capture snapshot CER bytes"]
+        Z2["2. Build PFX from cert+chain+key<br/>(PKCS#12 via go-pkcs12)"]
+        Z3["3. ImportCertificate(name, PFX, tags) —<br/>ALWAYS creates a new version"]
+        Z4["4. Tags carried forward automatically"]
+        Z5["5. GetCertificate(name, '' = latest) —<br/>verify serial matches expected"]
+        Z6["6. ON MISMATCH: rollback<br/>ImportCertificate(name, snapshot_PFX) —<br/>new version"]
+        Z1 --> Z2 --> Z3 --> Z4 --> Z5 --> Z6
+    end
+
+    A6 --> Audit
+    Z6 --> Audit
+    Audit["7. Audit row + Prometheus counters<br/>certctl_deploy_attempts_total{target_type, result}<br/>certctl_deploy_rollback_total{target_type, outcome}"]
+```
+
+---
+
+## Configuring an AWS ACM target
+
+### Minimum config
+
+```bash
+curl -X POST https://certctl.example.com/api/v1/targets \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "name": "Production ALB cert",
+    "type": "AWSACM",
+    "agent_id": "ag-server",
+    "config": {
+      "region": "us-east-1",
+      "tags": {"env": "production"}
+    }
+  }'
+```
+
+Empty `certificate_arn` on first deploy = ACM creates a fresh ARN; the
+deployment record's Metadata captures it. Update the
+`deployment_targets.config.certificate_arn` field via the GUI / API /
+direct SQL to pin the ARN for subsequent renewals.
+
+### Minimum IAM policy
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [{
+    "Effect": "Allow",
+    "Action": [
+      "acm:ImportCertificate",
+      "acm:GetCertificate",
+      "acm:DescribeCertificate",
+      "acm:ListCertificates",
+      "acm:AddTagsToCertificate"
+    ],
+    "Resource": "arn:aws:acm:us-east-1:*:certificate/*"
+  }]
+}
+```
+
+Pin `Resource` to the specific region / account where the ALB lives.
+Cross-account deploys use AssumeRole — configure the agent's role with
+`sts:AssumeRole` against the target account's role ARN.
+
+### Auth: IRSA (recommended for EKS-hosted agents)
+
+```yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: certctl-agent
+  namespace: certctl-system
+  annotations:
+    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/certctl-acm-deployer
+```
+
+Trust policy on `certctl-acm-deployer`:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [{
+    "Effect": "Allow",
+    "Principal": {
+      "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE"
+    },
+    "Action": "sts:AssumeRoleWithWebIdentity",
+    "Condition": {
+      "StringEquals": {
+        "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE:sub": "system:serviceaccount:certctl-system:certctl-agent"
+      }
+    }
+  }]
+}
+```
+
+---
+
+## Configuring an Azure Key Vault target
+
+### Minimum config
+
+```bash
+curl -X POST https://certctl.example.com/api/v1/targets \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "name": "Production AGW cert",
+    "type": "AzureKeyVault",
+    "agent_id": "ag-server",
+    "config": {
+      "vault_url": "https://prod-vault.vault.azure.net",
+      "certificate_name": "api-prod",
+      "credential_mode": "managed_identity",
+      "tags": {"env": "production"}
+    }
+  }'
+```
+
+### Minimum RBAC role
+
+Off-the-shelf builtin: **Key Vault Certificates Officer** (assigns at
+the vault scope).
+
+Custom minimum-permission role:
+
+```json
+{
+  "properties": {
+    "roleName": "certctl-keyvault-deployer",
+    "description": "Minimum permissions for certctl Key Vault target",
+    "assignableScopes": [
+      "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault-name>"
+    ],
+    "permissions": [{
+      "actions": [],
+      "notActions": [],
+      "dataActions": [
+        "Microsoft.KeyVault/vaults/certificates/import/action",
+        "Microsoft.KeyVault/vaults/certificates/read",
+        "Microsoft.KeyVault/vaults/certificates/listversions/read"
+      ],
+      "notDataActions": []
+    }]
+  }
+}
+```
+
+### Auth: AKS workload identity (recommended for AKS-hosted agents)
+
+Annotate the agent's ServiceAccount:
+
+```yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: certctl-agent
+  namespace: certctl-system
+  annotations:
+    azure.workload.identity/client-id: <app-registration-client-id>
+  labels:
+    azure.workload.identity/use: "true"
+```
+
+Federated credential on the app registration:
+
+```json
+{
+  "name": "certctl-agent-federated",
+  "issuer": "https://<oidc-issuer-url>",
+  "subject": "system:serviceaccount:certctl-system:certctl-agent",
+  "audiences": ["api://AzureADTokenExchange"]
+}
+```
+
+Set `credential_mode: workload_identity` on the deployment_target
+config.
+
+---
+
+## Operator playbook
+
+### "Did the cert get imported to ACM / Key Vault?"
+
+**AWS:**
+
+```bash
+aws acm describe-certificate \
+  --certificate-arn arn:aws:acm:us-east-1:...:certificate/<id> \
+  --query 'Certificate.{Status:Status,Serial:Serial,Issued:IssuedAt,NotAfter:NotAfter,Tags:[Tags]}'
+```
+
+**Azure:**
+
+```bash
+az keyvault certificate show \
+  --vault-name prod-vault \
+  --name api-prod \
+  --query '{Serial:x509ThumbprintHex, Version:id, NotAfter:attributes.expires}'
+```
+
+In both cases, the `certctl-managed-by` tag confirms the cert was
+imported by certctl (and not someone running aws-cli directly).
+
+### "Why did the rollback fail?"
+
+The Prometheus counter
+`certctl_deploy_rollback_total{outcome="also_failed"}` ticks when the
+rollback's own ImportCertificate / Set call also returns an error. Look
+at the agent's slog at ERROR level for the per-call diagnostic; the
+underlying cloud SDK error message tells you whether it was IAM
+denial, throttling, or a structural input problem.
+
+Manual recovery:
+
+**AWS ACM:**
+
+```bash
+# Get the snapshot of a known-good cert from S3 / Vault / wherever the
+# operator stores backup PEMs:
+aws acm import-certificate \
+  --certificate fileb://known-good.crt \
+  --private-key  fileb://known-good.key \
+  --certificate-chain fileb://known-good.chain \
+  --certificate-arn  arn:aws:acm:us-east-1:...:certificate/<id> \
+  --tags Key=certctl-managed-by,Value=manual-recovery
+```
+
+**Azure Key Vault:**
+
+```bash
+# Import a fresh PFX as a new version under the same name:
+az keyvault certificate import \
+  --vault-name prod-vault \
+  --name api-prod \
+  --file known-good.pfx \
+  --tags certctl-managed-by=manual-recovery
+```
+
+After the manual recovery, certctl's next renewal-loop tick re-verifies
+the live cert via `ValidateDeployment` and resumes normal operation.
+
+### "How do I know certctl is the only one writing to this ARN / vault cert?"
+
+**AWS — via CloudTrail:**
+
+```
+EventName = "ImportCertificate"
+Resources.ARN = "arn:aws:acm:us-east-1:...:certificate/<id>"
+```
+
+Filter by user identity to see which principal made each call. The
+certctl agent's IAM role / IRSA-bound role should be the only writer.
+
+**Azure — via Activity Log:**
+
+```bash
+az monitor activity-log list \
+  --resource-id /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault>/certificates/<name> \
+  --offset 30d \
+  --query "[?operationName.value=='Microsoft.KeyVault/vaults/certificates/import/action'].{caller:caller, time:eventTimestamp}"
+```
+
+---
+
+## Cardinality + cost
+
+- Per-target-type Prometheus counters: 2 new
+  `certctl_deploy_attempts_total` series (AWSACM + AzureKeyVault) ×
+  2 results = 4 series. Comfortable.
+- AWS ACM costs: ImportCertificate is free; CloudTrail logs at $2 per
+  GB. Renewing 100 certs/month adds ~10 KB to CloudTrail.
+- Azure Key Vault costs: certificate operations $0.03 per 10K
+  operations (V2 pricing as of 2026-05). 100 certs/month = $0.0009 in
+  cert-op spend. Activity Log retention is configurable (default 90
+  days, free).
+
+---
+
+## V3-Pro forward path
+
+Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
+
+- **AWS CloudFront direct-attach** — UpdateDistribution after an ACM
+  ImportCertificate so the CloudFront edge picks up the new cert
+  without operator intervention. Requires `cloudfront:UpdateDistribution`
+  IAM permission on top of the ACM minimum.
+- **Azure Front Door direct-attach** — UpdateRoutingConfig equivalent.
+- **AWS ALB / Azure App Gateway auto-bind** — currently operators
+  attach the ARN / KID URI to the LB out-of-band (Terraform);
+  V3-Pro adds the auto-attach step.
+- **Soft-delete recovery for Azure Key Vault** — V2 always
+  re-imports as a new version; V3 detects soft-deleted prior
+  versions and offers operator-confirmed recovery.
+- **GCP Certificate Manager target** — Google Cloud's equivalent to
+  ACM; mirrors the AWS ACM connector shape. Separate cloud,
+  separate connector.
@@ -0,0 +1,348 @@
+# Disaster recovery runbook
+
+> **Status (this document):** Production hardening II Phase 10
+> deliverable. Codifies the fail-safe behaviors that already exist in
+> the codebase and the operator procedures for recovering from
+> common failure modes. Nothing in this runbook requires new code —
+> if a procedure here doesn't work as documented, that's a bug in
+> docs (file an issue).
+
+This runbook is the SOC 2 / PCI procurement-team deliverable: it tells
+auditors and on-call operators what to do when a piece of certctl's
+state corrupts, when a CA key needs rotation, or when Postgres needs
+a point-in-time restore. Read it once when you set up certctl; print
+the [DR checklist](#dr-checklist) and pin it near your on-call rotation.
+
+## Contents
+
+1. [Overview — what's already automatic](#overview)
+2. [CRL cache recovery](#crl-cache-recovery)
+3. [OCSP responder cert recovery](#ocsp-responder-cert-recovery)
+4. [OCSP response cache recovery](#ocsp-response-cache-recovery)
+5. [CA private-key rotation](#ca-private-key-rotation)
+6. [Postgres restore](#postgres-restore)
+7. [Trust-bundle reload semantics (SCEP / EST / Intune)](#trust-bundle-reload-semantics)
+8. [DR checklist](#dr-checklist)
+
+## Overview
+
+certctl is engineered so most failure modes are auto-recoverable
+without operator action. The fail-safes in the codebase:
+
+- **CRL cache corruption** — the scheduler's `crlGenerationLoop`
+  regenerates the CRL for every issuer on its tick (default 1h via
+  `CERTCTL_CRL_GENERATION_INTERVAL`). A corrupt or missing
+  `crl_cache` row causes the next HTTP fetch to fall through to the
+  live-signing path; the scheduler then writes the fresh CRL back to
+  cache.
+- **OCSP responder cert missing** — `ensureOCSPResponder` lazily
+  bootstraps the responder cert on the first OCSP request after a
+  missing row. The CA-key signing operation is rare (only at
+  bootstrap / 7-day rotation cycle), so this is fast even on a
+  cold cache.
+- **OCSP response cache corruption** — the read-through facade in
+  `CAOperationsSvc.GetOCSPResponseWithNonce` falls through to live
+  signing on cache miss + writes the fresh response back. Operators
+  can `DELETE FROM ocsp_response_cache;` and the cache rebuilds
+  organically as relying parties query.
+- **Trust anchor reload after a half-rotation** — `TrustAnchorHolder`
+  (used by SCEP/Intune + EST mTLS) keeps the OLD pool in place when
+  a SIGHUP-triggered reload fails (parse error, expired cert). The
+  GUI reload modal surfaces the typed error so the operator can
+  correct the file and retry without taking the EST/SCEP endpoint
+  down.
+
+These fail-safes mean most of this runbook is "delete the corrupt
+row + wait for the next tick" rather than "restore from backup +
+manually re-issue." The runbook documents the full procedures
+anyway because compliance auditors need to see them written down.
+
+## CRL cache recovery
+
+**Symptom:** `GET /.well-known/pki/crl/{issuer_id}` returns 500, or
+the CRL it returns has the wrong revocations / wrong signature, or
+parses as garbage.
+
+**Diagnosis:**
+
+```bash
+# 1. Look at the cached row directly:
+psql -c "SELECT issuer_id, length(crl_der), this_update, next_update,
+                generated_at, generation_duration_ms, revoked_count
+         FROM crl_cache WHERE issuer_id = 'iss-local';"
+
+# 2. Look at recent generation events:
+psql -c "SELECT started_at, succeeded, error, duration_ms
+         FROM crl_generation_events
+         WHERE issuer_id = 'iss-local'
+         ORDER BY started_at DESC LIMIT 10;"
+```
+
+**Recovery:**
+
+```bash
+# Force regeneration on next request by deleting the cache row.
+# The next HTTP fetch falls through to the live-signing path AND the
+# next crlGenerationLoop tick (≤1h by default) writes a fresh row.
+psql -c "DELETE FROM crl_cache WHERE issuer_id = 'iss-local';"
+
+# Verify:
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/crl/iss-local \
+  | openssl crl -inform DER -noout -text \
+  | head -20
+```
+
+**Worst case** — if the underlying revocation data in
+`certificate_revocations` is also corrupt, restore Postgres
+(see [Postgres restore](#postgres-restore)) and the CRL regenerates
+from the restored data on the next tick.
+
+## OCSP responder cert recovery
+
+**Symptom:** OCSP requests return 500 with errors like "responder
+not configured" or "failed to load responder key."
+
+**Diagnosis:**
+
+```bash
+psql -c "SELECT issuer_id, cert_subject, not_before, not_after,
+                created_at, key_path
+         FROM ocsp_responder_certs
+         WHERE issuer_id = 'iss-local';"
+
+# Check the on-disk responder key file (path from the row above):
+ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
+```
+
+**Recovery:**
+
+```bash
+# Delete the responder row. The next OCSP request triggers
+# ensureOCSPResponder which generates a fresh keypair, signs a new
+# responder cert with the CA key (rare CA-key use), and persists
+# the new row + the on-disk key file (mode 0600 enforced).
+psql -c "DELETE FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
+
+# If the on-disk key file is also corrupt, delete it first:
+rm -f /etc/certctl/ocsp-responder-keys/iss-local.key
+
+# Trigger the bootstrap by issuing one OCSP request:
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/00 \
+  > /dev/null
+
+# Verify the new row + file:
+psql -c "SELECT * FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
+ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
+```
+
+The new responder cert carries the same `id-pkix-ocsp-nocheck`
+extension as the original (per RFC 6960 §4.2.2.2.1) so relying
+parties accept it without recursing through OCSP for the responder
+itself.
+
+## OCSP response cache recovery
+
+**Symptom:** an OCSP request returns a stale response (e.g. "good"
+for a cert you just revoked). This usually means the
+`InvalidateOnRevoke` wire failed to fire — see the warning logs from
+`RevocationSvc.RevokeCertificateWithActor`.
+
+**Recovery:**
+
+```bash
+# Delete the stale cache entry. The next OCSP request falls through
+# to live signing which reads the now-current revocation_status.
+psql -c "DELETE FROM ocsp_response_cache
+         WHERE issuer_id = 'iss-local' AND serial_hex = 'deadbeef...';"
+
+# Verify the next fetch returns "revoked":
+curl -sS --cacert /path/to/ca.crt \
+    https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/deadbeef... \
+  | openssl ocsp -respin /dev/stdin -resp_text -CAfile /path/to/ca.crt \
+  | grep "Cert Status"
+```
+
+For a fleet-wide invalidation (e.g. you rotated the CA key — see
+next section), nuke the whole cache:
+
+```bash
+psql -c "TRUNCATE ocsp_response_cache;"
+```
+
+The cache rebuilds organically as relying parties query. There's no
+service-degradation window because the live-sign fallback is always
+available; only the per-request CPU cost goes up until the cache
+warms back up.
+
+## CA private-key rotation
+
+**Symptom:** scheduled rotation cycle (annual or longer), or
+emergency rotation due to suspected compromise.
+
+This procedure rotates the CA private key for the local issuer.
+After rotation, every existing cert chains to the OLD CA cert which
+remains trusted by relying parties until its `notAfter` (typical
+10y); newly-issued certs chain to the NEW CA cert.
+
+**Procedure:**
+
+1. **Backup the current CA cert + key.** The on-disk paths are
+   `CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH` (typically
+   `/etc/certctl/ca.crt` + `/etc/certctl/ca.key`). Copy both to
+   a secure offline location with at least 2y retention (relying
+   parties may still send OCSP requests against certs the OLD CA
+   issued).
+2. **Generate a new keypair + cert.** For self-signed mode:
+   ```bash
+   openssl ecparam -name prime256v1 -genkey -noout -out new-ca.key
+   openssl req -x509 -key new-ca.key -days 3650 \
+       -subj "/CN=certctl Local CA" -out new-ca.crt
+   ```
+   For sub-CA mode, generate a CSR and have your enterprise root
+   sign it instead.
+3. **Stop certctl.** `kill -TERM <pid>` or `docker stop certctl`.
+4. **Move the new files into place + back up the old:**
+   ```bash
+   mv /etc/certctl/ca.crt /etc/certctl/ca.crt.old-rotated-20XX-XX-XX
+   mv /etc/certctl/ca.key /etc/certctl/ca.key.old-rotated-20XX-XX-XX
+   mv new-ca.crt /etc/certctl/ca.crt
+   mv new-ca.key /etc/certctl/ca.key
+   chmod 0600 /etc/certctl/ca.key
+   ```
+5. **Truncate the OCSP responder cert table** so the responder
+   bootstrap re-fires against the new CA:
+   ```bash
+   psql -c "DELETE FROM ocsp_responder_certs;"
+   ```
+6. **Truncate the CRL cache** so the next `crlGenerationLoop` tick
+   regenerates the CRL signed by the new CA:
+   ```bash
+   psql -c "TRUNCATE crl_cache;"
+   ```
+7. **Truncate the OCSP response cache** so future OCSP requests
+   live-sign with the new CA's responder cert:
+   ```bash
+   psql -c "TRUNCATE ocsp_response_cache;"
+   ```
+8. **Start certctl.** The startup preflight loads the new CA cert +
+   key. The next HTTP request bootstraps a new responder cert.
+9. **Verify:**
+   ```bash
+   # Issue a test cert
+   curl ... new-cert
+   # Confirm chain to the new CA
+   openssl x509 -in new-cert -noout -issuer
+   ```
+
+**Future:** when the HSM/PKCS#11 driver bundle (`cowork/hsm-pkcs11-
+driver-prompt.md`) ships, this rotation procedure changes
+substantially — the HSM-backed key never moves, only the cert wrap
+rotates. The signer interface seam is the load-bearing prerequisite
+for that.
+
+## Postgres restore
+
+certctl's full state lives in Postgres. The on-disk artifacts (CA
+cert/key, RA cert/key for SCEP, responder keys for OCSP, trust
+bundles for SCEP/Intune/EST mTLS) are operator-managed; everything
+else is in DB rows.
+
+**Restore procedure:**
+
+1. Stop certctl. `kill -TERM <pid>` or `docker stop certctl`.
+2. Restore the Postgres database from your point-in-time backup
+   (`pg_restore` or your managed-DB equivalent).
+3. Run any migrations newer than the backup's snapshot:
+   ```bash
+   migrate -path migrations/ -database "$DATABASE_URL" up
+   ```
+4. **Truncate the caches** that may now hold stale data referencing
+   pre-restore rows:
+   ```bash
+   psql -c "TRUNCATE crl_cache;"
+   psql -c "TRUNCATE ocsp_response_cache;"
+   ```
+5. Start certctl. The schedulers regenerate caches on their next
+   ticks.
+
+**Recoverable from DB only:** managed certificates, revocations,
+audit log, jobs, agents, owners, teams, profiles, issuer/target/
+notifier configs, scheduled tasks, network scan results.
+
+**Operator-managed (NOT in DB):**
+- CA cert + key (`CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH`)
+- SCEP RA cert + key per profile
+- OCSP responder keys per issuer (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)
+- SCEP/Intune trust anchor PEM bundles
+- EST mTLS client CA trust bundles
+- `CERTCTL_API_KEY`, `CERTCTL_AGENT_BOOTSTRAP_TOKEN`,
+  `CERTCTL_CONFIG_ENCRYPTION_KEY`
+
+Back these up out-of-band on the same cadence as your Postgres
+backups. Without them, a restored DB is unusable.
+
+## Trust-bundle reload semantics
+
+This section codifies the fail-safe behavior that's already in code,
+for compliance auditors who need to see the procedure documented.
+
+**Pattern:** every trust-bundle holder (`internal/trustanchor.Holder`,
+used by SCEP/Intune dispatcher + EST mTLS sibling route) implements
+the same SIGHUP-equivalent reload semantics:
+
+- A bad reload (parse error, expired cert, empty bundle) keeps the
+  OLD pool in place. The endpoint stays up; the operator sees the
+  typed error in the GUI Reload modal.
+- The reload is atomic. There's no window where the holder is
+  empty or pointing at a half-loaded bundle.
+- In-flight requests use a snapshot taken at request-start. A
+  request that crosses a SIGHUP uses the OLD pool — no mid-request
+  validation drift.
+
+**Operator workflow:**
+
+1. Receive the new trust bundle (e.g., rotated Intune Connector
+   signing cert, rotated EST mTLS client CA).
+2. Overwrite the on-disk PEM file at the configured path.
+3. Trigger reload via the GUI (`/scep` Profiles tab → Reload trust
+   anchor; `/est` Profiles tab → same) OR send `kill -HUP <certctl-pid>`
+   directly.
+4. The Reload modal returns success or shows the typed error. On
+   error, fix the file (`openssl x509 -in trust.pem -noout -text`
+   to validate) and retry; the OLD pool stays in place between
+   attempts.
+
+## DR checklist
+
+Print this. Pin it near your on-call rotation.
+
+```
+☐ Backups: Postgres backup runs nightly + retention ≥ 30 days
+☐ Backups: CA cert + key offsite + retention ≥ NotAfter + 2y
+☐ Backups: OCSP responder keys offsite (or accept rotate-from-CA on restore)
+☐ Backups: Trust anchor PEMs offsite
+☐ Backups: Operator-managed env vars (API_KEY, BOOTSTRAP_TOKEN,
+  CONFIG_ENCRYPTION_KEY) in a separate secret manager
+
+☐ Quarterly: dry-run a Postgres restore into a staging environment
+☐ Quarterly: verify CA cert NotAfter > 1y
+☐ Quarterly: rotate the OCSP responder cert (auto-handled by
+  ensureOCSPResponder; verify the rotation actually fires by
+  diffing the responder row's serial_number quarter-over-quarter)
+
+☐ Annually: dry-run a full DR — restore Postgres + CA + responders
+  into a clean environment + issue + revoke a test cert end-to-end
+☐ Annually: rotate API_KEY, AGENT_BOOTSTRAP_TOKEN
+☐ Every 5y: rotate the CA private key (see CA rotation section above)
+```
+
+## Related docs
+
+- [`crl-ocsp.md`](crl-ocsp.md) — CRL/OCSP responder operator guide.
+- [`tls.md`](tls.md) — control-plane TLS bootstrap.
+- [`security.md`](security.md) — production-grade security posture.
+- [`scep-intune.md`](scep-intune.md) — SCEP/Intune trust-anchor
+  rotation specifics.
+- [`est.md`](est.md) — EST mTLS trust-bundle rotation specifics.
@@ -0,0 +1,226 @@
+# Runbook: certificate-expiry alerts (multi-channel)
+
+This runbook covers the per-policy multi-channel expiry-alert dispatch
+path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
+deep-research deliverable). It complements the operator-facing
+[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
+section in `docs/connectors.md`.
+
+Audience: a platform sysadmin or on-call engineer who needs to
+configure, debug, or audit certctl's expiry-alert routing. Not a
+walkthrough of how to install certctl — that lives in the README.
+
+---
+
+## End-to-end flow
+
+```mermaid
+flowchart TD
+    Tick["daily ticker (renewalCheckLoop)"]
+    Check["RenewalService.CheckExpiringCertificates"]
+
+    Tick --> Check --> Loop
+
+    subgraph Loop["for cert in expiring (≤30 days)"]
+        L1["1. Resolve RenewalPolicy"]
+        L2["2. Compute daysUntil"]
+        L3["3. updateCertExpiryStatus"]
+        L4["4. sendThresholdAlerts"]
+        L5["5. Create renewal job<br/>(if issuer registered +<br/>ARI allows)"]
+        L1 --> L2 --> L3 --> L4 --> L5
+    end
+
+    L4 --> Threshold
+
+    subgraph Threshold["per threshold"]
+        T1["a. resolve severity tier<br/>via AlertSeverityMap"]
+        T2["b. resolve channel set<br/>via AlertChannels[tier]"]
+        T1 --> T2 --> Channel
+    end
+
+    subgraph Channel["for each channel (fault-isolating)"]
+        C1["i. dedup via notification_events<br/>(cert, threshold, channel)"]
+        C2["ii. SendThresholdAlertOnChannel<br/>→ notifierRegistry[channel]<br/>→ Send(recipient, subj, body)"]
+        C3["iii. record audit row<br/>event_type=expiration_alert_sent<br/>metadata.channel, metadata.severity_tier"]
+        C4["iv. bump Prometheus counter<br/>certctl_expiry_alerts_total<br/>{channel, threshold, result}"]
+        C1 --> C2 --> C3 --> C4
+    end
+```
+
+The dispatch loop's per-channel error handling is
+**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
+at the same threshold. Each channel runs independently, with its
+own dedup row + audit row + metric increment.
+
+---
+
+## Configuring the per-policy channel matrix
+
+The matrix is a property of `RenewalPolicy`. Two new JSONB columns
+on the `renewal_policies` table back it (migration 000026):
+
+- `alert_channels JSONB` — `map[severity_tier][]channel_name`. Default `{}`
+  → fall through to `DefaultAlertChannels` (Email-only at every tier).
+- `alert_severity_map JSONB` — `map[threshold_days]severity_tier`. Default
+  `{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
+  14→warning, 7→warning, 0→critical`).
+
+### Example: production-grade routing
+
+```bash
+curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "name": "Production CDN renewal policy",
+    "renewal_window_days": 30,
+    "auto_renew": true,
+    "max_retries": 3,
+    "retry_interval_seconds": 300,
+    "alert_thresholds_days": [30, 14, 7, 0],
+    "alert_channels": {
+      "informational": ["Slack"],
+      "warning":       ["Slack", "Email"],
+      "critical":      ["PagerDuty", "OpsGenie", "Email"]
+    },
+    "alert_severity_map": {
+      "30": "informational",
+      "14": "warning",
+      "7":  "warning",
+      "0":  "critical"
+    }
+  }'
+```
+
+After this PUT, the next renewal-loop tick that finds a cert under
+this policy will fan out alerts as documented above.
+
+### Example: opt out of informational alerts
+
+If your team doesn't want T-30 informational alerts (you'd rather
+hear about a cert only at warning tier and beyond):
+
+```json
+"alert_channels": {
+  "informational": [],
+  "warning":       ["Email"],
+  "critical":      ["PagerDuty", "Email"]
+}
+```
+
+The empty `informational` list causes the dispatch loop to record
+an `expiration_alert_skipped_no_channels` audit row at T-30 and
+skip the dispatch. Other tiers still fire.
+
+---
+
+## Operator playbook
+
+### "Did the on-call team get paged?"
+
+```sql
+SELECT created_at,
+       metadata->>'channel'        AS channel,
+       metadata->>'threshold_days' AS threshold,
+       metadata->>'severity_tier'  AS severity
+FROM audit_events
+WHERE event_type = 'expiration_alert_sent'
+  AND resource_id = '<cert-id>'
+ORDER BY created_at DESC;
+```
+
+One row per (channel, threshold) attempt. If you see a row with
+`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
+out (or was at least dispatched to the notifier).
+
+### "Why didn't I get an alert at T-7?"
+
+Three places to look:
+
+1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
+   ('expiration_alert_sent','expiration_alert_skipped_no_channels',
+   'expiration_alert_skipped_invalid_channel') AND resource_id =
+   '<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
+   your policy's tier list is empty for the resolved tier. If
+   `expiration_alert_skipped_invalid_channel` appears, your matrix
+   has a typo (the `metadata->>'invalid_channel'` field tells you
+   which value).
+
+2. **Notifications table** —
+   `SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
+   AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
+   rows exist with `channel = 'Slack'` and `status = 'failed'`,
+   the dispatch reached the channel but the channel rejected the
+   send. Look at the `error` column for the upstream message.
+
+3. **Prometheus counters** —
+   `curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
+   Sustained `{result="failure"}` counts indicate a notifier
+   connector misconfiguration (bad webhook URL, expired API key,
+   etc.).
+
+### "How do I test the matrix without waiting for a real expiry?"
+
+certctl ships an admin endpoint for this:
+
+```bash
+curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "certificate_id": "mc-test-cert",
+    "threshold_days": 0,
+    "channel": "PagerDuty"
+  }'
+```
+
+This calls `NotificationService.SendThresholdAlertOnChannel`
+directly and bypasses the renewal loop's threshold check. Useful
+for "did I configure PagerDuty correctly?" without having to set
+up a deliberately-expiring cert. The admin endpoint requires
+`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
+token only.
+
+### "How do I rotate a notifier credential without downtime?"
+
+1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
+   var in your deployment.
+2. Restart `certctl-server`. The notifier registry rebuilds
+   with the new credential.
+3. Confirm with the admin-test endpoint above against the cert
+   you most care about.
+
+The renewal loop is idempotent — a missed tick during the restart
+window does NOT cause double-dispatch on the next tick (per-channel
+dedup on the `notification_events` table guards against that).
+
+---
+
+## Cardinality + cost
+
+- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
+- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
+  expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
+- Closed-enum discipline at the dispatch site means typos in
+  `alert_channels` do NOT grow this count.
+- A daily renewal-loop tick over 10K certs each policy-bound to the
+  matrix above produces O(channels × thresholds × certs) audit rows
+  + notification rows in the worst case (every cert has crossed
+  every threshold and no dedup applies). Operators sizing
+  Postgres should plan for an `audit_events` row count on the
+  order of `unique_certs × channels_per_critical_tier` per fan-out
+  batch — which is ~3-5× the pre-Rank-4 row count.
+
+---
+
+## V3-Pro forward path
+
+Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
+
+- Per-owner / per-team / per-tenant channel routing (the matrix is
+  per-policy today, not per-owner).
+- Calendar-aware suppression (no T-30 alerts on weekends for non-
+  on-call teams).
+- Escalation chains (T-1 unanswered for 30m → escalate to
+  manager's PagerDuty).
+- Per-channel rate limiting (downstream of I-005's retry+DLQ).