docs: Phase 2 mechanical file moves to subdirectory structure

Pure git mv operations; no content edits. Internal links remain pointing at old paths and will be fixed in Phase 11. Per the Phase 1 audit recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/. 35 files moved across 8 audience-organized subdirectories: docs/getting-started/ (5): quickstart.md, concepts.md, examples.md, advanced-demo.md (was demo-advanced.md), why-certctl.md docs/reference/ (6): architecture.md, api.md (was openapi.md), mcp.md, intermediate-ca-hierarchy.md, deployment-model.md (was deployment-atomicity.md), vendor-matrix.md (was deployment-vendor-matrix.md) docs/reference/protocols/ (6): acme-server.md, acme-server-threat-model.md, scep-intune.md, est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md) docs/operator/ (4): security.md, tls.md, database-tls.md, approval-workflow.md docs/operator/runbooks/ (3): cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md (was runbook-expiry-alerts.md), disaster-recovery.md docs/migration/ (3): from-certbot.md (was migrate-from-certbot.md), from-acmesh.md (was migrate-from-acmesh.md), cert-manager-coexistence.md (was certctl-for-cert-manager-users.md) docs/compliance/ (4): index.md (was compliance.md), soc2.md (was compliance-soc2.md), pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was compliance-nist.md) docs/contributor/ (4): testing-strategy.md, test-environment.md (was test-env.md), ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md) Deferred to later Phase 2 sub-phases: - connectors.md split (Phase 4): docs/connectors.md + docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level - testing-guide.md prune (Phase 5): docs/testing-guide.md still at top level - features.md disperse (Phase 6): docs/features.md still at top level - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md still at top level - ACME walkthrough re-homing (Phase 8): three docs/acme-*-walkthrough.md still at top level - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still at top level Cross-reference updates (Phase 11) will happen after all moves and content edits land. Internal links to docs/* paths are temporarily broken until that phase completes.
2026-06-07 14:21:37 +00:00 · 2026-05-05 02:49:28 +00:00
parent cda957f302
commit 3a807ae37e
35 changed files with 0 additions and 0 deletions
@@ -0,0 +1,226 @@
+# Runbook: certificate-expiry alerts (multi-channel)
+
+This runbook covers the per-policy multi-channel expiry-alert dispatch
+path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
+deep-research deliverable). It complements the operator-facing
+[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
+section in `docs/connectors.md`.
+
+Audience: a platform sysadmin or on-call engineer who needs to
+configure, debug, or audit certctl's expiry-alert routing. Not a
+walkthrough of how to install certctl — that lives in the README.
+
+---
+
+## End-to-end flow
+
+```mermaid
+flowchart TD
+    Tick["daily ticker (renewalCheckLoop)"]
+    Check["RenewalService.CheckExpiringCertificates"]
+
+    Tick --> Check --> Loop
+
+    subgraph Loop["for cert in expiring (≤30 days)"]
+        L1["1. Resolve RenewalPolicy"]
+        L2["2. Compute daysUntil"]
+        L3["3. updateCertExpiryStatus"]
+        L4["4. sendThresholdAlerts"]
+        L5["5. Create renewal job<br/>(if issuer registered +<br/>ARI allows)"]
+        L1 --> L2 --> L3 --> L4 --> L5
+    end
+
+    L4 --> Threshold
+
+    subgraph Threshold["per threshold"]
+        T1["a. resolve severity tier<br/>via AlertSeverityMap"]
+        T2["b. resolve channel set<br/>via AlertChannels[tier]"]
+        T1 --> T2 --> Channel
+    end
+
+    subgraph Channel["for each channel (fault-isolating)"]
+        C1["i. dedup via notification_events<br/>(cert, threshold, channel)"]
+        C2["ii. SendThresholdAlertOnChannel<br/>→ notifierRegistry[channel]<br/>→ Send(recipient, subj, body)"]
+        C3["iii. record audit row<br/>event_type=expiration_alert_sent<br/>metadata.channel, metadata.severity_tier"]
+        C4["iv. bump Prometheus counter<br/>certctl_expiry_alerts_total<br/>{channel, threshold, result}"]
+        C1 --> C2 --> C3 --> C4
+    end
+```
+
+The dispatch loop's per-channel error handling is
+**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
+at the same threshold. Each channel runs independently, with its
+own dedup row + audit row + metric increment.
+
+---
+
+## Configuring the per-policy channel matrix
+
+The matrix is a property of `RenewalPolicy`. Two new JSONB columns
+on the `renewal_policies` table back it (migration 000026):
+
+- `alert_channels JSONB` — `map[severity_tier][]channel_name`. Default `{}`
+  → fall through to `DefaultAlertChannels` (Email-only at every tier).
+- `alert_severity_map JSONB` — `map[threshold_days]severity_tier`. Default
+  `{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
+  14→warning, 7→warning, 0→critical`).
+
+### Example: production-grade routing
+
+```bash
+curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "name": "Production CDN renewal policy",
+    "renewal_window_days": 30,
+    "auto_renew": true,
+    "max_retries": 3,
+    "retry_interval_seconds": 300,
+    "alert_thresholds_days": [30, 14, 7, 0],
+    "alert_channels": {
+      "informational": ["Slack"],
+      "warning":       ["Slack", "Email"],
+      "critical":      ["PagerDuty", "OpsGenie", "Email"]
+    },
+    "alert_severity_map": {
+      "30": "informational",
+      "14": "warning",
+      "7":  "warning",
+      "0":  "critical"
+    }
+  }'
+```
+
+After this PUT, the next renewal-loop tick that finds a cert under
+this policy will fan out alerts as documented above.
+
+### Example: opt out of informational alerts
+
+If your team doesn't want T-30 informational alerts (you'd rather
+hear about a cert only at warning tier and beyond):
+
+```json
+"alert_channels": {
+  "informational": [],
+  "warning":       ["Email"],
+  "critical":      ["PagerDuty", "Email"]
+}
+```
+
+The empty `informational` list causes the dispatch loop to record
+an `expiration_alert_skipped_no_channels` audit row at T-30 and
+skip the dispatch. Other tiers still fire.
+
+---
+
+## Operator playbook
+
+### "Did the on-call team get paged?"
+
+```sql
+SELECT created_at,
+       metadata->>'channel'        AS channel,
+       metadata->>'threshold_days' AS threshold,
+       metadata->>'severity_tier'  AS severity
+FROM audit_events
+WHERE event_type = 'expiration_alert_sent'
+  AND resource_id = '<cert-id>'
+ORDER BY created_at DESC;
+```
+
+One row per (channel, threshold) attempt. If you see a row with
+`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
+out (or was at least dispatched to the notifier).
+
+### "Why didn't I get an alert at T-7?"
+
+Three places to look:
+
+1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
+   ('expiration_alert_sent','expiration_alert_skipped_no_channels',
+   'expiration_alert_skipped_invalid_channel') AND resource_id =
+   '<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
+   your policy's tier list is empty for the resolved tier. If
+   `expiration_alert_skipped_invalid_channel` appears, your matrix
+   has a typo (the `metadata->>'invalid_channel'` field tells you
+   which value).
+
+2. **Notifications table** —
+   `SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
+   AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
+   rows exist with `channel = 'Slack'` and `status = 'failed'`,
+   the dispatch reached the channel but the channel rejected the
+   send. Look at the `error` column for the upstream message.
+
+3. **Prometheus counters** —
+   `curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
+   Sustained `{result="failure"}` counts indicate a notifier
+   connector misconfiguration (bad webhook URL, expired API key,
+   etc.).
+
+### "How do I test the matrix without waiting for a real expiry?"
+
+certctl ships an admin endpoint for this:
+
+```bash
+curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "certificate_id": "mc-test-cert",
+    "threshold_days": 0,
+    "channel": "PagerDuty"
+  }'
+```
+
+This calls `NotificationService.SendThresholdAlertOnChannel`
+directly and bypasses the renewal loop's threshold check. Useful
+for "did I configure PagerDuty correctly?" without having to set
+up a deliberately-expiring cert. The admin endpoint requires
+`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
+token only.
+
+### "How do I rotate a notifier credential without downtime?"
+
+1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
+   var in your deployment.
+2. Restart `certctl-server`. The notifier registry rebuilds
+   with the new credential.
+3. Confirm with the admin-test endpoint above against the cert
+   you most care about.
+
+The renewal loop is idempotent — a missed tick during the restart
+window does NOT cause double-dispatch on the next tick (per-channel
+dedup on the `notification_events` table guards against that).
+
+---
+
+## Cardinality + cost
+
+- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
+- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
+  expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
+- Closed-enum discipline at the dispatch site means typos in
+  `alert_channels` do NOT grow this count.
+- A daily renewal-loop tick over 10K certs each policy-bound to the
+  matrix above produces O(channels × thresholds × certs) audit rows
+  + notification rows in the worst case (every cert has crossed
+  every threshold and no dedup applies). Operators sizing
+  Postgres should plan for an `audit_events` row count on the
+  order of `unique_certs × channels_per_critical_tier` per fan-out
+  batch — which is ~3-5× the pre-Rank-4 row count.
+
+---
+
+## V3-Pro forward path
+
+Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
+
+- Per-owner / per-team / per-tenant channel routing (the matrix is
+  per-policy today, not per-owner).
+- Calendar-aware suppression (no T-30 alerts on weekends for non-
+  on-call teams).
+- Escalation chains (T-1 unanswered for 30m → escalate to
+  manager's PagerDuty).
+- Per-channel rate limiting (downstream of I-005's retry+DLQ).