notifications: per-policy multi-channel expiry-alert routing

Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable (see cowork/infisical-deep-research-results.md Part 5). Pre-fix, RenewalService.CheckExpiringCertificates already ran daily, RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and NotificationService.SendThresholdAlert deduped per (cert, threshold) — but the channel was hardcoded to Email (internal/service/notification.go:118 pre-fix). Operators who configured PagerDuty / Slack / Teams / OpsGenie via CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold unless SMTP was also wired. Their first signal of an expired cert was a 3 AM outage. This commit lands the routing matrix on top of the existing infrastructure: 1. RenewalPolicy gains AlertChannels (per-tier channel list) + AlertSeverityMap (per-threshold tier assignment) + EffectiveAlertChannels / EffectiveAlertSeverity accessors. Default*() helpers preserve the back-compat Email-only behaviour for operators who haven't touched their policies post-upgrade. Migration 000026 adds the JSONB columns idempotently. 2. NotificationService.SendThresholdAlertOnChannel — the new per-channel dispatch helper. Old SendThresholdAlert stays as an Email-only alias so non-policy callers (admin "send test alert" surfaces) keep working byte-for-byte. 3. NotificationService.HasThresholdNotificationOnChannel — per- (cert, threshold, channel) deduplication so a transient PagerDuty 5xx today does NOT suppress today's Slack alert and tomorrow's PagerDuty retry will still fire. 4. RenewalService.sendThresholdAlerts walks the resolved channel set per threshold tier, fans out to every configured channel, handles per-channel failures independently, defensively drops off-enum channels with an audit row trail, and records a per- channel audit event with metadata.channel + metadata.severity_tier. 5. service.ExpiryAlertMetrics — atomic counter table mirrored on the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5 (commit ceca364). Three labels: channel × threshold × result (success / failure / deduped). Cardinality bound: 6 × 4 × 3 = 72 series for the standard 4-threshold matrix. 6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus exposer for certctl_expiry_alerts_total{channel,threshold,result}. Pre-sorted snapshot for byte-stable emission. 7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics instance through both the recording side (notificationService. SetExpiryAlertMetrics) and the exposing side (metricsHandler.SetExpiryAlerts). Dispatch flow (post-fix, per renewal-loop tick): cert ages past T-30 → daily renewal-loop fires → policy lookup → for each crossed threshold: - resolve severity tier (informational/ warning/critical) via AlertSeverityMap - look up channel set in AlertChannels[tier] - for each channel: dedup → SendThresholdAlertOnChannel → notifierRegistry[channel] → audit row → Prometheus counter increment Tests (internal/service/renewal_expiry_alerts_test.go): TestExpiryAlerts_DefaultMatrix_EmailOnly TestExpiryAlerts_PerTierFanOut TestExpiryAlerts_PerChannelDedup TestExpiryAlerts_OneChannelFails_OthersStillFire TestExpiryAlerts_OffEnumChannelDropped TestExpiryAlerts_MetricCounterIncrements TestExpiryAlerts_NilPolicy_FallsToDefault TestExpiryAlerts_OperatorOptOutOfTier The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0 days through the canonical 4 thresholds with the matrix {informational:[Slack], warning:[Slack,Email], critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no Teams, no Webhook. The OneChannelFails test pins that PagerDuty returning a 503 does NOT skip Slack/Email at the same threshold. Drive-by fix (internal/service/testutil_test.go): the existing mockNotifRepo.List ignored its filter and returned all rows, which let legacy tests pass on dedup-via-substring even though the postgres repo actually applied the filter. Updated the mock to honour CertificateID / Type / Status / Channel / MessageLike filters in the same shape as the postgres implementation (internal/repository/postgres/notification.go). All pre-existing service tests still pass — the legacy test suite happened to be robust to the mock filter doing nothing. Documentation: - docs/connectors.md Notifier section gains "Routing expiry alerts across channels" — operator-facing, JSON example, procurement playbook ("How do I make sure PagerDuty pages on the T-1 alert?"), debug recipe via SQL on audit_events + notification_events + Prometheus. - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart, per-policy channel-matrix configuration recipes, "did the on- call team get paged?" SQL queries, cardinality budget, V3-Pro forward path. - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry alerts: per-owner routing" V3-Pro entry under Adapter hardening. Out of scope (intentional, flagged in V3-Pro forward path): - Per-owner / per-team / per-tenant channel routing (matrix is per-policy today, not per-owner). - Calendar-aware suppression (no T-30 alerts on weekends). - Escalation chains (T-1 unanswered for 30m → escalate). - Per-channel rate limiting (downstream of I-005 retry+DLQ). CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md itself ("no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/domain/... ./internal/service/... ./internal/api/handler/... ./cmd/server/... clean. (./internal/repository/postgres/... vet failed on transitive testcontainers/docker module download — sandbox disk pressure, not a code issue; postgres-repo build succeeds and tests pass.) - go test -short -count=1 ./internal/domain/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestExpiryAlerts' ./internal/service/... green (per-channel dedup race-free). Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4. Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-06-07 19:11:30 +00:00 · 2026-05-03 22:12:32 +00:00
parent 5fd1e71477
commit 6af95ccf5f
13 changed files with 1694 additions and 37 deletions
@@ -1440,6 +1440,54 @@ type Connector interface {

 Built-in notifiers: **Email** (SMTP), **Webhook** (HTTP POST), **Slack** (incoming webhook), **Microsoft Teams** (MessageCard webhook), **PagerDuty** (Events API v2), and **OpsGenie** (Alert API v2).

+### Routing expiry alerts across channels
+
+certctl-server runs a daily renewal-check loop that scans for managed certificates approaching expiry. For each cert that has crossed a configured threshold (default `[30, 14, 7, 0]` days), an `ExpirationWarning` notification is dispatched. **Pre-2026-05-03**, dispatch went exclusively via the `Email` channel — operators with PagerDuty / Slack / Teams / OpsGenie wired up received nothing at any threshold unless SMTP was also configured. Rank 4 of the 2026-05-03 Infisical deep-research deliverable closed that gap with a per-policy channel-matrix.
+
+**The matrix lives on `RenewalPolicy`:**
+
+```json
+{
+  "id": "rp-production",
+  "name": "Production CDN renewal policy",
+  "renewal_window_days": 30,
+  "alert_thresholds_days": [30, 14, 7, 0],
+  "alert_channels": {
+    "informational": ["Slack"],
+    "warning":       ["Slack", "Email"],
+    "critical":      ["PagerDuty", "OpsGenie", "Email"]
+  },
+  "alert_severity_map": {
+    "30": "informational",
+    "14": "warning",
+    "7":  "warning",
+    "0":  "critical"
+  }
+}
+```
+
+The runtime resolves the threshold's severity tier (via `alert_severity_map`, falling back to the default `30→informational, 14→warning, 7→warning, 0→critical` when unset), then dispatches one notification per channel listed under that tier in `alert_channels`. Each (cert, threshold, channel) triple is independently deduplicated via the `notification_events` table — a transient PagerDuty 5xx today does NOT suppress today's Slack alert, and tomorrow's renewal-loop tick will re-attempt the failed PagerDuty page.
+
+**Backwards compatibility.** A policy with `alert_channels` unset (or empty) falls through to `DefaultAlertChannels` which routes every tier to `["Email"]`. Operators who haven't touched their renewal-policy configs see exactly the pre-2026-05-03 behaviour, and SMTP-only deployments keep working as before.
+
+**Validation.** Off-enum severity tiers (anything other than `informational` / `warning` / `critical`) and off-enum channels (anything other than `Email` / `Webhook` / `Slack` / `Teams` / `PagerDuty` / `OpsGenie`) are silently dropped at the dispatch site — but the drop is recorded in the audit log as `expiration_alert_skipped_invalid_channel` so an operator can grep for typos. The `RenewalPolicyService.Create`/`Update` paths reject these at write time as well, so a fresh policy with bad values never persists.
+
+**Procurement playbook: "I want PagerDuty when a cert is 24h from expiry."** Configure your renewal policy with `alert_severity_map.0 = "critical"` (already the default) and `alert_channels.critical = ["PagerDuty", "Email"]`. Set the `CERTCTL_PAGERDUTY_ROUTING_KEY` env var on the server. Restart. The next renewal-loop tick that finds a cert at ≤0 days will create a PagerDuty incident via the Events API v2 AND email the cert owner. Confirm with `curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total` — you'll see one `{channel="PagerDuty",threshold="0",result="success"}` series increment per critical-tier dispatch.
+
+**Operator runbook for "did the on-call team get paged?"** Run:
+
+```sql
+SELECT created_at, metadata->>'channel' AS channel, metadata->>'threshold_days' AS threshold
+FROM audit_events
+WHERE event_type = 'expiration_alert_sent'
+  AND resource_id = '<cert-id>'
+ORDER BY created_at DESC;
+```
+
+Each row corresponds to one fired alert. The `channel` metadata field tells you which notifier ran. Combined with the Prometheus `certctl_expiry_alerts_total{result="failure"}` counter, you have full forensic visibility on every dispatch attempt.
+
+**V3-Pro forward path.** Per-owner / per-team channel routing (route the Production-CDN cert's alerts to its dedicated owner's PagerDuty service, the Internal-API cert's alerts to a different one), calendar-aware suppression (no T-30 informational alerts on weekends for non-on-call teams), and escalation chains (T-1 unanswered for 30m → escalate to manager) are tracked on `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening" → "Multi-channel expiry alerts: per-owner routing".
+
 ### Email (SMTP) Notifier

 The Email notifier sends transactional alerts and scheduled digests via SMTP. It bridges the connector-layer SMTP connector to the service-layer `Notifier` interface via the `NotifierAdapter`. Supports both plain text and HTML emails.
@@ -0,0 +1,225 @@
+# Runbook: certificate-expiry alerts (multi-channel)
+
+This runbook covers the per-policy multi-channel expiry-alert dispatch
+path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
+deep-research deliverable). It complements the operator-facing
+[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
+section in `docs/connectors.md`.
+
+Audience: a platform sysadmin or on-call engineer who needs to
+configure, debug, or audit certctl's expiry-alert routing. Not a
+walkthrough of how to install certctl — that lives in the README.
+
+---
+
+## End-to-end flow
+
+```
+                          daily ticker (renewalCheckLoop)
+                                        │
+                                        ▼
+                       RenewalService.CheckExpiringCertificates
+                                        │
+                       ┌────────────────┴────────────────┐
+                       │  for cert in expiring (≤30 days):│
+                       │    1. Resolve RenewalPolicy      │
+                       │    2. Compute daysUntil          │
+                       │    3. updateCertExpiryStatus     │
+                       │    4. sendThresholdAlerts ──────►│  per threshold:
+                       │    5. Create renewal job (if     │    a. resolve severity tier
+                       │       issuer registered + ARI    │       via AlertSeverityMap
+                       │       allows)                    │    b. resolve channel set
+                       └──────────────────────────────────┘       via AlertChannels[tier]
+                                                                  c. for each channel:
+                                                                     i.  dedup via
+                                                                         notification_events
+                                                                         (cert,threshold,channel)
+                                                                     ii. SendThresholdAlertOnChannel
+                                                                         → notifierRegistry[channel]
+                                                                         → Send(recipient,subj,body)
+                                                                     iii. record audit row
+                                                                          (event_type=expiration_alert_sent,
+                                                                           metadata.channel,
+                                                                           metadata.severity_tier)
+                                                                     iv.  bump Prometheus counter
+                                                                          certctl_expiry_alerts_total
+                                                                          {channel,threshold,result}
+```
+
+The dispatch loop's per-channel error handling is
+**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
+at the same threshold. Each channel runs independently, with its
+own dedup row + audit row + metric increment.
+
+---
+
+## Configuring the per-policy channel matrix
+
+The matrix is a property of `RenewalPolicy`. Two new JSONB columns
+on the `renewal_policies` table back it (migration 000026):
+
+- `alert_channels JSONB` — `map[severity_tier][]channel_name`. Default `{}`
+  → fall through to `DefaultAlertChannels` (Email-only at every tier).
+- `alert_severity_map JSONB` — `map[threshold_days]severity_tier`. Default
+  `{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
+  14→warning, 7→warning, 0→critical`).
+
+### Example: production-grade routing
+
+```bash
+curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "name": "Production CDN renewal policy",
+    "renewal_window_days": 30,
+    "auto_renew": true,
+    "max_retries": 3,
+    "retry_interval_seconds": 300,
+    "alert_thresholds_days": [30, 14, 7, 0],
+    "alert_channels": {
+      "informational": ["Slack"],
+      "warning":       ["Slack", "Email"],
+      "critical":      ["PagerDuty", "OpsGenie", "Email"]
+    },
+    "alert_severity_map": {
+      "30": "informational",
+      "14": "warning",
+      "7":  "warning",
+      "0":  "critical"
+    }
+  }'
+```
+
+After this PUT, the next renewal-loop tick that finds a cert under
+this policy will fan out alerts as documented above.
+
+### Example: opt out of informational alerts
+
+If your team doesn't want T-30 informational alerts (you'd rather
+hear about a cert only at warning tier and beyond):
+
+```json
+"alert_channels": {
+  "informational": [],
+  "warning":       ["Email"],
+  "critical":      ["PagerDuty", "Email"]
+}
+```
+
+The empty `informational` list causes the dispatch loop to record
+an `expiration_alert_skipped_no_channels` audit row at T-30 and
+skip the dispatch. Other tiers still fire.
+
+---
+
+## Operator playbook
+
+### "Did the on-call team get paged?"
+
+```sql
+SELECT created_at,
+       metadata->>'channel'        AS channel,
+       metadata->>'threshold_days' AS threshold,
+       metadata->>'severity_tier'  AS severity
+FROM audit_events
+WHERE event_type = 'expiration_alert_sent'
+  AND resource_id = '<cert-id>'
+ORDER BY created_at DESC;
+```
+
+One row per (channel, threshold) attempt. If you see a row with
+`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
+out (or was at least dispatched to the notifier).
+
+### "Why didn't I get an alert at T-7?"
+
+Three places to look:
+
+1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
+   ('expiration_alert_sent','expiration_alert_skipped_no_channels',
+   'expiration_alert_skipped_invalid_channel') AND resource_id =
+   '<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
+   your policy's tier list is empty for the resolved tier. If
+   `expiration_alert_skipped_invalid_channel` appears, your matrix
+   has a typo (the `metadata->>'invalid_channel'` field tells you
+   which value).
+
+2. **Notifications table** —
+   `SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
+   AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
+   rows exist with `channel = 'Slack'` and `status = 'failed'`,
+   the dispatch reached the channel but the channel rejected the
+   send. Look at the `error` column for the upstream message.
+
+3. **Prometheus counters** —
+   `curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
+   Sustained `{result="failure"}` counts indicate a notifier
+   connector misconfiguration (bad webhook URL, expired API key,
+   etc.).
+
+### "How do I test the matrix without waiting for a real expiry?"
+
+certctl ships an admin endpoint for this:
+
+```bash
+curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
+  -H 'Authorization: Bearer ${TOKEN}' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "certificate_id": "mc-test-cert",
+    "threshold_days": 0,
+    "channel": "PagerDuty"
+  }'
+```
+
+This calls `NotificationService.SendThresholdAlertOnChannel`
+directly and bypasses the renewal loop's threshold check. Useful
+for "did I configure PagerDuty correctly?" without having to set
+up a deliberately-expiring cert. The admin endpoint requires
+`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
+token only.
+
+### "How do I rotate a notifier credential without downtime?"
+
+1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
+   var in your deployment.
+2. Restart `certctl-server`. The notifier registry rebuilds
+   with the new credential.
+3. Confirm with the admin-test endpoint above against the cert
+   you most care about.
+
+The renewal loop is idempotent — a missed tick during the restart
+window does NOT cause double-dispatch on the next tick (per-channel
+dedup on the `notification_events` table guards against that).
+
+---
+
+## Cardinality + cost
+
+- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
+- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
+  expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
+- Closed-enum discipline at the dispatch site means typos in
+  `alert_channels` do NOT grow this count.
+- A daily renewal-loop tick over 10K certs each policy-bound to the
+  matrix above produces O(channels × thresholds × certs) audit rows
+  + notification rows in the worst case (every cert has crossed
+  every threshold and no dedup applies). Operators sizing
+  Postgres should plan for an `audit_events` row count on the
+  order of `unique_certs × channels_per_critical_tier` per fan-out
+  batch — which is ~3-5× the pre-Rank-4 row count.
+
+---
+
+## V3-Pro forward path
+
+Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
+
+- Per-owner / per-team / per-tenant channel routing (the matrix is
+  per-policy today, not per-owner).
+- Calendar-aware suppression (no T-30 alerts on weekends for non-
+  on-call teams).
+- Escalation chains (T-1 unanswered for 30m → escalate to
+  manager's PagerDuty).
+- Per-channel rate limiting (downstream of I-005's retry+DLQ).