notifications: per-policy multi-channel expiry-alert routing

Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit ceca364). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
This commit is contained in:
shankar0123
2026-05-03 22:12:32 +00:00
parent 5fd1e71477
commit 6af95ccf5f
13 changed files with 1694 additions and 37 deletions
+48
View File
@@ -1440,6 +1440,54 @@ type Connector interface {
Built-in notifiers: **Email** (SMTP), **Webhook** (HTTP POST), **Slack** (incoming webhook), **Microsoft Teams** (MessageCard webhook), **PagerDuty** (Events API v2), and **OpsGenie** (Alert API v2).
### Routing expiry alerts across channels
certctl-server runs a daily renewal-check loop that scans for managed certificates approaching expiry. For each cert that has crossed a configured threshold (default `[30, 14, 7, 0]` days), an `ExpirationWarning` notification is dispatched. **Pre-2026-05-03**, dispatch went exclusively via the `Email` channel — operators with PagerDuty / Slack / Teams / OpsGenie wired up received nothing at any threshold unless SMTP was also configured. Rank 4 of the 2026-05-03 Infisical deep-research deliverable closed that gap with a per-policy channel-matrix.
**The matrix lives on `RenewalPolicy`:**
```json
{
"id": "rp-production",
"name": "Production CDN renewal policy",
"renewal_window_days": 30,
"alert_thresholds_days": [30, 14, 7, 0],
"alert_channels": {
"informational": ["Slack"],
"warning": ["Slack", "Email"],
"critical": ["PagerDuty", "OpsGenie", "Email"]
},
"alert_severity_map": {
"30": "informational",
"14": "warning",
"7": "warning",
"0": "critical"
}
}
```
The runtime resolves the threshold's severity tier (via `alert_severity_map`, falling back to the default `30→informational, 14→warning, 7→warning, 0→critical` when unset), then dispatches one notification per channel listed under that tier in `alert_channels`. Each (cert, threshold, channel) triple is independently deduplicated via the `notification_events` table — a transient PagerDuty 5xx today does NOT suppress today's Slack alert, and tomorrow's renewal-loop tick will re-attempt the failed PagerDuty page.
**Backwards compatibility.** A policy with `alert_channels` unset (or empty) falls through to `DefaultAlertChannels` which routes every tier to `["Email"]`. Operators who haven't touched their renewal-policy configs see exactly the pre-2026-05-03 behaviour, and SMTP-only deployments keep working as before.
**Validation.** Off-enum severity tiers (anything other than `informational` / `warning` / `critical`) and off-enum channels (anything other than `Email` / `Webhook` / `Slack` / `Teams` / `PagerDuty` / `OpsGenie`) are silently dropped at the dispatch site — but the drop is recorded in the audit log as `expiration_alert_skipped_invalid_channel` so an operator can grep for typos. The `RenewalPolicyService.Create`/`Update` paths reject these at write time as well, so a fresh policy with bad values never persists.
**Procurement playbook: "I want PagerDuty when a cert is 24h from expiry."** Configure your renewal policy with `alert_severity_map.0 = "critical"` (already the default) and `alert_channels.critical = ["PagerDuty", "Email"]`. Set the `CERTCTL_PAGERDUTY_ROUTING_KEY` env var on the server. Restart. The next renewal-loop tick that finds a cert at ≤0 days will create a PagerDuty incident via the Events API v2 AND email the cert owner. Confirm with `curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total` — you'll see one `{channel="PagerDuty",threshold="0",result="success"}` series increment per critical-tier dispatch.
**Operator runbook for "did the on-call team get paged?"** Run:
```sql
SELECT created_at, metadata->>'channel' AS channel, metadata->>'threshold_days' AS threshold
FROM audit_events
WHERE event_type = 'expiration_alert_sent'
AND resource_id = '<cert-id>'
ORDER BY created_at DESC;
```
Each row corresponds to one fired alert. The `channel` metadata field tells you which notifier ran. Combined with the Prometheus `certctl_expiry_alerts_total{result="failure"}` counter, you have full forensic visibility on every dispatch attempt.
**V3-Pro forward path.** Per-owner / per-team channel routing (route the Production-CDN cert's alerts to its dedicated owner's PagerDuty service, the Internal-API cert's alerts to a different one), calendar-aware suppression (no T-30 informational alerts on weekends for non-on-call teams), and escalation chains (T-1 unanswered for 30m → escalate to manager) are tracked on `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening" → "Multi-channel expiry alerts: per-owner routing".
### Email (SMTP) Notifier
The Email notifier sends transactional alerts and scheduled digests via SMTP. It bridges the connector-layer SMTP connector to the service-layer `Notifier` interface via the `NotifierAdapter`. Supports both plain text and HTML emails.
+225
View File
@@ -0,0 +1,225 @@
# Runbook: certificate-expiry alerts (multi-channel)
This runbook covers the per-policy multi-channel expiry-alert dispatch
path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
deep-research deliverable). It complements the operator-facing
[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
section in `docs/connectors.md`.
Audience: a platform sysadmin or on-call engineer who needs to
configure, debug, or audit certctl's expiry-alert routing. Not a
walkthrough of how to install certctl — that lives in the README.
---
## End-to-end flow
```
daily ticker (renewalCheckLoop)
RenewalService.CheckExpiringCertificates
┌────────────────┴────────────────┐
│ for cert in expiring (≤30 days):│
│ 1. Resolve RenewalPolicy │
│ 2. Compute daysUntil │
│ 3. updateCertExpiryStatus │
│ 4. sendThresholdAlerts ──────►│ per threshold:
│ 5. Create renewal job (if │ a. resolve severity tier
│ issuer registered + ARI │ via AlertSeverityMap
│ allows) │ b. resolve channel set
└──────────────────────────────────┘ via AlertChannels[tier]
c. for each channel:
i. dedup via
notification_events
(cert,threshold,channel)
ii. SendThresholdAlertOnChannel
→ notifierRegistry[channel]
→ Send(recipient,subj,body)
iii. record audit row
(event_type=expiration_alert_sent,
metadata.channel,
metadata.severity_tier)
iv. bump Prometheus counter
certctl_expiry_alerts_total
{channel,threshold,result}
```
The dispatch loop's per-channel error handling is
**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
at the same threshold. Each channel runs independently, with its
own dedup row + audit row + metric increment.
---
## Configuring the per-policy channel matrix
The matrix is a property of `RenewalPolicy`. Two new JSONB columns
on the `renewal_policies` table back it (migration 000026):
- `alert_channels JSONB``map[severity_tier][]channel_name`. Default `{}`
→ fall through to `DefaultAlertChannels` (Email-only at every tier).
- `alert_severity_map JSONB``map[threshold_days]severity_tier`. Default
`{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
14→warning, 7→warning, 0→critical`).
### Example: production-grade routing
```bash
curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
-H 'Authorization: Bearer ${TOKEN}' \
-H 'Content-Type: application/json' \
-d '{
"name": "Production CDN renewal policy",
"renewal_window_days": 30,
"auto_renew": true,
"max_retries": 3,
"retry_interval_seconds": 300,
"alert_thresholds_days": [30, 14, 7, 0],
"alert_channels": {
"informational": ["Slack"],
"warning": ["Slack", "Email"],
"critical": ["PagerDuty", "OpsGenie", "Email"]
},
"alert_severity_map": {
"30": "informational",
"14": "warning",
"7": "warning",
"0": "critical"
}
}'
```
After this PUT, the next renewal-loop tick that finds a cert under
this policy will fan out alerts as documented above.
### Example: opt out of informational alerts
If your team doesn't want T-30 informational alerts (you'd rather
hear about a cert only at warning tier and beyond):
```json
"alert_channels": {
"informational": [],
"warning": ["Email"],
"critical": ["PagerDuty", "Email"]
}
```
The empty `informational` list causes the dispatch loop to record
an `expiration_alert_skipped_no_channels` audit row at T-30 and
skip the dispatch. Other tiers still fire.
---
## Operator playbook
### "Did the on-call team get paged?"
```sql
SELECT created_at,
metadata->>'channel' AS channel,
metadata->>'threshold_days' AS threshold,
metadata->>'severity_tier' AS severity
FROM audit_events
WHERE event_type = 'expiration_alert_sent'
AND resource_id = '<cert-id>'
ORDER BY created_at DESC;
```
One row per (channel, threshold) attempt. If you see a row with
`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
out (or was at least dispatched to the notifier).
### "Why didn't I get an alert at T-7?"
Three places to look:
1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
('expiration_alert_sent','expiration_alert_skipped_no_channels',
'expiration_alert_skipped_invalid_channel') AND resource_id =
'<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
your policy's tier list is empty for the resolved tier. If
`expiration_alert_skipped_invalid_channel` appears, your matrix
has a typo (the `metadata->>'invalid_channel'` field tells you
which value).
2. **Notifications table** —
`SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
rows exist with `channel = 'Slack'` and `status = 'failed'`,
the dispatch reached the channel but the channel rejected the
send. Look at the `error` column for the upstream message.
3. **Prometheus counters** —
`curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
Sustained `{result="failure"}` counts indicate a notifier
connector misconfiguration (bad webhook URL, expired API key,
etc.).
### "How do I test the matrix without waiting for a real expiry?"
certctl ships an admin endpoint for this:
```bash
curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
-H 'Authorization: Bearer ${TOKEN}' \
-H 'Content-Type: application/json' \
-d '{
"certificate_id": "mc-test-cert",
"threshold_days": 0,
"channel": "PagerDuty"
}'
```
This calls `NotificationService.SendThresholdAlertOnChannel`
directly and bypasses the renewal loop's threshold check. Useful
for "did I configure PagerDuty correctly?" without having to set
up a deliberately-expiring cert. The admin endpoint requires
`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
token only.
### "How do I rotate a notifier credential without downtime?"
1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
var in your deployment.
2. Restart `certctl-server`. The notifier registry rebuilds
with the new credential.
3. Confirm with the admin-test endpoint above against the cert
you most care about.
The renewal loop is idempotent — a missed tick during the restart
window does NOT cause double-dispatch on the next tick (per-channel
dedup on the `notification_events` table guards against that).
---
## Cardinality + cost
- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
- Closed-enum discipline at the dispatch site means typos in
`alert_channels` do NOT grow this count.
- A daily renewal-loop tick over 10K certs each policy-bound to the
matrix above produces O(channels × thresholds × certs) audit rows
+ notification rows in the worst case (every cert has crossed
every threshold and no dedup applies). Operators sizing
Postgres should plan for an `audit_events` row count on the
order of `unique_certs × channels_per_critical_tier` per fan-out
batch — which is ~3-5× the pre-Rank-4 row count.
---
## V3-Pro forward path
Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
- Per-owner / per-team / per-tenant channel routing (the matrix is
per-policy today, not per-owner).
- Calendar-aware suppression (no T-30 alerts on weekends for non-
on-call teams).
- Escalation chains (T-1 unanswered for 30m → escalate to
manager's PagerDuty).
- Per-channel rate limiting (downstream of I-005's retry+DLQ).