Files
certctl/docs/runbook-expiry-alerts.md
T
shankar0123 6af95ccf5f notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit ceca364). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00

9.2 KiB
Raw Blame History

Runbook: certificate-expiry alerts (multi-channel)

This runbook covers the per-policy multi-channel expiry-alert dispatch path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical deep-research deliverable). It complements the operator-facing Routing expiry alerts across channels section in docs/connectors.md.

Audience: a platform sysadmin or on-call engineer who needs to configure, debug, or audit certctl's expiry-alert routing. Not a walkthrough of how to install certctl — that lives in the README.


End-to-end flow

                          daily ticker (renewalCheckLoop)
                                        │
                                        ▼
                       RenewalService.CheckExpiringCertificates
                                        │
                       ┌────────────────┴────────────────┐
                       │  for cert in expiring (≤30 days):│
                       │    1. Resolve RenewalPolicy      │
                       │    2. Compute daysUntil          │
                       │    3. updateCertExpiryStatus     │
                       │    4. sendThresholdAlerts ──────►│  per threshold:
                       │    5. Create renewal job (if     │    a. resolve severity tier
                       │       issuer registered + ARI    │       via AlertSeverityMap
                       │       allows)                    │    b. resolve channel set
                       └──────────────────────────────────┘       via AlertChannels[tier]
                                                                  c. for each channel:
                                                                     i.  dedup via
                                                                         notification_events
                                                                         (cert,threshold,channel)
                                                                     ii. SendThresholdAlertOnChannel
                                                                         → notifierRegistry[channel]
                                                                         → Send(recipient,subj,body)
                                                                     iii. record audit row
                                                                          (event_type=expiration_alert_sent,
                                                                           metadata.channel,
                                                                           metadata.severity_tier)
                                                                     iv.  bump Prometheus counter
                                                                          certctl_expiry_alerts_total
                                                                          {channel,threshold,result}

The dispatch loop's per-channel error handling is fault-isolating: PagerDuty's failure does NOT skip Slack/Email at the same threshold. Each channel runs independently, with its own dedup row + audit row + metric increment.


Configuring the per-policy channel matrix

The matrix is a property of RenewalPolicy. Two new JSONB columns on the renewal_policies table back it (migration 000026):

  • alert_channels JSONBmap[severity_tier][]channel_name. Default {} → fall through to DefaultAlertChannels (Email-only at every tier).
  • alert_severity_map JSONBmap[threshold_days]severity_tier. Default {} → fall through to DefaultAlertSeverityMap (30→informational, 14→warning, 7→warning, 0→critical).

Example: production-grade routing

curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
  -H 'Authorization: Bearer ${TOKEN}' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Production CDN renewal policy",
    "renewal_window_days": 30,
    "auto_renew": true,
    "max_retries": 3,
    "retry_interval_seconds": 300,
    "alert_thresholds_days": [30, 14, 7, 0],
    "alert_channels": {
      "informational": ["Slack"],
      "warning":       ["Slack", "Email"],
      "critical":      ["PagerDuty", "OpsGenie", "Email"]
    },
    "alert_severity_map": {
      "30": "informational",
      "14": "warning",
      "7":  "warning",
      "0":  "critical"
    }
  }'

After this PUT, the next renewal-loop tick that finds a cert under this policy will fan out alerts as documented above.

Example: opt out of informational alerts

If your team doesn't want T-30 informational alerts (you'd rather hear about a cert only at warning tier and beyond):

"alert_channels": {
  "informational": [],
  "warning":       ["Email"],
  "critical":      ["PagerDuty", "Email"]
}

The empty informational list causes the dispatch loop to record an expiration_alert_skipped_no_channels audit row at T-30 and skip the dispatch. Other tiers still fire.


Operator playbook

"Did the on-call team get paged?"

SELECT created_at,
       metadata->>'channel'        AS channel,
       metadata->>'threshold_days' AS threshold,
       metadata->>'severity_tier'  AS severity
FROM audit_events
WHERE event_type = 'expiration_alert_sent'
  AND resource_id = '<cert-id>'
ORDER BY created_at DESC;

One row per (channel, threshold) attempt. If you see a row with channel = 'PagerDuty' and severity = 'critical', the page went out (or was at least dispatched to the notifier).

"Why didn't I get an alert at T-7?"

Three places to look:

  1. Audit logSELECT FROM audit_events WHERE event_type IN ('expiration_alert_sent','expiration_alert_skipped_no_channels', 'expiration_alert_skipped_invalid_channel') AND resource_id = '<cert-id>'. If expiration_alert_skipped_no_channels appears, your policy's tier list is empty for the resolved tier. If expiration_alert_skipped_invalid_channel appears, your matrix has a typo (the metadata->>'invalid_channel' field tells you which value).

  2. Notifications tableSELECT FROM notification_events WHERE certificate_id = '<cert-id>' AND type = 'ExpirationWarning' ORDER BY created_at DESC. If rows exist with channel = 'Slack' and status = 'failed', the dispatch reached the channel but the channel rejected the send. Look at the error column for the upstream message.

  3. Prometheus counterscurl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total. Sustained {result="failure"} counts indicate a notifier connector misconfiguration (bad webhook URL, expired API key, etc.).

"How do I test the matrix without waiting for a real expiry?"

certctl ships an admin endpoint for this:

curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
  -H 'Authorization: Bearer ${TOKEN}' \
  -H 'Content-Type: application/json' \
  -d '{
    "certificate_id": "mc-test-cert",
    "threshold_days": 0,
    "channel": "PagerDuty"
  }'

This calls NotificationService.SendThresholdAlertOnChannel directly and bypasses the renewal loop's threshold check. Useful for "did I configure PagerDuty correctly?" without having to set up a deliberately-expiring cert. The admin endpoint requires role=admin (V3-Pro RBAC); V2 deploys gate it on the bearer token only.

"How do I rotate a notifier credential without downtime?"

  1. Update the CERTCTL_PAGERDUTY_ROUTING_KEY (or equivalent) env var in your deployment.
  2. Restart certctl-server. The notifier registry rebuilds with the new credential.
  3. Confirm with the admin-test endpoint above against the cert you most care about.

The renewal loop is idempotent — a missed tick during the restart window does NOT cause double-dispatch on the next tick (per-channel dedup on the notification_events table guards against that).


Cardinality + cost

  • Default 6 channels × 4 thresholds × 3 results = 72 Prometheus series.
  • Custom-thresholds policies (e.g. [60, 45, 30, 14, 7, 3, 1, 0]) expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
  • Closed-enum discipline at the dispatch site means typos in alert_channels do NOT grow this count.
  • A daily renewal-loop tick over 10K certs each policy-bound to the matrix above produces O(channels × thresholds × certs) audit rows
    • notification rows in the worst case (every cert has crossed every threshold and no dedup applies). Operators sizing Postgres should plan for an audit_events row count on the order of unique_certs × channels_per_critical_tier per fan-out batch — which is ~3-5× the pre-Rank-4 row count.

V3-Pro forward path

Tracked at cowork/WORKSPACE-ROADMAP.md under "Adapter hardening":

  • Per-owner / per-team / per-tenant channel routing (the matrix is per-policy today, not per-owner).
  • Calendar-aware suppression (no T-30 alerts on weekends for non- on-call teams).
  • Escalation chains (T-1 unanswered for 30m → escalate to manager's PagerDuty).
  • Per-channel rate limiting (downstream of I-005's retry+DLQ).