notifications: per-policy multi-channel expiry-alert routing

Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
This commit is contained in:
shankar0123
2026-05-03 22:12:32 +00:00
parent 022caf39b4
commit 109f32ff41
13 changed files with 1694 additions and 37 deletions
+131 -25
View File
@@ -200,8 +200,16 @@ func (s *RenewalService) CheckExpiringCertificates(ctx context.Context) error {
// Update certificate status based on expiry
s.updateCertExpiryStatus(ctx, cert, daysUntil)
// Send threshold-based alerts with deduplication
s.sendThresholdAlerts(ctx, cert, int(daysUntil), thresholds)
// Send threshold-based alerts with per-channel deduplication. The
// policy pointer (nil-safe) drives the per-(threshold) channel
// matrix; nil policy or empty AlertChannels falls through to the
// back-compat Email-only default. Rank 4 of the 2026-05-03
// Infisical deep-research deliverable.
var policyPtr *domain.RenewalPolicy
if cert.RenewalPolicyID != "" {
policyPtr = policyCache[cert.RenewalPolicyID]
}
s.sendThresholdAlerts(ctx, cert, int(daysUntil), thresholds, policyPtr)
// Only create renewal job if an issuer connector is registered for this cert's issuer
connector, hasIssuer := s.issuerRegistry.Get(cert.IssuerID)
@@ -289,40 +297,138 @@ func (s *RenewalService) CheckExpiringCertificates(ctx context.Context) error {
return nil
}
// sendThresholdAlerts sends deduplicated expiration notifications based on configured thresholds.
// For each threshold that the certificate has crossed (e.g., ≤30 days, ≤14 days), it checks
// whether a notification for that threshold was already sent. Only new threshold crossings
// trigger notifications.
func (s *RenewalService) sendThresholdAlerts(ctx context.Context, cert *domain.ManagedCertificate, daysUntil int, thresholds []int) {
// sendThresholdAlerts sends deduplicated expiration notifications based on
// configured thresholds AND the per-policy channel matrix. For each
// threshold that the certificate has crossed (e.g., ≤30 days, ≤14 days),
// the dispatch loop:
//
// 1. Resolves the threshold's severity tier from the policy's
// AlertSeverityMap (or DefaultAlertSeverityMap if unset / off-map).
// 2. Looks up the channel set for that tier in the policy's AlertChannels
// (or DefaultAlertChannels — Email-only — if unset / empty).
// 3. For each resolved channel, defensively re-validates against the
// closed-enum NotificationChannel set (off-enum values silently drop
// with an audit row so an operator can grep + fix the typo without
// us silently dynamic-cardinality-growing the Prometheus counter).
// 4. Per-(cert, threshold, channel) dedup via
// HasThresholdNotificationOnChannel — a successful PagerDuty page
// yesterday won't fire again today, but a transient PagerDuty 5xx
// today does NOT suppress today's Slack and tomorrow's PagerDuty
// retry will still fire (the failed row stays "failed" in the DB,
// not "sent").
// 5. SendThresholdAlertOnChannel persists the notification row (channel
// column populated), reports the metric, and dispatches.
// 6. Per-channel audit row so an operator can SQL-grep
// audit_events WHERE event_type='expiration_alert_sent'
// AND metadata->>'channel' = 'PagerDuty' to answer "did the on-call
// team get paged?".
//
// Rank 4 of the 2026-05-03 Infisical deep-research deliverable
// (cowork/infisical-deep-research-results.md Part 5). The policy
// argument is nil-safe — a cert with no RenewalPolicy attached gets the
// back-compat Email-only default matrix.
func (s *RenewalService) sendThresholdAlerts(
ctx context.Context, cert *domain.ManagedCertificate, daysUntil int,
thresholds []int, policy *domain.RenewalPolicy,
) {
channelMatrix := domain.DefaultAlertChannels()
if policy != nil {
channelMatrix = policy.EffectiveAlertChannels()
}
for _, threshold := range thresholds {
// Only alert if the cert has crossed this threshold (days remaining ≤ threshold)
if daysUntil > threshold {
continue
}
// Check if we already sent a notification for this threshold (deduplication)
alreadySent, err := s.notificationSvc.HasThresholdNotification(ctx, cert.ID, threshold)
if err != nil {
slog.Error("failed to check notification dedup", "cert_id", cert.ID, "threshold", threshold, "error", err)
continue
tier := domain.AlertSeverityInformational
if policy != nil {
tier = policy.EffectiveAlertSeverity(threshold)
} else if t, ok := domain.DefaultAlertSeverityMap()[threshold]; ok {
tier = t
}
if alreadySent {
// Defensive: an unknown tier (operator typo that survived
// validation, or a future tier name added in a later schema)
// drops to "informational" so we still alert on SOMETHING
// rather than silently swallowing the threshold.
if !domain.IsValidAlertSeverityTier(tier) {
tier = domain.AlertSeverityInformational
}
channels := channelMatrix[tier]
if len(channels) == 0 {
// Operator opted out of this tier (or matrix has no entry
// for the tier). Skip silently — record-empty audit row to
// surface the opt-out in the audit log.
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"expiration_alert_skipped_no_channels", "certificate", cert.ID,
map[string]interface{}{
"threshold_days": threshold,
"days_until_expiry": daysUntil,
"severity_tier": tier,
})
continue
}
// Send the threshold alert
if err := s.notificationSvc.SendThresholdAlert(ctx, cert, daysUntil, threshold); err != nil {
slog.Error("failed to send threshold alert for cert", "cert_id", cert.ID, "threshold", threshold, "error", err)
}
for _, ch := range channels {
// Defensive validation: the policy validation path rejects
// off-enum values at write time, but a stored row could
// drift across a schema change. Drop off-enum values here
// rather than letting them through to a dispatch site that
// would either fail the Send call or grow Prometheus
// cardinality. Audit the drop so operators see the typo.
if !domain.IsValidNotificationChannel(ch) {
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"expiration_alert_skipped_invalid_channel", "certificate", cert.ID,
map[string]interface{}{
"threshold_days": threshold,
"days_until_expiry": daysUntil,
"severity_tier": tier,
"invalid_channel": ch,
})
continue
}
// Record audit event for the alert
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"expiration_alert_sent", "certificate", cert.ID,
map[string]interface{}{
"threshold_days": threshold,
"days_until_expiry": daysUntil,
}); auditErr != nil {
slog.Error("failed to record audit event", "error", auditErr)
channel := domain.NotificationChannel(ch)
alreadySent, err := s.notificationSvc.HasThresholdNotificationOnChannel(
ctx, cert.ID, threshold, channel,
)
if err != nil {
slog.Error("failed to check notification dedup",
"cert_id", cert.ID, "threshold", threshold,
"channel", ch, "error", err)
continue
}
if alreadySent {
s.notificationSvc.RecordExpiryAlertDeduped(ch, threshold)
continue
}
if err := s.notificationSvc.SendThresholdAlertOnChannel(
ctx, cert, daysUntil, threshold, channel,
); err != nil {
slog.Error("failed to send threshold alert",
"cert_id", cert.ID, "threshold", threshold,
"channel", ch, "error", err)
// continue — other channels still fire
}
// Per-(cert, threshold, channel) audit row. Operators alert
// on the channel-labelled row to confirm a specific pager
// went out.
if auditErr := s.auditService.RecordEvent(ctx, "system",
domain.ActorTypeSystem, "expiration_alert_sent",
"certificate", cert.ID,
map[string]interface{}{
"threshold_days": threshold,
"days_until_expiry": daysUntil,
"channel": ch,
"severity_tier": tier,
}); auditErr != nil {
slog.Error("failed to record audit event", "error", auditErr)
}
}
}
}