mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-12 10:48:58 +00:00
notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.
This commit lands the routing matrix on top of the existing
infrastructure:
1. RenewalPolicy gains AlertChannels (per-tier channel list) +
AlertSeverityMap (per-threshold tier assignment) +
EffectiveAlertChannels / EffectiveAlertSeverity accessors.
Default*() helpers preserve the back-compat Email-only
behaviour for operators who haven't touched their policies
post-upgrade. Migration 000026 adds the JSONB columns
idempotently.
2. NotificationService.SendThresholdAlertOnChannel — the new
per-channel dispatch helper. Old SendThresholdAlert stays as
an Email-only alias so non-policy callers (admin "send test
alert" surfaces) keep working byte-for-byte.
3. NotificationService.HasThresholdNotificationOnChannel — per-
(cert, threshold, channel) deduplication so a transient
PagerDuty 5xx today does NOT suppress today's Slack alert and
tomorrow's PagerDuty retry will still fire.
4. RenewalService.sendThresholdAlerts walks the resolved channel
set per threshold tier, fans out to every configured channel,
handles per-channel failures independently, defensively drops
off-enum channels with an audit row trail, and records a per-
channel audit event with metadata.channel + metadata.severity_tier.
5. service.ExpiryAlertMetrics — atomic counter table mirrored on
the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
(commit 0792271). Three labels: channel × threshold × result
(success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
72 series for the standard 4-threshold matrix.
6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
exposer for certctl_expiry_alerts_total{channel,threshold,result}.
Pre-sorted snapshot for byte-stable emission.
7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
instance through both the recording side (notificationService.
SetExpiryAlertMetrics) and the exposing side
(metricsHandler.SetExpiryAlerts).
Dispatch flow (post-fix, per renewal-loop tick):
cert ages past T-30 → daily renewal-loop fires
→ policy lookup
→ for each crossed threshold:
- resolve severity tier (informational/
warning/critical) via AlertSeverityMap
- look up channel set in AlertChannels[tier]
- for each channel: dedup → SendThresholdAlertOnChannel
→ notifierRegistry[channel] → audit row →
Prometheus counter increment
Tests (internal/service/renewal_expiry_alerts_test.go):
TestExpiryAlerts_DefaultMatrix_EmailOnly
TestExpiryAlerts_PerTierFanOut
TestExpiryAlerts_PerChannelDedup
TestExpiryAlerts_OneChannelFails_OthersStillFire
TestExpiryAlerts_OffEnumChannelDropped
TestExpiryAlerts_MetricCounterIncrements
TestExpiryAlerts_NilPolicy_FallsToDefault
TestExpiryAlerts_OperatorOptOutOfTier
The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.
Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.
Documentation:
- docs/connectors.md Notifier section gains "Routing expiry
alerts across channels" — operator-facing, JSON example,
procurement playbook ("How do I make sure PagerDuty pages on
the T-1 alert?"), debug recipe via SQL on audit_events +
notification_events + Prometheus.
- docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
per-policy channel-matrix configuration recipes, "did the on-
call team get paged?" SQL queries, cardinality budget, V3-Pro
forward path.
- cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
alerts: per-owner routing" V3-Pro entry under Adapter
hardening.
Out of scope (intentional, flagged in V3-Pro forward path):
- Per-owner / per-team / per-tenant channel routing (matrix is
per-policy today, not per-owner).
- Calendar-aware suppression (no T-30 alerts on weekends).
- Escalation chains (T-1 unanswered for 30m → escalate).
- Per-channel rate limiting (downstream of I-005 retry+DLQ).
CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").
Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
./internal/api/handler/... ./cmd/server/... clean.
(./internal/repository/postgres/... vet failed on transitive
testcontainers/docker module download — sandbox disk pressure,
not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
./internal/service/... ./internal/api/handler/... green.
- go test -race -count=10 -run 'TestExpiryAlerts'
./internal/service/... green (per-channel dedup race-free).
Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
This commit is contained in:
@@ -49,6 +49,53 @@ type NotificationService struct {
|
||||
notifRepo repository.NotificationRepository
|
||||
ownerRepo repository.OwnerRepository
|
||||
notifierRegistry map[string]Notifier
|
||||
|
||||
// expiryAlertMetrics — when set via SetExpiryAlertMetrics, every call
|
||||
// to SendThresholdAlertOnChannel reports its outcome (success / failure)
|
||||
// to the metric sink so the Prometheus exposer surfaces
|
||||
// certctl_expiry_alerts_total{channel,threshold,result}. Rank 4 of the
|
||||
// 2026-05-03 Infisical deep-research deliverable. Nil leaves the
|
||||
// dispatch path unchanged (no metric emission, but alerts still fire).
|
||||
expiryAlertMetrics ExpiryAlertRecorder
|
||||
}
|
||||
|
||||
// ExpiryAlertRecorder is the metric-sink surface SendThresholdAlertOnChannel
|
||||
// uses. result is one of: "success", "failure", "deduped". Implementations
|
||||
// MUST be goroutine-safe — RecordExpiryAlert is called from the renewal
|
||||
// loop's own goroutine on every threshold-channel tick.
|
||||
//
|
||||
// service.ExpiryAlertMetrics satisfies this interface. cmd/server wires
|
||||
// the same instance into the service (recording side) and into
|
||||
// MetricsHandler (exposing side, for the Prometheus emitter).
|
||||
type ExpiryAlertRecorder interface {
|
||||
RecordExpiryAlert(channel string, threshold int, result string)
|
||||
}
|
||||
|
||||
// SetExpiryAlertMetrics wires the per-(channel, threshold, result) counter
|
||||
// table for expiry-alert dispatch. Pass nil to disable recording. Safe to
|
||||
// call before any SendThresholdAlertOnChannel call; calling later just
|
||||
// means earlier calls didn't increment the counters.
|
||||
func (s *NotificationService) SetExpiryAlertMetrics(r ExpiryAlertRecorder) {
|
||||
s.expiryAlertMetrics = r
|
||||
}
|
||||
|
||||
// recordExpiryAlert is the internal hook used by SendThresholdAlertOnChannel
|
||||
// to report per-(channel, threshold, result) counts. Nil-safe.
|
||||
func (s *NotificationService) recordExpiryAlert(channel string, threshold int, result string) {
|
||||
if s == nil || s.expiryAlertMetrics == nil {
|
||||
return
|
||||
}
|
||||
s.expiryAlertMetrics.RecordExpiryAlert(channel, threshold, result)
|
||||
}
|
||||
|
||||
// RecordExpiryAlertDeduped is the public hook RenewalService uses to report
|
||||
// (channel, threshold, "deduped") — dedup happens before
|
||||
// SendThresholdAlertOnChannel runs, so the call site is in the caller, not
|
||||
// the dispatch helper. Kept on NotificationService rather than exposed on
|
||||
// the recorder directly so callers don't need to know whether the recorder
|
||||
// is wired.
|
||||
func (s *NotificationService) RecordExpiryAlertDeduped(channel string, threshold int) {
|
||||
s.recordExpiryAlert(channel, threshold, "deduped")
|
||||
}
|
||||
|
||||
// Notifier defines the interface for notification channels (email, Slack, webhooks, etc.).
|
||||
@@ -94,9 +141,48 @@ func (s *NotificationService) SendExpirationWarning(ctx context.Context, cert *d
|
||||
return s.SendThresholdAlert(ctx, cert, daysUntilExpiry, daysUntilExpiry)
|
||||
}
|
||||
|
||||
// SendThresholdAlert sends an expiration alert for a specific threshold (e.g., 30-day, 14-day, expired).
|
||||
// The threshold parameter indicates which configured threshold triggered the alert.
|
||||
// SendThresholdAlert sends an expiration alert for a specific threshold via
|
||||
// the Email channel. Preserved for backwards-compat with non-policy callers
|
||||
// (admin "send test alert" surfaces in the GUI, etc.); equivalent to
|
||||
// SendThresholdAlertOnChannel(ctx, cert, days, threshold,
|
||||
// domain.NotificationChannelEmail).
|
||||
//
|
||||
// Policy-driven dispatch in RenewalService.sendThresholdAlerts uses
|
||||
// SendThresholdAlertOnChannel directly with the channel resolved from the
|
||||
// per-policy AlertChannels matrix. Rank 4 of the 2026-05-03 Infisical
|
||||
// deep-research deliverable.
|
||||
func (s *NotificationService) SendThresholdAlert(ctx context.Context, cert *domain.ManagedCertificate, daysUntilExpiry int, threshold int) error {
|
||||
return s.SendThresholdAlertOnChannel(ctx, cert, daysUntilExpiry, threshold, domain.NotificationChannelEmail)
|
||||
}
|
||||
|
||||
// SendThresholdAlertOnChannel sends an expiration alert for a specific
|
||||
// (cert, threshold, channel) triple. The channel must be one of the
|
||||
// closed-enum NotificationChannel values; off-enum channels surface as a
|
||||
// failure metric increment + ERROR log + a wrapped error so the caller can
|
||||
// react (typically: log and continue with the next channel in the
|
||||
// policy's tier list — see RenewalService.sendThresholdAlerts).
|
||||
//
|
||||
// The notification record is persisted with the channel field set to the
|
||||
// requested value, and the message body carries the [threshold:N] tag for
|
||||
// dedup at HasThresholdNotification's substring filter. Combined with the
|
||||
// repository.NotificationFilter.Channel field, this gives us per-(cert,
|
||||
// threshold, channel) dedup so a transient PagerDuty 5xx today does NOT
|
||||
// suppress today's Slack delivery and tomorrow's PagerDuty retry will
|
||||
// still fire.
|
||||
//
|
||||
// Result is reported to expiryAlertMetrics (when wired): "success" on
|
||||
// successful send, "failure" on send error or persistence error.
|
||||
// "deduped" results are reported by the caller (sendThresholdAlerts) since
|
||||
// dedup happens before this method runs.
|
||||
func (s *NotificationService) SendThresholdAlertOnChannel(
|
||||
ctx context.Context, cert *domain.ManagedCertificate, daysUntilExpiry int,
|
||||
threshold int, channel domain.NotificationChannel,
|
||||
) error {
|
||||
if !domain.IsValidNotificationChannel(string(channel)) {
|
||||
s.recordExpiryAlert(string(channel), threshold, "failure")
|
||||
return fmt.Errorf("invalid notification channel %q for threshold %d", channel, threshold)
|
||||
}
|
||||
|
||||
var body string
|
||||
if threshold <= 0 {
|
||||
body = fmt.Sprintf(
|
||||
@@ -110,12 +196,11 @@ func (s *NotificationService) SendThresholdAlert(ctx context.Context, cert *doma
|
||||
)
|
||||
}
|
||||
|
||||
// Create notification record — resolve owner email if possible
|
||||
notif := &domain.NotificationEvent{
|
||||
ID: generateID("notif"),
|
||||
CertificateID: &cert.ID,
|
||||
Type: domain.NotificationTypeExpirationWarning,
|
||||
Channel: domain.NotificationChannelEmail,
|
||||
Channel: channel,
|
||||
Recipient: s.resolveRecipient(ctx, cert.OwnerID),
|
||||
Message: body,
|
||||
Status: "pending",
|
||||
@@ -123,20 +208,52 @@ func (s *NotificationService) SendThresholdAlert(ctx context.Context, cert *doma
|
||||
}
|
||||
|
||||
if err := s.notifRepo.Create(ctx, notif); err != nil {
|
||||
s.recordExpiryAlert(string(channel), threshold, "failure")
|
||||
return fmt.Errorf("failed to create notification: %w", err)
|
||||
}
|
||||
|
||||
// Attempt immediate send
|
||||
return s.sendNotification(ctx, notif)
|
||||
if err := s.sendNotification(ctx, notif); err != nil {
|
||||
s.recordExpiryAlert(string(channel), threshold, "failure")
|
||||
return err
|
||||
}
|
||||
s.recordExpiryAlert(string(channel), threshold, "success")
|
||||
return nil
|
||||
}
|
||||
|
||||
// HasThresholdNotification checks whether an expiration warning has already been sent
|
||||
// for a specific certificate and threshold combination. Used for deduplication.
|
||||
// HasThresholdNotification checks whether an expiration warning has already
|
||||
// been sent for a specific (cert, threshold) pair via the Email channel.
|
||||
// Preserved for backwards-compat. Equivalent to
|
||||
// HasThresholdNotificationOnChannel(ctx, certID, threshold, "Email").
|
||||
//
|
||||
// New callers driven by the per-policy channel matrix should use
|
||||
// HasThresholdNotificationOnChannel directly with the explicit channel —
|
||||
// see RenewalService.sendThresholdAlerts.
|
||||
func (s *NotificationService) HasThresholdNotification(ctx context.Context, certID string, threshold int) (bool, error) {
|
||||
return s.HasThresholdNotificationOnChannel(ctx, certID, threshold, domain.NotificationChannelEmail)
|
||||
}
|
||||
|
||||
// HasThresholdNotificationOnChannel reports whether an ExpirationWarning
|
||||
// notification has already been persisted for a specific (cert, threshold,
|
||||
// channel) triple. Used to dedupe per-channel fan-out so a successful
|
||||
// PagerDuty page today doesn't fire again tomorrow when the renewal loop
|
||||
// re-checks the same threshold (and so a transient PagerDuty 5xx today
|
||||
// doesn't suppress tomorrow's successful retry).
|
||||
//
|
||||
// The match is on the substring "[threshold:N]" in the stored message body
|
||||
// (the same dedup pattern used by HasThresholdNotification pre-2026-05-03)
|
||||
// AND the channel column. Both filters apply; a match requires both.
|
||||
//
|
||||
// channel == "" preserves the legacy (cert, threshold) dedup for the same
|
||||
// reason HasThresholdNotification kept its old shape — admin-surface
|
||||
// callers still need that behaviour.
|
||||
func (s *NotificationService) HasThresholdNotificationOnChannel(
|
||||
ctx context.Context, certID string, threshold int, channel domain.NotificationChannel,
|
||||
) (bool, error) {
|
||||
filter := &repository.NotificationFilter{
|
||||
CertificateID: certID,
|
||||
Type: string(domain.NotificationTypeExpirationWarning),
|
||||
MessageLike: fmt.Sprintf("%%[threshold:%d]%%", threshold),
|
||||
Channel: string(channel),
|
||||
PerPage: 1,
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user