mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:51:30 +00:00
notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.
This commit lands the routing matrix on top of the existing
infrastructure:
1. RenewalPolicy gains AlertChannels (per-tier channel list) +
AlertSeverityMap (per-threshold tier assignment) +
EffectiveAlertChannels / EffectiveAlertSeverity accessors.
Default*() helpers preserve the back-compat Email-only
behaviour for operators who haven't touched their policies
post-upgrade. Migration 000026 adds the JSONB columns
idempotently.
2. NotificationService.SendThresholdAlertOnChannel — the new
per-channel dispatch helper. Old SendThresholdAlert stays as
an Email-only alias so non-policy callers (admin "send test
alert" surfaces) keep working byte-for-byte.
3. NotificationService.HasThresholdNotificationOnChannel — per-
(cert, threshold, channel) deduplication so a transient
PagerDuty 5xx today does NOT suppress today's Slack alert and
tomorrow's PagerDuty retry will still fire.
4. RenewalService.sendThresholdAlerts walks the resolved channel
set per threshold tier, fans out to every configured channel,
handles per-channel failures independently, defensively drops
off-enum channels with an audit row trail, and records a per-
channel audit event with metadata.channel + metadata.severity_tier.
5. service.ExpiryAlertMetrics — atomic counter table mirrored on
the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
(commit 0792271). Three labels: channel × threshold × result
(success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
72 series for the standard 4-threshold matrix.
6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
exposer for certctl_expiry_alerts_total{channel,threshold,result}.
Pre-sorted snapshot for byte-stable emission.
7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
instance through both the recording side (notificationService.
SetExpiryAlertMetrics) and the exposing side
(metricsHandler.SetExpiryAlerts).
Dispatch flow (post-fix, per renewal-loop tick):
cert ages past T-30 → daily renewal-loop fires
→ policy lookup
→ for each crossed threshold:
- resolve severity tier (informational/
warning/critical) via AlertSeverityMap
- look up channel set in AlertChannels[tier]
- for each channel: dedup → SendThresholdAlertOnChannel
→ notifierRegistry[channel] → audit row →
Prometheus counter increment
Tests (internal/service/renewal_expiry_alerts_test.go):
TestExpiryAlerts_DefaultMatrix_EmailOnly
TestExpiryAlerts_PerTierFanOut
TestExpiryAlerts_PerChannelDedup
TestExpiryAlerts_OneChannelFails_OthersStillFire
TestExpiryAlerts_OffEnumChannelDropped
TestExpiryAlerts_MetricCounterIncrements
TestExpiryAlerts_NilPolicy_FallsToDefault
TestExpiryAlerts_OperatorOptOutOfTier
The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.
Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.
Documentation:
- docs/connectors.md Notifier section gains "Routing expiry
alerts across channels" — operator-facing, JSON example,
procurement playbook ("How do I make sure PagerDuty pages on
the T-1 alert?"), debug recipe via SQL on audit_events +
notification_events + Prometheus.
- docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
per-policy channel-matrix configuration recipes, "did the on-
call team get paged?" SQL queries, cardinality budget, V3-Pro
forward path.
- cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
alerts: per-owner routing" V3-Pro entry under Adapter
hardening.
Out of scope (intentional, flagged in V3-Pro forward path):
- Per-owner / per-team / per-tenant channel routing (matrix is
per-policy today, not per-owner).
- Calendar-aware suppression (no T-30 alerts on weekends).
- Escalation chains (T-1 unanswered for 30m → escalate).
- Per-channel rate limiting (downstream of I-005 retry+DLQ).
CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").
Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
./internal/api/handler/... ./cmd/server/... clean.
(./internal/repository/postgres/... vet failed on transitive
testcontainers/docker module download — sandbox disk pressure,
not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
./internal/service/... ./internal/api/handler/... green.
- go test -race -count=10 -run 'TestExpiryAlerts'
./internal/service/... green (per-channel dedup race-free).
Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
This commit is contained in:
@@ -34,24 +34,43 @@ func NewRenewalPolicyRepository(db *sql.DB) *RenewalPolicyRepository {
|
||||
// pre-existing drift is out of G-1's minimum-viable-delta and is tracked in
|
||||
// the design doc §8. Introducing them would change struct shapes / JSON tags
|
||||
// and require domain-layer churn we're not taking on in this change.
|
||||
//
|
||||
// alert_channels / alert_severity_map (migration 000026) ARE read here —
|
||||
// they're the per-policy channel matrix that drives multi-channel expiry
|
||||
// alert routing (Rank 4 of the 2026-05-03 Infisical deep-research
|
||||
// deliverable). Both default to '{}' at the DB level; scanRenewalPolicy
|
||||
// unmarshals an empty map into nil so domain.EffectiveAlertChannels /
|
||||
// EffectiveAlertSeverityMap fall through to the back-compat defaults.
|
||||
const renewalPolicyColumns = `
|
||||
id, name, renewal_window_days, auto_renew, max_retries,
|
||||
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
|
||||
retry_interval_seconds, alert_thresholds_days,
|
||||
alert_channels, alert_severity_map,
|
||||
created_at, updated_at
|
||||
`
|
||||
|
||||
// scanRenewalPolicy decodes one renewal_policies row from a Row or Rows
|
||||
// scanner, unmarshaling alert_thresholds_days JSONB into the domain slice.
|
||||
// Malformed JSONB silently falls back to DefaultAlertThresholds() — same
|
||||
// behavior as the pre-G-1 code so we don't change observable semantics.
|
||||
//
|
||||
// alert_channels + alert_severity_map (migration 000026) follow the same
|
||||
// "malformed → fall through to default" rule. The default-fallthrough
|
||||
// happens at read time in domain.EffectiveAlertChannels /
|
||||
// EffectiveAlertSeverity, so populating these fields with nil on parse
|
||||
// failure is the correct shape — the runtime still gets the back-compat
|
||||
// Email-only matrix.
|
||||
func scanRenewalPolicy(scanner interface {
|
||||
Scan(dest ...any) error
|
||||
}) (*domain.RenewalPolicy, error) {
|
||||
var policy domain.RenewalPolicy
|
||||
var thresholdsJSON []byte
|
||||
var channelsJSON []byte
|
||||
var severityJSON []byte
|
||||
|
||||
if err := scanner.Scan(
|
||||
&policy.ID, &policy.Name, &policy.RenewalWindowDays, &policy.AutoRenew,
|
||||
&policy.MaxRetries, &policy.RetryInterval, &thresholdsJSON,
|
||||
&channelsJSON, &severityJSON,
|
||||
&policy.CreatedAt, &policy.UpdatedAt,
|
||||
); err != nil {
|
||||
return nil, err
|
||||
@@ -63,9 +82,56 @@ func scanRenewalPolicy(scanner interface {
|
||||
}
|
||||
}
|
||||
|
||||
if len(channelsJSON) > 0 && string(channelsJSON) != "{}" {
|
||||
if err := json.Unmarshal(channelsJSON, &policy.AlertChannels); err != nil {
|
||||
policy.AlertChannels = nil // EffectiveAlertChannels falls through to default
|
||||
}
|
||||
}
|
||||
|
||||
if len(severityJSON) > 0 && string(severityJSON) != "{}" {
|
||||
// JSONB stores int keys as string; unmarshal via a string-keyed map
|
||||
// then convert. JSON does not support non-string object keys, so
|
||||
// the wire representation is e.g. {"30":"informational"}.
|
||||
stringKeyed := map[string]string{}
|
||||
if err := json.Unmarshal(severityJSON, &stringKeyed); err == nil {
|
||||
converted := make(map[int]string, len(stringKeyed))
|
||||
for k, v := range stringKeyed {
|
||||
var threshold int
|
||||
if _, scanErr := fmt.Sscanf(k, "%d", &threshold); scanErr == nil {
|
||||
converted[threshold] = v
|
||||
}
|
||||
}
|
||||
policy.AlertSeverityMap = converted
|
||||
}
|
||||
}
|
||||
|
||||
return &policy, nil
|
||||
}
|
||||
|
||||
// marshalSeverityMap converts the domain's int-keyed map into the
|
||||
// string-keyed form Postgres JSONB stores. Mirror of the inverse
|
||||
// conversion in scanRenewalPolicy. Returns "{}" for nil/empty maps so
|
||||
// the DB never sees null where NOT NULL is required.
|
||||
func marshalSeverityMap(m map[int]string) ([]byte, error) {
|
||||
if len(m) == 0 {
|
||||
return []byte("{}"), nil
|
||||
}
|
||||
stringKeyed := make(map[string]string, len(m))
|
||||
for k, v := range m {
|
||||
stringKeyed[fmt.Sprintf("%d", k)] = v
|
||||
}
|
||||
return json.Marshal(stringKeyed)
|
||||
}
|
||||
|
||||
// marshalAlertChannels marshals the channel matrix as JSONB. nil/empty
|
||||
// returns "{}" so the DB NOT NULL constraint is satisfied.
|
||||
func marshalAlertChannels(m map[string][]string) ([]byte, error) {
|
||||
if len(m) == 0 {
|
||||
return []byte("{}"), nil
|
||||
}
|
||||
return json.Marshal(m)
|
||||
}
|
||||
|
||||
// Get retrieves a renewal policy by ID.
|
||||
func (r *RenewalPolicyRepository) Get(ctx context.Context, id string) (*domain.RenewalPolicy, error) {
|
||||
row := r.db.QueryRowContext(ctx, `SELECT `+renewalPolicyColumns+` FROM renewal_policies WHERE id = $1`, id)
|
||||
@@ -158,6 +224,16 @@ func (r *RenewalPolicyRepository) Create(ctx context.Context, policy *domain.Ren
|
||||
return fmt.Errorf("failed to marshal alert thresholds: %w", err)
|
||||
}
|
||||
|
||||
channelsJSON, err := marshalAlertChannels(policy.AlertChannels)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to marshal alert channels: %w", err)
|
||||
}
|
||||
|
||||
severityJSON, err := marshalSeverityMap(policy.AlertSeverityMap)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to marshal alert severity map: %w", err)
|
||||
}
|
||||
|
||||
// ID auto-generation with collision retry. We attempt up to 10 suffix
|
||||
// variants (rp-foo, rp-foo-2, ..., rp-foo-10) before giving up — the
|
||||
// 23505 error the caller gets back past that point is on Name (their
|
||||
@@ -170,8 +246,10 @@ func (r *RenewalPolicyRepository) Create(ctx context.Context, policy *domain.Ren
|
||||
insertSQL := `
|
||||
INSERT INTO renewal_policies (
|
||||
id, name, renewal_window_days, auto_renew, max_retries,
|
||||
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
|
||||
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW(), NOW())
|
||||
retry_interval_seconds, alert_thresholds_days,
|
||||
alert_channels, alert_severity_map,
|
||||
created_at, updated_at
|
||||
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, NOW(), NOW())
|
||||
RETURNING ` + renewalPolicyColumns
|
||||
|
||||
maxAttempts := 10
|
||||
@@ -189,6 +267,7 @@ func (r *RenewalPolicyRepository) Create(ctx context.Context, policy *domain.Ren
|
||||
row := r.db.QueryRowContext(ctx, insertSQL,
|
||||
candidateID, policy.Name, policy.RenewalWindowDays, policy.AutoRenew,
|
||||
policy.MaxRetries, policy.RetryInterval, thresholdsJSON,
|
||||
channelsJSON, severityJSON,
|
||||
)
|
||||
|
||||
inserted, scanErr := scanRenewalPolicy(row)
|
||||
@@ -234,6 +313,16 @@ func (r *RenewalPolicyRepository) Update(ctx context.Context, id string, policy
|
||||
return fmt.Errorf("failed to marshal alert thresholds: %w", err)
|
||||
}
|
||||
|
||||
channelsJSON, err := marshalAlertChannels(policy.AlertChannels)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to marshal alert channels: %w", err)
|
||||
}
|
||||
|
||||
severityJSON, err := marshalSeverityMap(policy.AlertSeverityMap)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to marshal alert severity map: %w", err)
|
||||
}
|
||||
|
||||
row := r.db.QueryRowContext(ctx, `
|
||||
UPDATE renewal_policies SET
|
||||
name = $2,
|
||||
@@ -242,11 +331,14 @@ func (r *RenewalPolicyRepository) Update(ctx context.Context, id string, policy
|
||||
max_retries = $5,
|
||||
retry_interval_seconds = $6,
|
||||
alert_thresholds_days = $7,
|
||||
alert_channels = $8,
|
||||
alert_severity_map = $9,
|
||||
updated_at = NOW()
|
||||
WHERE id = $1
|
||||
RETURNING `+renewalPolicyColumns,
|
||||
id, policy.Name, policy.RenewalWindowDays, policy.AutoRenew,
|
||||
policy.MaxRetries, policy.RetryInterval, thresholdsJSON,
|
||||
channelsJSON, severityJSON,
|
||||
)
|
||||
|
||||
updated, err := scanRenewalPolicy(row)
|
||||
|
||||
Reference in New Issue
Block a user