Files
certctl/internal/domain/certificate.go
T
shankar0123 109f32ff41 notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00

247 lines
10 KiB
Go

package domain
import (
"time"
)
// ManagedCertificate represents a certificate managed by the control plane.
type ManagedCertificate struct {
ID string `json:"id"`
Name string `json:"name"`
CommonName string `json:"common_name"`
SANs []string `json:"sans"`
Environment string `json:"environment"`
OwnerID string `json:"owner_id"`
TeamID string `json:"team_id"`
IssuerID string `json:"issuer_id"`
TargetIDs []string `json:"target_ids"`
RenewalPolicyID string `json:"renewal_policy_id"`
CertificateProfileID string `json:"certificate_profile_id,omitempty"`
Status CertificateStatus `json:"status"`
ExpiresAt time.Time `json:"expires_at"`
Tags map[string]string `json:"tags"`
LastRenewalAt *time.Time `json:"last_renewal_at,omitempty"`
LastDeploymentAt *time.Time `json:"last_deployment_at,omitempty"`
RevokedAt *time.Time `json:"revoked_at,omitempty"`
RevocationReason string `json:"revocation_reason,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
// Source tags how this managed certificate was created. EST RFC 7030
// hardening master bundle Phase 11.1 — operators bulk-revoke
// EST-issued certs by filtering on Source=EST. Empty value preserves
// the v2.X.0 behavior (the bulk-revoke handler treats empty as
// equivalent to legacy/manual; new EST issuances stamp Source=EST,
// new SCEP issuances will eventually stamp Source=SCEP under a
// future bundle).
Source CertificateSource `json:"source,omitempty"`
}
// CertificateSource is the enum of provenance values stamped on each
// managed-certificate row when it's created. The empty string is the
// back-compat default — pre-Phase-11 rows have it set to "" by the
// migration's DEFAULT clause; the bulk-revoke filter treats empty as
// "any source" so existing call paths see no behavior change.
//
// EST RFC 7030 hardening master bundle Phase 11.1.
type CertificateSource string
const (
// CertificateSourceUnspecified preserves the v2.X.0 default ("").
CertificateSourceUnspecified CertificateSource = ""
// CertificateSourceEST stamps every cert issued through one of the
// EST endpoints (simpleenroll / simplereenroll / serverkeygen).
CertificateSourceEST CertificateSource = "EST"
// CertificateSourceSCEP / API / Agent reserve future provenance
// values — not stamped today; SCEP-issued certs continue to land
// with Source="" until a follow-up bundle wires the stamp at the
// SCEP service layer.
CertificateSourceSCEP CertificateSource = "SCEP"
CertificateSourceAPI CertificateSource = "API"
CertificateSourceAgent CertificateSource = "Agent"
// CertificateSourceACME stamps every cert issued through the
// built-in ACME server endpoint (RFC 8555 finalize → cert
// download). The ACME service (internal/service/acme.go)
// pins this on every managed_certificates row it inserts at
// finalize time. Operators bulk-revoke ACME-issued certs by
// filtering on Source=ACME.
CertificateSourceACME CertificateSource = "ACME"
)
// CertificateVersion represents a specific version of a certificate.
type CertificateVersion struct {
ID string `json:"id"`
CertificateID string `json:"certificate_id"`
SerialNumber string `json:"serial_number"`
NotBefore time.Time `json:"not_before"`
NotAfter time.Time `json:"not_after"`
FingerprintSHA256 string `json:"fingerprint_sha256"`
PEMChain string `json:"pem_chain"`
CSRPEM string `json:"csr_pem"`
KeyAlgorithm string `json:"key_algorithm,omitempty"`
KeySize int `json:"key_size,omitempty"`
CreatedAt time.Time `json:"created_at"`
}
// CertificateStatus represents the lifecycle status of a managed certificate.
type CertificateStatus string
const (
CertificateStatusPending CertificateStatus = "Pending"
CertificateStatusActive CertificateStatus = "Active"
CertificateStatusExpiring CertificateStatus = "Expiring"
CertificateStatusExpired CertificateStatus = "Expired"
CertificateStatusRenewalInProgress CertificateStatus = "RenewalInProgress"
CertificateStatusFailed CertificateStatus = "Failed"
CertificateStatusRevoked CertificateStatus = "Revoked"
CertificateStatusArchived CertificateStatus = "Archived"
)
// RenewalPolicy defines renewal parameters for a managed certificate.
type RenewalPolicy struct {
ID string `json:"id"`
Name string `json:"name"`
RenewalWindowDays int `json:"renewal_window_days"`
AutoRenew bool `json:"auto_renew"`
MaxRetries int `json:"max_retries"`
RetryInterval int `json:"retry_interval_seconds"`
AlertThresholdsDays []int `json:"alert_thresholds_days"`
CertificateProfileID string `json:"certificate_profile_id,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
// AlertChannels is the per-policy channel-matrix that maps each
// severity tier ("informational" / "warning" / "critical") to the
// set of NotificationChannel values that receive expiry alerts at
// that tier. Values are slices of channel-name strings matching
// the NotificationChannel constants ("Email", "Slack", "Teams",
// "PagerDuty", "OpsGenie", "Webhook"). nil or empty falls back to
// DefaultAlertChannels (Email-only across all tiers, the pre-2026-05-03
// behaviour preserved as the safe default for operators who have
// not yet opted into multi-channel routing).
//
// Off-enum severity keys or channel values are silently dropped at
// the dispatch site (closed-enum discipline; we do NOT dynamically
// grow Prometheus cardinality on a typo).
//
// Rank 4 of the 2026-05-03 Infisical deep-research deliverable
// (cowork/infisical-deep-research-results.md Part 5).
AlertChannels map[string][]string `json:"alert_channels,omitempty"`
// AlertSeverityMap maps each threshold-day value to its severity
// tier. Off-map thresholds default to "informational". Operators
// with non-default AlertThresholdsDays values supply their own
// severity mapping; operators on the canonical 30/14/7/0 thresholds
// can leave this empty to inherit DefaultAlertSeverityMap which
// maps:
//
// 30 → informational
// 14 → warning
// 7 → warning
// 0 → critical
AlertSeverityMap map[int]string `json:"alert_severity_map,omitempty"`
}
// DefaultAlertThresholds returns the standard alert thresholds when none are configured.
func DefaultAlertThresholds() []int {
return []int{30, 14, 7, 0}
}
// EffectiveAlertThresholds returns the configured thresholds or defaults if empty.
func (p *RenewalPolicy) EffectiveAlertThresholds() []int {
if len(p.AlertThresholdsDays) > 0 {
return p.AlertThresholdsDays
}
return DefaultAlertThresholds()
}
// Severity-tier names for the channel matrix. Closed-enum to keep
// Prometheus cardinality bounded and operator typos surfaceable in
// audit logs (off-enum tier values are dropped at dispatch).
const (
AlertSeverityInformational = "informational"
AlertSeverityWarning = "warning"
AlertSeverityCritical = "critical"
)
// DefaultAlertChannels returns the back-compat default channel matrix
// — Email only at every tier. This preserves the pre-2026-05-03
// behaviour for operators who have not yet opted into multi-channel
// routing. Nil or empty AlertChannels on a RenewalPolicy is read as
// "use this default."
func DefaultAlertChannels() map[string][]string {
return map[string][]string{
AlertSeverityInformational: {string(NotificationChannelEmail)},
AlertSeverityWarning: {string(NotificationChannelEmail)},
AlertSeverityCritical: {string(NotificationChannelEmail)},
}
}
// DefaultAlertSeverityMap returns the canonical threshold-to-tier
// mapping for the standard 30/14/7/0 thresholds. Operators with
// custom thresholds supply their own mapping.
func DefaultAlertSeverityMap() map[int]string {
return map[int]string{
30: AlertSeverityInformational,
14: AlertSeverityWarning,
7: AlertSeverityWarning,
0: AlertSeverityCritical,
}
}
// EffectiveAlertChannels returns the configured channel matrix on
// the policy, or the default if unset. Used by the dispatch site in
// RenewalService.sendThresholdAlerts to resolve the channel set for
// a given tier.
//
// A returned map is safe to mutate by the caller — the default-path
// branch returns a fresh map; the configured-path branch returns the
// caller-supplied map (which the caller already owns).
func (p *RenewalPolicy) EffectiveAlertChannels() map[string][]string {
if p == nil || len(p.AlertChannels) == 0 {
return DefaultAlertChannels()
}
return p.AlertChannels
}
// EffectiveAlertSeverity returns the severity tier for a given
// threshold. Off-map thresholds resolve to "informational" so a
// custom-thresholds policy without an explicit severity map still
// gets dispatch (just at the lowest tier).
func (p *RenewalPolicy) EffectiveAlertSeverity(threshold int) string {
if p != nil {
if tier, ok := p.AlertSeverityMap[threshold]; ok {
return tier
}
}
if tier, ok := DefaultAlertSeverityMap()[threshold]; ok {
return tier
}
return AlertSeverityInformational
}
// IsValidAlertSeverityTier reports whether tier is one of the closed-enum
// severity values. Used by the policy validation path in
// service.RenewalPolicyService to reject typos at write time.
func IsValidAlertSeverityTier(tier string) bool {
switch tier {
case AlertSeverityInformational, AlertSeverityWarning, AlertSeverityCritical:
return true
}
return false
}
// IsValidNotificationChannel reports whether channel is one of the
// closed-enum NotificationChannel values. Used by the policy
// validation path to reject typos at write time AND by the dispatch
// site to defensively drop off-enum values that survived a migration.
func IsValidNotificationChannel(channel string) bool {
switch NotificationChannel(channel) {
case NotificationChannelEmail, NotificationChannelWebhook,
NotificationChannelSlack, NotificationChannelTeams,
NotificationChannelPagerDuty, NotificationChannelOpsGenie:
return true
}
return false
}