I-005: notification retry loop + dead-letter queue

Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.
2026-07-26 13:58:13 +00:00 · 2026-04-19 15:17:27 +00:00
parent 707d8de4fb
commit 675b87ba63
33 changed files with 3758 additions and 228 deletions
@@ -0,0 +1,12 @@
+-- Rollback for migration 000016 (I-005 notification retry + DLQ).
+-- Drops the retry-sweep partial index first, then the three columns added to
+-- notification_events. No status-rewriting: rows that were promoted to 'dead'
+-- during retry exhaustion remain in that status (rollback is opt-in, and
+-- clobbering terminal states on rollback would erase the audit trail of which
+-- alerts were never delivered).
+
+DROP INDEX IF EXISTS idx_notification_events_retry_sweep;
+
+ALTER TABLE notification_events DROP COLUMN IF EXISTS last_error;
+ALTER TABLE notification_events DROP COLUMN IF EXISTS next_retry_at;
+ALTER TABLE notification_events DROP COLUMN IF EXISTS retry_count;
@@ -0,0 +1,55 @@
+-- Migration 000016: Notification retry + dead-letter queue (I-005 coverage-gap fix).
+--
+-- Adds retry bookkeeping to notification_events so transient webhook / SMTP
+-- failures no longer silently drop critical alerts, and introduces a terminal
+-- "dead" status that an operator can triage from the UI.
+--
+-- Rationale (audit finding I-005):
+--   Today `internal/service/notification.go:282-288` flips status to 'failed'
+--   with a zero-valued sent_at and returns. `ProcessPendingNotifications`
+--   (line 243) only lists rows whose status='pending', so a failed row is
+--   orphaned: no retry, no backoff, no escalation, no dead-letter. The only
+--   way an operator learns about the drop is by reading the server log.
+--
+--   The fix mirrors the I-001 job retry loop: a sibling scheduler loop sweeps
+--   notification_events for rows whose (status='failed', next_retry_at <= now())
+--   and, while retry_count < max_attempts, requeues them to 'pending'. Once
+--   retry_count crosses max_attempts the row is promoted to 'dead' and a
+--   Prometheus counter is bumped for alerting. The UI exposes a manual Requeue
+--   button on dead rows for when the operator has resolved the underlying
+--   notifier outage.
+--
+-- Column design mirrors migration 000015 (agent_retire) style:
+--   * retry_count INTEGER NOT NULL DEFAULT 0 — explicit NOT NULL + default so
+--     existing rows backfill cleanly and the service layer never needs to
+--     nil-check the counter.
+--   * next_retry_at TIMESTAMPTZ NULL — nullable because the field is only
+--     meaningful while a row is in 'failed' state; 'sent', 'pending', 'dead'
+--     and 'read' rows all leave it NULL. The partial index below is what makes
+--     the retry sweep O(retry-eligible) rather than O(total).
+--   * last_error TEXT NULL — preserves the most recent transient failure
+--     string for operator triage. TEXT (not VARCHAR(N)) because notifier
+--     errors can include full HTTP bodies, stack traces, or stringified
+--     TLS handshake diagnostics without truncation risk.
+--
+-- Idempotency guarantees (enforced by notification repository integration tests):
+--   * ADD COLUMN IF NOT EXISTS → re-running is a no-op
+--   * CREATE INDEX IF NOT EXISTS → re-running is a no-op
+
+-- Retry counter. DEFAULT 0 backfills every existing row at zero attempts.
+ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS retry_count INTEGER NOT NULL DEFAULT 0;
+
+-- Next-retry timestamp. Populated by the service layer on the failed→pending
+-- transition using exponential backoff (2^retry_count minutes, capped at 1h).
+ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS next_retry_at TIMESTAMPTZ;
+
+-- Last transient error preserved for operator triage and dashboard display.
+ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS last_error TEXT;
+
+-- Partial index for the retry-sweep hot path. Only rows in 'failed' state with
+-- a scheduled next_retry_at participate in the index; everything else (sent,
+-- pending, dead, read, and unscheduled failures) is excluded. Keeps the index
+-- tiny in healthy fleets where transient failures are rare.
+CREATE INDEX IF NOT EXISTS idx_notification_events_retry_sweep
+  ON notification_events(next_retry_at)
+  WHERE status = 'failed' AND next_retry_at IS NOT NULL;