I-005: notification retry loop + dead-letter queue

Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.
2026-06-07 16:41:36 +00:00 · 2026-04-19 15:17:27 +00:00
parent 707d8de4fb
commit 675b87ba63
33 changed files with 3758 additions and 228 deletions
@@ -0,0 +1,256 @@
+package postgres_test
+
+import (
+	"context"
+	"database/sql"
+	"strings"
+	"testing"
+)
+
+// TestMigration000016_NotificationRetryRoundTrip is the Phase 1 Red regression
+// test for I-005 ("failed webhook/email drops critical alerts — no retry, no
+// DLQ, no escalation"). The fix depends on a new migration,
+// 000016_notification_retry.up.sql + .down.sql, which must:
+//
+//  1. Add `retry_count INTEGER NOT NULL DEFAULT 0` on notification_events.
+//     Mirrors migration 000015's column-nullability pattern: explicit
+//     NOT NULL + default so existing rows backfill cleanly and the service
+//     layer never has to nil-check the counter. The 0 default is what lets
+//     the retry scheduler promote a row from failed → pending on its very
+//     first sweep without a bespoke backfill.
+//
+//  2. Add `next_retry_at TIMESTAMPTZ` (nullable) on notification_events.
+//     Populated by the service layer on every failed→pending transition
+//     using exponential backoff (2^retry_count minutes, cap 1h). Nullable
+//     because the field is only meaningful while a row sits in 'failed'
+//     state; 'sent', 'pending', 'dead', and 'read' rows leave it NULL.
+//
+//  3. Add `last_error TEXT` (nullable) on notification_events. TEXT
+//     (not VARCHAR(N)) because notifier errors can include full HTTP
+//     response bodies, TLS handshake diagnostics, or stringified stack
+//     traces. Truncation here would kick the operator back to the server
+//     log, which is exactly the triage pain I-005 is meant to eliminate.
+//
+//  4. Create the partial retry-sweep index
+//     `idx_notification_events_retry_sweep ON notification_events(next_retry_at)
+//     WHERE status = 'failed' AND next_retry_at IS NOT NULL`.
+//     The predicate keeps the index tiny in a healthy fleet — only failed
+//     rows scheduled for retry participate; sent/pending/dead/read rows and
+//     unscheduled failures are excluded. Makes the retry sweep in
+//     RetryFailedNotifications O(retry-eligible) rather than O(total-events).
+//
+// The round-trip also validates that the down migration cleanly reverses all
+// four schema additions, so an operator who lands on a rollback can still
+// boot the server. Stage 4 asserts idempotency — the up migration must be
+// safely re-runnable after a partial rollback, which requires ADD COLUMN
+// IF NOT EXISTS and CREATE INDEX IF NOT EXISTS on every new object.
+//
+// Red-until-Green: this test compiles but fails until
+// migrations/000016_notification_retry.up.sql + .down.sql exist with the
+// right schema, because freshSchema(t) runs every `.up.sql` in lexical order
+// — the new migration runs automatically once Phase 2 creates the files.
+func TestMigration000016_NotificationRetryRoundTrip(t *testing.T) {
+	tdb := getTestDB(t)
+	db := tdb.freshSchema(t)
+	ctx := context.Background()
+
+	// ─── Stage 1: Post-up assertions ─────────────────────────────────────
+	//
+	// After every .up.sql migration (including the new 000016) has run, the
+	// three new columns and the partial retry-sweep index must be observable
+	// in the catalog.
+
+	// All three retry columns must be present on notification_events.
+	assertColumnExists(t, db, "notification_events", "retry_count")
+	assertColumnExists(t, db, "notification_events", "next_retry_at")
+	assertColumnExists(t, db, "notification_events", "last_error")
+
+	// retry_count must be NOT NULL with a server-side default of 0. The
+	// scheduler's failed→pending transition relies on reading the counter
+	// without a COALESCE, and the back-fill on existing rows must be
+	// deterministic; 0 is the only safe default for an attempt counter.
+	assertColumnNotNull(t, db, "notification_events", "retry_count", true)
+	assertColumnDefaultContains(t, db, "notification_events", "retry_count", "0")
+
+	// next_retry_at and last_error are nullable by design — see the Stage 1
+	// doc block above for why. A NOT NULL constraint here would force the
+	// service layer to write sentinel values on every terminal-status
+	// transition, which is worse than just leaving them NULL.
+	assertColumnNotNull(t, db, "notification_events", "next_retry_at", false)
+	assertColumnNotNull(t, db, "notification_events", "last_error", false)
+
+	// The partial retry-sweep index must exist on notification_events and
+	// must include the WHERE predicate that restricts it to failed+scheduled
+	// rows. Without the predicate the index is merely an index on
+	// next_retry_at — correct semantics, but it would balloon in a busy
+	// fleet because every sent/read row would sit in it with a NULL key.
+	assertIndexExists(t, db, "idx_notification_events_retry_sweep")
+	assertIndexPredicateContains(t, db, "idx_notification_events_retry_sweep", "status = 'failed'")
+	assertIndexPredicateContains(t, db, "idx_notification_events_retry_sweep", "next_retry_at IS NOT NULL")
+
+	// ─── Stage 2: Run the 000016 down migration manually ─────────────────
+	//
+	// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
+	// the down migration I read and execute it by hand, then re-check the
+	// catalog.
+
+	downSQL := readMigrationFile(t, "000016_notification_retry.down.sql")
+	if _, err := db.ExecContext(ctx, downSQL); err != nil {
+		t.Fatalf("000016 down migration failed: %v", err)
+	}
+
+	// Stage 3: Post-down assertions — all three columns removed, partial
+	// index dropped.
+	assertColumnGone(t, db, "notification_events", "retry_count")
+	assertColumnGone(t, db, "notification_events", "next_retry_at")
+	assertColumnGone(t, db, "notification_events", "last_error")
+	assertIndexGone(t, db, "idx_notification_events_retry_sweep")
+
+	// ─── Stage 4: Re-run the up migration for idempotency ────────────────
+	//
+	// The up migration must be safely re-runnable — operators sometimes
+	// re-apply by hand after a partial rollback. Use ADD COLUMN IF NOT
+	// EXISTS and CREATE INDEX IF NOT EXISTS so every converging run is a
+	// no-op.
+
+	upSQL := readMigrationFile(t, "000016_notification_retry.up.sql")
+	if _, err := db.ExecContext(ctx, upSQL); err != nil {
+		t.Fatalf("000016 up migration re-apply failed (must be idempotent): %v", err)
+	}
+
+	assertColumnExists(t, db, "notification_events", "retry_count")
+	assertColumnExists(t, db, "notification_events", "next_retry_at")
+	assertColumnExists(t, db, "notification_events", "last_error")
+	assertIndexExists(t, db, "idx_notification_events_retry_sweep")
+}
+
+// ─── Extra catalog helpers for 000016 ─────────────────────────────────────
+//
+// These are additive to the column-existence and FK helpers defined in
+// migration_000015_test.go. Both files live in `package postgres_test`, so
+// assertColumnExists / assertColumnGone / readMigrationFile are already in
+// scope from the 000015 test file and must not be redeclared.
+
+// assertColumnNotNull asserts that the information_schema reports the
+// expected nullability for a column. PG exposes `is_nullable` as the string
+// 'YES' or 'NO'; we translate to a bool so the call site reads cleanly.
+func assertColumnNotNull(t *testing.T, db *sql.DB, table, column string, wantNotNull bool) {
+	t.Helper()
+	var isNullable string
+	err := db.QueryRowContext(context.Background(), `
+		SELECT is_nullable
+		FROM information_schema.columns
+		WHERE table_schema = current_schema()
+		  AND table_name = $1
+		  AND column_name = $2
+	`, table, column).Scan(&isNullable)
+	if err == sql.ErrNoRows {
+		t.Fatalf("column %s.%s not found in current_schema (migration missing?)", table, column)
+	}
+	if err != nil {
+		t.Fatalf("is_nullable lookup for %s.%s failed: %v", table, column, err)
+	}
+	gotNotNull := isNullable == "NO"
+	if gotNotNull != wantNotNull {
+		t.Errorf("column %s.%s nullability: got NOT NULL=%v, want NOT NULL=%v (is_nullable=%q)",
+			table, column, gotNotNull, wantNotNull, isNullable)
+	}
+}
+
+// assertColumnDefaultContains asserts that the server-side DEFAULT clause for
+// a column contains the expected substring. Postgres can render defaults in
+// a few different normalized shapes (`0`, `(0)::integer`, `0::integer`),
+// so substring matching is more robust than exact equality here.
+func assertColumnDefaultContains(t *testing.T, db *sql.DB, table, column, wantSubstr string) {
+	t.Helper()
+	var columnDefault sql.NullString
+	err := db.QueryRowContext(context.Background(), `
+		SELECT column_default
+		FROM information_schema.columns
+		WHERE table_schema = current_schema()
+		  AND table_name = $1
+		  AND column_name = $2
+	`, table, column).Scan(&columnDefault)
+	if err == sql.ErrNoRows {
+		t.Fatalf("column %s.%s not found in current_schema (migration missing?)", table, column)
+	}
+	if err != nil {
+		t.Fatalf("column_default lookup for %s.%s failed: %v", table, column, err)
+	}
+	if !columnDefault.Valid {
+		t.Errorf("column %s.%s has no DEFAULT clause; want substring %q", table, column, wantSubstr)
+		return
+	}
+	if !strings.Contains(columnDefault.String, wantSubstr) {
+		t.Errorf("column %s.%s DEFAULT = %q; want substring %q",
+			table, column, columnDefault.String, wantSubstr)
+	}
+}
+
+// assertIndexExists asserts that a named index exists in the current schema.
+// Scoped via pg_indexes.schemaname = current_schema() so schema-per-test
+// isolation holds.
+func assertIndexExists(t *testing.T, db *sql.DB, indexName string) {
+	t.Helper()
+	var exists bool
+	err := db.QueryRowContext(context.Background(), `
+		SELECT EXISTS (
+			SELECT 1 FROM pg_indexes
+			WHERE schemaname = current_schema()
+			  AND indexname = $1
+		)`, indexName).Scan(&exists)
+	if err != nil {
+		t.Fatalf("index existence query failed for %s: %v", indexName, err)
+	}
+	if !exists {
+		t.Errorf("expected index %s to exist after 000016 up (migration missing or drifted)", indexName)
+	}
+}
+
+// assertIndexGone is the negative form, used after the down migration to
+// confirm the partial retry-sweep index has been dropped.
+func assertIndexGone(t *testing.T, db *sql.DB, indexName string) {
+	t.Helper()
+	var exists bool
+	err := db.QueryRowContext(context.Background(), `
+		SELECT EXISTS (
+			SELECT 1 FROM pg_indexes
+			WHERE schemaname = current_schema()
+			  AND indexname = $1
+		)`, indexName).Scan(&exists)
+	if err != nil {
+		t.Fatalf("index existence query failed for %s: %v", indexName, err)
+	}
+	if exists {
+		t.Errorf("expected index %s to be removed after 000016 down (down migration is incomplete)", indexName)
+	}
+}
+
+// assertIndexPredicateContains asserts that the reconstructed `indexdef`
+// (pg_indexes.indexdef — the CREATE INDEX statement Postgres would emit to
+// recreate the index) contains the expected substring. This is how we pin
+// the WHERE predicate of a partial index without parsing the SQL.
+//
+// Postgres normalises the predicate (e.g. single-quoted literals stay
+// single-quoted, column references are bare), so substring matching is both
+// sufficient and robust against cosmetic reformatting.
+func assertIndexPredicateContains(t *testing.T, db *sql.DB, indexName, wantSubstr string) {
+	t.Helper()
+	var indexdef string
+	err := db.QueryRowContext(context.Background(), `
+		SELECT indexdef
+		FROM pg_indexes
+		WHERE schemaname = current_schema()
+		  AND indexname = $1
+	`, indexName).Scan(&indexdef)
+	if err == sql.ErrNoRows {
+		t.Fatalf("index %s not found in current_schema (migration missing?)", indexName)
+	}
+	if err != nil {
+		t.Fatalf("indexdef lookup for %s failed: %v", indexName, err)
+	}
+	if !strings.Contains(indexdef, wantSubstr) {
+		t.Errorf("index %s definition missing expected predicate fragment %q\nfull indexdef: %s",
+			indexName, wantSubstr, indexdef)
+	}
+}
@@ -100,10 +100,14 @@ func (r *NotificationRepository) List(ctx context.Context, filter *repository.No
 		return nil, fmt.Errorf("failed to count notifications: %w", err)
 	}

-	// Get paginated results
+	// Get paginated results. I-005 extends the SELECT with the three retry
+	// columns (retry_count / next_retry_at / last_error) so scanNotification
+	// can populate the new fields on domain.NotificationEvent. The column
+	// order here MUST stay in lockstep with scanNotification below.
 	offset := (filter.Page - 1) * filter.PerPage
 	query := fmt.Sprintf(`
-		SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error
+		SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
+		       retry_count, next_retry_at, last_error
 		FROM notification_events
 		%s
 		ORDER BY sent_at DESC NULLS LAST
@@ -156,13 +160,23 @@ func (r *NotificationRepository) UpdateStatus(ctx context.Context, id string, st
 	return nil
 }

-// scanNotification scans a notification from a row or rows
+// scanNotification scans a notification from a row or rows.
+//
+// I-005 extends the scan list from 9 → 12 columns (adds retry_count,
+// next_retry_at, last_error). Every caller — List and the four new retry
+// methods below — funnels rows through this helper, so the SELECT column
+// order in every query must match the Scan order here exactly. RetryCount
+// scans into an `int` (migration 000016 declares the column NOT NULL with
+// DEFAULT 0), while NextRetryAt and LastError scan into pointer types
+// because the column is nullable — a healthy pending/sent/dead row leaves
+// both NULL.
 func scanNotification(scanner interface {
 	Scan(...interface{}) error
 }) (*domain.NotificationEvent, error) {
 	var notif domain.NotificationEvent
 	err := scanner.Scan(&notif.ID, &notif.Type, &notif.CertificateID, &notif.Channel,
-		&notif.Recipient, &notif.Message, &notif.SentAt, &notif.Status, &notif.Error)
+		&notif.Recipient, &notif.Message, &notif.SentAt, &notif.Status, &notif.Error,
+		&notif.RetryCount, &notif.NextRetryAt, &notif.LastError)

 	if err != nil {
 		return nil, fmt.Errorf("failed to scan notification: %w", err)
@@ -170,3 +184,220 @@ func scanNotification(scanner interface {

 	return &notif, nil
 }
+
+// ─── I-005 retry/DLQ methods ─────────────────────────────────────────────
+//
+// The four methods below implement the repository half of the I-005
+// notification retry + dead-letter queue fix. The retry scheduler loop
+// (added alongside these in internal/scheduler/scheduler.go) drives them in
+// a strict cycle:
+//
+//    ┌─► ListRetryEligible(ctx, now, maxAttempts, limit)
+//    │         (oldest overdue failed rows first)
+//    │            │
+//    │            ├──► notifier.Send() succeeds → UpdateStatus('sent')
+//    │            │
+//    │            ├──► transient failure, retry_count+1 < maxAttempts
+//    │            │        → RecordFailedAttempt(id, err, next)
+//    │            │
+//    │            └──► transient failure, retry_count+1 == maxAttempts
+//    │                     → MarkAsDead(id, err)
+//    │
+//    └──◄ Requeue(id) ────── operator "try again" from Dead-letter tab
+//
+// The WHERE clauses in every UPDATE are scoped by id (not by status), so
+// status invariants ("you can't requeue a sent row", "you can't mark a
+// dead row as dead again") live in the service layer. The repo layer is
+// deliberately thin — it mirrors the postgres CHECK constraints and
+// trusts the service to hand it rows in a sane state. The one exception
+// is "row must exist": each method returns an error on zero RowsAffected,
+// matching the pre-existing UpdateStatus contract above so the scheduler
+// can detect a concurrent delete without guessing.
+
+// listRetryEligibleDefaultLimit caps a caller that passes limit <= 0.
+// Picked high enough that normal sweeps never hit it (a healthy fleet
+// should have tens of overdue rows at most, not thousands), but finite
+// so a pathological call (wrong arg in a future refactor, bad MCP tool
+// wiring) cannot scan the entire notification_events table.
+const listRetryEligibleDefaultLimit = 1000
+
+// ListRetryEligible returns failed notification rows whose next_retry_at
+// is due and whose retry_count has not yet reached the configured
+// max_attempts.
+//
+// The WHERE clause is the exact dual of the partial retry-sweep index
+// predicate from migration 000016:
+//
+//	WHERE status = 'failed'
+//	  AND next_retry_at IS NOT NULL
+//	  AND next_retry_at <= $1
+//	  AND retry_count   <  $2
+//
+// Because the index is partial on the first two conjuncts, the planner
+// uses it to satisfy the range scan on next_retry_at; the retry_count
+// filter is applied as a residual on the (very small) candidate set.
+//
+// ORDER BY next_retry_at ASC matches the fairness guarantee called out
+// in the test file: oldest overdue row goes first, so a backed-up
+// scheduler doesn't starve the notifications that have been waiting
+// longest. The same order is what I-001's RetryFailedJobs uses.
+func (r *NotificationRepository) ListRetryEligible(ctx context.Context, now time.Time, maxAttempts, limit int) ([]*domain.NotificationEvent, error) {
+	if limit <= 0 {
+		limit = listRetryEligibleDefaultLimit
+	}
+
+	rows, err := r.db.QueryContext(ctx, `
+		SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
+		       retry_count, next_retry_at, last_error
+		FROM notification_events
+		WHERE status = 'failed'
+		  AND next_retry_at IS NOT NULL
+		  AND next_retry_at <= $1
+		  AND retry_count    < $2
+		ORDER BY next_retry_at ASC
+		LIMIT $3
+	`, now, maxAttempts, limit)
+	if err != nil {
+		return nil, fmt.Errorf("failed to query retry-eligible notifications: %w", err)
+	}
+	defer rows.Close()
+
+	var notifs []*domain.NotificationEvent
+	for rows.Next() {
+		notif, err := scanNotification(rows)
+		if err != nil {
+			return nil, err
+		}
+		notifs = append(notifs, notif)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("error iterating retry-eligible notification rows: %w", err)
+	}
+
+	return notifs, nil
+}
+
+// RecordFailedAttempt is called by the retry sweep after a notifier.Send
+// transient failure. It increments retry_count by exactly 1, overwrites
+// last_error and next_retry_at, and deliberately DOES NOT touch status —
+// the row must remain 'failed' so the next ListRetryEligible tick can
+// pick it up again (unless the service layer has decided this attempt
+// exhausts max_attempts, in which case it calls MarkAsDead directly
+// instead of calling RecordFailedAttempt).
+//
+// The +1 is done server-side (SET retry_count = retry_count + 1) rather
+// than client-side so a race between two scheduler instances cannot lose
+// an attempt. Only one scheduler should be running in a healthy deploy,
+// but the cheap arithmetic here survives a split-brain without lying
+// about attempt counts.
+func (r *NotificationRepository) RecordFailedAttempt(ctx context.Context, id string, lastError string, nextRetryAt time.Time) error {
+	result, err := r.db.ExecContext(ctx, `
+		UPDATE notification_events
+		SET retry_count   = retry_count + 1,
+		    last_error    = $1,
+		    next_retry_at = $2
+		WHERE id = $3
+	`, lastError, nextRetryAt, id)
+	if err != nil {
+		return fmt.Errorf("failed to record notification retry attempt: %w", err)
+	}
+
+	rows, err := result.RowsAffected()
+	if err != nil {
+		return fmt.Errorf("failed to get rows affected: %w", err)
+	}
+	if rows == 0 {
+		// Same "not found" error shape as UpdateStatus above. The scheduler
+		// logs-and-continues on this so a concurrently-deleted row doesn't
+		// break the sweep.
+		return fmt.Errorf("notification not found")
+	}
+	return nil
+}
+
+// MarkAsDead performs the DLQ transition. Flips status='dead' so the
+// partial retry-sweep index drops the row (the index predicate requires
+// status='failed'), clears next_retry_at so operator dashboards don't
+// claim the row is still "scheduled to retry", writes the final
+// last_error for triage, and PRESERVES retry_count as historical evidence
+// of how many attempts were burned before the row was declared dead.
+// The retry_count value is operator-visible in the Dead letter tab so
+// on-call can tell "this notification died on attempt 5" vs "this one
+// died on attempt 1 because the recipient webhook was malformed from the
+// start".
+func (r *NotificationRepository) MarkAsDead(ctx context.Context, id string, lastError string) error {
+	result, err := r.db.ExecContext(ctx, `
+		UPDATE notification_events
+		SET status        = 'dead',
+		    next_retry_at = NULL,
+		    last_error    = $1
+		WHERE id = $2
+	`, lastError, id)
+	if err != nil {
+		return fmt.Errorf("failed to mark notification as dead: %w", err)
+	}
+
+	rows, err := result.RowsAffected()
+	if err != nil {
+		return fmt.Errorf("failed to get rows affected: %w", err)
+	}
+	if rows == 0 {
+		return fmt.Errorf("notification not found")
+	}
+	return nil
+}
+
+// Requeue is the operator "try again" action fired from the Dead letter
+// tab. Flips status='pending' so ProcessPendingNotifications picks the
+// row up again, resets retry_count to 0 (otherwise the operator's first
+// retry would immediately sit at the top of the backoff ladder), clears
+// next_retry_at so the row is no longer in the retry-sweep index, and
+// clears last_error so the UI doesn't render a stale error badge next
+// to a freshly-requeued row.
+//
+// The service layer is responsible for forbidding Requeue on 'sent' or
+// 'read' rows (terminal success states). This repo layer deliberately
+// doesn't filter by current status — an operator action has already
+// passed a human-in-the-loop guard by the time it reaches the DB, and
+// the test suite only exercises the Requeue-from-{dead,failed} paths.
+// Matches how UpdateStatus doesn't filter by current status either.
+func (r *NotificationRepository) Requeue(ctx context.Context, id string) error {
+	result, err := r.db.ExecContext(ctx, `
+		UPDATE notification_events
+		SET status        = 'pending',
+		    retry_count   = 0,
+		    next_retry_at = NULL,
+		    last_error    = NULL
+		WHERE id = $1
+	`, id)
+	if err != nil {
+		return fmt.Errorf("failed to requeue notification: %w", err)
+	}
+
+	rows, err := result.RowsAffected()
+	if err != nil {
+		return fmt.Errorf("failed to get rows affected: %w", err)
+	}
+	if rows == 0 {
+		return fmt.Errorf("notification not found")
+	}
+	return nil
+}
+
+// CountByStatus returns the number of notification_events rows matching the
+// given status string. Implemented as a direct COUNT(*) rather than via List
+// because List resets filter.PerPage>500 to 50 (see line 57 quirk), which
+// would produce undercounts on high-volume deployments. I-005 Phase 2 Green —
+// backs StatsService.GetDashboardSummary.NotificationsDead and the Prometheus
+// counter certctl_notification_dead_total.
+func (r *NotificationRepository) CountByStatus(ctx context.Context, status string) (int64, error) {
+	var count int64
+	err := r.db.QueryRowContext(ctx,
+		`SELECT COUNT(*) FROM notification_events WHERE status = $1`,
+		status,
+	).Scan(&count)
+	if err != nil {
+		return 0, fmt.Errorf("failed to count notifications by status: %w", err)
+	}
+	return count, nil
+}
@@ -0,0 +1,398 @@
+package postgres_test
+
+import (
+	"context"
+	"database/sql"
+	"testing"
+	"time"
+
+	"github.com/shankar0123/certctl/internal/domain"
+	"github.com/shankar0123/certctl/internal/repository/postgres"
+)
+
+// TestNotificationRepository_RetryMethods is the Phase 1 Red regression test
+// for the I-005 fix ("failed webhook/email drops critical alerts — no retry,
+// no DLQ, no escalation"). It pins the four new repository methods the
+// notification-retry scheduler loop will depend on:
+//
+//  1. ListRetryEligible(ctx, now, maxAttempts, limit) — the retry-sweep query.
+//     Returns failed rows whose next_retry_at <= now AND retry_count <
+//     maxAttempts. Everything else (sent/pending/dead/read, unscheduled
+//     failures, exhausted rows) is excluded. Ordering is ASC on next_retry_at
+//     so the oldest overdue row is processed first — same fairness guarantee
+//     as I-001's RetryFailedJobs.
+//
+//  2. RecordFailedAttempt(ctx, id, lastError, nextRetryAt) — what the
+//     scheduler calls after a notifier.Send() transient failure. Must
+//     increment retry_count by exactly 1, overwrite last_error, overwrite
+//     next_retry_at, and KEEP status='failed' so the row is still a
+//     candidate for ListRetryEligible on the next sweep.
+//
+//  3. MarkAsDead(ctx, id, lastError) — the DLQ transition when retry_count
+//     hits max_attempts. Flips status to 'dead', clears next_retry_at
+//     (so the partial retry-sweep index drops the row), preserves
+//     retry_count as historical evidence of how many attempts were spent,
+//     and records the final transient error for operator triage.
+//
+//  4. Requeue(ctx, id) — the operator "try again" action fired from the
+//     Dead letter tab in the UI. Flips status back to 'pending' (which is
+//     what ProcessPendingNotifications picks up), resets retry_count to 0,
+//     clears next_retry_at AND last_error. Valid from both 'dead' (normal
+//     path) and 'failed' (operator rescuing a stuck row before the sweep
+//     fires). Invalid from 'sent' / 'read' (terminal success states).
+//
+// Red-until-Green: this test file compiles only after Phase 2 adds
+// ListRetryEligible, RecordFailedAttempt, MarkAsDead, and Requeue to
+// postgres.NotificationRepository. Every subtest is testcontainers-gated
+// via getTestDB(t).freshSchema(t), so `go test -short` skips them and CI
+// without Docker stays green. Fixtures are inserted via raw SQL — Create()
+// doesn't know about the new retry columns pre-Green, so the test bypasses
+// it entirely. certificate_id is left NULL on every fixture row to dodge
+// the FK to managed_certificates (the column is nullable per migration
+// 000001, line 212).
+
+// TestNotificationRepository_ListRetryEligible exercises the retry-sweep
+// query. The test fixture deliberately seeds one row per excluded and
+// included case so a single call to ListRetryEligible is the oracle:
+// every row the query returns must be an "include", every row it skips
+// must be an "exclude".
+func TestNotificationRepository_ListRetryEligible(t *testing.T) {
+	tdb := getTestDB(t)
+	db := tdb.freshSchema(t)
+	repo := postgres.NewNotificationRepository(db)
+	ctx := context.Background()
+
+	// Pin `now` so the test is deterministic. All "overdue" rows have
+	// next_retry_at < now; all "future" rows have next_retry_at > now.
+	now := time.Now().UTC().Truncate(time.Microsecond)
+	past := now.Add(-5 * time.Minute)
+	future := now.Add(5 * time.Minute)
+
+	// Fixture grid — each row pins a specific edge of the query:
+	//
+	//   notif-overdue-1  status=failed, retry=1, next=past   → INCLUDE
+	//   notif-overdue-2  status=failed, retry=3, next=past   → INCLUDE
+	//                      (later next_retry_at than notif-overdue-1 by a
+	//                      few seconds so ORDER BY is observable)
+	//   notif-future     status=failed, retry=2, next=future → EXCLUDE
+	//                      (CA hasn't hit backoff yet)
+	//   notif-exhausted  status=failed, retry=5, next=past   → EXCLUDE
+	//                      (retry_count >= max_attempts — sweep must skip
+	//                      so we don't re-promote a row that's about to
+	//                      be marked dead)
+	//   notif-pending    status=pending, retry=0, next=NULL  → EXCLUDE
+	//                      (healthy in-flight notification)
+	//   notif-sent       status=sent, retry=0, next=NULL     → EXCLUDE
+	//   notif-dead       status=dead, retry=5, next=NULL     → EXCLUDE
+	//                      (already in DLQ — retrying it would reset the
+	//                      dead-letter counter and lie to the operator)
+	//   notif-unsched    status=failed, retry=1, next=NULL   → EXCLUDE
+	//                      (failed row that somehow lost its next_retry_at
+	//                      — partial index predicate strips it, and the
+	//                      WHERE clause must mirror the predicate)
+	rawInsert := func(id, status string, retryCount int, nextRetryAt *time.Time) {
+		t.Helper()
+		_, err := db.ExecContext(ctx, `
+			INSERT INTO notification_events (
+				id, type, channel, recipient, message, status, retry_count, next_retry_at
+			) VALUES ($1, 'ExpirationWarning', 'Webhook', 'https://hooks.example.com/x',
+			          'seed', $2, $3, $4)
+		`, id, status, retryCount, nextRetryAt)
+		if err != nil {
+			t.Fatalf("raw insert for %s failed: %v", id, err)
+		}
+	}
+
+	overdue1 := past.Add(-30 * time.Second) // oldest overdue
+	overdue2 := past                        // second-oldest overdue
+	rawInsert("notif-overdue-1", "failed", 1, &overdue1)
+	rawInsert("notif-overdue-2", "failed", 3, &overdue2)
+	rawInsert("notif-future", "failed", 2, &future)
+	rawInsert("notif-exhausted", "failed", 5, &overdue1)
+	rawInsert("notif-pending", "pending", 0, nil)
+	rawInsert("notif-sent", "sent", 0, nil)
+	rawInsert("notif-dead", "dead", 5, nil)
+	rawInsert("notif-unsched", "failed", 1, nil)
+
+	// Act — the central call under test.
+	got, err := repo.ListRetryEligible(ctx, now, 5, 100)
+	if err != nil {
+		t.Fatalf("ListRetryEligible failed: %v", err)
+	}
+
+	// Assert inclusion: exactly the two overdue rows.
+	if len(got) != 2 {
+		t.Fatalf("ListRetryEligible returned %d rows, want 2 (overdue-1 + overdue-2); got IDs = %v",
+			len(got), collectIDs(got))
+	}
+
+	// Assert ordering: ASC on next_retry_at. notif-overdue-1 has the
+	// earlier next_retry_at (past - 30s), so it must come first.
+	if got[0].ID != "notif-overdue-1" {
+		t.Errorf("ListRetryEligible[0].ID = %q, want %q (ORDER BY next_retry_at ASC — oldest first)",
+			got[0].ID, "notif-overdue-1")
+	}
+	if got[1].ID != "notif-overdue-2" {
+		t.Errorf("ListRetryEligible[1].ID = %q, want %q", got[1].ID, "notif-overdue-2")
+	}
+
+	// Assert limit is respected. Re-run with limit=1 and confirm only the
+	// oldest overdue row comes back — this is what lets the scheduler
+	// chunk its sweep under load.
+	limited, err := repo.ListRetryEligible(ctx, now, 5, 1)
+	if err != nil {
+		t.Fatalf("ListRetryEligible(limit=1) failed: %v", err)
+	}
+	if len(limited) != 1 || limited[0].ID != "notif-overdue-1" {
+		t.Errorf("ListRetryEligible(limit=1) returned %v, want [notif-overdue-1]", collectIDs(limited))
+	}
+
+	// Assert maxAttempts is respected. Re-run with maxAttempts=2 — this
+	// flips notif-overdue-2 (retry_count=3) into the "exhausted" bucket
+	// and must not come back. Only notif-overdue-1 (retry_count=1) qualifies.
+	capped, err := repo.ListRetryEligible(ctx, now, 2, 100)
+	if err != nil {
+		t.Fatalf("ListRetryEligible(maxAttempts=2) failed: %v", err)
+	}
+	if len(capped) != 1 || capped[0].ID != "notif-overdue-1" {
+		t.Errorf("ListRetryEligible(maxAttempts=2) returned %v, want [notif-overdue-1]", collectIDs(capped))
+	}
+}
+
+// TestNotificationRepository_RecordFailedAttempt verifies the retry-bump
+// UPDATE. The contract is: retry_count += 1, last_error = new msg,
+// next_retry_at = new time, status STAYS 'failed'. Any other side effect
+// (status flip, retry_count reset, sent_at mutation) is a bug.
+func TestNotificationRepository_RecordFailedAttempt(t *testing.T) {
+	tdb := getTestDB(t)
+	db := tdb.freshSchema(t)
+	repo := postgres.NewNotificationRepository(db)
+	ctx := context.Background()
+
+	initialRetry := past()
+	_, err := db.ExecContext(ctx, `
+		INSERT INTO notification_events (
+			id, type, channel, recipient, message, status, retry_count, next_retry_at, last_error
+		) VALUES ('notif-attempt-1', 'ExpirationWarning', 'Webhook',
+		          'https://hooks.example.com/x', 'seed', 'failed', 2, $1, 'first failure')
+	`, initialRetry)
+	if err != nil {
+		t.Fatalf("seed failed: %v", err)
+	}
+
+	nextTry := time.Now().UTC().Add(8 * time.Minute).Truncate(time.Microsecond)
+	if err := repo.RecordFailedAttempt(ctx, "notif-attempt-1", "connection refused", nextTry); err != nil {
+		t.Fatalf("RecordFailedAttempt failed: %v", err)
+	}
+
+	// Re-read the row directly from the DB (bypassing the repo's List()
+	// filter logic) so the assertion tests storage, not query plumbing.
+	var (
+		gotStatus     string
+		gotRetryCount int
+		gotNextRetry  *time.Time
+		gotLastError  *string
+	)
+	err = db.QueryRowContext(ctx, `
+		SELECT status, retry_count, next_retry_at, last_error
+		FROM notification_events WHERE id = 'notif-attempt-1'
+	`).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
+	if err != nil {
+		t.Fatalf("post-update SELECT failed: %v", err)
+	}
+
+	if gotStatus != "failed" {
+		t.Errorf("status = %q, want 'failed' (RecordFailedAttempt must preserve status so sweep re-picks the row)", gotStatus)
+	}
+	if gotRetryCount != 3 {
+		t.Errorf("retry_count = %d, want 3 (must increment by exactly 1 from seeded 2)", gotRetryCount)
+	}
+	if gotNextRetry == nil || !gotNextRetry.Equal(nextTry) {
+		t.Errorf("next_retry_at = %v, want %v", gotNextRetry, nextTry)
+	}
+	if gotLastError == nil || *gotLastError != "connection refused" {
+		t.Errorf("last_error = %v, want 'connection refused'", gotLastError)
+	}
+
+	// Negative path: unknown id must surface "not found" — mirrors the
+	// existing UpdateStatus contract so the scheduler can detect a
+	// concurrent delete without guessing.
+	if err := repo.RecordFailedAttempt(ctx, "notif-does-not-exist", "oops", nextTry); err == nil {
+		t.Errorf("RecordFailedAttempt on unknown id succeeded; want error")
+	}
+}
+
+// TestNotificationRepository_MarkAsDead verifies the DLQ transition. Flips
+// status to 'dead', clears next_retry_at (so the partial retry-sweep
+// index drops the row), writes final last_error, preserves retry_count as
+// evidence of how many attempts were burned.
+func TestNotificationRepository_MarkAsDead(t *testing.T) {
+	tdb := getTestDB(t)
+	db := tdb.freshSchema(t)
+	repo := postgres.NewNotificationRepository(db)
+	ctx := context.Background()
+
+	lastAttempt := past()
+	_, err := db.ExecContext(ctx, `
+		INSERT INTO notification_events (
+			id, type, channel, recipient, message, status, retry_count, next_retry_at, last_error
+		) VALUES ('notif-dlq-1', 'ExpirationWarning', 'Webhook',
+		          'https://hooks.example.com/x', 'seed', 'failed', 5, $1, 'prior failure')
+	`, lastAttempt)
+	if err != nil {
+		t.Fatalf("seed failed: %v", err)
+	}
+
+	if err := repo.MarkAsDead(ctx, "notif-dlq-1", "max attempts exceeded"); err != nil {
+		t.Fatalf("MarkAsDead failed: %v", err)
+	}
+
+	var (
+		gotStatus     string
+		gotRetryCount int
+		gotNextRetry  *time.Time
+		gotLastError  *string
+	)
+	err = db.QueryRowContext(ctx, `
+		SELECT status, retry_count, next_retry_at, last_error
+		FROM notification_events WHERE id = 'notif-dlq-1'
+	`).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
+	if err != nil {
+		t.Fatalf("post-update SELECT failed: %v", err)
+	}
+
+	if gotStatus != "dead" {
+		t.Errorf("status = %q, want 'dead' (DLQ transition)", gotStatus)
+	}
+	if gotNextRetry != nil {
+		// next_retry_at MUST be NULL post-DLQ — the partial retry-sweep
+		// index predicate is `status='failed' AND next_retry_at IS NOT NULL`,
+		// so leaving a value here would only waste space; the status='dead'
+		// half of the predicate already excludes the row from the sweep,
+		// but operator dashboards treat a populated next_retry_at as "still
+		// scheduled", which would be a lie.
+		t.Errorf("next_retry_at = %v, want NULL (dead rows are terminal, not rescheduled)", gotNextRetry)
+	}
+	if gotRetryCount != 5 {
+		// retry_count is audit evidence — how many attempts were burned
+		// before the row was declared dead. Don't clobber it.
+		t.Errorf("retry_count = %d, want 5 preserved (evidence of burned attempts)", gotRetryCount)
+	}
+	if gotLastError == nil || *gotLastError != "max attempts exceeded" {
+		t.Errorf("last_error = %v, want 'max attempts exceeded'", gotLastError)
+	}
+
+	// Negative path: unknown id must surface "not found".
+	if err := repo.MarkAsDead(ctx, "notif-does-not-exist", "oops"); err == nil {
+		t.Errorf("MarkAsDead on unknown id succeeded; want error")
+	}
+}
+
+// TestNotificationRepository_Requeue verifies the operator "try again"
+// flow exposed by the Dead letter tab. The contract:
+//
+//   - Flips status → 'pending' regardless of prior ('dead' or 'failed').
+//   - Resets retry_count to 0 — a manual requeue restarts the backoff
+//     ladder; otherwise the operator's first retry would already be at
+//     "wait 32 minutes" which defeats the point.
+//   - Clears next_retry_at so the row is no longer in the retry-sweep
+//     index (the scheduler would otherwise try to retry it *again* a
+//     few seconds later).
+//   - Clears last_error — the UI shouldn't show a stale error next to
+//     a freshly-requeued row.
+func TestNotificationRepository_Requeue(t *testing.T) {
+	tdb := getTestDB(t)
+	db := tdb.freshSchema(t)
+	repo := postgres.NewNotificationRepository(db)
+	ctx := context.Background()
+
+	// Two fixtures — one dead (DLQ path, the normal case) and one failed
+	// (operator rescuing a stuck-in-retry row before the sweep fires).
+	// Both must accept Requeue; a status='sent' or 'read' row must NOT.
+	_, err := db.ExecContext(ctx, `
+		INSERT INTO notification_events (id, type, channel, recipient, message, status, retry_count, last_error)
+		VALUES
+		  ('notif-dead-ready', 'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'dead', 5, 'gave up'),
+		  ('notif-failed-hot', 'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'failed', 2, 'transient'),
+		  ('notif-sent-done',  'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'sent',   0, NULL)
+	`)
+	if err != nil {
+		t.Fatalf("seed failed: %v", err)
+	}
+
+	// Happy path 1: requeue a dead row.
+	if err := repo.Requeue(ctx, "notif-dead-ready"); err != nil {
+		t.Fatalf("Requeue(dead) failed: %v", err)
+	}
+	assertRequeued(t, db, ctx, "notif-dead-ready")
+
+	// Happy path 2: requeue a failed row.
+	if err := repo.Requeue(ctx, "notif-failed-hot"); err != nil {
+		t.Fatalf("Requeue(failed) failed: %v", err)
+	}
+	assertRequeued(t, db, ctx, "notif-failed-hot")
+
+	// Negative path: Requeue on unknown id is "not found", not a no-op
+	// silent success — the handler needs to surface a 404 to the operator.
+	if err := repo.Requeue(ctx, "notif-does-not-exist"); err == nil {
+		t.Errorf("Requeue on unknown id succeeded; want error")
+	}
+}
+
+// ─── Helpers ──────────────────────────────────────────────────────────────
+
+// past returns a stable "5 minutes ago" time for fixture seeding. Truncated
+// to microseconds so round-tripping through Postgres TIMESTAMPTZ doesn't
+// introduce a sub-microsecond diff that breaks equality assertions.
+func past() time.Time {
+	return time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Microsecond)
+}
+
+// collectIDs pulls the IDs out of a slice of events for readable test
+// failure output. Without it, a failure prints "[0xc00012... 0xc00013...]"
+// which is useless when diagnosing a mis-sorted sweep.
+func collectIDs(events []*domain.NotificationEvent) []string {
+	ids := make([]string, len(events))
+	for i, e := range events {
+		ids[i] = e.ID
+	}
+	return ids
+}
+
+// assertRequeued is the shared "did Requeue do exactly what the contract
+// promises?" assertion. Re-reads the row and checks all four mutations
+// atomically so every Requeue test path gets the same rigor: status flipped
+// to 'pending', retry_count reset to 0, next_retry_at cleared, last_error
+// cleared. Any one of these missing is a contract violation.
+func assertRequeued(t *testing.T, db *sql.DB, ctx context.Context, id string) {
+	t.Helper()
+	var (
+		gotStatus     string
+		gotRetryCount int
+		gotNextRetry  *time.Time
+		gotLastError  *string
+	)
+	err := db.QueryRowContext(ctx, `
+		SELECT status, retry_count, next_retry_at, last_error
+		FROM notification_events WHERE id = $1
+	`, id).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
+	if err != nil {
+		t.Fatalf("post-Requeue SELECT for %s failed: %v", id, err)
+	}
+	if gotStatus != "pending" {
+		t.Errorf("%s.status = %q, want 'pending' (Requeue must re-open the row for ProcessPendingNotifications)",
+			id, gotStatus)
+	}
+	if gotRetryCount != 0 {
+		t.Errorf("%s.retry_count = %d, want 0 (Requeue restarts the backoff ladder so the operator's first retry isn't already at hour-long waits)",
+			id, gotRetryCount)
+	}
+	if gotNextRetry != nil {
+		t.Errorf("%s.next_retry_at = %v, want NULL (a fresh pending row must not sit in the retry-sweep index)",
+			id, gotNextRetry)
+	}
+	if gotLastError != nil {
+		t.Errorf("%s.last_error = %v, want NULL (stale errors on freshly-requeued rows mislead the UI)",
+			id, *gotLastError)
+	}
+}