I-005: notification retry loop + dead-letter queue

Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.
2026-06-07 13:41:30 +00:00 · 2026-04-19 15:17:27 +00:00
parent 707d8de4fb
commit 675b87ba63
33 changed files with 3758 additions and 228 deletions
@@ -61,7 +61,7 @@ flowchart TB
        API["REST API\n(Go net/http, :8443)"]
        SVC["Service Layer"]
        REPO["Repository Layer\n(database/sql + lib/pq)"]
-        SCHED["Background Scheduler\n7 loops"]
+        SCHED["Background Scheduler\n8 always-on + 4 optional loops"]
        DASH["Web Dashboard\n(React SPA)"]
    end

@@ -285,6 +285,9 @@ erDiagram
        text channel
        text recipient
        text status
+        int retry_count
+        timestamptz next_retry_at
+        text last_error
    }
    certificate_profiles {
        text id PK
@@ -483,40 +486,55 @@ For compliance events requiring fleet-wide revocation (key compromise, CA distru

 ### 4. Automatic Renewal

-The control plane runs a scheduler with seven background loops:
+The control plane runs a scheduler with 8 always-on loops plus up to 4 optional loops (enabled by configuration). `internal/scheduler/scheduler.go:262-265` is the authoritative count.

 ```mermaid
 flowchart LR
    subgraph "Scheduler (Background Goroutines)"
        R["Renewal Checker\n⏱ every 1h"]
        J["Job Processor\n⏱ every 30s"]
+        JR["Job Retry\n⏱ every 5m"]
+        JT["Job Timeout\n⏱ every 10m"]
        H["Agent Health\n⏱ every 2m"]
        N["Notification Processor\n⏱ every 1m"]
+        NR["Notification Retry\n⏱ every 2m"]
        SL["Short-Lived Expiry\n⏱ every 30s"]
        NS["Network Scanner\n⏱ every 6h"]
        DG["Certificate Digest\n⏱ every 24h"]
+        HC["Endpoint Health\n⏱ every 60s"]
+        CD["Cloud Discovery\n⏱ every 6h"]
    end

    R -->|"Find expiring certs\nCreate renewal jobs"| DB[("PostgreSQL")]
    J -->|"Process pending jobs\nCoordinate issuance"| DB
+    JR -->|"Retry Failed jobs\nFailed→Pending"| DB
+    JT -->|"Reap stalled AwaitingCSR / AwaitingApproval jobs"| DB
    H -->|"Check heartbeat staleness\nMark agents offline"| DB
    N -->|"Send pending notifications\nEmail / Webhook / Slack"| DB
+    NR -->|"Retry failed notifications\n2^n-min backoff, DLQ after 5 attempts"| DB
    SL -->|"Expire short-lived certs\nMark as Expired"| DB
    NS -->|"Probe TLS endpoints\nStore discovered certs"| DB
    DG -->|"Generate & send HTML digest\nEmail to recipients"| DB
+    HC -->|"Probe deployed TLS endpoints\nState machine + mismatch"| DB
+    CD -->|"AWS SM / Azure KV / GCP SM\nFeed discovery pipeline"| DB
 ```

-| Loop | Interval | Timeout | Purpose |
-|------|----------|---------|---------|
-| Renewal checker | 1 hour | 5 minutes | Finds certificates approaching expiry, creates renewal jobs |
-| Job processor | 30 seconds | 2 minutes | Processes pending jobs (issuance, renewal, deployment) |
-| Agent health check | 2 minutes | 1 minute | Marks agents as offline if heartbeat is stale |
-| Notification processor | 1 minute | 1 minute | Sends pending notifications via configured channels |
-| Short-lived expiry | 30 seconds | 30 seconds | Marks expired short-lived certificates (profile TTL < 1 hour) |
-| Network scanner | 6 hours | 30 minutes | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21, opt-in via `CERTCTL_NETWORK_SCAN_ENABLED`). CIDR size validated at API level — max /20 (4096 IPs) per range. |
-| Certificate digest | 24 hours | 5 minutes | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Configurable interval and recipients via `CERTCTL_DIGEST_INTERVAL` and `CERTCTL_DIGEST_RECIPIENTS`. Falls back to certificate owner emails if no explicit recipients configured. |
+| Loop | Interval | Always-on? | Purpose |
+|------|----------|------------|---------|
+| Renewal checker | 1 hour | Yes | Finds certificates approaching expiry (threshold-based or ARI-directed), creates renewal jobs |
+| Job processor | 30 seconds | Yes | Processes pending jobs (issuance, renewal, deployment) |
+| Job retry | 5 minutes (`CERTCTL_SCHEDULER_RETRY_INTERVAL`) | Yes | Transitions `Failed` jobs back to `Pending` for re-dispatch (I-001) |
+| Job timeout | 10 minutes (`CERTCTL_JOB_TIMEOUT_INTERVAL`) | Yes | Reaps `AwaitingCSR` jobs older than 24h and `AwaitingApproval` jobs older than 7d to `Failed`, feeding the retry loop (I-003) |
+| Agent health check | 2 minutes | Yes | Marks agents as offline if heartbeat is stale |
+| Notification processor | 1 minute | Yes | Sends pending notifications via configured channels |
+| Notification retry | 2 minutes (`CERTCTL_NOTIFICATION_RETRY_INTERVAL`) | Yes | Re-dispatches `Failed` notifications whose `next_retry_at` has elapsed; exponential backoff (2^n minutes, capped at 1h), 5-attempt budget, terminal `dead` status after exhaustion (I-005) |
+| Short-lived expiry | 30 seconds | Yes | Marks expired short-lived certificates (profile TTL < 1 hour) |
+| Network scanner | 6 hours | Opt-in (`CERTCTL_NETWORK_SCAN_ENABLED`) | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21). CIDR size validated at API level — max /20 (4096 IPs) per range. |
+| Certificate digest | 24 hours (`CERTCTL_DIGEST_INTERVAL`) | Opt-in (digest service) | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Falls back to certificate owner emails if no explicit recipients configured. |
+| Endpoint health | 60 seconds (`CERTCTL_HEALTH_CHECK_INTERVAL`) | Opt-in (health check service) | Probes deployed TLS endpoints, drives the healthy/degraded/down/cert_mismatch state machine (M48) |
+| Cloud discovery | 6 hours | Opt-in (at least one cloud source configured) | Walks AWS Secrets Manager / Azure Key Vault / GCP Secret Manager, feeds discovery pipeline (M50) |

-Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. All loops (including short-lived expiry check) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.
+Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. Most loops (including short-lived expiry, job retry, job timeout, and notification retry) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.

 Each operation has a context timeout to prevent indefinite hangs if external services become unresponsive.

@@ -658,6 +676,16 @@ Built-in notifiers: **Email** (SMTP), **Webhook** (HTTP POST), **Slack** (incomi

 See the [Connector Development Guide](connectors.md) for details on building custom connectors.

+### Notification Retry & Dead-Letter Queue
+
+A transient notifier failure (SMTP timeout, 5xx webhook response, Slack rate-limit) must not silently drop a critical alert. Migration `000016_notification_retry` adds three columns to `notification_events` — `retry_count INTEGER NOT NULL DEFAULT 0`, `next_retry_at TIMESTAMPTZ` (nullable — only meaningful while a row is in `failed` state), and `last_error TEXT` (the most recent transient error, preserved for operator triage) — together with a partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` so the retry hot path scales with the retry-eligible slice rather than the full notification history.
+
+The scheduler's notification-retry loop (see the scheduler section above) calls `NotificationService.RetryFailedNotifications(ctx)` every `CERTCTL_NOTIFICATION_RETRY_INTERVAL` (default `2m`). Each tick pulls up to 1000 rows via `notifRepo.ListRetryEligible(ctx, now, maxAttempts, sweepLimit)` — a partial-index-driven query that filters on `status='failed' AND next_retry_at <= now() AND retry_count < 5` — and redispatches them through the same notifier registry used by `ProcessPendingNotifications`. A successful redispatch transitions the row directly to `sent` without incrementing `retry_count`, so the audit trail preserves "delivered on attempt N". A failed redispatch re-arms `next_retry_at` using exponential backoff — `wait = min(2^retry_count minutes, 1h)` — bumps `retry_count`, and stamps `last_error`. When `retry_count >= 4` (the fifth attempt has just failed) the row is promoted to the terminal `dead` status via `notifRepo.MarkAsDead`, which clears `next_retry_at` so the partial retry-sweep index stops matching and the row cannot be re-entered into the retry rotation without operator action.
+
+`NotificationService.RequeueNotification(ctx, id)` is the operator-driven escape hatch from `dead`. It atomically resets `retry_count → 0`, `next_retry_at → NULL`, `last_error → NULL`, and `status → pending`, handing the row back to `ProcessPendingNotifications` on the next 1m tick. This is the correct response to "the notifier outage is resolved, redeliver the queue"; it is not a retry, which is why the retry counter is reset rather than incremented.
+
+The dead-letter depth is surfaced in two places. First, `DashboardSummary.NotificationsDead` is populated by `StatsService.GetDashboardSummary` via `notifRepo.CountByStatus(ctx, "dead")`. The injection uses a `SetNotifRepo` setter pattern (mirroring `CertificateService.SetTargetRepo`) rather than a new positional argument to `NewStatsService`, which keeps all nine existing `NewStatsService` call sites (main.go plus eight digest tests and stats_test.go) signature-stable — when the notification repository has not been wired in, `NotificationsDead` falls through to zero. Second, the `/api/v1/metrics/prometheus` endpoint emits `certctl_notification_dead_total` as a counter (operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical) using the same `DashboardSummary` snapshot so the dashboard card and the Prometheus counter cannot skew. The web dashboard exposes a two-tab toolbar on `/notifications` — "All" (the pre-I-005 inbox) and "Dead letter" (threads `?status=dead` into the list query, surfaces `Retry N/5` and the truncated `last_error` with a full-text tooltip per row, and binds a Requeue button to `POST /api/v1/notifications/{id}/requeue`).
+
 ### EST Server (RFC 7030)

 The EST (Enrollment over Secure Transport) server provides an industry-standard enrollment interface for devices that need certificates without using the REST API. It runs under `/.well-known/est/` per RFC 7030 and supports four operations: CA certificate distribution (`/cacerts`), initial enrollment (`/simpleenroll`), re-enrollment (`/simplereenroll`), and CSR attributes (`/csrattrs`).
@@ -865,7 +893,7 @@ The HTTP middleware stack processes requests in the following order (see `cmd/se

 ### Concurrency Safety

-The background scheduler uses `sync/atomic.Bool` idempotency guards on all 7 loops — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.
+The background scheduler uses `sync/atomic.Bool` idempotency guards on every loop (8 always-on plus up to 4 optional) — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.

 ### Logging

@@ -562,6 +562,7 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
 - **Alert Notifications** (M3, M16a) — Configurable escalation:
  - Email alerts: certificate approaching expiration, renewal failure, revocation notification.
  - Webhook: custom HTTP POST to your monitoring system (Slack, Teams, PagerDuty, OpsGenie, custom webhook).
+  - **Retry & Dead-Letter Queue** (I-005) — Transient notifier failures (SMTP timeout, webhook 5xx) are retried with exponential backoff (`2^n` minutes capped at 1h, 5-attempt budget) before landing in the terminal `dead` status. Operators monitor DLQ depth via the `certctl_notification_dead_total` Prometheus counter and requeue via the Notifications page Dead letter tab once the underlying outage is resolved. Closes the pre-I-005 silent-drop gap where a single 5xx could lose a compliance-relevant alert without evidence.
  - Deduplication: one alert per threshold/certificate per day (avoid alert fatigue).

 - **Audit Trail Filtering and Export** (M13) — Compliance reporting:
@@ -6812,6 +6812,194 @@ print('OpenAPI I-004 contract: OK')

 ---

+## Part 56: Notification Retry & Dead-Letter Queue (I-005)
+
+**What this validates:** The full retry lifecycle for `notification_events` rows — transient notifier failures are re-armed with exponential backoff (`2^retry_count` minutes capped at 1h, 5-attempt budget), rows that exhaust the budget land in the terminal `dead` status, the dead-letter depth is surfaced both on the dashboard and via a Prometheus counter, and operators can requeue dead rows once the underlying outage is resolved.
+
+**Why it matters:** Before I-005, a failed notification was a silent drop. `internal/service/notification.go` flipped `status` to `failed` and never came back to it, because `ProcessPendingNotifications` only lists rows whose `status='pending'`. A 5xx from Slack, a 30-second SMTP stall, or a misrouted webhook URL could each lose a critical alert (cert expiry, CA compromise, approval-rejected) with no trace beyond a single log line. Part 56 pins the replacement contract (retry loop + DLQ + dashboard surface + Prometheus metric + operator requeue) so regressions show up here rather than as a post-incident "why didn't we get paged?" review.
+
+### 56.1 Migration 000016 Columns Applied
+
+```bash
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT column_name FROM information_schema.columns WHERE table_name='notification_events' AND column_name IN ('retry_count','next_retry_at','last_error') ORDER BY column_name;"
+```
+
+**What:** Confirms migration 000016 added the retry bookkeeping columns to `notification_events`.
+**PASS if** all three rows (`last_error`, `next_retry_at`, `retry_count`) are returned. **FAIL** if any is missing — the migration did not apply and the retry loop will error on every tick.
+
+---
+
+### 56.2 Partial Retry-Sweep Index Present
+
+```bash
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT indexdef FROM pg_indexes WHERE tablename='notification_events' AND indexname='idx_notification_events_retry_sweep';"
+```
+
+**What:** Confirms the partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` exists and has the expected predicate.
+**PASS if** the returned `indexdef` includes `WHERE ((status = 'failed'::text) AND (next_retry_at IS NOT NULL))`. **FAIL** if the index is missing or unpartialed — the retry sweep will scan the full notification history instead of the small retry-eligible slice.
+
+---
+
+### 56.3 Failed Notification Retries On Next Tick
+
+```bash
+# Seed a failed notification with next_retry_at in the past
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "UPDATE notification_events SET status='failed', retry_count=0, next_retry_at=NOW() - INTERVAL '1 minute', last_error='transient SMTP timeout' WHERE id='notif-demo-1';"
+
+# Wait for the retry loop to sweep (default CERTCTL_NOTIFICATION_RETRY_INTERVAL=2m)
+sleep 130
+
+# Observe the post-sweep state
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT id, status, retry_count, next_retry_at IS NOT NULL AS has_next_retry FROM notification_events WHERE id='notif-demo-1';"
+```
+
+**What:** Exercises the retry loop's failure path. The seeded row is re-dispatched through the notifier registry; in the demo environment the notifier does not exist for `email` so the sweep either delivers (`status='sent'`) or records a failed attempt (`retry_count=1`, `next_retry_at` re-armed).
+**PASS if** either `status='sent'` (delivered on retry) or the row is still `failed` with `retry_count >= 1` and `has_next_retry=t`. **FAIL** if the row is still `failed` with `retry_count=0` and `next_retry_at` in the past — the retry loop is not actually running.
+
+---
+
+### 56.4 Exhausted Notification Transitions To Dead
+
+```bash
+# Seed a row one failure shy of exhaustion — retry_count=4 means the next
+# tick's failure is the 5th attempt (notifRetryMaxAttempts-1 check at
+# internal/service/notification.go:531).
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "UPDATE notification_events SET status='failed', retry_count=4, next_retry_at=NOW() - INTERVAL '1 minute', last_error='persistent outage', channel='channel-that-does-not-exist' WHERE id='notif-demo-2';"
+
+sleep 130
+
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT id, status, retry_count, last_error FROM notification_events WHERE id='notif-demo-2';"
+```
+
+**What:** The row at `retry_count=4` enters the sweep, the notifier lookup fails (channel unknown), the exhaustion branch fires, and `MarkAsDead` flips the row. Note: the "notifier unknown" branch at notification.go:494-503 promotes to `sent` for demo parity, so for a strict DLQ assertion seed a row whose channel is a known registered notifier that will reject delivery — alternatively run against the integration test fixture where the retry-exhaustion path is deterministic.
+**PASS if** `status='dead'` and `last_error` reflects the send failure. **FAIL** if the row is still `failed` with `retry_count >= 5` — the exhaustion branch did not fire and the row will retry forever.
+
+---
+
+### 56.5 Dead Row Has Null next_retry_at
+
+```bash
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT COUNT(*) FROM notification_events WHERE status='dead' AND next_retry_at IS NOT NULL;"
+```
+
+**What:** `MarkAsDead` must clear `next_retry_at` so the partial retry-sweep index stops matching the row. If this invariant breaks, a dead row keeps appearing in `ListRetryEligible` and the exhaustion branch fires on every sweep.
+**PASS if** the count is `0`. **FAIL** if any dead rows still carry a non-null `next_retry_at` — the DLQ is leaky and the row will re-enter the retry rotation on the next tick.
+
+---
+
+### 56.6 DashboardSummary Populates NotificationsDead
+
+```bash
+# Seed a dead row so the count is observable
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "UPDATE notification_events SET status='dead', next_retry_at=NULL, last_error='demo DLQ fixture' WHERE id='notif-demo-3';"
+
+curl -sS "http://localhost:8443/api/v1/stats/summary" \
+  -H "Authorization: Bearer ${CERTCTL_API_KEY}" \
+  | python3 -c "import sys,json; s=json.load(sys.stdin); assert 'notifications_dead' in s, 'missing notifications_dead field'; assert s['notifications_dead'] >= 1, s['notifications_dead']; print('notifications_dead:', s['notifications_dead'])"
+```
+
+**What:** Confirms `DashboardSummary.NotificationsDead` (`internal/service/stats.go:66`) is populated by `notifRepo.CountByStatus(ctx, "dead")` (stats.go:137-142) and surfaced in the dashboard summary JSON.
+**PASS if** the field is present and reflects at least the seeded dead row. **FAIL** if the field is missing (`SetNotifRepo` was not called on StatsService) or stuck at zero despite seeded dead rows (repository `CountByStatus` is broken).
+
+---
+
+### 56.7 Prometheus Counter Emits certctl_notification_dead_total
+
+```bash
+curl -sS "http://localhost:8443/api/v1/metrics/prometheus" \
+  -H "Authorization: Bearer ${CERTCTL_API_KEY}" \
+  | grep -E '^# (HELP|TYPE) certctl_notification_dead_total|^certctl_notification_dead_total '
+```
+
+**What:** The Prometheus endpoint (`internal/api/handler/metrics.go:217-219`) emits three lines: `# HELP certctl_notification_dead_total Number of notifications in the dead-letter queue.`, `# TYPE certctl_notification_dead_total counter`, and a bare `certctl_notification_dead_total <value>` value line. Operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical.
+**PASS if** all three lines are present and the value is `>= 1` when dead rows exist. **FAIL** if any of the three lines is missing — the metric name is misspelled, the `# TYPE` is wrong, or `DashboardSummary.NotificationsDead` is not wired into the metrics handler.
+
+---
+
+### 56.8 Requeue Resets Retry Bookkeeping
+
+```bash
+# Confirm the row is in 'dead' with the full retry history
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
+
+# Requeue via the operator endpoint
+curl -sS -X POST "http://localhost:8443/api/v1/notifications/notif-demo-3/requeue" \
+  -H "Authorization: Bearer ${CERTCTL_API_KEY}" \
+  -w "\nHTTP %{http_code}\n"
+
+# Confirm the atomic reset
+docker compose -f deploy/docker-compose.yml exec postgres \
+  psql -U certctl -d certctl -c \
+  "SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
+```
+
+**What:** Exercises the operator-driven escape hatch (`POST /api/v1/notifications/{id}/requeue`). The repository's `Requeue` must atomically flip `status → pending`, reset `retry_count → 0`, clear `next_retry_at → NULL`, and clear `last_error → NULL` — see `internal/service/notification.go:571-576` and the pinning test at `notification_handler_test.go:307-347`.
+**PASS if** HTTP `200` with JSON body `{"status":"requeued"}` AND the post-requeue row has `status='pending'`, `retry_count=0`, `next_retry_at IS NULL`, `last_error IS NULL`. **FAIL** if any of the four fields is not reset — `ProcessPendingNotifications` will not treat this as a fresh attempt and the audit trail will be ambiguous.
+
+---
+
+### 56.9 GUI Dead Letter Tab Threads ?status=dead
+
+```bash
+cd web
+npx vitest run src/pages/NotificationsPage.test.tsx -t 'Dead letter tab fetches notifications with status=dead'
+```
+
+**What:** The two-tab toolbar on `/notifications` routes the "Dead letter" tab's query through `getNotifications({ status: 'dead', per_page: '100' })`. This test verifies the React Query's `queryKey: ['notifications', activeTab]` (`NotificationsPage.tsx:31`) actually translates the tab click into the server-side filter — not client-side filtering of the full inbox.
+**PASS if** the Vitest assertion at `NotificationsPage.test.tsx:104-128` passes. **FAIL** if the Dead letter tab is merely a client-side filter on the `all` response — the DLQ-only code path (`NotificationRepository.ListByStatus`) is not exercised, which matters for pagination correctness once the inbox grows beyond 100 rows.
+
+---
+
+### 56.10 Requeue Button MutationFn Wrapper
+
+```bash
+cd web
+npx vitest run src/pages/NotificationsPage.test.tsx -t 'clicking Requeue invokes requeueNotification'
+```
+
+**What:** `react-query` v5's `mutate(id)` passes a second positional argument (the mutation context object) to the `mutationFn`. If `mutationFn: requeueNotification` is used directly, the API client receives `(id, { client })` — an extra argument that the strict-match `toHaveBeenCalledWith('notif-dead-001')` assertion at `NotificationsPage.test.tsx:181` rejects. The fix is an explicit single-arg arrow: `mutationFn: (id: string) => requeueNotification(id)` at `NotificationsPage.tsx:64`.
+**PASS if** the Vitest assertion passes (the API client was called with exactly one argument). **FAIL** if the wrapper is inadvertently removed — silent success in runtime, loud failure in this contract.
+
+---
+
+### 56.11 HEAD-State OpenAPI Contract
+
+```bash
+npx --yes @redocly/cli lint api/openapi.yaml \
+  --config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}'
+python3 -c "
+import yaml
+spec = yaml.safe_load(open('api/openapi.yaml'))
+post = spec['paths']['/api/v1/notifications/{id}/requeue']['post']
+assert post['operationId'] == 'requeueNotification', post['operationId']
+assert set(post['responses'].keys()) >= {'200','400','404','405','500'}, post['responses'].keys()
+print('OpenAPI I-005 contract: OK')
+"
+```
+
+**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the requeue endpoint's `operationId` and the five minimum response codes (200/400/404/405/500).
+**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-005 contract: OK`. **FAIL** if the `operationId` changed or any of the five responses is missing — downstream MCP/CLI clients rely on the contract.
+
+---
+
 ## Release Sign-Off

 All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**).
@@ -7813,10 +8001,10 @@ These must be green before starting manual QA:
 | Category | Count |
 |----------|-------|
 | ☑ Auto (passed in `qa-smoke-test.sh`) | 144 |
-| ☐ Auto (not yet run) | 129 |
+| ☐ Auto (not yet run) | 136 |
 | — Skipped (preconditions not met in demo) | 5 |
-| ☐ Manual (requires hands-on verification) | 282 |
-| **Total** | **560** |
+| ☐ Manual (requires hands-on verification) | 286 |
+| **Total** | **571** |

 **Automated tests must also be green.** CI passing is necessary but not sufficient — this manual QA catches integration issues that isolated unit tests miss.