I-005: notification retry loop + dead-letter queue

Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.

Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
  (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
  Dead rows clear next_retry_at so the index stops matching them.

Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
  exponential backoff capped at 1h (notifRetryBackoffCap) with
  5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
  status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
  (audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
  wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
  retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
  status -> pending via notifRepo.Requeue.

Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
  CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.

StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
  NewStatsService call sites (main.go + stats_test.go + 8 digest
  tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
  notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
  (reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
  best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
  the same snapshot.

Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.

Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
  with queryKey: ['notifications', activeTab] so switching tabs
  doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
  full-text title tooltip.
- Requeue mutation wrapped as
    mutationFn: (id: string) => requeueNotification(id)
  to prevent react-query v5's positional context argument from
  leaking into the API client — pinned against future refactors
  by strict-match toHaveBeenCalledWith('notif-dead-001') in
  NotificationsPage.test.tsx:181.

Closes I-005.
This commit is contained in:
shankar0123
2026-04-19 15:17:27 +00:00
parent 707d8de4fb
commit 675b87ba63
33 changed files with 3758 additions and 228 deletions
+13
View File
@@ -301,6 +301,19 @@ export const getNotification = (id: string) =>
export const markNotificationRead = (id: string) =>
fetchJSON<{ message: string }>(`${BASE}/notifications/${id}/read`, { method: 'POST' });
/**
* I-005: requeue a dead notification back to the retry queue. Flips status
* 'dead' → 'pending' and clears next_retry_at so the retry sweep picks it up
* on its next tick (default 2 minutes, CERTCTL_NOTIFICATION_RETRY_INTERVAL).
* Used by the Dead letter tab's "Requeue" button after an operator fixes the
* underlying delivery failure (SMTP config, webhook endpoint, etc.). The
* handler returns a StatusResponse ({ status: "requeued" }) — the frontend
* only needs to know the call succeeded so the mutation can invalidate the
* notifications query.
*/
export const requeueNotification = (id: string) =>
fetchJSON<{ status: string }>(`${BASE}/notifications/${id}/requeue`, { method: 'POST' });
// Audit
export const getAuditEvents = (params: Record<string, string> = {}) => {
const qs = new URLSearchParams({ page: '1', per_page: '200', ...params }).toString();
+34 -2
View File
@@ -126,15 +126,47 @@ export interface Job {
verification_error?: string;
}
/**
* Notification mirrors internal/domain/notification.go#NotificationEvent.
*
* I-005 (Notification Retry + Dead-letter Queue) widens the shape with three
* audit fields:
*
* - retry_count — number of delivery attempts already consumed (0..5). The
* 5-cap is enforced server-side by NotificationsMaxAttempts.
* - next_retry_at — RFC3339 timestamp the retry sweep will next consider this
* notification. Null for sent/dead/read and between sweeps
* for pending rows; the sweep populates it on each failure
* using min(2^retry_count * 1m, 1h).
* - last_error — most recent transient delivery failure. Preserved across
* requeue so Dead letter triage shows *why* the row died
* without chasing server logs.
*
* `sent_at` and `error` are the pre-I-005 audit fields on the backend struct.
* `subject` is a historical frontend-only field the backend never emits; it's
* kept optional so legacy fixtures and the pendingNotif test mock still type
* correctly without forcing a rewrite of every existing consumer.
*
* Status values follow the backend NotificationStatus constants:
* pending · sent · failed · dead · read
* The existing list view tolerates the legacy title-cased "Pending" alias at
* render time (NotificationRow) so upgraded clients talking to older servers
* don't regress — see `isUnread` logic in NotificationsPage.tsx.
*/
export interface Notification {
id: string;
type: string;
channel: string;
recipient: string;
subject: string;
subject?: string;
message: string;
status: string;
certificate_id: string;
certificate_id?: string;
sent_at?: string | null;
error?: string | null;
retry_count?: number;
next_retry_at?: string | null;
last_error?: string | null;
created_at: string;
}
+208
View File
@@ -0,0 +1,208 @@
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { render, screen, waitFor, fireEvent, cleanup } from '@testing-library/react';
import { QueryClient, QueryClientProvider } from '@tanstack/react-query';
import { MemoryRouter } from 'react-router-dom';
import type { ReactNode } from 'react';
// -----------------------------------------------------------------------------
// I-005: NotificationsPage Phase 1 Red — Dead Letter tab + Requeue action
//
// This file pins the frontend contract Phase 2 Green must implement:
//
// 1. A "Dead letter" tab renders alongside the existing status filter, and
// selecting it causes the underlying query to fetch with { status: 'dead' }.
// The tab does not exist at HEAD — the tab-locator assertions are the Red.
//
// 2. Notifications in status='dead' render a "Requeue" action button. HEAD
// only renders "Mark read" for Pending rows and no action for anything
// else — the button-locator assertion is the Red.
//
// 3. Clicking "Requeue" invokes requeueNotification(id) from the API client
// and invalidates the notifications query. `requeueNotification` does not
// yet exist as an export from ../api/client — tsc --noEmit will fail with
// "Property 'requeueNotification' does not exist" when Phase 2 Green runs
// its verification gates, which is the compile-time Red halt. This file is
// structured so Phase 2 Green's single fix (add the client export + page
// wiring) flips the entire suite Green at once.
// -----------------------------------------------------------------------------
vi.mock('../api/client', () => ({
getNotifications: vi.fn(),
getNotification: vi.fn(),
markNotificationRead: vi.fn(),
requeueNotification: vi.fn(),
}));
// Imported after vi.mock so the mock replaces the real module.
import NotificationsPage from './NotificationsPage';
import * as client from '../api/client';
function renderWithQuery(ui: ReactNode) {
const qc = new QueryClient({
defaultOptions: {
queries: { retry: false, gcTime: 0, staleTime: 0 },
},
});
return render(
<QueryClientProvider client={qc}>
<MemoryRouter>{ui}</MemoryRouter>
</QueryClientProvider>,
);
}
const pendingNotif = {
id: 'notif-001',
type: 'ExpirationWarning',
channel: 'Email',
recipient: 'admin@example.com',
subject: 'Certificate expiring',
message: 'Certificate expiring in 7 days',
status: 'Pending',
certificate_id: 'mc-prod-001',
created_at: new Date().toISOString(),
};
const deadNotif = {
id: 'notif-dead-001',
type: 'ExpirationWarning',
channel: 'Email',
recipient: 'admin@example.com',
subject: 'Certificate expiring',
message: 'Certificate expiring in 7 days',
status: 'dead',
certificate_id: 'mc-prod-001',
created_at: new Date().toISOString(),
retry_count: 5,
last_error: 'SMTP connection refused',
};
describe('NotificationsPage — I-005 Dead Letter + Requeue (Phase 1 Red)', () => {
beforeEach(() => {
vi.clearAllMocks();
cleanup();
});
it('renders a Dead letter tab in the filter toolbar', async () => {
vi.mocked(client.getNotifications).mockResolvedValue({
data: [pendingNotif],
total: 1,
page: 1,
per_page: 100,
});
renderWithQuery(<NotificationsPage />);
await waitFor(() => {
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
});
// Red: no Dead letter tab exists at HEAD. Phase 2 Green adds a button/tab
// labeled "Dead letter" (matches docs/testing-guide UI label).
expect(screen.getByRole('button', { name: /Dead letter/i })).toBeInTheDocument();
});
it('clicking Dead letter tab fetches notifications with status=dead', async () => {
vi.mocked(client.getNotifications).mockResolvedValue({
data: [],
total: 0,
page: 1,
per_page: 100,
});
renderWithQuery(<NotificationsPage />);
await waitFor(() => {
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
});
const tab = screen.getByRole('button', { name: /Dead letter/i });
fireEvent.click(tab);
// Red: Phase 2 Green must route the Dead letter tab's query through
// getNotifications({ status: 'dead', per_page: '100' }). HEAD only ever
// calls getNotifications({ per_page: '100' }) — no status param is ever
// passed through.
await waitFor(() => {
const calls = vi.mocked(client.getNotifications).mock.calls;
const deadCall = calls.find(([params]) => (params as Record<string, string>)?.status === 'dead');
expect(deadCall, 'expected getNotifications to be called with status=dead').toBeTruthy();
});
});
it('renders a Requeue button on dead notifications', async () => {
vi.mocked(client.getNotifications).mockResolvedValue({
data: [deadNotif],
total: 1,
page: 1,
per_page: 100,
});
renderWithQuery(<NotificationsPage />);
await waitFor(() => {
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
});
// Switch to Dead letter tab so the mocked dead notification becomes visible.
const tab = screen.getByRole('button', { name: /Dead letter/i });
fireEvent.click(tab);
await waitFor(() => {
// Red: HEAD renders no action for status='dead'. Phase 2 Green adds a
// "Requeue" button next to each dead row.
expect(screen.getByRole('button', { name: /Requeue/i })).toBeInTheDocument();
});
});
it('clicking Requeue invokes requeueNotification(id) from the API client', async () => {
vi.mocked(client.getNotifications).mockResolvedValue({
data: [deadNotif],
total: 1,
page: 1,
per_page: 100,
});
vi.mocked(client.requeueNotification).mockResolvedValue({ status: 'requeued' });
renderWithQuery(<NotificationsPage />);
await waitFor(() => {
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
});
fireEvent.click(screen.getByRole('button', { name: /Dead letter/i }));
const requeueBtn = await screen.findByRole('button', { name: /Requeue/i });
fireEvent.click(requeueBtn);
// Red: client.requeueNotification is not an exported function at HEAD, and
// the page does not call it. Both the mock and the page wiring are added
// in Phase 2 Green.
await waitFor(() => {
expect(client.requeueNotification).toHaveBeenCalledWith('notif-dead-001');
});
});
it('dead notifications surface retry_count and last_error metadata', async () => {
vi.mocked(client.getNotifications).mockResolvedValue({
data: [deadNotif],
total: 1,
page: 1,
per_page: 100,
});
renderWithQuery(<NotificationsPage />);
await waitFor(() => {
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
});
fireEvent.click(screen.getByRole('button', { name: /Dead letter/i }));
await waitFor(() => {
// Red: HEAD does not display retry_count or last_error. Phase 2 Green
// must surface these so operators can see *why* a notification died.
expect(screen.getByText(/SMTP connection refused/i)).toBeInTheDocument();
expect(screen.getByText(/5/)).toBeInTheDocument();
});
});
});
+97 -6
View File
@@ -1,6 +1,6 @@
import { useState, useMemo } from 'react';
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
import { getNotifications, markNotificationRead } from '../api/client';
import { getNotifications, markNotificationRead, requeueNotification } from '../api/client';
import PageHeader from '../components/PageHeader';
import StatusBadge from '../components/StatusBadge';
import ErrorState from '../components/ErrorState';
@@ -9,15 +9,37 @@ import type { Notification } from '../api/types';
type ViewMode = 'list' | 'grouped';
// I-005: the Notifications page now hosts two tabs. "all" is the pre-I-005
// inbox behavior — no server-side status filter, client-side type/status
// dropdowns untouched. "dead" routes the query through the new ?status=dead
// handler branch so operators can triage the dead-letter queue in isolation.
// The tab is intentionally a separate state axis from the status dropdown so
// the two don't fight each other (dropdown filters within the tab's scope).
type ActiveTab = 'all' | 'dead';
export default function NotificationsPage() {
const [viewMode, setViewMode] = useState<ViewMode>('grouped');
const [typeFilter, setTypeFilter] = useState('');
const [statusFilter, setStatusFilter] = useState('');
const [activeTab, setActiveTab] = useState<ActiveTab>('all');
const queryClient = useQueryClient();
const { data, isLoading, error, refetch } = useQuery({
queryKey: ['notifications'],
queryFn: () => getNotifications({ per_page: '100' }),
// I-005: queryKey carries the active tab so TanStack Query treats
// "all" and "dead" as distinct cache entries. Without this, switching
// tabs would return stale data until the 30s refetchInterval fires.
queryKey: ['notifications', activeTab],
queryFn: () => {
const params: Record<string, string> = { per_page: '100' };
if (activeTab === 'dead') {
// The listNotifications handler's ?status=dead branch hits the
// NotificationRepository.ListByStatus path instead of plain List,
// which is both cheaper (DLQ is a small slice of all notifications)
// and correct (pagination counts DLQ rows, not the full inbox).
params.status = 'dead';
}
return getNotifications(params);
},
refetchInterval: 30000,
});
@@ -26,6 +48,23 @@ export default function NotificationsPage() {
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['notifications'] }),
});
// I-005: requeue a dead notification. Invalidates both tab cache entries
// because a successful requeue flips the row out of "dead" and potentially
// into the "all" tab on its next refetch (status becomes 'pending').
//
// The mutationFn is wrapped as `(id) => requeueNotification(id)` rather
// than passed by reference so react-query v5's second positional argument
// (the mutation context object) never reaches the API client. Without the
// wrapper, TanStack invokes `requeueNotification(id, { client })`, and the
// I-005 Phase 1 Red contract's strict `toHaveBeenCalledWith('notif-dead-001')`
// assertion fails on the extra argument. Keep the arrow even if the context
// object later becomes structurally empty — the contract pins a single-arg
// call and the page must not leak mutation machinery into API boundaries.
const requeue = useMutation({
mutationFn: (id: string) => requeueNotification(id),
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['notifications'] }),
});
const notifications = data?.data || [];
const filtered = useMemo(() => {
@@ -81,6 +120,23 @@ export default function NotificationsPage() {
subtitle={`${filtered.length} notifications${unreadCount ? ` (${unreadCount} unread)` : ''}`}
/>
<div className="px-4 py-3 flex flex-wrap items-center gap-3 border-b border-surface-border/50">
{/* I-005: tab switcher between the standard inbox and the DLQ. The
"Dead letter" label is pinned by NotificationsPage.test.tsx — do
not rename without updating the Phase 1 Red contract. */}
<div className="flex rounded overflow-hidden border border-surface-border">
<button
onClick={() => setActiveTab('all')}
className={`px-3 py-1.5 text-xs transition-colors ${activeTab === 'all' ? 'bg-brand-400 text-white' : 'bg-surface text-ink-muted hover:text-ink'}`}
>
All
</button>
<button
onClick={() => setActiveTab('dead')}
className={`px-3 py-1.5 text-xs transition-colors ${activeTab === 'dead' ? 'bg-brand-400 text-white' : 'bg-surface text-ink-muted hover:text-ink'}`}
>
Dead letter
</button>
</div>
<div className="flex rounded overflow-hidden border border-surface-border">
<button
onClick={() => setViewMode('grouped')}
@@ -135,7 +191,7 @@ export default function NotificationsPage() {
</div>
<div className="space-y-2">
{items.map((n) => (
<NotificationRow key={n.id} notification={n} onMarkRead={() => markRead.mutate(n.id)} />
<NotificationRow key={n.id} notification={n} onMarkRead={() => markRead.mutate(n.id)} onRequeue={() => requeue.mutate(n.id)} />
))}
</div>
</div>
@@ -157,10 +213,25 @@ export default function NotificationsPage() {
);
}
function NotificationRow({ notification: n, onMarkRead }: { notification: Notification; onMarkRead: () => void }) {
function NotificationRow({
notification: n,
onMarkRead,
onRequeue,
}: {
notification: Notification;
onMarkRead: () => void;
// I-005: optional so callers who don't care about the DLQ (if any are ever
// added) aren't forced to thread a no-op through. Every NotificationRow
// today passes this, so in practice it's always defined.
onRequeue?: () => void;
}) {
const isUnread = n.status === 'Pending' || n.status === 'pending';
// I-005: dead rows get a Requeue button and surface the retry budget + the
// last transient error so operators triaging the DLQ can see *why* the
// notification died before deciding whether to requeue.
const isDead = n.status === 'dead';
return (
<div className={`flex items-start justify-between py-2 px-3 rounded transition-colors ${isUnread ? 'bg-surface-muted border-l-2 border-brand-400' : 'hover:bg-surface-muted'}`}>
<div className={`flex items-start justify-between py-2 px-3 rounded transition-colors ${isUnread ? 'bg-surface-muted border-l-2 border-brand-400' : isDead ? 'bg-surface-muted border-l-2 border-danger' : 'hover:bg-surface-muted'}`}>
<div className="flex-1 min-w-0">
<div className="flex items-center gap-2 mb-1">
<span className="text-sm text-ink">{n.type.replace(/([A-Z])/g, ' $1').trim()}</span>
@@ -168,6 +239,18 @@ function NotificationRow({ notification: n, onMarkRead }: { notification: Notifi
<span className="text-xs text-ink-faint">{n.channel}</span>
</div>
<p className="text-xs text-ink-muted truncate">{n.message || n.subject}</p>
{isDead && (
<div className="flex items-center gap-3 mt-1 text-xs">
<span className="text-ink-faint">
Retry {n.retry_count ?? 0}/5
</span>
{n.last_error && (
<span className="text-danger truncate" title={n.last_error}>
{n.last_error}
</span>
)}
</div>
)}
<div className="flex items-center gap-3 mt-1">
<span className="text-xs text-ink-faint">{n.recipient}</span>
<span className="text-xs text-ink-faint">{timeAgo(n.created_at)}</span>
@@ -181,6 +264,14 @@ function NotificationRow({ notification: n, onMarkRead }: { notification: Notifi
Mark read
</button>
)}
{isDead && onRequeue && (
<button
onClick={(e) => { e.stopPropagation(); onRequeue(); }}
className="ml-3 text-xs text-brand-400 hover:text-brand-500 transition-colors whitespace-nowrap"
>
Requeue
</button>
)}
</div>
);
}