certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-08 07:38:57 +00:00

Author	SHA1	Message	Date
shankar0123	efce2363f7	feat(metrics): extend /metrics/prometheus with per-area OCSP counters (Phase 8) Production hardening II Phase 8 — surface the OCSP per-event counters shipped in Phase 1+2 through the existing /api/v1/metrics/prometheus endpoint. Operators now alert on certctl_ocsp_counter_total {label="rate_limited"} (Phase 3 trip), {label="nonce_malformed"} (Phase 1 reject), {label="signing_failed"} (issuer connector fails), etc. NEW interface CounterSnapshotter (handler/metrics.go) — minimum surface the Prometheus exposer needs from any per-area counter table: just Snapshot() map[string]uint64. service.OCSPCounters.Snapshot (Phase 1) satisfies it; future per-area counters (CRL, cert-export, EST per-profile, SCEP per-profile, Intune per-profile) plug in the same way as separate SetXxxCounters setters. Naming convention per frozen decision 0.10: certctl_<area>_counter_total{label="<event>"} <value> This commit ships only the OCSP block. The remaining areas (CRL, cert-export, EST, SCEP, Intune) plug in via the same SetXxxCounters pattern in follow-up commits — the wire-up cost per area is one new field + one setter + one block of fmt.Fprintf lines. The bundle's S-1 docs-count guard means we don't claim a specific total in prose; operators run `curl /api/v1/metrics/prometheus \| grep certctl_` to enumerate. Wired in cmd/server/main.go: a single shared *service.OCSPCounters instance is created once and passed to BOTH the ocspResponseCacheService (so the cache hot path ticks counters) AND metricsHandler.SetOCSPCounters (so the Prometheus exposer reads them). Existing dashboard metrics (certctl_certificate_total, certctl_agent_total, etc.) remain unchanged at the same line offsets — back-compat preserved. Pre-commit verification: go build ./... clean; go test -short -count=1 green for handler/ + service/. The existing TestGetPrometheusMetrics_Success tests still pass (the new counter block is additive at the END of the response body, after the existing dashboard metrics + uptime line).	2026-04-30 05:15:05 +00:00
Shankar	15daf008aa	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
Shankar	be85fbd77e	feat: add network certificate discovery (M21) and Prometheus metrics (M22) M21 adds server-side active TLS scanning of CIDR ranges with concurrent probing, sentinel agent pattern for pipeline reuse, and full CRUD API for scan targets. M22 adds Prometheus exposition format endpoint alongside existing JSON metrics. Comprehensive documentation audit updates all docs to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 23:37:47 -04:00
Shankar	45d6e2f7b2	feat: M14 — Observability (dashboard charts, agent fleet, stats API, metrics, structured logging, rollback) Backend: StatsService with 5 aggregation methods, JSON metrics endpoint, slog-based structured logging middleware. Stats API: dashboard summary, certificates-by-status, expiration timeline, job trends, issuance rate. 23 new backend tests. Frontend: Recharts-powered dashboard with 4 charts (status pie, expiration heatmap, job trends line, issuance bar), agent fleet overview page with OS/arch grouping and version breakdown, deployment rollback buttons on version history. 7 new frontend tests. 78 API endpoints, 744+ total tests (658 Go + 86 Vitest). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 19:46:13 -04:00

4 Commits