certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 14:11:31 +00:00

Author	SHA1	Message	Date
shankar0123	109f32ff41	notifications: per-policy multi-channel expiry-alert routing Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable (see cowork/infisical-deep-research-results.md Part 5). Pre-fix, RenewalService.CheckExpiringCertificates already ran daily, RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and NotificationService.SendThresholdAlert deduped per (cert, threshold) — but the channel was hardcoded to Email (internal/service/notification.go:118 pre-fix). Operators who configured PagerDuty / Slack / Teams / OpsGenie via CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold unless SMTP was also wired. Their first signal of an expired cert was a 3 AM outage. This commit lands the routing matrix on top of the existing infrastructure: 1. RenewalPolicy gains AlertChannels (per-tier channel list) + AlertSeverityMap (per-threshold tier assignment) + EffectiveAlertChannels / EffectiveAlertSeverity accessors. Default*() helpers preserve the back-compat Email-only behaviour for operators who haven't touched their policies post-upgrade. Migration 000026 adds the JSONB columns idempotently. 2. NotificationService.SendThresholdAlertOnChannel — the new per-channel dispatch helper. Old SendThresholdAlert stays as an Email-only alias so non-policy callers (admin "send test alert" surfaces) keep working byte-for-byte. 3. NotificationService.HasThresholdNotificationOnChannel — per- (cert, threshold, channel) deduplication so a transient PagerDuty 5xx today does NOT suppress today's Slack alert and tomorrow's PagerDuty retry will still fire. 4. RenewalService.sendThresholdAlerts walks the resolved channel set per threshold tier, fans out to every configured channel, handles per-channel failures independently, defensively drops off-enum channels with an audit row trail, and records a per- channel audit event with metadata.channel + metadata.severity_tier. 5. service.ExpiryAlertMetrics — atomic counter table mirrored on the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5 (commit `0792271`). Three labels: channel × threshold × result (success / failure / deduped). Cardinality bound: 6 × 4 × 3 = 72 series for the standard 4-threshold matrix. 6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus exposer for certctl_expiry_alerts_total{channel,threshold,result}. Pre-sorted snapshot for byte-stable emission. 7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics instance through both the recording side (notificationService. SetExpiryAlertMetrics) and the exposing side (metricsHandler.SetExpiryAlerts). Dispatch flow (post-fix, per renewal-loop tick): cert ages past T-30 → daily renewal-loop fires → policy lookup → for each crossed threshold: - resolve severity tier (informational/ warning/critical) via AlertSeverityMap - look up channel set in AlertChannels[tier] - for each channel: dedup → SendThresholdAlertOnChannel → notifierRegistry[channel] → audit row → Prometheus counter increment Tests (internal/service/renewal_expiry_alerts_test.go): TestExpiryAlerts_DefaultMatrix_EmailOnly TestExpiryAlerts_PerTierFanOut TestExpiryAlerts_PerChannelDedup TestExpiryAlerts_OneChannelFails_OthersStillFire TestExpiryAlerts_OffEnumChannelDropped TestExpiryAlerts_MetricCounterIncrements TestExpiryAlerts_NilPolicy_FallsToDefault TestExpiryAlerts_OperatorOptOutOfTier The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0 days through the canonical 4 thresholds with the matrix {informational:[Slack], warning:[Slack,Email], critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no Teams, no Webhook. The OneChannelFails test pins that PagerDuty returning a 503 does NOT skip Slack/Email at the same threshold. Drive-by fix (internal/service/testutil_test.go): the existing mockNotifRepo.List ignored its filter and returned all rows, which let legacy tests pass on dedup-via-substring even though the postgres repo actually applied the filter. Updated the mock to honour CertificateID / Type / Status / Channel / MessageLike filters in the same shape as the postgres implementation (internal/repository/postgres/notification.go). All pre-existing service tests still pass — the legacy test suite happened to be robust to the mock filter doing nothing. Documentation: - docs/connectors.md Notifier section gains "Routing expiry alerts across channels" — operator-facing, JSON example, procurement playbook ("How do I make sure PagerDuty pages on the T-1 alert?"), debug recipe via SQL on audit_events + notification_events + Prometheus. - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart, per-policy channel-matrix configuration recipes, "did the on- call team get paged?" SQL queries, cardinality budget, V3-Pro forward path. - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry alerts: per-owner routing" V3-Pro entry under Adapter hardening. Out of scope (intentional, flagged in V3-Pro forward path): - Per-owner / per-team / per-tenant channel routing (matrix is per-policy today, not per-owner). - Calendar-aware suppression (no T-30 alerts on weekends). - Escalation chains (T-1 unanswered for 30m → escalate). - Per-channel rate limiting (downstream of I-005 retry+DLQ). CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md itself ("no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/domain/... ./internal/service/... ./internal/api/handler/... ./cmd/server/... clean. (./internal/repository/postgres/... vet failed on transitive testcontainers/docker module download — sandbox disk pressure, not a code issue; postgres-repo build succeeds and tests pass.) - go test -short -count=1 ./internal/domain/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestExpiryAlerts' ./internal/service/... green (per-channel dedup race-free). Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4. Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.	2026-05-03 22:12:32 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	ccd89c348f	fix(m2-pr-d): thread ctx through Job/Notification/Audit services Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background() hits across the Job+Notification+Audit service cluster by threading ctx through their handler-facing service interfaces. Services (ctx-first): - service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now accept ctx; the CancelJobWithContext wrapper is removed (handler callers continue to invoke CancelJob, now ctx-aware). - service/notification.go: ListNotifications, GetNotification, MarkAsRead accept ctx. - service/audit.go: ListAuditEvents, GetAuditEvent accept ctx. Handlers (interface + callsites): - handler/jobs.go, handler/notifications.go, handler/audit.go: local service interfaces updated, r.Context() threaded at every callsite. Tests: - Mock services updated to match the new interfaces (ctx accepted and ignored via '_ context.Context' first parameter; Fn closure fields unchanged). - job_test.go / notification_test.go callsites thread context.Background() to match production shape. Verification: go build ./... ok go vet ./... ok go test -short ./... ok go test -race -short ./... ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:20:46 +00:00
shankar0123	5d98e373e3	feat: M15a — certificate revocation API, CRL endpoint, and revocation notifications Implements core revocation infrastructure: POST /api/v1/certificates/{id}/revoke with all 8 RFC 5280 reason codes, JSON-formatted CRL at GET /api/v1/crl, webhook and email revocation notifications, best-effort issuer notification, and immutable revocation audit trail. Includes 48 new tests across service, handler, integration, and domain layers (600+ total). Fixes 3 pre-existing test bugs (team_test error matching, agent_group delete status code, team handler per_page validation). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 10:59:18 -04:00
shankar0123	b0549e6f05	feat: M11b — ownership tracking, agent groups, interactive renewal approval Ownership: owners/teams GUI pages, notification email resolution via resolveRecipient (owner_id → owner.email lookup). Agent groups: dynamic device grouping by OS/arch/IP CIDR/version with manual include/exclude membership, migration 000004, full CRUD stack (domain → repo → service → handler → frontend). Interactive approval: AwaitingApproval job state, approve/reject API endpoints with reason tracking. Tests: 12 agent group handler tests, 8 approve/reject job handler tests, integration tests updated for 13-param RegisterHandlers. Docs updated across architecture, concepts, and seed data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 21:02:35 -04:00
shankar0123	e03a75ed9a	fix: replace fmt.Printf with structured slog logging across all services All 10 service files now use slog.Error for failure logging instead of fmt.Printf. Audit event recording errors are checked and logged rather than silently discarded. Adds consistent structured context (resource IDs, operation names) to all error log statements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 01:20:03 -04:00
shankar0123	66f04f7afe	style: run gofmt -s across all Go files Fixes Go Report Card gofmt score from 52% to 100%. Pure formatting changes — no logic modifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 19:32:29 -04:00
shankar0123	1d1b89c9b5	Implement M3: expiration threshold alerting with dedup and status transitions - Add alert_thresholds_days JSONB column to renewal_policies (default [30,14,7,0]) - Add RenewalPolicy.AlertThresholdsDays field + EffectiveAlertThresholds() helper - Add RenewalPolicyRepository interface + postgres implementation - Rewrite CheckExpiringCertificates with per-policy threshold alerting - Add SendThresholdAlert + HasThresholdNotification for deduplication via [threshold:N] tags - Add Type and MessageLike filters to NotificationFilter + postgres query support - Auto-transition certs to Expiring (>0 days) or Expired (<=0 days) status - Record expiration_alert_sent audit events per threshold crossing - Fix .gitignore: allow SQL migration files, scope server/agent build artifact rules - Track previously untracked cmd/ and migrations/ directories - Update docs (README, architecture, demo-advanced) for threshold alerting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:03:43 -04:00
shankar0123	9b4122b159	Fix runtime bugs, implement service layer, and overhaul documentation Runtime fixes: - Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL) - Fix table name mismatches (certificates → managed_certificates, notifications → notification_events) - Add renewal_policy_id to certificate queries - Remove non-existent created_at from notification queries - Add env var fallback for agent CLI flags - Graceful degradation for missing notifiers/issuers in demo mode - Copy web/ directory in Dockerfile for dashboard serving Service layer: - Implement handler-service interface pattern across all services - Wire up certificate, agent, job, policy, team, owner, audit, notification services Documentation: - Add concepts.md: beginner-friendly guide to TLS, CAs, private keys - Rewrite quickstart.md with accurate API examples matching actual handlers - Add demo-advanced.md: interactive demo with cert issuance and automated script - Update architecture.md with correct table names and connector interfaces - Update connectors.md to match actual Go interface signatures - Update demo-guide.md with cross-references to new docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 21:38:11 -04:00
shankar0123	d395776a95	Initial scaffold: certificate control plane v0.1.0	2026-03-14 08:22:17 -04:00

10 Commits