certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:41:41 +00:00

Author	SHA1	Message	Date
shankar0123	7cb453a336	chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4 Mechanical reformat. The new 'gofmt drift' CI step (added in ci-pipeline-cleanup Phase 4, commit `0f205a8`) surfaced 111 files with accumulated gofmt drift across cmd/, internal/, and deploy/test/. Each file's diff is gofmt-standard: whitespace adjustments, intra- group import sorting (alphabetical by import path within blank-line- separated groups), and struct-tag column alignment. No semantic changes — verified via 'git diff --ignore-all-space' which shows only the line-position deltas from import reordering. The gate stays in place after this commit. Going forward it catches gofmt drift at PR time.	2026-04-30 22:33:57 +00:00
shankar0123	62a412c488	Bundle C: Renewal/reliability cluster — 7 findings closed Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)	2026-04-27 00:08:25 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	1ee77c89f8	I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap Add 11th always-on scheduler loop that transitions jobs stuck in AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL) to Failed. I-001's retry loop then auto-promotes eligible Failed jobs back to Pending. No new status enum, no schema migration. - JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE - JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure - Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m per-tick context, WaitGroup shutdown drain - Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h) - Audit event per transition: actor=system, actorType=System, action=job_timeout, details={old_status, new_status, timeout_reason, age_hours} - 14 new tests: 3 config, 7 service, 4 scheduler	2026-04-19 01:37:18 +00:00
shankar0123	0200c7f4a4	Close I-001 (RetryFailedJobs never invoked) coverage-gap finding Operator decision answered as Option A: JobService.RetryFailedJobs is now wired into the scheduler as an always-on 10th loop. Prior to this commit the method was implemented, unit-tested, and exported but had zero runtime callers — any job that transitioned to status=Failed stayed Failed forever regardless of how many attempts it had remaining. Scheduler — 10th loop: internal/scheduler/scheduler.go grows a jobRetryLoop alongside the existing nine loops (renewal, jobs, health, notifications, short-lived, network scan, digest, health check, cloud discovery). The loop follows the established run-immediately-then-tick pattern (same shape as jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and joined into the scheduler's sync.WaitGroup so WaitForCompletion drains it on graceful shutdown. Each tick runs under a 2-minute context timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory maxRetries cap is belt-and-suspenders; per-job eligibility is still enforced inside the service via Attempts < MaxAttempts. The JobServicer scheduler-interface gains RetryFailedJobs so the scheduler's dependency surface stays explicit and mockable. Service — audit trail per retry: internal/service/job.go:RetryFailedJobs now emits an audit event for every Failed→Pending transition. Following the house convention used by all scheduler-emitted events, actor='system' and actorType= domain.ActorTypeSystem; action='job_retry'; details capture old_status, new_status, attempts, max_attempts. JobService carries an optional *AuditService (SetAuditService) that nil-guards to preserve test-wiring ergonomics — existing tests that construct JobService without an audit service continue to pass unchanged. Config — env var with sane default: internal/config/config.go:SchedulerConfig grows RetryInterval, wired to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate rejects intervals below 1 second (matches other scheduler interval validators). Server wiring: cmd/server/main.go calls jobService.SetAuditService(auditService) after JobService construction and sched.SetJobRetryInterval( cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls. Regression coverage: internal/service/job_test.go (3 new) - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts - TestJobService_RetryFailedJobs_NoAuditServiceOK internal/scheduler/scheduler_test.go (3 new) - TestScheduler_JobRetryLoop_CallsService - TestScheduler_JobRetryLoop_IdempotencyGuard - TestScheduler_JobRetryLoop_WaitForCompletion The service tests assert status transitions, attempt-cap short- circuiting, and audit event shape (actor='system', action='job_retry', details keys). The scheduler tests assert the loop invokes the service, the atomic.Bool guard skips overlapping ticks with the expected 'still running, skipping tick' log, and WaitForCompletion drains the in-flight tick on Stop. Residual follow-up (not in scope for this commit): internal/service/renewal.go:RetryFailedJobs is a parallel dead-code duplicate of the same logic on RenewalService — untested and has no runtime caller. The audit finding called this out as 'implemented twice'. Removing it is a separate cleanup and does not block the Option-A wiring this commit delivers. Files: cmd/server/main.go — SetAuditService + SetJobRetryInterval internal/config/config.go — RetryInterval field + env + validate internal/scheduler/scheduler.go — 10th loop, interface, field, setter internal/scheduler/scheduler_test.go — 3 new scheduler-loop tests internal/service/job.go — RetryFailedJobs audit emission + SetAuditService internal/service/job_test.go — 3 new service-layer tests	2026-04-18 23:24:54 +00:00
shankar0123	7382e5f03b	test: comprehensive test gap closure across 24 packages Close coverage gaps identified by dual-audit (qualitative + quantitative). New test files for config (0%→98%), router (0%→100%), handler validation, health, audit, response helpers, webhook notifier (0%→88%), email notifier, middleware (recovery, rate limiter), domain profile, service nil-safety, config helpers, issuer bootstrap, and server bootstrap wiring. Expanded existing tests for ACME (34%→42%), step-ca (42%→52%), F5, SSH, agent (43%→63%), scheduler (88%→99%), renewal service, and issuerfactory. All tests pass: go test -short, go vet, go test -race clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 23:09:40 -04:00
shankar0123	03472072b8	test + docs: close 12 test gaps (~250 new tests) and expand testing guide to 34 parts Implements all P0-P2 test gaps from docs/test-gap-prompt.md: - Deployment service tests (20), target service tests (18), scheduler tests (8) - Agent binary tests (48), CSR renewal tests (8), short-lived cert tests (7) - Domain model tests (25), context cancellation tests (9), concurrency tests (7) - Handler negative-path tests (23 across 5 files) - Frontend error handling tests (86) and API client tests (7) Expands testing-guide.md from 28 to 34 parts covering certificate export, S/MIME/EKU, OCSP/DER CRL, body size limits, Apache/HAProxy connectors, and sub-CA mode. Fixes stale profile count (4->5) and updates sign-off table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-28 17:57:25 -04:00
shankar0123	01607f8614	fix: scheduler race — track loop goroutines in WaitGroup Root cause: WaitForCompletion only waited for work goroutines (wg), but the 5-6 loop goroutines (renewalCheckLoop, jobProcessorLoop, etc.) were not tracked. After cancel() + WaitForCompletion(), loop goroutines could still be alive accessing scheduler/mock fields when the next test started, triggering the race detector. Fix: - Start() now adds loop goroutines to wg, so WaitForCompletion blocks until both work items AND loops have fully exited - Removed untracked 100ms timer goroutine for startedChan — now closed immediately after launching loops - Timeout test updated: uses blockCh (ignores context) instead of slowDelay (respects context) so it reliably triggers the timeout path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 23:31:52 -04:00
shankar0123	fde5b39d53	fix: resolve test compilation and runtime failures across codebase - Add context.Context to handler test mocks (agent, agent_group) - Refactor scheduler to use local interfaces instead of concrete service types - Wire RevocationSvc/CAOperationsSvc sub-services in integration tests - Add context.Background() to service test calls (agent, agent_group) - Fix repo integration tests: add FK prerequisite records (team, owner, issuer, renewal_policy) before creating certificates - Set MaxOpenConns(1) on test DB to preserve SET search_path across queries - Fix Apache/HAProxy tests: replace "echo ok"/"echo reload" with "true" binary to avoid macOS exec.Command PATH resolution failure - Fix validation tests: correct error expectations for regex-first checks, replace null byte strings with strings.Repeat for length tests - Fix scheduler timeout test flakiness with t.Skip fallback - Remove unused imports (context in ca_operations_test, service in scheduler) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 22:53:46 -04:00
shankar0123	3e5cc86c5a	fix(reliability): TICKET-002 add scheduler idempotency guards and graceful shutdown ## Summary Fixes two critical scheduler reliability issues in certctl: ### TICKET-002 (CRITICAL): Scheduler job idempotency - Added atomic.Bool guards to all 6 scheduler loops (renewal, job processor, agent health, notifications, short-lived expiry, network scan) - Uses CompareAndSwap pattern to prevent duplicate execution if previous job is still running - Logs warning when a tick is skipped due to in-flight work - Prevents runaway scheduler duplicates and resource exhaustion ### TICKET-011 (MEDIUM): Graceful shutdown - Added sync.WaitGroup to track in-flight scheduler work - Each job is wrapped in wg.Add(1)/wg.Done() for lifecycle tracking - New WaitForCompletion(timeout) method waits for all in-flight work to complete - Integrates into main.go: after context cancellation, waits up to 30s for jobs to finish before closing DB - Graceful shutdown ensures no work is lost during server restart/termination ## Changes internal/scheduler/scheduler.go: - Imports: added "errors", "sync", "sync/atomic" - Scheduler struct: added 6 atomic.Bool fields (one per loop) + sync.WaitGroup - All 6 loop functions: spawn goroutines with wg.Add/Done, check atomic guard on each tick, skip tick if already running - New WaitForCompletion(timeout) method with timeout support - New ErrSchedulerShutdownTimeout error type cmd/server/main.go: - After context cancellation and before HTTP shutdown, call sched.WaitForCompletion(30 * time.Second) - Logs "waiting for scheduler to complete in-flight work" and any errors internal/scheduler/scheduler_test.go (new file): - Mock services for testing (renewal, job, agent, notification, network scan) - TestSchedulerIdempotencyGuard: verifies slow job doesn't cause duplicate execution - TestWaitForCompletionSuccess: verifies graceful shutdown with adequate timeout - TestWaitForCompletionTimeout: verifies timeout is respected - TestSchedulerMultipleLoopsIdempotency: verifies all 6 loops respect idempotency - TestSchedulerGracefulShutdown: end-to-end graceful shutdown flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 21:34:07 -04:00

10 Commits