certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:32:02 +00:00

Author	SHA1	Message	Date
shankar0123	52248be717	v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).	2026-04-20 03:43:10 +00:00
shankar0123	6e646e0fe8	M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight Closes CWE-306 (missing authentication for critical function) for SCEP via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with their respective RFCs. CRL/OCSP remain unauthenticated under .well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D): no mTLS in this milestone. - RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1 (/cacerts explicitly anonymous): EST paths served unauthenticated; CSR-signature + profile policy enforce identity inside ESTService. - RFC 8894 §3.2: SCEP authenticates via the challengePassword PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential. HTTP dispatch is unauthenticated; preflightSCEPChallengePassword refuses to start when CERTCTL_SCEP_ENABLED=true without CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the same invariant defense-in-depth and compares with crypto/subtle.ConstantTimeCompare. cmd/server/main.go: - Extract buildFinalHandler(apiHandler, noAuthHandler, webDir, dashboardEnabled); route /.well-known/est/, /scep, /scep/, /.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial}, and health probes through noAuthHandler (RequestID + structuredLogger + Recovery only). - Add preflightSCEPChallengePassword fail-loud gate; startup log emits challenge_password_set boolean for operator visibility. cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests): - TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard (7) pin the dispatch surface: EST 4-endpoint, SCEP exact + trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/* authenticated, /assets/* file server, SPA fallback. internal/api/router/router.go, internal/config/config.go: - Router-level comments explain why EST/SCEP/PKI dispatchers sit outside the authenticated mux; SCEP challenge password config plumbed through. docs/architecture.md: - New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1, buildFinalHandler + noAuthHandler references). - Rewrite SCEP Authentication subsection; replaces pre-existing factually-incorrect "any value accepted" claim with CWE-306 preflight, service-layer defense-in-depth, and crypto/subtle.ConstantTimeCompare. - Top-level Authentication section: qualify /api/v1/* scope on API clients bullet; add standards-based-endpoints bullet referencing the 27-subtest regression harness. docs/compliance-soc2.md: - CC6.1: scope API Key Authentication to /api/v1/; add standards-based endpoints bullet citing RFCs and CWE-306 closure. - CC6.3: scope API Key Policy to /api/v1/ with cross-reference to CC6.1. - Evidence Locations augmented with buildFinalHandler, preflightSCEPChallengePassword, scep.go defense path, regression harness, and OpenAPI security:[] overrides. api/openapi.yaml: verified already correct (global bearerAuth default overridden with security:[] on /cacerts, /simpleenroll, /simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id}, /ocsp/{issuer_id}/{serial}); no edits needed.	2026-04-19 17:20:05 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	0725713e19	Close I-004 (agent hard-delete cascades targets) coverage-gap finding Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=<text>` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire <id> [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI	2026-04-19 05:24:00 +00:00
shankar0123	1ee77c89f8	I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap Add 11th always-on scheduler loop that transitions jobs stuck in AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL) to Failed. I-001's retry loop then auto-promotes eligible Failed jobs back to Pending. No new status enum, no schema migration. - JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE - JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure - Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m per-tick context, WaitGroup shutdown drain - Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h) - Audit event per transition: actor=system, actorType=System, action=job_timeout, details={old_status, new_status, timeout_reason, age_hours} - 14 new tests: 3 config, 7 service, 4 scheduler	2026-04-19 01:37:18 +00:00
shankar0123	4bc8b3e723	fix(config): add RetryInterval to TestValidate_ValidConfig + TestValidate_AuthTypeNone fixtures (I-001 follow-up) Problem: TestValidate_ValidConfig and TestValidate_AuthTypeNone construct a SchedulerConfig without RetryInterval, so Validate() fails the 'retry interval must be at least 1 second' check at config.go:1086 with 'retry interval must be at least 1 second'. Both tests expect success, so they fail whenever run. Root cause (re-derived from source, not inherited from memory): git log -S 'retry interval must be at least' --source --all shows the validation was introduced in `0200c7f` (I-001, RetryFailedJobs scheduler wiring). git log -- internal/config/config_test.go shows the test file was last touched in `7382e5f`, which predates `0200c7f`. I-001 added a new Validate() rule without updating the two positive test fixtures — a gap in I-001's verification pass. This is NOT C-001 fallout. The config_test.go file was untouched by the C-001 closure commits `91642e2` and `4696116`. The failure surfaced during the full test suite run after C-001 landed because no one had run 'go test ./internal/config/...' since I-001. Scope: - internal/config/config_test.go (2 fixtures: TestValidate_ValidConfig, TestValidate_AuthTypeNone). Implementation: Added 'RetryInterval: 5 * time.Minute' to both SchedulerConfig literals. 5 minutes matches the I-001 default at config.go:818: RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5time.Minute) The other two TestValidate_ tests (InvalidAuthType, APIKeyAuth_ MissingSecret) are unaffected because they expect Validate() to error at the auth-type check (line 1052) or auth-secret check (line 1057), both of which fire before the RetryInterval check at line 1086. Verification: - go test -count=1 -run 'TestValidate_' ./internal/config/...: PASS - go test -short -count=1 ./...: all packages PASS - go vet ./...: exit 0 Residual: None. This is a pure test-fixture fix — production code is unchanged. Commit: `0200c7f` (I-001) should have included this edit. Attributed here for traceability.	2026-04-19 00:33:22 +00:00
shankar0123	469611650c	fix(cli): add missing os + path/filepath imports to client_test.go Follow-up to `91642e2`. TestClient_ImportCertificates_SixFieldPayload uses filepath.Join(t.TempDir(), ...) and os.WriteFile to stage a test PEM, but the import block only listed encoding/json, encoding/pem, net/http, etc. — neither os nor path/filepath was imported. go vet rejected the package with 'undefined: filepath' (and would have caught 'undefined: os' next). Add both imports. No behavioral change — the referenced symbols are the standard library's usual names for their respective packages, so the test compiles and runs exactly as intended. CI should now pass go build + go vet on the cli package.	2026-04-19 00:27:11 +00:00
shankar0123	91642e2860	C-001 scope expansion: tighten parallel POST /api/v1/certificates call sites to six-field contract Problem: `a53a4b8` closed C-001 at the handler boundary by tightening the ValidateRequired contract on POST /api/v1/certificates to require six fields: name, common_name, renewal_policy_id, issuer_id, owner_id, team_id. (Correction re-derived from source: the handler ValidateRequired calls on owner_id/team_id/renewal_policy_id were actually installed in `3287e17` under M-002/M-003/M-006 auth unification — a53a4b8's commit message overstates scope.) Post-audit on 2026-04-18 found three parallel call sites still shipping three-to-four-field payloads that the newly strict handler would reject with HTTP 400: - GUI: OnboardingWizard CertificateStep (common_name + sans + issuer_id + environment only) - CLI: certctl-cli import (common_name + issuer_id + status only; no required-flag gating) - Tests: deploy/test/qa_test.go Part03 positive paths Scope: Bring every POST /api/v1/certificates caller to six-field parity. No handler changes — the contract is authoritative; the callers must conform. Implementation: GUI — OnboardingWizard CertificateStep expansion: web/src/pages/OnboardingWizard.tsx adds name/owner_id/team_id/ renewal_policy_id state. React Query hooks for getOwners/ getTeams/getPolicies use per_page: '500' to populate dropdowns without pagination-driven truncation. Payload ships all six required fields plus sans/certificate_profile_id/environment. nextDisabled gate enforces all six before the Continue button activates. CLI — ImportCertificates rewrite: internal/cli/client.go rewrites ImportCertificates with flag.NewFlagSet("import", flag.ContinueOnError). Required flags: --owner-id, --team-id, --renewal-policy-id, --issuer-id. Optional: --name-template (default {cn}, templated via strings.ReplaceAll against cert.Subject.CommonName), --environment (default imported). Missing required flags fail pre-HTTP with a clear error. Request map ships all six required fields plus sans/ environment/status/optional serial_number. cmd/cli/main.go — usage string updated to document the new required/optional flags. Tests — qa_test.go Part03 positive paths: deploy/test/qa_test.go Part03 Create_Minimal and Create_Full updated to include all six fields. Uses seed_demo.sql-supplied IDs (o-alice, t-platform, rp-standard) — docker-compose.demo.yml is the run context. C-001 explanatory comment added above Create_Minimal so future readers understand why the minimal payload is no longer minimal. MCP parity: Verified no-op. internal/mcp/types.go:28 CreateCertificateInput already declares all six fields; internal/mcp/tools.go:102 forwards the typed struct unchanged. Verification: Go CLI regression tests (internal/cli/client_test.go): * TestClient_ImportCertificates_MissingRequiredFlags — 5 subtests, one per missing required flag, confirms flag.ContinueOnError rejects with non-nil error before any HTTP call is attempted. * TestClient_ImportCertificates_MissingPositionalArgs — confirms the "usage: import <file>" error path when no PEM file is supplied after the flags. * TestClient_ImportCertificates_SixFieldPayload — uses httptest to decode the POST body and assert all six required fields plus sans/environment are present on the wire. Frontend regression test (web/src/api/client.test.ts): 'createCertificate accepts and transmits all six required fields' pins the wire shape for both GUI call sites (OnboardingWizard CertificateStep + CertificatesPage CreateCertificateModal). If either UI surface accidentally drops a field, this assertion fails in CI rather than surfacing as a 400 at runtime. Grep-based call-site sweep: Enumerated every POST /api/v1/certificates create caller. Four total: OnboardingWizard, CertificatesPage, MCP tools, CLI import. All four now ship six-field payloads. Claim path (internal/service/discovery.go) updates existing rows and does not POST. EST/SCEP handlers invoke internal certService.CreateVersion, not the public API. Negative-path tests (qa_test.go:1085/1267/1274/1288/1298) remain valid: they assert 400/non-500 on oversized/malformed/missing-CN/UTF-8/empty bodies, and these properties still hold under the stricter handler. Static gates: go build ./..., go vet ./..., go test ./internal/cli/..., and cd web && npm run test deferred to operator pre-push — the Go toolchain is not available in the session sandbox. Grep-based verification confirms the syntactic shape of every changed file. Residual: None. Every POST /api/v1/certificates call site now conforms to the six-field contract; the wire shape is pinned by both Go and TypeScript regression tests. Commit: TBD-SHA (audit doc + CLAUDE.md carry TBD-SHA placeholders to be amended after commit)	2026-04-19 00:25:10 +00:00
shankar0123	0200c7f4a4	Close I-001 (RetryFailedJobs never invoked) coverage-gap finding Operator decision answered as Option A: JobService.RetryFailedJobs is now wired into the scheduler as an always-on 10th loop. Prior to this commit the method was implemented, unit-tested, and exported but had zero runtime callers — any job that transitioned to status=Failed stayed Failed forever regardless of how many attempts it had remaining. Scheduler — 10th loop: internal/scheduler/scheduler.go grows a jobRetryLoop alongside the existing nine loops (renewal, jobs, health, notifications, short-lived, network scan, digest, health check, cloud discovery). The loop follows the established run-immediately-then-tick pattern (same shape as jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and joined into the scheduler's sync.WaitGroup so WaitForCompletion drains it on graceful shutdown. Each tick runs under a 2-minute context timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory maxRetries cap is belt-and-suspenders; per-job eligibility is still enforced inside the service via Attempts < MaxAttempts. The JobServicer scheduler-interface gains RetryFailedJobs so the scheduler's dependency surface stays explicit and mockable. Service — audit trail per retry: internal/service/job.go:RetryFailedJobs now emits an audit event for every Failed→Pending transition. Following the house convention used by all scheduler-emitted events, actor='system' and actorType= domain.ActorTypeSystem; action='job_retry'; details capture old_status, new_status, attempts, max_attempts. JobService carries an optional *AuditService (SetAuditService) that nil-guards to preserve test-wiring ergonomics — existing tests that construct JobService without an audit service continue to pass unchanged. Config — env var with sane default: internal/config/config.go:SchedulerConfig grows RetryInterval, wired to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate rejects intervals below 1 second (matches other scheduler interval validators). Server wiring: cmd/server/main.go calls jobService.SetAuditService(auditService) after JobService construction and sched.SetJobRetryInterval( cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls. Regression coverage: internal/service/job_test.go (3 new) - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts - TestJobService_RetryFailedJobs_NoAuditServiceOK internal/scheduler/scheduler_test.go (3 new) - TestScheduler_JobRetryLoop_CallsService - TestScheduler_JobRetryLoop_IdempotencyGuard - TestScheduler_JobRetryLoop_WaitForCompletion The service tests assert status transitions, attempt-cap short- circuiting, and audit event shape (actor='system', action='job_retry', details keys). The scheduler tests assert the loop invokes the service, the atomic.Bool guard skips overlapping ticks with the expected 'still running, skipping tick' log, and WaitForCompletion drains the in-flight tick on Stop. Residual follow-up (not in scope for this commit): internal/service/renewal.go:RetryFailedJobs is a parallel dead-code duplicate of the same logic on RenewalService — untested and has no runtime caller. The audit finding called this out as 'implemented twice'. Removing it is a separate cleanup and does not block the Option-A wiring this commit delivers. Files: cmd/server/main.go — SetAuditService + SetJobRetryInterval internal/config/config.go — RetryInterval field + env + validate internal/scheduler/scheduler.go — 10th loop, interface, field, setter internal/scheduler/scheduler_test.go — 3 new scheduler-loop tests internal/service/job.go — RetryFailedJobs audit emission + SetAuditService internal/service/job_test.go — 3 new service-layer tests	2026-04-18 23:24:54 +00:00
shankar0123	fe7e766510	Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings M-004 — OCSP issuer binding (composite key): The OCSP lookup path now binds (issuer_id, serial) as a composite key rather than resolving by serial alone. CertificateRepository and RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go scopes both lookups by the issuer_id path param. When no managed cert binds to that (issuer, serial) tuple, GetOCSPResponse constructs an RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior default 'good'. Short-lived cert exemption (profile TTL < 1h) is preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log. Regression coverage: internal/service/ca_operations_test.go - TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer - TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial M-005 — Discovery Claim/Dismiss actor propagation: DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an explicit 'actor string' parameter (propagation pattern mirrors bulk_revocation.go / revocation_svc.go). The handler layer passes resolveActor(r.Context()) — the named-key identity established by the M-002 auth unification — and the service falls back to 'api' (the same safe sentinel resolveActor uses when no auth context is present) only when the caller passes an empty string. Never falls back to 'operator'. Regression coverage: internal/service/discovery_test.go - TestDiscoveryService_ClaimDiscovered_AuditActor - TestDiscoveryService_DismissDiscovered_AuditActor - TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI - TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI Each new test asserts event.Actor matches the caller-supplied string (or 'api' on empty input) and explicitly asserts event.Actor != 'operator' to lock in the historical fix intent. Files: internal/api/handler/discovery.go — pass resolveActor(ctx) internal/api/handler/discovery_handler_test.go — updated call sites internal/integration/lifecycle_test.go — updated mock wiring internal/repository/interfaces.go — GetByIssuerAndSerial on CertificateRepository + RevocationRepository internal/repository/postgres/certificate.go — composite key lookup internal/service/ca_operations.go — (issuer_id, serial) scoping internal/service/ca_operations_test.go — 2 new M-004 tests internal/service/discovery.go — actor parameter + 'api' fallback internal/service/discovery_test.go — 4 new M-005 tests internal/service/shortlived_test.go — mock signature update internal/service/testutil_test.go — mock GetByIssuerAndSerial	2026-04-18 22:20:25 +00:00
shankar0123	ff7357f889	fix(lint): godoc comment on NewAuthWithNamedKeys must lead with function name (ST1020) CI failure on master (commit `3287e17`) — staticcheck ST1020: internal/api/middleware/middleware.go:125:1: ST1020: comment on exported function NewAuthWithNamedKeys should be of the form "NewAuthWithNamedKeys ..." (staticcheck) When NewAuth was renamed to NewAuthWithNamedKeys during the M-002 auth unification, the leading godoc sentence was left pointing at the old name. Rewrite the comment so its first sentence starts with the new function name, and expand the body to describe the named-key + admin-flag contract introduced in `3287e17`. Also gitignore /.gopath/ — session-scoped tool install cache, same category as /.gocache/ and /.gomodcache/. Verification: go vet ./internal/api/middleware/... — clean go build ./internal/api/middleware/... — clean go test ./internal/api/middleware/... — PASS (0.245s) staticcheck -checks=all,<project exclusions> — clean across middleware, handler, service, domain, cmd/server, scheduler Closes: CI failure on `3287e17`.	2026-04-18 21:38:46 +00:00
shankar0123	3287e174dc	Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001) Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006) on top of the C-001/C-002 ownership + agent-FK contract fixes landed in `a53a4b8`. The work lands as a single commit spanning server, docs, tests, and the React client. M-002 — Named API keys with per-key actor propagation * Migration 000014 adds the 'api_keys' table (id, name, hash, principal, role, created_at, last_used_at, disabled_at) so every credential carries an identifiable principal instead of the opaque 'anonymous'/'api-key' sentinel. * Auth middleware now rotates through configured keys, performs constant-time hash comparison, stamps 'last_used_at', and emits an actor struct via contextWithActor(). The audit middleware, bulk-revocation handler, approval handlers, and MCP tool layer now read the principal off the context and persist it on every audit_events row. * Regression coverage: - internal/api/middleware/audit_test.go — actor propagation, principal redaction for disabled keys, anonymous fallback for unauthenticated endpoints. - internal/api/handler/bulk_revocation_handler_test.go, job_handler_test.go — principal-on-audit assertions. M-003 — Authorization gates (Phase B) * Approval handler rejects self-approval / self-rejection with 403 when the actor principal equals the job's requested_by field. * Bulk revocation is gated behind the 'admin' role; operators and viewers receive 403. * Regression coverage: - internal/service/job_test.go — TestApproveJob_NotSelf, TestRejectJob_NotSelf. - internal/api/handler/bulk_revocation_handler_test.go — TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds. M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux * Per RFC 8615, relying parties cannot reasonably be asked to authenticate against the issuing certctl instance to retrieve revocation material. CRL and OCSP move off the authenticated '/api/v1/crl' and '/api/v1/ocsp/' paths onto: GET /.well-known/pki/crl/{issuer_id} Content-Type: application/pkix-crl (RFC 5280 §5) GET /.well-known/pki/ocsp/{issuer_id}/{serial} Content-Type: application/ocsp-response (RFC 6960) * Non-standard JSON CRL shape is removed; only DER is served. * Short-lived certificate exemption (profile TTL < 1h → skip CRL/OCSP) is preserved; the response simply omits the serial. * Routes are registered on the unauthenticated 'finalHandler' mux in cmd/server/main.go alongside EST ('/.well-known/est/') and SCEP ('/scep'). Legacy authenticated paths return 404. Regression coverage: - internal/api/handler/certificate_handler_test.go — content type, DER parseability, 404 for unknown issuer. - internal/api/handler/adversarial_path_test.go — unauthenticated access asserted for CRL, OCSP, EST, SCEP. - internal/api/router/router_test.go — route-table assertion that '.well-known/pki/', '.well-known/est/', and '/scep' are mounted on the unauthenticated branch. M-001 — Auto-closed by M-002 EST and SCEP were already registered on the unauthenticated 'finalHandler' mux; the router comment at internal/api/router/router.go:247 now matches reality. The adversarial-path tests above lock the behavior in. Verification (all gates green): * go vet ./... — clean * go build ./... — ok * go test -short ./... (55+ packages) — all pass * web/ : npm test (225 Vitest tests) — all pass * web/ : npx tsc --noEmit — clean * grep sweep for '/api/v1/(crl\|ocsp)' — 13 surviving hits, all intentional M-006 tombstone/relocation comments. Documentation: * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 → Fixed, with per-finding resolution paragraphs citing regression test IDs. (Audit file lives outside this repo; see cowork root.) * CLAUDE.md Project Status line updated with the auth-unification closure note. * docs/features.md, docs/architecture.md, docs/quickstart.md, docs/concepts.md, docs/connectors.md, docs/test-env.md, docs/testing-guide.md, docs/compliance-.md, docs/demo-advanced.md — refreshed for the new '.well-known/pki/' namespace and named API keys. * api/openapi.yaml — documents the new unauthenticated endpoints and removes the legacy '/api/v1/crl' + '/api/v1/ocsp/' paths. .gitignore: adds '/.gocache/' and '/.gomodcache/' for the session- scoped Go caches so they never enter the tree.	2026-04-18 18:17:41 +00:00
shankar0123	a53a4b845b	fix(gui,api): close C-001 + C-002 — ownership + agent FK contract C-001 — CreateCertificate was server-accepted with null owner_id, team_id, renewal_policy_id because the GUI neither collected the fields nor enforced them, even though the backend's ManagedCertificate schema and handler contract treat them as required. Fix the contract at all four layers: - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free- text inputs with <select> elements fed by getOwners/getTeams/ getPolicies queries; mark all three required; gate the Create button on owner_id + team_id + renewal_policy_id being set. - internal/api/handler/certificates.go: ValidateRequired for owner_id, team_id, renewal_policy_id on CreateCertificate so the handler returns HTTP 400 with the offending field name before the service layer is reached. - internal/mcp/types.go: drop ',omitempty' from CreateCertificateInput.RenewalPolicyID so the MCP schema reflects the required contract; Update inputs keep partial-update semantics. - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id, issuer_id, owner_id, team_id]' was already present on the Create schema; clarified DeploymentTarget.agent_id description to note the FK contract. C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the service inserted directly, producing a Postgres 23503 FK-violation that bubbled out as a generic HTTP 500. The FK itself (migration 000001 line 104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep the schema strict and add validation at three layers: - internal/service/target.go: introduce ErrAgentNotFound sentinel and pre-validate agent_id in TargetService.CreateTarget — empty string returns 'agent_id is required'; a nonexistent id returns the full 'referenced agent does not exist: <id>' error. Both wrap ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is. - internal/api/handler/targets.go: ValidateRequired on agent_id; map errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of letting it fall through to the generic 500 branch. - internal/mcp/types.go: drop ',omitempty' from CreateTargetInput.AgentID to match the required contract. - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input with a <select> populated from getAgents(); include agent in the canProceedToReview gate so Next is disabled until an agent is chosen. Regression coverage (21 new subtests total): - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests, one per required field, each proves the handler guard fires before the mock service is called. - TestCreateTarget_MissingAgentID_Returns400 — handler guard. - TestCreateTarget_NonexistentAgent_Returns400 — pins the ErrAgentNotFound -> 400 translation. - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel. - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is. - The existing TestTargetService_CreateTarget_Success, along with TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler tests, were updated to seed a real agent or include agent_id in the request body so the happy paths still run cleanly. Gates (Phase 4): - go build/vet/test/race: green - go test -cover: internal/service 68.7% (gate 55%), internal/api/handler 78.9% (gate 60%) - golangci-lint on service+handler+mcp: 0 issues - govulncheck: no reachable vulns - tsc --noEmit: clean - vitest: 223/223 passing See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.	2026-04-18 16:01:40 +00:00
shankar0123	b3cc7cbdb2	fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008) D-008 was a three-part drift in the policy engine that made the D-005/D-006 remediation cosmetic below the DB layer: (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase types ('ownership', 'environment', 'lifetime', 'renewal_window') that the handler validator rejects on Create/Update but that raw SQL INSERTs bypassed entirely. At runtime evaluateRule's switch fell through to the default "unknown policy rule type" error branch on every demo rule × every cert × every cycle, flooding logs while emitting zero violations. (b) migrations/seed_demo.sql persisted lowercase severity values ('critical', 'error', 'warning') on policy_violations rows. INSERT succeeded because that column had no CHECK, but any frontend comparing against the canonical PolicySeverity enum mis-categorized every seeded violation. (c) evaluateRule hardcoded Severity: PolicySeverityWarning on every emitted violation and ignored rule.Config entirely — so the D-006 per-rule severity column (000013) and every per-arm Config JSON ({allowed_issuer_ids, allowed_domains, required_keys, allowed, lead_time_days, max_days}) was dead data below the evaluation layer. This commit lands (a)+(b)+(c) atomically. Shipping any subset leaves the feature half-working. ## Changes Domain (internal/domain/policy.go): * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical. Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine arm — routing it through RenewalLeadTime would conflate "how close to expiry before we renew" with "how long can the cert possibly be", two distinct semantics. The new type accepts config {"max_days": int} and flags certs whose NotAfter - NotBefore exceeds the cap. Handler validator (internal/api/handler/validation.go): * ValidatePolicyType allowlist grown to 6 canonicals (AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime, CertificateLifetime). OpenAPI (api/openapi.yaml): * PolicyType enum grown to match domain. Frontend (web/src/api/types.ts, types.test.ts): * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts all 6 canonicals and rejects casing drift. Migration 000014 (policy_violations severity CHECK): * Named CHECK constraint (policy_violations_severity_check) mirroring 000013's allowlist, defense-in-depth at the DB layer against future drift from bypassed writes (migrations, psql sessions, future callers). Symmetric down migration drops by name. Seed data: * migrations/seed.sql rewritten to emit TitleCase canonicals with per-arm config JSON that actually exercises the config-consuming paths (not the missing-field backstops): - pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning - pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical - pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning Severities are now differentiated per rule (D-006 intent). * migrations/seed_demo.sql violation rows flipped to TitleCase severity ('Critical', 'Error', 'Warning') so migration 000014 applies cleanly on upgrade paths. Engine rewrite (internal/service/policy.go): * evaluateRule rewritten. All six arms now: 1. Parse rule.Config into the per-arm typed struct. 2. Bad JSON → log at ValidateCertificate boundary and skip this rule (no co-located poisoning of other rules in the same batch). 3. Empty/null Config → emit the pre-D-008 missing-field violation (backwards compat invariant — operators who haven't reconfigured still see the same output). 4. Violations emitted carry rule.Severity (no more hardcoded Warning); D-006 column is now load-bearing. * CertificateLifetime arm reads NotBefore/NotAfter from the certificate's latest version via CertRepo. Injected via PolicyService.SetCertRepo() setter — avoids churning ~36 NewPolicyService call sites while keeping the lifetime arm optional (degrades to a log+skip if the setter is not wired). Server wiring (cmd/server/main.go): * policyService.SetCertRepo(certRepo) wired after construction. Tests (internal/service/policy_test.go): * 25 new subtests across 5 groups: - TestEvaluateRule_SeverityPassThrough (6): every rule type emits violations carrying rule.Severity, not hardcoded. - TestEvaluateRule_ConfigConsumed (12): every per-arm Config path exercised positive + negative. - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null Config still emits pre-D-008 missing-field violations. - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs and skips cleanly without poisoning neighbors. - TestEvaluateRule_CertificateLifetime_RepoScenarios (3): ok when repo wired, log+skip when not, handles missing NotBefore/NotAfter edges. Provenance: D-008 surfaced during D-005/D-006 remediation review in `eef1db0`. That commit added persistence and CI pins for the severity field but did not re-verify the evaluation layer consumed it; this finding and fix close the audit-process gap.	2026-04-18 14:55:56 +00:00
shankar0123	eef1db0f0a	fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006) Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a single commit because they share the same root cause — policy CRUD sending values the backend silently rejects — and splitting them would leave a half-working UI between commits. ## D-005 (P0): PoliciesPage dropdown 400s every Create Policy Root cause ---------- `web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array. The backend's `internal/api/handler/validators.go::ValidatePolicyType` enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`, `RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in `internal/domain/policy.go`. Every Create Policy request was rejected with `400 invalid policy type`. The error surfaced only as a transient toast; the modal closed anyway. Silent user-visible failure. Fix --- - `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES` tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and `PolicyViolation.severity` to the literal-union types. Dropdown is now sourced from the tuple; casing drift becomes a compile error. - `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` / `severityDots` to the TitleCase values, added `humanize()` for display (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral` fallback that was papering over the mismatch. - `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone edits one side of the frontend/backend contract without the other, CI fails with a clear assertion. Pure-TS vitest, no RTL dependency. ## D-006 (P1): `severity` field silently dropped on create/update Root cause ---------- `PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The frontend has always sent `severity` on create/update, but Go's `json.Decoder` (default settings, no `DisallowUnknownFields`) silently dropped it. The value never reached PostgreSQL. Every rule rendered with the same severity because there was no severity — just a display computation downstream. Fix: option (b), full-stack schema add (not delete-the-field) ------------------------------------------------------------- - Migration `000013_policy_rule_severity` (up + down): adds `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No index — three-value column on a low-thousands-rows table, planner will seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data. - `internal/domain/policy.go`: added `Severity PolicySeverity` field. - `internal/repository/postgres/policy.go`: plumbed `severity` through ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT, UpdateRule UPDATE (4 queries). - `internal/service/policy.go::UpdatePolicy`: if the client omits severity on a PUT (zero-value empty string), fetch the existing rule and preserve its severity. Without this, partial updates would trip the NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type (out of scope). - `internal/api/handler/policies.go::CreatePolicy`: default empty severity to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with clear message instead of 500 on CHECK violation. `UpdatePolicy`: validates severity only when provided. - `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional `severity` on the MCP `create_policy` / `update_policy` tool inputs so LLM callers stay in sync with the wire contract. - `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with the enum and default. Acceptance criterion (user-defined) ----------------------------------- "Create a rule with severity=Critical, reload the page, and still see Critical — no silent drops." Verified end-to-end: frontend sends `severity: "Critical"`, handler validates, service persists, DB stores, GET returns, React renders the correct badge. Seed data --------- `migrations/seed.sql`: four demo rules now have differentiated severities — `pr-require-owner` → Warning, `pr-allowed-environments` → Error, `pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` → Warning. The user called out that seeding all four at the same severity makes the feature look decorative; differentiation demonstrates the column carries real signal. ## Integration test fix (side effect of D-006) `internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy` was sending `"severity": "High"` — a value from the pre-audit severity vocabulary that the new `ValidatePolicySeverity` correctly rejects with 400. Changed to `"Error"` (closest semantic match in the new TitleCase allowlist). Only severity reference in the integration/ directory; verified via grep. ## Out of scope, logged for follow-up (d/D-008) Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly deferred per direction: 1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`). These are load-bearing on `internal/service/policy.go::evaluateRule`'s `switch rule.Type` (which also uses the lowercase strings). Migrating requires coordinated changes across seed + evaluation engine. 2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'` severity — will now fail the new CHECK constraint. Separate fix. 3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on emitted violations and ignores the configured `rule.Config`. The new severity column is read correctly on the CRUD path but not yet consulted during evaluation. ## Verification Backend: - `go build ./...` — clean - `go vet ./...` — clean - `go test -short ./...` — all packages green, including `internal/service` (policy service), `internal/api/handler` (policy + MCP handler tests), `internal/integration` (e2e_test.go after fix), `internal/domain`, `internal/repository/postgres`. Frontend: - `tsc --noEmit` — clean - `vitest run` — 223/223 passing (4 new assertions in types.test.ts) - `vite build` — clean (only the pre-existing chunk-size warning)	2026-04-18 13:02:04 +00:00
shankar0123	ef670fa6da	fix(m-9): aggregate per-endpoint scan errors in NetworkScanService Before this fix, RunScan declared `scanErrors []string` but never appended to it. As a result: - the summary Info log ("network target scan completed") always reported `"errors": 0`, regardless of how many endpoints failed - the DiscoveryReport's `Errors` field — stored on the scan record and surfaced in the GUI scan history — was always nil Operators who needed to understand scan failures had to enable Debug logging and grep through the noise of expected sweep-scan connection refusals. The per-endpoint log level (Debug) is deliberate and correct — scanning a /24 typically produces 200+ connection-refused results, and logging each at Warn would create massive log spam at default verbosity. The bug was the silent loss of the aggregate count. This commit: - extracts the partitioning logic into `collectScanResults`, a pure method that splits per-endpoint results into discovered certificate entries and a list of endpoint error strings - populates the errors list with "<address>: <error>" so the scan record correlates failures back to specific endpoints - preserves the existing Debug-level per-endpoint log (sweep noise discipline) — no change to default-verbosity log output The summary Info log's "errors" field and the DiscoveryReport's Errors field now reflect the true failure count. Debug detail remains available for operators diagnosing specific endpoints. Audit scope note: the M-9 finding narrative implied broad Debug-level hiding of real errors across AWS SM, Azure KV, GCP SM, and network scan sentinel agents. On investigation, the three cloud-discovery connectors (awssm, azurekv, gcpsm) already use appropriate Warn/Error discipline for per-item and root-level failures. Only the network scanner had a silent observability gap, and it was a missed append rather than a misapplied log level. See audit resolution log for full details. CWE: CWE-778 (Insufficient Logging) — aggregate failure count lost. Tests: 4 new unit tests on collectScanResults covering the aggregation path (success + failure mix), all-success, all-failed, and empty-input degenerate cases. All tests pass with -race. Verification: - go build ./cmd/server/... ./cmd/agent/... ./cmd/mcp-server/... ./cmd/cli/... exit 0 - go vet ./... exit 0 - go test -race -count=1 -timeout 300s [full CI race path] exit 0 - golangci-lint run ./... --timeout 5m (v2.11.4) 0 issues - govulncheck ./... (@latest) 0 in-code vulnerabilities - go test -count=1 -cover ./internal/service/... 68.0% (> 55% threshold) Invariants preserved: - collectScanResults signature: method on *NetworkScanService, input []domain.NetworkScanResult, return ([]DiscoveredCertEntry, []string) - Debug log key names unchanged ("address", "error") - DiscoveryReport schema unchanged (Errors field already existed) - Sentinel agent ID "server-scanner" unchanged - No migration, no API, no wire-format change Refs: M-9 Medium finding; audit resolution log appended in follow-up commit on workspace-level audit report.	2026-04-18 02:34:14 +00:00
shankar0123	e3196e7b50	M-2 PR-F: Middleware/ACME ctx-propagation + contextcheck linter + audit closeout Final PR in the six-commit M-2 sequence (PR-A: CertificateService cluster `cdc9d03`, PR-B: IssuerService+TargetService `eb14236`, PR-C: Policy/Profile/ Owner/Team `2497be4`, PR-D: Job/Notification/Audit `ccd89c3`, PR-E: AgentService `283ec27`, PR-F: this commit). PR-A through PR-E collapsed the service-layer shim methods and deleted every in-production context.Background() / context.TODO() call from internal/service/; this PR completes the sweep across the non-service tiers (HTTP middleware + ACME connector) and wires the contextcheck linter so regressions fail CI. Three narrow edits land the D-3 pattern (context.WithoutCancel for subsidiary async writes and deferred shutdown contexts): - internal/api/middleware/audit.go -- async audit goroutine now runs on auditCtx := context.WithoutCancel(r.Context()) instead of context.Background(). Preserves request-scoped values (trace ID, auth) while detaching from the request's cancellation so the audit write does not get killed when the response completes. Goroutine is still tracked via a.wg (M-1 shutdown drain) so Flush(ctx) behaviour is unchanged. CWE-770 Missing Release (goroutine leak potential) + CWE-400 Resource Exhaustion (missed cancellation propagation). - internal/api/middleware/middleware.go -- Recovery panic path now logs via slog.ErrorContext(ctx, ...) instead of log.Printf. Request- scoped trace/auth metadata now carries through the panic log, matching every other request log. D-3 non-bypass: the context is r.Context() captured before the defer, so even a panic mid-handler propagates the ctx's trace ID into the ERROR log line. - internal/connector/issuer/acme/acme.go (HTTP-01 challenge server shutdown) -- defer shutdown context derived from context.WithTimeout(context.WithoutCancel(ctx), 5s) instead of context.Background(). Preserves parent ctx values, detaches from parent cancellation so Shutdown always gets its full 5-second budget even when the parent was cancelled. Matches the same pattern applied in ACME's solveAuthorizationsDNS01 and solveAuthorizationsDNSPersist01. Linter wiring: .golangci.yml adds `contextcheck` to the enabled set. golangci-lint v2.11.4 now fails CI on any function that takes a context.Context parameter but calls into context.Background() or context.TODO() instead of propagating -- regression guard for all five prior PRs. Verification (CI parity, GOCACHE=/tmp/gocache GOMODCACHE=/tmp/gomodcache GOLANGCI_LINT_CACHE=/tmp/lintcache): - go build ./... -> 0 - go vet ./... -> 0 - golangci-lint run (contextcheck enabled) -> 0 issues - go test -race -short ./internal/api/middleware/... -> PASS - go test -race -short ./internal/scheduler/... -> PASS - go test -race -short ./internal/connector/issuer/acme/... -> PASS - go test -race -short ./internal/service/... -> PASS - rg "context\.(Background\|TODO)" internal/service/ internal/scheduler/ internal/connector/ internal/api/middleware/ -> 0 non-test hits (one pedagogical godoc reference in audit.go documenting why context.Background() would be wrong remains intentional) Wire-format invariants preserved: 0 API routes, 0 SQL migrations, 0 frontend bytes, 0 OpenAPI bytes, 0 connector interface signature changes, 0 new env vars, 0 new external dependencies (pure context stdlib). The AuditRecorder interface signature, the body-hash algorithm (SHA-256 16 hex chars), the excluded-path short-circuit, the actor-extraction path, the responseWriter status-capture wrapper, the AuditServiceAdapter, and all 116 API routes under /api/v1/, /.well-known/est/, /scep, /health, /auth are byte-identical. M-2 aggregate across PR-A through PR-F: 57 files, +635 / -613 (PR-A 12f +227/-237, PR-B 9f +150/-146, PR-C 17f +156/-148, PR-D 11f +67/-63, PR-E 4f +9/-15, PR-F 4f +26/-4). With M-2 closed, 8 of 10 Medium findings resolved; M-9, M-10, L-1..L-4, I-1..I-8 remain post-v2.1.0 hardening batch. Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:43:47 +00:00
shankar0123	283ec27ca4	fix(m2-pr-e): collapse AgentService.HeartbeatWithContext into Heartbeat PR-E of 6 in the M-2 end-to-end remediation sequence. Collapses the HeartbeatWithContext wrapper into a single ctx-first Heartbeat method, matching D-1 (ctx-only signatures, no dual forms). The handler-facing method name is preserved (D-4) — internal/api/handler/agents.go already declares `Heartbeat(ctx, ...)` on its local service interface, and the handler mock at internal/api/handler/agent_handler_test.go already takes `_ context.Context` as its first param, so no handler churn. Changes ------- internal/service/agent.go - Delete the zero-body Heartbeat wrapper that forwarded to HeartbeatWithContext with context.Background(). - Rename HeartbeatWithContext → Heartbeat (ctx-bearing body folded directly into the canonical method). internal/service/agent_test.go - TestHeartbeat (L95) and TestHeartbeat_NotFound (L128): agentService.HeartbeatWithContext(ctx, ...) → .Heartbeat(ctx, ...). internal/service/concurrent_test.go - L162: agentSvc.HeartbeatWithContext(ctx, agentID, metadata) → .Heartbeat(ctx, agentID, metadata). internal/service/context_test.go - L179 + L232: agentSvc.HeartbeatWithContext(ctx, ...) → .Heartbeat(...) - L185 + L238 t.Logf strings: "HeartbeatWithContext with ..." → "Heartbeat with ..." to match the collapsed method name. Verification (Go 1.25.9 linux/arm64, CI-parity caches) ------------------------------------------------------ go build ./... clean go vet ./... clean go test -short ./internal/service/... ./internal/api/handler/... \ ./internal/integration/... all ok go test -race -short same set all ok go test -short ./... all packages ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:25:20 +00:00
shankar0123	ccd89c348f	fix(m2-pr-d): thread ctx through Job/Notification/Audit services Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background() hits across the Job+Notification+Audit service cluster by threading ctx through their handler-facing service interfaces. Services (ctx-first): - service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now accept ctx; the CancelJobWithContext wrapper is removed (handler callers continue to invoke CancelJob, now ctx-aware). - service/notification.go: ListNotifications, GetNotification, MarkAsRead accept ctx. - service/audit.go: ListAuditEvents, GetAuditEvent accept ctx. Handlers (interface + callsites): - handler/jobs.go, handler/notifications.go, handler/audit.go: local service interfaces updated, r.Context() threaded at every callsite. Tests: - Mock services updated to match the new interfaces (ctx accepted and ignored via '_ context.Context' first parameter; Fn closure fields unchanged). - job_test.go / notification_test.go callsites thread context.Background() to match production shape. Verification: go build ./... ok go vet ./... ok go test -short ./... ok go test -race -short ./... ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:20:46 +00:00
shankar0123	2497be496d	M-2 PR-C: Collapse Policy/Profile/Owner/Team services to ctx-first signatures - Add ctx first param to 21 service-layer handler-interface methods across policy.go (6), profile.go (5), owner.go (5), team.go (5) - Replace 24 context.Background() call sites with received ctx; use context.WithoutCancel(ctx) for subsidiary audit-recording ops to preserve fire-and-forget audit semantics without inheriting caller cancellation - Add ctx first param to 21 handler-interface method signatures across policies.go (6), profiles.go (5), owners.go (5), teams.go (5) - Thread r.Context() through 21 HTTP handler sites (ListPolicies, GetPolicy, CreatePolicy, UpdatePolicy, DeletePolicy, ListViolations, ListProfiles, GetProfile, CreateProfile, UpdateProfile, DeleteProfile, ListOwners, GetOwner, CreateOwner, UpdateOwner, DeleteOwner, ListTeams, GetTeam, CreateTeam, UpdateTeam, DeleteTeam) - Update MockPolicyService/MockProfileService/MockOwnerService/ MockTeamService mock method impls with _ context.Context first param (Fn fields unchanged — closures do not need ctx); update mock impls in integration/lifecycle_test.go for all four services - Update 12 service-layer test callsites (policy_test.go ×2, owner_test.go ×5, team_test.go ×5, profile_test.go ×13) to pass context.Background() at the call site Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:10:06 +00:00
shankar0123	eb14236166	M-2 PR-B: Collapse IssuerService + TargetService to ctx-first signatures - Delete bare TestConnection wrapper in IssuerService; rename TestConnectionWithContext → TestConnection - Delete TestTargetConnection delegate shim in TargetService (canonical TestConnection already ctx-first) - Add ctx first param to 10 handler-interface methods (ListIssuers/GetIssuer/CreateIssuer/UpdateIssuer/DeleteIssuer and ListTargets/GetTarget/CreateTarget/UpdateTarget/DeleteTarget) - Replace 16 context.Background() call sites with received ctx - Thread r.Context() through 12 HTTP handler sites in issuers.go and targets.go (outer TargetHandler.TestTargetConnection HTTP method name preserved for router compatibility) - Update MockIssuerService, MockTargetService, and mockTargetService (integration) for ctx-first forwarding; update test callsite literals Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 00:46:58 +00:00
shankar0123	cdc9d03d5b	fix(m-2): thread context through CertificateService cluster Collapses CertificateService, RevocationSvc, and CAOperationsSvc to ctx-accepting method signatures. Removes context.Background() synthesis at 24 internal call sites across certificate.go, revocation_svc.go, and ca_operations.go. - Primary repo calls inherit request cancellation via the passed ctx. - Audit and notification dispatches use context.WithoutCancel(ctx) so they survive client disconnect. - Collapses TriggerRenewal/TriggerRenewalWithActor, TriggerDeployment/TriggerDeploymentWithActor, and RevokeCertificate/RevokeCertificateWithActor sibling pairs into single canonical ctx-accepting methods (decisions D-1, D-2). Handlers pass r.Context(). Mocks and tests updated to match new signatures. No HTTP surface change, no OpenAPI change. PR 1 of 6 in the M-2 remediation chain. Master green at this commit. Refs: certctl-audit-report.md M-2 (L143, L224)	2026-04-18 00:29:37 +00:00
shankar0123	d14a45401b	fix(audit): drain in-flight recording goroutines on shutdown (M-1) Audit events spawned from the HTTP middleware ran in detached goroutines using context.Background(). On SIGTERM the DB pool was closed before those goroutines finished writing, silently dropping audit events (CWE-662 Improper Synchronization / CWE-400 Uncontrolled Resource Consumption). NewAuditLog now returns an *AuditMiddleware struct that tracks every spawned goroutine with sync.WaitGroup. Callers wire the middleware via its Middleware method value (preserves the existing func(http.Handler) http.Handler shape) and drain the WaitGroup with Flush(ctx), which blocks until in-flight recordings complete or the provided context is cancelled — mirroring scheduler.WaitForCompletion. Flush is invoked in cmd/server/main.go between http.Server.Shutdown (no new requests accepted) and db.Close (pool torn down), with a timeout returning ErrAuditFlushTimeout wrapping ctx.Err(). Request-derived inputs (method, path, status) are snapshotted before the goroutine spawn so the worker does not race with http.Server reusing r after the handler returns. Tests: TestAuditLog_FlushDrainsInFlightGoroutines TestAuditLog_FlushTimeoutReturnsErrAuditFlushTimeout Verification: go build ./... : 0 go vet ./... : 0 go test -race -short ./... : 0 (all packages) go test -cover ./internal/api/middleware : 81.4% golangci-lint run : 0 issues govulncheck ./... : 0 vulns in called code	2026-04-17 17:29:48 +00:00
shankar0123	27afa4463d	fix(repository): idempotent sentinel agent creation via ON CONFLICT (M-6) Sentinel agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) were created on startup with a plain INSERT whose duplicate-key error was swallowed unconditionally. That silenced every other DB failure too (connectivity drop, permissions change, unrelated constraint violation) — a restart after the first boot quietly de-fanged cloud discovery and the network scanner (CWE-662, CWE-209- adjacent). Shape A: add AgentRepository.CreateIfNotExists using ON CONFLICT (id) DO NOTHING RETURNING id + sql.ErrNoRows discrimination. This keeps the strict Create semantics (duplicate-key is an error) intact for real agent registration and gives sentinels their own idempotent path. - repo: CreateIfNotExists returns (created bool, err error); false,nil on pre-existing row; false,wrapped err on anything else. - interface: CreateIfNotExists added to AgentRepository. - main.go: 4 sentinel sites log Error/Info/Debug distinctly. - mocks: service + integration mocks implement the new method. - tests: 4 new testcontainers integration tests cover first-insert, idempotent second-call, concurrent 16-goroutine race (exactly one creator, no duplicate-key panic), and pre-cancelled context surfacing. Coverage gates (go test -cover): service 67.6%/55, handler 78.6%/60, domain 92.7%/40, middleware 80.0%/30, crypto 86.7%/85. Race/vet/ golangci-lint v2.11.4 (0 issues)/govulncheck v1.2.0 clean across all touched packages.	2026-04-17 16:32:07 +00:00
shankar0123	80450c7180	fix(repository): populate TargetIDs in certificate scan helper (M-7) scanCertificate never queried the certificate_target_mappings junction table, so Certificate.TargetIDs was always nil on reads. This silently broke deployment lookups, bulk revocation filters, cert detail pages, and any code path that iterated TargetIDs to dispatch target work. Fix: - Convert scanCertificate to a receiver method (r *CertificateRepository) so it has access to the DB for the secondary junction query. - Get(): scan the row, then call r.getTargetIDs(ctx, certID) to populate TargetIDs with a single targeted query. - List() and GetExpiringCertificates(): inline the scan loop so we can collect all certIDs first, then call getTargetIDsForCertificates once with pq.Array(certIDs) to avoid N+1 round-trips. Build a map and attach TargetIDs to each certificate in the result set. - Default TargetIDs to []string{} (not nil) when a cert has no mappings so JSON marshals as [] rather than null. Tests: - New integration test file certificate_targetids_test.go with 5 subtests exercising Get / List / GetExpiringCertificates single and multi-target cases plus the empty-slice vs nil contract. - Uses the shared testcontainers-go setupTestDB infrastructure and skips under 'go test -short' so CI (which excludes ./internal/repository/... from coverage paths anyway) stays green. Addresses M-7 from certctl-audit-report.md.	2026-04-17 15:41:08 +00:00
shankar0123	c655e0f8c5	fix(crypto/local-ca): reject expired or not-yet-valid sub-CA certificates on disk load (M-5) loadCAFromDisk now validates the upstream sub-CA certificate's NotBefore and NotAfter fields before accepting it, returning a fail-closed error at server startup instead of silently loading an out-of-window CA. Before this fix, loadCAFromDisk checked BasicConstraints.IsCA and KeyUsage=CertSign but not the validity window. An expired enterprise sub-CA (e.g. an ADCS subordinate whose rollover slipped) would load without warning and the scheduler would mint child certs that every RFC 5280 path validator rejects — outages show up at relying parties, not at certctl, and only after thresholds trip. CWE-672 (Operation on a Resource after Expiration or Release); secondary CWE-295 (Improper Certificate Validation). Error strings include the CA subject CommonName and both RFC3339 timestamps so the log line is actionable in a 3am incident. Tests: TestSubCAMode gains three subtests exercising the new gate — SubCA_ExpiredCert_IsRejected (CA expired 1h ago → error mentions 'expired' and the CN), SubCA_NotYetValid_IsRejected (CA valid +1h → error mentions 'not yet valid' and the CN), and SubCA_BarelyValid_IsAccepted (CA valid [now-1m, now+1h] → issuance succeeds, proving no over-rejection). Adds generateTestSubCAWithValidity helper; the original generateTestSubCA wrapper preserves the [now, now+5y] default for existing tests. Package coverage: 67.7% -> 68.3%. Verification: go build, go vet, go test -race, go test -cover all green locally; golangci-lint v2.11.4 clean; govulncheck clean. All CI coverage floors met with margin (service 67.6/55, handler 78.6/60, domain 92.7/40, middleware 80.0/30, crypto 86.7/85). Parent: `5abeeb8` (M-8 per-ciphertext salt). Closes: audit finding M-5 in certctl-audit-report.md.	2026-04-17 14:10:23 +00:00
shankar0123	5abeeb882b	fix(crypto): per-ciphertext PBKDF2 salt + v2 versioned format with v1 fallback (M-8)	2026-04-17 05:36:29 +00:00
shankar0123	89b910a8f1	security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6) Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row locks, so two scheduler replicas in an HA deployment could both read the same row, both decide it was theirs, and race on UpdateStatus, producing duplicate Running jobs and duplicate certificate issuances. Remediation: a claim-style repository API that selects + transitions Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP LOCKED. Concurrent claimants observe disjoint row sets; no worker ever sees another worker's claimed row. Repository changes (internal/repository/postgres/job.go): - New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,... FROM jobs WHERE status='Pending' (optional type filter, optional LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running', updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed rows with status already flipped. - New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL semantics (direct agent_id match, target->agent JOIN fallback, certificate->target->agent chain for AwaitingCSR) but wraps each branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows to Running. AwaitingCSR rows are returned in place (state transition deferred until SubmitCSR, consistent with M8 semantics). - Existing GetPendingJobs / ListPendingByAgentID retained for legacy compatibility; their godoc now directs production callers to the Claim* variants. Production caller switches: - internal/service/job.go ProcessPendingJobs: ListByStatus(Pending) -> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler race between two replicas tick-firing simultaneously. - internal/service/agent.go GetPendingWork: ListPendingByAgentID -> ClaimPendingByAgentID. Eliminates the race between two pollers for the same agent (e.g. brief network blip causing duplicate poll) and between a scheduler tick and an agent poll. Safety argument for pre-flipping Pending -> Running inside the claim transaction: ProcessRenewalJob and ProcessDeploymentJob both call UpdateStatus(Running) unconditionally on entry, so an early flip is idempotent. On panic, the scheduler's panic recovery leaves the job in Running which the existing stale-running reaper handles. Tests (internal/repository/postgres/repo_test.go, skipped in -short): - TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending, claim once, assert all 5 returned + DB rows Running, residual claim returns 0. - TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40 Pending Renewals, spawn N=8 goroutines each calling ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants: (a) no job ID claimed by more than one worker, (b) sum of claims == 40, (c) all 40 rows in Running state in the DB. Bounded empty-streak guard (20 iterations) covers SKIP LOCKED transient zeros under contention. - TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments: seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1 Pending Renewal for agent B (scope check). Asserts deployments flip to Running, AwaitingCSR is returned but preserved, agent B's renewal never appears. Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos mirroring the real Pending -> Running semantics. Mocks intentionally do NOT write to StatusUpdates (that map tracks UpdateStatus() call history specifically; the real claim path uses a bulk UPDATE, not UpdateStatus). Verification (CI-scope): - go build ./cmd/...: ok - go vet ./...: ok - go test -race -short on service, api/handler, api/middleware, scheduler, connector/..., domain, validation, tlsprobe: ok - Coverage gates: service 67.6% (>=55), handler 78.6% (>=60), middleware 80.0% (>=30), domain 92.7% (>=40). All hold. - golangci-lint 2.11.4: 0 issues - govulncheck: no vulnerabilities in call graph - Frontend: tsc clean, 218 vitest tests pass, vite build ok - helm lint + helm template: ok - Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go; H-1 through H-5 fixtures unchanged. Refs: H-6 in certctl-audit-report.md	2026-04-17 02:34:56 +00:00
shankar0123	6315ef102a	security(globalsign): remove InsecureSkipVerify and pin CA pool (H-5) The GlobalSign Atlas HVCA connector previously used InsecureSkipVerify:true on its mTLS TLS config, disabling server certificate validation and defeating the purpose of the client-side mTLS handshake. This was a CWE-295 Improper Certificate Validation vulnerability silently degrading trust on every production call to GlobalSign's signing API. Remediation (per H-5 audit finding, Lens 4.4): - Remove InsecureSkipVerify from all three http.Client construction sites (ValidateConfig, getHTTPClient, and legacy initialisation path). - Introduce buildServerTLSConfig() helper that constructs tls.Config with MinVersion: tls.VersionTLS12 (addresses adjacent L-1 recommendation). - New optional config field `server_ca_path` (env: CERTCTL_GLOBALSIGN_SERVER_CA_PATH). When unset the connector trusts the system root CA bundle (correct default for GlobalSign's publicly-trusted HVCA endpoints). When set the bundle is loaded via x509.NewCertPool() + AppendCertsFromPEM, and only those roots are trusted (supports private HVCA deployments and defence-in-depth root pinning). - Error wrapping chain: "failed to read server CA bundle at %s" and "no valid PEM certificates found in server CA bundle at %s" surface config problems at ValidateConfig time instead of silently failing at request time. Docs, config, service env-seed, and GUI issuer type definition updated to expose the new field. Tests: 9 dead `InsecureSkipVerify: true` client TLSClientConfig blocks (no-ops against httptest.NewServer plain-HTTP) replaced with bare http.Client; new TestGlobalSign_ServerTLSConfig covers pinned-CA trust, untrusted-server rejection, missing-file and invalid-PEM error paths. Verification: - go build ./... clean - go vet ./... clean - go test -race ./internal/connector/issuer/globalsign/... ./internal/config/... ./internal/service/... ok - go test ./... (excluding testcontainers-gated repo layer) ok - golangci-lint run ./... 0 issues - govulncheck ./... 0 reachable vulns - Per-layer coverage: service 68.7% (≥55), handler 83.6% (≥60), domain 82.0% (≥40), middleware 63.8% (≥30) - globalsign package coverage: 75.9% - Invariant sweep: 0 InsecureSkipVerify references remain in globalsign package (only a test-file comment documenting the removal).	2026-04-17 01:40:58 +00:00
shankar0123	119986fa7e	security: add SSRF defence-in-depth for webhook notifier (fixes H-4) The webhook notifier would previously accept any operator-configured URL and hand it to http.Client without validation. That exposed two SSRF classes (CWE-918): * Reserved-address reachability — a misconfigured or adversarial webhook URL pointing at 127.0.0.1, ::1, 169.254.169.254 (cloud metadata), or 0.0.0.0 would succeed, exfiltrating request bodies to local services or leaking short-lived cloud credentials. * DNS rebinding — a hostname resolving to a public IP at validation time and to a reserved IP at dial time would bypass any URL-string-only check. Fix installs two independent layers: * validation.ValidateSafeURL runs at config-ingest time and before every outbound POST. It rejects non-HTTP(S) schemes, empty hosts, and literal reserved-IP hosts with a clear operator-facing error. This is a fast early diagnostic. * validation.SafeHTTPDialContext is installed on the webhook http.Transport. It re-resolves the host at dial time, rejects any resolved address whose address lies in a reserved range (loopback, link-local, multicast, broadcast, unspecified, IPv6 link-local/multicast), and pins the resolved IP into the final dial address so the TLS handshake targets the exact IP the guard approved. This is the authoritative, TOCTOU-safe defence against DNS rebinding. The two layers are complementary — validateURL fails fast on obvious misconfiguration; SafeHTTPDialContext fails closed when DNS changes between validation and dial. The existing unexported isReservedIP helper in internal/service/network_scan.go is extracted into internal/validation.IsReservedIP with byte-identical behaviour so the webhook notifier and the network scanner share a single authoritative reserved-address list. RFC 1918 ranges remain intentionally allowed (certctl's self-hosted design). Broader unspecified / IPv6 link-local coverage lives only in the stricter dial-time policy, where it belongs for outbound HTTP egress. Test seam: Connector gains an unexported validateURL func field and a same-package newForTest constructor that installs a permissive validator and the stdlib default transport. Production callers cannot reach this constructor because it is unexported; only same-package tests (package webhook) can use it. Same-package happy-path tests call newForTest so they can point at httptest loopback servers without being blocked by the production guard. The four SSRF-rejection tests that verify the guard itself still call New so they exercise the real, strict validator. This keeps the production SSRF defence unconditionally on in real code while preserving legitimate unit-test coverage. Tests ----- * internal/validation/ssrf_test.go (new) — 16-subtest pin on IsReservedIP that is byte-identical with the original network- scanner behaviour; ValidateSafeURL accept/reject matrix covering HTTPS/HTTP, reserved-literal IPv4/IPv6, dangerous schemes (file/gopher/ftp/javascript/data/ldap/dict/jar), missing hosts, and malformed inputs; SafeHTTPDialContext rejects literal reserved addresses and hosts resolving to reserved addresses (DNS-rebinding coverage via localhost). * internal/connector/notifier/webhook/webhook_test.go — happy-path tests switched to newForTest; production-guard SSRF-rejection tests (TestValidateConfig_RejectsReservedURLs, TestValidateConfig_RejectsDangerousScheme, TestPostWebhook_RejectsReservedURL, TestPostWebhook_RejectsDangerousScheme) continue to call New so they exercise the unconditionally-installed production validator. Wire-format invariants preserved -------------------------------- * Outbound HTTP request shape (method, headers, body, HMAC signature) unchanged. * network_scan.go behaviour unchanged — validation.IsReservedIP is byte-identical with the deleted helper. * RFC 1918 (10/8, 172.16/12, 192.168/16) remain allowed for both outbound webhook and CIDR expansion, matching the self-hosted design. Verification ------------ * go test -race ./internal/validation/... ./internal/connector/ notifier/webhook/... ./internal/service/... — green. * Full-suite go test -race ./... — green (GOTMPDIR=/dev/shm to sidestep full /tmp on the sandbox host). * Coverage gates pass: service 68.8% >= 55%, handler 83.6% >= 60%, domain 82.0% >= 40%, middleware 63.8% >= 30%. Overall 67.8%. Webhook package 91.5% line coverage; validation package ValidateSafeURL/SafeHTTPDialContext 78-100% per function. * govulncheck ./... — no vulnerabilities found. * golangci-lint run on touched H-4 production code — clean. Pre- existing errcheck/gosimple warnings in scope-adjacent files (webhook_test.go:270 w.Write, network_scan.go:120/173/265/305) verified against `3853b74` to predate this commit; left alone per scope guard. Operational notes ----------------- * No migration needed. The guard is pure Go code; existing webhook configs continue to work unless they point at reserved addresses, in which case they now fail closed with a clear error. * Existing operators who rely on webhook POST to 127.0.0.1 or ::1 (e.g., local receivers on the same host as certctl-server) must expose their receiver on an RFC 1918 address or public IP. This is deliberate — the threat model for webhook notifiers includes untrusted operator-supplied URLs. Scope guard: H-4 only. H-5, H-6, M-, L-, and I-* findings remain open and are tracked separately. No drive-by refactors.	2026-04-17 00:34:47 +00:00
shankar0123	3853b7460c	security: reject CRLF/NUL in email headers to prevent SMTP injection (fixes H-3) H-3 in certctl-audit-report.md: caller-supplied From/To/Subject were interpolated directly into the SMTP DATA payload and handed to client.Mail / client.Rcpt with no sanitization, allowing an attacker who controls any of those values to inject extra headers (Bcc:, Reply-To:), split the message body (CRLFCRLF), or tamper with the SMTP envelope. CWE-113. Fix: - New package helper internal/validation.ValidateHeaderValue(field, value). Rejects CR ("\r"), LF ("\n"), and NUL ("\x00") with an error that names the offending field but does NOT echo the raw value, so log readers cannot be attacked with injected content. Silent stripping was considered and rejected: authentication-relevant headers must fail visibly. - Two-layer defense in internal/connector/notifier/email/email.go: (1) primary guard at the top of sendEmail / sendHTMLEmail, which blocks tampering of the SMTP envelope (client.Mail, client.Rcpt) since net/smtp does not sanitize those arguments; and (2) defense-in-depth guard inside formatEmailMessage / formatHTMLEmailMessage, catching any future caller that bypasses sendEmail. Both format functions now return an error. - Body content is intentionally NOT validated — CR/LF in body is legal RFC 5322 content and net/smtp handles dot-stuffing. Tests: - internal/validation/headers_test.go: 3 functions (AcceptsSafeInput, RejectsControlCharacters, DefaultFieldName) covering plain ASCII, UTF-8 multibyte, tabs, typical email addresses, CRLF injection, lone CR, lone LF, NUL, CRLFCRLF body split, trailing CR, leading LF. Each reject case asserts the field name IS in the error and the raw offending value IS NOT (anti-log-injection). - internal/connector/notifier/email/email_test.go: added TestEmail_FormatEmailMessage_RejectsCRLFInjection and TestEmail_FormatHTMLEmailMessage_RejectsCRLFInjection. Existing format tests updated for the new (bytes, error) signature. Wire-format invariants preserved: - SMTP DATA headers still use CRLF separators and RFC 1123Z Date (unchanged). - Content-Type headers unchanged (text/plain for plain, text/html + MIME-Version: 1.0 for HTML). - No change to message encoding or transport. Verification (Go 1.25.9 linux-arm64, parent `e9947dc`): - go build ./... clean - go vet ./... clean - go test -race ./internal/validation/... ok - go test -race ./internal/connector/notifier/email/... ok - go test -race ./internal/connector/notifier/webhook/... ok - Per-layer coverage gates all pass: validation 95.1% (+0.7 vs baseline 94.4%) email 39.7% (+1.4 vs baseline 38.3%) service 67.8% (unchanged) handler 78.6% (unchanged) middleware 80.0% (unchanged) domain 92.7% (unchanged) - govulncheck ./... No vulnerabilities found - golangci-lint run ./internal/validation/... ./internal/connector/notifier/email/... 0 issues Operational note: SMTP sends that would previously deliver a tampered message now fail fast at the notifier with a clear error. Operators who were relying on header-injection-shaped inputs (there should be none in practice — all callers are internal certctl code) will see "failed to format message: <field> contains disallowed control character" in logs. Scope: H-3 only. H-4 (webhook SSRF) follows in a separate commit.	2026-04-17 00:08:20 +00:00
shankar0123	b813660c74	security: require SCEP challenge password when SCEP enabled (fixes H-2) Problem (CWE-306 Missing Authentication for Critical Function): internal/service/scep.go PKCSReq skipped the shared-secret check when s.challengePassword was empty. An unconfigured-but-enabled SCEP server accepted any unauthenticated client reaching /scep and issued a certificate against the configured issuer for any CSR with a valid signature. No audit trail distinguished authenticated from unauthenticated enrollments. This matches the two-layer fail-closed pattern already used for C-2 (`f549a7a`): reject at startup AND reject at the service boundary. Fix (two layers, defense-in-depth): Layer 1 — startup pre-flight in cmd/server/main.go: preflightSCEPChallengePassword returns a non-nil error when SCEP is enabled and CERTCTL_SCEP_CHALLENGE_PASSWORD is empty. main logs and os.Exit(1)s before the SCEP service is constructed. Disabled SCEP is unaffected. The helper is unit-testable in isolation. Layer 2 — service-layer rejection in internal/service/scep.go: PKCSReq refuses enrollment when s.challengePassword == "" even though main already blocks this state — protects future call sites (tests, library reuse, a REST-over-HTTPS wrapper). When a secret is configured, the comparison now uses crypto/subtle.ConstantTimeCompare so response time does not leak the configured secret through a short-circuiting byte compare. Files: - cmd/server/main.go: preflightSCEPChallengePassword helper; call site inside the `if cfg.SCEP.Enabled` block before issuer lookup; fatal slog error references CWE-306 and names the env var so operators can diagnose the startup failure without reading code. - cmd/server/main_test.go: TestPreflightSCEPChallengePassword with five table-driven subtests (disabled empty, disabled set, enabled empty rejected, enabled set, single-char boundary). The enabled-empty case asserts the error string contains both CERTCTL_SCEP_CHALLENGE_PASSWORD and CWE-306 so the log message remains actionable. - internal/config/config.go: SCEPConfig.ChallengePassword godoc now states the field is REQUIRED when SCEP.Enabled and cross-references preflightSCEPChallengePassword. - internal/service/scep.go: imports crypto/subtle; PKCSReq rewritten with the two-layer check; comment block cites H-2 / CWE-306 and the constant-time rationale. - internal/service/scep_test.go: existing tests that relied on the vulnerable empty-password path now configure a secret on both sides. TestSCEPService_PKCSReq_ChallengePassword_NotRequired is replaced by TestSCEPService_PKCSReq_ChallengePassword_EmptyServerConfigRejected which iterates ["", "any-value", "guess"] against an unconfigured server and asserts "not configured" in the error. A new TestSCEPService_PKCSReq_ChallengePassword_ConstantTimeLengthIndependence exercises same-prefix-longer and wrong-case inputs to guard against a regression from ConstantTimeCompare to a short-circuiting byte compare. - internal/service/m11c_crypto_enforcement_test.go: four tests (RejectsWeakKey, AcceptsStrongKey, MaxTTL_ForwardedToIssuer, NoProfileRepo_PassesThrough) constructed NewSCEPService with an empty challenge password and exercised PKCSReq through the now-rejected vulnerable path. All four now configure "secret123" on both sides with an inline H-2 comment; the crypto/MaxTTL/profile behavior they assert is unchanged. Wire-format / behavioral invariants preserved: - RFC 8894 SCEP handler is untouched (internal/api/handler/scep.go and internal/pkcs7/): GetCACaps/GetCACert responses, PKIOperation request parsing, and the PKCS#7 certs-only response format are byte-identical. - RFC 7030 EST handler is untouched (internal/api/handler/est.go + internal/pkcs7/). - Revocation idempotency composite key (H-1, migration 000012) untouched. - AES-256-GCM config encryption (C-2) untouched. - CRL DER bytes and OCSP response bytes unchanged. Verification: - go build ./... silent success - go vet ./... silent success - go test -race -count=1 ./internal/service/ ./cmd/server/ ./internal/api/handler/ ./internal/integration/ all OK - Coverage with comfortable headroom over CI gates: service 67.8% (gate 55%) handler 79.0% (gate 60%) domain 92.7% (gate 40%) middleware 80.0% (gate 30%) cmd/server 1.6% (preflightSCEPChallengePassword: 100%) internal/service/scep.go PKCSReq statement coverage: 100%. - rg sweeps: no `s.challengePassword != ""` remains; no `challengePassword != s.challengePassword` remains. Operational note: operators with SCEP enabled but no challenge password set will see a fatal startup error and a log line citing CERTCTL_SCEP_CHALLENGE_PASSWORD and CWE-306 after upgrading. This is the intended fail-closed behavior. Fix by either setting the env var to a non-empty shared secret or setting CERTCTL_SCEP_ENABLED=false. Audit report: certctl-audit-report.md (revision 5) logs this under H-2 Resolution Log.	2026-04-16 22:22:51 +00:00
shankar0123	387fb555ac	security: scope revocation unique index to (issuer_id, serial_number) (fixes H-1) RFC 5280 §5.2.3 defines certificate serial number uniqueness per issuing CA, not globally. The prior unique index on `certificate_revocations.serial_number` enforced a stricter invariant than the spec: with 12 issuer connectors (Local CA, ACME, Vault, step-ca, OpenSSL, DigiCert, Sectigo, Google CAS, AWS ACM PCA, Entrust, GlobalSign, EJBCA), two distinct certificates legitimately issued by different CAs can share a serial number. Recording a revocation for the second collision silently dropped via `ON CONFLICT DO NOTHING`, leaving the second cert persistently absent from OCSP/CRL responses. Changes: - Migration 000012 drops `idx_certificate_revocations_serial` and creates `idx_certificate_revocations_issuer_serial` UNIQUE ON (issuer_id, serial_number). Adds a non-unique `idx_certificate_revocations_serial_lookup` to preserve the serial-only fast path for OCSP/CRL probes that already know the issuer scope. - `CertificateRevocationRepository.Create` targets the new composite key in `ON CONFLICT` — same-issuer idempotency preserved, cross-issuer collisions now recorded as distinct rows. - `GetBySerial(serial)` renamed `GetByIssuerAndSerial(issuerID, serial)` on the interface and Postgres impl. All callers (OCSP responder, CRL generator, short-lived-cert exemption check) already have `issuerID` in scope because the protocol paths carry it (`/api/v1/ocsp/{issuer_id}/{serial}`, `/api/v1/crl/{issuer_id}`). - Repository integration test added: `TestRevocationRepository_CrossIssuerSerialCollision` asserts that serial `CAFEBABE01` can be stored under two issuers simultaneously, that lookups return the correct row per (issuer, serial), and that same-issuer idempotency still works (re-inserting (issuer, serial) does not error and does not duplicate). - Existing tests and service/integration mocks updated for the rename. Wire-format invariants preserved: CRL DER bytes, OCSP response bytes, and AES-256-GCM config encryption are unaffected — this change touches only revocation-record uniqueness scope. CWE-664.	2026-04-16 21:49:59 +00:00
shankar0123	f549a7aa79	security: fail closed when CERTCTL_CONFIG_ENCRYPTION_KEY is unset (fixes C-2) EncryptIfKeySet/DecryptIfKeySet in internal/crypto/encryption.go previously returned plaintext + wasEncrypted=false when the operator had not configured CERTCTL_CONFIG_ENCRYPTION_KEY. That produced a data-at-rest confidentiality bypass (CWE-311): sensitive fields on dynamically-configured issuer and target rows (source='database') were persisted to PostgreSQL without any encryption, and no caller could distinguish the encrypted from the plaintext branch at runtime. The only visible signal was a single warning log line emitted once at startup. Fail closed instead: - EncryptIfKeySet / DecryptIfKeySet now return crypto.ErrEncryptionKeyRequired (a new exported sentinel, errors.Is-unwrappable) when the key is empty or nil, rather than silently emitting plaintext. The (result, wasEncrypted, err) tuple signature is preserved for source compatibility; only the semantics of the no-key branch changed. - cmd/server/main.go grows a startup pre-flight check: if no encryption key is configured the server lists issuers and targets, counts rows with source='database', and refuses to start (os.Exit(1)) if any exist. Operators must either configure CERTCTL_CONFIG_ENCRYPTION_KEY or remove the exposed rows before the control plane can boot. The warning-only path is retained for the clean-slate case (no database rows). - internal/service/issuer.go's SeedFromEnvVars now guards the encryption call with len(s.encryptionKey) > 0 so env-seeded rows (source='env', which are reconstructable on every boot from process env) continue to persist as plaintext in the 'config' column when no key is configured. Registry load already falls through to cfg.Config when EncryptedConfig is nil. GUI/API write paths (source='database') remain fail-closed via propagation of ErrEncryptionKeyRequired. - Integration tests that exercise CreateIssuer via the handler layer now supply a real 32-byte AES-256 test key so the encrypt path runs instead of returning ErrEncryptionKeyRequired. Same pattern in internal/service/ testutil_test.go for consolidated service-layer tests. - internal/crypto/encryption_test.go grows regression guards: TestEncryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests), TestDecryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests), TestEncryptDecryptIfKeySet_RoundTripProducesDifferentCiphertext, TestDecryptIfKeySet_RejectsTamperedCiphertext, and TestEncryptIfKeySet_PreservesErrEncryptionKeyRequiredSentinel (verifies the sentinel unwraps through fmt.Errorf(%w)-style wrapping). Wire format is unchanged: AES-256-GCM Encrypt/Decrypt/DeriveKey, the 12-byte nonce prefix, the GCM auth tag, the PBKDF2 salt ('certctl-config-encryption-v1'), and the 100,000 iteration count are all byte-identical. Ciphertexts produced before this change remain decryptable. Verified: - go build ./... : clean - go vet ./... : clean - go test -race ./internal/crypto/... ./internal/service/... \ ./internal/integration/... ./cmd/server/... : pass - golangci-lint run ./... : 0 issues - govulncheck ./... : 0 reachable vulnerabilities - rg 'return plaintext, false, nil' internal/ : no matches - Coverage: crypto 85.0% (unchanged), service 67.8% (was 67.9%, noise), cmd/server 0.0% (unchanged baseline). All above CI thresholds. See certctl-audit-report.md for the full finding record and resolution log.	2026-04-16 21:10:40 +00:00
shankar0123	b219e5d68a	security: use crypto/rand for agent API keys (fixes C-1) Replaces math/rand-based agent API key generation in internal/service/agent.go with crypto/rand.Read over a 32-byte buffer encoded with base64.RawURLEncoding, yielding a 43-character URL-safe unpadded ASCII string (256 bits of entropy). generateAPIKey now returns (string, error); Register and RegisterAgent propagate entropy-source failures. hashAPIKey is unchanged — the SHA-256 hashed-at-rest invariant is preserved. Fixes C-1 (CWE-338: Use of Cryptographically Weak Pseudo-Random Number Generator) from certctl-audit-report.md. Changes: - internal/service/agent.go: new imports (crypto/rand, encoding/base64); generateAPIKey rewritten to return (string, error); Register and RegisterAgent updated to propagate the error. - internal/service/agent_test.go: TestGenerateAPIKey_Properties regression test (non-empty, length 43, valid base64url, 32 decoded bytes, no collisions over 64 calls). No entropy-failure test — Go 1.24+ (issue #66821) makes crypto/rand errors fatal, so that branch is defensively unreachable. Verification: - go build ./cmd/server/... ./cmd/agent/... ./cmd/mcp-server/... ./cmd/cli/... → pass - go vet ./... → pass - go test -race (CI scope, 43 packages) → pass - golangci-lint v2.11.4 run ./... → 0 issues - govulncheck ./... → 0 vulnerabilities in certctl code - Coverage: service 68.9% / handler 83.6% / domain 82.0% / middleware 63.8% (all above CI gates 55/60/40/30) - grep math/rand in internal/ and cmd/ → zero production hits - No caller assumes the old 32-char length or legacy charset	2026-04-16 19:43:19 +00:00
shankar0123	13cd4d98ba	feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 00:06:34 -04:00
shankar0123	84bc1245a1	fix: case-insensitive issuer type validation + missing M49 types (#7 ) Backend rejected lowercase type strings (e.g., "acme") sent by older cached frontends. Add normalizeIssuerType() with alias map for case-insensitive lookup, wire into both Create paths. Add missing Entrust/GlobalSign/EJBCA to validIssuerTypes. Add lowercase fallbacks to issuer factory switch. 39 new test subtests covering normalization, lowercase create flows, and M49 type acceptance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:20:32 -04:00
shankar0123	e1bcde4cf1	feat(M50): cloud secret manager discovery — AWS SM, Azure KV, GCP SM Extend certificate discovery from filesystem + network to cloud secret managers. Three pluggable DiscoverySource connectors feed into the existing discovery pipeline via sentinel agent pattern, with a 9th scheduler loop for periodic cloud scanning. - AWS Secrets Manager: aws-sdk-go-v2, tag/prefix filtering, 10 tests - Azure Key Vault: stdlib HTTP + OAuth2, base64 DER/PEM, 16 tests - GCP Secret Manager: stdlib HTTP + JWT OAuth2, label filter, 14 tests - CloudDiscoveryService orchestrator with 9 tests - 9th scheduler loop (6h default, atomic.Bool idempotency) - Discovery page: color-coded source type badges - 14 new env vars across CloudDiscoveryConfig structs - Docs: connectors.md, architecture.md, features.md, README updated 49 new tests. All CI checks pass (go vet, race, lint, coverage). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:01:00 -04:00
shankar0123	3f619bcaac	feat(M49): Entrust, GlobalSign & EJBCA issuer connectors Add three new issuer connectors completing commercial and open-source CA coverage. Entrust uses mTLS client certificate auth with sync/async issuance. GlobalSign Atlas uses mTLS + API key/secret dual auth with serial-based tracking. EJBCA supports dual auth (mTLS or OAuth2) for self-hosted Keyfactor CAs. Each connector implements the full issuer.Connector interface (9 methods), includes httptest-based unit tests (~14 each), and follows established patterns (injectable HTTP clients, RFC 5280 revocation reason mapping, CRL/OCSP delegated to CA). Also includes: issuer factory cases, env var seeding, config structs, domain types, seed data (3 rows, all disabled), OpenAPI enum updates, frontend issuer catalog entries with config fields, and full docs (connectors.md, architecture.md, features.md, README). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 22:24:12 -04:00
shankar0123	f3a85d6b08	fix: remove unused createTestCert function in tlsprobe tests golangci-lint (unused linter) flagged createTestCert as dead code — only createTestCertWithKey is called by the actual tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 21:54:38 -04:00
shankar0123	596d86a206	feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 21:45:45 -04:00
shankar0123	f2e60b93a3	feat(M11c): crypto policy enforcement — CSR validation, MaxTTL caps, key metadata Enforce certificate profile crypto constraints across all 5 issuance paths (renewal, agent CSR, EST, SCEP). ValidateCSRAgainstProfile() rejects CSRs with key algorithm/size that don't match profile rules. MaxTTL enforcement caps certificate validity per issuer connector (Local CA, Vault, step-ca enforce directly; ACME/DigiCert/Sectigo pass through). Key algorithm and size are now persisted in certificate_versions for audit compliance. 16 new tests (12 service-layer + 4 Local CA connector). Removes hardcoded version number from GUI sidebar. Documentation updated across architecture, features, connectors, and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 21:05:14 -04:00
shankar0123	bcefb11e65	feat(M51): add SCEP server (RFC 8894) for MDM and network device enrollment Implements Simple Certificate Enrollment Protocol with single-endpoint operation-based dispatch (GetCACaps, GetCACert, PKIOperation), PKCS#7 SignedData CSR extraction with fallback for raw/base64 CSR, challenge password authentication via CSR attributes, and shared internal/pkcs7 package extracted from EST handler to eliminate code duplication. 24 new tests (11 service + 13 handler) plus 5 shared pkcs7 package tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 16:47:18 -04:00
shankar0123	c015cab2f4	docs: rewrite features.md, audit README + architecture against repo Rewrote docs/features.md from scratch as authoritative feature inventory (1255 lines, every claim verified against source files). Audited README.md and architecture.md against repo — fixed 19 stale references: K8s Secrets status, issuer counts, dashboard page counts, CI thresholds, missing connectors in Mermaid diagrams, OpenAPI operation count, GetCACertPEM behavior, and V2/V4 roadmap accuracy. Also includes related fixes discovered during audit: - Scheduler skips expired/failed/revoked certs from auto-renewal - Seed demo expiry dates moved outside 31-day scheduler query window - Agent pages use correct last_heartbeat_at field name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 00:22:57 -04:00
shankar0123	68f6fd474b	fix: return 409 on duplicate issuer name, improve error handling and onboarding defaults Closes #7. The issuer create/update handlers swallowed all service errors as generic 500s. Now differentiates: 409 for UNIQUE constraint violations, 400 for unsupported issuer type, 404 for not-found on update, 500 for unknown errors. Adds structured error logging via slog. OnboardingWizard now pre-populates config field defaults when a type is selected (matching IssuersPage behavior), preventing empty required fields from causing silent failures. install-agent.sh hardened for curl\|bash usage: --agent-id flag, =value syntax, /dev/tty stdin reopening, proper stderr routing in download_binary, non-interactive install examples in help text, and updated wizard commands. Adds adversarial security tests for EST, path traversal, and query injection handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-12 19:18:32 -04:00
shankar0123	370f856725	fix: resolve 8 staticcheck lint errors in test files SA1029: use typed context key instead of string in main_test.go S1039: remove unnecessary fmt.Sprintf in validation_test.go SA4023: fix unreachable nil check on concrete error type SA4006: fix unused variable assignments in stepca_test.go (4 occurrences) SA4000: fix duplicate expression in ssh_test.go (BEGIN vs END CERTIFICATE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 23:27:57 -04:00
shankar0123	7382e5f03b	test: comprehensive test gap closure across 24 packages Close coverage gaps identified by dual-audit (qualitative + quantitative). New test files for config (0%→98%), router (0%→100%), handler validation, health, audit, response helpers, webhook notifier (0%→88%), email notifier, middleware (recovery, rate limiter), domain profile, service nil-safety, config helpers, issuer bootstrap, and server bootstrap wiring. Expanded existing tests for ACME (34%→42%), step-ca (42%→52%), F5, SSH, agent (43%→63%), scheduler (88%→99%), renewal service, and issuerfactory. All tests pass: go test -short, go vet, go test -race clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 23:09:40 -04:00
shankar0123	5567d4b411	feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors Implement both M47 connectors with full cross-layer wiring: Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret create-or-update, chain concatenation, serial number validation, Helm RBAC gating. 18 tests. AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex validation, RFC 5280 revocation reason mapping, CA cert retrieval, factory + env var seeding. 23 tests. Cross-cutting: domain types, service validation, config, factory, agent dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide, qa_test.go) with Parts thematically integrated near related connectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-07 20:21:09 -04:00
shankar0123	93e1dc598c	fix: resolve frontend-to-backend mapping gaps across API types, config fields, and issuer IDs Full audit of all ~100 backend API endpoints against frontend client functions and TypeScript interfaces. Fixes field name mismatches, missing client functions, phantom interface fields, type coercion for Go bool/int config fields, and issuer type ID alignment with backend domain constants. Backend: - issuer.go/target.go: GUI-created entities default enabled=true (Go bool zero value was overriding DB DEFAULT) Frontend types (types.ts): - Certificate: fingerprint→fingerprint_sha256, phantom fields made optional - CertificateVersion: fingerprint→fingerprint_sha256, chain_pem→pem_chain, removed phantom version/cert_pem fields - Job: error_message→last_error (matches Go json tag) Frontend client (client.ts): - Added getNotification(id) and getAuditEvent(id) for existing backend routes Frontend pages: - CertificateDetailPage: derives serial/fingerprint/issuedAt from latest CertificateVersion instead of empty Certificate fields - JobsPage/JobDetailPage: error_message→last_error - TargetsPage: reload_cmd→reload_command, validate_cmd→validate_command, added missing config fields per backend structs (validate_command for NGINX/Apache, hostname/winrm_timeout for IIS, private_key/passphrase/ cert_mode/key_mode for SSH, winrm_https/winrm_insecure for WinCertStore, create_keystore for JavaKeystore, mode for Dovecot), type coercion via buildConfigPayload() with BOOL_FIELDS/INT_FIELDS sets, IIS WinRM nesting - TargetDetailPage: added passphrase to sensitiveKeys redaction - issuerTypes.ts: type IDs aligned to backend constants (acme→ACME, local→GenericCA, stepca→StepCA, openssl→OpenSSL), backward compat aliases preserved, step-ca config fields updated to match backend struct Utilities (utils.ts): - formatDate/formatDateTime accept string\|undefined\|null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-05 21:09:48 -04:00
shankar0123	25f33b830f	fix: resolve golangci-lint issues in wincertstore connector Remove unnecessary fmt.Sprintf wrapping a string literal (staticcheck S1039), remove unused tempFileForPFX function, and clean up unused os import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-05 19:16:34 -04:00

1 2 3 4

154 Commits