certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 22:51:30 +00:00

Author	SHA1	Message	Date
Shankar	345bafe5aa	Bundle C: Renewal/reliability cluster — 7 findings closed Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)	2026-04-27 00:08:25 +00:00
Shankar	2933395730	Bundle B CI follow-up: G-3 env-var docs + M-028 closure (final 5 SA1019 sites) Two CI failures on master after Bundle B merge: 1. Frontend Build / G-3 env-var docs guardrail Bundle B introduced CERTCTL_RATE_LIMIT_PER_USER_RPS and CERTCTL_RATE_LIMIT_PER_USER_BURST without adding them to docs/features.md. The guardrail step that scans Go source for getEnv* calls and asserts each appears in a doc page failed. Fix: docs/features.md rate-limit section extended with both new env vars + a paragraph explaining the per-key keying contract from M-025. 2. Go Build & Test / staticcheck SA1019 hits (6 errors) The CI workflow runs staticcheck without continue-on-error. Bundle 7 opened M-028 to track 6 deprecated-API sites; Bundle 9 closed 1 of them (the elliptic.Marshal in local.go) but kept a deliberate regression-oracle reference in bundle9_coverage_test.go protected only by golangci-lint's //nolint comment — staticcheck-as-CLI does not honor that, only its native //lint:ignore directive. Closure of remaining 5 sites: cmd/server/main_test.go:47, 163, 192, 465 — 4 × middleware.NewAuth migrated to middleware.NewAuthWithNamedKeys with explicit NamedAPIKey entries. The auth=none case at line 465 maps to a nil NamedAPIKey slice (no-op pass-through, matches the NewAuthWithNamedKeys contract for empty input). Audit count was 3; recon found a 4th at line 465 that was missed. internal/api/handler/scep.go:266 — csr.Attributes is a real RFC 2985 §5.4.1 challengePassword carve-out. Go's stdlib deprecation note explicitly applies only to OID 1.2.840.113549.1.9.14 (requestedExtensions), NOT to OID 1.2.840.113549.1.9.7 (challengePassword), for which there is no non-deprecated stdlib API. Suppressed with native //lint:ignore SA1019 + comment block citing the RFC. internal/connector/issuer/local/bundle9_coverage_test.go:342 — deliberate regression-oracle that calls elliptic.Marshal to prove the new crypto/ecdh path is byte-identical. Comment converted from //nolint:staticcheck to native //lint:ignore SA1019 so staticcheck-as-CLI honors the suppression. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: M-028 box flipped [x]; score 30/55 -> 31/55 (Medium 12/27 -> 13/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: M-028 status partial_closed -> closed with closure note. Verification: go test -count=1 -short ./cmd/server ./internal/api/handler ./internal/connector/issuer/local ./internal/api/middleware ./internal/config — all green. staticcheck on each changed package — 0 SA1019 hits. Bundle C had M-028 in scope; this CI-fix lift moves it forward so master CI goes green immediately. Bundle C scope adjusts to remove M-028 and focuses on M-006 / M-015 / M-016 / M-019 / M-020 plus the M-007 / M-008 coverage gaps.	2026-04-26 23:35:13 +00:00
Shankar	e8f5ecf3c9	Bundle B: Auth & transport surface tightening — 5 findings closed Closes M-001 + M-002 + M-013 + M-018 + M-025 from comprehensive-audit-2026-04-25. M-001 (CWE-916) — PBKDF2 100k -> 600k via v3 blob format internal/crypto/encryption.go: - New v3Magic (0x03), pbkdf2IterationsV3 (600,000 — OWASP 2024 Password Storage Cheat Sheet floor), v3SaltSize (16 bytes), deriveKeyWithSaltV3 helper. - EncryptIfKeySet now unconditionally writes v3: magic(0x03) \|\| salt(16) \|\| nonce(12) \|\| ciphertext+tag - DecryptIfKeySet falls through v3 -> v2 -> v1 with AEAD verification at each step. Wrong-passphrase v3 reads cannot be silently misattributed to v2/v1. - IsLegacyFormat updated to recognize 0x03 as non-legacy. internal/crypto/encryption_v3_test.go (NEW, 7 tests): V3 round-trip / V2 read-fallback against deterministic v2 fixture / V3 wrong-passphrase fails / V3-vs-V2 dispatch order / V2 vs V3 keys differ for same (passphrase, salt) / iteration-count pin at OWASP 2024 floor / IsLegacyFormat-recognises-V3. Coverage internal/crypto: 86.7% -> 88.2%. M-002 (CWE-862) — Auth-exempt allowlist constants + AST regression test Recon found auth-exempt surface spans TWO layers (audit's claim was incomplete): Layer 1 (router.go direct r.mux.Handle): GET /health, GET /ready, GET /api/v1/auth/info, GET /api/v1/version Layer 2 (cmd/server/main.go::buildFinalHandler URL-prefix dispatch): /.well-known/pki/, /.well-known/est/, /scep[/...]* internal/api/router/router.go: - New AuthExemptRouterRoutes constant with per-entry justifications. - New AuthExemptDispatchPrefixes constant. internal/api/router/auth_exempt_test.go (NEW, 2 tests): AST-walks router.go for every direct mux.Handle call and asserts set equals AuthExemptRouterRoutes; reads source bytes of Register / RegisterFunc and asserts they still wrap with middleware.Chain. cmd/server/auth_exempt_test.go (NEW, 2 tests): 14-case table test on buildFinalHandler asserting documented prefixes route to noAuthHandler and authenticated routes route to apiHandler; inverse-overlap pin proves no documented bypass shadows an authenticated prefix. M-013 (CWE-942) — CORS deny-by-default verified-already-clean + pin Audit claim 'default allows all origins if env-var unset' was WRONG. internal/api/middleware/middleware.go::NewCORS already denies cross- origin requests when len(cfg.AllowedOrigins) == 0 (no Access-Control-Allow-Origin header is emitted, same-origin policy applies). internal/api/middleware/cors_test.go: +TestNewCORS_NilOriginsDeniesAll + TestNewCORS_M013_ContractDocumentedInOrder (5-case table test pinning the 3-arm dispatch contract). M-018 (CWE-319 / PCI-DSS Req 4) — Postgres TLS opt-in toggle deploy/helm/certctl/values.yaml: new postgresql.tls.{mode,caSecretRef} operator-facing knobs. Default 'disable' preserves in-cluster pod- network behavior; PCI-scoped operators set verify-full. deploy/helm/certctl/templates/_helpers.tpl: certctl.databaseURL helper pipes postgresql.tls.mode into ?sslmode=. deploy/helm/certctl/templates/server-secret.yaml: uses the helper instead of hardcoded sslmode=disable. deploy/docker-compose.yml: CERTCTL_DATABASE_URL is now ${CERTCTL_DATABASE_URL:-...} so operators override without editing. docs/database-tls.md (NEW): operator runbook covering 4 deployment shapes, RDS verify-full example with PGSSLROOTCERT mount, and pg_stat_ssl verification query. helm template + helm lint clean. M-025 (OWASP ASVS L2 §11.2.1) — Per-key rate limiting internal/api/middleware/middleware.go::NewRateLimiter rewritten from a single global tokenBucket to a keyedRateLimiter map keyed on 'user:'+GetUser(ctx) for authenticated callers 'ip:'+RemoteAddr-host for unauthenticated - Empty UserKey strings treated as unauthenticated. - X-Forwarded-For intentionally NOT consulted (header-spoofing risk). - Create-on-demand bucket allocation under sync.RWMutex with double- check pattern. RateLimitConfig.PerUserRPS / PerUserBurstSize fields with env vars CERTCTL_RATE_LIMIT_PER_USER_RPS / CERTCTL_RATE_LIMIT_PER_USER_BURST allow per-user budgets distinct from per-IP. internal/api/middleware/ratelimit_keyed_test.go (NEW, 5 tests): TwoIPsHaveIndependentBuckets / SameUserDifferentIPsShareBucket / TwoUsersHaveIndependentBuckets / PerUserBudgetOverride / EmptyUserKeyTreatedAsAnonymous. Coverage internal/api/middleware: 82.1% -> 83.7%. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 25/55 -> 30/55 closed (High 7/9, Medium 7/27 -> 12/27, Low 8/19). cowork/comprehensive-audit-2026-04-25/findings.yaml: 5 status flips open -> closed with closure notes citing the Bundle B mechanism. certctl/CHANGELOG.md: Bundle B section under [unreleased]. Verification: go test -count=1 -short ./... all green staticcheck on changed packages no new SA/ST hits (the 4 pre-existing SA1019 sites in cmd/server/main_test.go are Bundle 9 / M-028 partial closure leftovers tracked in Bundle C) helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure, same on master HEAD before this branch — environmental, not Bundle B	2026-04-26 23:09:10 +00:00
Shankar	1d0733f170	Bundle 9 follow-up: ST1018 ESC sweep + make verify pre-commit gate CI on the bundle-9 merge (run #24962543332) failed golangci-lint with 16 staticcheck ST1018 'string literal contains the Unicode format character U+202X, consider using the \u202X escape sequence' hits — across the two test files we added (internal/validation/unicode_test.go + internal/connector/issuer/local/bundle9_coverage_test.go). Mechanical sweep, byte-identical at runtime: internal/validation/unicode_test.go (13 + 1 hits cleared) RTL/LTR overrides U+202A..U+202E + U+2066..U+2069 (lines 39-47) zero-width U+200B..U+200D + U+2060 (lines 67-70) additional U+202E in TestValidateUnicodeSafe_ErrorMentionsByteOffset internal/connector/issuer/local/bundle9_coverage_test.go (3 hits) U+202E in TestValidateCSRUnicode_RejectsDNSNameRTL U+200B in TestValidateCSRUnicode_RejectsEmailZeroWidth U+202E in TestValidateCSRUnicode_RejectsAdditionalSAN The strings now use Go \uXXXX escape sequences. Identical UTF-8 bytes hit ValidateUnicodeSafe at runtime — every test passes unchanged locally. The file-header comment in unicode_test.go that promised this convention is now actually honored. Verification: staticcheck -checks=ST1018 returns clean across the two packages. go test -count=1 -short still green. Pre-commit gate added to prevent recurrence: Makefile: new 'verify' aggregate target runs gofmt + go vet + golangci-lint run + go test -short — same set CI enforces. Run 'make verify' before every commit going forward. cowork/CLAUDE.md: new 'Pre-commit verification gate' paragraph in Operating Rules. Documents make verify as the canonical gate; explains WHY (Bundle-9 shipped green-on-vet / red-on-CI because ST1018 only fires under golangci-lint's staticcheck, not vet); documents the staticcheck-only fallback for disk-constrained sandboxes. This commit changes only: - 2 test source files (\uXXXX escapes, no behavior change) - Makefile (1 new target, 1 .PHONY entry, 1 help line) - cowork/CLAUDE.md (1 new operating-rule paragraph)	2026-04-26 21:17:12 +00:00
Shankar	d588bb898a	Bundle 9: Local-issuer hardening — 5 findings closed + 1 partial Closes H-010 + L-002 + L-003 + L-012 + L-014 from comprehensive-audit-2026-04-25; partial-closes M-028 (the local.go:682 elliptic.Marshal site only). H-010 (CWE-1257) — local-issuer coverage 68.3% -> 86.7% * internal/connector/issuer/local/bundle9_coverage_test.go (NEW) Adds ~30 subtests across CSR-acceptance failure paths, parsePrivateKey four-format coverage, resolveEKUsAndKeyUsage all-EKU + fallback, hashPublicKey RSA + ECDSA P-256/P-384/P-521 + unsupported curve, ecdsaToECDH byte-identical round-trip pin, loadCAFromDisk expired/non-CA/missing/happy, validateCSRUnicode all rejection arms, marshalPrivateKeyAndZeroize / ensureKeyDirSecure all branches, ValidateConfig 5 arms, MaxTTLSeconds cap. * .github/workflows/ci.yml — flips local-issuer floor 60% -> 85% hard with explicit "add tests, do not lower the gate" comment. L-002 (CWE-226) — agent + local-CA private-key zeroization * internal/connector/issuer/local/keymem.go (NEW) * cmd/agent/keymem.go (NEW) marshalPrivateKeyAndZeroize wraps x509.MarshalECPrivateKey with defer clear(der). Agent additionally defer clear(privKeyPEM) on the encoded buffer. Bounds heap-resident exposure of the private scalar to the duration of PEM-encode + os.WriteFile. L-003 (CWE-732) — 0700 key-directory hardening * internal/connector/issuer/local/keystore.go (NEW) * cmd/agent/keymem.go (NEW) ensureKeyDirSecure / ensureAgentKeyDirSecure create dir tree at 0700, accept owner-only modes, chmod-tighten permissive leaves with re-stat verification, refuse empty/root/dot. Wired ahead of every os.WriteFile(keyPath, ..., 0600) site in cmd/agent/main.go. L-012 (CWE-1007 + CWE-176) — Unicode safety in CN/SAN * internal/validation/unicode.go (NEW) * internal/validation/unicode_test.go (NEW, 8 test functions) ValidateUnicodeSafe rejects RTL/LTR overrides U+202A..U+202E + U+2066..U+2069, zero-width U+200B..U+200D + U+2060 + U+FEFF, control chars <0x20 + 0x7F..0x9F, and per-DNS-label Latin+non-Latin-letter mixes (Cyrillic-а-in-apple homograph). Pure-IDN labels allowed. Errors cite codepoint + byte offset. Wired into IssueCertificate + RenewCertificate via validateCSRUnicode covering CSR Subject CommonName + DNSNames + EmailAddresses + request-side additional SANs. L-014 — CA-key-in-process threat-model documentation * internal/connector/issuer/local/local.go file-header doc comment Documents what the bundled defense-in-depth measures DO and DO NOT protect against; directs operators with stricter requirements to HSM/PKCS#11/cloud-KMS-backed signing (V3 Pro KMS-issuance roadmap entry as the source-of-truth fix). M-028 (CWE-477) PARTIAL — 1 of 6 SA1019 sites * internal/connector/issuer/local/local.go::ecdsaToECDH (NEW helper) Replaces deprecated elliptic.Marshal(k.Curve, k.X, k.Y) inside hashPublicKey with crypto/ecdh.PublicKey.Bytes(). Dispatches on Curve.Params().Name to avoid importing crypto/elliptic for sentinel comparisons. Supports P-256/P-384/P-521; P-224 returns unsupported-curve error and the caller falls back to a stable X+Y big.Int.Bytes() hash (so SKI generation never panics). * TestHashPublicKey_ECDSA_RoundTripPin — byte-identical regression oracle that pins the new output to the legacy elliptic.Marshal output across all three supported curves (with explicit //nolint:staticcheck on the SA1019 reference). Migration cannot silently change the SubjectKeyId of every previously-issued cert. * 5 SA1019 sites still open (test-file middleware.NewAuth × 3 + scep.go csr.Attributes). Audit deliverables updated: * cowork/comprehensive-audit-2026-04-25/audit-report.md — score 20/55 -> 25/55 closed (High 6/9 -> 7/9; Low 4/19 -> 8/19). * cowork/comprehensive-audit-2026-04-25/findings.yaml — H-010 + L-002 + L-003 + L-012 + L-014 status open -> closed; M-028 status open -> partial_closed; closure notes cite the Bundle-9 mechanism. * certctl/CHANGELOG.md — Bundle-9 section under [unreleased].	2026-04-26 17:18:00 +00:00
Shankar	90f0cab204	fix(bundle-6): Audit Integrity + Privacy — 3 audit findings closed Closes Audit-2026-04-25 H-008 (High), M-017 (Medium), M-022 (Medium). Hardens audit-trail tamper-resistance + minimizes PII leakage in one cohesive change, with both controls applying automatically and no operator action required at install time. What changed - internal/service/audit_redact.go (NEW) — RedactDetailsForAudit: * credentialKeys deny-list (api_key, password, _pem, eab_secret, ...) piiKeys deny-list (email, phone, ssn, name, address, ip_address, ...) * case-insensitive key match; recurses into nested maps + arrays * mutation-free; surfaces redacted_keys array for operator visibility * nil/empty input → nil out (preserves pre-Bundle-6 behaviour) - internal/service/audit.go — RecordEvent now routes details through RedactDetailsForAudit BEFORE marshaling. No call-site changes required. - internal/service/audit_redact_test.go (NEW) — full coverage: * credential keys (~30 entries) * PII keys (~20 entries) * nested maps + arrays * case-insensitivity * mutation-free invariant * JSON round-trip (catches type-assertion regressions) * scalar pass-through (no panic on int/bool/nil) - migrations/000018_audit_events_worm.up.sql (NEW) — DB-level WORM: * BEFORE UPDATE OR DELETE trigger raises check_violation with diagnostic citing the rationale + compliance-superuser hint * REVOKE UPDATE,DELETE ON audit_events FROM certctl (defence-in-depth) * REVOKE wrapped in pg_roles existence check so test fixtures without the certctl role stay idempotent - migrations/000018_audit_events_worm.down.sql (NEW) — clean teardown for dev resets; not for production use. - internal/repository/postgres/audit_worm_test.go (NEW, testcontainers, -short gated) — INSERT succeeds; UPDATE + DELETE fail with check_violation; second INSERT after blocked modification still succeeds (no trigger-state corruption). - docs/compliance.md — new section "Audit-Trail Integrity & Privacy (Bundle 6)" with verification psql snippet, compliance-superuser pattern (NOT auto-created), redactor before/after example, and a maintenance note for adding new credential keys. Compliance mapping - H-008 (CWE-532 Insertion of Sensitive Information into Log File) - M-017 (HIPAA Technical Safeguards §164.312(b) — audit controls) - M-022 (GDPR Art. 32 — data minimization) Threat model: TB-3 (audit log tampering), TB-1 (operator/orchestrator). Verification - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - go test -count=1 -run TestRedactDetailsForAudit ./internal/service/... → all pass - (testcontainers, gated by -short) audit_worm_test.go pins WORM contract - npx tsc --noEmit (web) → clean (no frontend changes) - python3 yaml.safe_load(api/openapi.yaml) → 89 paths Backward compatibility - Trigger applies forward only — existing rows unchanged. - nil/empty details from RecordEvent callers → nil out (preserves prior behaviour for the many existing call sites that pass nil). - Compliance superusers (provisioned out-of-band) bypass the trigger. Bundle 6 of the 2026-04-25 comprehensive audit.	2026-04-26 00:26:44 +00:00
Shankar	17ef377edb	fix(bundle-5): CI green-up — drop unused sync.Once + document new env vars Two CI gate failures from the Bundle 5 push: 1. golangci-lint (unused) — agent_bootstrap.go declared `var bootstrapWarnOnce sync.Once` but never called .Do(). The one-shot WARN actually lives in cmd/server/main.go (per-process at startup, not per-request) so the handler-side variable was dead code. Dropped the var + sync import; left a comment explaining where the WARN lives. 2. G-3 env-var docs guardrail — Bundle 5 added two new env vars (CERTCTL_AGENT_BOOTSTRAP_TOKEN, CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS) but the G-3 closure CI step asserts every CERTCTL_* env defined in internal/config/config.go is mentioned in docs/features.md. Added three new sub-sections to docs/features.md after the Body Size Limits block: * Agent Bootstrap Token (H-007 contract + generation guidance) * Graceful Shutdown Audit Flush (M-011 timeout knob) * Liveness vs Readiness Probes (H-006 /health vs /ready table) No production behaviour change; pure CI-gate fix. Verification - go vet ./internal/api/handler/... → clean - go test -count=1 -run 'TestVerifyBootstrapToken\|TestRegisterAgent_BootstrapToken' ./internal/api/handler/... → all pass - grep CERTCTL_AGENT_BOOTSTRAP_TOKEN docs/features.md → present - grep CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS docs/features.md → present	2026-04-26 00:03:03 +00:00
Shankar	5c8d37e0f0	fix(bundle-5): Operational Liveness + Bootstrap — 4 audit findings closed Closes Audit-2026-04-25 H-006 (High), H-007 (High), M-011 (Medium), L-006 (Low — verified-already-closed via C-1 master closure in v2.0.54). Hardens the orchestrator-facing surface — k8s probes, agent enrollment, shutdown audit drain, scheduler config plumbing. What changed - internal/api/handler/health.go — split contract: * /health stays shallow 200 (k8s liveness — process alive) * /ready accepts sql.DB; runs db.PingContext(2s); 503 on failure Nil DB path returns 200 + db=not_configured (test fixtures) - internal/api/handler/agent_bootstrap.go (NEW) — verifyBootstrapToken: * empty expected = warn-mode pass-through * non-empty = `Authorization: Bearer <token>` required * crypto/subtle.ConstantTimeCompare; length-mismatch path runs dummy compare to keep timing uniform * ErrBootstrapTokenInvalid sentinel - internal/api/handler/agents.go — RegisterAgent calls verifyBootstrapToken BEFORE body parse so unauth probes don't even allocate a JSON decoder - internal/config/config.go — two new env vars: * CERTCTL_AGENT_BOOTSTRAP_TOKEN (Auth.AgentBootstrapToken) * CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS (Server.AuditFlushTimeoutSeconds) - cmd/server/main.go — 3 changes: * pass sql.DB into NewHealthHandler (H-006) pass cfg.Auth.AgentBootstrapToken into NewAgentHandler (H-007) * configurable shutdown audit-flush timeout (M-011) * one-shot startup WARN when bootstrap token unset (deprecation) - new tests: agent_bootstrap_test.go (full deny/accept/warn-mode coverage, constant-time compare path, length-mismatch); health_test.go extended with /ready DB-probe failure (503), nil-DB pass-through, /health-shallow L-006 verified - cmd/server/main.go:557 already calls sched.SetShortLivedExpiryCheckInterval(cfg.Scheduler.ShortLivedExpiryCheckInterval) per the C-1 master closure in v2.0.54. Bundle 5 confirms; no code change. Threat model: TB-1 (operator/orchestrator), TB-2 (Agent↔Server). - CWE-754 (Improper Check for Unusual or Exceptional Conditions) for H-006 - CWE-306 + CWE-288 (Missing Authentication for Critical Function) for H-007 Verification - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - targeted Bundle-5 regressions → all pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → in-flight (sandbox 45s ceiling exceeded; no failure markers in dot stream; no frontend changes in this bundle so no regression risk) - python3 yaml.safe_load(api/openapi.yaml) → 89 paths Backward compatibility - Bootstrap token defaults to empty (warn-mode) — existing demo deployments unaffected. Server logs deprecation WARN; v2.2.0 will require it. - Audit flush timeout default 30s preserves prior behaviour. - Helm chart already routes readiness probe to /ready (no chart change needed); now /ready actually probes the DB. Bundle 5 of the 2026-04-25 comprehensive audit.	2026-04-25 23:54:18 +00:00
Shankar	2c5383da9f	fix(bundle-3): MCP Trust-Boundary Fencing — 5 audit findings closed Closes Audit-2026-04-25 H-002, H-003, M-003, M-004, M-005 (all CWE-1039 LLM Prompt Injection at the MCP↔consumer trust boundary, TB-7). Strategy: wrapper-layer fencing. All 87 MCP tools route their success path through textResult and their failure path through errorResult. By fencing at those two wrappers we cover every existing tool AND every future tool with a single change — no per-tool wiring required. What changed - internal/mcp/fence.go (new) — FenceUntrusted helper with strategy doc + per-finding rationale. Both fenceMCPResponse and fenceMCPError use it internally. - internal/mcp/tools.go — textResult wraps response body via fenceMCPResponse; errorResult wraps error string via fenceMCPError. - internal/mcp/tools_test.go — TestTextResult / TestErrorResult updated to assert fenced shape (start marker + end marker + inner body). - internal/mcp/injection_regression_test.go (new) — 5 regression test functions, one per audit finding, each replays 5 classic LLM injection payloads (instruction_override, system_role_spoofing, delimiter_break_attempt, markdown_link_phishing, data_exfil_via_url) and asserts the planted payload appears VERBATIM (preservation, operator visibility) INSIDE the fence boundaries. - internal/mcp/fence_guardrail_test.go (new) — CI guardrail that walks every non-test .go file in the mcp package and fails if it finds a bare gomcp.CallToolResult literal outside tools.go. Prevents future tools from silently bypassing the fence. Delimiter-forgery defense The naive constant fence (--- UNTRUSTED MCP_RESPONSE END ---) is forgeable: an attacker who controls a field value can plant the literal end marker and "break out" of the fence. Defense: every fence call generates a 6-byte crypto/rand nonce, hex-encoded, and embeds it in BOTH the START and END markers. An attacker would need to predict the nonce (2^48 search per fence) to forge a matching END inside the payload. The delimiter_break_attempt regression test exercises this. Per-finding mapping - H-002 Cert Subject DN injection (CSR submitter controlled) → TestMCP_PromptInjection_H002_CertSubjectDN - H-003 Discovered cert metadata injection (cert owner controlled) → TestMCP_PromptInjection_H003_DiscoveredCertMetadata - M-003 Agent heartbeat injection (agent self-reports hostname/OS/IP) → TestMCP_PromptInjection_M003_AgentHeartbeat - M-004 Upstream CA error injection (CA controls error string) → TestMCP_PromptInjection_M004_UpstreamCAError - M-005 Audit details + notification body injection (downstream actors control these) → TestMCP_PromptInjection_M005_AuditDetailsAndNotifications Verification gates - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - go test -count=1 ./internal/mcp/... → all packages pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → 337 passed - python3 yaml.safe_load(api/openapi.yaml) → 89 paths, 56 schemas Threat-model placement: TB-7 (MCP↔LLM consumer). certctl owns the boundary; consumer-side prompt engineering is recommended but not relied upon. Defense-in-depth: per-call nonce closes the delimiter-forgery edge case that constant fences would have left exposed. Bundle 3 of the 2026-04-25 comprehensive audit (88 findings).	2026-04-25 22:44:33 +00:00
cowork	84d0aa2d5a	fix(bundle-4): EST/SCEP Attack Surface Hardening — 3 audit findings closed Closes 3 findings (1 High + 1 Medium + 1 Low) from /Users/shankar/Desktop/cowork/comprehensive-audit-2026-04-25/. Bundle 4 hardens the only attack surface reachable by an anonymous network attacker in certctl: the unauthenticated EST + SCEP enrollment endpoints. Findings closed: - H-004 (High): Hand-rolled ASN.1 parser had no fuzz target. The audit's original framing pointed at internal/pkcs7/, but recon confirmed that package is an ASN.1 ENCODER (BuildCertsOnlyPKCS7, ASN1Wrap, ASN1EncodeLength) — not a parser. The actual hand-rolled PKCS#7 PARSING reachable via anonymous network is in internal/api/handler/scep.go::extractCSRFromPKCS7 + parseSignedDataForCSR. Added native go fuzz targets: internal/api/handler/scep_fuzz_test.go::FuzzExtractCSRFromPKCS7 * internal/api/handler/scep_fuzz_test.go::FuzzParseSignedDataForCSR * internal/pkcs7/pkcs7_fuzz_test.go::FuzzPEMToDERChain (defense-in-depth) * internal/pkcs7/pkcs7_fuzz_test.go::FuzzASN1EncodeLength (defense-in-depth) Local 15s fuzz session: 150k execs on FuzzExtractCSRFromPKCS7, 937k on FuzzPEMToDERChain, 925k on FuzzASN1EncodeLength — zero panics. - M-021 (Medium): EST TLS-Unique channel binding (RFC 7030 §3.2.3). Added internal/api/handler/est.go::verifyESTTransport — defense-in-depth TLS pre-conditions (r.TLS != nil; HandshakeComplete; TLS ≥ 1.2). The full §3.2.3 channel binding only applies when EST mTLS is in use; certctl does not currently support EST mTLS, so the §3.2.3 requirement is moot today. RFC 9266 (TLS 1.3 tls-exporter) and EST mTLS are documented as deferred follow-ups in the verifyESTTransport doc comment. - L-005 (Low): EST/SCEP issuer-binding fail-loud at startup. Pre-Bundle-4 cmd/server/main.go validated that CERTCTL_EST_ISSUER_ID and CERTCTL_SCEP_ISSUER_ID existed in the registry but did NOT validate the issuer TYPE could emit a CA cert. An operator binding EST to an ACME issuer (whose GetCACertPEM returns explicit error) booted successfully and only failed at first /est/cacerts request. Post-Bundle-4: new preflightEnrollmentIssuer helper calls GetCACertPEM(ctx) at startup with a 10s timeout. Failure logs the connector error + the candidate issuer types and os.Exit(1). Tests added/modified: - internal/api/handler/est_transport_test.go (new) — 5 verifyESTTransport table cases covering plaintext-rejected, incomplete-handshake-rejected, TLS 1.0 rejected, TLS 1.2/1.3 accepted - cmd/server/preflight_test.go (new) — TestPreflightEnrollmentIssuer covering nil-connector, error-from-issuer, empty-PEM, valid cases - internal/api/handler/est_handler_test.go (modified) — 7 POST sites now stamp r.TLS to satisfy the new transport pre-condition - internal/integration/negative_test.go (modified) — setupTestServer wraps the test handler with a fake-TLS-state injector so the EST handler receives r.TLS != nil; production paths still rely on the real TLS listener Threat model reference: TB-11 (EST/SCEP client ↔ Server) per cowork/comprehensive-audit-2026-04-25/threat-model.md. Standards: RFC 7030 §3.2.3, RFC 8894 §3, RFC 5652, RFC 9266 (deferred).	2026-04-25 21:14:41 +00:00
cowork	1752d261b3	test: triage 37 skipped-test sites — closure comments pinning rationale (Q-1) Closes Q-1 (cat-s3-58ce7e9840be) — 37 t.Skip / testing.Short() sites across 9 test files audited. Per-site verdict matrix: - cmd/agent/verify_test.go (1 site): defensive guard against unreachable httptest.NewTLSServer code path. Document-skip with closure comment. - deploy/test/qa_test.go (11 sites): file already gated by `//go:build qa` tag. The 11 t.Skip("Requires X — manual test") markers are runtime second-line guards for operators who run -tags qa against a stack missing the required external service. File-level header comment block added explaining the manual-test convention. - deploy/test/healthcheck_test.go (5 sites): 3 docker-availability + 1 testing.Short + 1 hard-skip for not-yet-wired runtime probe (image-spec contract above already covers the audit-flagged regression). All correctly gated; file-level header comment block added explaining each. - deploy/test/integration_test.go (5 sites): in-flight-state guards (poll-with-skip after 90s polling for agent-online, inter-test Phase04→Phase07 ordering, scheduler-tick race for discovered certs, inter-test issuer fallthrough, defensive PEM-empty assertion). Each site now has a closure comment explaining why skip is the right choice rather than fail (upstream phase already surfaces the real failure; skipping prevents masking root cause behind cascading noise). - internal/repository/postgres/{testutil,seed,repo}_test.go (5 sites): testing.Short() gates for testcontainers-backed live PostgreSQL integration tests. All correctly gated; closure comments added naming the run command. - internal/connector/notifier/email/email_test.go (2 sites): anti-fixture assertions (test asserts SMTP dial fails; if a captive portal black-holes the call to success, skip rather than false-pass). Closure comments added explaining the fixture assumption. - internal/connector/target/iis/iis_test.go (2 sites): platform-gated skip for powershell.exe absence on non-Windows hosts. Mirrors the production iis_connector.go LookPath guard. Closure comments added. Total: 17 closure comments anchor the 37 skip sites (some sites share a single block-level comment). All skips remain in place; the change is purely documentation. The audit recommendation was "audit each skip and decide" — for these 37, the decision is uniformly document-skip: the gating is correct, the t.Skip messages name the missing precondition, and the closure comments now pin the rationale for future readers. See coverage-gap-audit-2026-04-24-v5/unified-audit.md cat-s3-58ce7e9840be for closure rationale.	2026-04-25 18:44:36 +00:00
certctl-bot	a86816451a	refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2) Closes one 2026-04-24 audit finding (P2): - cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites in internal/api/handler/ — brittle to repository-layer message changes, untyped against the actual failure mode. Approach (Option B from prompt design notes): - New typed sentinels in internal/repository/errors.go: ErrNotFound, ErrForeignKeyConstraint IsForeignKeyError(err) helper (the only place substring matching at the lib/pq boundary is allowed; isolates the DB-driver string knowledge to one function). - New typed sentinel in internal/domain/errors.go: ErrValidation (reserved for future per-entity validation wrappers; not yet used by all handlers). - 49 sites in internal/repository/postgres/*.go updated to wrap sql.ErrNoRows-derived errors via fmt.Errorf("...: %w", repository.ErrNotFound). - 18 not-found handler sites + 2 FK-constraint handler sites refactored to errors.Is(err, repository.ErrNotFound) / repository.IsForeignKeyError(err). - 23 inline `fmt.Errorf("X not found")` test fixtures across handler tests rewrapped to wrap repository.ErrNotFound. - test_utils.go::ErrMockNotFound rewrapped to wrap repository.ErrNotFound; renewal_policy.go closure docblock updated to reflect the new convention. - integration test mockJobRepository.Get wraps repository.ErrNotFound. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error()) regression guard (S-2)" greps for the three patterns ("not found", "violates foreign key", "RESTRICT") under internal/api/handler/ and fails the build on regression. Verification: - go build ./... — clean - go vet ./... — clean - go test ./... -short -count=1 — all packages pass (handler + repository + service + integration) - golangci-lint v2.11.4 run ./... — 0 issues - S-2 guardrail dry-run on post-fix tree → empty (good) - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass Audit findings closed: - cat-s6-efc7f6f6bd50 (P2) Deferred follow-ups: - 6 domain-specific substring patterns still inline in handlers ("cannot approve", "cannot reject", "cannot be parsed", "no certificates found", "challenge password", "invalid"/ "required" validation chains in profiles + agent_groups). Each needs its own typed sentinel, scoped per service. Documented by the S-2 CI guardrail's allowlist for closure-comments only. - Per-entity not-found sentinels (Option A — ErrCertificateNotFound, ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the current dispatch needs; per-entity precision would let handlers return entity-aware error bodies without a domain.Type field, but not blocking.	2026-04-25 17:54:14 +00:00
certctl-bot	ffc982bfbc	chore(cleanup,docs): vite proxy + dead scheduler setter wired + registerAgent/CLI docs (C-1 master) Closes six 2026-04-24 audit findings (3 P2 + 3 P3) — a cleanup-and-doc tail bundle that drains the smallest remaining leaves of the audit: - cat-u-vite_dev_proxy_plaintext_drift (P2): web/vite.config.ts proxied dev requests to http://localhost:8443 against an HTTPS-only backend (HTTPS-only since v2.0.47). Every dev-server API call 502'd. Fix: targets are now object-form `{target: 'https://...', secure: false, changeOrigin: true}` — the dev cert is self-signed by the deploy/test bootstrap and changes per-checkout. - cat-g-7e38f9708e20 (P3): Scheduler.SetShortLivedExpiryCheckInterval was defined + tested but never called from cmd/server/main.go. Operators tuning CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL got no effect — the 30s default in scheduler.NewScheduler was effectively hardcoded. Fix: added Config.Scheduler.ShortLivedExpiryCheckInterval + getEnvDuration in Load() reading the env var with a 30s default, + sched.SetShortLivedExpiryCheckInterval(...) call in main.go alongside the other scheduler-interval setters. - diff-10xmain-2bf4a0a60388 (P3): same root cause as cat-g-7e38f9708e20; closes as ride-along. - cat-b-6177f36636fb (P2): registerAgent client fn orphan. By-design per pull-only deployment model. Fix (audit recommendation: "document"): added a closure docblock above the export in client.ts + a new "Registration is by-design pull-only" paragraph in docs/architecture.md::Agents section explaining when/why a future GUI-driven enrollment feature might reach the endpoint (proxy-agent topologies for network appliances). - cat-i-7c8b28936e3d (P2): CLI scope intentionally narrow but undocumented. Fix: new "Scope (intentionally narrow)" subsection in docs/features.md::CLI capturing the SSH-into-prod / day-to-day GUI / AI-automation MCP three-way split. Verification: - go build ./... — clean - go vet ./... — clean - go test ./internal/scheduler/... ./internal/config/... — pass - golangci-lint v2.11.4 run ./... — 0 issues - tsc --noEmit (frontend) — clean - All sibling guardrails (S-1 / G-3 / D-1+D-2 / B-1 / L-1 / H-1) still pass Audit findings closed: - cat-u-vite_dev_proxy_plaintext_drift (P2) - cat-g-7e38f9708e20 (P3) - diff-10xmain-2bf4a0a60388 (P3) - cat-b-6177f36636fb (P2) - cat-i-7c8b28936e3d (P2) - (audit-bookkeeping ride-along: ensures every closed-bundle row has a non-empty merge SHA) Deferred follow-ups: none from this bundle. The remaining audit backlog (frontend test campaign, F-1 CertificatesPage UX, P-1 orphan-fn sweep, S-2 handler error-mapping refactor) is sibling sub-bundles in this mega-prompt.	2026-04-25 17:34:59 +00:00
certctl-bot	6cb4414690	feat(security): bodyLimit on noAuth + security headers + encryption-key validation (H-1 master) Closes three 2026-04-24 audit findings (all P2): - cat-s5-4936a1cf0118: noAuthHandler chain accepted arbitrary-size bodies (EST simpleenroll, SCEP, PKI CRL/OCSP, /health, /ready). Memory exhaustion vector without HTTP-layer auth gatekeeping. - cat-s11-missing_security_headers: zero security headers on any response. Clickjacking, MIME-sniffing, untrusted-origin resource loads against the dashboard and API. - cat-r-encryption_key_no_length_validation: CERTCTL_CONFIG_ENCRYPTION_KEY accepted with any non-empty value including a single character. PBKDF2-SHA256 (100k rounds) does not compensate for low-entropy passphrases at scale (CWE-916, CWE-329). Changes: - cmd/server/main.go::noAuthHandler chain — added bodyLimitMiddleware + securityHeadersMiddleware. Same default cap as authed surface (1MB via CERTCTL_MAX_BODY_SIZE), same 413 on overflow. - cmd/server/main.go::middlewareStack (authed) — added securityHeadersMiddleware before corsMiddleware. - internal/api/middleware/securityheaders.go (new) — SecurityHeaders middleware + SecurityHeadersDefaults() with conservative defaults: HSTS 1y+includeSubDomains, X-Frame-Options DENY, X-Content-Type- Options nosniff, Referrer-Policy no-referrer-when-downgrade, CSP default-src 'self' + img/data + style 'unsafe-inline' (Tailwind/Vite needs it; scripts still 'self' only) + connect 'self' + frame- ancestors 'none'. Operators behind a customising reverse proxy can disable any header by setting its config field to empty. - internal/config/config.go::Validate() — enforce minEncryptionKeyLength = 32 bytes when CERTCTL_CONFIG_ENCRYPTION_KEY is set. Empty stays accepted (downstream fail-closed sentinel handles it). Structured error names the env var, the actual length, the required minimum, and the canonical generation command (`openssl rand -base64 32`). Tests: - internal/api/middleware/securityheaders_test.go (new) — 4 cases (defaults present, empty value disables single header, override applied, headers on 4xx/5xx). - internal/config/config_test.go — 5 new cases for the encryption-key length check (empty accepted, 1-byte rejected, 31-byte rejected at boundary, 32-byte accepted, 44-byte realistic operator key accepted). Documentation: - CHANGELOG.md — H-1 section above D-2 under [unreleased] with Breaking-change callout (operators with low-entropy keys must rotate before upgrade). - coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker 25/47 → 33/47, P1 14/14 (zero remaining), P2 11/27 → 16/27. Three H-1 findings flipped + closed-bundle row added. Verification: - go build ./... — clean - go vet ./... — clean - golangci-lint v2.11.4 run ./... — 0 issues - go test ./internal/api/middleware/... — pass (incl. 4 new SecurityHeaders cases) - go test ./internal/config/... — pass (incl. 5 new EncryptionKey cases) - tsc --noEmit (frontend) — clean - All sibling guardrails (S-1 / G-3 / D-1 / D-2 / B-1 / L-1) still pass Audit findings closed: - cat-s5-4936a1cf0118 (P2) - cat-s11-missing_security_headers (P2) - cat-r-encryption_key_no_length_validation (P2) Breaking change: - Operators with CERTCTL_CONFIG_ENCRYPTION_KEY shorter than 32 bytes must rotate before upgrade. Generate via `openssl rand -base64 32`. Deferred follow-ups: - Weak-key dictionary check (reject password123, common ASCII patterns) — adds operational friction with low marginal entropy gain at the 32-byte minimum. - CSP 'unsafe-inline' for styles — required for Tailwind/Vite per-component <style> blocks; removing requires HTML report or component refactor outside H-1 scope. - Permissions-Policy header — dashboard uses no advanced browser APIs (camera, mic, geolocation); deferred until a real consumer needs it.	2026-04-25 16:40:21 +00:00
certctl-bot	c2329b3603	feat(mcp): add claim_discovered + dismiss_discovered MCP tools (I-2 closure) Closes the LAST P1 in the 2026-04-24 audit (cat-i-b0924b6675f8). Pre-I-2 the README claimed "all API endpoints are exposed via MCP" but the discovered-certificate lifecycle (HTTP handlers ClaimDiscovered + DismissDiscovered at internal/api/handler/discovery.go:125,162) had zero MCP tool wrappers — operators using Claude / Cursor / similar MCP clients had no path to bring an out-of-band cert under management or to mark a benign discovery as not-of-interest without dropping to the REST API directly. The audit's count of 0 MCP discovery tools was correct: `grep -niE 'discover\|claim\|dismiss' internal/mcp/tools.go` returned only the pre-existing agent-retire tool's description text mentioning sentinel discovery agents — no actual discovery-tool registrations. Added in internal/mcp/types.go: - ClaimDiscoveredCertificateInput (id + managed_certificate_id) - DismissDiscoveredCertificateInput (id) Both follow the existing Go-doc / staticcheck convention (lead with the type name + brief; closure-rationale prose follows). Pinned by the existing L-1 staticcheck-fix lesson. Added in internal/mcp/tools.go (slotted at end of file, after certctl_auth_check): - certctl_claim_discovered_certificate — POST /api/v1/discovered-certificates/{id}/claim - certctl_dismiss_discovered_certificate — POST /api/v1/discovered-certificates/{id}/dismiss Both wrap the existing HTTP handlers via the generic c.Post helper. No backend changes; no openapi.yaml changes (both ops were already in the spec from earlier work). The audit's third name "acknowledge" is NOT closed: at recon, no notification-acknowledge HTTP handler exists in the API surface (grep across internal/api/handler/ returned zero hits for "acknowledge"). The audit appears to have mis-quoted; "acknowledge" isn't a real backend endpoint to wrap. If a future feature adds notification acknowledgement, register it in the same shape. Verification: - go build ./... — clean - go vet ./internal/mcp/... — clean - go test ./internal/mcp/... -count=1 — pass - golangci-lint v2.11.4 run ./... — 0 issues - MCP tool count went from 85 → 87 (verify via `grep -cE 'gomcp\.AddTool\(' internal/mcp/tools.go`) - S-1 + G-3 + D-1 + D-2 + B-1 + L-1 CI guardrails all still pass Audit findings closed: - cat-i-b0924b6675f8 (P1, MCP discovery completeness — last P1 in audit) This brings the audit to ZERO REMAINING P1s. Deferred follow-ups: - Notification acknowledge MCP tool — add when a notification-ack HTTP handler exists. Currently no such handler exists in the API surface; treat as a separate feature, not an MCP gap.	2026-04-25 16:33:56 +00:00
certctl-bot	45ae7716a8	fix(mcp): close staticcheck ST1021 on BulkRenew/BulkReassign input docstrings CI on the B-1 merge (`d2ebe1b`) failed at the golangci-lint step on two ST1021 errors against internal/mcp/types.go — both pre-existed L-1 but weren't caught locally because the linter wasn't installed during the L-1 verification gates. The convention staticcheck enforces is "comment on exported type X should be of the form 'X ...'" — i.e. the doc-comment must lead with the type name (with optional article) so godoc renders correctly. Before: // L-1 master closure (cat-l-fa0c1ac07ab5): bulk-renew MCP tool input. After: // BulkRenewCertificatesInput is the MCP tool input for bulk-renew (L-1 // master closure, cat-l-fa0c1ac07ab5). Mirrors BulkRevokeCertificatesInput // field-for-field minus Reason. Same shape applied to BulkReassignCertificatesInput. The L-1 / L-2 closure rationale is preserved verbatim — only the lead-in is restructured to satisfy the godoc convention. Verification: - golangci-lint v2.11.4 (matching CI) installed locally at /dev/shm/bin - golangci-lint run ./... --timeout 5m → 0 issues - internal/mcp/... package targeted lint → 0 issues This unblocks the B-1 CI run on master. No behavioral change; doc-only edit.	2026-04-25 15:48:39 +00:00
Shankar Reddy	fb4362e534	fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master) Two audit findings, both category cat-l, both rooted in web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200 ms each = a 5–20-second wedge during which the operator stared at a progress bar. Post-L-1 each workflow is a single POST. cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop handleBulkRenewal: for/await triggerRenewal(id) cat-l-8a1fb258a38a [P2] — bulk reassign loop handleReassign: for/await updateCertificate(id, {owner_id}) The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke + BulkRevocationCriteria/Result) already existed as the canonical shape in v2.0.x — L-1 ports that pattern to renew + reassign with per-action twists. Backend (Go) - internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id}; shared BulkOperationError type for all bulk paths. - internal/domain/bulk_reassignment.go: narrower shape — IDs-only, owner_id required, team_id optional. - internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew: resolves criteria → status filter (Archived/Revoked/Expired/ RenewalInProgress all silent-skip) → per-cert status flip + job create. Keygen-mode-aware so jobs land in the same initial status as single-cert TriggerRenewal. Single bulk audit event per call, not N. - internal/service/bulk_reassignment.go::BulkReassignmentService. BulkReassign: validates owner_id upfront via the ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner returns 400 before any cert is touched. Already-owned-by-target is silent-skip. Single bulk audit event. - internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP shape mirrors bulk_revocation.go. NOT admin-gated (renew is non- destructive; reassign is a common-case workflow). Sentinel-error → 400 mapping for OwnerNotFound. - internal/api/router/router.go: three bulk-* routes registered as a block before the {id} routes. HandlerRegistry gains BulkRenewal + BulkReassignment fields. - cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode so bulk-renew jobs land in same initial state as single-cert path. Frontend - web/src/api/client.ts: bulkRenewCertificates(criteria) + bulkReassignCertificates(request) functions with full TS types. - web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign rewritten from N-call loops to single calls. Result envelope drives progress UI; first-error message surfaced when total_failed > 0. Stale triggerRenewal + updateCertificate imports removed. MCP - internal/mcp/types.go: BulkRenewCertificatesInput + BulkReassignCertificatesInput. - internal/mcp/tools.go: certctl_bulk_renew_certificates + certctl_bulk_reassign_certificates tools mirroring the existing certctl_bulk_revoke_certificates pattern. OpenAPI - api/openapi.yaml: two new operations (bulkRenewCertificates, bulkReassignCertificates) under Certificates tag. Four new schemas (BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob, BulkReassignRequest, BulkReassignResult). Tests - Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty IsEmpty contracts; JSON round-trip shape pinning. - Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/ skips-revoked-archived/empty-criteria-error/partial-failure/ audit-event-emitted) + 8 BulkReassign tests (happy/skips-already- owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id- optional/team-id-provided/partial-failure/audit-event-emitted). - Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong- method-405/actor-attribution/service-error-500) + 6 BulkReassign handler tests (happy/empty-IDs-400/missing-owner-400/owner-not- found-400-via-sentinel/wrong-method-405/generic-error-500). CI guardrail - .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx for 'for(...) await triggerRenewal(...)' and 'for(...) await updateCertificate(...)' patterns; comment lines exempt; test files exempt. Verified locally (passes against post-fix tree, fires against synthetic regression). Counts (deltas) - Routes: 119 → 121 (+2) - OpenAPI operations: 123 → 125 (+2) - MCP tools: 83 → 85 (+2) Performance - 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency reduction on the canonical operator workflow). - Audit event volume: 1 + N per operation → 1. Out of scope (deferred follow-ups) - cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan (different shape — wire existing PUT to GUI, not new bulk endpoint). - cat-k-e85d1099b2d7: CertificatesPage no pagination UI. - cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added 2 new tools but does not close that finding). Verification - go build / vet / test -short / test -short -race all clean. - web tsc --noEmit + vitest run all clean (296 tests passing). - OpenAPI YAML parses (89 paths, 125 ops). - L-1 CI guardrail passes against post-fix tree, fires against synthetic regression. No push.	2026-04-25 14:33:02 +00:00
Shankar Reddy	df4b56798c	fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master) GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/.sql or seed.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree.	2026-04-25 13:29:23 +00:00
Shankar	f9258e3ba6	fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2) Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged `json:"api_key_hash"` and shipped on every wire surface that returned domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}), GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the POST /api/v1/agents registration response. Every authenticated client (browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key string. The browser silently dropped it because web/src/api/types.ts omits the field, but CLI and MCP consumers print full JSON so the hash was visible there. Even though the value is a hash and not the plaintext key, shipping it gives an attacker an offline brute-force target if the API-key entropy is low (certctl doesn't enforce a minimum on operator- supplied keys), and there's no business reason for any client to ever receive it — the value is server-internal, used only for the lookup at internal/repository/postgres/agent.go::GetByAPIKey. (Audit: cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) We chose the audit's recommended fix (json:"-") plus a defense-in-depth MarshalJSON plus a CI guardrail. Three layers because struct-tag redaction alone is one rebase away from being silently reverted, the custom MarshalJSON catches the case where a parent struct embeds Agent under a different tag, and the CI grep blocks reintroduction at the spec or frontend boundary even without a code review catching it. Files changed: Phase 1 — Domain redaction: - internal/domain/connector.go: APIKeyHash tag flipped from `json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON with value receiver + type-alias-recursion-break that explicitly zeroes APIKeyHash on the marshal-time copy. Long-form docblock explaining the G-2 closure rationale + cross-references to service.RegisterAgent (populator), repository.AgentRepository:: GetByAPIKey (consumer), docs/architecture.md (DB-shape vs API-shape distinction), and the audit finding. Phase 2 — Domain tests (5 test functions): - internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer pins the Agent path. ...RedactsInSlice pins the []Agent path that the ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver pins the by-value-receiver contract so a future refactor that switches to pointer-receiver gets caught. ...RoundTrip pins the wire-shape guarantee that APIKeyHash is dropped on encode and cannot reappear on decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE- SENTINEL") flows through every fixture for grep-ability on regression. Phase 3 — Handler tests (4 test functions): - internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash, TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash, TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the literal substring "api_key_hash" is absent from the httptest-captured body, (b) the leak sentinel value is absent, (c) the non-leaked fields ARE present (sanity that the handler is serving real data, not just empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE- HANDLER-SENTINEL" so a single grep over a failing test's output identifies the leak surface immediately. Phase 4 — Spec / docs: - api/openapi.yaml: api_key_hash property REMOVED from Agent schema (was at line 3690). Inline G-2 comment naming the closure + the database-vs-API-shape distinction so a future spec edit doesn't silently re-introduce the field. - docs/architecture.md: ER-diagram block already documents the agents table including api_key_hash (DB shape — correct). Added a sibling note paragraph immediately below the diagram explaining that several columns are intentionally server-internal (api_key_hash redaction + issuers.config / deployment_targets.config encrypted shadow), with cross-references to the redaction enforcement site, the OpenAPI schema, the frontend interface, and the CI guardrail. - web/src/api/types.ts: Agent interface unchanged in shape (already omitted the field) but added a leading comment block explaining WHY the omission is intentional — stops a future frontend dev from "completing" the interface from the OpenAPI spec or the Go struct. Phase 5 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape regression guard (G-2)" step. Scoped patterns catch the actual regression shapes — Go struct tag (json:"api_key_hash"), frontend interface declaration, OpenAPI schema property, YAML enum/array membership. Repository / migration / seed / service / integration / unit-test / comment lines exempt. Verified locally on the real tree (passes) and against 4 synthetic regression patterns (each fires the guardrail). Mirrors the G-1 pattern from .github/workflows/ ci.yml lines 47-108. Phase 5b — Sweep verification (no changes, results documented for the next reader): - internal/api/middleware/audit.go: doesn't serialize Agent struct; records request body only. No leak. - service.RegisterAgent audit-event payload: `map[string]interface{}{ "name": name, "hostname": hostname}` — name + hostname only, no APIKeyHash. No leak. - All 9 slog sites that mention agent: scalar attrs only ("agent_id", "error", "agent_hostname"), never the full struct. No leak. - internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so the wire-side fix transitively closes them. Verification (all gates pass): - go build ./... - go vet ./... - go test -short ./... — every package green - go test -short -race ./internal/domain/... ./internal/api/handler/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds - python3 yaml.safe_load on api/openapi.yaml — parses - OpenAPI Agent schema scan: no api_key_hash property - CI guardrail mirror: clean on real tree, fires on all 4 synthetic regression patterns - Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5% - Handler pkg coverage: 79.2% Sample response body (httptest captured during verification, GET /api/v1/agents/{id} via the new handler test): {"id":"agent-demo","name":"demo-agent","hostname":"demo.host", "status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z", "registered_at":"2026-04-24T12:00:00Z","os":"linux", "architecture":"amd64","ip_address":"10.0.0.42", "version":"v2.0.49"} Note the absence of any api_key_hash key, even though the in-memory struct passed to the handler had APIKeyHash set to a sentinel. Out of scope (intentionally untouched): - internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan paths and GetByAPIKey lookup — DB column stays, repo still populates the struct, auth lookup still works. The redaction is a marshal-boundary concern. - migrations/000001_initial_schema.up.sql + migrations/seed_.sql — DB schema and seed data unchanged. - internal/service/agent.go::RegisterAgent — service-side hashing and persistence unchanged. - Other domain types with potential credential-derivative fields (Issuer.Config, DeploymentTarget.Config, notifier configs). Not flagged by the audit; some are already protected (e.g., DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a separate audit pass if recon surfaces additional leaks. - Per-resource DTO layer across every handler. Single audit finding, single domain type. - A separate possible follow-up: the v2 RegisterAgent endpoint doesn't return the plaintext API key to the agent, which may mean self-bootstrap via POST /api/v1/agents is broken. Verified during recon; out of scope for G-2; should be its own ticket. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-s5-apikey_leak Audit recommendation: 'json:"-" or API-response DTO excluding APIKeyHash' — went with the json:"-" + MarshalJSON defense-in-depth pair plus CI guardrail and structural docs.	2026-04-25 01:56:26 +00:00
Shankar	54a41603de	fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1) The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the startup log faithfully echoed 'authentication enabled type=jwt'. Reasonable people read that and concluded JWT auth was on. It wasn't. The auth-middleware wiring at cmd/server/main.go unconditionally routed every request through the api-key bearer middleware regardless of cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming 'Authorization: Bearer <token>' against whatever string the operator put in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who treated CERTCTL_AUTH_SECRET as a signing secret (because they thought they were configuring JWT) had effectively handed an attacker an api-key. A security finding masquerading as a config option. We chose the audit-recommended structural fix: remove the option, fail fast at startup, and add the gateway-fronting pattern as the documented forward path. Implementing JWT middleware would have meant jwks vs static-secret rotation, claim mapping, expiry enforcement, audience and issuer validation, key rollover semantics, and regression coverage at the same depth as the existing api-key path — a feature, not a fix. Operators who genuinely need JWT/OIDC front certctl with an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same shape works on docker-compose and Helm. The change is comprehensive across 7 phases — every surface that mentioned 'jwt' as a certctl-auth-type is updated, plus structural backstops (typed enum, runtime guard, helm template validation, CI grep guard) so the lie can't reappear. Files changed: Phase 1 — production code (typed enum + jwt removal): - internal/config/config.go: AuthType typed alias + AuthTypeAPIKey / AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes literal 'jwt' through a dedicated multi-line diagnostic naming the authenticating-gateway pattern, then cross-checks against ValidAuthTypes(). Secret-required branch simplified to api-key-only. Field comment on AuthConfig.Type rewritten to drop jwt and point at the gateway pattern. - internal/api/middleware/middleware.go: AuthConfig.Type field comment references the typed config.AuthType constants. - internal/api/handler/health.go: same treatment for HealthHandler.AuthType. - cmd/server/main.go: defense-in-depth runtime switch immediately after config.Load() — exits 1 on any unsupported auth-type that bypassed the validator. Auth-disabled startup log explicitly names the authenticating-gateway pattern. Phase 2 — tests (Red→Green, contract pinning): - internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated (two table rows pinning the dedicated G-1 error fires regardless of whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property guard against future re-introduction), TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract), TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still hit the generic invalid-auth-type error). Removed the prior TestValidate_JWTAuth_MissingSecret happy-path since its premise is inverted post-G-1. - internal/api/handler/health_test.go: removed TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie into the regression suite). Pre-existing _APIKey test continues to cover the api-key happy path. Phase 3 — spec, docs, env templates: - api/openapi.yaml: auth_type enum dropped to [api-key, none] with inline comment naming the G-1 closure. - .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop jwt and point at the gateway pattern; secret-required conditional simplified to api-key-only. - docs/architecture.md: middleware-stack bullet rewritten to drop the JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)' section explaining the design rationale and listing oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy forward_auth / Apache mod_auth_openidc / nginx auth_request as the standard fronting options. - docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide with preconditions, what-changes, both recovery paths, complete docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy ext_authz patterns, rollback posture. Phase 4 — Helm chart (template validation + docs): - deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType helper mirroring the existing certctl.tls.required pattern. Fails template render on any server.auth.type outside {api-key, none} with a multi-line diagnostic. - deploy/helm/certctl/templates/server-deployment.yaml, server-configmap.yaml, server-secret.yaml: invoke the helper at the top of each template that depends on .Values.server.auth.type. - deploy/helm/certctl/values.yaml: auth: block comment expanded with the G-1 rationale and gateway-pattern cross-reference. - deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces the allowed set and points at the upgrade doc. - deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating gateway' section with a Kubernetes-flavored oauth2-proxy + certctl walkthrough. Phase 5 — release surface: - CHANGELOG.md: new [unreleased] top entry with Breaking / Removed / Added / Changed sections; explicit pointer at docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection. Phase 6 — CI guardrail: - .github/workflows/ci.yml: new 'Forbidden auth-type literal regression guard (G-1)' step. Scoped patterns catch the actual regression shapes (map literal, slice literal, switch case, OpenAPI enum, env-file default, AuthType('jwt') cast). Comments and the dedicated rejection branch are intentionally exempt; connector-package JWT references (Google OAuth2 / step-ca) are exempt as out-of-scope external protocols. Verified locally: the guard passes on the actual tree and fires on all 4 synthetic regression patterns. Out of scope (explicitly untouched): - internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service- account JWT (external protocol). - internal/connector/issuer/googlecas/googlecas.go — same. - internal/connector/issuer/stepca/stepca.go — step-ca's provisioner one-time-token JWT for /sign API. - docs/test-env.md, docs/connectors.md, docs/features.md — describe external CAs' use of JWT, not certctl's auth shape. - Implementing actual JWT middleware. Feature, not a fix. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go test -short ./... — every package green - go test -short -race ./internal/config/... ./internal/api/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template with auth.type=api-key — renders OK - helm template with auth.type=none — renders OK - helm template with auth.type=jwt — fails with validateAuthType diagnostic (exit 1) - python3 yaml.safe_load on api/openapi.yaml — parses - CI guardrail mirror — clean on real tree, fires on all 4 synthetic regression patterns - Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no longer accepted (G-1 silent auth downgrade): no JWT middleware ships with certctl. To use JWT/OIDC, run an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream. See docs/architecture.md "Authenticating-gateway pattern" and docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough' config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-g-jwt_silent_auth_downgrade Audit recommendation followed verbatim: 'Remove jwt from validAuthTypes until middleware ships'.	2026-04-25 00:22:23 +00:00
Shankar	67f352db69	fix(db): emit volume-state guidance on postgres auth failure (U-1, #10 ) The shipped quickstart instructs operators to copy deploy/.env.example to deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the first boot of a fresh checkout this works. On the second boot — i.e., when an operator first booted with the default POSTGRES_PASSWORD=certctl, then edited .env and re-ran up — the certctl-server container picks up the new password (env interpolated at every container start) but postgres does not. The postgres docker-entrypoint runs initdb only when the data dir is empty; on subsequent boots the persistent named volume postgres_data is non-empty so pg_authid retains the password baked in on first boot. The server connects with the new credentials, postgres rejects them, and the operator sees an opaque `pq: password authentication failed for user "certctl"` in the server log with no pointer to the actual cause. New- operator onboarding gets blocked on the documented production path. Why a doc fix alone is not sufficient. Operators don't reread the docs after a successful first boot — the trap fires on the second up, when they think they've already learned the system. The opaque pq error is indistinguishable in the log from a typo'd password or a misconfigured secret store. The diagnostic has to fire at the moment the failure is observed. Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is intrinsic to how the official postgres image bootstraps (see docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a bind mount or ephemeral volume breaks the production path; switching to POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without eliminating the divergence. The ergonomic fix is to surface the failure mode loudly, with both remediation paths, at the exact log line where it becomes visible. Two remediation paths, surfaced together. Destructive: `docker compose -f deploy/docker-compose.yml down -v && up -d --build` — wipes the postgres volume so initdb re-runs with the new env value. Use this on demos / first-time setup where data loss is acceptable. Non-destructive: `docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl PASSWORD '<new>';"` followed by a server restart with the matching POSTGRES_PASSWORD. Use this on any environment that holds data you want to keep. Surfacing both means the operator can pick based on their environment without us assuming. Files changed: - internal/repository/postgres/db.go — extract wrapPingError(err) helper. errors.As against pq.Error; on SQLSTATE 28P01 (invalid_password) emit the multi-line guidance preserving the %w wrap chain. Non-28P01 errors retain the original `failed to ping database: %w` shape so transient connection-refused / timeout paths don't get noisy. Add pgErrInvalidPassword = "28P01" constant. Convert blank `_ "github.com/lib/pq"` import to direct import (driver registration still works via init()) so we can name the pq.Error type at compile time. NewDB now calls wrapPingError(err) instead of inlining the wrap. - internal/repository/postgres/db_test.go (new) — 4 internal-package unit tests covering wrapPingError. AuthFailureGuidance pins the contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD", "first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap pins the no-leak contract for SQLSTATE 08006 (connection_failure). NonPqErrorPreservesOriginalWrap pins the network-level path. NilReturnsNil pins defensive contract. All run in -short without testcontainers — package postgres (internal) so the unexported helper is callable directly. - docs/quickstart.md — `> Warning:` callout immediately after the `cp deploy/.env.example deploy/.env` block at lines 56-61. Names the trap, names the SQLSTATE, gives both remediation paths. Uses the in-file `> Note:` blockquote convention. - deploy/ENVIRONMENTS.md — `Stateful volume — first-boot password binding (U-1)` paragraph appended to the Postgres expert-note block. Explains the env-vs-pg_authid divergence, points at wrapPingError as the runtime diagnostic, lists both remediation paths. Uses the in-file `Expert note:` convention. Out of scope (separate follow-ups): - deploy/helm/certctl/templates/postgres-statefulset.yaml has the same root cause via PVC retention. The wrapPingError diagnostic covers the Helm path because the same NewDB code runs at server startup; the Helm-specific doc warning lands separately. - /.env.example at repo root (line 16 hardcodes the password literally inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent trap, separate fix. - examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer, acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The diagnostic covers them; targeted doc warnings are scoped to the canonical quickstart + ENVIRONMENTS docs. Out of consideration: - Switch to bind mount / ephemeral volume — breaks the production path. - POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds operator surface without fixing the env-vs-pg_authid divergence. Verification (all passing): - go build ./... - go vet ./... - go test -short -race ./internal/repository/postgres/ — 4/4 new tests pass plus existing tests - go test -short ./... — every package green - govulncheck ./... — no vulnerabilities in our code - wrapPingError coverage 100%; postgres pkg total unchanged in shape (NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper adds 100%-covered statements) Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap GitHub Issue #10 (mikeakasully)	2026-04-24 23:21:26 +00:00
Shankar	8b6415035c	test(repository): close L-1 integration-coverage gap for HealthCheck + RenewalPolicy The coverage-gap audit flagged L-1 (P2): `HealthCheckRepository` (453 LOC, 11 methods) and `RenewalPolicyRepository` (289 LOC, 5 methods post-G-1 — the audit's "92 lines, 2 methods" figure was stale) ship to production with zero live-DB integration coverage. The existing `repo_test.go` header self-documents the gap: "15 of 17 PostgreSQL repository files". Operationally load-bearing piece: M48's scheduler calls `HealthCheckRepository.ListDueForCheck` every tick to drive continuous TLS health monitoring. A silent SQL regression there — wrong INTERVAL math, NULL-handling slip, lost ORDER BY — would fail open: operator adds endpoint → scheduler never picks it up → endpoint degrades in production → no alert. The loop continues ticking and logs "processed 0 endpoints" normally, so the failure mode is operationally invisible. Closure shape (test-only; no production code touched): - internal/repository/postgres/health_check_test.go (new file, 7 tests) · TestHealthCheckRepository_CRUD · TestHealthCheckRepository_GetByEndpoint · TestHealthCheckRepository_List_Filters · TestHealthCheckRepository_ListDueForCheck (the load-bearing one — seeds four rows with differing last_checked_at+interval relationships to NOW() plus one NULL-last_checked_at row, asserts the correct subset returns and ORDER BY last_checked_at ASC NULLS FIRST holds) · TestHealthCheckRepository_RecordHistory_GetHistory · TestHealthCheckRepository_PurgeHistory · TestHealthCheckRepository_GetSummary - internal/repository/postgres/renewal_policy_test.go (new file, 3 tests) · TestRenewalPolicyRepository_CRUD (exercises auto-generated rp-<slug(name)> PK, JSONB round-trip of [30,14,7,0] thresholds, UpdatedAt monotonic advance, ORDER BY name for List) · TestRenewalPolicyRepository_DuplicateName (asserts errors.Is(err, repository.ErrRenewalPolicyDuplicateName) on both Create-name-unique and Update-name-unique collision paths, the pg 23505 sentinel mapping) · TestRenewalPolicyRepository_DeleteInUse (raw-INSERTs a managed_certificates row FK'ing the policy, asserts errors.Is(err, repository.ErrRenewalPolicyInUse) from pg 23503 ON DELETE RESTRICT, cleans up, then asserts not-found surfaces distinctly) - internal/repository/postgres/repo_test.go (one-line header flip) "covering 15 of 17 ... repository files" → "17 of 17"; added cross-reference pointing readers at the two sibling files. Both new files use the existing getTestDB(t) + schema-per-test-isolation convention and skip via testing.Short() in CI, matching M26 TICKET-003 scaffolding byte-for-byte. Repository/postgres is not in the CI coverage-gate path (grep -nE "internal/repository/postgres" .github/workflows/ci.yml → no hits), so adding test-only files cannot regress gated coverage elsewhere. Verification gates run locally (sandbox without Docker, so the -short skip gate itself is what's exercised; operator runs the testcontainer path locally): 1. go vet ./... — clean 2. go build ./... — clean 3. go test -short -count=1 ./... — clean 4. go test -race -short ./internal/repository/postgres/... — clean 5. staticcheck — absent; CI checkset holds 6. govulncheck — skipped; test-only, no deps 7. per-layer coverage no-regression — N/A; repo/pg not gated 8. tsc --noEmit — N/A; no frontend change 9. vitest run — N/A; no frontend change 10. vite build — N/A; no frontend change 11. OpenAPI lint — N/A; no spec change No migration, no interface change, no production code diff. The RenewalPolicyRepository drift between audit ("92 lines, 2 methods") and HEAD (289 lines, 5 methods post-G-1) is documented honestly in the audit report's Resolution Log, not papered over. Closes: coverage-gap-audit L-1 (P2)	2026-04-20 20:39:06 +00:00
Shankar	dfa02764a5	D-1: correct certctl-cli status endpoint path (/api/v1/health -> /health) The CLI's GetStatus() was issuing GET /api/v1/health, but the real liveness route is GET /health at internal/api/router/router.go:76 (mounted at root, not under /api/v1/). Every 'certctl-cli status' invocation 404'd since M16b. The regression was masked because TestClient_GetStatus encoded the same wrong path on both sides of the contract -- the mock server also dispatched on /api/v1/health -- so the production request matched the test's buggy dispatch and the green bar hid the bug. Two-line fix: - internal/cli/client.go:615: "/api/v1/health" -> "/health" - internal/cli/client_test.go:296: mock dispatch to match Red receipt captured before the green fix: with the test fixture corrected but production still wrong, TestClient_GetStatus fails 'parsing response: unexpected end of JSON input' (the client falls through the mock's if/else to the default 200 OK empty body and the JSON decoder chokes). After the production edit the test passes. GetStatus()'s response decoder is already compatible with the real /health shape (graceful 'ok' check on health["status"], optional health["timestamp"]). No interface change. No migration. No frontend change. No OpenAPI delta -- /health is a root-level liveness probe, not part of the /api/v1/ surface.	2026-04-20 19:40:58 +00:00
shankar	e9bbf33193	G-1: renewal-policies API + frontend FK-drift fix Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52, CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from getPolicies() — the compliance-rule endpoint returning pol-* IDs — which violated the FK managed_certificates.renewal_policy_id REFERENCES renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert. Backend (new): - RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse → HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409) - RenewalPolicyService with repo-only constructor. Service sentinels var-alias the repo sentinels so errors.Is walks across layers. - RenewalPolicyHandler with validation bounds: name 1–255; renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted; retry_interval_seconds [60,86400] default 3600; alert_thresholds_days [0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>. - Router registers 5 routes under /api/v1/renewal-policies[/{id}]. Frontend: - CertificatesPage/CertificateDetailPage/OnboardingWizard now call getRenewalPolicies() and render rp-* IDs. - client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/ deleteRenewalPolicy. types.ts adds the RenewalPolicy shape. OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy, RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses on create/update duplicate-name and delete FK-in-use. No migration — renewal_policies table already exists from the initial schema (000001). Tests: - internal/service/renewal_policy_test.go: CRUD + validation + sentinel error wrapping. - internal/api/handler/renewal_policy_handler_test.go: handler endpoint contracts including 400/404/409. - web/src/api/client.test.ts: 4 subtests covering the 4 new API functions. Phase 3 gates all green: go vet, build, short tests, race tests (service/ handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0 reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%, middleware 80.6% — all above thresholds), tsc, vitest (256 passed), vite build, OpenAPI structural validation.	2026-04-20 18:53:01 +00:00
certctl	4dc0e5c44e	F-001/F-002/F-003: CRL prefix-scan, digest error sanitization, ctx-aware sleeps F-001 (P3): GenerateDERCRL scoped to issuer via composite index - Add RevocationRepository.ListByIssuer leveraging migration 000012's idx_certificate_revocations_issuer_serial composite index as a prefix-scan target. Previously CAOperationsSvc.GenerateDERCRL called ListAll() and filtered by IssuerID in Go — O(total revocations) regardless of how many revocations belonged to the target issuer. - Rewrite GenerateDERCRL to call ListByIssuer(ctx, issuerID) so PostgreSQL drives a prefix scan of the composite index. Drops the in-memory filter. - New regression test in ca_operations_test.go asserts the CRL hot path invokes ListByIssuer exactly once and never ListAll, and that the issuerID is threaded through correctly. F-002 (P3): digest.go admin-auth endpoints no longer leak internal errors - PreviewDigest (GET /api/v1/digest/preview) and SendDigest (POST /api/v1/digest/send) previously wrote err.Error() into the HTTP response body on 500s. Replace with slog.Error server-side logging plus a generic "internal error" response body, matching the house pattern in certificates.go and export.go. F-003 (P4): three blocking time.Sleep sites now honor ctx cancellation - internal/connector/issuer/acme/acme.go:672 (DNS-01 propagation wait) now runs under a select{case <-ctx.Done(): CleanUp + return ctx.Err(); case <-time.After(d):} so graceful shutdown doesn't get stuck behind the propagation delay. - internal/connector/issuer/acme/acme.go:786 (dns-persist-01 propagation wait) same pattern, returns ctx.Err() on cancel. - cmd/agent/main.go:272 (polling backoff inside the heartbeat loop) now wraps the sleep in select{case <-ctx.Done(): continue; case <-time.After(backoff):} so the outer <-ctx.Done() case on the parent loop fires cleanly. Verification: build, vet, and race-enabled short tests green across all 55+ packages. govulncheck reports zero vulnerabilities in the code path. No migration needed — F-001 reuses the existing 000012 composite index. No frontend changes.	2026-04-20 16:51:52 +00:00
Shankar	3155b9475f	v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).	2026-04-20 03:43:10 +00:00
Shankar	25131a377d	M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight Closes CWE-306 (missing authentication for critical function) for SCEP via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with their respective RFCs. CRL/OCSP remain unauthenticated under .well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D): no mTLS in this milestone. - RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1 (/cacerts explicitly anonymous): EST paths served unauthenticated; CSR-signature + profile policy enforce identity inside ESTService. - RFC 8894 §3.2: SCEP authenticates via the challengePassword PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential. HTTP dispatch is unauthenticated; preflightSCEPChallengePassword refuses to start when CERTCTL_SCEP_ENABLED=true without CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the same invariant defense-in-depth and compares with crypto/subtle.ConstantTimeCompare. cmd/server/main.go: - Extract buildFinalHandler(apiHandler, noAuthHandler, webDir, dashboardEnabled); route /.well-known/est/, /scep, /scep/, /.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial}, and health probes through noAuthHandler (RequestID + structuredLogger + Recovery only). - Add preflightSCEPChallengePassword fail-loud gate; startup log emits challenge_password_set boolean for operator visibility. cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests): - TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard (7) pin the dispatch surface: EST 4-endpoint, SCEP exact + trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/* authenticated, /assets/* file server, SPA fallback. internal/api/router/router.go, internal/config/config.go: - Router-level comments explain why EST/SCEP/PKI dispatchers sit outside the authenticated mux; SCEP challenge password config plumbed through. docs/architecture.md: - New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1, buildFinalHandler + noAuthHandler references). - Rewrite SCEP Authentication subsection; replaces pre-existing factually-incorrect "any value accepted" claim with CWE-306 preflight, service-layer defense-in-depth, and crypto/subtle.ConstantTimeCompare. - Top-level Authentication section: qualify /api/v1/* scope on API clients bullet; add standards-based-endpoints bullet referencing the 27-subtest regression harness. docs/compliance-soc2.md: - CC6.1: scope API Key Authentication to /api/v1/; add standards-based endpoints bullet citing RFCs and CWE-306 closure. - CC6.3: scope API Key Policy to /api/v1/ with cross-reference to CC6.1. - Evidence Locations augmented with buildFinalHandler, preflightSCEPChallengePassword, scep.go defense path, regression harness, and OpenAPI security:[] overrides. api/openapi.yaml: verified already correct (global bearerAuth default overridden with security:[] on /cacerts, /simpleenroll, /simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id}, /ocsp/{issuer_id}/{serial}); no edits needed.	2026-04-19 17:20:05 +00:00
Shankar	15daf008aa	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
Shankar Reddy	49002c8cba	Close I-004 (agent hard-delete cascades targets) coverage-gap finding Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=<text>` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire <id> [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI	2026-04-19 05:24:00 +00:00
Shankar	c17ea577e7	I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap Add 11th always-on scheduler loop that transitions jobs stuck in AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL) to Failed. I-001's retry loop then auto-promotes eligible Failed jobs back to Pending. No new status enum, no schema migration. - JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE - JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure - Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m per-tick context, WaitGroup shutdown drain - Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h) - Audit event per transition: actor=system, actorType=System, action=job_timeout, details={old_status, new_status, timeout_reason, age_hours} - 14 new tests: 3 config, 7 service, 4 scheduler	2026-04-19 01:37:18 +00:00
Shankar	0d7d933e91	fix(config): add RetryInterval to TestValidate_ValidConfig + TestValidate_AuthTypeNone fixtures (I-001 follow-up) Problem: TestValidate_ValidConfig and TestValidate_AuthTypeNone construct a SchedulerConfig without RetryInterval, so Validate() fails the 'retry interval must be at least 1 second' check at config.go:1086 with 'retry interval must be at least 1 second'. Both tests expect success, so they fail whenever run. Root cause (re-derived from source, not inherited from memory): git log -S 'retry interval must be at least' --source --all shows the validation was introduced in `8665b16` (I-001, RetryFailedJobs scheduler wiring). git log -- internal/config/config_test.go shows the test file was last touched in `43e1c89`, which predates `8665b16`. I-001 added a new Validate() rule without updating the two positive test fixtures — a gap in I-001's verification pass. This is NOT C-001 fallout. The config_test.go file was untouched by the C-001 closure commits `0fb7d46` and `d23c268`. The failure surfaced during the full test suite run after C-001 landed because no one had run 'go test ./internal/config/...' since I-001. Scope: - internal/config/config_test.go (2 fixtures: TestValidate_ValidConfig, TestValidate_AuthTypeNone). Implementation: Added 'RetryInterval: 5 * time.Minute' to both SchedulerConfig literals. 5 minutes matches the I-001 default at config.go:818: RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5time.Minute) The other two TestValidate_ tests (InvalidAuthType, APIKeyAuth_ MissingSecret) are unaffected because they expect Validate() to error at the auth-type check (line 1052) or auth-secret check (line 1057), both of which fire before the RetryInterval check at line 1086. Verification: - go test -count=1 -run 'TestValidate_' ./internal/config/...: PASS - go test -short -count=1 ./...: all packages PASS - go vet ./...: exit 0 Residual: None. This is a pure test-fixture fix — production code is unchanged. Commit: `8665b16` (I-001) should have included this edit. Attributed here for traceability.	2026-04-19 00:33:22 +00:00
Shankar	d23c268e9d	fix(cli): add missing os + path/filepath imports to client_test.go Follow-up to `0fb7d46`. TestClient_ImportCertificates_SixFieldPayload uses filepath.Join(t.TempDir(), ...) and os.WriteFile to stage a test PEM, but the import block only listed encoding/json, encoding/pem, net/http, etc. — neither os nor path/filepath was imported. go vet rejected the package with 'undefined: filepath' (and would have caught 'undefined: os' next). Add both imports. No behavioral change — the referenced symbols are the standard library's usual names for their respective packages, so the test compiles and runs exactly as intended. CI should now pass go build + go vet on the cli package.	2026-04-19 00:27:11 +00:00
Shankar	0fb7d46019	C-001 scope expansion: tighten parallel POST /api/v1/certificates call sites to six-field contract Problem: `5c01c7f` closed C-001 at the handler boundary by tightening the ValidateRequired contract on POST /api/v1/certificates to require six fields: name, common_name, renewal_policy_id, issuer_id, owner_id, team_id. (Correction re-derived from source: the handler ValidateRequired calls on owner_id/team_id/renewal_policy_id were actually installed in `4536147` under M-002/M-003/M-006 auth unification — 5c01c7f's commit message overstates scope.) Post-audit on 2026-04-18 found three parallel call sites still shipping three-to-four-field payloads that the newly strict handler would reject with HTTP 400: - GUI: OnboardingWizard CertificateStep (common_name + sans + issuer_id + environment only) - CLI: certctl-cli import (common_name + issuer_id + status only; no required-flag gating) - Tests: deploy/test/qa_test.go Part03 positive paths Scope: Bring every POST /api/v1/certificates caller to six-field parity. No handler changes — the contract is authoritative; the callers must conform. Implementation: GUI — OnboardingWizard CertificateStep expansion: web/src/pages/OnboardingWizard.tsx adds name/owner_id/team_id/ renewal_policy_id state. React Query hooks for getOwners/ getTeams/getPolicies use per_page: '500' to populate dropdowns without pagination-driven truncation. Payload ships all six required fields plus sans/certificate_profile_id/environment. nextDisabled gate enforces all six before the Continue button activates. CLI — ImportCertificates rewrite: internal/cli/client.go rewrites ImportCertificates with flag.NewFlagSet("import", flag.ContinueOnError). Required flags: --owner-id, --team-id, --renewal-policy-id, --issuer-id. Optional: --name-template (default {cn}, templated via strings.ReplaceAll against cert.Subject.CommonName), --environment (default imported). Missing required flags fail pre-HTTP with a clear error. Request map ships all six required fields plus sans/ environment/status/optional serial_number. cmd/cli/main.go — usage string updated to document the new required/optional flags. Tests — qa_test.go Part03 positive paths: deploy/test/qa_test.go Part03 Create_Minimal and Create_Full updated to include all six fields. Uses seed_demo.sql-supplied IDs (o-alice, t-platform, rp-standard) — docker-compose.demo.yml is the run context. C-001 explanatory comment added above Create_Minimal so future readers understand why the minimal payload is no longer minimal. MCP parity: Verified no-op. internal/mcp/types.go:28 CreateCertificateInput already declares all six fields; internal/mcp/tools.go:102 forwards the typed struct unchanged. Verification: Go CLI regression tests (internal/cli/client_test.go): * TestClient_ImportCertificates_MissingRequiredFlags — 5 subtests, one per missing required flag, confirms flag.ContinueOnError rejects with non-nil error before any HTTP call is attempted. * TestClient_ImportCertificates_MissingPositionalArgs — confirms the "usage: import <file>" error path when no PEM file is supplied after the flags. * TestClient_ImportCertificates_SixFieldPayload — uses httptest to decode the POST body and assert all six required fields plus sans/environment are present on the wire. Frontend regression test (web/src/api/client.test.ts): 'createCertificate accepts and transmits all six required fields' pins the wire shape for both GUI call sites (OnboardingWizard CertificateStep + CertificatesPage CreateCertificateModal). If either UI surface accidentally drops a field, this assertion fails in CI rather than surfacing as a 400 at runtime. Grep-based call-site sweep: Enumerated every POST /api/v1/certificates create caller. Four total: OnboardingWizard, CertificatesPage, MCP tools, CLI import. All four now ship six-field payloads. Claim path (internal/service/discovery.go) updates existing rows and does not POST. EST/SCEP handlers invoke internal certService.CreateVersion, not the public API. Negative-path tests (qa_test.go:1085/1267/1274/1288/1298) remain valid: they assert 400/non-500 on oversized/malformed/missing-CN/UTF-8/empty bodies, and these properties still hold under the stricter handler. Static gates: go build ./..., go vet ./..., go test ./internal/cli/..., and cd web && npm run test deferred to operator pre-push — the Go toolchain is not available in the session sandbox. Grep-based verification confirms the syntactic shape of every changed file. Residual: None. Every POST /api/v1/certificates call site now conforms to the six-field contract; the wire shape is pinned by both Go and TypeScript regression tests. Commit: TBD-SHA (audit doc + CLAUDE.md carry TBD-SHA placeholders to be amended after commit)	2026-04-19 00:25:10 +00:00
Shankar	8665b1648d	Close I-001 (RetryFailedJobs never invoked) coverage-gap finding Operator decision answered as Option A: JobService.RetryFailedJobs is now wired into the scheduler as an always-on 10th loop. Prior to this commit the method was implemented, unit-tested, and exported but had zero runtime callers — any job that transitioned to status=Failed stayed Failed forever regardless of how many attempts it had remaining. Scheduler — 10th loop: internal/scheduler/scheduler.go grows a jobRetryLoop alongside the existing nine loops (renewal, jobs, health, notifications, short-lived, network scan, digest, health check, cloud discovery). The loop follows the established run-immediately-then-tick pattern (same shape as jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and joined into the scheduler's sync.WaitGroup so WaitForCompletion drains it on graceful shutdown. Each tick runs under a 2-minute context timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory maxRetries cap is belt-and-suspenders; per-job eligibility is still enforced inside the service via Attempts < MaxAttempts. The JobServicer scheduler-interface gains RetryFailedJobs so the scheduler's dependency surface stays explicit and mockable. Service — audit trail per retry: internal/service/job.go:RetryFailedJobs now emits an audit event for every Failed→Pending transition. Following the house convention used by all scheduler-emitted events, actor='system' and actorType= domain.ActorTypeSystem; action='job_retry'; details capture old_status, new_status, attempts, max_attempts. JobService carries an optional *AuditService (SetAuditService) that nil-guards to preserve test-wiring ergonomics — existing tests that construct JobService without an audit service continue to pass unchanged. Config — env var with sane default: internal/config/config.go:SchedulerConfig grows RetryInterval, wired to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate rejects intervals below 1 second (matches other scheduler interval validators). Server wiring: cmd/server/main.go calls jobService.SetAuditService(auditService) after JobService construction and sched.SetJobRetryInterval( cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls. Regression coverage: internal/service/job_test.go (3 new) - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts - TestJobService_RetryFailedJobs_NoAuditServiceOK internal/scheduler/scheduler_test.go (3 new) - TestScheduler_JobRetryLoop_CallsService - TestScheduler_JobRetryLoop_IdempotencyGuard - TestScheduler_JobRetryLoop_WaitForCompletion The service tests assert status transitions, attempt-cap short- circuiting, and audit event shape (actor='system', action='job_retry', details keys). The scheduler tests assert the loop invokes the service, the atomic.Bool guard skips overlapping ticks with the expected 'still running, skipping tick' log, and WaitForCompletion drains the in-flight tick on Stop. Residual follow-up (not in scope for this commit): internal/service/renewal.go:RetryFailedJobs is a parallel dead-code duplicate of the same logic on RenewalService — untested and has no runtime caller. The audit finding called this out as 'implemented twice'. Removing it is a separate cleanup and does not block the Option-A wiring this commit delivers. Files: cmd/server/main.go — SetAuditService + SetJobRetryInterval internal/config/config.go — RetryInterval field + env + validate internal/scheduler/scheduler.go — 10th loop, interface, field, setter internal/scheduler/scheduler_test.go — 3 new scheduler-loop tests internal/service/job.go — RetryFailedJobs audit emission + SetAuditService internal/service/job_test.go — 3 new service-layer tests	2026-04-18 23:24:54 +00:00
Shankar	2cad4d7ade	Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings M-004 — OCSP issuer binding (composite key): The OCSP lookup path now binds (issuer_id, serial) as a composite key rather than resolving by serial alone. CertificateRepository and RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go scopes both lookups by the issuer_id path param. When no managed cert binds to that (issuer, serial) tuple, GetOCSPResponse constructs an RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior default 'good'. Short-lived cert exemption (profile TTL < 1h) is preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log. Regression coverage: internal/service/ca_operations_test.go - TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer - TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial M-005 — Discovery Claim/Dismiss actor propagation: DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an explicit 'actor string' parameter (propagation pattern mirrors bulk_revocation.go / revocation_svc.go). The handler layer passes resolveActor(r.Context()) — the named-key identity established by the M-002 auth unification — and the service falls back to 'api' (the same safe sentinel resolveActor uses when no auth context is present) only when the caller passes an empty string. Never falls back to 'operator'. Regression coverage: internal/service/discovery_test.go - TestDiscoveryService_ClaimDiscovered_AuditActor - TestDiscoveryService_DismissDiscovered_AuditActor - TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI - TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI Each new test asserts event.Actor matches the caller-supplied string (or 'api' on empty input) and explicitly asserts event.Actor != 'operator' to lock in the historical fix intent. Files: internal/api/handler/discovery.go — pass resolveActor(ctx) internal/api/handler/discovery_handler_test.go — updated call sites internal/integration/lifecycle_test.go — updated mock wiring internal/repository/interfaces.go — GetByIssuerAndSerial on CertificateRepository + RevocationRepository internal/repository/postgres/certificate.go — composite key lookup internal/service/ca_operations.go — (issuer_id, serial) scoping internal/service/ca_operations_test.go — 2 new M-004 tests internal/service/discovery.go — actor parameter + 'api' fallback internal/service/discovery_test.go — 4 new M-005 tests internal/service/shortlived_test.go — mock signature update internal/service/testutil_test.go — mock GetByIssuerAndSerial	2026-04-18 22:20:25 +00:00
Shankar	10a949fb9a	fix(lint): godoc comment on NewAuthWithNamedKeys must lead with function name (ST1020) CI failure on master (commit `4536147`) — staticcheck ST1020: internal/api/middleware/middleware.go:125:1: ST1020: comment on exported function NewAuthWithNamedKeys should be of the form "NewAuthWithNamedKeys ..." (staticcheck) When NewAuth was renamed to NewAuthWithNamedKeys during the M-002 auth unification, the leading godoc sentence was left pointing at the old name. Rewrite the comment so its first sentence starts with the new function name, and expand the body to describe the named-key + admin-flag contract introduced in `4536147`. Also gitignore /.gopath/ — session-scoped tool install cache, same category as /.gocache/ and /.gomodcache/. Verification: go vet ./internal/api/middleware/... — clean go build ./internal/api/middleware/... — clean go test ./internal/api/middleware/... — PASS (0.245s) staticcheck -checks=all,<project exclusions> — clean across middleware, handler, service, domain, cmd/server, scheduler Closes: CI failure on `4536147`.	2026-04-18 21:38:46 +00:00
Shankar Reddy	45361477ed	Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001) Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006) on top of the C-001/C-002 ownership + agent-FK contract fixes landed in `5c01c7f`. The work lands as a single commit spanning server, docs, tests, and the React client. M-002 — Named API keys with per-key actor propagation * Migration 000014 adds the 'api_keys' table (id, name, hash, principal, role, created_at, last_used_at, disabled_at) so every credential carries an identifiable principal instead of the opaque 'anonymous'/'api-key' sentinel. * Auth middleware now rotates through configured keys, performs constant-time hash comparison, stamps 'last_used_at', and emits an actor struct via contextWithActor(). The audit middleware, bulk-revocation handler, approval handlers, and MCP tool layer now read the principal off the context and persist it on every audit_events row. * Regression coverage: - internal/api/middleware/audit_test.go — actor propagation, principal redaction for disabled keys, anonymous fallback for unauthenticated endpoints. - internal/api/handler/bulk_revocation_handler_test.go, job_handler_test.go — principal-on-audit assertions. M-003 — Authorization gates (Phase B) * Approval handler rejects self-approval / self-rejection with 403 when the actor principal equals the job's requested_by field. * Bulk revocation is gated behind the 'admin' role; operators and viewers receive 403. * Regression coverage: - internal/service/job_test.go — TestApproveJob_NotSelf, TestRejectJob_NotSelf. - internal/api/handler/bulk_revocation_handler_test.go — TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds. M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux * Per RFC 8615, relying parties cannot reasonably be asked to authenticate against the issuing certctl instance to retrieve revocation material. CRL and OCSP move off the authenticated '/api/v1/crl' and '/api/v1/ocsp/' paths onto: GET /.well-known/pki/crl/{issuer_id} Content-Type: application/pkix-crl (RFC 5280 §5) GET /.well-known/pki/ocsp/{issuer_id}/{serial} Content-Type: application/ocsp-response (RFC 6960) * Non-standard JSON CRL shape is removed; only DER is served. * Short-lived certificate exemption (profile TTL < 1h → skip CRL/OCSP) is preserved; the response simply omits the serial. * Routes are registered on the unauthenticated 'finalHandler' mux in cmd/server/main.go alongside EST ('/.well-known/est/') and SCEP ('/scep'). Legacy authenticated paths return 404. Regression coverage: - internal/api/handler/certificate_handler_test.go — content type, DER parseability, 404 for unknown issuer. - internal/api/handler/adversarial_path_test.go — unauthenticated access asserted for CRL, OCSP, EST, SCEP. - internal/api/router/router_test.go — route-table assertion that '.well-known/pki/', '.well-known/est/', and '/scep' are mounted on the unauthenticated branch. M-001 — Auto-closed by M-002 EST and SCEP were already registered on the unauthenticated 'finalHandler' mux; the router comment at internal/api/router/router.go:247 now matches reality. The adversarial-path tests above lock the behavior in. Verification (all gates green): * go vet ./... — clean * go build ./... — ok * go test -short ./... (55+ packages) — all pass * web/ : npm test (225 Vitest tests) — all pass * web/ : npx tsc --noEmit — clean * grep sweep for '/api/v1/(crl\|ocsp)' — 13 surviving hits, all intentional M-006 tombstone/relocation comments. Documentation: * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 → Fixed, with per-finding resolution paragraphs citing regression test IDs. (Audit file lives outside this repo; see cowork root.) * CLAUDE.md Project Status line updated with the auth-unification closure note. * docs/features.md, docs/architecture.md, docs/quickstart.md, docs/concepts.md, docs/connectors.md, docs/test-env.md, docs/testing-guide.md, docs/compliance-.md, docs/demo-advanced.md — refreshed for the new '.well-known/pki/' namespace and named API keys. * api/openapi.yaml — documents the new unauthenticated endpoints and removes the legacy '/api/v1/crl' + '/api/v1/ocsp/' paths. .gitignore: adds '/.gocache/' and '/.gomodcache/' for the session- scoped Go caches so they never enter the tree.	2026-04-18 18:17:41 +00:00
Shankar Reddy	5c01c7f21f	fix(gui,api): close C-001 + C-002 — ownership + agent FK contract C-001 — CreateCertificate was server-accepted with null owner_id, team_id, renewal_policy_id because the GUI neither collected the fields nor enforced them, even though the backend's ManagedCertificate schema and handler contract treat them as required. Fix the contract at all four layers: - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free- text inputs with <select> elements fed by getOwners/getTeams/ getPolicies queries; mark all three required; gate the Create button on owner_id + team_id + renewal_policy_id being set. - internal/api/handler/certificates.go: ValidateRequired for owner_id, team_id, renewal_policy_id on CreateCertificate so the handler returns HTTP 400 with the offending field name before the service layer is reached. - internal/mcp/types.go: drop ',omitempty' from CreateCertificateInput.RenewalPolicyID so the MCP schema reflects the required contract; Update inputs keep partial-update semantics. - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id, issuer_id, owner_id, team_id]' was already present on the Create schema; clarified DeploymentTarget.agent_id description to note the FK contract. C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the service inserted directly, producing a Postgres 23503 FK-violation that bubbled out as a generic HTTP 500. The FK itself (migration 000001 line 104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep the schema strict and add validation at three layers: - internal/service/target.go: introduce ErrAgentNotFound sentinel and pre-validate agent_id in TargetService.CreateTarget — empty string returns 'agent_id is required'; a nonexistent id returns the full 'referenced agent does not exist: <id>' error. Both wrap ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is. - internal/api/handler/targets.go: ValidateRequired on agent_id; map errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of letting it fall through to the generic 500 branch. - internal/mcp/types.go: drop ',omitempty' from CreateTargetInput.AgentID to match the required contract. - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input with a <select> populated from getAgents(); include agent in the canProceedToReview gate so Next is disabled until an agent is chosen. Regression coverage (21 new subtests total): - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests, one per required field, each proves the handler guard fires before the mock service is called. - TestCreateTarget_MissingAgentID_Returns400 — handler guard. - TestCreateTarget_NonexistentAgent_Returns400 — pins the ErrAgentNotFound -> 400 translation. - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel. - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is. - The existing TestTargetService_CreateTarget_Success, along with TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler tests, were updated to seed a real agent or include agent_id in the request body so the happy paths still run cleanly. Gates (Phase 4): - go build/vet/test/race: green - go test -cover: internal/service 68.7% (gate 55%), internal/api/handler 78.9% (gate 60%) - golangci-lint on service+handler+mcp: 0 issues - govulncheck: no reachable vulns - tsc --noEmit: clean - vitest: 223/223 passing See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.	2026-04-18 16:01:40 +00:00
Shankar	dfa9faa426	fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008) D-008 was a three-part drift in the policy engine that made the D-005/D-006 remediation cosmetic below the DB layer: (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase types ('ownership', 'environment', 'lifetime', 'renewal_window') that the handler validator rejects on Create/Update but that raw SQL INSERTs bypassed entirely. At runtime evaluateRule's switch fell through to the default "unknown policy rule type" error branch on every demo rule × every cert × every cycle, flooding logs while emitting zero violations. (b) migrations/seed_demo.sql persisted lowercase severity values ('critical', 'error', 'warning') on policy_violations rows. INSERT succeeded because that column had no CHECK, but any frontend comparing against the canonical PolicySeverity enum mis-categorized every seeded violation. (c) evaluateRule hardcoded Severity: PolicySeverityWarning on every emitted violation and ignored rule.Config entirely — so the D-006 per-rule severity column (000013) and every per-arm Config JSON ({allowed_issuer_ids, allowed_domains, required_keys, allowed, lead_time_days, max_days}) was dead data below the evaluation layer. This commit lands (a)+(b)+(c) atomically. Shipping any subset leaves the feature half-working. ## Changes Domain (internal/domain/policy.go): * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical. Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine arm — routing it through RenewalLeadTime would conflate "how close to expiry before we renew" with "how long can the cert possibly be", two distinct semantics. The new type accepts config {"max_days": int} and flags certs whose NotAfter - NotBefore exceeds the cap. Handler validator (internal/api/handler/validation.go): * ValidatePolicyType allowlist grown to 6 canonicals (AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime, CertificateLifetime). OpenAPI (api/openapi.yaml): * PolicyType enum grown to match domain. Frontend (web/src/api/types.ts, types.test.ts): * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts all 6 canonicals and rejects casing drift. Migration 000014 (policy_violations severity CHECK): * Named CHECK constraint (policy_violations_severity_check) mirroring 000013's allowlist, defense-in-depth at the DB layer against future drift from bypassed writes (migrations, psql sessions, future callers). Symmetric down migration drops by name. Seed data: * migrations/seed.sql rewritten to emit TitleCase canonicals with per-arm config JSON that actually exercises the config-consuming paths (not the missing-field backstops): - pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning - pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical - pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning Severities are now differentiated per rule (D-006 intent). * migrations/seed_demo.sql violation rows flipped to TitleCase severity ('Critical', 'Error', 'Warning') so migration 000014 applies cleanly on upgrade paths. Engine rewrite (internal/service/policy.go): * evaluateRule rewritten. All six arms now: 1. Parse rule.Config into the per-arm typed struct. 2. Bad JSON → log at ValidateCertificate boundary and skip this rule (no co-located poisoning of other rules in the same batch). 3. Empty/null Config → emit the pre-D-008 missing-field violation (backwards compat invariant — operators who haven't reconfigured still see the same output). 4. Violations emitted carry rule.Severity (no more hardcoded Warning); D-006 column is now load-bearing. * CertificateLifetime arm reads NotBefore/NotAfter from the certificate's latest version via CertRepo. Injected via PolicyService.SetCertRepo() setter — avoids churning ~36 NewPolicyService call sites while keeping the lifetime arm optional (degrades to a log+skip if the setter is not wired). Server wiring (cmd/server/main.go): * policyService.SetCertRepo(certRepo) wired after construction. Tests (internal/service/policy_test.go): * 25 new subtests across 5 groups: - TestEvaluateRule_SeverityPassThrough (6): every rule type emits violations carrying rule.Severity, not hardcoded. - TestEvaluateRule_ConfigConsumed (12): every per-arm Config path exercised positive + negative. - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null Config still emits pre-D-008 missing-field violations. - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs and skips cleanly without poisoning neighbors. - TestEvaluateRule_CertificateLifetime_RepoScenarios (3): ok when repo wired, log+skip when not, handles missing NotBefore/NotAfter edges. Provenance: D-008 surfaced during D-005/D-006 remediation review in `7a0ea35`. That commit added persistence and CI pins for the severity field but did not re-verify the evaluation layer consumed it; this finding and fix close the audit-process gap.	2026-04-18 14:55:56 +00:00
Shankar	7a0ea35b97	fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006) Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a single commit because they share the same root cause — policy CRUD sending values the backend silently rejects — and splitting them would leave a half-working UI between commits. ## D-005 (P0): PoliciesPage dropdown 400s every Create Policy Root cause ---------- `web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array. The backend's `internal/api/handler/validators.go::ValidatePolicyType` enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`, `RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in `internal/domain/policy.go`. Every Create Policy request was rejected with `400 invalid policy type`. The error surfaced only as a transient toast; the modal closed anyway. Silent user-visible failure. Fix --- - `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES` tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and `PolicyViolation.severity` to the literal-union types. Dropdown is now sourced from the tuple; casing drift becomes a compile error. - `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` / `severityDots` to the TitleCase values, added `humanize()` for display (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral` fallback that was papering over the mismatch. - `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone edits one side of the frontend/backend contract without the other, CI fails with a clear assertion. Pure-TS vitest, no RTL dependency. ## D-006 (P1): `severity` field silently dropped on create/update Root cause ---------- `PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The frontend has always sent `severity` on create/update, but Go's `json.Decoder` (default settings, no `DisallowUnknownFields`) silently dropped it. The value never reached PostgreSQL. Every rule rendered with the same severity because there was no severity — just a display computation downstream. Fix: option (b), full-stack schema add (not delete-the-field) ------------------------------------------------------------- - Migration `000013_policy_rule_severity` (up + down): adds `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No index — three-value column on a low-thousands-rows table, planner will seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data. - `internal/domain/policy.go`: added `Severity PolicySeverity` field. - `internal/repository/postgres/policy.go`: plumbed `severity` through ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT, UpdateRule UPDATE (4 queries). - `internal/service/policy.go::UpdatePolicy`: if the client omits severity on a PUT (zero-value empty string), fetch the existing rule and preserve its severity. Without this, partial updates would trip the NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type (out of scope). - `internal/api/handler/policies.go::CreatePolicy`: default empty severity to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with clear message instead of 500 on CHECK violation. `UpdatePolicy`: validates severity only when provided. - `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional `severity` on the MCP `create_policy` / `update_policy` tool inputs so LLM callers stay in sync with the wire contract. - `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with the enum and default. Acceptance criterion (user-defined) ----------------------------------- "Create a rule with severity=Critical, reload the page, and still see Critical — no silent drops." Verified end-to-end: frontend sends `severity: "Critical"`, handler validates, service persists, DB stores, GET returns, React renders the correct badge. Seed data --------- `migrations/seed.sql`: four demo rules now have differentiated severities — `pr-require-owner` → Warning, `pr-allowed-environments` → Error, `pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` → Warning. The user called out that seeding all four at the same severity makes the feature look decorative; differentiation demonstrates the column carries real signal. ## Integration test fix (side effect of D-006) `internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy` was sending `"severity": "High"` — a value from the pre-audit severity vocabulary that the new `ValidatePolicySeverity` correctly rejects with 400. Changed to `"Error"` (closest semantic match in the new TitleCase allowlist). Only severity reference in the integration/ directory; verified via grep. ## Out of scope, logged for follow-up (d/D-008) Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly deferred per direction: 1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`). These are load-bearing on `internal/service/policy.go::evaluateRule`'s `switch rule.Type` (which also uses the lowercase strings). Migrating requires coordinated changes across seed + evaluation engine. 2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'` severity — will now fail the new CHECK constraint. Separate fix. 3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on emitted violations and ignores the configured `rule.Config`. The new severity column is read correctly on the CRUD path but not yet consulted during evaluation. ## Verification Backend: - `go build ./...` — clean - `go vet ./...` — clean - `go test -short ./...` — all packages green, including `internal/service` (policy service), `internal/api/handler` (policy + MCP handler tests), `internal/integration` (e2e_test.go after fix), `internal/domain`, `internal/repository/postgres`. Frontend: - `tsc --noEmit` — clean - `vitest run` — 223/223 passing (4 new assertions in types.test.ts) - `vite build` — clean (only the pre-existing chunk-size warning)	2026-04-18 13:02:04 +00:00
Shankar	875f433c52	fix(m-9): aggregate per-endpoint scan errors in NetworkScanService Before this fix, RunScan declared `scanErrors []string` but never appended to it. As a result: - the summary Info log ("network target scan completed") always reported `"errors": 0`, regardless of how many endpoints failed - the DiscoveryReport's `Errors` field — stored on the scan record and surfaced in the GUI scan history — was always nil Operators who needed to understand scan failures had to enable Debug logging and grep through the noise of expected sweep-scan connection refusals. The per-endpoint log level (Debug) is deliberate and correct — scanning a /24 typically produces 200+ connection-refused results, and logging each at Warn would create massive log spam at default verbosity. The bug was the silent loss of the aggregate count. This commit: - extracts the partitioning logic into `collectScanResults`, a pure method that splits per-endpoint results into discovered certificate entries and a list of endpoint error strings - populates the errors list with "<address>: <error>" so the scan record correlates failures back to specific endpoints - preserves the existing Debug-level per-endpoint log (sweep noise discipline) — no change to default-verbosity log output The summary Info log's "errors" field and the DiscoveryReport's Errors field now reflect the true failure count. Debug detail remains available for operators diagnosing specific endpoints. Audit scope note: the M-9 finding narrative implied broad Debug-level hiding of real errors across AWS SM, Azure KV, GCP SM, and network scan sentinel agents. On investigation, the three cloud-discovery connectors (awssm, azurekv, gcpsm) already use appropriate Warn/Error discipline for per-item and root-level failures. Only the network scanner had a silent observability gap, and it was a missed append rather than a misapplied log level. See audit resolution log for full details. CWE: CWE-778 (Insufficient Logging) — aggregate failure count lost. Tests: 4 new unit tests on collectScanResults covering the aggregation path (success + failure mix), all-success, all-failed, and empty-input degenerate cases. All tests pass with -race. Verification: - go build ./cmd/server/... ./cmd/agent/... ./cmd/mcp-server/... ./cmd/cli/... exit 0 - go vet ./... exit 0 - go test -race -count=1 -timeout 300s [full CI race path] exit 0 - golangci-lint run ./... --timeout 5m (v2.11.4) 0 issues - govulncheck ./... (@latest) 0 in-code vulnerabilities - go test -count=1 -cover ./internal/service/... 68.0% (> 55% threshold) Invariants preserved: - collectScanResults signature: method on *NetworkScanService, input []domain.NetworkScanResult, return ([]DiscoveredCertEntry, []string) - Debug log key names unchanged ("address", "error") - DiscoveryReport schema unchanged (Errors field already existed) - Sentinel agent ID "server-scanner" unchanged - No migration, no API, no wire-format change Refs: M-9 Medium finding; audit resolution log appended in follow-up commit on workspace-level audit report.	2026-04-18 02:34:14 +00:00
Shankar	297ff8349e	M-2 PR-F: Middleware/ACME ctx-propagation + contextcheck linter + audit closeout Final PR in the six-commit M-2 sequence (PR-A: CertificateService cluster `ad2734c`, PR-B: IssuerService+TargetService `20b0e75`, PR-C: Policy/Profile/ Owner/Team `e5a7b45`, PR-D: Job/Notification/Audit `c2e9ebf`, PR-E: AgentService `6b2d137`, PR-F: this commit). PR-A through PR-E collapsed the service-layer shim methods and deleted every in-production context.Background() / context.TODO() call from internal/service/; this PR completes the sweep across the non-service tiers (HTTP middleware + ACME connector) and wires the contextcheck linter so regressions fail CI. Three narrow edits land the D-3 pattern (context.WithoutCancel for subsidiary async writes and deferred shutdown contexts): - internal/api/middleware/audit.go -- async audit goroutine now runs on auditCtx := context.WithoutCancel(r.Context()) instead of context.Background(). Preserves request-scoped values (trace ID, auth) while detaching from the request's cancellation so the audit write does not get killed when the response completes. Goroutine is still tracked via a.wg (M-1 shutdown drain) so Flush(ctx) behaviour is unchanged. CWE-770 Missing Release (goroutine leak potential) + CWE-400 Resource Exhaustion (missed cancellation propagation). - internal/api/middleware/middleware.go -- Recovery panic path now logs via slog.ErrorContext(ctx, ...) instead of log.Printf. Request- scoped trace/auth metadata now carries through the panic log, matching every other request log. D-3 non-bypass: the context is r.Context() captured before the defer, so even a panic mid-handler propagates the ctx's trace ID into the ERROR log line. - internal/connector/issuer/acme/acme.go (HTTP-01 challenge server shutdown) -- defer shutdown context derived from context.WithTimeout(context.WithoutCancel(ctx), 5s) instead of context.Background(). Preserves parent ctx values, detaches from parent cancellation so Shutdown always gets its full 5-second budget even when the parent was cancelled. Matches the same pattern applied in ACME's solveAuthorizationsDNS01 and solveAuthorizationsDNSPersist01. Linter wiring: .golangci.yml adds `contextcheck` to the enabled set. golangci-lint v2.11.4 now fails CI on any function that takes a context.Context parameter but calls into context.Background() or context.TODO() instead of propagating -- regression guard for all five prior PRs. Verification (CI parity, GOCACHE=/tmp/gocache GOMODCACHE=/tmp/gomodcache GOLANGCI_LINT_CACHE=/tmp/lintcache): - go build ./... -> 0 - go vet ./... -> 0 - golangci-lint run (contextcheck enabled) -> 0 issues - go test -race -short ./internal/api/middleware/... -> PASS - go test -race -short ./internal/scheduler/... -> PASS - go test -race -short ./internal/connector/issuer/acme/... -> PASS - go test -race -short ./internal/service/... -> PASS - rg "context\.(Background\|TODO)" internal/service/ internal/scheduler/ internal/connector/ internal/api/middleware/ -> 0 non-test hits (one pedagogical godoc reference in audit.go documenting why context.Background() would be wrong remains intentional) Wire-format invariants preserved: 0 API routes, 0 SQL migrations, 0 frontend bytes, 0 OpenAPI bytes, 0 connector interface signature changes, 0 new env vars, 0 new external dependencies (pure context stdlib). The AuditRecorder interface signature, the body-hash algorithm (SHA-256 16 hex chars), the excluded-path short-circuit, the actor-extraction path, the responseWriter status-capture wrapper, the AuditServiceAdapter, and all 116 API routes under /api/v1/, /.well-known/est/, /scep, /health, /auth are byte-identical. M-2 aggregate across PR-A through PR-F: 57 files, +635 / -613 (PR-A 12f +227/-237, PR-B 9f +150/-146, PR-C 17f +156/-148, PR-D 11f +67/-63, PR-E 4f +9/-15, PR-F 4f +26/-4). With M-2 closed, 8 of 10 Medium findings resolved; M-9, M-10, L-1..L-4, I-1..I-8 remain post-v2.1.0 hardening batch. Audit complete. Commit: `855124a9d9`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:43:47 +00:00
Shankar	6b2d1375e6	fix(m2-pr-e): collapse AgentService.HeartbeatWithContext into Heartbeat PR-E of 6 in the M-2 end-to-end remediation sequence. Collapses the HeartbeatWithContext wrapper into a single ctx-first Heartbeat method, matching D-1 (ctx-only signatures, no dual forms). The handler-facing method name is preserved (D-4) — internal/api/handler/agents.go already declares `Heartbeat(ctx, ...)` on its local service interface, and the handler mock at internal/api/handler/agent_handler_test.go already takes `_ context.Context` as its first param, so no handler churn. Changes ------- internal/service/agent.go - Delete the zero-body Heartbeat wrapper that forwarded to HeartbeatWithContext with context.Background(). - Rename HeartbeatWithContext → Heartbeat (ctx-bearing body folded directly into the canonical method). internal/service/agent_test.go - TestHeartbeat (L95) and TestHeartbeat_NotFound (L128): agentService.HeartbeatWithContext(ctx, ...) → .Heartbeat(ctx, ...). internal/service/concurrent_test.go - L162: agentSvc.HeartbeatWithContext(ctx, agentID, metadata) → .Heartbeat(ctx, agentID, metadata). internal/service/context_test.go - L179 + L232: agentSvc.HeartbeatWithContext(ctx, ...) → .Heartbeat(...) - L185 + L238 t.Logf strings: "HeartbeatWithContext with ..." → "Heartbeat with ..." to match the collapsed method name. Verification (Go 1.25.9 linux/arm64, CI-parity caches) ------------------------------------------------------ go build ./... clean go vet ./... clean go test -short ./internal/service/... ./internal/api/handler/... \ ./internal/integration/... all ok go test -race -short same set all ok go test -short ./... all packages ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `855124a9d9`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:25:20 +00:00
Shankar	c2e9ebf62f	fix(m2-pr-d): thread ctx through Job/Notification/Audit services Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background() hits across the Job+Notification+Audit service cluster by threading ctx through their handler-facing service interfaces. Services (ctx-first): - service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now accept ctx; the CancelJobWithContext wrapper is removed (handler callers continue to invoke CancelJob, now ctx-aware). - service/notification.go: ListNotifications, GetNotification, MarkAsRead accept ctx. - service/audit.go: ListAuditEvents, GetAuditEvent accept ctx. Handlers (interface + callsites): - handler/jobs.go, handler/notifications.go, handler/audit.go: local service interfaces updated, r.Context() threaded at every callsite. Tests: - Mock services updated to match the new interfaces (ctx accepted and ignored via '_ context.Context' first parameter; Fn closure fields unchanged). - job_test.go / notification_test.go callsites thread context.Background() to match production shape. Verification: go build ./... ok go vet ./... ok go test -short ./... ok go test -race -short ./... ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `855124a9d9`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:20:46 +00:00
Shankar	e5a7b4585c	M-2 PR-C: Collapse Policy/Profile/Owner/Team services to ctx-first signatures - Add ctx first param to 21 service-layer handler-interface methods across policy.go (6), profile.go (5), owner.go (5), team.go (5) - Replace 24 context.Background() call sites with received ctx; use context.WithoutCancel(ctx) for subsidiary audit-recording ops to preserve fire-and-forget audit semantics without inheriting caller cancellation - Add ctx first param to 21 handler-interface method signatures across policies.go (6), profiles.go (5), owners.go (5), teams.go (5) - Thread r.Context() through 21 HTTP handler sites (ListPolicies, GetPolicy, CreatePolicy, UpdatePolicy, DeletePolicy, ListViolations, ListProfiles, GetProfile, CreateProfile, UpdateProfile, DeleteProfile, ListOwners, GetOwner, CreateOwner, UpdateOwner, DeleteOwner, ListTeams, GetTeam, CreateTeam, UpdateTeam, DeleteTeam) - Update MockPolicyService/MockProfileService/MockOwnerService/ MockTeamService mock method impls with _ context.Context first param (Fn fields unchanged — closures do not need ctx); update mock impls in integration/lifecycle_test.go for all four services - Update 12 service-layer test callsites (policy_test.go ×2, owner_test.go ×5, team_test.go ×5, profile_test.go ×13) to pass context.Background() at the call site Audit complete. Commit: `855124a9d9`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:10:06 +00:00
Shankar	20b0e75d48	M-2 PR-B: Collapse IssuerService + TargetService to ctx-first signatures - Delete bare TestConnection wrapper in IssuerService; rename TestConnectionWithContext → TestConnection - Delete TestTargetConnection delegate shim in TargetService (canonical TestConnection already ctx-first) - Add ctx first param to 10 handler-interface methods (ListIssuers/GetIssuer/CreateIssuer/UpdateIssuer/DeleteIssuer and ListTargets/GetTarget/CreateTarget/UpdateTarget/DeleteTarget) - Replace 16 context.Background() call sites with received ctx - Thread r.Context() through 12 HTTP handler sites in issuers.go and targets.go (outer TargetHandler.TestTargetConnection HTTP method name preserved for router compatibility) - Update MockIssuerService, MockTargetService, and mockTargetService (integration) for ctx-first forwarding; update test callsite literals Audit complete. Commit: `855124a9d9`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 00:46:58 +00:00
Shankar	ad2734c10a	fix(m-2): thread context through CertificateService cluster Collapses CertificateService, RevocationSvc, and CAOperationsSvc to ctx-accepting method signatures. Removes context.Background() synthesis at 24 internal call sites across certificate.go, revocation_svc.go, and ca_operations.go. - Primary repo calls inherit request cancellation via the passed ctx. - Audit and notification dispatches use context.WithoutCancel(ctx) so they survive client disconnect. - Collapses TriggerRenewal/TriggerRenewalWithActor, TriggerDeployment/TriggerDeploymentWithActor, and RevokeCertificate/RevokeCertificateWithActor sibling pairs into single canonical ctx-accepting methods (decisions D-1, D-2). Handlers pass r.Context(). Mocks and tests updated to match new signatures. No HTTP surface change, no OpenAPI change. PR 1 of 6 in the M-2 remediation chain. Master green at this commit. Refs: certctl-audit-report.md M-2 (L143, L224)	2026-04-18 00:29:37 +00:00
Shankar	5c69cdf33b	fix(audit): drain in-flight recording goroutines on shutdown (M-1) Audit events spawned from the HTTP middleware ran in detached goroutines using context.Background(). On SIGTERM the DB pool was closed before those goroutines finished writing, silently dropping audit events (CWE-662 Improper Synchronization / CWE-400 Uncontrolled Resource Consumption). NewAuditLog now returns an *AuditMiddleware struct that tracks every spawned goroutine with sync.WaitGroup. Callers wire the middleware via its Middleware method value (preserves the existing func(http.Handler) http.Handler shape) and drain the WaitGroup with Flush(ctx), which blocks until in-flight recordings complete or the provided context is cancelled — mirroring scheduler.WaitForCompletion. Flush is invoked in cmd/server/main.go between http.Server.Shutdown (no new requests accepted) and db.Close (pool torn down), with a timeout returning ErrAuditFlushTimeout wrapping ctx.Err(). Request-derived inputs (method, path, status) are snapshotted before the goroutine spawn so the worker does not race with http.Server reusing r after the handler returns. Tests: TestAuditLog_FlushDrainsInFlightGoroutines TestAuditLog_FlushTimeoutReturnsErrAuditFlushTimeout Verification: go build ./... : 0 go vet ./... : 0 go test -race -short ./... : 0 (all packages) go test -cover ./internal/api/middleware : 81.4% golangci-lint run : 0 issues govulncheck ./... : 0 vulns in called code	2026-04-17 17:29:48 +00:00
certctl-copilot	5d18fee987	fix(repository): idempotent sentinel agent creation via ON CONFLICT (M-6) Sentinel agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) were created on startup with a plain INSERT whose duplicate-key error was swallowed unconditionally. That silenced every other DB failure too (connectivity drop, permissions change, unrelated constraint violation) — a restart after the first boot quietly de-fanged cloud discovery and the network scanner (CWE-662, CWE-209- adjacent). Shape A: add AgentRepository.CreateIfNotExists using ON CONFLICT (id) DO NOTHING RETURNING id + sql.ErrNoRows discrimination. This keeps the strict Create semantics (duplicate-key is an error) intact for real agent registration and gives sentinels their own idempotent path. - repo: CreateIfNotExists returns (created bool, err error); false,nil on pre-existing row; false,wrapped err on anything else. - interface: CreateIfNotExists added to AgentRepository. - main.go: 4 sentinel sites log Error/Info/Debug distinctly. - mocks: service + integration mocks implement the new method. - tests: 4 new testcontainers integration tests cover first-insert, idempotent second-call, concurrent 16-goroutine race (exactly one creator, no duplicate-key panic), and pre-cancelled context surfacing. Coverage gates (go test -cover): service 67.6%/55, handler 78.6%/60, domain 92.7%/40, middleware 80.0%/30, crypto 86.7%/85. Race/vet/ golangci-lint v2.11.4 (0 issues)/govulncheck v1.2.0 clean across all touched packages.	2026-04-17 16:32:07 +00:00
Shankar	4819f8263d	fix(repository): populate TargetIDs in certificate scan helper (M-7) scanCertificate never queried the certificate_target_mappings junction table, so Certificate.TargetIDs was always nil on reads. This silently broke deployment lookups, bulk revocation filters, cert detail pages, and any code path that iterated TargetIDs to dispatch target work. Fix: - Convert scanCertificate to a receiver method (r *CertificateRepository) so it has access to the DB for the secondary junction query. - Get(): scan the row, then call r.getTargetIDs(ctx, certID) to populate TargetIDs with a single targeted query. - List() and GetExpiringCertificates(): inline the scan loop so we can collect all certIDs first, then call getTargetIDsForCertificates once with pq.Array(certIDs) to avoid N+1 round-trips. Build a map and attach TargetIDs to each certificate in the result set. - Default TargetIDs to []string{} (not nil) when a cert has no mappings so JSON marshals as [] rather than null. Tests: - New integration test file certificate_targetids_test.go with 5 subtests exercising Get / List / GetExpiringCertificates single and multi-target cases plus the empty-slice vs nil contract. - Uses the shared testcontainers-go setupTestDB infrastructure and skips under 'go test -short' so CI (which excludes ./internal/repository/... from coverage paths anyway) stays green. Addresses M-7 from certctl-audit-report.md.	2026-04-17 15:41:08 +00:00

1 2 3 4

179 Commits