certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 22:41:31 +00:00

Author	SHA1	Message	Date
shankar0123	a2a82a6cf8	fix(bundle-5): CI green-up — drop unused sync.Once + document new env vars Two CI gate failures from the Bundle 5 push: 1. golangci-lint (unused) — agent_bootstrap.go declared `var bootstrapWarnOnce sync.Once` but never called .Do(). The one-shot WARN actually lives in cmd/server/main.go (per-process at startup, not per-request) so the handler-side variable was dead code. Dropped the var + sync import; left a comment explaining where the WARN lives. 2. G-3 env-var docs guardrail — Bundle 5 added two new env vars (CERTCTL_AGENT_BOOTSTRAP_TOKEN, CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS) but the G-3 closure CI step asserts every CERTCTL_* env defined in internal/config/config.go is mentioned in docs/features.md. Added three new sub-sections to docs/features.md after the Body Size Limits block: * Agent Bootstrap Token (H-007 contract + generation guidance) * Graceful Shutdown Audit Flush (M-011 timeout knob) * Liveness vs Readiness Probes (H-006 /health vs /ready table) No production behaviour change; pure CI-gate fix. Verification - go vet ./internal/api/handler/... → clean - go test -count=1 -run 'TestVerifyBootstrapToken\|TestRegisterAgent_BootstrapToken' ./internal/api/handler/... → all pass - grep CERTCTL_AGENT_BOOTSTRAP_TOKEN docs/features.md → present - grep CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS docs/features.md → present v2.0.57 v2.0.56	2026-04-26 00:03:03 +00:00
shankar0123	1a845a9490	docs(CHANGELOG): Bundle 5 Operational Liveness + Bootstrap — 4 audit findings closed	2026-04-25 23:58:35 +00:00
shankar0123	260a1af9a9	Merge branch 'fix/bundle-5-ops-liveness-bootstrap' (Bundle 5: Operational Liveness + Bootstrap, 4 audit findings)	2026-04-25 23:54:25 +00:00
shankar0123	85e60b24ec	fix(bundle-5): Operational Liveness + Bootstrap — 4 audit findings closed Closes Audit-2026-04-25 H-006 (High), H-007 (High), M-011 (Medium), L-006 (Low — verified-already-closed via C-1 master closure in v2.0.54). Hardens the orchestrator-facing surface — k8s probes, agent enrollment, shutdown audit drain, scheduler config plumbing. What changed - internal/api/handler/health.go — split contract: * /health stays shallow 200 (k8s liveness — process alive) * /ready accepts sql.DB; runs db.PingContext(2s); 503 on failure Nil DB path returns 200 + db=not_configured (test fixtures) - internal/api/handler/agent_bootstrap.go (NEW) — verifyBootstrapToken: * empty expected = warn-mode pass-through * non-empty = `Authorization: Bearer <token>` required * crypto/subtle.ConstantTimeCompare; length-mismatch path runs dummy compare to keep timing uniform * ErrBootstrapTokenInvalid sentinel - internal/api/handler/agents.go — RegisterAgent calls verifyBootstrapToken BEFORE body parse so unauth probes don't even allocate a JSON decoder - internal/config/config.go — two new env vars: * CERTCTL_AGENT_BOOTSTRAP_TOKEN (Auth.AgentBootstrapToken) * CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS (Server.AuditFlushTimeoutSeconds) - cmd/server/main.go — 3 changes: * pass sql.DB into NewHealthHandler (H-006) pass cfg.Auth.AgentBootstrapToken into NewAgentHandler (H-007) * configurable shutdown audit-flush timeout (M-011) * one-shot startup WARN when bootstrap token unset (deprecation) - new tests: agent_bootstrap_test.go (full deny/accept/warn-mode coverage, constant-time compare path, length-mismatch); health_test.go extended with /ready DB-probe failure (503), nil-DB pass-through, /health-shallow L-006 verified - cmd/server/main.go:557 already calls sched.SetShortLivedExpiryCheckInterval(cfg.Scheduler.ShortLivedExpiryCheckInterval) per the C-1 master closure in v2.0.54. Bundle 5 confirms; no code change. Threat model: TB-1 (operator/orchestrator), TB-2 (Agent↔Server). - CWE-754 (Improper Check for Unusual or Exceptional Conditions) for H-006 - CWE-306 + CWE-288 (Missing Authentication for Critical Function) for H-007 Verification - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - targeted Bundle-5 regressions → all pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → in-flight (sandbox 45s ceiling exceeded; no failure markers in dot stream; no frontend changes in this bundle so no regression risk) - python3 yaml.safe_load(api/openapi.yaml) → 89 paths Backward compatibility - Bootstrap token defaults to empty (warn-mode) — existing demo deployments unaffected. Server logs deprecation WARN; v2.2.0 will require it. - Audit flush timeout default 30s preserves prior behaviour. - Helm chart already routes readiness probe to /ready (no chart change needed); now /ready actually probes the DB. Bundle 5 of the 2026-04-25 comprehensive audit.	2026-04-25 23:54:18 +00:00
shankar0123	018b705b91	docs(CHANGELOG): Bundle 3 MCP Trust-Boundary Fencing — 5 audit findings closed v2.0.55	2026-04-25 22:48:29 +00:00
shankar0123	0233f39e53	Merge branch 'fix/bundle-3-mcp-fencing' (Bundle 3: MCP Trust-Boundary Fencing, 5 audit findings)	2026-04-25 22:44:37 +00:00
shankar0123	23411bd6fc	fix(bundle-3): MCP Trust-Boundary Fencing — 5 audit findings closed Closes Audit-2026-04-25 H-002, H-003, M-003, M-004, M-005 (all CWE-1039 LLM Prompt Injection at the MCP↔consumer trust boundary, TB-7). Strategy: wrapper-layer fencing. All 87 MCP tools route their success path through textResult and their failure path through errorResult. By fencing at those two wrappers we cover every existing tool AND every future tool with a single change — no per-tool wiring required. What changed - internal/mcp/fence.go (new) — FenceUntrusted helper with strategy doc + per-finding rationale. Both fenceMCPResponse and fenceMCPError use it internally. - internal/mcp/tools.go — textResult wraps response body via fenceMCPResponse; errorResult wraps error string via fenceMCPError. - internal/mcp/tools_test.go — TestTextResult / TestErrorResult updated to assert fenced shape (start marker + end marker + inner body). - internal/mcp/injection_regression_test.go (new) — 5 regression test functions, one per audit finding, each replays 5 classic LLM injection payloads (instruction_override, system_role_spoofing, delimiter_break_attempt, markdown_link_phishing, data_exfil_via_url) and asserts the planted payload appears VERBATIM (preservation, operator visibility) INSIDE the fence boundaries. - internal/mcp/fence_guardrail_test.go (new) — CI guardrail that walks every non-test .go file in the mcp package and fails if it finds a bare gomcp.CallToolResult literal outside tools.go. Prevents future tools from silently bypassing the fence. Delimiter-forgery defense The naive constant fence (--- UNTRUSTED MCP_RESPONSE END ---) is forgeable: an attacker who controls a field value can plant the literal end marker and "break out" of the fence. Defense: every fence call generates a 6-byte crypto/rand nonce, hex-encoded, and embeds it in BOTH the START and END markers. An attacker would need to predict the nonce (2^48 search per fence) to forge a matching END inside the payload. The delimiter_break_attempt regression test exercises this. Per-finding mapping - H-002 Cert Subject DN injection (CSR submitter controlled) → TestMCP_PromptInjection_H002_CertSubjectDN - H-003 Discovered cert metadata injection (cert owner controlled) → TestMCP_PromptInjection_H003_DiscoveredCertMetadata - M-003 Agent heartbeat injection (agent self-reports hostname/OS/IP) → TestMCP_PromptInjection_M003_AgentHeartbeat - M-004 Upstream CA error injection (CA controls error string) → TestMCP_PromptInjection_M004_UpstreamCAError - M-005 Audit details + notification body injection (downstream actors control these) → TestMCP_PromptInjection_M005_AuditDetailsAndNotifications Verification gates - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - go test -count=1 ./internal/mcp/... → all packages pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → 337 passed - python3 yaml.safe_load(api/openapi.yaml) → 89 paths, 56 schemas Threat-model placement: TB-7 (MCP↔LLM consumer). certctl owns the boundary; consumer-side prompt engineering is recommended but not relied upon. Defense-in-depth: per-call nonce closes the delimiter-forgery edge case that constant fences would have left exposed. Bundle 3 of the 2026-04-25 comprehensive audit (88 findings).	2026-04-25 22:44:33 +00:00
shankar0123	9d769efbb9	docs(CHANGELOG): Bundle 4 EST/SCEP Hardening — 3 audit findings closed H-004 (PKCS#7 fuzz target gap), M-021 (EST TLS channel binding), L-005 (EST/SCEP issuer-binding fail-loud at startup). Bundle 4 of the 2026-04-25 comprehensive audit (cowork/comprehensive-audit-2026-04-25/). Tracker movement: 0/55 → 3/55 closed.	2026-04-25 21:18:27 +00:00
shankar0123	2352dfa0a6	Merge branch 'fix/bundle-4-est-scep-hardening' (Bundle 4: EST/SCEP Hardening, 3 audit findings)	2026-04-25 21:14:57 +00:00
shankar0123	1c099071d1	fix(bundle-4): EST/SCEP Attack Surface Hardening — 3 audit findings closed Closes 3 findings (1 High + 1 Medium + 1 Low) from /Users/shankar/Desktop/cowork/comprehensive-audit-2026-04-25/. Bundle 4 hardens the only attack surface reachable by an anonymous network attacker in certctl: the unauthenticated EST + SCEP enrollment endpoints. Findings closed: - H-004 (High): Hand-rolled ASN.1 parser had no fuzz target. The audit's original framing pointed at internal/pkcs7/, but recon confirmed that package is an ASN.1 ENCODER (BuildCertsOnlyPKCS7, ASN1Wrap, ASN1EncodeLength) — not a parser. The actual hand-rolled PKCS#7 PARSING reachable via anonymous network is in internal/api/handler/scep.go::extractCSRFromPKCS7 + parseSignedDataForCSR. Added native go fuzz targets: internal/api/handler/scep_fuzz_test.go::FuzzExtractCSRFromPKCS7 * internal/api/handler/scep_fuzz_test.go::FuzzParseSignedDataForCSR * internal/pkcs7/pkcs7_fuzz_test.go::FuzzPEMToDERChain (defense-in-depth) * internal/pkcs7/pkcs7_fuzz_test.go::FuzzASN1EncodeLength (defense-in-depth) Local 15s fuzz session: 150k execs on FuzzExtractCSRFromPKCS7, 937k on FuzzPEMToDERChain, 925k on FuzzASN1EncodeLength — zero panics. - M-021 (Medium): EST TLS-Unique channel binding (RFC 7030 §3.2.3). Added internal/api/handler/est.go::verifyESTTransport — defense-in-depth TLS pre-conditions (r.TLS != nil; HandshakeComplete; TLS ≥ 1.2). The full §3.2.3 channel binding only applies when EST mTLS is in use; certctl does not currently support EST mTLS, so the §3.2.3 requirement is moot today. RFC 9266 (TLS 1.3 tls-exporter) and EST mTLS are documented as deferred follow-ups in the verifyESTTransport doc comment. - L-005 (Low): EST/SCEP issuer-binding fail-loud at startup. Pre-Bundle-4 cmd/server/main.go validated that CERTCTL_EST_ISSUER_ID and CERTCTL_SCEP_ISSUER_ID existed in the registry but did NOT validate the issuer TYPE could emit a CA cert. An operator binding EST to an ACME issuer (whose GetCACertPEM returns explicit error) booted successfully and only failed at first /est/cacerts request. Post-Bundle-4: new preflightEnrollmentIssuer helper calls GetCACertPEM(ctx) at startup with a 10s timeout. Failure logs the connector error + the candidate issuer types and os.Exit(1). Tests added/modified: - internal/api/handler/est_transport_test.go (new) — 5 verifyESTTransport table cases covering plaintext-rejected, incomplete-handshake-rejected, TLS 1.0 rejected, TLS 1.2/1.3 accepted - cmd/server/preflight_test.go (new) — TestPreflightEnrollmentIssuer covering nil-connector, error-from-issuer, empty-PEM, valid cases - internal/api/handler/est_handler_test.go (modified) — 7 POST sites now stamp r.TLS to satisfy the new transport pre-condition - internal/integration/negative_test.go (modified) — setupTestServer wraps the test handler with a fake-TLS-state injector so the EST handler receives r.TLS != nil; production paths still rely on the real TLS listener Threat model reference: TB-11 (EST/SCEP client ↔ Server) per cowork/comprehensive-audit-2026-04-25/threat-model.md. Standards: RFC 7030 §3.2.3, RFC 8894 §3, RFC 5652, RFC 9266 (deferred).	2026-04-25 21:14:41 +00:00
shankar0123	d84ff36854	docs(CHANGELOG): T-1 + Q-1 final-tail closure — audit at 47/47 (100%) The last two findings (T-1 frontend Vitest page coverage, Q-1 skipped-test sweep) of the 2026-04-24 v5 audit are now closed. After this lands, the audit folder is archived; future audits start a new dated folder.	2026-04-25 18:50:33 +00:00
shankar0123	050b936fcf	Merge branch 'fix/q1-skipped-tests-sweep' (Q-1 standalone, 1 audit finding — final-tail closure)	2026-04-25 18:44:48 +00:00
shankar0123	90bfa5d320	test: triage 37 skipped-test sites — closure comments pinning rationale (Q-1) Closes Q-1 (cat-s3-58ce7e9840be) — 37 t.Skip / testing.Short() sites across 9 test files audited. Per-site verdict matrix: - cmd/agent/verify_test.go (1 site): defensive guard against unreachable httptest.NewTLSServer code path. Document-skip with closure comment. - deploy/test/qa_test.go (11 sites): file already gated by `//go:build qa` tag. The 11 t.Skip("Requires X — manual test") markers are runtime second-line guards for operators who run -tags qa against a stack missing the required external service. File-level header comment block added explaining the manual-test convention. - deploy/test/healthcheck_test.go (5 sites): 3 docker-availability + 1 testing.Short + 1 hard-skip for not-yet-wired runtime probe (image-spec contract above already covers the audit-flagged regression). All correctly gated; file-level header comment block added explaining each. - deploy/test/integration_test.go (5 sites): in-flight-state guards (poll-with-skip after 90s polling for agent-online, inter-test Phase04→Phase07 ordering, scheduler-tick race for discovered certs, inter-test issuer fallthrough, defensive PEM-empty assertion). Each site now has a closure comment explaining why skip is the right choice rather than fail (upstream phase already surfaces the real failure; skipping prevents masking root cause behind cascading noise). - internal/repository/postgres/{testutil,seed,repo}_test.go (5 sites): testing.Short() gates for testcontainers-backed live PostgreSQL integration tests. All correctly gated; closure comments added naming the run command. - internal/connector/notifier/email/email_test.go (2 sites): anti-fixture assertions (test asserts SMTP dial fails; if a captive portal black-holes the call to success, skip rather than false-pass). Closure comments added explaining the fixture assumption. - internal/connector/target/iis/iis_test.go (2 sites): platform-gated skip for powershell.exe absence on non-Windows hosts. Mirrors the production iis_connector.go LookPath guard. Closure comments added. Total: 17 closure comments anchor the 37 skip sites (some sites share a single block-level comment). All skips remain in place; the change is purely documentation. The audit recommendation was "audit each skip and decide" — for these 37, the decision is uniformly document-skip: the gating is correct, the t.Skip messages name the missing precondition, and the closure comments now pin the rationale for future readers. See coverage-gap-audit-2026-04-24-v5/unified-audit.md cat-s3-58ce7e9840be for closure rationale.	2026-04-25 18:44:36 +00:00
shankar0123	8fd11e024b	Merge branch 'fix/t1-master-page-vitest-coverage' (T-1 master, 1 audit finding)	2026-04-25 18:35:48 +00:00
shankar0123	7013227a34	test(web): Vitest coverage for 8 high-leverage pages (T-1 master) Closes T-1 (cat-s2-c24a548076c6) — frontend page-level Vitest coverage was 3 of 28 pages pre-T-1. T-1 lifts that to 11 of 28 (39%) by writing focused behavior tests for the 8 highest-leverage pages. Tests added: - CertificatesPage.test.tsx (6 cases) — F-1 filter+pagination contract: team_id / expires_before / sort param wiring, page=1 reset on filter change, page+per_page always present in getCertificates params. - PoliciesPage.test.tsx (4 cases) — D-006/D-008 TitleCase contract: list render, severity badge, toggle-enabled inversion, delete confirm. - IssuersPage.test.tsx (3 cases) — D-2 phantom-trim + B-1 EditIssuer: list render, StatusBadge derives from enabled, Test fires testIssuerConnection. - TargetsPage.test.tsx (3 cases) — D-2 phantom-trim: list render, Status derives from enabled, Delete fires deleteTarget. - AgentsPage.test.tsx (3 cases) — D-2 phantom-trim + heartbeatStatus: list render, undefined last_heartbeat_at -> Offline, listRetiredAgents lazy-loaded. - AgentDetailPage.test.tsx (3 cases) — D-2 phantom-trim: fetches by URL :id, Registered row reads registered_at, Capabilities + Tags sections absent. - OwnersPage.test.tsx (3 cases) — B-1 EditOwnerModal closure: list render, Edit opens modal, Save fires updateOwner. - TeamsPage.test.tsx (2 cases) — B-1 EditTeamModal closure. - AgentGroupsPage.test.tsx (2 cases) — B-1 EditAgentGroupModal closure. - RenewalPoliciesPage.test.tsx (3 cases) — B-1 brand-new-page closure: list + alert_thresholds_days display, Create modal, Edit modal. - DiscoveryPage.test.tsx (3 cases) — I-2 claim/dismiss closure: list render, status filter wiring, Dismiss fires dismissDiscoveredCertificate. CI guardrail: .github/workflows/ci.yml step "Frontend page-coverage regression guard (T-1)" blocks new pages from landing without sibling .test.tsx unless added to a 14-name deferred allowlist with one-line "why deferred" justifications. Net coverage: 13 page-level vitest cases -> ~35 page-level vitest cases across 14 files (was 3); total project tests 302 -> 337. See coverage-gap-audit-2026-04-24-v5/unified-audit.md cat-s2-c24a548076c6 for closure rationale.	2026-04-25 18:35:41 +00:00
shankar0123	c6a9a76147	docs(features): document CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL (G-3 fix) CI on the S-2 merge (`a54805c`) failed at the G-3 env-var-docs-drift guardrail step: G-3 regression: env var(s) defined in Go source but never documented: CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL The C-1 master commit (`c4d231e`) added the env var to internal/config/config.go::SchedulerConfig + the Load() reader, and wired the previously-dead Scheduler setter from cmd/server/main.go, but I missed adding the env var to the canonical scheduler-loops table at docs/features.md:1124. Fix: the "Short-lived expiry check" row in the scheduler-loops table now names CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL with the C-1 backstory ("pre-C-1 the setter was unwired and this env var had no effect; post-C-1 it's read by cmd/server/main.go::sched.SetShortLived ExpiryCheckInterval"). The G-3 guardrail is doing exactly what it was designed to do: catching env-var docs drift the moment it appears. Working as intended; this fix closes the gap the guardrail flagged. Verification: - comm -23 docs vs defined → empty post-fix (allowlist applied) - comm -23 defined vs docs → empty post-fix - The fix is doc-only; no Go / TS / config changes. This is a follow-up to the C-1 + F-1 + P-1 + S-2 mega-prompt closure; push together to unblock CI. v2.0.53 v2.0.54	2026-04-25 18:01:24 +00:00
shankar0123	a54805c63c	Merge branch 'fix/s2-handler-error-mapping-typed-sentinels' (S-2 standalone, 1 audit finding)	2026-04-25 17:54:14 +00:00
shankar0123	0e29c416b1	refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2) Closes one 2026-04-24 audit finding (P2): - cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites in internal/api/handler/ — brittle to repository-layer message changes, untyped against the actual failure mode. Approach (Option B from prompt design notes): - New typed sentinels in internal/repository/errors.go: ErrNotFound, ErrForeignKeyConstraint IsForeignKeyError(err) helper (the only place substring matching at the lib/pq boundary is allowed; isolates the DB-driver string knowledge to one function). - New typed sentinel in internal/domain/errors.go: ErrValidation (reserved for future per-entity validation wrappers; not yet used by all handlers). - 49 sites in internal/repository/postgres/*.go updated to wrap sql.ErrNoRows-derived errors via fmt.Errorf("...: %w", repository.ErrNotFound). - 18 not-found handler sites + 2 FK-constraint handler sites refactored to errors.Is(err, repository.ErrNotFound) / repository.IsForeignKeyError(err). - 23 inline `fmt.Errorf("X not found")` test fixtures across handler tests rewrapped to wrap repository.ErrNotFound. - test_utils.go::ErrMockNotFound rewrapped to wrap repository.ErrNotFound; renewal_policy.go closure docblock updated to reflect the new convention. - integration test mockJobRepository.Get wraps repository.ErrNotFound. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error()) regression guard (S-2)" greps for the three patterns ("not found", "violates foreign key", "RESTRICT") under internal/api/handler/ and fails the build on regression. Verification: - go build ./... — clean - go vet ./... — clean - go test ./... -short -count=1 — all packages pass (handler + repository + service + integration) - golangci-lint v2.11.4 run ./... — 0 issues - S-2 guardrail dry-run on post-fix tree → empty (good) - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass Audit findings closed: - cat-s6-efc7f6f6bd50 (P2) Deferred follow-ups: - 6 domain-specific substring patterns still inline in handlers ("cannot approve", "cannot reject", "cannot be parsed", "no certificates found", "challenge password", "invalid"/ "required" validation chains in profiles + agent_groups). Each needs its own typed sentinel, scoped per service. Documented by the S-2 CI guardrail's allowlist for closure-comments only. - Per-entity not-found sentinels (Option A — ErrCertificateNotFound, ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the current dispatch needs; per-entity precision would let handlers return entity-aware error bodies without a domain.Type field, but not blocking.	2026-04-25 17:54:14 +00:00
shankar0123	8a3086c4ae	Merge branch 'fix/p1-master-orphan-client-fn-sweep' (P-1 master, 2 audit findings)	2026-04-25 17:41:12 +00:00
shankar0123	d4c421b98d	chore(web,ci): document orphan client fns + sync guard (P-1 master) Closes two 2026-04-24 audit findings: - diff-04x03-d24864996ad4 (P2, "26 orphan client fns") - cat-b-dc46aadab98e (P3, "16 singleton-getter orphans") Recon at HEAD found 17 actual orphans (not 26 or 16 — the audit numbers conflated; many were eliminated by the B-1 / S-1 / I-2 / D-2 closures since the audit was written, and the audit's regex double-counted in some buckets). All 17 are detail-page candidates: singleton-getter `getX(id)` fns that detail pages will need when the corresponding `XPage` grows a `XDetailPage` route. Two valid closures: - delete each fn (forces re-add when detail pages land) - document each as intent-suspect-but-preserved (lets future detail-page work land without a client.ts edit detour) Picked the document-and-preserve path. Reasons: - Many of the 17 are obvious detail-page candidates (Owner, Team, AgentGroup, Policy, RenewalPolicy, Notification, AuditEvent, NetworkScanTarget, HealthCheck, DiscoveredCertificate) given the existing list-page + Edit-modal pattern shipped in B-1. - The cost of the deletes (and re-adds, and test re-adds) outweighs the cost of carrying 17 documented-orphan declarations. - registerAgent (already covered by C-1's docblock as by-design pull-only) sits in this same set and is the canonical "preserved orphan" precedent. Changes: - web/src/api/client.ts: new docblock at file-top listing all 17 documented orphans with their detail-page rationale and a pointer to the CI guardrail. - .github/workflows/ci.yml: new step "Documented orphan client fns sync guard (P-1)" verifies that every name in the docblock is still declared as `export const X = ...` somewhere in client.ts. Catches drift in either direction (delete export but forget docblock = MISSING; delete docblock entry but leave export = silent orphan accumulation, caught only on next mass-recon). Verification: - P-1 guardrail dry-run on post-fix tree → MISSING='' (empty, good) - tsc --noEmit — clean - golangci-lint v2.11.4 run ./... — 0 issues - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1) pass Audit findings closed: - diff-04x03-d24864996ad4 (P2) - cat-b-dc46aadab98e (P3) Deferred follow-ups: - The 17 detail-page candidates remain orphan until a XDetailPage consumer lands. Each future detail-page commit removes one entry from the docblock as it gains a real consumer. The CI guardrail enforces the docblock-↔-export sync regardless.	2026-04-25 17:41:12 +00:00
shankar0123	1bdab897ef	Merge branch 'fix/f1-master-certificates-page-ux' (F-1 master, 2 audit findings)	2026-04-25 17:38:54 +00:00
shankar0123	94ca69554b	feat(web): expand CertificatesPage filters + reusable DataTable pagination (F-1 master) Closes two 2026-04-24 audit findings (P2): - cat-e-610251c8f72d: CertificatesPage exposed only 5 of the backend handler's 17 supported query filters. Audit recommended minimum-add: team_id (already first-class elsewhere), expires_before (drives the "expiring in N days" workflow), and sort (sort by notAfter for the most common operator triage). Fix: 3 new useState hooks + 3 new filter UIs in the toolbar + 3 new param wires. Remaining filters (agent_id, expires_after, created_after, updated_after, cursor, fields, sort_desc) deferred until a consumer use case demands them — over-stuffing the toolbar is its own UX cost. - cat-k-e85d1099b2d7: CertificatesPage rendered the first 50 certs returned by the backend with no way to advance. Backend response carries {data, total, page, per_page} — a pure render gap. Fix: lifted pagination into the reusable DataTable component as an opt-in `pagination?` prop. CertificatesPage is the first consumer; TargetsPage / IssuersPage / OwnersPage / others can adopt by passing the same prop. DataTable changes: - New `PaginationProps` interface (page, perPage, total, onPageChange, onPerPageChange?, perPageOptions?). - New optional `pagination?` prop on DataTable. - New `PaginationControls` subcomponent rendered in the table footer when `pagination` is set and `total > 0`. Renders "Showing X–Y of Z" + per-page selector + page counter + Prev/Next buttons. Disabling logic guards both boundaries. CertificatesPage changes: - 3 new filter useState hooks: teamFilter, expiresBefore, sortBy. - 2 new pagination useState hooks: page (1), perPage (50). - Added 4th cohort hook: getTeams via useQuery (mirrors the existing issuers/owners/profiles filter-data pattern). - params object gains team_id, expires_before, sort, page, per_page. - 3 new filter UIs in the toolbar (team select, expires_before date picker, sort select). - DataTable gets the new pagination prop. - Filter changes reset page=1 to keep results visible. Verification: - tsc --noEmit — clean - vitest run — 9 files, 302 tests passing (no regression) - golangci-lint v2.11.4 run ./... — 0 issues - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1) pass Audit findings closed: - cat-e-610251c8f72d (P2) - cat-k-e85d1099b2d7 (P2) Deferred follow-ups: - 8 backend filters (agent_id, expires_after, created_after, updated_after, cursor, fields, sort_desc, plus secondary sort fields) deferred until consumer demand justifies UI weight. - TargetsPage / IssuersPage / OwnersPage / etc. opt-in to the pagination prop incrementally — DataTable now supports it; per- page adoption is a follow-up commit each. - CertificatesPage Vitest coverage of the new filter+pagination paths deferred to the per-page test campaign (cat-s2-c24a548076c6).	2026-04-25 17:38:54 +00:00
shankar0123	c4d231e728	Merge branch 'fix/c1-master-cleanup-and-doc-tail' (C-1 master, 6 audit findings)	2026-04-25 17:34:59 +00:00
shankar0123	1c6009a920	chore(cleanup,docs): vite proxy + dead scheduler setter wired + registerAgent/CLI docs (C-1 master) Closes six 2026-04-24 audit findings (3 P2 + 3 P3) — a cleanup-and-doc tail bundle that drains the smallest remaining leaves of the audit: - cat-u-vite_dev_proxy_plaintext_drift (P2): web/vite.config.ts proxied dev requests to http://localhost:8443 against an HTTPS-only backend (HTTPS-only since v2.0.47). Every dev-server API call 502'd. Fix: targets are now object-form `{target: 'https://...', secure: false, changeOrigin: true}` — the dev cert is self-signed by the deploy/test bootstrap and changes per-checkout. - cat-g-7e38f9708e20 (P3): Scheduler.SetShortLivedExpiryCheckInterval was defined + tested but never called from cmd/server/main.go. Operators tuning CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL got no effect — the 30s default in scheduler.NewScheduler was effectively hardcoded. Fix: added Config.Scheduler.ShortLivedExpiryCheckInterval + getEnvDuration in Load() reading the env var with a 30s default, + sched.SetShortLivedExpiryCheckInterval(...) call in main.go alongside the other scheduler-interval setters. - diff-10xmain-2bf4a0a60388 (P3): same root cause as cat-g-7e38f9708e20; closes as ride-along. - cat-b-6177f36636fb (P2): registerAgent client fn orphan. By-design per pull-only deployment model. Fix (audit recommendation: "document"): added a closure docblock above the export in client.ts + a new "Registration is by-design pull-only" paragraph in docs/architecture.md::Agents section explaining when/why a future GUI-driven enrollment feature might reach the endpoint (proxy-agent topologies for network appliances). - cat-i-7c8b28936e3d (P2): CLI scope intentionally narrow but undocumented. Fix: new "Scope (intentionally narrow)" subsection in docs/features.md::CLI capturing the SSH-into-prod / day-to-day GUI / AI-automation MCP three-way split. Verification: - go build ./... — clean - go vet ./... — clean - go test ./internal/scheduler/... ./internal/config/... — pass - golangci-lint v2.11.4 run ./... — 0 issues - tsc --noEmit (frontend) — clean - All sibling guardrails (S-1 / G-3 / D-1+D-2 / B-1 / L-1 / H-1) still pass Audit findings closed: - cat-u-vite_dev_proxy_plaintext_drift (P2) - cat-g-7e38f9708e20 (P3) - diff-10xmain-2bf4a0a60388 (P3) - cat-b-6177f36636fb (P2) - cat-i-7c8b28936e3d (P2) - (audit-bookkeeping ride-along: ensures every closed-bundle row has a non-empty merge SHA) Deferred follow-ups: none from this bundle. The remaining audit backlog (frontend test campaign, F-1 CertificatesPage UX, P-1 orphan-fn sweep, S-2 handler error-mapping refactor) is sibling sub-bundles in this mega-prompt.	2026-04-25 17:34:59 +00:00
shankar0123	a39f5af22a	Merge branch 'fix/h1-master-security-hardening-trio' (H-1 master, 3 audit findings)	2026-04-25 16:40:22 +00:00
shankar0123	3e78ecb799	feat(security): bodyLimit on noAuth + security headers + encryption-key validation (H-1 master) Closes three 2026-04-24 audit findings (all P2): - cat-s5-4936a1cf0118: noAuthHandler chain accepted arbitrary-size bodies (EST simpleenroll, SCEP, PKI CRL/OCSP, /health, /ready). Memory exhaustion vector without HTTP-layer auth gatekeeping. - cat-s11-missing_security_headers: zero security headers on any response. Clickjacking, MIME-sniffing, untrusted-origin resource loads against the dashboard and API. - cat-r-encryption_key_no_length_validation: CERTCTL_CONFIG_ENCRYPTION_KEY accepted with any non-empty value including a single character. PBKDF2-SHA256 (100k rounds) does not compensate for low-entropy passphrases at scale (CWE-916, CWE-329). Changes: - cmd/server/main.go::noAuthHandler chain — added bodyLimitMiddleware + securityHeadersMiddleware. Same default cap as authed surface (1MB via CERTCTL_MAX_BODY_SIZE), same 413 on overflow. - cmd/server/main.go::middlewareStack (authed) — added securityHeadersMiddleware before corsMiddleware. - internal/api/middleware/securityheaders.go (new) — SecurityHeaders middleware + SecurityHeadersDefaults() with conservative defaults: HSTS 1y+includeSubDomains, X-Frame-Options DENY, X-Content-Type- Options nosniff, Referrer-Policy no-referrer-when-downgrade, CSP default-src 'self' + img/data + style 'unsafe-inline' (Tailwind/Vite needs it; scripts still 'self' only) + connect 'self' + frame- ancestors 'none'. Operators behind a customising reverse proxy can disable any header by setting its config field to empty. - internal/config/config.go::Validate() — enforce minEncryptionKeyLength = 32 bytes when CERTCTL_CONFIG_ENCRYPTION_KEY is set. Empty stays accepted (downstream fail-closed sentinel handles it). Structured error names the env var, the actual length, the required minimum, and the canonical generation command (`openssl rand -base64 32`). Tests: - internal/api/middleware/securityheaders_test.go (new) — 4 cases (defaults present, empty value disables single header, override applied, headers on 4xx/5xx). - internal/config/config_test.go — 5 new cases for the encryption-key length check (empty accepted, 1-byte rejected, 31-byte rejected at boundary, 32-byte accepted, 44-byte realistic operator key accepted). Documentation: - CHANGELOG.md — H-1 section above D-2 under [unreleased] with Breaking-change callout (operators with low-entropy keys must rotate before upgrade). - coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker 25/47 → 33/47, P1 14/14 (zero remaining), P2 11/27 → 16/27. Three H-1 findings flipped + closed-bundle row added. Verification: - go build ./... — clean - go vet ./... — clean - golangci-lint v2.11.4 run ./... — 0 issues - go test ./internal/api/middleware/... — pass (incl. 4 new SecurityHeaders cases) - go test ./internal/config/... — pass (incl. 5 new EncryptionKey cases) - tsc --noEmit (frontend) — clean - All sibling guardrails (S-1 / G-3 / D-1 / D-2 / B-1 / L-1) still pass Audit findings closed: - cat-s5-4936a1cf0118 (P2) - cat-s11-missing_security_headers (P2) - cat-r-encryption_key_no_length_validation (P2) Breaking change: - Operators with CERTCTL_CONFIG_ENCRYPTION_KEY shorter than 32 bytes must rotate before upgrade. Generate via `openssl rand -base64 32`. Deferred follow-ups: - Weak-key dictionary check (reject password123, common ASCII patterns) — adds operational friction with low marginal entropy gain at the 32-byte minimum. - CSP 'unsafe-inline' for styles — required for Tailwind/Vite per-component <style> blocks; removing requires HTML report or component refactor outside H-1 scope. - Permissions-Policy header — dashboard uses no advanced browser APIs (camera, mic, geolocation); deferred until a real consumer needs it.	2026-04-25 16:40:21 +00:00
shankar0123	24f25353f8	Merge branch 'fix/i2-mcp-discovered-cert-completeness' (I-2 closure, last P1)	2026-04-25 16:33:56 +00:00
shankar0123	25c34ace45	feat(mcp): add claim_discovered + dismiss_discovered MCP tools (I-2 closure) Closes the LAST P1 in the 2026-04-24 audit (cat-i-b0924b6675f8). Pre-I-2 the README claimed "all API endpoints are exposed via MCP" but the discovered-certificate lifecycle (HTTP handlers ClaimDiscovered + DismissDiscovered at internal/api/handler/discovery.go:125,162) had zero MCP tool wrappers — operators using Claude / Cursor / similar MCP clients had no path to bring an out-of-band cert under management or to mark a benign discovery as not-of-interest without dropping to the REST API directly. The audit's count of 0 MCP discovery tools was correct: `grep -niE 'discover\|claim\|dismiss' internal/mcp/tools.go` returned only the pre-existing agent-retire tool's description text mentioning sentinel discovery agents — no actual discovery-tool registrations. Added in internal/mcp/types.go: - ClaimDiscoveredCertificateInput (id + managed_certificate_id) - DismissDiscoveredCertificateInput (id) Both follow the existing Go-doc / staticcheck convention (lead with the type name + brief; closure-rationale prose follows). Pinned by the existing L-1 staticcheck-fix lesson. Added in internal/mcp/tools.go (slotted at end of file, after certctl_auth_check): - certctl_claim_discovered_certificate — POST /api/v1/discovered-certificates/{id}/claim - certctl_dismiss_discovered_certificate — POST /api/v1/discovered-certificates/{id}/dismiss Both wrap the existing HTTP handlers via the generic c.Post helper. No backend changes; no openapi.yaml changes (both ops were already in the spec from earlier work). The audit's third name "acknowledge" is NOT closed: at recon, no notification-acknowledge HTTP handler exists in the API surface (grep across internal/api/handler/ returned zero hits for "acknowledge"). The audit appears to have mis-quoted; "acknowledge" isn't a real backend endpoint to wrap. If a future feature adds notification acknowledgement, register it in the same shape. Verification: - go build ./... — clean - go vet ./internal/mcp/... — clean - go test ./internal/mcp/... -count=1 — pass - golangci-lint v2.11.4 run ./... — 0 issues - MCP tool count went from 85 → 87 (verify via `grep -cE 'gomcp\.AddTool\(' internal/mcp/tools.go`) - S-1 + G-3 + D-1 + D-2 + B-1 + L-1 CI guardrails all still pass Audit findings closed: - cat-i-b0924b6675f8 (P1, MCP discovery completeness — last P1 in audit) This brings the audit to ZERO REMAINING P1s. Deferred follow-ups: - Notification acknowledge MCP tool — add when a notification-ack HTTP handler exists. Currently no such handler exists in the API surface; treat as a separate feature, not an MCP gap.	2026-04-25 16:33:56 +00:00
shankar0123	5e4eaa78b1	Merge branch 'fix/g3-master-env-var-docs-drift' (G-3 master, 3 audit findings)	2026-04-25 16:31:46 +00:00
shankar0123	2419f8cd27	docs(features): reconcile env-var inventory with config.go (G-3 master) Closes three 2026-04-24 audit findings (all P2, all category cat-g): - cat-g-renewal_check_interval_rename_drift: features.md:152 advertised CERTCTL_RENEWAL_CHECK_INTERVAL but config.go renamed that to CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL. Fixed in prose + the scheduler-loops table on line 1117. - cat-g-b8f8f8796159: 6 env vars in config.go that were never documented: CERTCTL_DATABASE_MIGRATIONS_PATH CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT CERTCTL_JOB_AWAITING_CSR_TIMEOUT CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL Added to the scheduler-loops table at features.md:1117 and (DATABASE_MIGRATIONS_PATH) to the new Database Schema preamble. - cat-g-163dae19bc59: 37 env vars in docs not defined in config.go. The audit's strict comm over-flagged this set: most "phantoms" are integration-surface contracts (script env vars certctl EXPORTS to user-provided ACME DNS-01 / OpenSSL CA scripts; StepCA / Webhook per-issuer-or-notifier config-blob field names; CERTCTL_QA_* test fixtures; agent-side env vars defined in cmd/agent/main.go). The closure narrows the gate to the one true phantom (the rename) and allowlists the documented integration contracts in the CI guard. Each allowlist entry has a one-line justification. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden env-var docs drift regression guard (G-3)" — runs `comm -23` both ways between the env vars defined in Go source (config.go + cmd/* + ACME DNS export + test fixtures) and env vars mentioned in README + docs/ + deploy/helm/. Fails the build if either set is non-empty modulo the documented integration-surface allowlist. Verification: - comm -23 docs vs defined → empty post-fix (allowlist applied) - comm -23 defined vs docs → empty post-fix - golangci-lint v2.11.4 run ./... → 0 issues - tsc --noEmit → clean - S-1 stale-counts guardrail still passes Audit findings closed: - cat-g-163dae19bc59 (P2, docs-only env vars) - cat-g-b8f8f8796159 (P2, config-only env vars) - cat-g-renewal_check_interval_rename_drift (P2, renamed env var still in docs) Deferred follow-ups: - The 26 documented-but-unimplemented integration contracts on the allowlist (CERTCTL_OPENSSL_, CERTCTL_ACME_EAB_, CERTCTL_WEBHOOK_, CERTCTL_AUDIT_EXCLUDE_PATHS, CERTCTL_TLS_, CERTCTL_ACME_DNS_PROPAGATION_WAIT) are documented in features.md / connectors.md / demo-advanced.md but not yet read by any Go source. Either implement in config.go (each is its own M-X) or delete from docs (separate cleanup PR). Neither expansion fits inside G-3's "reconcile drift" scope.	2026-04-25 16:31:45 +00:00
shankar0123	6f045293e9	Merge branch 'fix/s1-master-stale-counts' (S-1 master, 2 audit findings)	2026-04-25 16:26:54 +00:00
shankar0123	530da674f8	docs(README,features,examples): replace stale source counts with rebuild commands (S-1 master) Closes two 2026-04-24 audit findings — one P1 (cat-s1-9ce1cbe26876, README + features.md cite stale numeric counts) and one P2 (cat-s1-features_md_issuer_count_contradiction, features.md self- disagreed on issuer count saying 9 in two places + 12 in two others). Both root in a CLAUDE.md invariant: "Numeric claims about current state rot the instant the next release lands... Before adding any current-state count, delete it and write the command instead." Per-site changes: - docs/features.md::"At a Glance" table — replaced 12 hardcoded counts with `rebuild via <command>` references quoting the canonical source-of-truth grep from CLAUDE.md::"Current-state commands". - docs/features.md::Issuer Connectors section — dropped "9 issuer connectors" (stale; live: 12) and "12 IssuerType constants" prose; prose now references the rebuild command. - docs/features.md::Target Connectors section — same treatment for "14 target connector types". - docs/features.md::"Per-type config schema validation for all 9 issuer types" — same treatment. - docs/features.md::"80 MCP tools covering all API endpoints" — same. - docs/features.md::Web Dashboard section — dropped "24 pages wired" + the "(25 Route elements, 24 pages)" comment. - docs/examples.md::"Beyond These Examples" — dropped "7 issuer backends and 10 target connectors" prose; references features.md and the rebuild commands. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden hardcoded source-count prose regression guard (S-1)" — grep-fails the build if any of the blocked phrases (e.g. "9 issuer connectors", "21 database tables", "80 MCP tools") reappears in README or docs/. Allowlists demo- fixture prose ("32 certificates" — seed_demo.sql facts), historical WORKSPACE-CHANGELOG counts, the testing-guide example phrasing, and any number adjacent to a quoted rebuild command. Verification: - S-1 guardrail dry-run on post-fix tree → empty (good) - golangci-lint v2.11.4 run ./... → 0 issues - tsc --noEmit → clean - vitest, vite build unchanged from pre-S-1 baseline (no JS/TS touched) Audit findings closed: - cat-s1-9ce1cbe26876 (P1, README + features.md stale numeric counts) - cat-s1-features_md_issuer_count_contradiction (P2, features.md self-contradiction on issuer count) Deferred follow-ups: - WORKSPACE-CHANGELOG.md historical-milestone counts intentionally preserved (those are point-in-time facts about shipped slices, not current-state claims). README demo-fixture counts ("32 certs, 10 issuers") preserved — those describe the seed_demo.sql shape, not the live source surface.	2026-04-25 16:26:44 +00:00
shankar0123	555eef449e	Merge branch 'fix/d2-master-type-drift-cluster' (D-2 master, 5 audit findings)	2026-04-25 16:07:36 +00:00
shankar0123	55eb7135be	fix(web,ci): close TS↔Go type drift across 5 entities (D-2 master) Closes five 2026-04-24 audit findings (all P2, all category cat-f / diff-05x06-) by reconciling the TypeScript interfaces in web/src/api/types.ts with the on-wire JSON shape Go's internal/domain/.go structs actually emit. D-1 closed the same pattern for one entity (Certificate / ManagedCertificate); D-2 covers the remaining five. Per-entity verdicts (audit's "stricter side is the contract"): Agent — TRIM 5 phantoms (last_heartbeat, capabilities, tags, created_at, updated_at). Go emits last_heartbeat_at only. Target — ADD 2 (retired_at?, retired_reason?) — I-004 fields. DiscCert — ADD pem_data? — real field, real Go emit, omitempty. Issuer — TRIM phantom status. Go has Enabled bool only. Notif — TRIM phantom subject. Go has Message string only. Certificate — verify-only; D-1 closure confirmed clean at recon. Consumer fixes (same commit as the trim): - AgentDetailPage.tsx — remove dead Capabilities + Tags sections (always rendered empty); replace agent.created_at/updated_at row with the Go-emitted registered_at; widen heartbeatStatus() to accept undefined. - AgentsPage.tsx — same heartbeatStatus widening. - IssuersPage.tsx + IssuerDetailPage.tsx — issuerStatus() now derives from `enabled` exclusively; the dead `issuer.status \|\| 'Unknown'` fallback is gone. - NotificationsPage.tsx — drop dead `\|\| n.subject` fallback. - NotificationsPage.test.tsx — drop dead `subject:` from mocks. - api/utils.ts::timeAgo widened to accept string \| undefined \| null. - api/types.test.ts — Agent (I-004) fixture trimmed of the 5 phantoms. Tests (Vitest): - 5 new describe blocks in web/src/api/types.test.ts: - Agent interface (D-2 phantom-fields trim) — 2 it blocks - Target interface (D-2 retirement fields) — 2 it blocks - DiscoveredCertificate interface (D-2 pem_data ADD) — 2 it blocks - Issuer interface (D-2 status phantom trim) — 1 it block - Notification interface (D-2 subject phantom trim) — 1 it block - Each block uses the literal-construction pattern from D-1; trimmed fields are pinned via excess-property comments that compile-fail when uncommented if a phantom is reintroduced. CI regression guardrail: - .github/workflows/ci.yml — existing D-1 step renamed to "Forbidden StatusBadge dead-key + TS phantom-field regression guard (D-1 + D-2)". Three new awk-windowed greps over Agent / Issuer / Notification interfaces in types.ts. The Agent grep includes a `grep -v 'last_heartbeat_at'` filter to avoid false positives on the legitimate Go-emitted heartbeat field. Documentation: - CHANGELOG.md — new D-2 section above B-1 under [unreleased] with full Added/Removed/Audit findings closed/Known follow-ups breakdown. - docs/architecture.md — Web Dashboard section gains a new "TS ↔ Go type contract rule (D-1 + D-2 closure)" paragraph capturing the stricter-side-wins rule and the CI guardrail it's anchored by. - coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker score 20/47 → 25/47 (P2: 6/27 → 11/27). Per-finding ✅ RESOLVED Status blocks added to all 5 diff-05x06-* entries plus the verify-only Certificate entry. Closed-bundle index gets D-2 row. Verification (all gates green): - cd web && tsc --noEmit → clean - cd web && vitest run --reporter=dot → 9 files, 302 tests passing (was 294 → +8 D-2 cases) - cd web && vite build → clean - go vet ./internal/... ./cmd/... → clean (no Go touched) - golangci-lint v2.11.4 run ./... → 0 issues - D-2 Agent guardrail dry-run → empty (good) - D-2 Issuer guardrail dry-run → empty (good) - D-2 Notification guardrail dry-run → empty (good) - D-2 Target ADD-shape sanity → 2 retirement fields present - D-2 DiscCert ADD-shape sanity → pem_data present - D-1 Certificate guardrail still clean → empty (good) - OpenAPI YAML parses → 89 paths Audit findings closed: - diff-05x06-7cdf4e78ae24 (P2, Agent TS↔Go drift) - diff-05x06-2044a46f4dd0 (P2, Target TS↔DeploymentTarget Go drift) - diff-05x06-85ab6b98a2f7 (P2, DiscoveredCertificate TS↔Go drift) - diff-05x06-97fab8783a5c (P2, Issuer TS↔Go drift) - diff-05x06-caba9eb3620e (P2, Notification TS↔NotificationEvent drift) - diff-05x06-af18a8d7ef41 (P2) — verified clean since D-1; no edit Deferred follow-ups: - Issuer richer status view (enabled × test_status) — UX scope, not drift. - Real Agent metadata (capabilities, tags) — backend feature, not drift. - DiscoveredCertificate pem_data list-response perf — separate backend change.	2026-04-25 16:07:31 +00:00
shankar0123	2edac7e78b	fix(mcp): close staticcheck ST1021 on BulkRenew/BulkReassign input docstrings CI on the B-1 merge (`b8a4318`) failed at the golangci-lint step on two ST1021 errors against internal/mcp/types.go — both pre-existed L-1 but weren't caught locally because the linter wasn't installed during the L-1 verification gates. The convention staticcheck enforces is "comment on exported type X should be of the form 'X ...'" — i.e. the doc-comment must lead with the type name (with optional article) so godoc renders correctly. Before: // L-1 master closure (cat-l-fa0c1ac07ab5): bulk-renew MCP tool input. After: // BulkRenewCertificatesInput is the MCP tool input for bulk-renew (L-1 // master closure, cat-l-fa0c1ac07ab5). Mirrors BulkRevokeCertificatesInput // field-for-field minus Reason. Same shape applied to BulkReassignCertificatesInput. The L-1 / L-2 closure rationale is preserved verbatim — only the lead-in is restructured to satisfy the godoc convention. Verification: - golangci-lint v2.11.4 (matching CI) installed locally at /dev/shm/bin - golangci-lint run ./... --timeout 5m → 0 issues - internal/mcp/... package targeted lint → 0 issues This unblocks the B-1 CI run on master. No behavioral change; doc-only edit.	2026-04-25 15:48:39 +00:00
shankar0123	b8a4318082	Merge branch 'fix/b1-master-orphan-crud-edit-modals' (B-1 master, 4 audit findings)	2026-04-25 15:23:21 +00:00
shankar0123	097995e503	fix(web,ci): close orphan-CRUD GUI gaps + dead exportCertificatePEM (B-1 master) Closes four 2026-04-24 audit findings via per-page Edit modals on five existing pages, a brand-new RenewalPoliciesPage for the rp-* CRUD surface, and removal of one dead duplicate so the public client surface stops growing without consumers. Anchored by a CI grep guardrail that fails the build if any of the eight previously-orphan client functions loses its non-test page consumer or if exportCertificatePEM is resurrected. Per-page Edit modals (mirroring existing CreateXModal scaffolding): - web/src/pages/OwnersPage.tsx — EditOwnerModal (name/email/team_id) - web/src/pages/TeamsPage.tsx — EditTeamModal (name/description) - web/src/pages/AgentGroupsPage.tsx — EditAgentGroupModal (full match-rule set: name/description/match_os/match_architecture/match_ip_cidr/ match_version/enabled) - web/src/pages/IssuersPage.tsx — EditIssuerModal (rename-only; type locked, config blob preserved untouched, footer note about delete+ recreate for credential rotation) - web/src/pages/ProfilesPage.tsx — EditProfileModal (rename + description only; policy fields preserved untouched, footer note about deferred policy editing) New page (closes cat-b-4631ca092bee — RenewalPolicy CRUD orphan): - web/src/pages/RenewalPoliciesPage.tsx — full CRUD page with shared PolicyFormModal for Create + Edit (form shape identical), 7-column DataTable (Policy/RenewalWindow/Auto/Retries/AlertThresholds/Created/ Actions), comma-separated alert_thresholds_days input parser, and alert() surfacing of repository.ErrRenewalPolicyInUse (409) on Delete so operators can re-target dependent certs before deletion. - web/src/main.tsx — adds /renewal-policies route. - web/src/components/Layout.tsx — adds sidebar nav item slotted between Policies and Profiles. Removed (closes cat-b-9b97ffb35ef7 — dead duplicate): - web/src/api/client.ts::exportCertificatePEM — zero consumers across web/, MCP, CLI, tests; downloadCertificatePEM is the actual call site in CertificateDetailPage. Test references in client.test.ts and client.error.test.ts also removed. CI regression guardrail: - .github/workflows/ci.yml — adds 'Forbidden orphan-CRUD client function regression guard (B-1)' step. Greps for all eight previously-orphan fns (updateOwner/updateTeam/updateAgentGroup/updateIssuer/updateProfile + createRenewalPolicy/updateRenewalPolicy/deleteRenewalPolicy) under web/src/pages/ and fails the build if any has zero non-test consumers. Also blocks resurrection of exportCertificatePEM. Verified locally (all 8 fns have ≥2 consumers; exportCertificatePEM is gone) and against synthetic regressions. Documentation: - CHANGELOG.md — new B-1 section above L-1 under [unreleased]. - docs/architecture.md — Web Dashboard section gains a new paragraph capturing the 'every backend CRUD must have a GUI consumer' rule with reference to the CI guardrail. - coverage-gap-audit-2026-04-24-v5/unified-audit.md — flips four findings to ✅ RESOLVED with detailed Status blocks; bumps Live Tracker score 16/47 → 20/47 (P1: 9→12, P3: 1→2); adds B-1 row to closed-bundle index. Verification: - cd web && tsc --noEmit — clean - cd web && vitest run — 9 test files, 294 tests, all passing - cd web && vite build — clean (no new warnings) - B-1 guardrail dry-run — all 8 client fns have ≥2 page consumers, exportCertificatePEM removed (good), FAIL=0 Audit findings closed: - cat-b-31ceb6aaa9f1 (P1, updateOwner/updateTeam/updateAgentGroup orphan) - cat-b-7a34f893a8f9 (P1, updateIssuer/updateProfile orphan, rename-only) - cat-b-4631ca092bee (P1, RenewalPolicy CRUD orphan) - cat-b-9b97ffb35ef7 (P3, exportCertificatePEM dead duplicate) Deferred follow-ups: - Fuller EditIssuerModal with credential-rotation flow (needs threat model: rotation reuse window, in-flight CSR cancellation, audit-trail granularity). - Fuller EditProfileModal with policy-field editing (max-TTL, allowed EKUs, allowed key algorithms — affect already-issued cert evaluation). - Per-page Vitest coverage for the new Edit modals (CI grep guardrail catches the same regression vector at lower cost).	2026-04-25 15:23:15 +00:00
shankar0123	3fc1a2222f	Merge branch 'fix/l1-master-bulk-action-endpoints' (L-1 master, 2 audit findings)	2026-04-25 14:33:10 +00:00
shankar0123	f0865bb051	fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master) Two audit findings, both category cat-l, both rooted in web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200 ms each = a 5–20-second wedge during which the operator stared at a progress bar. Post-L-1 each workflow is a single POST. cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop handleBulkRenewal: for/await triggerRenewal(id) cat-l-8a1fb258a38a [P2] — bulk reassign loop handleReassign: for/await updateCertificate(id, {owner_id}) The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke + BulkRevocationCriteria/Result) already existed as the canonical shape in v2.0.x — L-1 ports that pattern to renew + reassign with per-action twists. Backend (Go) - internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id}; shared BulkOperationError type for all bulk paths. - internal/domain/bulk_reassignment.go: narrower shape — IDs-only, owner_id required, team_id optional. - internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew: resolves criteria → status filter (Archived/Revoked/Expired/ RenewalInProgress all silent-skip) → per-cert status flip + job create. Keygen-mode-aware so jobs land in the same initial status as single-cert TriggerRenewal. Single bulk audit event per call, not N. - internal/service/bulk_reassignment.go::BulkReassignmentService. BulkReassign: validates owner_id upfront via the ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner returns 400 before any cert is touched. Already-owned-by-target is silent-skip. Single bulk audit event. - internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP shape mirrors bulk_revocation.go. NOT admin-gated (renew is non- destructive; reassign is a common-case workflow). Sentinel-error → 400 mapping for OwnerNotFound. - internal/api/router/router.go: three bulk-* routes registered as a block before the {id} routes. HandlerRegistry gains BulkRenewal + BulkReassignment fields. - cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode so bulk-renew jobs land in same initial state as single-cert path. Frontend - web/src/api/client.ts: bulkRenewCertificates(criteria) + bulkReassignCertificates(request) functions with full TS types. - web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign rewritten from N-call loops to single calls. Result envelope drives progress UI; first-error message surfaced when total_failed > 0. Stale triggerRenewal + updateCertificate imports removed. MCP - internal/mcp/types.go: BulkRenewCertificatesInput + BulkReassignCertificatesInput. - internal/mcp/tools.go: certctl_bulk_renew_certificates + certctl_bulk_reassign_certificates tools mirroring the existing certctl_bulk_revoke_certificates pattern. OpenAPI - api/openapi.yaml: two new operations (bulkRenewCertificates, bulkReassignCertificates) under Certificates tag. Four new schemas (BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob, BulkReassignRequest, BulkReassignResult). Tests - Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty IsEmpty contracts; JSON round-trip shape pinning. - Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/ skips-revoked-archived/empty-criteria-error/partial-failure/ audit-event-emitted) + 8 BulkReassign tests (happy/skips-already- owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id- optional/team-id-provided/partial-failure/audit-event-emitted). - Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong- method-405/actor-attribution/service-error-500) + 6 BulkReassign handler tests (happy/empty-IDs-400/missing-owner-400/owner-not- found-400-via-sentinel/wrong-method-405/generic-error-500). CI guardrail - .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx for 'for(...) await triggerRenewal(...)' and 'for(...) await updateCertificate(...)' patterns; comment lines exempt; test files exempt. Verified locally (passes against post-fix tree, fires against synthetic regression). Counts (deltas) - Routes: 119 → 121 (+2) - OpenAPI operations: 123 → 125 (+2) - MCP tools: 83 → 85 (+2) Performance - 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency reduction on the canonical operator workflow). - Audit event volume: 1 + N per operation → 1. Out of scope (deferred follow-ups) - cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan (different shape — wire existing PUT to GUI, not new bulk endpoint). - cat-k-e85d1099b2d7: CertificatesPage no pagination UI. - cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added 2 new tools but does not close that finding). Verification - go build / vet / test -short / test -short -race all clean. - web tsc --noEmit + vitest run all clean (296 tests passing). - OpenAPI YAML parses (89 paths, 125 ops). - L-1 CI guardrail passes against post-fix tree, fires against synthetic regression. No push.	2026-04-25 14:33:02 +00:00
shankar0123	677524d9ec	Merge branch 'fix/d1-master-statusbadge-enum-drift' (D-1 master, 5 audit findings) v2.0.51 v2.0.52	2026-04-25 13:53:02 +00:00
shankar0123	9dc0742e77	fix(web): close StatusBadge enum drift + Certificate TS phantom fields (D-1 master) Five audit findings, all category cat-d or cat-f, all rooted in two frontend files. The dashboard silently lied: cat-d-359e92c20cbf [P1, primary] — Agent: 'Stale' dead key + 'Degraded' neutral fallthrough cat-d-9f4c8e4a91f1 [P2] — Notification: 'dead' missing cat-d-1447e04732e7 [P3] — Cert: 'PendingIssuance' dead key cat-f-cert_detail_page_key_render_fallback [P2] — render-site reads cert.key_algorithm directly cat-f-ae0d06b6588f [P2] — Certificate TS phantom fields (root cause) Pre-D-1, agents in the only Go AgentStatus that means 'needs operator attention' (Degraded) rendered as default neutral grey because StatusBadge mapped 'Stale' (a key Go has never emitted) to yellow. Dead-letter notifications visually equated with 'read' (operator-acknowledged). The Certificate badge map carried a 'PendingIssuance' key no Go enum emits. CertificateDetailPage's Key Algorithm and Key Size rows always rendered '—' even when the data was a single fetch away — the lookup went through cert.key_algorithm / cert.key_size directly, both phantom Certificate TS fields. Trim the TS type so the missing-data case is explicit; fix the render site to use latestVersion?.field; pin the contract with a 38-case Vitest property test that walks every Go enum. StatusBadge (web/src/components/StatusBadge.tsx) - Drop 'Stale' (Agent dead key) + 'PendingIssuance' (Cert dead key). - Add 'Degraded' (Agent → badge-warning) + 'dead' (Notification → badge-danger). - Add leading docblock naming Go-side source-of-truth file for every status family and pointing at the property test as regression vector. Property test (web/src/components/StatusBadge.test.tsx — 38 cases) - Iterates every Go-emitted enum value (AgentStatus, CertificateStatus, JobStatus, NotificationStatus, DiscoveryStatus, HealthStatus) plus the two frontend-synthesized Enabled/Disabled labels, asserts every value gets a non-default class (or an explicit 'badge badge-neutral' for the five intentionally-neutral terminal values: Archived, Cancelled, Dismissed, read, unknown). - Negative assertions: 'Stale' and 'PendingIssuance' must fall through to the dictionary default — re-adding either key surfaces here. - Specific UX-correctness assertions: 'dead' → badge-danger, 'Degraded' → badge-warning. - Unknown-status fallthrough preserves label text. Certificate TS trim (web/src/api/types.ts) - Drop serial_number?, fingerprint_sha256?, key_algorithm?, key_size?, issued_at? from Certificate. Go's ManagedCertificate has never carried these — they live on CertificateVersion. Post-trim a cert.X access for any of the five fields is a TS compile error. - Leading docblock cross-references the closure rationale and the latestVersion fallback pattern. Render-site fix (web/src/pages/CertificateDetailPage.tsx) - Key Algorithm / Key Size rows now read latestVersion?.key_algorithm / latestVersion?.key_size, mirroring the existing latestVersion fallback used a few lines above for serial_number / fingerprint_sha256. - The same edit also tightened the serial / fingerprint / issued_at derivations to drop the now-impossible 'cert.X \|\| latestVersion?.X' cert-side leg (cert.serial_number is a TS error post-trim). Type-test regression (web/src/api/types.test.ts) - Certificate literal construction pinned post-trim — adding any of the five fields back makes the literal an excess-property TS error. - Sibling CertificateVersion literal pinning the trimmed fields still live on the version envelope (so the CertificateDetailPage fallback path can't break). OpenAPI (api/openapi.yaml) - ManagedCertificate schema unchanged — was already correct (no phantom fields). Added a leading comment cross-referencing the D-5 closure for future readers. CI guardrail (.github/workflows/ci.yml) - 'Forbidden StatusBadge dead-key + Certificate phantom-field regression guard (D-1)'. Two grep blocks: catches Stale/PendingIssuance map literals in StatusBadge.tsx; uses an awk-scoped window over the 'export interface Certificate {' block in types.ts to catch the five phantom fields reappearing while explicitly excluding CertificateVersion (which legitimately carries them). Comments + test files exempt. Verification - Backend build/vet/test -short -race all clean across handler/router/ middleware packages. - Frontend tsc --noEmit clean. - Vitest 256 → 296 tests (+40: 38 from new StatusBadge test, 2 from D-5 Certificate trim regression in types.test.ts). - OpenAPI YAML parses (87 paths). - Both CI guardrail patterns clear on the post-fix tree; both fire against synthetic regression patterns (re-add Stale → fires; re-add serial_number? to Certificate → fires). Out of scope (deferred) - diff-05x06-* type drifts for Agent/DeploymentTarget/Notification/ DiscoveredCertificate/Issuer TS interfaces. Per-type field-by-field Go ↔ TS diff is codegen-shaped, not edit-shaped — warrants its own D-2 master prompt. Noted in CHANGELOG follow-ups section.	2026-04-25 13:52:54 +00:00
shankar0123	1440a30d28	Merge branch 'fix/u3-master-db-coupling-cleanup' (U-3 master + 4 ride-alongs)	2026-04-25 13:29:30 +00:00
shankar0123	a3d8b9c607	fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master) GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/.sql or seed.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree.	2026-04-25 13:29:23 +00:00
shankar0123	aa6fafdee9	Merge branch 'fix/u2-dockerfile-healthcheck-https' v2.0.50	2026-04-25 12:02:28 +00:00
shankar0123	86fffa305a	fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2) Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed on every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker- compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart- loop. (Audit: cat-u-healthcheck_protocol_mismatch in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone gap, both bundled into this commit because they share the same root cause and the same operator surface: 1. Helm chart `server.readinessProbe.httpGet.path` pointed at `/readyz`, the kube-flavored convention. The certctl server doesn't register `/readyz` (only `/health` and `/ready` are wired and bypass the auth middleware — see internal/api/router/router.go:81 and cmd/server/main.go:920). K8s readiness probes therefore got 401 (api-key auth rejection) or 404 (when auth was disabled), pods stayed `NotReady` indefinitely, and Helm rollouts stalled. 2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all, so bare-`docker run` agents got zero health signal. The compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against the agent image, but the agent image didn't ship `procps` — pgrep was missing too. The compose probe was a latent always-fail. We fixed all three with the audit-recommended shape (option (a) — `-k`) plus three structural backstops: Files changed: Phase 1 — Dockerfile fix: - Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/ health` to `curl -fsk https://localhost:8443/health`. `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed, no network hop. Pinning `--cacert` is not viable for the published image because the bootstrap cert is per-deploy (generated into the `certs` named volume on first up; operator-supplied via Helm's `existingSecret` or cert-manager). Long-form docblock cross-references the audit closure, the compose vs Helm vs examples coverage matrix, and the CI guardrail. - Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent` matching the compose pattern. Added `procps` to the runtime apk install — fixes both the new image-level HEALTHCHECK AND the pre-existing compose probe that was silently failing. Phase 2 — Helm readiness probe path: - deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path changed from `/readyz` to `/ready`. Liveness probe path (`/health`) was correct and is unchanged. Probes block now carries an explanatory comment naming the registered no-auth probe routes and the U-2 closure rationale. Phase 3 — Image-level integration tests: - deploy/test/healthcheck_test.go (new, //go:build integration): TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (positive + negative regression contracts). TestPublishedAgentImage_HealthcheckSpecExists builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker) — verified locally: tests skip with the diagnostic and the suite returns PASS. TestPublishedServerImage_HealthcheckTransitionsToHealthy is a documented `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke; the spec-level tests above cover the audit-flagged regression. Phase 4 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK regression guard (U-2)" step. Scoped patterns catch `HEALTHCHECK.http://` and `curl -f http://localhost:8443/health` in any `Dockerfile`. Comment lines exempt; docs/upgrade-to-tls.md out of scope (the post-cutover invariant string at line 182 is intentionally a documented expected-failure assertion). Verified locally on the real tree (passes) and against synthetic regressions (each fires the guard). Phase 5 — Docs sweep: - docs/connectors.md: 15 stale curl examples updated from `http://localhost:8443/...` to `https://localhost:8443/...` with `--cacert "$CA"` injected on every site. Added a one-time introductory note documenting the `$CA` extraction with `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`, matching the pattern in docs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener. Phase 6 — Release surface: - CHANGELOG.md: appended U-2 section to the existing [unreleased] block (immediately below the G-1 entry). Sections: explanatory blockquote covering all three bugs (primary + 2 adjacent), Fixed, Added, Changed. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go vet -tags integration ./deploy/test/ — clean - go test -short ./... — every package green - go test -tags integration -v -run TestPublishedServerImage\|TestPublishedAgentImage ./deploy/test/ — three tests SKIP cleanly with "docker not available" diagnostic - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds; rendered Deployment carries `path: /ready` and zero `/readyz` matches - python3 yaml.safe_load on api/openapi.yaml — parses - govulncheck ./... — no vulnerabilities in our code - CI guardrail mirror: clean on real tree, fires on synthetic regression patterns Out of scope (intentionally untouched): - cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct, this finding does NOT propose adding back a plaintext listener. - deploy/docker-compose.yml:126 HEALTHCHECK — already correct. - deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct. - All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already correct (they ALSO use `-fsk https://localhost:8443/health`). - Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` + `path: /health`, correct. - docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health` invariant line — that's the expected-failure assertion for the post-cutover state ("plaintext is gone, expect Connection refused"); intentionally left intact. - Go production code — this is purely a deploy-image / probe / docs / Helm-chart fix. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-healthcheck_protocol_mismatch Audit recommendation followed verbatim: 'change Dockerfile:80 to CMD curl -kf https://localhost:8443/health'.	2026-04-25 12:02:18 +00:00
shankar0123	e17788355b	Merge branch 'fix/g2-apikey-hash-redaction'	2026-04-25 01:56:34 +00:00
shankar0123	87213128cc	fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2) Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged `json:"api_key_hash"` and shipped on every wire surface that returned domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}), GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the POST /api/v1/agents registration response. Every authenticated client (browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key string. The browser silently dropped it because web/src/api/types.ts omits the field, but CLI and MCP consumers print full JSON so the hash was visible there. Even though the value is a hash and not the plaintext key, shipping it gives an attacker an offline brute-force target if the API-key entropy is low (certctl doesn't enforce a minimum on operator- supplied keys), and there's no business reason for any client to ever receive it — the value is server-internal, used only for the lookup at internal/repository/postgres/agent.go::GetByAPIKey. (Audit: cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) We chose the audit's recommended fix (json:"-") plus a defense-in-depth MarshalJSON plus a CI guardrail. Three layers because struct-tag redaction alone is one rebase away from being silently reverted, the custom MarshalJSON catches the case where a parent struct embeds Agent under a different tag, and the CI grep blocks reintroduction at the spec or frontend boundary even without a code review catching it. Files changed: Phase 1 — Domain redaction: - internal/domain/connector.go: APIKeyHash tag flipped from `json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON with value receiver + type-alias-recursion-break that explicitly zeroes APIKeyHash on the marshal-time copy. Long-form docblock explaining the G-2 closure rationale + cross-references to service.RegisterAgent (populator), repository.AgentRepository:: GetByAPIKey (consumer), docs/architecture.md (DB-shape vs API-shape distinction), and the audit finding. Phase 2 — Domain tests (5 test functions): - internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer pins the Agent path. ...RedactsInSlice pins the []Agent path that the ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver pins the by-value-receiver contract so a future refactor that switches to pointer-receiver gets caught. ...RoundTrip pins the wire-shape guarantee that APIKeyHash is dropped on encode and cannot reappear on decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE- SENTINEL") flows through every fixture for grep-ability on regression. Phase 3 — Handler tests (4 test functions): - internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash, TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash, TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the literal substring "api_key_hash" is absent from the httptest-captured body, (b) the leak sentinel value is absent, (c) the non-leaked fields ARE present (sanity that the handler is serving real data, not just empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE- HANDLER-SENTINEL" so a single grep over a failing test's output identifies the leak surface immediately. Phase 4 — Spec / docs: - api/openapi.yaml: api_key_hash property REMOVED from Agent schema (was at line 3690). Inline G-2 comment naming the closure + the database-vs-API-shape distinction so a future spec edit doesn't silently re-introduce the field. - docs/architecture.md: ER-diagram block already documents the agents table including api_key_hash (DB shape — correct). Added a sibling note paragraph immediately below the diagram explaining that several columns are intentionally server-internal (api_key_hash redaction + issuers.config / deployment_targets.config encrypted shadow), with cross-references to the redaction enforcement site, the OpenAPI schema, the frontend interface, and the CI guardrail. - web/src/api/types.ts: Agent interface unchanged in shape (already omitted the field) but added a leading comment block explaining WHY the omission is intentional — stops a future frontend dev from "completing" the interface from the OpenAPI spec or the Go struct. Phase 5 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape regression guard (G-2)" step. Scoped patterns catch the actual regression shapes — Go struct tag (json:"api_key_hash"), frontend interface declaration, OpenAPI schema property, YAML enum/array membership. Repository / migration / seed / service / integration / unit-test / comment lines exempt. Verified locally on the real tree (passes) and against 4 synthetic regression patterns (each fires the guardrail). Mirrors the G-1 pattern from .github/workflows/ ci.yml lines 47-108. Phase 5b — Sweep verification (no changes, results documented for the next reader): - internal/api/middleware/audit.go: doesn't serialize Agent struct; records request body only. No leak. - service.RegisterAgent audit-event payload: `map[string]interface{}{ "name": name, "hostname": hostname}` — name + hostname only, no APIKeyHash. No leak. - All 9 slog sites that mention agent: scalar attrs only ("agent_id", "error", "agent_hostname"), never the full struct. No leak. - internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so the wire-side fix transitively closes them. Verification (all gates pass): - go build ./... - go vet ./... - go test -short ./... — every package green - go test -short -race ./internal/domain/... ./internal/api/handler/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds - python3 yaml.safe_load on api/openapi.yaml — parses - OpenAPI Agent schema scan: no api_key_hash property - CI guardrail mirror: clean on real tree, fires on all 4 synthetic regression patterns - Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5% - Handler pkg coverage: 79.2% Sample response body (httptest captured during verification, GET /api/v1/agents/{id} via the new handler test): {"id":"agent-demo","name":"demo-agent","hostname":"demo.host", "status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z", "registered_at":"2026-04-24T12:00:00Z","os":"linux", "architecture":"amd64","ip_address":"10.0.0.42", "version":"v2.0.49"} Note the absence of any api_key_hash key, even though the in-memory struct passed to the handler had APIKeyHash set to a sentinel. Out of scope (intentionally untouched): - internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan paths and GetByAPIKey lookup — DB column stays, repo still populates the struct, auth lookup still works. The redaction is a marshal-boundary concern. - migrations/000001_initial_schema.up.sql + migrations/seed_.sql — DB schema and seed data unchanged. - internal/service/agent.go::RegisterAgent — service-side hashing and persistence unchanged. - Other domain types with potential credential-derivative fields (Issuer.Config, DeploymentTarget.Config, notifier configs). Not flagged by the audit; some are already protected (e.g., DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a separate audit pass if recon surfaces additional leaks. - Per-resource DTO layer across every handler. Single audit finding, single domain type. - A separate possible follow-up: the v2 RegisterAgent endpoint doesn't return the plaintext API key to the agent, which may mean self-bootstrap via POST /api/v1/agents is broken. Verified during recon; out of scope for G-2; should be its own ticket. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-s5-apikey_leak Audit recommendation: 'json:"-" or API-response DTO excluding APIKeyHash' — went with the json:"-" + MarshalJSON defense-in-depth pair plus CI guardrail and structural docs.	2026-04-25 01:56:26 +00:00
shankar0123	697fa792ea	Merge branch 'fix/g1-jwt-silent-auth-downgrade-removal'	2026-04-25 00:22:33 +00:00
shankar0123	9c1d446e40	fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1) The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the startup log faithfully echoed 'authentication enabled type=jwt'. Reasonable people read that and concluded JWT auth was on. It wasn't. The auth-middleware wiring at cmd/server/main.go unconditionally routed every request through the api-key bearer middleware regardless of cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming 'Authorization: Bearer <token>' against whatever string the operator put in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who treated CERTCTL_AUTH_SECRET as a signing secret (because they thought they were configuring JWT) had effectively handed an attacker an api-key. A security finding masquerading as a config option. We chose the audit-recommended structural fix: remove the option, fail fast at startup, and add the gateway-fronting pattern as the documented forward path. Implementing JWT middleware would have meant jwks vs static-secret rotation, claim mapping, expiry enforcement, audience and issuer validation, key rollover semantics, and regression coverage at the same depth as the existing api-key path — a feature, not a fix. Operators who genuinely need JWT/OIDC front certctl with an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same shape works on docker-compose and Helm. The change is comprehensive across 7 phases — every surface that mentioned 'jwt' as a certctl-auth-type is updated, plus structural backstops (typed enum, runtime guard, helm template validation, CI grep guard) so the lie can't reappear. Files changed: Phase 1 — production code (typed enum + jwt removal): - internal/config/config.go: AuthType typed alias + AuthTypeAPIKey / AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes literal 'jwt' through a dedicated multi-line diagnostic naming the authenticating-gateway pattern, then cross-checks against ValidAuthTypes(). Secret-required branch simplified to api-key-only. Field comment on AuthConfig.Type rewritten to drop jwt and point at the gateway pattern. - internal/api/middleware/middleware.go: AuthConfig.Type field comment references the typed config.AuthType constants. - internal/api/handler/health.go: same treatment for HealthHandler.AuthType. - cmd/server/main.go: defense-in-depth runtime switch immediately after config.Load() — exits 1 on any unsupported auth-type that bypassed the validator. Auth-disabled startup log explicitly names the authenticating-gateway pattern. Phase 2 — tests (Red→Green, contract pinning): - internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated (two table rows pinning the dedicated G-1 error fires regardless of whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property guard against future re-introduction), TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract), TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still hit the generic invalid-auth-type error). Removed the prior TestValidate_JWTAuth_MissingSecret happy-path since its premise is inverted post-G-1. - internal/api/handler/health_test.go: removed TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie into the regression suite). Pre-existing _APIKey test continues to cover the api-key happy path. Phase 3 — spec, docs, env templates: - api/openapi.yaml: auth_type enum dropped to [api-key, none] with inline comment naming the G-1 closure. - .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop jwt and point at the gateway pattern; secret-required conditional simplified to api-key-only. - docs/architecture.md: middleware-stack bullet rewritten to drop the JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)' section explaining the design rationale and listing oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy forward_auth / Apache mod_auth_openidc / nginx auth_request as the standard fronting options. - docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide with preconditions, what-changes, both recovery paths, complete docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy ext_authz patterns, rollback posture. Phase 4 — Helm chart (template validation + docs): - deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType helper mirroring the existing certctl.tls.required pattern. Fails template render on any server.auth.type outside {api-key, none} with a multi-line diagnostic. - deploy/helm/certctl/templates/server-deployment.yaml, server-configmap.yaml, server-secret.yaml: invoke the helper at the top of each template that depends on .Values.server.auth.type. - deploy/helm/certctl/values.yaml: auth: block comment expanded with the G-1 rationale and gateway-pattern cross-reference. - deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces the allowed set and points at the upgrade doc. - deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating gateway' section with a Kubernetes-flavored oauth2-proxy + certctl walkthrough. Phase 5 — release surface: - CHANGELOG.md: new [unreleased] top entry with Breaking / Removed / Added / Changed sections; explicit pointer at docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection. Phase 6 — CI guardrail: - .github/workflows/ci.yml: new 'Forbidden auth-type literal regression guard (G-1)' step. Scoped patterns catch the actual regression shapes (map literal, slice literal, switch case, OpenAPI enum, env-file default, AuthType('jwt') cast). Comments and the dedicated rejection branch are intentionally exempt; connector-package JWT references (Google OAuth2 / step-ca) are exempt as out-of-scope external protocols. Verified locally: the guard passes on the actual tree and fires on all 4 synthetic regression patterns. Out of scope (explicitly untouched): - internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service- account JWT (external protocol). - internal/connector/issuer/googlecas/googlecas.go — same. - internal/connector/issuer/stepca/stepca.go — step-ca's provisioner one-time-token JWT for /sign API. - docs/test-env.md, docs/connectors.md, docs/features.md — describe external CAs' use of JWT, not certctl's auth shape. - Implementing actual JWT middleware. Feature, not a fix. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go test -short ./... — every package green - go test -short -race ./internal/config/... ./internal/api/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template with auth.type=api-key — renders OK - helm template with auth.type=none — renders OK - helm template with auth.type=jwt — fails with validateAuthType diagnostic (exit 1) - python3 yaml.safe_load on api/openapi.yaml — parses - CI guardrail mirror — clean on real tree, fires on all 4 synthetic regression patterns - Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no longer accepted (G-1 silent auth downgrade): no JWT middleware ships with certctl. To use JWT/OIDC, run an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream. See docs/architecture.md "Authenticating-gateway pattern" and docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough' config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-g-jwt_silent_auth_downgrade Audit recommendation followed verbatim: 'Remove jwt from validAuthTypes until middleware ships'.	2026-04-25 00:22:23 +00:00
shankar0123	3192cd15c5	Merge branch 'fix/u1-followups-helm-rootenv-examples'	2026-04-24 23:51:18 +00:00

1 2 3 4 5 ...

399 Commits