certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 23:31:39 +00:00

Author	SHA1	Message	Date
shankar0123	52b86a08f4	Bundle K (Coverage Audit Closure): MCP per-tool coverage — C-002 closed internal/mcp line coverage 28.0% -> 93.1% (+65.1pp; +8.1 above target) via internal/mcp/tools_per_tool_test.go (~580 LoC, 4 top-level + 174 sub-tests). Strategy: gomcp.NewInMemoryTransports() wires an in-process client + server pair; RegisterTools(server, client) is invoked against a mock certctl API; every one of 87 registered tools is dispatched via clientSession.CallTool. This is the first test in the package that exercises the closure bodies inside registerTools — existing tests (tools_test.go, injection_regression_test.go, fence_guardrail_test.go, retire_agent_test.go) tested the wrapper + HTTP client in isolation. Tests: TestMCP_AllTools_HappyPath: 87 sub-tests, mock 'ok' mode, asserts response fence end-to-end. TestMCP_AllTools_ErrorPath: 87 sub-tests, mock '5xx' mode, asserts MCP_ERROR fence. TestMCP_FenceInjectionResistance: 50 dispatches; asserts per-call nonce uniqueness (security property). TestMCP_FenceWithPlantedEndMarker: planted attacker nonce does not collide with real RNG nonce. TestMCP_RegisterTools_DispatchableToolCount: tool-inventory check (87 registered == 87 covered). Per-registerTools coverage: registerCertificateTools: 11.2% -> 84.1% registerCRLOCSPTools: 20.0% -> 100.0% registerIssuerTools: 20.0% -> 100.0% registerTargetTools: 20.0% -> 100.0% registerAgentTools: 13.5% -> 86.5% registerJobTools: 15.2% -> 90.9% registerPolicyTools: 19.4% -> 100.0% registerProfileTools: 20.0% -> 100.0% registerTeamTools: 20.0% -> 100.0% registerOwnerTools: 20.0% -> 100.0% registerAgentGroupTools: 20.0% -> 100.0% registerAuditTools: 20.0% -> 100.0% registerNotificationTools: 17.4% -> 95.7% registerStatsTools: 14.7% -> 91.2% registerDigestTools: 20.0% -> 100.0% registerMetricsTools: 20.0% -> 100.0% registerHealthTools: 19.4% -> 100.0% Binary-blob tools (certctl_get_der_crl, certctl_ocsp_check) bypass textResult by design — they return human-readable summaries instead of fenced JSON. Matches the existing fence_guardrail_test.go allowlist. Verification: go vet ./internal/mcp/... clean gofmt -l internal/mcp/ clean staticcheck -checks all clean (only pre-existing S1009 + ST1000 hits in master remain) go test -short -cover 93.1% coverage go test -race -count=1 PASS, 0 races Audit deliverables: findings.yaml: C-002 status open -> closed gap-backlog.md: closure log + C-002 strikethrough coverage-matrix.md: MCP row at 93.1% closure-plan.md: Bundle K [x] closed CHANGELOG.md: [unreleased] Bundle K entry	2026-04-27 16:47:38 +00:00
shankar0123	23411bd6fc	fix(bundle-3): MCP Trust-Boundary Fencing — 5 audit findings closed Closes Audit-2026-04-25 H-002, H-003, M-003, M-004, M-005 (all CWE-1039 LLM Prompt Injection at the MCP↔consumer trust boundary, TB-7). Strategy: wrapper-layer fencing. All 87 MCP tools route their success path through textResult and their failure path through errorResult. By fencing at those two wrappers we cover every existing tool AND every future tool with a single change — no per-tool wiring required. What changed - internal/mcp/fence.go (new) — FenceUntrusted helper with strategy doc + per-finding rationale. Both fenceMCPResponse and fenceMCPError use it internally. - internal/mcp/tools.go — textResult wraps response body via fenceMCPResponse; errorResult wraps error string via fenceMCPError. - internal/mcp/tools_test.go — TestTextResult / TestErrorResult updated to assert fenced shape (start marker + end marker + inner body). - internal/mcp/injection_regression_test.go (new) — 5 regression test functions, one per audit finding, each replays 5 classic LLM injection payloads (instruction_override, system_role_spoofing, delimiter_break_attempt, markdown_link_phishing, data_exfil_via_url) and asserts the planted payload appears VERBATIM (preservation, operator visibility) INSIDE the fence boundaries. - internal/mcp/fence_guardrail_test.go (new) — CI guardrail that walks every non-test .go file in the mcp package and fails if it finds a bare gomcp.CallToolResult literal outside tools.go. Prevents future tools from silently bypassing the fence. Delimiter-forgery defense The naive constant fence (--- UNTRUSTED MCP_RESPONSE END ---) is forgeable: an attacker who controls a field value can plant the literal end marker and "break out" of the fence. Defense: every fence call generates a 6-byte crypto/rand nonce, hex-encoded, and embeds it in BOTH the START and END markers. An attacker would need to predict the nonce (2^48 search per fence) to forge a matching END inside the payload. The delimiter_break_attempt regression test exercises this. Per-finding mapping - H-002 Cert Subject DN injection (CSR submitter controlled) → TestMCP_PromptInjection_H002_CertSubjectDN - H-003 Discovered cert metadata injection (cert owner controlled) → TestMCP_PromptInjection_H003_DiscoveredCertMetadata - M-003 Agent heartbeat injection (agent self-reports hostname/OS/IP) → TestMCP_PromptInjection_M003_AgentHeartbeat - M-004 Upstream CA error injection (CA controls error string) → TestMCP_PromptInjection_M004_UpstreamCAError - M-005 Audit details + notification body injection (downstream actors control these) → TestMCP_PromptInjection_M005_AuditDetailsAndNotifications Verification gates - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - go test -count=1 ./internal/mcp/... → all packages pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → 337 passed - python3 yaml.safe_load(api/openapi.yaml) → 89 paths, 56 schemas Threat-model placement: TB-7 (MCP↔LLM consumer). certctl owns the boundary; consumer-side prompt engineering is recommended but not relied upon. Defense-in-depth: per-call nonce closes the delimiter-forgery edge case that constant fences would have left exposed. Bundle 3 of the 2026-04-25 comprehensive audit (88 findings).	2026-04-25 22:44:33 +00:00
shankar0123	25c34ace45	feat(mcp): add claim_discovered + dismiss_discovered MCP tools (I-2 closure) Closes the LAST P1 in the 2026-04-24 audit (cat-i-b0924b6675f8). Pre-I-2 the README claimed "all API endpoints are exposed via MCP" but the discovered-certificate lifecycle (HTTP handlers ClaimDiscovered + DismissDiscovered at internal/api/handler/discovery.go:125,162) had zero MCP tool wrappers — operators using Claude / Cursor / similar MCP clients had no path to bring an out-of-band cert under management or to mark a benign discovery as not-of-interest without dropping to the REST API directly. The audit's count of 0 MCP discovery tools was correct: `grep -niE 'discover\|claim\|dismiss' internal/mcp/tools.go` returned only the pre-existing agent-retire tool's description text mentioning sentinel discovery agents — no actual discovery-tool registrations. Added in internal/mcp/types.go: - ClaimDiscoveredCertificateInput (id + managed_certificate_id) - DismissDiscoveredCertificateInput (id) Both follow the existing Go-doc / staticcheck convention (lead with the type name + brief; closure-rationale prose follows). Pinned by the existing L-1 staticcheck-fix lesson. Added in internal/mcp/tools.go (slotted at end of file, after certctl_auth_check): - certctl_claim_discovered_certificate — POST /api/v1/discovered-certificates/{id}/claim - certctl_dismiss_discovered_certificate — POST /api/v1/discovered-certificates/{id}/dismiss Both wrap the existing HTTP handlers via the generic c.Post helper. No backend changes; no openapi.yaml changes (both ops were already in the spec from earlier work). The audit's third name "acknowledge" is NOT closed: at recon, no notification-acknowledge HTTP handler exists in the API surface (grep across internal/api/handler/ returned zero hits for "acknowledge"). The audit appears to have mis-quoted; "acknowledge" isn't a real backend endpoint to wrap. If a future feature adds notification acknowledgement, register it in the same shape. Verification: - go build ./... — clean - go vet ./internal/mcp/... — clean - go test ./internal/mcp/... -count=1 — pass - golangci-lint v2.11.4 run ./... — 0 issues - MCP tool count went from 85 → 87 (verify via `grep -cE 'gomcp\.AddTool\(' internal/mcp/tools.go`) - S-1 + G-3 + D-1 + D-2 + B-1 + L-1 CI guardrails all still pass Audit findings closed: - cat-i-b0924b6675f8 (P1, MCP discovery completeness — last P1 in audit) This brings the audit to ZERO REMAINING P1s. Deferred follow-ups: - Notification acknowledge MCP tool — add when a notification-ack HTTP handler exists. Currently no such handler exists in the API surface; treat as a separate feature, not an MCP gap.	2026-04-25 16:33:56 +00:00
shankar0123	2edac7e78b	fix(mcp): close staticcheck ST1021 on BulkRenew/BulkReassign input docstrings CI on the B-1 merge (`b8a4318`) failed at the golangci-lint step on two ST1021 errors against internal/mcp/types.go — both pre-existed L-1 but weren't caught locally because the linter wasn't installed during the L-1 verification gates. The convention staticcheck enforces is "comment on exported type X should be of the form 'X ...'" — i.e. the doc-comment must lead with the type name (with optional article) so godoc renders correctly. Before: // L-1 master closure (cat-l-fa0c1ac07ab5): bulk-renew MCP tool input. After: // BulkRenewCertificatesInput is the MCP tool input for bulk-renew (L-1 // master closure, cat-l-fa0c1ac07ab5). Mirrors BulkRevokeCertificatesInput // field-for-field minus Reason. Same shape applied to BulkReassignCertificatesInput. The L-1 / L-2 closure rationale is preserved verbatim — only the lead-in is restructured to satisfy the godoc convention. Verification: - golangci-lint v2.11.4 (matching CI) installed locally at /dev/shm/bin - golangci-lint run ./... --timeout 5m → 0 issues - internal/mcp/... package targeted lint → 0 issues This unblocks the B-1 CI run on master. No behavioral change; doc-only edit.	2026-04-25 15:48:39 +00:00
shankar0123	f0865bb051	fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master) Two audit findings, both category cat-l, both rooted in web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200 ms each = a 5–20-second wedge during which the operator stared at a progress bar. Post-L-1 each workflow is a single POST. cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop handleBulkRenewal: for/await triggerRenewal(id) cat-l-8a1fb258a38a [P2] — bulk reassign loop handleReassign: for/await updateCertificate(id, {owner_id}) The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke + BulkRevocationCriteria/Result) already existed as the canonical shape in v2.0.x — L-1 ports that pattern to renew + reassign with per-action twists. Backend (Go) - internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id}; shared BulkOperationError type for all bulk paths. - internal/domain/bulk_reassignment.go: narrower shape — IDs-only, owner_id required, team_id optional. - internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew: resolves criteria → status filter (Archived/Revoked/Expired/ RenewalInProgress all silent-skip) → per-cert status flip + job create. Keygen-mode-aware so jobs land in the same initial status as single-cert TriggerRenewal. Single bulk audit event per call, not N. - internal/service/bulk_reassignment.go::BulkReassignmentService. BulkReassign: validates owner_id upfront via the ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner returns 400 before any cert is touched. Already-owned-by-target is silent-skip. Single bulk audit event. - internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP shape mirrors bulk_revocation.go. NOT admin-gated (renew is non- destructive; reassign is a common-case workflow). Sentinel-error → 400 mapping for OwnerNotFound. - internal/api/router/router.go: three bulk-* routes registered as a block before the {id} routes. HandlerRegistry gains BulkRenewal + BulkReassignment fields. - cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode so bulk-renew jobs land in same initial state as single-cert path. Frontend - web/src/api/client.ts: bulkRenewCertificates(criteria) + bulkReassignCertificates(request) functions with full TS types. - web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign rewritten from N-call loops to single calls. Result envelope drives progress UI; first-error message surfaced when total_failed > 0. Stale triggerRenewal + updateCertificate imports removed. MCP - internal/mcp/types.go: BulkRenewCertificatesInput + BulkReassignCertificatesInput. - internal/mcp/tools.go: certctl_bulk_renew_certificates + certctl_bulk_reassign_certificates tools mirroring the existing certctl_bulk_revoke_certificates pattern. OpenAPI - api/openapi.yaml: two new operations (bulkRenewCertificates, bulkReassignCertificates) under Certificates tag. Four new schemas (BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob, BulkReassignRequest, BulkReassignResult). Tests - Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty IsEmpty contracts; JSON round-trip shape pinning. - Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/ skips-revoked-archived/empty-criteria-error/partial-failure/ audit-event-emitted) + 8 BulkReassign tests (happy/skips-already- owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id- optional/team-id-provided/partial-failure/audit-event-emitted). - Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong- method-405/actor-attribution/service-error-500) + 6 BulkReassign handler tests (happy/empty-IDs-400/missing-owner-400/owner-not- found-400-via-sentinel/wrong-method-405/generic-error-500). CI guardrail - .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx for 'for(...) await triggerRenewal(...)' and 'for(...) await updateCertificate(...)' patterns; comment lines exempt; test files exempt. Verified locally (passes against post-fix tree, fires against synthetic regression). Counts (deltas) - Routes: 119 → 121 (+2) - OpenAPI operations: 123 → 125 (+2) - MCP tools: 83 → 85 (+2) Performance - 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency reduction on the canonical operator workflow). - Audit event volume: 1 + N per operation → 1. Out of scope (deferred follow-ups) - cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan (different shape — wire existing PUT to GUI, not new bulk endpoint). - cat-k-e85d1099b2d7: CertificatesPage no pagination UI. - cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added 2 new tools but does not close that finding). Verification - go build / vet / test -short / test -short -race all clean. - web tsc --noEmit + vitest run all clean (296 tests passing). - OpenAPI YAML parses (89 paths, 125 ops). - L-1 CI guardrail passes against post-fix tree, fires against synthetic regression. No push.	2026-04-25 14:33:02 +00:00
shankar0123	52248be717	v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).	2026-04-20 03:43:10 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	0725713e19	Close I-004 (agent hard-delete cascades targets) coverage-gap finding Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=<text>` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire <id> [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI	2026-04-19 05:24:00 +00:00
shankar0123	3287e174dc	Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001) Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006) on top of the C-001/C-002 ownership + agent-FK contract fixes landed in `a53a4b8`. The work lands as a single commit spanning server, docs, tests, and the React client. M-002 — Named API keys with per-key actor propagation * Migration 000014 adds the 'api_keys' table (id, name, hash, principal, role, created_at, last_used_at, disabled_at) so every credential carries an identifiable principal instead of the opaque 'anonymous'/'api-key' sentinel. * Auth middleware now rotates through configured keys, performs constant-time hash comparison, stamps 'last_used_at', and emits an actor struct via contextWithActor(). The audit middleware, bulk-revocation handler, approval handlers, and MCP tool layer now read the principal off the context and persist it on every audit_events row. * Regression coverage: - internal/api/middleware/audit_test.go — actor propagation, principal redaction for disabled keys, anonymous fallback for unauthenticated endpoints. - internal/api/handler/bulk_revocation_handler_test.go, job_handler_test.go — principal-on-audit assertions. M-003 — Authorization gates (Phase B) * Approval handler rejects self-approval / self-rejection with 403 when the actor principal equals the job's requested_by field. * Bulk revocation is gated behind the 'admin' role; operators and viewers receive 403. * Regression coverage: - internal/service/job_test.go — TestApproveJob_NotSelf, TestRejectJob_NotSelf. - internal/api/handler/bulk_revocation_handler_test.go — TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds. M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux * Per RFC 8615, relying parties cannot reasonably be asked to authenticate against the issuing certctl instance to retrieve revocation material. CRL and OCSP move off the authenticated '/api/v1/crl' and '/api/v1/ocsp/' paths onto: GET /.well-known/pki/crl/{issuer_id} Content-Type: application/pkix-crl (RFC 5280 §5) GET /.well-known/pki/ocsp/{issuer_id}/{serial} Content-Type: application/ocsp-response (RFC 6960) * Non-standard JSON CRL shape is removed; only DER is served. * Short-lived certificate exemption (profile TTL < 1h → skip CRL/OCSP) is preserved; the response simply omits the serial. * Routes are registered on the unauthenticated 'finalHandler' mux in cmd/server/main.go alongside EST ('/.well-known/est/') and SCEP ('/scep'). Legacy authenticated paths return 404. Regression coverage: - internal/api/handler/certificate_handler_test.go — content type, DER parseability, 404 for unknown issuer. - internal/api/handler/adversarial_path_test.go — unauthenticated access asserted for CRL, OCSP, EST, SCEP. - internal/api/router/router_test.go — route-table assertion that '.well-known/pki/', '.well-known/est/', and '/scep' are mounted on the unauthenticated branch. M-001 — Auto-closed by M-002 EST and SCEP were already registered on the unauthenticated 'finalHandler' mux; the router comment at internal/api/router/router.go:247 now matches reality. The adversarial-path tests above lock the behavior in. Verification (all gates green): * go vet ./... — clean * go build ./... — ok * go test -short ./... (55+ packages) — all pass * web/ : npm test (225 Vitest tests) — all pass * web/ : npx tsc --noEmit — clean * grep sweep for '/api/v1/(crl\|ocsp)' — 13 surviving hits, all intentional M-006 tombstone/relocation comments. Documentation: * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 → Fixed, with per-finding resolution paragraphs citing regression test IDs. (Audit file lives outside this repo; see cowork root.) * CLAUDE.md Project Status line updated with the auth-unification closure note. * docs/features.md, docs/architecture.md, docs/quickstart.md, docs/concepts.md, docs/connectors.md, docs/test-env.md, docs/testing-guide.md, docs/compliance-.md, docs/demo-advanced.md — refreshed for the new '.well-known/pki/' namespace and named API keys. * api/openapi.yaml — documents the new unauthenticated endpoints and removes the legacy '/api/v1/crl' + '/api/v1/ocsp/' paths. .gitignore: adds '/.gocache/' and '/.gomodcache/' for the session- scoped Go caches so they never enter the tree.	2026-04-18 18:17:41 +00:00
shankar0123	a53a4b845b	fix(gui,api): close C-001 + C-002 — ownership + agent FK contract C-001 — CreateCertificate was server-accepted with null owner_id, team_id, renewal_policy_id because the GUI neither collected the fields nor enforced them, even though the backend's ManagedCertificate schema and handler contract treat them as required. Fix the contract at all four layers: - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free- text inputs with <select> elements fed by getOwners/getTeams/ getPolicies queries; mark all three required; gate the Create button on owner_id + team_id + renewal_policy_id being set. - internal/api/handler/certificates.go: ValidateRequired for owner_id, team_id, renewal_policy_id on CreateCertificate so the handler returns HTTP 400 with the offending field name before the service layer is reached. - internal/mcp/types.go: drop ',omitempty' from CreateCertificateInput.RenewalPolicyID so the MCP schema reflects the required contract; Update inputs keep partial-update semantics. - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id, issuer_id, owner_id, team_id]' was already present on the Create schema; clarified DeploymentTarget.agent_id description to note the FK contract. C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the service inserted directly, producing a Postgres 23503 FK-violation that bubbled out as a generic HTTP 500. The FK itself (migration 000001 line 104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep the schema strict and add validation at three layers: - internal/service/target.go: introduce ErrAgentNotFound sentinel and pre-validate agent_id in TargetService.CreateTarget — empty string returns 'agent_id is required'; a nonexistent id returns the full 'referenced agent does not exist: <id>' error. Both wrap ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is. - internal/api/handler/targets.go: ValidateRequired on agent_id; map errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of letting it fall through to the generic 500 branch. - internal/mcp/types.go: drop ',omitempty' from CreateTargetInput.AgentID to match the required contract. - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input with a <select> populated from getAgents(); include agent in the canProceedToReview gate so Next is disabled until an agent is chosen. Regression coverage (21 new subtests total): - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests, one per required field, each proves the handler guard fires before the mock service is called. - TestCreateTarget_MissingAgentID_Returns400 — handler guard. - TestCreateTarget_NonexistentAgent_Returns400 — pins the ErrAgentNotFound -> 400 translation. - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel. - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is. - The existing TestTargetService_CreateTarget_Success, along with TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler tests, were updated to seed a real agent or include agent_id in the request body so the happy paths still run cleanly. Gates (Phase 4): - go build/vet/test/race: green - go test -cover: internal/service 68.7% (gate 55%), internal/api/handler 78.9% (gate 60%) - golangci-lint on service+handler+mcp: 0 issues - govulncheck: no reachable vulns - tsc --noEmit: clean - vitest: 223/223 passing See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.	2026-04-18 16:01:40 +00:00
shankar0123	eef1db0f0a	fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006) Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a single commit because they share the same root cause — policy CRUD sending values the backend silently rejects — and splitting them would leave a half-working UI between commits. ## D-005 (P0): PoliciesPage dropdown 400s every Create Policy Root cause ---------- `web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array. The backend's `internal/api/handler/validators.go::ValidatePolicyType` enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`, `RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in `internal/domain/policy.go`. Every Create Policy request was rejected with `400 invalid policy type`. The error surfaced only as a transient toast; the modal closed anyway. Silent user-visible failure. Fix --- - `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES` tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and `PolicyViolation.severity` to the literal-union types. Dropdown is now sourced from the tuple; casing drift becomes a compile error. - `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` / `severityDots` to the TitleCase values, added `humanize()` for display (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral` fallback that was papering over the mismatch. - `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone edits one side of the frontend/backend contract without the other, CI fails with a clear assertion. Pure-TS vitest, no RTL dependency. ## D-006 (P1): `severity` field silently dropped on create/update Root cause ---------- `PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The frontend has always sent `severity` on create/update, but Go's `json.Decoder` (default settings, no `DisallowUnknownFields`) silently dropped it. The value never reached PostgreSQL. Every rule rendered with the same severity because there was no severity — just a display computation downstream. Fix: option (b), full-stack schema add (not delete-the-field) ------------------------------------------------------------- - Migration `000013_policy_rule_severity` (up + down): adds `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No index — three-value column on a low-thousands-rows table, planner will seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data. - `internal/domain/policy.go`: added `Severity PolicySeverity` field. - `internal/repository/postgres/policy.go`: plumbed `severity` through ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT, UpdateRule UPDATE (4 queries). - `internal/service/policy.go::UpdatePolicy`: if the client omits severity on a PUT (zero-value empty string), fetch the existing rule and preserve its severity. Without this, partial updates would trip the NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type (out of scope). - `internal/api/handler/policies.go::CreatePolicy`: default empty severity to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with clear message instead of 500 on CHECK violation. `UpdatePolicy`: validates severity only when provided. - `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional `severity` on the MCP `create_policy` / `update_policy` tool inputs so LLM callers stay in sync with the wire contract. - `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with the enum and default. Acceptance criterion (user-defined) ----------------------------------- "Create a rule with severity=Critical, reload the page, and still see Critical — no silent drops." Verified end-to-end: frontend sends `severity: "Critical"`, handler validates, service persists, DB stores, GET returns, React renders the correct badge. Seed data --------- `migrations/seed.sql`: four demo rules now have differentiated severities — `pr-require-owner` → Warning, `pr-allowed-environments` → Error, `pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` → Warning. The user called out that seeding all four at the same severity makes the feature look decorative; differentiation demonstrates the column carries real signal. ## Integration test fix (side effect of D-006) `internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy` was sending `"severity": "High"` — a value from the pre-audit severity vocabulary that the new `ValidatePolicySeverity` correctly rejects with 400. Changed to `"Error"` (closest semantic match in the new TitleCase allowlist). Only severity reference in the integration/ directory; verified via grep. ## Out of scope, logged for follow-up (d/D-008) Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly deferred per direction: 1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`). These are load-bearing on `internal/service/policy.go::evaluateRule`'s `switch rule.Type` (which also uses the lowercase strings). Migrating requires coordinated changes across seed + evaluation engine. 2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'` severity — will now fail the new CHECK constraint. Separate fix. 3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on emitted violations and ignores the configured `rule.Config`. The new severity column is read correctly on the CRUD path but not yet consulted during evaluation. ## Verification Backend: - `go build ./...` — clean - `go vet ./...` — clean - `go test -short ./...` — all packages green, including `internal/service` (policy service), `internal/api/handler` (policy + MCP handler tests), `internal/integration` (e2e_test.go after fix), `internal/domain`, `internal/repository/postgres`. Frontend: - `tsc --noEmit` — clean - `vitest run` — 223/223 passing (4 new assertions in types.test.ts) - `vite build` — clean (only the pre-existing chunk-size warning)	2026-04-18 13:02:04 +00:00
shankar0123	13cd4d98ba	feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 00:06:34 -04:00
shankar0123	a8fc177118	fix: resolve NULL csr_pem scan errors and QA smoke test failures Root cause: certificate_versions.csr_pem is nullable in the schema but Go code scanned it into a plain string. Used sql.NullString in ListVersions and GetLatestVersion to handle NULL values correctly. Also includes: partial update fetch-merge-update pattern to prevent FK violations, nil directory guard in discovery service, diagnostic slog logging in handlers, export handler 422 for unparseable PEM, OpenAPI spec corrections, MCP tool description improvements, and test fixes. Rewrites the Release Sign-Off section in testing-guide.md to individual test-level granularity (320 rows) with smoke test results audited and checked off (121 pass, 5 skip, 194 manual remaining). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 00:51:18 -04:00
shankar0123	ec21c9bb29	feat(m28+m29+m30): ACME ARI, email digest, and Helm chart M28: ACME Renewal Information (RFC 9702) — CA-directed renewal timing with cert ID computation, directory endpoint discovery, graceful degradation for non-ARI CAs. 19 tests. M29: Email notifier wiring + scheduled certificate digest — SMTP connector bridged to service layer via NotifierAdapter, DigestService with HTML email template, 7th scheduler loop (24h), digest preview/send API endpoints and GUI card. 21 tests. M30: Production-ready Helm chart — server Deployment, PostgreSQL StatefulSet, agent DaemonSet, ConfigMaps, Secrets, Ingress, security contexts, health probes, example values for dev/prod/ACME scenarios. Also: OpenAPI spec updates, MCP tool additions, CI helm-lint job, documentation updates across 5 doc files and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-28 21:18:35 -04:00
shankar0123	43a03c168c	fix: Go 1.25 upgrade, codebase audit fixes, MCP server tests Upgrade from Go 1.22 to 1.25 (minimum for MCP SDK, actively supported). CI updated to match. Codebase audit fixes: - Local CA parseIP() now uses net.ParseIP — IP SANs no longer silently dropped - Nil pointer guards in agent.go GetWorkWithTargets for target/cert enrichment - MCP CreateCertificateInput marks owner_id/team_id as required - NGINX connector uses CombinedOutput() — captures diagnostic output on failure - Jobs handler validates JSON decode on rejection body — returns 400 on malformed - CRL/OCSP handlers propagate requestID for error tracing MCP server tests (26 tests): - client_test.go: HTTP client coverage (GET/POST/PUT/DELETE, auth, 204, errors, binary) - tools_test.go: tool registration, pagination, end-to-end flows with mock API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 17:36:25 -04:00
shankar0123	956230aec1	feat: M18a — MCP server exposing all 76 API endpoints as AI-native tools Separate standalone binary (cmd/mcp-server/) using official MCP Go SDK (modelcontextprotocol/go-sdk v1.4.1) with stdio transport. Stateless HTTP proxy translates MCP tool calls to certctl REST API requests. 76 tools across 16 resource domains with typed input structs and jsonschema tags for automatic LLM-friendly schema generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 16:49:39 -04:00

16 Commits