certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 14:11:31 +00:00

Author	SHA1	Message	Date
shankar0123	0725713e19	Close I-004 (agent hard-delete cascades targets) coverage-gap finding Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=<text>` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire <id> [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI	2026-04-19 05:24:00 +00:00
shankar0123	a53a4b845b	fix(gui,api): close C-001 + C-002 — ownership + agent FK contract C-001 — CreateCertificate was server-accepted with null owner_id, team_id, renewal_policy_id because the GUI neither collected the fields nor enforced them, even though the backend's ManagedCertificate schema and handler contract treat them as required. Fix the contract at all four layers: - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free- text inputs with <select> elements fed by getOwners/getTeams/ getPolicies queries; mark all three required; gate the Create button on owner_id + team_id + renewal_policy_id being set. - internal/api/handler/certificates.go: ValidateRequired for owner_id, team_id, renewal_policy_id on CreateCertificate so the handler returns HTTP 400 with the offending field name before the service layer is reached. - internal/mcp/types.go: drop ',omitempty' from CreateCertificateInput.RenewalPolicyID so the MCP schema reflects the required contract; Update inputs keep partial-update semantics. - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id, issuer_id, owner_id, team_id]' was already present on the Create schema; clarified DeploymentTarget.agent_id description to note the FK contract. C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the service inserted directly, producing a Postgres 23503 FK-violation that bubbled out as a generic HTTP 500. The FK itself (migration 000001 line 104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep the schema strict and add validation at three layers: - internal/service/target.go: introduce ErrAgentNotFound sentinel and pre-validate agent_id in TargetService.CreateTarget — empty string returns 'agent_id is required'; a nonexistent id returns the full 'referenced agent does not exist: <id>' error. Both wrap ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is. - internal/api/handler/targets.go: ValidateRequired on agent_id; map errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of letting it fall through to the generic 500 branch. - internal/mcp/types.go: drop ',omitempty' from CreateTargetInput.AgentID to match the required contract. - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input with a <select> populated from getAgents(); include agent in the canProceedToReview gate so Next is disabled until an agent is chosen. Regression coverage (21 new subtests total): - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests, one per required field, each proves the handler guard fires before the mock service is called. - TestCreateTarget_MissingAgentID_Returns400 — handler guard. - TestCreateTarget_NonexistentAgent_Returns400 — pins the ErrAgentNotFound -> 400 translation. - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel. - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is. - The existing TestTargetService_CreateTarget_Success, along with TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler tests, were updated to seed a real agent or include agent_id in the request body so the happy paths still run cleanly. Gates (Phase 4): - go build/vet/test/race: green - go test -cover: internal/service 68.7% (gate 55%), internal/api/handler 78.9% (gate 60%) - golangci-lint on service+handler+mcp: 0 issues - govulncheck: no reachable vulns - tsc --noEmit: clean - vitest: 223/223 passing See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.	2026-04-18 16:01:40 +00:00
shankar0123	eef1db0f0a	fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006) Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a single commit because they share the same root cause — policy CRUD sending values the backend silently rejects — and splitting them would leave a half-working UI between commits. ## D-005 (P0): PoliciesPage dropdown 400s every Create Policy Root cause ---------- `web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array. The backend's `internal/api/handler/validators.go::ValidatePolicyType` enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`, `RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in `internal/domain/policy.go`. Every Create Policy request was rejected with `400 invalid policy type`. The error surfaced only as a transient toast; the modal closed anyway. Silent user-visible failure. Fix --- - `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES` tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and `PolicyViolation.severity` to the literal-union types. Dropdown is now sourced from the tuple; casing drift becomes a compile error. - `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` / `severityDots` to the TitleCase values, added `humanize()` for display (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral` fallback that was papering over the mismatch. - `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone edits one side of the frontend/backend contract without the other, CI fails with a clear assertion. Pure-TS vitest, no RTL dependency. ## D-006 (P1): `severity` field silently dropped on create/update Root cause ---------- `PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The frontend has always sent `severity` on create/update, but Go's `json.Decoder` (default settings, no `DisallowUnknownFields`) silently dropped it. The value never reached PostgreSQL. Every rule rendered with the same severity because there was no severity — just a display computation downstream. Fix: option (b), full-stack schema add (not delete-the-field) ------------------------------------------------------------- - Migration `000013_policy_rule_severity` (up + down): adds `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No index — three-value column on a low-thousands-rows table, planner will seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data. - `internal/domain/policy.go`: added `Severity PolicySeverity` field. - `internal/repository/postgres/policy.go`: plumbed `severity` through ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT, UpdateRule UPDATE (4 queries). - `internal/service/policy.go::UpdatePolicy`: if the client omits severity on a PUT (zero-value empty string), fetch the existing rule and preserve its severity. Without this, partial updates would trip the NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type (out of scope). - `internal/api/handler/policies.go::CreatePolicy`: default empty severity to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with clear message instead of 500 on CHECK violation. `UpdatePolicy`: validates severity only when provided. - `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional `severity` on the MCP `create_policy` / `update_policy` tool inputs so LLM callers stay in sync with the wire contract. - `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with the enum and default. Acceptance criterion (user-defined) ----------------------------------- "Create a rule with severity=Critical, reload the page, and still see Critical — no silent drops." Verified end-to-end: frontend sends `severity: "Critical"`, handler validates, service persists, DB stores, GET returns, React renders the correct badge. Seed data --------- `migrations/seed.sql`: four demo rules now have differentiated severities — `pr-require-owner` → Warning, `pr-allowed-environments` → Error, `pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` → Warning. The user called out that seeding all four at the same severity makes the feature look decorative; differentiation demonstrates the column carries real signal. ## Integration test fix (side effect of D-006) `internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy` was sending `"severity": "High"` — a value from the pre-audit severity vocabulary that the new `ValidatePolicySeverity` correctly rejects with 400. Changed to `"Error"` (closest semantic match in the new TitleCase allowlist). Only severity reference in the integration/ directory; verified via grep. ## Out of scope, logged for follow-up (d/D-008) Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly deferred per direction: 1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`). These are load-bearing on `internal/service/policy.go::evaluateRule`'s `switch rule.Type` (which also uses the lowercase strings). Migrating requires coordinated changes across seed + evaluation engine. 2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'` severity — will now fail the new CHECK constraint. Separate fix. 3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on emitted violations and ignores the configured `rule.Config`. The new severity column is read correctly on the CRUD path but not yet consulted during evaluation. ## Verification Backend: - `go build ./...` — clean - `go vet ./...` — clean - `go test -short ./...` — all packages green, including `internal/service` (policy service), `internal/api/handler` (policy + MCP handler tests), `internal/integration` (e2e_test.go after fix), `internal/domain`, `internal/repository/postgres`. Frontend: - `tsc --noEmit` — clean - `vitest run` — 223/223 passing (4 new assertions in types.test.ts) - `vite build` — clean (only the pre-existing chunk-size warning)	2026-04-18 13:02:04 +00:00
shankar0123	13cd4d98ba	feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 00:06:34 -04:00
shankar0123	43a03c168c	fix: Go 1.25 upgrade, codebase audit fixes, MCP server tests Upgrade from Go 1.22 to 1.25 (minimum for MCP SDK, actively supported). CI updated to match. Codebase audit fixes: - Local CA parseIP() now uses net.ParseIP — IP SANs no longer silently dropped - Nil pointer guards in agent.go GetWorkWithTargets for target/cert enrichment - MCP CreateCertificateInput marks owner_id/team_id as required - NGINX connector uses CombinedOutput() — captures diagnostic output on failure - Jobs handler validates JSON decode on rejection body — returns 400 on malformed - CRL/OCSP handlers propagate requestID for error tracing MCP server tests (26 tests): - client_test.go: HTTP client coverage (GET/POST/PUT/DELETE, auth, 204, errors, binary) - tools_test.go: tool registration, pagination, end-to-end flows with mock API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 17:36:25 -04:00
shankar0123	956230aec1	feat: M18a — MCP server exposing all 76 API endpoints as AI-native tools Separate standalone binary (cmd/mcp-server/) using official MCP Go SDK (modelcontextprotocol/go-sdk v1.4.1) with stdio transport. Stateless HTTP proxy translates MCP tool calls to certctl REST API requests. 76 tools across 16 resource domains with typed input structs and jsonschema tags for automatic LLM-friendly schema generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 16:49:39 -04:00

6 Commits