Commit Graph

6 Commits

Author SHA1 Message Date
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00
shankar0123 a53a4b845b fix(gui,api): close C-001 + C-002 — ownership + agent FK contract
C-001 — CreateCertificate was server-accepted with null owner_id,
team_id, renewal_policy_id because the GUI neither collected the fields
nor enforced them, even though the backend's ManagedCertificate schema
and handler contract treat them as required. Fix the contract at all
four layers:

  - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free-
    text inputs with <select> elements fed by getOwners/getTeams/
    getPolicies queries; mark all three required; gate the Create
    button on owner_id + team_id + renewal_policy_id being set.
  - internal/api/handler/certificates.go: ValidateRequired for
    owner_id, team_id, renewal_policy_id on CreateCertificate so the
    handler returns HTTP 400 with the offending field name before the
    service layer is reached.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateCertificateInput.RenewalPolicyID so the MCP schema reflects
    the required contract; Update inputs keep partial-update semantics.
  - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id,
    issuer_id, owner_id, team_id]' was already present on the Create
    schema; clarified DeploymentTarget.agent_id description to note the
    FK contract.

C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the
service inserted directly, producing a Postgres 23503 FK-violation that
bubbled out as a generic HTTP 500. The FK itself (migration 000001 line
104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep
the schema strict and add validation at three layers:

  - internal/service/target.go: introduce
    ErrAgentNotFound sentinel and pre-validate agent_id in
    TargetService.CreateTarget — empty string returns
    'agent_id is required'; a nonexistent id returns the full
    'referenced agent does not exist: <id>' error. Both wrap
    ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is.
  - internal/api/handler/targets.go: ValidateRequired on agent_id; map
    errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of
    letting it fall through to the generic 500 branch.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateTargetInput.AgentID to match the required contract.
  - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input
    with a <select> populated from getAgents(); include agent in the
    canProceedToReview gate so Next is disabled until an agent is
    chosen.

Regression coverage (21 new subtests total):

  - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests,
    one per required field, each proves the handler guard fires before
    the mock service is called.
  - TestCreateTarget_MissingAgentID_Returns400 — handler guard.
  - TestCreateTarget_NonexistentAgent_Returns400 — pins the
    ErrAgentNotFound -> 400 translation.
  - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel.
  - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is.
  - The existing TestTargetService_CreateTarget_Success, along with
    TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler
    tests, were updated to seed a real agent or include agent_id in
    the request body so the happy paths still run cleanly.

Gates (Phase 4):
  - go build/vet/test/race: green
  - go test -cover: internal/service 68.7% (gate 55%),
    internal/api/handler 78.9% (gate 60%)
  - golangci-lint on service+handler+mcp: 0 issues
  - govulncheck: no reachable vulns
  - tsc --noEmit: clean
  - vitest: 223/223 passing

See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.
2026-04-18 16:01:40 +00:00
shankar0123 eef1db0f0a fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.

## D-005 (P0): PoliciesPage dropdown 400s every Create Policy

Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.

Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
  tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
  `PolicyViolation.severity` to the literal-union types. Dropdown is now
  sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
  `severityDots` to the TitleCase values, added `humanize()` for display
  (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
  fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
  edits one side of the frontend/backend contract without the other, CI
  fails with a clear assertion. Pure-TS vitest, no RTL dependency.

## D-006 (P1): `severity` field silently dropped on create/update

Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.

Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
  `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
  CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
  index — three-value column on a low-thousands-rows table, planner will
  seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
  ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
  UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
  severity on a PUT (zero-value empty string), fetch the existing rule
  and preserve its severity. Without this, partial updates would trip the
  NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
  (out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
  to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
  clear message instead of 500 on CHECK violation. `UpdatePolicy`:
  validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
  `severity` on the MCP `create_policy` / `update_policy` tool inputs so
  LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
  the enum and default.

Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.

Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.

## Integration test fix (side effect of D-006)

`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.

## Out of scope, logged for follow-up (d/D-008)

Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:

1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
   (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
   These are load-bearing on `internal/service/policy.go::evaluateRule`'s
   `switch rule.Type` (which also uses the lowercase strings). Migrating
   requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
   severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
   emitted violations and ignores the configured `rule.Config`. The new
   severity column is read correctly on the CRUD path but not yet
   consulted during evaluation.

## Verification

Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
  `internal/service` (policy service), `internal/api/handler` (policy +
  MCP handler tests), `internal/integration` (e2e_test.go after fix),
  `internal/domain`, `internal/repository/postgres`.

Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
2026-04-18 13:02:04 +00:00
shankar0123 13cd4d98ba feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id,
owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure
tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke),
server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI
spec, compliance mapping updates, and 21 new tests (12 service, 7 handler,
1 CLI, 1 frontend).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 00:06:34 -04:00
shankar0123 43a03c168c fix: Go 1.25 upgrade, codebase audit fixes, MCP server tests
Upgrade from Go 1.22 to 1.25 (minimum for MCP SDK, actively supported).
CI updated to match.

Codebase audit fixes:
- Local CA parseIP() now uses net.ParseIP — IP SANs no longer silently dropped
- Nil pointer guards in agent.go GetWorkWithTargets for target/cert enrichment
- MCP CreateCertificateInput marks owner_id/team_id as required
- NGINX connector uses CombinedOutput() — captures diagnostic output on failure
- Jobs handler validates JSON decode on rejection body — returns 400 on malformed
- CRL/OCSP handlers propagate requestID for error tracing

MCP server tests (26 tests):
- client_test.go: HTTP client coverage (GET/POST/PUT/DELETE, auth, 204, errors, binary)
- tools_test.go: tool registration, pagination, end-to-end flows with mock API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 17:36:25 -04:00
shankar0123 956230aec1 feat: M18a — MCP server exposing all 76 API endpoints as AI-native tools
Separate standalone binary (cmd/mcp-server/) using official MCP Go SDK
(modelcontextprotocol/go-sdk v1.4.1) with stdio transport. Stateless HTTP
proxy translates MCP tool calls to certctl REST API requests. 76 tools
across 16 resource domains with typed input structs and jsonschema tags
for automatic LLM-friendly schema generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 16:49:39 -04:00