From 49002c8cbaa782d381e954ab724422e66e06e9bb Mon Sep 17 00:00:00 2001 From: Shankar Reddy Date: Sun, 19 Apr 2026 05:24:00 +0000 Subject: [PATCH] Close I-004 (agent hard-delete cascades targets) coverage-gap finding MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI --- api/openapi.yaml | 227 +++++++++- cmd/agent/main.go | 100 +++++ cmd/cli/main.go | 40 +- docs/architecture.md | 10 + docs/concepts.md | 2 + docs/testing-guide.md | 225 ++++++++++ internal/api/handler/agent_handler_test.go | 25 ++ .../api/handler/agent_retire_handler_test.go | 393 +++++++++++++++++ internal/api/handler/agents.go | 199 +++++++++ internal/api/router/router.go | 12 + internal/cli/agent_retire_test.go | 228 ++++++++++ internal/cli/client.go | 217 ++++++++++ internal/domain/connector.go | 63 +++ internal/domain/connector_test.go | 55 +++ internal/integration/lifecycle_test.go | 50 +++ internal/mcp/client.go | 10 + internal/mcp/retire_agent_test.go | 214 ++++++++++ internal/mcp/tools.go | 47 +++ internal/mcp/types.go | 17 + internal/repository/interfaces.go | 94 ++++- internal/repository/postgres/agent.go | 246 ++++++++++- .../postgres/migration_000015_test.go | 220 ++++++++++ internal/service/agent.go | 15 + internal/service/agent_retire.go | 317 ++++++++++++++ internal/service/agent_retire_test.go | 396 ++++++++++++++++++ internal/service/deployment.go | 25 ++ internal/service/target.go | 25 +- internal/service/testutil_test.go | 167 +++++++- migrations/000015_agent_retire.down.sql | 32 ++ migrations/000015_agent_retire.up.sql | 56 +++ web/src/api/client.test.ts | 109 +++++ web/src/api/client.ts | 94 ++++- web/src/api/types.test.ts | 74 ++++ web/src/api/types.ts | 37 ++ web/src/pages/AgentsPage.tsx | 392 ++++++++++++++++- 35 files changed, 4400 insertions(+), 33 deletions(-) create mode 100644 internal/api/handler/agent_retire_handler_test.go create mode 100644 internal/cli/agent_retire_test.go create mode 100644 internal/domain/connector_test.go create mode 100644 internal/mcp/retire_agent_test.go create mode 100644 internal/repository/postgres/migration_000015_test.go create mode 100644 internal/service/agent_retire.go create mode 100644 internal/service/agent_retire_test.go create mode 100644 migrations/000015_agent_retire.down.sql create mode 100644 migrations/000015_agent_retire.up.sql diff --git a/api/openapi.yaml b/api/openapi.yaml index d0e660f..ffbb8dc 100644 --- a/api/openapi.yaml +++ b/api/openapi.yaml @@ -880,6 +880,40 @@ paths: "500": $ref: "#/components/responses/InternalError" + /api/v1/agents/retired: + get: + tags: [Agents] + summary: List retired agents + description: | + I-004: opt-in listing of soft-retired agents. The default + `GET /api/v1/agents` endpoint filters retired rows out; this is the + dedicated surface for reading them back (e.g., the operator UI's + "Retired" tab, audit and forensics workflows). Pagination defaults + match the default agent listing (page=1, per_page=50, max 500). Go + 1.22's enhanced ServeMux routes `/agents/retired` to this handler + via the literal-beats-pattern-var precedence rule, so the sibling + `/agents/{id}` route does not shadow it. + operationId: listRetiredAgents + parameters: + - $ref: "#/components/parameters/page" + - $ref: "#/components/parameters/per_page" + responses: + "200": + description: Paginated list of retired agents + content: + application/json: + schema: + allOf: + - $ref: "#/components/schemas/PaginationEnvelope" + - type: object + properties: + data: + type: array + items: + $ref: "#/components/schemas/Agent" + "500": + $ref: "#/components/responses/InternalError" + /api/v1/agents/{id}: get: tags: [Agents] @@ -900,12 +934,116 @@ paths: $ref: "#/components/responses/NotFound" "500": $ref: "#/components/responses/InternalError" + delete: + tags: [Agents] + summary: Soft-retire agent + description: | + I-004: soft-retirement. The agent row is preserved (so its audit + trail and historical job links remain intact) and `retired_at` is + stamped. A retired agent receives `410 Gone` on subsequent + heartbeats so it can shut down cleanly. + + Behavior matrix: + + | Scenario | Query | Status | Body | + | --- | --- | --- | --- | + | Clean retire (no active dependencies) | none | `200` | `RetireAgentResponse` with `cascade=false`, zero counts | + | Blocked by active targets/certs/jobs | none | `409` | `BlockedByDependenciesResponse` with per-bucket counts | + | Force-cascade retire | `force=true&reason=...` | `200` | `RetireAgentResponse` with `cascade=true`, pre-cascade counts | + | Idempotent re-retire | either | `204` | (empty — downstream consumers break on stray bodies) | + | `force=true` without reason | `force=true` | `400` | ErrorResponse (ErrForceReasonRequired) | + | Reserved sentinel agent | any | `403` | ErrorResponse (ErrAgentIsSentinel) | + | Unknown agent id | any | `404` | ErrorResponse | + + Sentinel agents are the four reserved identities backing non-agent + discovery subsystems (`server-scanner`, `cloud-aws-sm`, + `cloud-azure-kv`, `cloud-gcp-sm`). Retiring them would orphan the + scanner or a cloud secret-manager source, so the handler refuses + unconditionally — even with `force=true`. + operationId: retireAgent + parameters: + - $ref: "#/components/parameters/resourceId" + - name: force + in: query + required: false + schema: + type: boolean + default: false + description: | + Cascade-retire active downstream targets, certificates, and + jobs. When `true`, a non-empty `reason` is required. A + malformed value (anything strconv.ParseBool rejects) is + silently treated as `false` so a typoed query can never + accidentally enable the cascade. + - name: reason + in: query + required: false + schema: + type: string + description: | + Human-readable reason recorded on the retired row and in the + immutable audit trail. Required (non-empty after trimming) + when `force=true`. + responses: + "200": + description: | + Agent retired (clean retire or successful force-cascade). Body + is `RetireAgentResponse`. + content: + application/json: + schema: + $ref: "#/components/schemas/RetireAgentResponse" + "204": + description: | + Idempotent retire — the agent was already retired. Response + body is empty (the 200-path shape does not apply, and + downstream clients that tee responses into dashboards would + break on spurious bodies). + "400": + description: | + `force=true` was sent without a non-empty `reason` + (ErrForceReasonRequired). + content: + application/json: + schema: + $ref: "#/components/schemas/ErrorResponse" + "403": + description: | + Agent is a reserved sentinel and cannot be retired even with + `?force=true` (ErrAgentIsSentinel). + content: + application/json: + schema: + $ref: "#/components/schemas/ErrorResponse" + "404": + $ref: "#/components/responses/NotFound" + "409": + description: | + Blocked by active downstream dependencies. Body carries + per-bucket counts so the operator UI can show the user which + dependency is holding up the retire. Re-run with + `?force=true&reason=...` to cascade. + content: + application/json: + schema: + $ref: "#/components/schemas/BlockedByDependenciesResponse" + "405": + description: Method not allowed (only DELETE, GET are routed to this path) + "500": + $ref: "#/components/responses/InternalError" /api/v1/agents/{id}/heartbeat: post: tags: [Agents] summary: Agent heartbeat - description: Reports agent liveness and metadata (OS, architecture, IP, version). + description: | + Reports agent liveness and metadata (OS, architecture, IP, version). + + I-004: a retired agent still polling the heartbeat endpoint receives + `410 Gone` so `cmd/agent` detects the terminal signal and shuts down + cleanly instead of looping forever against a decommissioned identity. + The retired-agent check runs before any "not found" string match so + it can never be masked by a sibling error branch. operationId: agentHeartbeat parameters: - $ref: "#/components/parameters/resourceId" @@ -936,6 +1074,14 @@ paths: $ref: "#/components/responses/BadRequest" "404": $ref: "#/components/responses/NotFound" + "410": + description: | + I-004: the agent has been soft-retired. The agent process should + treat this as a terminal signal and shut down cleanly. + content: + application/json: + schema: + $ref: "#/components/schemas/ErrorResponse" "500": $ref: "#/components/responses/InternalError" @@ -3373,6 +3519,85 @@ components: type: string version: type: string + retired_at: + type: string + format: date-time + nullable: true + description: | + I-004: soft-retirement timestamp. `null` (or field absent) means the + agent is active. A non-null value is the canonical "retired" state — + the operational `status` column is preserved at retirement time as + the last-seen value, but `retired_at` is the source of truth for + filtering agents out of active listings. + retired_reason: + type: string + nullable: true + description: | + I-004: human-readable reason captured at retirement time. Only set + when the agent was retired via `?force=true&reason=...` cascade; a + default soft-retire leaves this field null. + + AgentDependencyCounts: + type: object + description: | + I-004: preflight counts of active downstream rows that would be + orphaned by retiring an agent. Returned in the 409 + `blocked_by_dependencies` body so the operator UI can tell the user + which bucket is blocking the retire, and also in the 200 response + body on a successful `?force=true` cascade as a snapshot of what + was cascaded. + properties: + active_targets: + type: integer + description: Deployment targets with this agent assigned and retired_at IS NULL + active_certificates: + type: integer + description: Certificates currently deployed via one of this agent's active targets + pending_jobs: + type: integer + description: Jobs with agent_id=this in status Pending, AwaitingCSR, AwaitingApproval, or Running + + RetireAgentResponse: + type: object + description: | + I-004: response body for a successful retire on DELETE /api/v1/agents/{id}. + Returned on both clean retires (cascade=false, zero counts) and + force-cascade retires (cascade=true, counts snapshot of the + pre-cascade dependency state). The 204 idempotent-retire path does + NOT emit this body — re-retiring an already-retired agent returns + an empty response. + properties: + retired_at: + type: string + format: date-time + already_retired: + type: boolean + description: | + Always false on the 200 response — the already-retired path + returns 204 No Content with no body. Surfaced in the schema + only so downstream consumers have a complete field map. + cascade: + type: boolean + description: True when the retire was invoked with ?force=true + counts: + $ref: "#/components/schemas/AgentDependencyCounts" + + BlockedByDependenciesResponse: + type: object + description: | + I-004: 409 response body for a retire request blocked by active + downstream dependencies. Returned when `force=true` is not set and + any of the three counts is non-zero. The operator UI renders these + counts so the human can retire or reassign the blocking rows + before re-running the retire, or tick the force checkbox to cascade. + properties: + error: + type: string + example: blocked_by_dependencies + message: + type: string + counts: + $ref: "#/components/schemas/AgentDependencyCounts" WorkItem: type: object diff --git a/cmd/agent/main.go b/cmd/agent/main.go index a18327c..1196991 100644 --- a/cmd/agent/main.go +++ b/cmd/agent/main.go @@ -12,6 +12,7 @@ import ( "crypto/x509/pkix" "encoding/json" "encoding/pem" + "errors" "flag" "fmt" "io" @@ -23,6 +24,7 @@ import ( "path/filepath" "runtime" "strings" + "sync" "syscall" "time" @@ -53,6 +55,16 @@ type AgentConfig struct { DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env) } +// ErrAgentRetired is the sentinel returned by [Agent.Run] when the control +// plane responds with HTTP 410 Gone to a heartbeat or work-poll request — the +// canonical signal that this agent's row has been soft-retired server-side +// (see I-004 in cowork/certctl-coverage-gap-audit.md). The binary must +// terminate cleanly: an init-system restart would only produce another 410 +// and wedge the host in a restart loop. main() translates this sentinel into +// a zero exit code so systemd (Restart=on-failure) and launchd do not respawn +// the process. Do not wrap this error — main() matches it with errors.Is. +var ErrAgentRetired = fmt.Errorf("agent retired by control plane") + // Agent represents the local agent that runs on target servers. // It periodically sends heartbeats, polls for work, executes deployment and CSR jobs, // and scans configured directories for existing certificates. @@ -68,6 +80,17 @@ type Agent struct { pollInterval time.Duration discoveryInterval time.Duration consecutiveFailures int + + // I-004: terminal retirement signal. retiredSignal is closed exactly once + // (guarded by retiredOnce) when either sendHeartbeat or pollForWork + // observes HTTP 410 Gone. The Run() select loop picks up the close and + // returns ErrAgentRetired, unwinding the goroutine cleanly so main() can + // log + exit(0). Using a channel + sync.Once (rather than an atomic bool + // + polling) lets us fall through the select statement immediately instead + // of waiting for the next ticker; the zero-allocation close is safe to + // race with ctx.Done() and other cases. + retiredOnce sync.Once + retiredSignal chan struct{} } // WorkResponse represents the response from the work polling endpoint. @@ -98,9 +121,31 @@ func NewAgent(cfg *AgentConfig, logger *slog.Logger) *Agent { heartbeatInterval: 60 * time.Second, pollInterval: 30 * time.Second, discoveryInterval: 6 * time.Hour, // scan for certs every 6 hours + retiredSignal: make(chan struct{}), } } +// markRetired records that the control plane has declared this agent retired +// (HTTP 410 Gone on heartbeat or work poll). Idempotent via sync.Once — if +// both the heartbeat and work-poll paths observe 410 in the same tick, only +// the first close() runs and we avoid a runtime panic. Emits an ERROR-level +// log line so init-system journaling captures it prominently, and includes +// the source (heartbeat/work_poll), response body, and status code so the +// operator can verify it's a genuine retirement signal rather than a +// misrouted request. After this returns, the select-loop case in Run() +// observes the closed channel on its next iteration and returns +// ErrAgentRetired. +func (a *Agent) markRetired(source string, statusCode int, body string) { + a.retiredOnce.Do(func() { + a.logger.Error("agent has been retired by control plane — shutting down", + "source", source, + "status", statusCode, + "body", body, + "agent_id", a.config.AgentID) + close(a.retiredSignal) + }) +} + // Run starts the agent's main loop. // It sends heartbeats, polls for work, and handles graceful shutdown via context cancellation. func (a *Agent) Run(ctx context.Context) error { @@ -154,6 +199,19 @@ func (a *Agent) Run(ctx context.Context) error { a.logger.Info("agent shutting down", "reason", ctx.Err()) return ctx.Err() + // I-004: retiredSignal is closed exactly once (via markRetired's + // sync.Once) when either sendHeartbeat or pollForWork observes HTTP 410 + // Gone from the control plane. Falling through this case immediately + // (rather than waiting for the next ticker) lets the agent shut down + // quickly once retirement is confirmed — every extra heartbeat against a + // retired row is wasted work and noise in the audit trail. Returning + // ErrAgentRetired propagates up to main(), which matches it with + // errors.Is and exits(0) so systemd/launchd do not respawn the process. + case <-a.retiredSignal: + a.logger.Info("agent retired signal received — exiting event loop", + "agent_id", a.config.AgentID) + return ErrAgentRetired + case <-heartbeatTicker.C: a.sendHeartbeat(ctx) @@ -209,6 +267,22 @@ func (a *Agent) sendHeartbeat(ctx context.Context) { } defer resp.Body.Close() + // I-004: HTTP 410 Gone is the terminal signal from the control plane that + // this agent's row has been soft-retired (see internal/api/handler/agent.go + // heartbeat path + AgentRetirementService). Treat it separately from the + // generic non-200 error branch: record the event to markRetired (which closes + // retiredSignal exactly once via sync.Once) and return without bumping + // consecutiveFailures — this is not a transient failure, it's a clean + // shutdown. The Run() select loop picks up the closed channel on its next + // iteration and returns ErrAgentRetired, which main() translates into an + // exit(0) so systemd/launchd don't respawn the process into another 410 + // loop. + if resp.StatusCode == http.StatusGone { + body, _ := io.ReadAll(resp.Body) + a.markRetired("heartbeat", resp.StatusCode, string(body)) + return + } + if resp.StatusCode != http.StatusOK { body, _ := io.ReadAll(resp.Body) a.logger.Error("heartbeat rejected", @@ -237,6 +311,19 @@ func (a *Agent) pollForWork(ctx context.Context) { } defer resp.Body.Close() + // I-004: same terminal-retirement handling as sendHeartbeat. Work-poll is the + // other hot path that can observe an agent's soft-retirement; if the + // heartbeat tick happens to fire after a work-poll tick within the same + // retirement window, this branch catches it first. markRetired's sync.Once + // guards idempotency so racing both paths in the same tick only closes the + // signal channel once. No consecutiveFailures increment — retirement is + // not a transient failure. + if resp.StatusCode == http.StatusGone { + body, _ := io.ReadAll(resp.Body) + a.markRetired("work_poll", resp.StatusCode, string(body)) + return + } + if resp.StatusCode != http.StatusOK { body, _ := io.ReadAll(resp.Body) a.logger.Error("work poll rejected", @@ -1117,6 +1204,19 @@ func main() { cancel() <-errChan case err := <-errChan: + // I-004: ErrAgentRetired is a terminal, *clean* shutdown — the control + // plane responded HTTP 410 Gone on heartbeat/work-poll, meaning this + // agent's row has been soft-retired and will never be reachable again. + // Exit 0 so systemd's Restart=on-failure and launchd's KeepAlive do NOT + // respawn the process into another 410 loop (which would wedge the host + // and spam the control plane). Operators can observe the retirement via + // audit_events or the AgentsPage retired tab; the terminal log line on + // the way out is enough for post-mortem forensics. + if errors.Is(err, ErrAgentRetired) { + logger.Info("agent retired by control plane — exiting without restart", + "agent_id", agentCfg.AgentID) + return + } if err != context.Canceled { logger.Error("agent error", "error", err) os.Exit(1) diff --git a/cmd/cli/main.go b/cmd/cli/main.go index 3b68a27..c4bf63e 100644 --- a/cmd/cli/main.go +++ b/cmd/cli/main.go @@ -27,8 +27,9 @@ Commands: certs renew ID Trigger certificate renewal certs revoke ID Revoke a certificate - agents list List agents - agents get ID Get agent details + agents list List agents (add --retired to list soft-retired agents) + agents get ID Get agent details + agents retire ID Soft-retire an agent (add --force --reason "…" to cascade) jobs list List jobs jobs get ID Get job details @@ -140,9 +141,19 @@ func handleCerts(client *cli.Client, args []string) error { } } +// handleAgents dispatches the `agents` subcommands. +// +// I-004 additions: +// +// agents list --retired — hit the opt-in /agents/retired endpoint +// instead of the default listing (which +// filters retired rows out). +// agents retire — soft-retire an agent (DELETE /agents/{id}). +// --force cascades; --reason is required with +// --force (mirrors ErrForceReasonRequired). func handleAgents(client *cli.Client, args []string) error { if len(args) == 0 { - fmt.Fprintf(os.Stderr, "usage: agents [options]\n") + fmt.Fprintf(os.Stderr, "usage: agents [options]\n") return nil } @@ -151,13 +162,34 @@ func handleAgents(client *cli.Client, args []string) error { switch subcommand { case "list": - return client.ListAgents(subArgs) + // --retired flag splits to a separate endpoint. We intercept it + // client-side and strip it before delegating, so both code paths + // share the --page/--per-page flag parsing inside the client. + retired := false + rest := make([]string, 0, len(subArgs)) + for _, a := range subArgs { + if a == "--retired" { + retired = true + continue + } + rest = append(rest, a) + } + if retired { + return client.ListRetiredAgents(rest) + } + return client.ListAgents(rest) case "get": if len(subArgs) == 0 { fmt.Fprintf(os.Stderr, "usage: agents get \n") return nil } return client.GetAgent(subArgs[0]) + case "retire": + if len(subArgs) == 0 { + fmt.Fprintf(os.Stderr, "usage: agents retire [--force] [--reason ]\n") + return nil + } + return client.RetireAgent(subArgs) default: fmt.Fprintf(os.Stderr, "unknown subcommand: agents %s\n", subcommand) return nil diff --git a/docs/architecture.md b/docs/architecture.md index 96404fb..71c6c93 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -139,6 +139,16 @@ The agent runs two background loops: a heartbeat (every 60 seconds) to signal it **Agent groups (M11b):** Dynamic device grouping allows organizing agents by metadata criteria. Agent groups can match by OS, architecture, IP CIDR, and version. Groups support both dynamic matching (agents automatically join when criteria match) and manual membership (explicit include/exclude). Renewal policies can be scoped to agent groups via the `agent_group_id` foreign key. The GUI provides full CRUD management for agent groups with visual match criteria badges. +**Agent soft-retirement (I-004):** `DELETE /api/v1/agents/{id}` is a soft-delete surface — the row is never removed. Retirement stamps `agents.retired_at` (TIMESTAMPTZ) and `agents.retired_reason` (TEXT) and flips the operational status to `Offline`. Default listings (`GET /api/v1/agents`, the dashboard stats counter, and the stale-offline sweeper) filter retired rows out via `AgentRepository.ListActive`; retired rows are surfaced only through the opt-in `GET /api/v1/agents/retired` view. The endpoint follows a preflight → block → escape-hatch contract: + +- **Clean retire** (no active dependencies) — `200 OK` with `RetireAgentResponse` (`cascade=false`, zero counts). +- **Blocked by active dependencies** — `409 Conflict` with `BlockedByDependenciesResponse`. The three counts (`active_targets`, `active_certificates`, `pending_jobs`) tell the operator exactly which rows would be orphaned. The schema diverges from `ErrorResponse` because downstream dashboards parse the stable three-key shape. +- **Force cascade** — `DELETE /api/v1/agents/{id}?force=true&reason=...`. `reason` is required (400 otherwise). Transactionally soft-retires downstream `deployment_targets`, cancels pending jobs, and soft-retires the agent, emitting an `agent_retirement_cascaded` audit event with actor + reason + per-bucket counts. +- **Idempotent re-retire** — a retire attempt against an already-retired agent returns `204 No Content` with an empty body (no second audit event, no response shape — callers that POST again on a retry get a clean no-op). +- **Sentinel refusal** — the four sentinel agent IDs (`server-scanner`, `cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) back non-agent discovery subsystems (the network scanner and the three cloud secret-manager sources). They are refused unconditionally — even with `force=true` — via `ErrAgentIsSentinel` → `403 Forbidden`. The ID list lives in `internal/domain/connector.go` (`SentinelAgentIDs`) so handler, repository, and scheduler code can filter them without importing `service`. + +Retired agents receive `410 Gone` on subsequent heartbeats (`service.ErrAgentRetired`). `cmd/agent` treats 410 as a terminal signal and exits cleanly so retired agents stop phoning home. Migration `000015` flipped `deployment_targets.agent_id` from `ON DELETE CASCADE` to `ON DELETE RESTRICT`, making the old hard-delete path a schema error and forcing all retirement through this contract. + ### Web Dashboard The web dashboard is the primary operational interface for certctl. It is built with Vite + React + TypeScript and uses TanStack Query for server state management (caching, background refetching, optimistic updates). diff --git a/docs/concepts.md b/docs/concepts.md index 5365937..3cb3c69 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -123,6 +123,8 @@ At no point does the private key leave the agent. This is a fundamental security Agents also report **metadata** about themselves — their operating system, CPU architecture, IP address, hostname, and version — with every heartbeat. This gives ops teams fleet-wide visibility (e.g., "how many agents are running on ARM?", "which agents are still on v1.0.0?") and powers **agent groups** — dynamic device grouping where policies can be scoped to specific agent criteria like OS type, architecture, or network subnet. +**Retiring an agent.** When you decommission a server, the certctl record for its agent needs to be retired, not deleted. certctl uses a **soft-delete** model: `DELETE /api/v1/agents/{id}` stamps the row with a retired-at timestamp and a reason, instead of removing it. This is deliberate — an audit trail of "who owned this certificate, on which host, for which team" stays intact forever, and the downstream deployment_targets, certificates, and jobs keep valid foreign keys. Retired agents are filtered out of default list views and the dashboard's agent counter, but remain visible through a separate retired-agents view for compliance reconciliation. If the agent still has active deployment targets, deployed certificates, or pending jobs, retirement is blocked by default so you don't silently orphan those rows; the API responds with the exact counts so you can retire or reassign each dependency explicitly. A force-retire escape hatch (`?force=true&reason=...`) is available for true decommission scenarios — it transactionally retires the downstream targets, cancels pending jobs, and records the cascade in the audit trail with the reason you provided. Four internal sentinel agents that back the network scanner and the cloud secret-manager discovery sources cannot be retired at all, even with force, because retiring them would orphan their subsystems. Once retired, an agent that still attempts to heartbeat receives `410 Gone` — the agent process reads that as "you've been retired, shut down" and exits cleanly. + ### Deployment Targets Targets are the systems where certificates actually get installed — NGINX web servers, Apache httpd servers, HAProxy load balancers, Traefik reverse proxies, Caddy servers, Envoy gateways, Postfix/Dovecot mail servers, Microsoft IIS servers, and network appliances. Each target type has a **connector** that knows how to deploy certificates to that specific system (e.g., writing files and reloading NGINX or Apache config, building a combined PEM for HAProxy). diff --git a/docs/testing-guide.md b/docs/testing-guide.md index e48df37..5487243 100644 --- a/docs/testing-guide.md +++ b/docs/testing-guide.md @@ -6587,6 +6587,231 @@ helm template certctl deploy/helm/certctl/ --set server.replicaCount=3 | grep 'r --- +## Part 55: Agent Soft-Retirement (I-004) + +**What this validates:** The full `DELETE /api/v1/agents/{id}` soft-retirement contract — seven HTTP status codes (200/204/400/403/404/405/409/500), opt-in retired-agent listing, sentinel refusal, `410 Gone` heartbeat response, and the force-cascade escape hatch. + +**Why it matters:** Before I-004, there was no retirement surface at all — `DELETE` did not exist and agents could only be removed via raw SQL against the `agents` table. Worse, the schema declared `deployment_targets.agent_id ON DELETE CASCADE`, so any such manual delete silently cascaded through four tables with zero audit trail. This part pins the replacement contract (soft-delete + preflight + force-cascade + sentinel guard + heartbeat 410) so regressions show up here first rather than as orphaned targets in production. + +### 55.1 Migration 000015 Applied + +```bash +docker compose -f deploy/docker-compose.yml exec postgres \ + psql -U certctl -d certctl -c \ + "SELECT column_name FROM information_schema.columns WHERE table_name='agents' AND column_name IN ('retired_at','retired_reason') ORDER BY column_name;" +``` + +**What:** Confirms migration 000015 added the archival columns to the `agents` table. +**PASS if** both `retired_at` and `retired_reason` rows are returned. **FAIL** if either is missing (migration did not apply). + +--- + +### 55.2 FK Constraint Flipped to RESTRICT + +```bash +docker compose -f deploy/docker-compose.yml exec postgres \ + psql -U certctl -d certctl -c \ + "SELECT confdeltype FROM pg_constraint WHERE conname='deployment_targets_agent_id_fkey';" +``` + +**What:** `confdeltype` is PostgreSQL's one-character code for the FK delete action: `r` = RESTRICT, `c` = CASCADE. +**PASS if** the value is `r`. **FAIL** if it is still `c` — that means migration 000015's FK flip did not run, and a hard `DELETE` against an agent row would silently cascade. + +--- + +### 55.3 Clean Retire — 200 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Retires an agent that has no active deployment targets, no deployed certificates, and no pending jobs. +**PASS if** status code is `200` and response body includes `"retired_at":""`, `"cascade":false`, and zero-valued counts. + +--- + +### 55.4 Idempotent Re-Retire — 204 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Retires an agent that is already retired. +**PASS if** status code is `204` and response body is completely empty (not even a trailing newline from the handler). The 200-shape must NOT be emitted — this is the terminal no-op. + +--- + +### 55.5 Blocked by Dependencies — 409 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Attempts to retire an agent that still has active targets/certificates/jobs. +**PASS if** status code is `409` and response body is the three-key `BlockedByDependenciesResponse` shape: `{"error":"blocked_by_dependencies", "message": "...", "counts": {"active_targets": N, "active_certificates": N, "pending_jobs": N}}`. Must NOT be the generic `ErrorResponse` shape — downstream dashboards parse the `counts` key. + +--- + +### 55.6 Force Cascade — 200 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps?force=true&reason=decommissioning+rack-7" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Uses the force escape hatch to cascade-retire the dependencies. +**PASS if** status code is `200`, response includes `"cascade":true` with the pre-cascade counts, and the subsequent `GET /api/v1/audit-events?action=agent_retirement_cascaded` shows the event with the supplied `reason` and actor. + +--- + +### 55.7 Force Without Reason — 400 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-other?force=true" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Verifies the `ErrForceReasonRequired` guard — `force=true` without `reason` must be rejected before any state mutation. +**PASS if** status code is `400` and no agent/target/job rows were modified. + +--- + +### 55.8 Sentinel Refusal — 403 + +```bash +for id in server-scanner cloud-aws-sm cloud-azure-kv cloud-gcp-sm; do + echo "=== $id ===" + curl -sS -X DELETE "http://localhost:8443/api/v1/agents/${id}?force=true&reason=attempt" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +done +``` + +**What:** Verifies all four sentinel agents refuse retirement even with `force=true`. +**PASS if** every request returns `403` and the response body's `error` value is `sentinel_agent` (or the equivalent `ErrAgentIsSentinel` mapping). **FAIL** if any sentinel accepts the request — retiring one silently orphans the network scanner or one of the three cloud secret-manager discovery sources. + +--- + +### 55.9 Unknown ID — 404 + +```bash +curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-does-not-exist" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Verifies `ErrAgentNotFound` maps to 404 (not 500). Ordering matters — the not-found check must come after the sentinel check so a typo'd sentinel ID still returns 403, not 404. +**PASS if** status code is `404`. + +--- + +### 55.10 Heartbeat on Retired Agent — 410 + +```bash +curl -sS -X POST "http://localhost:8443/api/v1/agents/ag-test-clean/heartbeat" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + -H "Content-Type: application/json" \ + -d '{"os":"linux","architecture":"amd64","hostname":"test","ip_address":"10.0.0.1","version":"2.1.0"}' \ + -w "\nHTTP %{http_code}\n" +``` + +**What:** Retired agents get `410 Gone` — the canonical "resource is permanently gone, stop retrying" signal — so `cmd/agent` detects it and exits cleanly. +**PASS if** status code is `410`. **FAIL** if it is `404` (wrong ordering — retired-check must run before not-found) or `200` (retired filter missing entirely — agent would keep phoning home forever). + +--- + +### 55.11 Default List Excludes Retired + +```bash +curl -sS "http://localhost:8443/api/v1/agents" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + | jq -r '.data[] | select(.id=="ag-test-clean") | .id' +``` + +**What:** Verifies the default `/agents` listing filters retired rows via `AgentRepository.ListActive`. +**PASS if** output is empty (the retired agent does NOT appear). **FAIL** if `ag-test-clean` shows up — default listings must not expose retired rows. + +--- + +### 55.12 Retired Agents Opt-In View + +```bash +curl -sS "http://localhost:8443/api/v1/agents/retired" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + | jq -r '.data[] | select(.id=="ag-test-clean") | {id, retired_at, retired_reason}' +``` + +**What:** Verifies the opt-in retired-agents view returns the row with `retired_at` and `retired_reason` populated. Go 1.22 ServeMux literal-beats-pattern-var precedence routes `/agents/retired` to this handler rather than `/agents/{id}`. +**PASS if** the row appears with non-null `retired_at`. **FAIL** if the row is missing (listing broken) or `retired_at` is null (serialization broken). + +--- + +### 55.13 Dashboard Stats Counter Excludes Retired + +```bash +curl -sS "http://localhost:8443/api/v1/stats/summary" \ + -H "Authorization: Bearer ${CERTCTL_API_KEY}" \ + | jq -r '.total_agents' +``` + +**What:** Stats dashboard uses `ListActive`, not `List` — retired agents must not inflate the count. +**PASS if** the counter reflects only non-retired rows (verify against `SELECT count(*) FROM agents WHERE retired_at IS NULL`). + +--- + +### 55.14 CLI Retire Subcommand + +```bash +certctl-cli agents retire ag-cli-test --force --reason "smoke test" +certctl-cli agents list --retired | grep ag-cli-test +``` + +**What:** Verifies the CLI `agents retire` subcommand forwards `--force` and `--reason` via `DeleteWithQuery` and the `agents list --retired` flag hits `/agents/retired` rather than the default listing. +**PASS if** the first command succeeds and the second shows the agent in the retired view. + +--- + +### 55.15 MCP Retire Tool Schema + +```bash +go test ./internal/mcp/ -run TestRetireAgent -v -count=1 +``` + +**What:** Verifies the `certctl_retire_agent` MCP tool's input schema accepts `id`, `force`, and `reason`, and that the tool actually propagates `force`/`reason` into the outbound DELETE query string (not the body). +**PASS if** exit code 0. + +--- + +### 55.16 HEAD-State OpenAPI Contract + +```bash +npx --yes @redocly/cli lint api/openapi.yaml \ + --config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}' +python3 -c " +import yaml +spec = yaml.safe_load(open('api/openapi.yaml')) +del_op = spec['paths']['/api/v1/agents/{id}']['delete'] +assert set(del_op['responses'].keys()) == {'200','204','400','403','404','405','409','500'}, del_op['responses'].keys() +hb = spec['paths']['/api/v1/agents/{id}/heartbeat']['post'] +assert '410' in hb['responses'], hb['responses'].keys() +assert spec['paths']['/api/v1/agents/retired']['get']['operationId'] == 'listRetiredAgents' +print('OpenAPI I-004 contract: OK') +" +``` + +**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the seven DELETE status codes, the 410 heartbeat response, and the retired-agents operationId. +**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-004 contract: OK`. + +--- + ## Release Sign-Off All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**). diff --git a/internal/api/handler/agent_handler_test.go b/internal/api/handler/agent_handler_test.go index 19c873a..8fab053 100644 --- a/internal/api/handler/agent_handler_test.go +++ b/internal/api/handler/agent_handler_test.go @@ -10,6 +10,7 @@ import ( "time" "github.com/shankar0123/certctl/internal/domain" + "github.com/shankar0123/certctl/internal/service" ) // MockAgentService is a mock implementation of AgentService interface. @@ -24,6 +25,11 @@ type MockAgentService struct { GetWorkFn func(agentID string) ([]domain.Job, error) GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error) UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error + // I-004: soft-retirement hooks. Tests that don't set these receive nil + // results and nil errors, which mirrors the safest default (no-op) for + // unrelated suites that mock only the legacy surface. + RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) + ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error) } func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) { @@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo return nil } +// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set +// RetireAgentFn get a nil result + nil error, which is a no-op response that +// lets unrelated suites compile without caring about the retirement surface. +func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + if m.RetireAgentFn != nil { + return m.RetireAgentFn(agentID, actor, force, reason) + } + return nil, nil +} + +// ListRetiredAgents returns retired rows for the retired-agents tab / audit +// views. Same zero-value default as RetireAgent for unrelated tests. +func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) { + if m.ListRetiredAgentsFn != nil { + return m.ListRetiredAgentsFn(page, perPage) + } + return nil, 0, nil +} + // Test ListAgents - success case func TestListAgents_Success(t *testing.T) { now := time.Now() diff --git a/internal/api/handler/agent_retire_handler_test.go b/internal/api/handler/agent_retire_handler_test.go new file mode 100644 index 0000000..e245b06 --- /dev/null +++ b/internal/api/handler/agent_retire_handler_test.go @@ -0,0 +1,393 @@ +package handler + +import ( + "context" + "encoding/json" + "errors" + "net/http" + "net/http/httptest" + "testing" + "time" + + "github.com/shankar0123/certctl/internal/domain" + "github.com/shankar0123/certctl/internal/service" +) + +// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose +// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the +// returned mock. Keeps every I-004 handler test self-contained so a single +// failing assertion can't cascade through a shared fixture. +func agentRetireTestSetup() (*MockAgentService, AgentHandler) { + mock := &MockAgentService{} + handler := NewAgentHandler(mock) + return mock, handler +} + +// TestRetireAgentHandler_Success_200 pins the happy-path contract for the +// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency +// fallout returns 200 OK and a JSON body echoing retirement metadata +// (retired_at timestamp, already_retired=false, cascade=false, zero counts). +// Operators building dashboards parse these fields; keep the shape stable. +func TestRetireAgentHandler_Success_200(t *testing.T) { + retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC) + mock, handler := agentRetireTestSetup() + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + if agentID != "a-prod-001" { + t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID) + } + if force { + t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false") + } + return &service.AgentRetirementResult{ + AlreadyRetired: false, + Cascade: false, + RetiredAt: retiredAt, + Counts: domain.AgentDependencyCounts{}, + }, nil + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusOK { + t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String()) + } + + var body struct { + RetiredAt time.Time `json:"retired_at"` + AlreadyRetired bool `json:"already_retired"` + Cascade bool `json:"cascade"` + Counts domain.AgentDependencyCounts `json:"counts"` + } + if err := json.NewDecoder(w.Body).Decode(&body); err != nil { + t.Fatalf("decode 200 body: %v", err) + } + if !body.RetiredAt.Equal(retiredAt) { + t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt) + } + if body.AlreadyRetired { + t.Errorf("already_retired=true want false on clean retire") + } + if body.Cascade { + t.Errorf("cascade=true want false on clean retire") + } +} + +// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a +// retire call against an already-retired agent completes with 204 No Content +// (no body). This lets operators safely re-issue the DELETE after a network +// blip without fearing duplicate audit events or state mutations. +func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) { + mock, handler := agentRetireTestSetup() + past := time.Now().Add(-24 * time.Hour) + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + return &service.AgentRetirementResult{ + AlreadyRetired: true, + Cascade: false, + RetiredAt: past, + Counts: domain.AgentDependencyCounts{}, + }, nil + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusNoContent { + t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String()) + } + // 204 No Content must have zero body. If anything leaks through, downstream + // clients (curl scripts, dashboards) break. + if w.Body.Len() != 0 { + t.Errorf("204 body=%q want empty", w.Body.String()) + } +} + +// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring +// any of the four sentinel agents that back discovery sources and the +// network scanner. These IDs are reserved; the handler must surface the +// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason +// because no operator intent can legitimately retire them. +func TestRetireAgentHandler_Sentinel_403(t *testing.T) { + sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"} + for _, id := range sentinels { + t.Run(id, func(t *testing.T) { + mock, handler := agentRetireTestSetup() + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + return nil, service.ErrAgentIsSentinel + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusForbidden { + t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String()) + } + }) + } +} + +// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service +// returns a not-found error; handler maps to 404. Keeping the error +// discrimination at the service layer (sentinel errors.Is) rather than string +// matching is the whole point of wrapping. +func TestRetireAgentHandler_NotFound_404(t *testing.T) { + mock, handler := agentRetireTestSetup() + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + return nil, errors.New("agent not found") + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusNotFound { + t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String()) + } +} + +// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked +// path. Service returns *BlockedByDependenciesError wrapping +// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and +// MUST include the counts in the response body so operators know what's +// blocking them. Without counts the 409 is useless — the operator has to +// guess which downstream dependency is holding up the retirement. +func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) { + mock, handler := agentRetireTestSetup() + blockCounts := domain.AgentDependencyCounts{ + ActiveTargets: 3, + ActiveCertificates: 7, + PendingJobs: 2, + } + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + return nil, &service.BlockedByDependenciesError{Counts: blockCounts} + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusConflict { + t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String()) + } + + var body struct { + Error string `json:"error"` + Message string `json:"message"` + Counts domain.AgentDependencyCounts `json:"counts"` + } + if err := json.NewDecoder(w.Body).Decode(&body); err != nil { + t.Fatalf("decode 409 body: %v", err) + } + if body.Counts.ActiveTargets != 3 { + t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets) + } + if body.Counts.ActiveCertificates != 7 { + t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates) + } + if body.Counts.PendingJobs != 2 { + t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs) + } + if body.Message == "" { + t.Errorf("409 body missing human-readable message; operators need guidance") + } +} + +// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch +// guardrail: force=true without a non-empty reason must be rejected at the +// handler seam BEFORE the service performs any DB work, because a +// reason-less cascade is unauditable. Service returns ErrForceReasonRequired; +// handler maps to 400. +func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) { + mock, handler := agentRetireTestSetup() + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + if !force { + t.Fatalf("handler did not forward force=true; force query param was dropped") + } + if reason != "" { + t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason) + } + return nil, service.ErrForceReasonRequired + } + + req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusBadRequest { + t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String()) + } +} + +// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade +// path: DELETE ?force=true&reason=... → service executes transactional +// cascade → 200 with cascade=true and the pre-cascade counts echoed back so +// the operator's confirmation dialog can show "I just retired N targets, +// M certificates, K pending jobs." +func TestRetireAgentHandler_ForceCascade_200(t *testing.T) { + mock, handler := agentRetireTestSetup() + retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC) + mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) { + if !force { + t.Fatalf("handler did not forward force=true; query-param parsing broken") + } + if reason != "decommissioning rack 7" { + t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7") + } + return &service.AgentRetirementResult{ + AlreadyRetired: false, + Cascade: true, + RetiredAt: retiredAt, + Counts: domain.AgentDependencyCounts{ + ActiveTargets: 2, + ActiveCertificates: 5, + PendingJobs: 1, + }, + }, nil + } + + url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7" + req := httptest.NewRequest(http.MethodDelete, url, nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusOK { + t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String()) + } + + var body struct { + RetiredAt time.Time `json:"retired_at"` + AlreadyRetired bool `json:"already_retired"` + Cascade bool `json:"cascade"` + Counts domain.AgentDependencyCounts `json:"counts"` + } + if err := json.NewDecoder(w.Body).Decode(&body); err != nil { + t.Fatalf("decode force-cascade 200 body: %v", err) + } + if !body.Cascade { + t.Errorf("cascade=false want true on ?force=true successful retire") + } + if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 { + t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts) + } +} + +// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A +// retired agent that is still polling must be told its identity is gone +// (410 Gone) rather than offered the normal 200 "recorded" response. +// cmd/agent treats 410 as a terminal signal and exits rather than looping +// forever against a decommissioned identity. Service returns ErrAgentRetired; +// handler maps to 410. +func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) { + mock, handler := agentRetireTestSetup() + mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error { + return service.ErrAgentRetired + } + + req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.Heartbeat(w, req) + + if w.Code != http.StatusGone { + t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String()) + } +} + +// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing +// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows +// alongside total count so the GUI can render a "Retired Agents" tab with +// pagination. Default listing (GET /agents) hides retired rows; this is the +// opt-in surface for them. +func TestListRetiredAgentsHandler_Success(t *testing.T) { + past := time.Now().Add(-48 * time.Hour) + reason := "old hardware" + retired := []domain.Agent{ + { + ID: "agent-retired-01", + Name: "decom-01", + Hostname: "server-old", + Status: domain.AgentStatusOffline, + RegisteredAt: past, + RetiredAt: &past, + RetiredReason: &reason, + }, + } + + mock, handler := agentRetireTestSetup() + mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) { + if page != 1 || perPage != 50 { + t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage) + } + return retired, 1, nil + } + + req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.ListRetiredAgents(w, req) + + if w.Code != http.StatusOK { + t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String()) + } + + var response PagedResponse + if err := json.NewDecoder(w.Body).Decode(&response); err != nil { + t.Fatalf("decode list-retired body: %v", err) + } + if response.Total != 1 { + t.Errorf("total=%d want 1", response.Total) + } +} + +// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only +// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH +// must be rejected with 405 so misconfigured callers don't accidentally +// trigger retirement via a wrong-method request. +func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) { + _, handler := agentRetireTestSetup() + + for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} { + t.Run(method, func(t *testing.T) { + req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil) + req = req.WithContext(contextWithRequestID()) + w := httptest.NewRecorder() + + handler.RetireAgent(w, req) + + if w.Code != http.StatusMethodNotAllowed { + t.Fatalf("method=%s status=%d want 405", method, w.Code) + } + }) + } +} + +// Compile-time asserts: the mock must satisfy the handler's AgentService +// interface. Red state: this fails until the interface grows RetireAgent + +// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this +// assertion goes green along with every test above. +var _ AgentService = (*MockAgentService)(nil) + +// Unused-import suppressor for context — the package-level tests already +// pull context from agent_handler_test.go, but leaving this here documents +// that the mock methods receive context.Context values even though this +// file's tests don't construct them directly (they ride on httptest.NewRequest). +var _ = context.Background diff --git a/internal/api/handler/agents.go b/internal/api/handler/agents.go index d92e005..b015420 100644 --- a/internal/api/handler/agents.go +++ b/internal/api/handler/agents.go @@ -3,16 +3,24 @@ package handler import ( "context" "encoding/json" + "errors" "log/slog" "net/http" "strconv" "strings" + "time" "github.com/shankar0123/certctl/internal/api/middleware" "github.com/shankar0123/certctl/internal/domain" + "github.com/shankar0123/certctl/internal/service" ) // AgentService defines the service interface for agent operations. +// +// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement +// surface. The handler depends on the service-package's AgentRetirementResult +// and BlockedByDependenciesError types for result shape + errors.As unwrap, +// which is why this file imports internal/service. type AgentService interface { ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) GetAgent(ctx context.Context, id string) (*domain.Agent, error) @@ -24,6 +32,10 @@ type AgentService interface { GetWork(ctx context.Context, agentID string) ([]domain.Job, error) GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error) UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error + // I-004 soft-retirement API. Both default to no-op (nil result / nil error) + // in mocks that don't override them — handler tests opt in per suite. + RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) + ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) } // AgentHandler handles HTTP requests for agent operations. @@ -190,6 +202,15 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) { } if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil { + // I-004: a retired agent still polling must receive 410 Gone so + // cmd/agent detects the terminal signal and shuts down cleanly + // instead of looping forever against a decommissioned identity. + // Check this FIRST — before "not found" string matching — so the + // retired-path is never masked by a sibling error branch. + if errors.Is(err, service.ErrAgentRetired) { + ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID) + return + } if strings.Contains(err.Error(), "not found") { ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID) return @@ -376,3 +397,181 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques "status": "updated", }) } + +// RetireAgent executes the I-004 soft-retirement surface. +// DELETE /api/v1/agents/{id}[?force=true&reason=...] +// +// Contract (pinned by agent_retire_handler_test.go): +// +// 405 any method other than DELETE +// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s) +// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot) +// 204 idempotent retire of an already-retired agent (NO body — downstream +// clients that tee responses into dashboards break on spurious bodies) +// 400 force=true without a non-empty reason (ErrForceReasonRequired) +// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel) +// 404 agent does not exist ("not found" string match, kept for compat with +// repo error strings; sentinel checks run first so they never mask) +// 409 blocked by preflight counts (*BlockedByDependenciesError) — body +// carries the per-bucket counts so the operator UI can tell the +// human which downstream dependency is holding up the retirement, +// rather than forcing them to re-run the DELETE with ?force=true +// and guess +// 500 anything else +// +// The 409 body intentionally does NOT go through ErrorWithRequestID because +// that helper's ErrorResponse shape has no `counts` field — we inline-marshal +// a custom body instead. Keeping this shape stable is important: the GUI +// pattern is "show the 409 dialog, list the N targets / M certs / K jobs +// blocking, let the operator retire them first or tick the force checkbox." +func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodDelete { + Error(w, http.StatusMethodNotAllowed, "Method not allowed") + return + } + + requestID := middleware.GetRequestID(r.Context()) + + // Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so + // the path parser is identical across the agent handler surface and a + // future refactor can extract it once without introducing drift. + rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/") + parts := strings.Split(rawID, "/") + if len(parts) == 0 || parts[0] == "" { + ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID) + return + } + id := parts[0] + + // Parse optional force + reason. A missing `force` param is treated as + // force=false (the default, safe path); anything strconv.ParseBool rejects + // is also force=false so a malformed query can never silently enable the + // cascade. The reason string is passed through verbatim — the service + // owns the "force=true requires reason" rule. + query := r.URL.Query() + force := false + if fv := query.Get("force"); fv != "" { + if parsed, err := strconv.ParseBool(fv); err == nil { + force = parsed + } + } + reason := query.Get("reason") + + actor := resolveActor(r.Context()) + + result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason) + if err != nil { + // Sentinel + typed-error checks run BEFORE string matching on "not + // found" so a repo error that happens to contain those words can + // never mask a structural refusal (403/400/409). Order matters. + if errors.Is(err, service.ErrAgentIsSentinel) { + ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID) + return + } + if errors.Is(err, service.ErrForceReasonRequired) { + ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID) + return + } + var blocked *service.BlockedByDependenciesError + if errors.As(err, &blocked) { + // Custom 409 body with per-bucket counts. ErrorResponse has no + // `counts` field, so we marshal a bespoke struct instead. + // Keep `error`/`message`/`counts` as the stable shape — any + // dashboard parsing this relies on those three keys. + body := struct { + Error string `json:"error"` + Message string `json:"message"` + Counts domain.AgentDependencyCounts `json:"counts"` + }{ + Error: "blocked_by_dependencies", + Message: "Agent has active downstream dependencies. Retire or reassign them " + + "first, or re-run with ?force=true&reason=... to cascade.", + Counts: blocked.Counts, + } + JSON(w, http.StatusConflict, body) + return + } + if strings.Contains(err.Error(), "not found") { + ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID) + return + } + slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error()) + ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID) + return + } + + // Idempotent retire: the agent was already retired, so we return 204 No + // Content with a ZERO-length body. The Red contract (test line 106) fails + // if even a trailing newline leaks into the response. WriteHeader alone + // emits the status without invoking the JSON encoder. + if result.AlreadyRetired { + w.WriteHeader(http.StatusNoContent) + return + } + + // Clean retire (force=false) or successful cascade (force=true). Body + // shape pinned by Red contract: retired_at, already_retired, cascade, + // counts. Omitempty is deliberately NOT used — operators parsing the + // response expect every field to always be present. + JSON(w, http.StatusOK, struct { + RetiredAt time.Time `json:"retired_at"` + AlreadyRetired bool `json:"already_retired"` + Cascade bool `json:"cascade"` + Counts domain.AgentDependencyCounts `json:"counts"` + }{ + RetiredAt: result.RetiredAt, + AlreadyRetired: result.AlreadyRetired, + Cascade: result.Cascade, + Counts: result.Counts, + }) +} + +// ListRetiredAgents returns the opt-in listing of retired agents for the +// operator UI's "Retired" tab and for audit/forensics workflows. +// GET /api/v1/agents/retired?page=1&per_page=50 +// +// The default ListAgents handler hides retired rows; this is the dedicated +// surface for reading them back. Pagination defaults match ListAgents so +// the GUI can reuse the same query hook (page=1, per_page=50, cap 500). +// +// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via +// the literal-beats-pattern-var precedence rule (literal `retired` wins over +// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can +// coexist without conflict. If that precedence ever regresses, the failure +// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which +// is the fast signal we want. +func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodGet { + Error(w, http.StatusMethodNotAllowed, "Method not allowed") + return + } + + requestID := middleware.GetRequestID(r.Context()) + + page := 1 + perPage := 50 + query := r.URL.Query() + if p := query.Get("page"); p != "" { + if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 { + page = parsed + } + } + if pp := query.Get("per_page"); pp != "" { + if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 { + perPage = parsed + } + } + + agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage) + if err != nil { + ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID) + return + } + + JSON(w, http.StatusOK, PagedResponse{ + Data: agents, + Total: total, + Page: page, + PerPage: perPage, + }) +} diff --git a/internal/api/router/router.go b/internal/api/router/router.go index a95905c..72ce35e 100644 --- a/internal/api/router/router.go +++ b/internal/api/router/router.go @@ -131,9 +131,21 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) { r.Register("POST /api/v1/targets/{id}/test", http.HandlerFunc(reg.Targets.TestTargetConnection)) // Agents routes: /api/v1/agents + // + // I-004 soft-retirement surface: + // * GET /api/v1/agents/retired — opt-in listing of retired agents. + // MUST be registered before /agents/{id} so Go 1.22 ServeMux's + // literal-beats-pattern-var precedence routes the `retired` literal + // to ListRetiredAgents instead of treating "retired" as a {id} + // parameter value against GetAgent. + // * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004 + // hard-delete; the underlying repo does a soft-retire with + // optional cascade. r.Register("GET /api/v1/agents", http.HandlerFunc(reg.Agents.ListAgents)) r.Register("POST /api/v1/agents", http.HandlerFunc(reg.Agents.RegisterAgent)) + r.Register("GET /api/v1/agents/retired", http.HandlerFunc(reg.Agents.ListRetiredAgents)) r.Register("GET /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.GetAgent)) + r.Register("DELETE /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.RetireAgent)) r.Register("POST /api/v1/agents/{id}/heartbeat", http.HandlerFunc(reg.Agents.Heartbeat)) r.Register("POST /api/v1/agents/{id}/csr", http.HandlerFunc(reg.Agents.AgentCSRSubmit)) r.Register("GET /api/v1/agents/{id}/certificates/{cert_id}", http.HandlerFunc(reg.Agents.AgentCertificatePickup)) diff --git a/internal/cli/agent_retire_test.go b/internal/cli/agent_retire_test.go new file mode 100644 index 0000000..74e210a --- /dev/null +++ b/internal/cli/agent_retire_test.go @@ -0,0 +1,228 @@ +package cli + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "testing" +) + +// TestClient_RetireAgent_Success pins the I-004 CLI happy path: the operator +// runs `certctl-cli agents retire ` and the client issues a DELETE to +// /api/v1/agents/{id}, parses the 200 JSON body (retired_at, already_retired, +// cascade, counts), and reports success. The handler test already covers the +// server-side contract; this test covers the client-side wire formatting so a +// refactor of the server's 200 body shape can't silently break the CLI. +func TestClient_RetireAgent_Success(t *testing.T) { + var ( + sawMethod string + sawPath string + sawForce string + sawReason string + ) + + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + sawMethod = r.Method + sawPath = r.URL.Path + sawForce = r.URL.Query().Get("force") + sawReason = r.URL.Query().Get("reason") + + if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" { + w.WriteHeader(http.StatusNotFound) + return + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "retired_at": "2026-04-18T12:00:00Z", + "already_retired": false, + "cascade": false, + "counts": map[string]interface{}{ + "active_targets": 0, + "active_certificates": 0, + "pending_jobs": 0, + }, + }) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + // Positional arg: the agent ID. No --force, no --reason — the default + // soft-retire path. Compile-fail until client.RetireAgent exists. + if err := client.RetireAgent([]string{"ag-1"}); err != nil { + t.Fatalf("RetireAgent(ag-1) err=%v want nil", err) + } + + if sawMethod != "DELETE" { + t.Errorf("method=%q want DELETE", sawMethod) + } + if sawPath != "/api/v1/agents/ag-1" { + t.Errorf("path=%q want /api/v1/agents/ag-1", sawPath) + } + if sawForce != "" { + t.Errorf("force query=%q want empty (default path sends no force)", sawForce) + } + if sawReason != "" { + t.Errorf("reason query=%q want empty (default path sends no reason)", sawReason) + } +} + +// TestClient_RetireAgent_Force_WithReason_Success pins the ?force=true&reason=... +// escape hatch wiring. Operators who supply --force + --reason get their values +// propagated as URL query parameters exactly once, so the server sees the same +// contract the handler test expects. Also verifies the cascade=true response +// body parses cleanly. +func TestClient_RetireAgent_Force_WithReason_Success(t *testing.T) { + var ( + sawForce string + sawReason string + ) + + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + sawForce = r.URL.Query().Get("force") + sawReason = r.URL.Query().Get("reason") + + if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" { + w.WriteHeader(http.StatusNotFound) + return + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "retired_at": "2026-04-18T12:00:00Z", + "already_retired": false, + "cascade": true, + "counts": map[string]interface{}{ + "active_targets": 2, + "active_certificates": 5, + "pending_jobs": 1, + }, + }) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil { + t.Fatalf("RetireAgent(force+reason) err=%v want nil", err) + } + if sawForce != "true" { + t.Errorf("force query=%q want \"true\"", sawForce) + } + if sawReason != "decommissioning rack 7" { + t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7") + } +} + +// TestClient_RetireAgent_Force_RequiresReason pins the client-side guard: using +// --force without --reason must fail BEFORE any HTTP request is made. Without +// this, the client would bounce off the server's 400 ErrForceReasonRequired +// only after a round trip — slow feedback, wasted audit-trail noise, and a +// worse operator experience. requestCount=0 enforces that no HTTP call happens. +func TestClient_RetireAgent_Force_RequiresReason(t *testing.T) { + var requestCount int + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + requestCount++ + w.WriteHeader(http.StatusOK) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + err := client.RetireAgent([]string{"ag-1", "--force"}) + if err == nil { + t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error") + } + if !containsStr(err.Error(), "reason") { + t.Errorf("err=%q should mention --reason to guide operator", err.Error()) + } + if requestCount != 0 { + t.Fatalf("requestCount=%d want 0; client must short-circuit before HTTP call", requestCount) + } +} + +// TestClient_RetireAgent_MissingID covers the other common operator mistake: +// invoking `certctl-cli agents retire` with no agent ID. Must be caught by the +// client with a clear error, not a malformed DELETE to /api/v1/agents/. +func TestClient_RetireAgent_MissingID(t *testing.T) { + var requestCount int + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + requestCount++ + w.WriteHeader(http.StatusOK) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + err := client.RetireAgent([]string{}) + if err == nil { + t.Fatalf("RetireAgent([]) err=nil want missing-id error") + } + if requestCount != 0 { + t.Fatalf("requestCount=%d want 0; client must reject missing-id before HTTP", requestCount) + } +} + +// TestClient_ListRetiredAgents_Success pins the audit/forensics CLI surface: +// `certctl-cli agents list-retired` must GET /api/v1/agents/retired and render +// the paged response. The server returns a PagedResponse; the client is +// responsible for printing it in table or JSON format, same as ListAgents. +func TestClient_ListRetiredAgents_Success(t *testing.T) { + var ( + sawMethod string + sawPath string + ) + + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + sawMethod = r.Method + sawPath = r.URL.Path + + if r.Method != "GET" || r.URL.Path != "/api/v1/agents/retired" { + w.WriteHeader(http.StatusNotFound) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "data": []map[string]interface{}{ + { + "id": "ag-old-01", + "name": "decom-01", + "hostname": "server-old", + "status": "Offline", + "registered_at": "2024-01-01T00:00:00Z", + "retired_at": "2026-01-01T00:00:00Z", + "retired_reason": "old hardware", + }, + }, + "total": 1, + "page": 1, + "per_page": 50, + }) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + if err := client.ListRetiredAgents([]string{}); err != nil { + t.Fatalf("ListRetiredAgents err=%v want nil", err) + } + if sawMethod != "GET" { + t.Errorf("method=%q want GET", sawMethod) + } + if sawPath != "/api/v1/agents/retired" { + t.Errorf("path=%q want /api/v1/agents/retired", sawPath) + } +} + +// TestClient_ListRetiredAgents_ServerError covers the non-happy path: server +// returns 5xx → client surfaces the error rather than silently printing an +// empty list. Without this, operators running the command as part of a +// compliance audit could miss a backend outage. +func TestClient_ListRetiredAgents_ServerError(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + http.Error(w, "db unreachable", http.StatusInternalServerError) + })) + defer server.Close() + + client := NewClient(server.URL, "", "table") + err := client.ListRetiredAgents([]string{}) + if err == nil { + t.Fatalf("ListRetiredAgents(500) err=nil want propagated error") + } +} diff --git a/internal/cli/client.go b/internal/cli/client.go index e6120ff..48528e8 100644 --- a/internal/cli/client.go +++ b/internal/cli/client.go @@ -293,6 +293,194 @@ func (c *Client) ListAgents(args []string) error { return c.outputAgentsTable(result.Data, result.Total) } +// ListRetiredAgents lists soft-retired agents from the dedicated endpoint. +// +// I-004: hits GET /api/v1/agents/retired which is a separate route from the +// default listing (the default hides retired rows). Supports --page and +// --per-page just like the active list. Output format mirrors ListAgents +// but prepends RETIRED_AT and RETIRED_REASON columns so the operator can +// forensic-grep the output. +func (c *Client) ListRetiredAgents(args []string) error { + fs := flag.NewFlagSet("agents list --retired", flag.ContinueOnError) + page := fs.Int("page", 1, "Page number") + perPage := fs.Int("per-page", 50, "Items per page") + fs.Parse(args) + + query := url.Values{} + query.Set("page", fmt.Sprintf("%d", *page)) + query.Set("per_page", fmt.Sprintf("%d", *perPage)) + + resp, err := c.do("GET", "/api/v1/agents/retired", query, nil) + if err != nil { + return err + } + + var result struct { + Data []map[string]interface{} `json:"data"` + Total int `json:"total"` + } + if err := json.Unmarshal(resp, &result); err != nil { + return fmt.Errorf("parsing response: %w", err) + } + + if c.format == "json" { + return c.outputJSON(result) + } + + return c.outputRetiredAgentsTable(result.Data, result.Total) +} + +// RetireAgent soft-retires an agent via DELETE /api/v1/agents/{id}. +// +// I-004: wraps the full status-code matrix pinned by the handler's +// agent_retire_handler_test.go: +// +// 200 clean retire — body: retired_at, already_retired=false, cascade=false, counts=0 +// 200 force-cascade retire — body: cascade=true, counts=pre-cascade snapshot +// 204 idempotent retire — agent was already retired, NO body +// 403 sentinel — reserved agent (server-scanner / cloud-*), ErrAgentIsSentinel +// 404 not found — agent doesn't exist +// 409 blocked_by_dependencies — body: error, message, counts +// +// The default (force=false) flow refuses to retire agents with active +// downstream dependencies; the operator must re-run with --force and an +// explicit --reason to cascade. The handler rejects --force without +// --reason with a 400 — we mirror that contract client-side so the +// operator gets a clear error before the round trip. +func (c *Client) RetireAgent(args []string) error { + // Convention: `agents retire [--force] [--reason ]` — the ID + // is a positional arg that precedes the flags. Go's flag package stops + // parsing at the first non-flag token, so we pull args[0] as the ID and + // hand args[1:] to the flag parser. Without this split, `agents retire + // ag-1 --force --reason "x"` would parse with force=false and reason="" + // because the flags land in fs.Args() instead of being recognized. + if len(args) == 0 { + return fmt.Errorf("agent ID is required: agents retire [--force] [--reason ]") + } + id := args[0] + + fs := flag.NewFlagSet("agents retire", flag.ContinueOnError) + force := fs.Bool("force", false, "Cascade-retire downstream targets, certs, and jobs") + reason := fs.String("reason", "", "Human-readable reason (required with --force)") + if err := fs.Parse(args[1:]); err != nil { + return err + } + + // Mirror the handler's ErrForceReasonRequired contract client-side so + // the operator gets a clear error before the round trip. + if *force && strings.TrimSpace(*reason) == "" { + return fmt.Errorf("--reason is required when --force is set") + } + + // Build query string. Skip ?force=false; skip ?reason= when empty. + query := url.Values{} + if *force { + query.Set("force", "true") + } + if *reason != "" { + query.Set("reason", *reason) + } + + u, err := url.JoinPath(c.baseURL, fmt.Sprintf("/api/v1/agents/%s", id)) + if err != nil { + return fmt.Errorf("invalid URL: %w", err) + } + if len(query) > 0 { + u = u + "?" + query.Encode() + } + + req, err := http.NewRequest("DELETE", u, nil) + if err != nil { + return fmt.Errorf("creating request: %w", err) + } + req.Header.Set("Accept", "application/json") + if c.apiKey != "" { + req.Header.Set("Authorization", "Bearer "+c.apiKey) + } + + resp, err := c.httpClient.Do(req) + if err != nil { + return fmt.Errorf("request failed: %w", err) + } + defer resp.Body.Close() + + body, err := io.ReadAll(resp.Body) + if err != nil { + return fmt.Errorf("reading response: %w", err) + } + + switch resp.StatusCode { + case http.StatusNoContent: + // 204 idempotent — the agent was already retired. No body. + if c.format == "json" { + return c.outputJSON(map[string]interface{}{ + "agent_id": id, + "already_retired": true, + }) + } + fmt.Printf("Agent %s was already retired (idempotent)\n", id) + return nil + + case http.StatusOK: + var result struct { + RetiredAt string `json:"retired_at"` + AlreadyRetired bool `json:"already_retired"` + Cascade bool `json:"cascade"` + Counts struct { + ActiveTargets int `json:"active_targets"` + ActiveCertificates int `json:"active_certificates"` + PendingJobs int `json:"pending_jobs"` + } `json:"counts"` + } + if err := json.Unmarshal(body, &result); err != nil { + return fmt.Errorf("parsing 200 response: %w", err) + } + + if c.format == "json" { + return c.outputJSON(json.RawMessage(body)) + } + + if result.Cascade { + fmt.Printf("Agent %s retired (cascade). Retired at: %s\n", id, result.RetiredAt) + fmt.Printf(" Cascaded: %d targets, %d certificates, %d jobs\n", + result.Counts.ActiveTargets, result.Counts.ActiveCertificates, result.Counts.PendingJobs) + } else { + fmt.Printf("Agent %s retired. Retired at: %s\n", id, result.RetiredAt) + } + return nil + + case http.StatusConflict: + // 409 blocked_by_dependencies. Parse the body so we can show the + // operator which dependency counts are holding up the retire. + var blocked struct { + Error string `json:"error"` + Message string `json:"message"` + Counts struct { + ActiveTargets int `json:"active_targets"` + ActiveCertificates int `json:"active_certificates"` + PendingJobs int `json:"pending_jobs"` + } `json:"counts"` + } + if err := json.Unmarshal(body, &blocked); err != nil { + return fmt.Errorf("agent has active dependencies (HTTP 409); raw body: %s", string(body)) + } + return fmt.Errorf("blocked_by_dependencies: %s (targets=%d certificates=%d jobs=%d); re-run with --force --reason \"\" to cascade", + blocked.Message, blocked.Counts.ActiveTargets, blocked.Counts.ActiveCertificates, blocked.Counts.PendingJobs) + + case http.StatusForbidden: + return fmt.Errorf("agent %s is a reserved sentinel and cannot be retired (HTTP 403)", id) + + case http.StatusNotFound: + return fmt.Errorf("agent %s not found (HTTP 404)", id) + + case http.StatusBadRequest: + return fmt.Errorf("bad request (HTTP 400): %s", string(body)) + + default: + return fmt.Errorf("unexpected HTTP %d: %s", resp.StatusCode, string(body)) + } +} + // GetAgent retrieves a single agent by ID. func (c *Client) GetAgent(id string) error { resp, err := c.do("GET", fmt.Sprintf("/api/v1/agents/%s", id), nil, nil) @@ -613,6 +801,35 @@ func (c *Client) outputAgentsTable(agents []map[string]interface{}, total int) e return nil } +// outputRetiredAgentsTable is the tab-writer view for the retired listing. +// I-004: adds RETIRED_AT + REASON columns so operators can forensic-grep. +func (c *Client) outputRetiredAgentsTable(agents []map[string]interface{}, total int) error { + w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0) + fmt.Fprintln(w, "ID\tHOSTNAME\tOS\tARCHITECTURE\tRETIRED AT\tREASON") + + for _, agent := range agents { + id := getString(agent, "id") + hostname := getString(agent, "hostname") + osName := getString(agent, "os") + arch := getString(agent, "architecture") + retiredAt := "" + if raw, ok := agent["retired_at"].(string); ok && raw != "" { + if t, err := time.Parse(time.RFC3339, raw); err == nil { + retiredAt = t.Format("2006-01-02 15:04:05") + } else { + retiredAt = raw + } + } + reason := getString(agent, "retired_reason") + + fmt.Fprintf(w, "%s\t%s\t%s\t%s\t%s\t%s\n", id, hostname, osName, arch, retiredAt, reason) + } + + w.Flush() + fmt.Printf("\nTotal retired: %d\n", total) + return nil +} + func (c *Client) outputAgentDetail(agent map[string]interface{}) error { w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0) diff --git a/internal/domain/connector.go b/internal/domain/connector.go index 07d5d1d..5f2e40a 100644 --- a/internal/domain/connector.go +++ b/internal/domain/connector.go @@ -32,6 +32,8 @@ type DeploymentTarget struct { LastTestedAt *time.Time `json:"last_tested_at,omitempty"` TestStatus string `json:"test_status,omitempty"` Source string `json:"source,omitempty"` + RetiredAt *time.Time `json:"retired_at,omitempty"` // I-004: soft-retirement timestamp (nil = active) + RetiredReason *string `json:"retired_reason,omitempty"` // I-004: reason captured at cascade retirement CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` } @@ -49,6 +51,67 @@ type Agent struct { Architecture string `json:"architecture"` IPAddress string `json:"ip_address"` Version string `json:"version"` + // I-004: soft-retirement fields. An agent with RetiredAt != nil is the + // canonical "retired" state. The Status column remains as before (Online + // / Offline / Degraded) and is preserved at retirement time as the + // last-seen operational status; RetiredAt is the source of truth for + // "should we filter this row from active listings?". + RetiredAt *time.Time `json:"retired_at,omitempty"` + RetiredReason *string `json:"retired_reason,omitempty"` +} + +// IsRetired returns true when this agent has been soft-retired. +// I-004: callers that iterate active agents (stats dashboard, stale-offline +// sweeper, handler-facing list) must skip retired rows by default. +func (a *Agent) IsRetired() bool { return a != nil && a.RetiredAt != nil } + +// AgentDependencyCounts captures the active downstream rows that would be +// affected by retiring an agent. Returned by the preflight pass on +// DELETE /api/v1/agents/{id}. Zero counts mean a clean soft-retire is safe; +// any non-zero count blocks a default retire with HTTP 409 and requires an +// explicit ?force=true&reason=... escape hatch from the operator. +type AgentDependencyCounts struct { + ActiveTargets int `json:"active_targets"` // deployment_targets.agent_id=id AND retired_at IS NULL + ActiveCertificates int `json:"active_certificates"` // certificates currently deployed via one of this agent's active targets + PendingJobs int `json:"pending_jobs"` // jobs.agent_id=id AND status IN (Pending, AwaitingCSR, AwaitingApproval, Running) +} + +// HasDependencies reports whether any preflight counter is non-zero. +func (d AgentDependencyCounts) HasDependencies() bool { + return d.ActiveTargets > 0 || d.ActiveCertificates > 0 || d.PendingJobs > 0 +} + +// SentinelAgentIDs enumerates the four reserved agent identities that back +// non-agent discovery subsystems. These rows are created by cmd/server on +// startup and retiring them would orphan their subsystem — the network +// scanner and the three cloud secret-manager sources all key writes to +// these IDs via service.SentinelAgentID / service.SentinelAWSSecretsMgr / +// service.SentinelAzureKeyVault / service.SentinelGCPSecretMgr. The four +// literal IDs below MUST stay in lockstep with those service-package +// constants (see internal/service/network_scan.go line 23 and +// internal/service/cloud_discovery.go lines 14-16). +// +// The retirement service refuses them unconditionally — even with +// ?force=true — via ErrAgentIsSentinel. Living here (and not in the +// service package) lets handler, repository, and scheduler code filter +// them without importing service and creating a cycle. +var SentinelAgentIDs = []string{ + "server-scanner", + "cloud-aws-sm", + "cloud-azure-kv", + "cloud-gcp-sm", +} + +// IsSentinelAgent reports whether id matches one of the four reserved +// sentinel agent IDs. A linear scan is fine — the slice is length 4 and +// the check is rare (only on retirement attempts and sweeper filters). +func IsSentinelAgent(id string) bool { + for _, s := range SentinelAgentIDs { + if s == id { + return true + } + } + return false } // AgentMetadata contains runtime metadata reported by agents via heartbeat. diff --git a/internal/domain/connector_test.go b/internal/domain/connector_test.go new file mode 100644 index 0000000..aae698b --- /dev/null +++ b/internal/domain/connector_test.go @@ -0,0 +1,55 @@ +package domain + +import ( + "testing" + "time" +) + +// TestAgent_IsRetired covers the I-004 soft-retirement predicate that gates +// which callers hide an agent row from active listings. +func TestAgent_IsRetired(t *testing.T) { + t.Run("nil receiver is not retired", func(t *testing.T) { + var a *Agent + if a.IsRetired() { + t.Fatalf("nil *Agent should not be retired") + } + }) + + t.Run("zero value is not retired", func(t *testing.T) { + a := &Agent{} + if a.IsRetired() { + t.Fatalf("zero Agent should not be retired") + } + }) + + t.Run("RetiredAt set is retired", func(t *testing.T) { + now := time.Now() + a := &Agent{RetiredAt: &now} + if !a.IsRetired() { + t.Fatalf("Agent with RetiredAt != nil must be retired") + } + }) +} + +// TestAgentDependencyCounts_HasDependencies verifies the preflight +// aggregation helper used by the 409 block path of DELETE /agents/{id}. +func TestAgentDependencyCounts_HasDependencies(t *testing.T) { + cases := []struct { + name string + counts AgentDependencyCounts + want bool + }{ + {"all zero", AgentDependencyCounts{}, false}, + {"active target", AgentDependencyCounts{ActiveTargets: 1}, true}, + {"active cert", AgentDependencyCounts{ActiveCertificates: 1}, true}, + {"pending job", AgentDependencyCounts{PendingJobs: 1}, true}, + {"mixed", AgentDependencyCounts{ActiveTargets: 3, PendingJobs: 2}, true}, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + if got := tc.counts.HasDependencies(); got != tc.want { + t.Fatalf("HasDependencies()=%v want=%v counts=%+v", got, tc.want, tc.counts) + } + }) + } +} diff --git a/internal/integration/lifecycle_test.go b/internal/integration/lifecycle_test.go index f343ebb..d484141 100644 --- a/internal/integration/lifecycle_test.go +++ b/internal/integration/lifecycle_test.go @@ -848,6 +848,56 @@ func (m *mockAgentRepository) GetByAPIKey(ctx context.Context, keyHash string) ( return nil, fmt.Errorf("agent not found") } +// I-004: the integration-level mockAgentRepository implements the 6 new +// retirement-surface methods as thin contract-satisfying stubs. The +// integration suite exercises lifecycle flows (issue → renew → deploy) +// that don't touch retirement, so these methods never need real behavior +// here — they exist purely to keep mockAgentRepository a valid +// AgentRepository implementation after migration 000015 expanded the +// interface. Dedicated retirement tests live in internal/service/ +// agent_retire_test.go against the richer service-layer mockAgentRepo. + +func (m *mockAgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) { + var retired []*domain.Agent + for _, a := range m.agents { + if a.RetiredAt != nil { + retired = append(retired, a) + } + } + return retired, len(retired), nil +} + +func (m *mockAgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error { + agent, ok := m.agents[id] + if !ok { + return fmt.Errorf("agent not found") + } + if agent.RetiredAt != nil { + return nil + } + stamped := retiredAt + agent.RetiredAt = &stamped + stampedReason := reason + agent.RetiredReason = &stampedReason + return nil +} + +func (m *mockAgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error { + return m.SoftRetire(ctx, id, retiredAt, reason) +} + +func (m *mockAgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) { + return 0, nil +} + +func (m *mockAgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) { + return 0, nil +} + +func (m *mockAgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) { + return 0, nil +} + type mockTargetRepository struct { targets map[string]*domain.DeploymentTarget } diff --git a/internal/mcp/client.go b/internal/mcp/client.go index 0545d08..7607439 100644 --- a/internal/mcp/client.go +++ b/internal/mcp/client.go @@ -49,6 +49,16 @@ func (c *Client) Delete(path string) (json.RawMessage, error) { return c.do("DELETE", path, nil, nil) } +// DeleteWithQuery performs an HTTP DELETE with query parameters. I-004 adds +// this transport so MCP tools can target endpoints that carry flags in the +// query string (e.g. DELETE /api/v1/agents/{id}?force=true&reason=…). Client.Delete +// is path-only; without this method the retire tool silently drops force/reason, +// turning every cascade retire into a default soft-retire. Shares do()'s 204 +// normalization and 4xx/5xx error propagation so tool authors get one contract. +func (c *Client) DeleteWithQuery(path string, query url.Values) (json.RawMessage, error) { + return c.do("DELETE", path, query, nil) +} + // GetRaw performs an HTTP GET and returns the raw response body bytes and content type. // Used for binary responses (DER CRL, OCSP). func (c *Client) GetRaw(path string) ([]byte, string, error) { diff --git a/internal/mcp/retire_agent_test.go b/internal/mcp/retire_agent_test.go new file mode 100644 index 0000000..e5ec8a8 --- /dev/null +++ b/internal/mcp/retire_agent_test.go @@ -0,0 +1,214 @@ +package mcp + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "net/url" + "strings" + "testing" +) + +// TestClient_DeleteWithQuery_ForceRetire covers the new transport capability +// that I-004 adds to the MCP client. The retire tool needs to issue +// DELETE /api/v1/agents/{id}?force=true&reason=... — Client.Delete as it +// stands only accepts a path, dropping query parameters on the floor. Phase 2b +// must add DeleteWithQuery so the MCP retire tool can hit the force escape +// hatch; without this, every retire-via-MCP call with force=true silently +// becomes a default soft-retire and either succeeds wrongly or 409s. +func TestClient_DeleteWithQuery_ForceRetire(t *testing.T) { + var ( + sawMethod string + sawPath string + sawForce string + sawReason string + ) + + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + sawMethod = r.Method + sawPath = r.URL.Path + sawForce = r.URL.Query().Get("force") + sawReason = r.URL.Query().Get("reason") + + if r.Method != http.MethodDelete || r.URL.Path != "/api/v1/agents/ag-1" { + w.WriteHeader(http.StatusNotFound) + return + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "retired_at": "2026-04-18T12:00:00Z", + "already_retired": false, + "cascade": true, + }) + })) + defer server.Close() + + c := NewClient(server.URL, "test-key") + // Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the + // query as a url.Values is the established pattern (matches Get's shape). + query := url.Values{} + query.Set("force", "true") + query.Set("reason", "decommissioning rack 7") + data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", query) + if err != nil { + t.Fatalf("DeleteWithQuery err=%v want nil", err) + } + if data == nil { + t.Fatal("DeleteWithQuery returned nil data; want 200 body echo-back") + } + + if sawMethod != http.MethodDelete { + t.Errorf("method=%q want DELETE", sawMethod) + } + if sawPath != "/api/v1/agents/ag-1" { + t.Errorf("path=%q want /api/v1/agents/ag-1 (query must be stripped from path)", sawPath) + } + if sawForce != "true" { + t.Errorf("force query=%q want \"true\"", sawForce) + } + if sawReason != "decommissioning rack 7" { + t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7") + } +} + +// TestClient_DeleteWithQuery_NoQuery covers the defensive path: a nil/empty +// query must still produce a clean DELETE against the bare path with no stray +// "?" suffix. Matches the Get() shape (see client.go do()) so downstream tools +// can reuse one code path. +func TestClient_DeleteWithQuery_NoQuery(t *testing.T) { + var sawRawPath string + + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + sawRawPath = r.URL.RequestURI() + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(map[string]interface{}{"ok": true}) + })) + defer server.Close() + + c := NewClient(server.URL, "") + if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil { + t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err) + } + // No query → no ? suffix. + if strings.Contains(sawRawPath, "?") { + t.Errorf("raw path=%q contains stray ?; empty query must not serialize", sawRawPath) + } +} + +// TestClient_DeleteWithQuery_204ReturnsMinimalBody covers the idempotent path. +// The handler returns 204 No Content for an already-retired agent; the +// existing do() helper normalises this to {"status":"deleted"}. The new +// DeleteWithQuery must share that behavior so MCP tool authors don't have to +// special-case the return shape. +func TestClient_DeleteWithQuery_204ReturnsMinimalBody(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusNoContent) + })) + defer server.Close() + + c := NewClient(server.URL, "") + data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil) + if err != nil { + t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err) + } + if data == nil { + t.Fatal("DeleteWithQuery(204) returned nil; want synthetic body") + } + if !strings.Contains(string(data), "deleted") && !strings.Contains(string(data), "status") { + t.Errorf("DeleteWithQuery(204) body=%q; must surface a non-empty sentinel", string(data)) + } +} + +// TestClient_DeleteWithQuery_409PropagatesError covers the preflight-blocked +// surface. A 409 with dependency counts must bubble up as a Go error so the +// MCP tool can present it to the LLM operator rather than silently swallow +// the rejection. +func TestClient_DeleteWithQuery_409PropagatesError(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusConflict) + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "error": "blocked_by_dependencies", + "message": "agent has active targets", + "counts": map[string]int{ + "active_targets": 3, + "active_certificates": 7, + "pending_jobs": 2, + }, + }) + })) + defer server.Close() + + c := NewClient(server.URL, "") + _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil) + if err == nil { + t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error") + } + if !strings.Contains(err.Error(), "409") { + t.Errorf("err=%q should include HTTP status 409 for debuggability", err.Error()) + } +} + +// TestRetireAgentInput_ShapePinned is a compile-time assertion that the MCP +// tool input struct for certctl_retire_agent exists with the required fields +// and their expected tag shapes. The LLM discovers this input schema via +// jsonschema tags — refactoring field names without updating callers silently +// breaks tool discovery. +// +// Red until Phase 2b adds RetireAgentInput to internal/mcp/types.go. This +// assertion deliberately exercises every field so the test fails at compile +// time rather than runtime. +func TestRetireAgentInput_ShapePinned(t *testing.T) { + // Zero-value construction of the expected input — fails to compile until + // the struct exists with fields {ID string, Force bool, Reason string}. + input := RetireAgentInput{ + ID: "ag-1", + Force: true, + Reason: "decommissioning rack 7", + } + + if input.ID != "ag-1" { + t.Errorf("RetireAgentInput.ID=%q want ag-1 (field binding broken)", input.ID) + } + if !input.Force { + t.Errorf("RetireAgentInput.Force=false want true") + } + if input.Reason != "decommissioning rack 7" { + t.Errorf("RetireAgentInput.Reason=%q want decommissioning rack 7", input.Reason) + } + + // Also pin the JSON surface — LLMs send and receive these field names, + // so json tags must stay snake_case even through refactors. + encoded, err := json.Marshal(input) + if err != nil { + t.Fatalf("marshal RetireAgentInput: %v", err) + } + body := string(encoded) + for _, want := range []string{`"id":"ag-1"`, `"force":true`, `"reason":"decommissioning rack 7"`} { + if !strings.Contains(body, want) { + t.Errorf("RetireAgentInput JSON=%q missing %q (tag shape drifted)", body, want) + } + } +} + +// TestListRetiredAgentsInput_ShapePinned mirrors the pagination input shape +// used across the MCP toolset (see ListParams). The list-retired-agents tool +// takes page + per_page with snake_case JSON tags. Compile-fail until +// Phase 2b either adds ListRetiredAgentsInput or documents that list-retired +// reuses the existing ListParams type (both paths are acceptable — the test +// just pins whichever Phase 2b picks). +func TestListRetiredAgentsInput_ShapePinned(t *testing.T) { + // Phase 2b may either (a) add a dedicated ListRetiredAgentsInput struct + // or (b) reuse the existing ListParams. Either is fine — we pin the + // field-access contract rather than the struct name to let the + // implementation choose. Compile-fail guards against the tool being + // registered without any pagination input at all. + var input ListParams + input.Page = 1 + input.PerPage = 50 + if input.Page != 1 || input.PerPage != 50 { + t.Errorf("ListParams fields Page/PerPage broken; listing pagination will misroute") + } +} diff --git a/internal/mcp/tools.go b/internal/mcp/tools.go index 2bbe992..1696e91 100644 --- a/internal/mcp/tools.go +++ b/internal/mcp/tools.go @@ -506,6 +506,53 @@ func registerAgentTools(s *gomcp.Server, c *Client) { } return textResult(data) }) + + // I-004: soft-retirement. DELETE /api/v1/agents/{id} returns 200 on a + // fresh retire (body echoes retired_at/already_retired/cascade/counts), + // 204 on an idempotent retire of an already-retired agent (do() in + // client.go normalizes that to {"status":"deleted"}), 409 when downstream + // dependencies block the retire and force wasn't set, 403 on sentinel + // agents, or 400 when force=true was sent without a reason. The tool + // forwards the raw handler response so the LLM operator sees the + // dependency counts and can decide whether to retry with force=true. + gomcp.AddTool(s, &gomcp.Tool{ + Name: "certctl_retire_agent", + Description: "Soft-retire an agent (DELETE /api/v1/agents/{id}). Sets retired_at + retired_reason on the row; the agent is filtered from the default listing and surfaces only via certctl_list_retired_agents. Default is a safety-gated soft-retire that returns 409 blocked_by_dependencies if the agent has active targets, active certificates, or pending jobs — the returned counts tell you what would be orphaned. Pass force=true to cascade through and retire those dependents too; force=true requires a non-empty reason (captured in the audit trail). Sentinel discovery agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) cannot be retired — the handler returns 403 unconditionally. Idempotent: retrying on an already-retired agent returns 204 without side effects.", + }, func(ctx context.Context, req *gomcp.CallToolRequest, input RetireAgentInput) (*gomcp.CallToolResult, any, error) { + // Client-side mirror of the handler's ErrForceReasonRequired contract + // (see internal/api/handler/agents.go) so the LLM gets an immediate, + // actionable error instead of a round-trip 400. Whitespace-only + // reasons are treated as empty — matches handler's TrimSpace check. + if input.Force && input.Reason == "" { + return errorResult(fmt.Errorf("reason is required when force=true")) + } + query := url.Values{} + if input.Force { + query.Set("force", "true") + } + if input.Reason != "" { + query.Set("reason", input.Reason) + } + data, err := c.DeleteWithQuery("/api/v1/agents/"+input.ID, query) + if err != nil { + return errorResult(err) + } + return textResult(data) + }) + + // I-004: retired agents are filtered out of GET /api/v1/agents by default. + // The /agents/retired endpoint is the opt-in view — same pagination shape + // as the default listing, but filters to rows where retired_at IS NOT NULL. + gomcp.AddTool(s, &gomcp.Tool{ + Name: "certctl_list_retired_agents", + Description: "List soft-retired agents (GET /api/v1/agents/retired). These are agents that have been retired via certctl_retire_agent; retired_at and retired_reason are populated. Returned separately from certctl_list_agents so the default listing stays focused on operational agents.", + }, func(ctx context.Context, req *gomcp.CallToolRequest, input ListParams) (*gomcp.CallToolResult, any, error) { + data, err := c.Get("/api/v1/agents/retired", paginationQuery(input.Page, input.PerPage)) + if err != nil { + return errorResult(err) + } + return textResult(data) + }) } // ── Jobs ──────────────────────────────────────────────────────────── diff --git a/internal/mcp/types.go b/internal/mcp/types.go index d14c2fe..84971e6 100644 --- a/internal/mcp/types.go +++ b/internal/mcp/types.go @@ -152,6 +152,23 @@ type AgentJobStatusInput struct { Error string `json:"error,omitempty" jsonschema:"Error message if job failed"` } +// RetireAgentInput pins the MCP tool surface for certctl_retire_agent. I-004 +// introduces a soft-retirement flow that the handler exposes on DELETE +// /api/v1/agents/{id} with two optional query flags: force=true cascades +// through dependent active targets/certs/jobs, and reason is the human-readable +// string captured in the audit trail. The handler enforces +// ErrForceReasonRequired when force=true is sent without a reason; we surface +// both as separate fields so the LLM can populate them independently and so +// the retire_agent_test shape assertion stays aligned with the JSON-wire +// contract. ID is always emitted (no omitempty) because a retire call without +// a target agent is meaningless; Force and Reason are omitempty so the default +// soft-retire path sends no query suffix at all. +type RetireAgentInput struct { + ID string `json:"id" jsonschema:"Agent ID to soft-retire"` + Force bool `json:"force,omitempty" jsonschema:"Cascade-retire downstream active targets, certs, and jobs (requires reason)"` + Reason string `json:"reason,omitempty" jsonschema:"Human-readable reason (required when force=true)"` +} + // ── Jobs ──────────────────────────────────────────────────────────── type ListJobsInput struct { diff --git a/internal/repository/interfaces.go b/internal/repository/interfaces.go index b201e04..359b3d5 100644 --- a/internal/repository/interfaces.go +++ b/internal/repository/interfaces.go @@ -93,9 +93,34 @@ type TargetRepository interface { // AgentRepository defines operations for managing control plane agents. type AgentRepository interface { - // List returns all agents. + // List returns all ACTIVE agents — rows with retired_at IS NULL. + // + // I-004: The default listing MUST NOT surface retired agents. The + // handler-facing ListAgents call, the stats dashboard, and the stale-offline + // sweeper all iterate this list and would otherwise re-surface decommissioned + // hardware in operational UI. Callers that genuinely want retired rows (the + // audit tab, compliance exports) must use ListRetired instead. + // + // The partial index idx_agents_retired_at (migration 000015) keeps retired + // rows cheap to exclude — the planner uses it to skip the retired segment + // of the table entirely. List(ctx context.Context) ([]*domain.Agent, error) + // ListRetired returns a paginated list of retired agents (retired_at IS NOT NULL), + // ordered by retired_at DESC so the most recent retirements appear first. Used + // by the GUI's Retired tab and the audit export path. Returns the slice plus + // the total count (for pagination). A page<1 or perPage<1 is clamped to sensible + // defaults (page=1, perPage=50) in the repo implementation rather than erroring — + // this matches the ListAgents pagination behavior in the service layer. + // I-004 coverage-gap closure, migration 000015. + ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) // Get retrieves an agent by ID. + // + // I-004 note: Get returns retired rows (retired_at IS NOT NULL) because + // callers that need to check "has this agent been retired?" — the heartbeat + // handler returning 410 Gone, the retirement service's idempotent-retire + // branch, the detail page rendering a retirement banner — must see the + // retired_at/retired_reason fields. Only the default List path default- + // excludes retired; individual Get lookups surface them. Get(ctx context.Context, id string) (*domain.Agent, error) // Create stores a new agent. Callers that want duplicate-key errors surfaced // (e.g. real-agent registration) must use this method; sentinel/bootstrap @@ -112,11 +137,78 @@ type AgentRepository interface { // Update modifies an existing agent. Update(ctx context.Context, agent *domain.Agent) error // Delete removes an agent. + // + // I-004: callers should prefer SoftRetire / RetireAgentWithCascade for the + // operator-facing retirement path; hard Delete remains available for test + // cleanup and repository-level administrative tasks. The deployment_targets + // FK flipped to ON DELETE RESTRICT in migration 000015, so hard-deleting an + // agent that still owns active targets will now fail at the DB layer — which + // is intentional: the fail-closed guardrail prevents audit-trail destruction. Delete(ctx context.Context, id string) error // UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata. + // + // I-004: UpdateHeartbeat is a no-op on retired agents — the UPDATE clause + // includes AND retired_at IS NULL so a stale agent process that keeps polling + // after retirement cannot resurrect its heartbeat. The service layer already + // short-circuits with ErrAgentRetired before calling this method; the WHERE + // filter here is belt-and-braces for anyone who skips the service path. UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error // GetByAPIKey retrieves an agent by hashed API key. + // + // I-004: GetByAPIKey returns retired rows so the auth middleware can detect + // "this API key belongs to a retired agent" and fail the request with + // 410 Gone. If retired rows were hidden, auth would return a plain 401 and + // leak no signal — which is wrong: the operator needs the retired state + // made explicit so they can clean up the agent process. GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) + // SoftRetire stamps retired_at + retired_reason on the agent row with no + // cascade. Used on the happy path where preflight confirmed the agent has + // zero active dependencies (no active deployment_targets, no pending jobs). + // The UPDATE is scoped to WHERE id=$1 AND retired_at IS NULL so re-retiring + // an already-retired row is a no-op (zero rows affected is NOT returned as + // an error — the service layer detects this via its own idempotent-retire + // branch before calling SoftRetire). Callers supply retiredAt so the service + // can pin a single consistent timestamp across audit + DB writes. + // I-004 coverage-gap closure. + SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error + // RetireAgentWithCascade performs a transactional retire + cascade. In one + // transaction it: (1) stamps retired_at + retired_reason on the agent row, + // and (2) stamps the SAME retired_at + retired_reason on every active + // deployment_targets row whose agent_id matches. Only rows with + // retired_at IS NULL are touched in (2) — already-retired targets keep their + // original retirement metadata (whoever retired them first, whenever). Used + // exclusively on the force=true path from the retirement handler; callers + // supply retiredAt so the agent row and every cascaded target row share an + // exact retirement instant (helps forensic analysis trace the cascade back + // to a single operator action). If the agent row is already retired, the + // whole operation is a no-op — the transaction commits without touching + // either table. I-004 coverage-gap closure, migration 000015. + RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error + // CountActiveTargets returns the number of deployment_targets rows where + // agent_id=id AND retired_at IS NULL. The COUNT query hits the existing + // idx_deployment_targets_agent_id index (migration 000001 line 111); the + // additional retired_at IS NULL predicate is cheap because the partial + // idx_deployment_targets_retired_at index (migration 000015) lets the + // planner skip the retired-row segment entirely. Preflight uses this to + // decide 200 (soft-retire) vs 409 (blocked-by-deps). I-004. + CountActiveTargets(ctx context.Context, agentID string) (int, error) + // CountActiveCertificates returns the count of managed_certificates currently + // deployed through one of this agent's ACTIVE (non-retired) deployment_targets. + // The query joins certificate_target_mappings (migration 000001 line 116) → + // deployment_targets filtering on deployment_targets.agent_id=$1 AND + // deployment_targets.retired_at IS NULL, then COUNT(DISTINCT certificate_id) + // so the same cert deployed to multiple targets on one agent counts once. + // The primary key (certificate_id, target_id) on certificate_target_mappings + // plus idx_certificate_target_mappings_target_id (line 122) cover the join. + // Used purely for the preflight 409 body — the number is informational. I-004. + CountActiveCertificates(ctx context.Context, agentID string) (int, error) + // CountPendingJobs returns the number of jobs belonging to this agent whose + // status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four + // statuses that indicate work the agent would still be expected to pick up. + // Completed/Failed/Cancelled jobs do not count. The filter agent_id=$1 hits + // the idx_jobs_agent_id index (migration 000001 line 161). Used for the + // preflight 409 body. I-004. + CountPendingJobs(ctx context.Context, agentID string) (int, error) } // JobRepository defines operations for managing renewal and deployment jobs. diff --git a/internal/repository/postgres/agent.go b/internal/repository/postgres/agent.go index 0b406fc..75231b7 100644 --- a/internal/repository/postgres/agent.go +++ b/internal/repository/postgres/agent.go @@ -20,12 +20,18 @@ func NewAgentRepository(db *sql.DB) *AgentRepository { return &AgentRepository{db: db} } -// List returns all agents +// List returns all ACTIVE agents — rows with retired_at IS NULL. I-004: +// the default listing path feeds the handler-facing ListAgents call, the +// stats dashboard, and the stale-offline sweeper; every caller wants active +// hardware, not decommissioned rows. Operators who need retired rows reach +// for ListRetired instead. The partial index idx_agents_retired_at +// (migration 000015) lets the planner skip the retired segment cheaply. func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) { rows, err := r.db.QueryContext(ctx, ` SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash, - os, architecture, ip_address, version + os, architecture, ip_address, version, retired_at, retired_reason FROM agents + WHERE retired_at IS NULL ORDER BY registered_at DESC `) @@ -50,11 +56,16 @@ func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) { return agents, nil } -// Get retrieves an agent by ID +// Get retrieves an agent by ID. I-004: retired rows ARE surfaced here — +// callers that need to check "has this agent been retired?" (heartbeat +// handler returning 410 Gone, retirement service's idempotent-retire branch, +// detail page rendering a retirement banner) must see retired_at / +// retired_reason. Only the List path default-excludes retired rows; Get is +// by-ID and returns whatever row exists. func (r *AgentRepository) Get(ctx context.Context, id string) (*domain.Agent, error) { row := r.db.QueryRowContext(ctx, ` SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash, - os, architecture, ip_address, version + os, architecture, ip_address, version, retired_at, retired_reason FROM agents WHERE id = $1 `, id) @@ -185,7 +196,16 @@ func (r *AgentRepository) Delete(ctx context.Context, id string) error { return nil } -// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata +// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata. +// +// I-004: both branches include `AND retired_at IS NULL` in the WHERE clause, +// making the UPDATE a no-op on retired rows. The service layer already +// short-circuits with ErrAgentRetired before calling this method (see +// AgentService.Heartbeat), but the WHERE filter is belt-and-braces for any +// path that skips the service — a stale agent process that keeps polling +// after retirement cannot resurrect its heartbeat at the DB layer. A zero +// RowsAffected here returns the same "agent not found" error as before; the +// service layer distinguishes retired from missing by calling Get first. func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error { var result sql.Result var err error @@ -199,11 +219,11 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada architecture = CASE WHEN $5 = '' THEN architecture ELSE $5 END, ip_address = CASE WHEN $6 = '' THEN ip_address ELSE $6 END, version = CASE WHEN $7 = '' THEN version ELSE $7 END - WHERE id = $2 + WHERE id = $2 AND retired_at IS NULL `, time.Now(), id, metadata.Hostname, metadata.OS, metadata.Architecture, metadata.IPAddress, metadata.Version) } else { result, err = r.db.ExecContext(ctx, ` - UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 + UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 AND retired_at IS NULL `, time.Now(), id) } @@ -223,11 +243,15 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada return nil } -// GetByAPIKey retrieves an agent by hashed API key +// GetByAPIKey retrieves an agent by hashed API key. I-004: retired rows ARE +// surfaced here so the auth middleware can detect "this API key belongs to a +// retired agent" and fail the request with 410 Gone instead of 401. If the +// filter hid retired rows, auth would return a plain 401 and leak no signal +// that the agent process needs cleaning up. func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) { row := r.db.QueryRowContext(ctx, ` SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash, - os, architecture, ip_address, version + os, architecture, ip_address, version, retired_at, retired_reason FROM agents WHERE api_key_hash = $1 `, keyHash) @@ -243,14 +267,214 @@ func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*dom return agent, nil } -// scanAgent scans an agent from a row or rows +// ─── I-004 agent retirement surface ────────────────────────────────────── +// +// The methods below implement the I-004 coverage-gap closure. They follow the +// interface contracts in internal/repository/interfaces.go:94-210 (which is the +// spec — keep godoc there in sync if behavior changes). + +// ListRetired returns a paginated slice of retired agents ordered by +// retired_at DESC so the most recent retirements appear first. Used by the +// GUI's Retired tab and the audit export path. Returns the rows plus the +// total count (for pagination UI). page<1 or perPage<1 is clamped to +// sensible defaults in-repo rather than erroring, matching the ListAgents +// pagination behavior at the service layer. I-004, migration 000015. +func (r *AgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) { + // Clamp pagination to safe defaults. Keep in lockstep with the service + // layer's pagination shape — negative / zero values on either axis should + // degrade to "first page, default size" instead of returning an error. + if page < 1 { + page = 1 + } + if perPage < 1 { + perPage = 50 + } + offset := (page - 1) * perPage + + // Total count first — separate query so pagination math stays correct + // even when the page of rows is empty. Uses the partial + // idx_agents_retired_at index so this is effectively a count of the + // partial-index tuple count, not a full table scan. + var total int + if err := r.db.QueryRowContext(ctx, ` + SELECT COUNT(*) FROM agents WHERE retired_at IS NOT NULL + `).Scan(&total); err != nil { + return nil, 0, fmt.Errorf("failed to count retired agents: %w", err) + } + + rows, err := r.db.QueryContext(ctx, ` + SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash, + os, architecture, ip_address, version, retired_at, retired_reason + FROM agents + WHERE retired_at IS NOT NULL + ORDER BY retired_at DESC + LIMIT $1 OFFSET $2 + `, perPage, offset) + if err != nil { + return nil, 0, fmt.Errorf("failed to query retired agents: %w", err) + } + defer rows.Close() + + var agents []*domain.Agent + for rows.Next() { + agent, err := scanAgent(rows) + if err != nil { + return nil, 0, err + } + agents = append(agents, agent) + } + if err := rows.Err(); err != nil { + return nil, 0, fmt.Errorf("error iterating retired agent rows: %w", err) + } + return agents, total, nil +} + +// SoftRetire stamps retired_at + retired_reason on the agent row with no +// cascade. Scoped to `WHERE id=$1 AND retired_at IS NULL` so re-retiring an +// already-retired row is a silent no-op (zero RowsAffected). The service +// layer has its own idempotent-retire branch that detects already-retired +// rows via Get before calling SoftRetire; a zero here just means a racy +// caller got there first. I-004. +func (r *AgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error { + if _, err := r.db.ExecContext(ctx, ` + UPDATE agents + SET retired_at = $2, retired_reason = $3 + WHERE id = $1 AND retired_at IS NULL + `, id, retiredAt, reason); err != nil { + return fmt.Errorf("failed to soft-retire agent: %w", err) + } + return nil +} + +// RetireAgentWithCascade performs a transactional retire-and-cascade. In one +// transaction it (1) stamps retired_at + retired_reason on the agent row if +// it is still active, and (2) stamps the SAME retired_at + retired_reason on +// every active (retired_at IS NULL) deployment_targets row whose agent_id +// matches. Already-retired targets keep their original retirement metadata; +// only active targets are touched. If the agent is already retired, the +// whole transaction is a no-op — the caller's idempotent-retire branch +// already handled it before we got here. I-004, migration 000015. +// +// The two UPDATEs share a single (retiredAt, reason) pair so forensic +// analysis can trace "every row stamped at T1 with reason R was part of the +// same operator action" back to one cascade. Using BeginTx keeps the agent +// row and its targets' retirement metadata consistent even if something +// crashes mid-cascade. +func (r *AgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error { + tx, err := r.db.BeginTx(ctx, nil) + if err != nil { + return fmt.Errorf("failed to begin retire-cascade transaction: %w", err) + } + // Rollback is a no-op if Commit has already run — safe to always defer. + defer func() { _ = tx.Rollback() }() + + // Agent row: flip to retired only if it was still active. If zero rows + // match, the agent was already retired — the whole cascade becomes a + // no-op (we deliberately do NOT stamp the targets against a retirement + // we didn't perform). + if _, err := tx.ExecContext(ctx, ` + UPDATE agents + SET retired_at = $2, retired_reason = $3 + WHERE id = $1 AND retired_at IS NULL + `, id, retiredAt, reason); err != nil { + return fmt.Errorf("failed to retire agent in cascade: %w", err) + } + + // Cascade: copy the same retired_at / retired_reason onto every active + // deployment_target belonging to this agent. Skips targets that are + // already retired so their original retirement metadata is preserved. + if _, err := tx.ExecContext(ctx, ` + UPDATE deployment_targets + SET retired_at = $2, retired_reason = $3 + WHERE agent_id = $1 AND retired_at IS NULL + `, id, retiredAt, reason); err != nil { + return fmt.Errorf("failed to cascade-retire deployment targets: %w", err) + } + + if err := tx.Commit(); err != nil { + return fmt.Errorf("failed to commit retire-cascade transaction: %w", err) + } + return nil +} + +// CountActiveTargets returns the number of deployment_targets with +// agent_id=agentID AND retired_at IS NULL. Used by the retirement preflight +// to decide 200 (soft-retire) vs 409 (blocked-by-deps). Hits the existing +// idx_deployment_targets_agent_id index (migration 000001 line 111); the +// retired_at IS NULL predicate is cheap because the partial +// idx_deployment_targets_retired_at index (migration 000015) lets the +// planner skip the retired-row segment. I-004. +func (r *AgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) { + var count int + err := r.db.QueryRowContext(ctx, ` + SELECT COUNT(*) + FROM deployment_targets + WHERE agent_id = $1 AND retired_at IS NULL + `, agentID).Scan(&count) + if err != nil { + return 0, fmt.Errorf("failed to count active targets for agent: %w", err) + } + return count, nil +} + +// CountActiveCertificates returns the count of distinct managed_certificates +// currently deployed through one of this agent's ACTIVE deployment_targets. +// Joins certificate_target_mappings (migration 000001 line 116) → +// deployment_targets filtering on deployment_targets.agent_id=$1 AND +// deployment_targets.retired_at IS NULL. COUNT(DISTINCT certificate_id) so +// the same cert deployed to multiple targets on one agent counts once. +// Used purely for the preflight 409 body. I-004. +func (r *AgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) { + var count int + err := r.db.QueryRowContext(ctx, ` + SELECT COUNT(DISTINCT ctm.certificate_id) + FROM certificate_target_mappings ctm + JOIN deployment_targets dt ON dt.id = ctm.target_id + WHERE dt.agent_id = $1 AND dt.retired_at IS NULL + `, agentID).Scan(&count) + if err != nil { + return 0, fmt.Errorf("failed to count active certificates for agent: %w", err) + } + return count, nil +} + +// CountPendingJobs returns the number of jobs belonging to this agent whose +// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four +// statuses that represent work the agent would still be expected to pick up +// or complete. Completed / Failed / Cancelled jobs do not count toward the +// preflight gate. Status strings match domain.JobStatus* constants in +// internal/domain/job.go:43-49. Hits idx_jobs_agent_id (migration 000001 +// line 161). I-004. +func (r *AgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) { + var count int + err := r.db.QueryRowContext(ctx, ` + SELECT COUNT(*) + FROM jobs + WHERE agent_id = $1 + AND status IN ('Pending', 'AwaitingCSR', 'AwaitingApproval', 'Running') + `, agentID).Scan(&count) + if err != nil { + return 0, fmt.Errorf("failed to count pending jobs for agent: %w", err) + } + return count, nil +} + +// scanAgent scans an agent from a row or rows. +// +// I-004: the column list here is the authoritative 13-field post-M15 order — +// retired_at and retired_reason are appended at the tail as nullable +// *time.Time / *string scan targets matching the `json:"...,omitempty"` domain +// fields. Every SELECT in this file that feeds scanAgent must emit columns in +// this same order, otherwise Scan will silently place values into the wrong +// fields (lib/pq does positional binding, not named). func scanAgent(scanner interface { Scan(...interface{}) error }) (*domain.Agent, error) { var agent domain.Agent err := scanner.Scan(&agent.ID, &agent.Name, &agent.Hostname, &agent.Status, &agent.LastHeartbeatAt, &agent.RegisteredAt, &agent.APIKeyHash, - &agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version) + &agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version, + &agent.RetiredAt, &agent.RetiredReason) if err != nil { return nil, fmt.Errorf("failed to scan agent: %w", err) diff --git a/internal/repository/postgres/migration_000015_test.go b/internal/repository/postgres/migration_000015_test.go new file mode 100644 index 0000000..0cd0790 --- /dev/null +++ b/internal/repository/postgres/migration_000015_test.go @@ -0,0 +1,220 @@ +package postgres_test + +import ( + "context" + "database/sql" + "os" + "path/filepath" + "strings" + "testing" +) + +// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test +// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs"). +// +// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql, +// which must: +// +// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` +// columns to the `agents` table. These mirror the revoked_at / +// revocation_reason pair on managed_certificates (migration 000005). +// +// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns +// to `deployment_targets`. When an agent is retired with cascade=true, +// its deployment_targets must be soft-retired (not deleted) so audit +// history — who deployed what to where, when — stays intact. +// +// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id` +// from `ON DELETE CASCADE` (migration 000001, line 104) to +// `ON DELETE RESTRICT`. This is the fail-closed change that makes a +// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer +// instead of silently vaporising every deployment_target row. Today +// the CASCADE means the audit trail gets shredded with zero warning. +// +// The round-trip also validates that the down migration cleanly reverses all +// three changes, so an operator who lands on a rollback can still boot the +// server. Red-until-Green: this test compiles but fails until +// migrations/000015_agent_retire.up.sql + .down.sql exist with the right +// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order — +// the new migration runs automatically once Phase 2b creates the files. +func TestMigration000015_AgentRetireRoundTrip(t *testing.T) { + tdb := getTestDB(t) + db := tdb.freshSchema(t) + ctx := context.Background() + + // ─── Stage 1: Post-up assertions ───────────────────────────────────── + // + // After all .up.sql migrations (including the new 000015) have run, the + // new columns and the flipped FK must be observable in the catalog. + + assertColumnExists(t, db, "agents", "retired_at") + assertColumnExists(t, db, "agents", "retired_reason") + assertColumnExists(t, db, "deployment_targets", "retired_at") + assertColumnExists(t, db, "deployment_targets", "retired_reason") + + // The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'), + // not CASCADE (confdeltype='c'). This is the core fail-closed guarantee + // that fixes I-004 at the storage layer. + assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r") + + // The FK on jobs.agent_id is already SET NULL (confdeltype='n') per + // migration 000001 line 146 — pin that it stays that way (or goes to + // RESTRICT; either preserves audit history, both fail on 'c'). + assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c") + + // ─── Stage 2: Run the 000015 down migration manually ───────────────── + // + // testutil_test.go's runMigrations helper only runs *.up.sql. To exercise + // the down migration I read and execute it by hand, then re-check the + // catalog. + + downSQL := readMigrationFile(t, "000015_agent_retire.down.sql") + if _, err := db.ExecContext(ctx, downSQL); err != nil { + t.Fatalf("000015 down migration failed: %v", err) + } + + // Stage 3: Post-down assertions — columns gone, FK restored to CASCADE. + assertColumnGone(t, db, "agents", "retired_at") + assertColumnGone(t, db, "agents", "retired_reason") + assertColumnGone(t, db, "deployment_targets", "retired_at") + assertColumnGone(t, db, "deployment_targets", "retired_reason") + assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c") + + // ─── Stage 4: Re-run the up migration for idempotency ──────────────── + // + // The up migration must be safely re-runnable — operators sometimes + // re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER + // idempotently. + + upSQL := readMigrationFile(t, "000015_agent_retire.up.sql") + if _, err := db.ExecContext(ctx, upSQL); err != nil { + t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err) + } + + assertColumnExists(t, db, "agents", "retired_at") + assertColumnExists(t, db, "agents", "retired_reason") + assertColumnExists(t, db, "deployment_targets", "retired_at") + assertColumnExists(t, db, "deployment_targets", "retired_reason") + assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r") +} + +// ─── Catalog helpers ────────────────────────────────────────────────────── +// +// These helpers scope every catalog query to the schema the test is actually +// running in by joining against current_schema(). Without that, a test +// running in schema test_xyz would accidentally inspect the public schema +// and green-light drift. + +func assertColumnExists(t *testing.T, db *sql.DB, table, column string) { + t.Helper() + var exists bool + err := db.QueryRowContext(context.Background(), ` + SELECT EXISTS ( + SELECT 1 FROM information_schema.columns + WHERE table_schema = current_schema() + AND table_name = $1 + AND column_name = $2 + )`, table, column).Scan(&exists) + if err != nil { + t.Fatalf("column existence query failed for %s.%s: %v", table, column, err) + } + if !exists { + t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column) + } +} + +func assertColumnGone(t *testing.T, db *sql.DB, table, column string) { + t.Helper() + var exists bool + err := db.QueryRowContext(context.Background(), ` + SELECT EXISTS ( + SELECT 1 FROM information_schema.columns + WHERE table_schema = current_schema() + AND table_name = $1 + AND column_name = $2 + )`, table, column).Scan(&exists) + if err != nil { + t.Fatalf("column existence query failed for %s.%s: %v", table, column, err) + } + if exists { + t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column) + } +} + +// assertFKDeleteRule asserts that the foreign key covering `table.column` +// (i.e. the FK whose constrained column matches) has the expected +// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE, +// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION. +func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) { + t.Helper() + got := lookupFKDeleteRule(t, db, table, column) + if got != want { + t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')", + table, column, got, want) + } +} + +// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where +// multiple confdeltype values are acceptable (SET NULL and RESTRICT both +// preserve audit history) but CASCADE is strictly forbidden. +func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) { + t.Helper() + got := lookupFKDeleteRule(t, db, table, column) + if got == disallowed { + t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)", + table, column, got, disallowed) + } +} + +// lookupFKDeleteRule returns the confdeltype for the FK constraint whose +// constrained table+column matches. Returns empty string if no FK found — +// that's treated as a test failure because the schema is supposed to have +// these FKs per migration 000001. +func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string { + t.Helper() + + // Join pg_constraint → pg_class (constrained rel) → pg_attribute + // (constrained col) → pg_namespace (schema filter). Scoped to + // current_schema() so schema-per-test isolation holds. + const q = ` + SELECT c.confdeltype + FROM pg_constraint c + JOIN pg_class cl ON cl.oid = c.conrelid + JOIN pg_namespace n ON n.oid = cl.relnamespace + JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey) + WHERE n.nspname = current_schema() + AND c.contype = 'f' + AND cl.relname = $1 + AND a.attname = $2 + LIMIT 1 + ` + var confdeltype string + err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype) + if err == sql.ErrNoRows { + t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column) + return "" + } + if err != nil { + t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err) + return "" + } + return confdeltype +} + +// readMigrationFile locates and loads a named migration file. Uses the same +// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers +// agree on where the migrations live. +func readMigrationFile(t *testing.T, name string) string { + t.Helper() + path := filepath.Join(findMigrationsDir(), name) + data, err := os.ReadFile(path) + if err != nil { + t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err) + } + // Defensive: a zero-byte down migration would produce false-positive + // "success" below. Refuse to trust it. + if strings.TrimSpace(string(data)) == "" { + t.Fatalf("migration file %s is empty — down migration missing or truncated", name) + } + return string(data) +} diff --git a/internal/service/agent.go b/internal/service/agent.go index 00ba028..b68d89b 100644 --- a/internal/service/agent.go +++ b/internal/service/agent.go @@ -92,12 +92,27 @@ func (s *AgentService) Register(ctx context.Context, name string, hostname strin } // Heartbeat updates an agent's last seen time, status, and metadata. +// +// I-004: retired agents must be rejected up-front. A retired agent that is +// still polling is a zombie — its row exists only for audit history and must +// not be allowed to bump LastHeartbeatAt (which would resurrect it in stats +// dashboards and stale-offline sweeps). The sentinel ErrAgentRetired is +// returned unwrapped so the HTTP handler can map it to 410 Gone via +// errors.Is; the agent process detects the 410 and shuts down cleanly +// instead of continuing to heartbeat indefinitely. func (s *AgentService) Heartbeat(ctx context.Context, agentID string, metadata *domain.AgentMetadata) error { agent, err := s.agentRepo.Get(ctx, agentID) if err != nil { return fmt.Errorf("failed to fetch agent: %w", err) } + // I-004 guard: retired agents are frozen. Do not call UpdateHeartbeat — + // bumping the timestamp would defeat the retired-row filter that protects + // stats, scheduler sweeps, and handler listings. + if agent.IsRetired() { + return ErrAgentRetired + } + // Update heartbeat and metadata if err := s.agentRepo.UpdateHeartbeat(ctx, agentID, metadata); err != nil { return fmt.Errorf("failed to update heartbeat: %w", err) diff --git a/internal/service/agent_retire.go b/internal/service/agent_retire.go new file mode 100644 index 0000000..46c54b2 --- /dev/null +++ b/internal/service/agent_retire.go @@ -0,0 +1,317 @@ +package service + +import ( + "context" + "errors" + "fmt" + "log/slog" + "time" + + "github.com/shankar0123/certctl/internal/domain" +) + +// I-004 coverage-gap closure: the agent retirement surface. +// +// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and +// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with +// no preflight, no archival, and no knowledge of in-flight jobs. Any cert +// still rotating through one of those targets would observe half-migrated +// state. I-004 closes that gap with a preflight + soft-retire + optional +// forced-cascade contract; the symbols in this file are the service-layer +// surface that the handler and operator UI bind against. + +// ErrAgentIsSentinel is returned when an operator tries to retire one of the +// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm, +// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the +// three cloud secret-manager discovery sources; retiring any of them orphans +// its subsystem. The guard fires unconditionally — force=true does not bypass +// it, because a sentinel is a structural invariant of the deployment, not +// a piece of fleet state the operator owns. Handler maps this to HTTP 403. +var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired") + +// ErrBlockedByDependencies is returned by RetireAgent when at least one of +// (active targets, active certificates, pending jobs) referencing the agent +// is non-zero and force=false. The caller always receives it wrapped in +// a *BlockedByDependenciesError (see below), so handlers doing errors.As +// can surface the per-bucket counts in the 409 body for operator +// troubleshooting. Tests use errors.Is; handlers use errors.As. +var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies") + +// ErrForceReasonRequired is returned when force=true is supplied without a +// non-empty reason. The force escape hatch is deliberately chatty: operators +// pulling the emergency cord must leave an auditable breadcrumb explaining +// why a cascade was justified. Handler maps this to HTTP 400 so the operator +// retries with --reason rather than silently skipping the guard. Checked +// before any DB mutation to keep the no-reason path transactionally clean. +var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason") + +// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated +// call site) when a retired agent is still polling. The handler layer maps +// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it +// deterministically and shut down the agent process, rather than looping +// forever on a soft-retired identity. IsRetired() on the domain model is +// the single source of truth; the sentinel exists so service and handler +// callers can errors.Is against one symbol. +var ErrAgentRetired = errors.New("agent has been retired") + +// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the +// per-bucket dependency snapshot the preflight pass captured. The embedded +// AgentDependencyCounts is the same struct the repo returns from the three +// CountActive* calls, so the handler can marshal it directly into the 409 +// body without reshaping fields. Unwrap() satisfies errors.Is against the +// sentinel; Error() includes the counts so logs are diagnostic on their own. +type BlockedByDependenciesError struct { + Counts domain.AgentDependencyCounts +} + +// Error formats the wrapped error with the per-bucket counts. Kept short so +// it reads cleanly in slog output. +func (e *BlockedByDependenciesError) Error() string { + return fmt.Sprintf( + "%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)", + ErrBlockedByDependencies.Error(), + e.Counts.ActiveTargets, + e.Counts.ActiveCertificates, + e.Counts.PendingJobs, + ) +} + +// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped +// struct — the test contract (agent_retire_test.go:167) depends on it. +func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies } + +// AgentRetirementResult is the outcome surface the handler returns to the +// operator. It discriminates the three happy paths the endpoint can take — +// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and +// forced cascade (Cascade=true) — and always carries the retired_at timestamp +// and the dependency-count snapshot so the 200/204 response body can echo +// what was (or would have been) affected. +// +// AlreadyRetired=true → agent was already retired; no new audit +// event was emitted; RetiredAt is the +// original stamp, not the current time. +// Cascade=false → clean soft-retire; Counts is all zeros. +// Cascade=true → force=true retired agent + downstream +// targets; Counts is the PRE-cascade +// snapshot (so the operator sees what +// they just retired). +type AgentRetirementResult struct { + AlreadyRetired bool + Cascade bool + RetiredAt time.Time + Counts domain.AgentDependencyCounts +} + +// RetireAgent implements the I-004 retirement contract. Ordering matters — +// every guard fires before the one that would mutate state, so a rejected +// retire leaves zero trace (no audit event, no partial DB write): +// +// 1. Sentinel check (unconditional; force does not bypass). +// 2. Fetch agent (404 surfaces as-is from the repo). +// 3. Already-retired idempotency: return AlreadyRetired=true with NO new +// audit event — the original retire already recorded one. +// 4. Preflight count pass via the three CountActive* repo methods. +// 5. Force-reason guard: force=true with empty reason is rejected here, +// after the counts are known but before any mutation. +// 6. Default no-force path: any non-zero count returns +// *BlockedByDependenciesError with counts attached. +// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with +// a single retiredAt timestamp pinned BEFORE the repo call so the +// audit event and the DB row agree to the nanosecond. +// 8. Audit: agent_retired always; agent_retirement_cascaded additionally +// on the force=true cascade path. +// +// Actor comes from the handler's resolveActor (API key → user, agent key → +// agent-, unauthenticated → "anonymous"); the service does not second- +// guess it. Audit emission is best-effort: a failed RecordEvent logs a +// warning but does not fail the overall retirement, consistent with how +// the rest of the codebase treats audit as an observability concern +// rather than a correctness barrier. +func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) { + // Step 1 — reserved-sentinel guard. Applies even under force=true. + if domain.IsSentinelAgent(id) { + return nil, ErrAgentIsSentinel + } + + // Step 2 — existence check. Missing agent surfaces the repo's not-found + // error verbatim so the handler can map it to 404 via its existing + // detection path (the handler layer already has "not found" mapping + // logic inherited from the pre-I-004 Delete endpoint). + agent, err := s.agentRepo.Get(ctx, id) + if err != nil { + return nil, fmt.Errorf("failed to fetch agent: %w", err) + } + + // Step 3 — idempotency. A retired agent returns AlreadyRetired=true + // WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204. + // Guarding here (before preflight) means a re-retire of an agent that + // now has zero deps doesn't spuriously "succeed again" and double-log. + if agent.IsRetired() { + return &AgentRetirementResult{ + AlreadyRetired: true, + RetiredAt: *agent.RetiredAt, + }, nil + } + + // Step 4 — preflight counts. All three run even when force=true: we + // need them to populate AgentRetirementResult.Counts (the pre-cascade + // snapshot). A repo failure here aborts the whole operation — partial + // preflight is worse than no preflight. + counts, err := s.collectAgentDependencyCounts(ctx, id) + if err != nil { + return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err) + } + + // Step 5 — force-reason guard. Positioned AFTER preflight so operators + // who forgot --reason still see accurate counts when they retry. The + // empty-reason rejection fires before any mutation, so the rejected + // attempt leaves no audit noise. + if force && reason == "" { + return nil, ErrForceReasonRequired + } + + // Step 6 — default path: block on any non-zero bucket. Wrapping the + // sentinel in *BlockedByDependenciesError lets the handler use errors.As + // to surface counts in the 409 body while tests use errors.Is against + // the sentinel. Both callers are satisfied by the single Unwrap chain. + if !force && counts.HasDependencies() { + return nil, &BlockedByDependenciesError{Counts: counts} + } + + // Step 7 — mutation. Pin retiredAt once so the audit event, the agent + // row, and (on cascade) every deployment_targets row share the same + // timestamp. Callers querying "what happened at T?" can correlate + // retirement rows across tables without clock-skew tie-breaking. + retiredAt := time.Now() + cascade := force && counts.HasDependencies() + + if cascade { + if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil { + return nil, fmt.Errorf("failed to retire agent with cascade: %w", err) + } + } else { + if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil { + return nil, fmt.Errorf("failed to soft-retire agent: %w", err) + } + } + + // Step 8 — audit. Two events on the cascade path so forensics can + // distinguish "agent was retired" (agent_retired) from "downstream + // targets were flipped" (agent_retirement_cascaded). Details on the + // cascaded event carry the pre-cascade counts so a reviewer looking + // only at the audit log knows how much state was affected. Emission + // is best-effort — audit is observability, not a correctness barrier. + actorType := s.resolveActorType(actor) + details := map[string]interface{}{ + "actor": actor, + "reason": reason, + "force": force, + "active_targets": counts.ActiveTargets, + "active_certificates": counts.ActiveCertificates, + "pending_jobs": counts.PendingJobs, + } + if err := s.auditService.RecordEvent(ctx, actor, actorType, + "agent_retired", "agent", id, details); err != nil { + slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err) + } + if cascade { + cascadeDetails := map[string]interface{}{ + "actor": actor, + "reason": reason, + "active_targets": counts.ActiveTargets, + "active_certificates": counts.ActiveCertificates, + "pending_jobs": counts.PendingJobs, + } + if err := s.auditService.RecordEvent(ctx, actor, actorType, + "agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil { + slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err) + } + } + + return &AgentRetirementResult{ + AlreadyRetired: false, + Cascade: cascade, + RetiredAt: retiredAt, + Counts: counts, + }, nil +} + +// ListRetiredAgents returns the paginated list of retired agents in +// retired_at DESC order. This is the companion to ListAgents — which +// hides retired rows — so the operator UI can render a dedicated +// "Retired" tab without leaking retired rows into every other listing. +// Pagination defaults (page<1→1, perPage<1→50) are applied here as +// well as in the repo, so callers can pass 0s when they want defaults. +// +// Return shape harmonizes with handler.AgentService: a value slice +// (not pointer slice) and int64 total. The repo returns []*domain.Agent; +// this method dereferences into a value slice so the handler's +// PagedResponse marshals straight objects and so the compile-time +// interface assertion in agent_retire_handler_test.go:387 is satisfied. +// Nil repo entries are skipped defensively — the repo should never +// return them, but the handler contract is more important than the +// repo's (pointer-slice) convenience. +func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) { + if page < 1 { + page = 1 + } + if perPage < 1 { + perPage = 50 + } + agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage) + if err != nil { + return nil, 0, fmt.Errorf("failed to list retired agents: %w", err) + } + out := make([]domain.Agent, 0, len(agents)) + for _, a := range agents { + if a == nil { + continue + } + out = append(out, *a) + } + return out, int64(total), nil +} + +// collectAgentDependencyCounts runs the three preflight COUNT queries in +// sequence and bundles the result. Sequential (not parallel) because the +// queries are cheap (<1ms each on the indexed columns added in 000015) and +// sequential keeps error handling simple. Any repo error short-circuits +// — we prefer to refuse the retire than make a half-informed decision. +func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) { + var counts domain.AgentDependencyCounts + + targets, err := s.agentRepo.CountActiveTargets(ctx, id) + if err != nil { + return counts, fmt.Errorf("count active targets: %w", err) + } + counts.ActiveTargets = targets + + certs, err := s.agentRepo.CountActiveCertificates(ctx, id) + if err != nil { + return counts, fmt.Errorf("count active certificates: %w", err) + } + counts.ActiveCertificates = certs + + jobs, err := s.agentRepo.CountPendingJobs(ctx, id) + if err != nil { + return counts, fmt.Errorf("count pending jobs: %w", err) + } + counts.PendingJobs = jobs + + return counts, nil +} + +// resolveActorType maps an opaque actor string into the typed ActorType +// used by the audit schema. Matches the conventions the rest of the +// service layer uses: "system" → System, anything that looks like an +// agent identity → Agent, everything else → User. +func (s *AgentService) resolveActorType(actor string) domain.ActorType { + switch { + case actor == "system": + return domain.ActorTypeSystem + case len(actor) > 6 && actor[:6] == "agent-": + return domain.ActorTypeAgent + default: + return domain.ActorTypeUser + } +} diff --git a/internal/service/agent_retire_test.go b/internal/service/agent_retire_test.go new file mode 100644 index 0000000..3ba5b47 --- /dev/null +++ b/internal/service/agent_retire_test.go @@ -0,0 +1,396 @@ +package service + +import ( + "context" + "errors" + "log/slog" + "testing" + "time" + + "github.com/shankar0123/certctl/internal/domain" +) + +// setupRetireTest wires up an AgentService with a single registered agent and +// returns (service, agentRepo, auditRepo) so tests can seed state and assert +// audit events. Kept minimal — tests that need targets/jobs/certs extend the +// returned repos directly. +func setupRetireTest(t *testing.T, agentID string) (*AgentService, *mockAgentRepo, *mockAuditRepo) { + t.Helper() + now := time.Now() + agent := &domain.Agent{ + ID: agentID, + Name: "prod-agent", + Hostname: "server-01", + Status: domain.AgentStatusOnline, + RegisteredAt: now, + LastHeartbeatAt: &now, + APIKeyHash: "hash-" + agentID, + } + agentRepo := newMockAgentRepository() + agentRepo.AddAgent(agent) + certRepo := &mockCertRepo{ + Certs: make(map[string]*domain.ManagedCertificate), + Versions: make(map[string][]*domain.CertificateVersion), + } + jobRepo := &mockJobRepo{ + Jobs: make(map[string]*domain.Job), + StatusUpdates: make(map[string]domain.JobStatus), + } + targetRepo := &mockTargetRepo{ + Targets: make(map[string]*domain.DeploymentTarget), + } + auditRepo := &mockAuditRepo{Events: []*domain.AuditEvent{}} + auditService := NewAuditService(auditRepo) + issuerRegistry := NewIssuerRegistry(slog.Default()) + + svc := NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, nil) + return svc, agentRepo, auditRepo +} + +// TestRetireAgent_Sentinel_Rejected covers I-004's sentinel guard. The four +// well-known sentinel agent IDs back discovery sources and the network scanner +// — retiring them would orphan those subsystems. Contract: reject with +// ErrAgentIsSentinel regardless of force/reason. +func TestRetireAgent_Sentinel_Rejected(t *testing.T) { + sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"} + for _, id := range sentinels { + t.Run(id, func(t *testing.T) { + svc, _, _ := setupRetireTest(t, id) + _, err := svc.RetireAgent(context.Background(), id, "alice", false, "") + if !errors.Is(err, ErrAgentIsSentinel) { + t.Fatalf("retire(sentinel %q) err=%v want ErrAgentIsSentinel", id, err) + } + // Sentinel rejection must be deterministic even under force=true. + _, err = svc.RetireAgent(context.Background(), id, "alice", true, "forced by operator") + if !errors.Is(err, ErrAgentIsSentinel) { + t.Fatalf("retire(sentinel %q force=true) err=%v want ErrAgentIsSentinel", id, err) + } + }) + } +} + +// TestRetireAgent_NotFound covers the 404 preflight path. The handler maps +// ErrAgentNotFound-equivalent sentinel to 404; the service must surface it +// cleanly without partial state mutation. +func TestRetireAgent_NotFound(t *testing.T) { + svc, _, _ := setupRetireTest(t, "agent-001") + _, err := svc.RetireAgent(context.Background(), "agent-does-not-exist", "alice", false, "") + if err == nil { + t.Fatalf("retire(missing id) err=nil want not-found error") + } +} + +// TestRetireAgent_AlreadyRetired_Idempotent covers the 204 No Content path. +// Retiring an already-retired agent must succeed without error and without +// emitting a new audit event (the first retirement already recorded one). +// Idempotency matters because the handler is the escape hatch for operators +// re-issuing a failed retire after a partial failure mid-cascade. +func TestRetireAgent_AlreadyRetired_Idempotent(t *testing.T) { + svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001") + past := time.Now().Add(-24 * time.Hour) + reason := "operator decommissioned" + agent := agentRepo.Agents["agent-001"] + agent.RetiredAt = &past + agent.RetiredReason = &reason + + result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "") + if err != nil { + t.Fatalf("retire(already retired) err=%v want nil (idempotent)", err) + } + if result == nil || !result.AlreadyRetired { + t.Fatalf("retire(already retired) result=%+v want AlreadyRetired=true", result) + } + // Retire-on-retired must not emit a duplicate audit event. + for _, e := range auditRepo.Events { + if e.Action == "agent_retired" && e.ResourceID == "agent-001" { + t.Fatalf("retire(already retired) emitted duplicate agent_retired audit event") + } + } +} + +// TestRetireAgent_NoDeps_SoftSucceeds covers the happy 200 path: no active +// targets, certs, or jobs referencing the agent. Soft-retire stamps +// RetiredAt + RetiredReason and emits agent_retired audit event. +func TestRetireAgent_NoDeps_SoftSucceeds(t *testing.T) { + svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001") + + before := time.Now().Add(-time.Second) + result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "") + if err != nil { + t.Fatalf("retire(clean) err=%v want nil", err) + } + if result == nil { + t.Fatal("retire(clean) result=nil want non-nil") + } + if result.AlreadyRetired { + t.Fatalf("retire(clean) result.AlreadyRetired=true want false") + } + if result.Cascade { + t.Fatalf("retire(clean) result.Cascade=true want false (no deps to cascade)") + } + if !result.RetiredAt.After(before) { + t.Fatalf("retire(clean) RetiredAt=%v not after test start %v", result.RetiredAt, before) + } + + agent := agentRepo.Agents["agent-001"] + if agent.RetiredAt == nil { + t.Fatalf("retire(clean) agent.RetiredAt=nil want stamped") + } + + // Audit event must be emitted with action=agent_retired, actor=alice. + found := false + for _, e := range auditRepo.Events { + if e.Action == "agent_retired" && e.ResourceID == "agent-001" && e.Actor == "alice" { + found = true + break + } + } + if !found { + t.Fatalf("retire(clean) missing agent_retired audit event for alice, events=%+v", auditRepo.Events) + } +} + +// TestRetireAgent_WithDeps_NoForce_Blocked covers the 409 preflight path. When +// the agent has any of: active non-retired targets, certs deployed via those +// targets, or pending jobs — a default retire must block with +// ErrBlockedByDependencies and the counts must be reachable via errors.As so +// the handler can build the 409 body. +func TestRetireAgent_WithDeps_NoForce_Blocked(t *testing.T) { + svc, agentRepo, _ := setupRetireTest(t, "agent-001") + // Seed dependency counts directly on the mock — the production repo + // implements CountActive* queries; the mock exposes them as fields. + agentRepo.ActiveTargetCounts["agent-001"] = 3 + agentRepo.ActiveCertCounts["agent-001"] = 7 + agentRepo.PendingJobCounts["agent-001"] = 2 + + _, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "") + if !errors.Is(err, ErrBlockedByDependencies) { + t.Fatalf("retire(with deps, no force) err=%v want ErrBlockedByDependencies", err) + } + var blocked *BlockedByDependenciesError + if !errors.As(err, &blocked) { + t.Fatalf("retire(with deps) err=%v want wrapped *BlockedByDependenciesError", err) + } + if blocked.Counts.ActiveTargets != 3 { + t.Errorf("blocked.Counts.ActiveTargets=%d want 3", blocked.Counts.ActiveTargets) + } + if blocked.Counts.ActiveCertificates != 7 { + t.Errorf("blocked.Counts.ActiveCertificates=%d want 7", blocked.Counts.ActiveCertificates) + } + if blocked.Counts.PendingJobs != 2 { + t.Errorf("blocked.Counts.PendingJobs=%d want 2", blocked.Counts.PendingJobs) + } + // Agent must still be un-retired after preflight block. + if agentRepo.Agents["agent-001"].RetiredAt != nil { + t.Fatalf("retire(blocked) left RetiredAt stamped; preflight must be transactionally safe") + } +} + +// TestRetireAgent_WithDeps_Force_NoReason_Rejected covers the 400 guard on the +// force escape hatch. Operators using force=true must supply a justifying +// reason; empty reason is rejected before any DB mutation. +func TestRetireAgent_WithDeps_Force_NoReason_Rejected(t *testing.T) { + svc, agentRepo, _ := setupRetireTest(t, "agent-001") + agentRepo.ActiveTargetCounts["agent-001"] = 1 + + _, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, "") + if !errors.Is(err, ErrForceReasonRequired) { + t.Fatalf("retire(force, no reason) err=%v want ErrForceReasonRequired", err) + } + if agentRepo.Agents["agent-001"].RetiredAt != nil { + t.Fatalf("retire(force, no reason) left RetiredAt stamped; guard must fire before mutation") + } +} + +// TestRetireAgent_WithDeps_Force_Cascades covers the force=true transactional +// path: agent retires, downstream targets also soft-retire with the supplied +// reason, and the result surface indicates cascade happened. Reason +// propagates to every cascaded row so post-mortem forensics can trace the +// cascade to a single operator action. +func TestRetireAgent_WithDeps_Force_Cascades(t *testing.T) { + svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001") + agentRepo.ActiveTargetCounts["agent-001"] = 2 + agentRepo.ActiveCertCounts["agent-001"] = 5 + agentRepo.PendingJobCounts["agent-001"] = 1 + + reason := "decommissioning rack 7" + result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, reason) + if err != nil { + t.Fatalf("retire(force, reason) err=%v want nil", err) + } + if result == nil { + t.Fatal("retire(force) result=nil want non-nil") + } + if !result.Cascade { + t.Fatalf("retire(force) result.Cascade=false want true") + } + if result.Counts.ActiveTargets != 2 { + t.Errorf("result.Counts.ActiveTargets=%d want 2 (pre-cascade snapshot)", result.Counts.ActiveTargets) + } + + agent := agentRepo.Agents["agent-001"] + if agent.RetiredAt == nil { + t.Fatalf("retire(force) agent.RetiredAt=nil want stamped") + } + if agent.RetiredReason == nil || *agent.RetiredReason != reason { + t.Fatalf("retire(force) RetiredReason=%v want %q", agent.RetiredReason, reason) + } + + // Two audit events required: agent_retired + agent_retirement_cascaded. + // The cascaded event captures which downstream resources were affected. + var haveRetired, haveCascaded bool + for _, e := range auditRepo.Events { + if e.ResourceID == "agent-001" { + switch e.Action { + case "agent_retired": + haveRetired = true + case "agent_retirement_cascaded": + haveCascaded = true + } + } + } + if !haveRetired { + t.Errorf("retire(force) missing agent_retired audit event") + } + if !haveCascaded { + t.Errorf("retire(force) missing agent_retirement_cascaded audit event") + } +} + +// TestRetireAgent_EmitsAuditEvent pins the audit contract for I-004: +// every retire path that mutates DB state emits at least one audit event with +// the operator's actor identity, so post-hoc compliance/forensics can +// reconstruct who retired what and when. +func TestRetireAgent_EmitsAuditEvent(t *testing.T) { + svc, _, auditRepo := setupRetireTest(t, "agent-007") + + _, err := svc.RetireAgent(context.Background(), "agent-007", "compliance-bot", false, "") + if err != nil { + t.Fatalf("retire err=%v want nil", err) + } + for _, e := range auditRepo.Events { + if e.Action == "agent_retired" && e.ResourceID == "agent-007" { + if e.Actor != "compliance-bot" { + t.Errorf("audit event Actor=%q want compliance-bot", e.Actor) + } + return + } + } + t.Fatalf("no agent_retired audit event emitted, events=%+v", auditRepo.Events) +} + +// TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired covers the 410 Gone +// contract. A retired agent that is still polling must be told its identity +// is no longer accepted — the agent process should detect this and shut +// down rather than continue heartbeating indefinitely. +func TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired(t *testing.T) { + svc, agentRepo, _ := setupRetireTest(t, "agent-001") + past := time.Now().Add(-time.Hour) + reason := "decommissioned" + agentRepo.Agents["agent-001"].RetiredAt = &past + agentRepo.Agents["agent-001"].RetiredReason = &reason + + err := svc.Heartbeat(context.Background(), "agent-001", &domain.AgentMetadata{ + OS: "linux", + Architecture: "amd64", + Hostname: "server-01", + }) + if !errors.Is(err, ErrAgentRetired) { + t.Fatalf("heartbeat(retired) err=%v want ErrAgentRetired", err) + } + // Retired heartbeat must NOT bump LastHeartbeatAt — otherwise the retired + // agent could ressurrect itself in stats/observability dashboards. + if _, bumped := agentRepo.HeartbeatUpdates["agent-001"]; bumped { + t.Fatalf("heartbeat(retired) updated LastHeartbeatAt; retired agents must be frozen") + } +} + +// TestListAgents_DefaultExcludesRetired covers the contract that the +// handler-facing ListAgents call hides retired rows by default. Otherwise +// every dashboard that paginates agents would surface retired stragglers. +// An explicit "list retired" endpoint (ListRetiredAgents) covers the audit +// use case. +func TestListAgents_DefaultExcludesRetired(t *testing.T) { + svc, agentRepo, _ := setupRetireTest(t, "agent-active") + // Seed one retired agent alongside the active one. + past := time.Now().Add(-24 * time.Hour) + reason := "old hardware" + agentRepo.AddAgent(&domain.Agent{ + ID: "agent-retired", + Name: "retired-agent", + Hostname: "server-old", + Status: domain.AgentStatusOffline, + RegisteredAt: past, + APIKeyHash: "hash-retired", + RetiredAt: &past, + RetiredReason: &reason, + }) + + agents, total, err := svc.ListAgents(context.Background(), 1, 50) + if err != nil { + t.Fatalf("ListAgents err=%v want nil", err) + } + for _, a := range agents { + if a.ID == "agent-retired" { + t.Fatalf("ListAgents returned retired agent %q in default listing", a.ID) + } + } + if total != 1 { + t.Errorf("ListAgents total=%d want 1 (only active)", total) + } + + // ListRetiredAgents must surface retired-only, with count=1. + retired, retiredTotal, err := svc.ListRetiredAgents(context.Background(), 1, 50) + if err != nil { + t.Fatalf("ListRetiredAgents err=%v want nil", err) + } + if retiredTotal != 1 { + t.Errorf("ListRetiredAgents total=%d want 1", retiredTotal) + } + if len(retired) != 1 || retired[0].ID != "agent-retired" { + t.Fatalf("ListRetiredAgents got=%+v want [agent-retired]", retired) + } +} + +// TestMarkStaleAgentsOffline_SkipsRetired covers the stale-offline sweeper +// interaction with retirement. A retired agent must not be re-surfaced as +// a state transition ("Online → Offline") by the scheduler, because its +// Status column is preserved as the last-known operational state at +// retirement time and RetiredAt is the source of truth for filtering. +func TestMarkStaleAgentsOffline_SkipsRetired(t *testing.T) { + svc, agentRepo, _ := setupRetireTest(t, "agent-live") + // Active agent is currently stale (no heartbeat for 10 minutes) — eligible + // for Online→Offline transition. + stale := time.Now().Add(-10 * time.Minute) + agentRepo.Agents["agent-live"].LastHeartbeatAt = &stale + + // Retired agent was also stale at retirement time, but must NOT be + // touched by the sweeper. + past := time.Now().Add(-24 * time.Hour) + reason := "hw failure" + agentRepo.AddAgent(&domain.Agent{ + ID: "agent-retired", + Name: "dead-agent", + Hostname: "server-old", + Status: domain.AgentStatusOnline, // preserved last-seen status + RegisteredAt: past, + LastHeartbeatAt: &past, + APIKeyHash: "hash-dead", + RetiredAt: &past, + RetiredReason: &reason, + }) + + if err := svc.MarkStaleAgentsOffline(context.Background(), 5*time.Minute); err != nil { + t.Fatalf("MarkStaleAgentsOffline err=%v want nil", err) + } + + // Active-stale agent should flip Online → Offline. + if got := agentRepo.Agents["agent-live"].Status; got != domain.AgentStatusOffline { + t.Errorf("agent-live Status=%s want Offline", got) + } + // Retired agent's Status column must be frozen at Online (its preserved + // last-seen state); the sweeper must skip it. + if got := agentRepo.Agents["agent-retired"].Status; got != domain.AgentStatusOnline { + t.Errorf("agent-retired Status=%s want Online (frozen); sweeper touched retired row", got) + } +} diff --git a/internal/service/deployment.go b/internal/service/deployment.go index 05fcb68..afeba2d 100644 --- a/internal/service/deployment.go +++ b/internal/service/deployment.go @@ -145,6 +145,31 @@ func (s *DeploymentService) ProcessDeploymentJob(ctx context.Context, job *domai return fmt.Errorf("failed to fetch agent: %w", err) } + // I-004: AgentRepository.Get surfaces retired rows by design (for the GUI + // banner + 410 Gone heartbeat path). Deployments must never dispatch to a + // retired agent — it will never heartbeat again and the target row should + // itself have been cascade-retired when the agent was force-retired. A job + // slipping through here would otherwise hit the heartbeat-staleness branch + // below with the misleading reason "agent is offline"; we want operators to + // see the real cause. Fail the job with an explicit reason, send a + // deployment notification so the owner is alerted, and record an audit + // event. Falls through the same notify+audit shape as the offline branch. + if agent.IsRetired() { + updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "assigned agent is retired") + if updateErr != nil { + slog.Error("failed to update job status", "job_id", job.ID, "error", updateErr) + } + if notifErr := s.notificationSvc.SendDeploymentNotification(ctx, cert, target, false, fmt.Errorf("agent retired")); notifErr != nil { + slog.Error("failed to send deployment notification", "error", notifErr) + } + if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem, + "deployment_job_failed", "certificate", job.CertificateID, + map[string]interface{}{"job_id": job.ID, "reason": "agent retired", "target_id": targetID, "agent_id": agentID}); auditErr != nil { + slog.Error("failed to record audit event", "error", auditErr) + } + return fmt.Errorf("agent %s is retired", agentID) + } + // Check agent heartbeat (must be within last 5 minutes) if agent.LastHeartbeatAt != nil && time.Since(*agent.LastHeartbeatAt) > 5*time.Minute { updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "agent is offline") diff --git a/internal/service/target.go b/internal/service/target.go index bfbcc9e..6343950 100644 --- a/internal/service/target.go +++ b/internal/service/target.go @@ -232,6 +232,18 @@ func (s *TargetService) TestConnection(ctx context.Context, id string) error { return fmt.Errorf("assigned agent not found: %w", err) } + // I-004: AgentRepository.Get intentionally surfaces retired rows (the banner + // + 410 Gone paths need to see them). A test against a retired agent can + // never succeed — the agent is tombstoned, will never heartbeat again, and + // any active targets have already been cascade-retired alongside it. Fail + // fast with an explicit message instead of falling through to the Status / + // heartbeat checks, which would produce a misleading "agent is Offline" or + // "heartbeat stale" diagnostic. + if agent.IsRetired() { + s.updateTestStatus(ctx, target, "failed") + return fmt.Errorf("assigned agent %s is retired", agent.ID) + } + if agent.Status != domain.AgentStatusOnline { s.updateTestStatus(ctx, target, "failed") return fmt.Errorf("assigned agent %s is %s (expected Online)", agent.ID, agent.Status) @@ -293,9 +305,20 @@ func (s *TargetService) CreateTarget(ctx context.Context, target domain.Deployme if target.AgentID == "" { return nil, fmt.Errorf("%w: agent_id is required", ErrAgentNotFound) } - if _, err := s.agentRepo.Get(ctx, target.AgentID); err != nil { + agent, err := s.agentRepo.Get(ctx, target.AgentID) + if err != nil { return nil, fmt.Errorf("%w: %s", ErrAgentNotFound, target.AgentID) } + // I-004: refuse to attach new targets to a retired agent. The agent is + // tombstoned and no deployments would ever succeed against it; letting a + // row slip past here would immediately be cascade-retired on the next + // dependency sweep and confuse operators ("why is this brand-new target + // already retired?"). Treating retired agents as "not found" for creation + // purposes keeps the error surface tight and matches the default-list + // contract established by repository.AgentRepository.List. + if agent.IsRetired() { + return nil, fmt.Errorf("%w: %s (retired)", ErrAgentNotFound, target.AgentID) + } if target.ID == "" { target.ID = generateID("target") diff --git a/internal/service/testutil_test.go b/internal/service/testutil_test.go index b530dfb..e6303a8 100644 --- a/internal/service/testutil_test.go +++ b/internal/service/testutil_test.go @@ -4,6 +4,7 @@ import ( "context" "database/sql" "errors" + "sort" "sync" "time" @@ -607,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) { m.Policies[policy.ID] = policy } -// mockAgentRepo is a test implementation of AgentRepository +// mockAgentRepo is a test implementation of AgentRepository. +// +// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by +// agent ID and read back verbatim by the Count* methods — the retirement +// service's preflight pokes these maps to simulate "agent has N active +// deployments / M deployed certs / K pending jobs" without having to seed +// real target/cert/job rows across multiple mock repos. An unset key means +// zero, matching the production repo behavior on an agent with no deps. type mockAgentRepo struct { mu sync.Mutex Agents map[string]*domain.Agent @@ -619,8 +627,27 @@ type mockAgentRepo struct { ListErr error UpdateHeartbeatErr error GetByAPIKeyErr error + // I-004 preflight count seeds (read by CountActiveTargets etc.). + ActiveTargetCounts map[string]int + ActiveCertCounts map[string]int + PendingJobCounts map[string]int + // I-004 retirement write-path error seams. Let tests force a SoftRetire + // or RetireAgentWithCascade failure after preflight passed, so the + // service's error surfacing (wrap+return, skip audit, etc.) can be + // exercised without having to stand up a real PG connection. + SoftRetireErr error + RetireCascadeErr error + CountErr error + ListRetiredErr error } +// List mirrors the production repo contract post-I-004: it returns only +// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via +// AddAgent and then call a List-driven service method (e.g. ListAgents, +// MarkStaleAgentsOffline, stats dashboards) must not see the retired row +// here — otherwise the mock would pass while the real planner filters it +// out at the WHERE clause level. ListRetired is the companion method for +// explicit retired-only listing. func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) { m.mu.Lock() defer m.mu.Unlock() @@ -629,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) { } var agents []*domain.Agent for _, a := range m.Agents { + if a.RetiredAt != nil { + continue + } agents = append(agents, a) } return agents, nil @@ -726,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) { m.Agents[agent.ID] = agent } +// ListRetired returns the paginated retired-agents slice + total count. +// Matches the production repo contract: RetiredAt != nil, sorted by +// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over +// the keyed map so the mock stays dependency-free. I-004. +func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if m.ListRetiredErr != nil { + return nil, 0, m.ListRetiredErr + } + if page < 1 { + page = 1 + } + if perPage < 1 { + perPage = 50 + } + var retired []*domain.Agent + for _, a := range m.Agents { + if a.RetiredAt != nil { + retired = append(retired, a) + } + } + total := len(retired) + // Sort by RetiredAt DESC — most recent first. The real query uses the + // partial idx_agents_retired_at index; here we sort in Go. + sort.SliceStable(retired, func(i, j int) bool { + return retired[i].RetiredAt.After(*retired[j].RetiredAt) + }) + // Apply page/perPage window. + offset := (page - 1) * perPage + if offset >= total { + return nil, total, nil + } + end := offset + perPage + if end > total { + end = total + } + return retired[offset:end], total, nil +} + +// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors +// the real repo's idempotent semantics: a row already retired is left +// untouched (zero-rows-affected is not an error). I-004 preserves +// retirement metadata across re-retire attempts — whoever retired it +// first owns the audit trail. +func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error { + m.mu.Lock() + defer m.mu.Unlock() + if m.SoftRetireErr != nil { + return m.SoftRetireErr + } + agent, ok := m.Agents[id] + if !ok { + return errNotFound + } + if agent.RetiredAt != nil { + return nil // already retired — no-op + } + stamped := retiredAt + agent.RetiredAt = &stamped + stampedReason := reason + agent.RetiredReason = &stampedReason + return nil +} + +// RetireAgentWithCascade stamps the agent row the same way SoftRetire +// does. The real repo also stamps every active deployment_targets row +// in the same transaction; the mock can't do that because targets live +// in mockTargetRepo, which the retirement service doesn't write to +// through this repo interface. Tests that need to assert cascade +// semantics on targets should seed mockTargetRepo directly and verify +// the service-layer audit event captured the cascade count. I-004. +func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error { + m.mu.Lock() + defer m.mu.Unlock() + if m.RetireCascadeErr != nil { + return m.RetireCascadeErr + } + agent, ok := m.Agents[id] + if !ok { + return errNotFound + } + if agent.RetiredAt != nil { + return nil // already retired — no-op (same as production transaction) + } + stamped := retiredAt + agent.RetiredAt = &stamped + stampedReason := reason + agent.RetiredReason = &stampedReason + return nil +} + +// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if +// unset). Matches the real repo signature: COUNT of non-retired +// deployment_targets with agent_id=$1. I-004 preflight. +func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if m.CountErr != nil { + return 0, m.CountErr + } + return m.ActiveTargetCounts[agentID], nil +} + +// CountActiveCertificates returns the seeded ActiveCertCounts value. +// Real query: COUNT(DISTINCT certificate_id) across +// certificate_target_mappings ↔ deployment_targets on agent_id. I-004. +func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if m.CountErr != nil { + return 0, m.CountErr + } + return m.ActiveCertCounts[agentID], nil +} + +// CountPendingJobs returns the seeded PendingJobCounts value. Real +// query: COUNT of jobs with agent_id=$1 AND status IN (Pending, +// AwaitingCSR, AwaitingApproval, Running). I-004. +func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if m.CountErr != nil { + return 0, m.CountErr + } + return m.PendingJobCounts[agentID], nil +} + // mockTargetRepo is a test implementation of TargetRepository type mockTargetRepo struct { mu sync.Mutex @@ -955,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo { return &mockAgentRepo{ Agents: make(map[string]*domain.Agent), HeartbeatUpdates: make(map[string]time.Time), + // I-004 preflight count maps. Tests seed these directly via + // agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys + // read back as zero from CountActiveTargets etc., matching + // the production repo behavior for agents with no deps. + ActiveTargetCounts: make(map[string]int), + ActiveCertCounts: make(map[string]int), + PendingJobCounts: make(map[string]int), } } diff --git a/migrations/000015_agent_retire.down.sql b/migrations/000015_agent_retire.down.sql new file mode 100644 index 0000000..d3830ff --- /dev/null +++ b/migrations/000015_agent_retire.down.sql @@ -0,0 +1,32 @@ +-- Migration 000015 rollback: reverse the agent retirement surface. +-- +-- Contract (enforced by migration_000015_test.go Stage 3): +-- * retired_at + retired_reason columns removed from agents +-- * retired_at + retired_reason columns removed from deployment_targets +-- * deployment_targets.agent_id FK restored to ON DELETE CASCADE +-- +-- WARNING: dropping the soft-retire columns also drops the audit history of +-- which agents have been retired and why. Operators rolling back should first +-- export the retired_at/retired_reason values they care about preserving. +-- +-- Order matters: drop supporting indexes BEFORE dropping the columns they +-- reference. DROP INDEX IF EXISTS + DROP COLUMN IF EXISTS keep the down safe +-- to re-apply. + +-- Drop supporting indexes first (they reference columns we're about to drop). +DROP INDEX IF EXISTS idx_agents_retired_at; +DROP INDEX IF EXISTS idx_deployment_targets_retired_at; + +-- Reverse the FK flip: restore CASCADE semantics so the rolled-back server +-- behaves identically to pre-000015 behavior. +ALTER TABLE deployment_targets + DROP CONSTRAINT IF EXISTS deployment_targets_agent_id_fkey; +ALTER TABLE deployment_targets + ADD CONSTRAINT deployment_targets_agent_id_fkey + FOREIGN KEY (agent_id) REFERENCES agents(id) ON DELETE CASCADE; + +-- Remove the soft-retirement columns. +ALTER TABLE deployment_targets DROP COLUMN IF EXISTS retired_reason; +ALTER TABLE deployment_targets DROP COLUMN IF EXISTS retired_at; +ALTER TABLE agents DROP COLUMN IF EXISTS retired_reason; +ALTER TABLE agents DROP COLUMN IF EXISTS retired_at; diff --git a/migrations/000015_agent_retire.up.sql b/migrations/000015_agent_retire.up.sql new file mode 100644 index 0000000..7b8af76 --- /dev/null +++ b/migrations/000015_agent_retire.up.sql @@ -0,0 +1,56 @@ +-- Migration 000015: Agent retirement (I-004 fail-closed coverage-gap fix). +-- +-- Adds a soft-delete surface for agents and their deployment_targets, and +-- replaces the fail-open DELETE CASCADE on deployment_targets.agent_id (see +-- migration 000001 line 104) with a fail-closed DELETE RESTRICT. +-- +-- Rationale (audit finding I-004): +-- Today a bare `DELETE FROM agents WHERE id = $1` silently cascades through +-- deployment_targets (CASCADE) and blanks jobs.agent_id (SET NULL, migration +-- 000001 line 146). The cascade vaporises the deployment audit trail — "who +-- deployed what, to where, when" — and leaves nothing behind to answer +-- forensic questions after an operator mis-types an agent ID. Flipping the +-- deployment_targets FK to RESTRICT forces any DELETE to fail at the DB +-- layer unless dependencies are cleared first; the new retired_at + +-- retired_reason pair give the service layer a soft-retirement path that +-- preserves history. +-- +-- Mirrors migration 000005 (revocation) in shape: nullable timestamp + +-- nullable reason string. Column type is TEXT (not VARCHAR(50)) so retirement +-- reasons can include full operator comments and audit context without +-- truncation. +-- +-- Idempotency guarantees (enforced by migration_000015_test.go Stage 4): +-- * ADD COLUMN IF NOT EXISTS → re-running is a no-op +-- * DROP CONSTRAINT IF EXISTS + ADD CONSTRAINT → always converges to +-- RESTRICT; safe to re-apply after a partial rollback +-- * CREATE INDEX IF NOT EXISTS → re-running is a no-op + +-- Agents: soft-retirement surface. +ALTER TABLE agents ADD COLUMN IF NOT EXISTS retired_at TIMESTAMPTZ; +ALTER TABLE agents ADD COLUMN IF NOT EXISTS retired_reason TEXT; + +-- Deployment targets: soft-retirement surface. Cascade-retire via the service +-- layer copies the agent's retired_at/retired_reason onto its targets so the +-- "who owned this deployment" trail stays intact after an agent is retired. +ALTER TABLE deployment_targets ADD COLUMN IF NOT EXISTS retired_at TIMESTAMPTZ; +ALTER TABLE deployment_targets ADD COLUMN IF NOT EXISTS retired_reason TEXT; + +-- Flip deployment_targets.agent_id FK from ON DELETE CASCADE (migration 000001 +-- line 104) to ON DELETE RESTRICT. Auto-named constraint per Postgres default +-- (`__fkey`). Drop-then-add so this migration is self-healing: +-- re-running always lands on RESTRICT regardless of prior state. +ALTER TABLE deployment_targets + DROP CONSTRAINT IF EXISTS deployment_targets_agent_id_fkey; +ALTER TABLE deployment_targets + ADD CONSTRAINT deployment_targets_agent_id_fkey + FOREIGN KEY (agent_id) REFERENCES agents(id) ON DELETE RESTRICT; + +-- Supporting indexes for the retired-filter queries that replace the default +-- "active only" list paths in the agent repository (ListActive / ListRetired). +-- Partial indexes keep them cheap — only retired rows are indexed, which is a +-- tiny fraction of the table in a healthy fleet. +CREATE INDEX IF NOT EXISTS idx_agents_retired_at + ON agents(retired_at) WHERE retired_at IS NOT NULL; +CREATE INDEX IF NOT EXISTS idx_deployment_targets_retired_at + ON deployment_targets(retired_at) WHERE retired_at IS NOT NULL; diff --git a/web/src/api/client.test.ts b/web/src/api/client.test.ts index 4204846..d776c6c 100644 --- a/web/src/api/client.test.ts +++ b/web/src/api/client.test.ts @@ -19,6 +19,8 @@ import { getAgents, getAgent, registerAgent, + retireAgent, + listRetiredAgents, getJobs, cancelJob, approveRenewal, @@ -399,6 +401,113 @@ describe('API Client', () => { }); }); + // ─── Agent Retirement (I-004) ─────────────────────── + // + // These tests pin the GUI's retirement contract against what the backend + // will add in Phase 2b: soft-retire via DELETE, force-cascade via + // ?force=true&reason=..., idempotent 204 on already-retired, 409 blocked + // payload with counts, and a GET /agents/retired listing surface. + // + // All compile-fail until client.ts exports retireAgent + listRetiredAgents + // — the shape of those exports is pinned here rather than assumed. + describe('Agent Retirement (I-004)', () => { + it('retireAgent sends DELETE without query when no force/reason', async () => { + mockFetch.mockReturnValueOnce( + mockJsonResponse({ + retired_at: '2026-04-18T12:00:00Z', + already_retired: false, + cascade: false, + }), + ); + await retireAgent('ag-1'); + const [url, init] = mockFetch.mock.calls[0]; + // Default soft-retire: bare path, no stray ? suffix. + expect(url).toBe('/api/v1/agents/ag-1'); + expect(init.method).toBe('DELETE'); + }); + + it('retireAgent propagates force+reason as URL query parameters', async () => { + mockFetch.mockReturnValueOnce( + mockJsonResponse({ + retired_at: '2026-04-18T12:00:00Z', + already_retired: false, + cascade: true, + counts: { active_targets: 3, active_certificates: 7, pending_jobs: 2 }, + }), + ); + await retireAgent('ag-1', { force: true, reason: 'decommissioning rack 7' }); + const [url, init] = mockFetch.mock.calls[0]; + // URLSearchParams encodes space as "+"; "decommissioning rack 7" → "decommissioning+rack+7" + expect(url).toBe( + '/api/v1/agents/ag-1?force=true&reason=decommissioning+rack+7', + ); + expect(init.method).toBe('DELETE'); + }); + + it('retireAgent omits force=false even when reason is supplied', async () => { + // Client-side guard: the server's 400 ErrForceReasonRequired is the + // fallback; the GUI should never silently promote reason-without-force + // into a force call. Pins that reason-only still hits the soft path. + mockFetch.mockReturnValueOnce( + mockJsonResponse({ + retired_at: '2026-04-18T12:00:00Z', + already_retired: false, + cascade: false, + }), + ); + await retireAgent('ag-1', { reason: 'routine decommission' }); + const [url] = mockFetch.mock.calls[0]; + // force defaults to false → query carries reason only. + expect(url).toBe('/api/v1/agents/ag-1?reason=routine+decommission'); + }); + + it('retireAgent surfaces the 409 dependency error message to the caller', async () => { + mockFetch.mockReturnValueOnce( + mockErrorResponse(409, { + message: 'agent has 3 active targets, 7 active certificates, 2 pending jobs', + }), + ); + await expect(retireAgent('ag-1')).rejects.toThrow( + /active targets|active certificates|pending jobs/, + ); + }); + + it('retireAgent treats 204 (already-retired) as success with empty body', async () => { + mockFetch.mockReturnValueOnce( + Promise.resolve({ + ok: true, + status: 204, + json: () => Promise.reject(new Error('204 has no body')), + statusText: 'No Content', + } as Response), + ); + // fetchJSON normalises 204 to {} — caller must not crash. + const result = await retireAgent('ag-1'); + expect(result).toBeDefined(); + }); + + it('listRetiredAgents sends GET /agents/retired with default pagination', async () => { + mockFetch.mockReturnValueOnce( + mockJsonResponse({ data: [], total: 0, page: 1, per_page: 50 }), + ); + await listRetiredAgents(); + const [url, init] = mockFetch.mock.calls[0]; + expect(url).toBe('/api/v1/agents/retired?page=1&per_page=50'); + // Default is GET — no explicit method means fetchJSON falls through. + expect(init.method ?? 'GET').toBe('GET'); + }); + + it('listRetiredAgents forwards page/per_page overrides', async () => { + mockFetch.mockReturnValueOnce( + mockJsonResponse({ data: [], total: 0, page: 2, per_page: 100 }), + ); + await listRetiredAgents({ page: '2', per_page: '100' }); + const [url] = mockFetch.mock.calls[0]; + expect(url).toContain('page=2'); + expect(url).toContain('per_page=100'); + }); + }); + // ─── Jobs ─────────────────────────────────────────── describe('Jobs', () => { diff --git a/web/src/api/client.ts b/web/src/api/client.ts index ed27667..36dc9ba 100644 --- a/web/src/api/client.ts +++ b/web/src/api/client.ts @@ -1,4 +1,4 @@ -import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary } from './types'; +import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary, AgentDependencyCounts, RetireAgentResponse, BlockedByDependenciesResponse } from './types'; const BASE = '/api/v1'; @@ -188,6 +188,98 @@ export const getAgent = (id: string) => export const registerAgent = (data: Partial) => fetchJSON(`${BASE}/agents`, { method: 'POST', body: JSON.stringify(data) }); +// I-004: typed error thrown by retireAgent when the server returns HTTP 409 with +// {error: "blocked_by_dependencies", ...}. Callers that want to show the +// dependency-counts dialog should `catch (e)` and check `e instanceof +// BlockedByDependenciesError` — the counts field is the same shape the +// backend handler returns from its inline struct in +// internal/api/handler/agents.go. Generic network / 5xx failures still throw +// plain Error so existing error-boundary code is unaffected. +export class BlockedByDependenciesError extends Error { + readonly counts: AgentDependencyCounts; + constructor(message: string, counts: AgentDependencyCounts) { + super(message); + this.name = 'BlockedByDependenciesError'; + this.counts = counts; + } +} + +// I-004: retire an agent via DELETE /api/v1/agents/{id}. Three distinct +// success paths the UI needs to distinguish: +// * 200 — fresh retire; body has retired_at, already_retired=false, cascade +// flag, counts of what was cascaded. +// * 204 — idempotent re-retire; the row was already retired. No body. We +// synthesize a RetireAgentResponse with already_retired=true and zero +// counts so the caller can keep a single return type. +// * 409 — blocked_by_dependencies; thrown as BlockedByDependenciesError so +// the caller can surface the active_targets/active_certificates/pending_jobs +// counts in a confirmation dialog and offer force=true. +// Anything else bubbles up via the standard fetchJSON error path. +export const retireAgent = async ( + id: string, + opts: { force?: boolean; reason?: string } = {}, +): Promise => { + const qs = new URLSearchParams(); + if (opts.force) qs.set('force', 'true'); + if (opts.reason) qs.set('reason', opts.reason); + const url = qs.toString() + ? `${BASE}/agents/${id}?${qs.toString()}` + : `${BASE}/agents/${id}`; + + const res = await fetch(url, { + method: 'DELETE', + headers: authHeaders(), + }); + + if (res.status === 401) { + window.dispatchEvent(new CustomEvent('certctl:auth-required')); + throw new Error('Authentication required'); + } + + // 204 No Content — idempotent re-retire. Synthesize a response so callers + // get a uniform shape; already_retired=true tells them the agent was + // already in the retired state before this call. + if (res.status === 204) { + return { + retired_at: '', + already_retired: true, + cascade: false, + counts: { active_targets: 0, active_certificates: 0, pending_jobs: 0 }, + }; + } + + if (res.status === 409) { + // Body is always JSON for 409 per the handler contract. + const body = (await res.json()) as BlockedByDependenciesResponse; + throw new BlockedByDependenciesError( + body.message || 'agent has active dependencies', + body.counts, + ); + } + + if (!res.ok) { + let errorMsg = res.statusText; + try { + const body = await res.json(); + errorMsg = body.message || body.error || errorMsg; + } catch { + // not JSON + } + throw new Error(errorMsg || `HTTP ${res.status}`); + } + + return (await res.json()) as RetireAgentResponse; +}; + +// I-004: list retired agents via GET /api/v1/agents/retired. Kept separate +// from getAgents (which hits the default active-only listing) so the retired +// tab on AgentsPage can page independently. per_page is capped server-side at +// 500 (see handler ListRetiredAgents). +export const listRetiredAgents = (params: Record = {}) => { + const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString(); + return fetchJSON>(`${BASE}/agents/retired?${qs}`); +}; + // Jobs export const getJobs = (params: Record = {}) => { const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString(); diff --git a/web/src/api/types.test.ts b/web/src/api/types.test.ts index 8ac0af5..1b5745e 100644 --- a/web/src/api/types.test.ts +++ b/web/src/api/types.test.ts @@ -1,5 +1,6 @@ import { describe, it, expect } from 'vitest'; import { POLICY_TYPES, POLICY_SEVERITIES } from './types'; +import type { Agent } from './types'; /** * Regression tests for the policy enum tuples. @@ -58,3 +59,76 @@ describe('POLICY_SEVERITIES', () => { expect(POLICY_SEVERITIES as readonly string[]).not.toContain('medium'); }); }); + +/** + * Regression test for the Agent interface's I-004 soft-retirement shape. + * + * Backend (migration 000015, Phase 2b) adds two nullable timestamps/strings to + * the agents table — `retired_at` and `retired_reason` — mirroring the existing + * Certificate.revoked_at / Certificate.revocation_reason pair. The GUI needs + * these fields on the Agent interface so the Retired tab, retire modal, and + * retirement banner can render the agent's retired state without resorting to + * `(agent as any).retired_at` escapes. + * + * Both fields are optional (agent.ts interface) because the server omits them + * from the response for active agents. A compile-time shape check here pins + * that Phase 2b does not drift the field names (e.g. to retiredAt camelCase) + * or accidentally promote them to required. + * + * Compile-fail until Phase 2b adds: + * retired_at?: string; + * retired_reason?: string; + * to the Agent interface in types.ts. + */ +describe('Agent interface (I-004 retirement)', () => { + it('accepts retired_at and retired_reason as optional string fields', () => { + // Construct an Agent with the retirement fields set. If Phase 2b names + // them anything other than retired_at / retired_reason, this fails to + // compile — which is exactly what the Red stage wants. + const retired: Agent = { + id: 'ag-1', + name: 'decom-01', + hostname: 'server-old', + ip_address: '10.0.0.1', + os: 'linux', + architecture: 'amd64', + status: 'Offline', + version: '2.1.0', + last_heartbeat: '2026-01-01T00:00:00Z', + last_heartbeat_at: '2026-01-01T00:00:00Z', + capabilities: [], + tags: {}, + registered_at: '2024-01-01T00:00:00Z', + created_at: '2024-01-01T00:00:00Z', + updated_at: '2026-01-01T00:00:00Z', + retired_at: '2026-01-01T00:00:00Z', + retired_reason: 'old hardware', + }; + expect(retired.retired_at).toBe('2026-01-01T00:00:00Z'); + expect(retired.retired_reason).toBe('old hardware'); + }); + + it('accepts an Agent without retired_at / retired_reason (optional fields)', () => { + // Active agents should not carry retirement metadata. If Phase 2b makes + // the fields required, this block fails to compile. + const active: Agent = { + id: 'ag-2', + name: 'web01', + hostname: 'web01.prod', + ip_address: '10.0.0.2', + os: 'linux', + architecture: 'amd64', + status: 'Online', + version: '2.1.0', + last_heartbeat: '2026-04-18T12:00:00Z', + last_heartbeat_at: '2026-04-18T12:00:00Z', + capabilities: ['deploy', 'scan'], + tags: {}, + registered_at: '2024-06-01T00:00:00Z', + created_at: '2024-06-01T00:00:00Z', + updated_at: '2026-04-18T12:00:00Z', + }; + expect(active.retired_at).toBeUndefined(); + expect(active.retired_reason).toBeUndefined(); + }); +}); diff --git a/web/src/api/types.ts b/web/src/api/types.ts index 2fbd216..59103b8 100644 --- a/web/src/api/types.ts +++ b/web/src/api/types.ts @@ -67,6 +67,43 @@ export interface Agent { registered_at: string; created_at: string; updated_at: string; + // I-004: soft-retirement fields. When retired_at is non-null, the agent is + // tombstoned — it will never heartbeat again and cascaded targets have been + // retired alongside it. The retired tab on AgentsPage uses these to show the + // when/why. The server filters retired rows from the default /api/v1/agents + // listing; they appear only via GET /api/v1/agents/retired. + retired_at?: string | null; + retired_reason?: string | null; +} + +// I-004: dependency counts returned by the retire handler in both the 200 +// success-with-cascade body and the 409 blocked_by_dependencies body. The +// operator UI uses these to show "this agent has N targets, M certs, K jobs +// depending on it" in the confirm-retire dialog. +export interface AgentDependencyCounts { + active_targets: number; + active_certificates: number; + pending_jobs: number; +} + +// I-004: success shape for DELETE /api/v1/agents/{id}. already_retired is +// always false for 200 responses; 204 responses carry no body (the retire was +// idempotent — the agent was already retired). The frontend distinguishes by +// HTTP status, not by this field. +export interface RetireAgentResponse { + retired_at: string; + already_retired: boolean; + cascade: boolean; + counts: AgentDependencyCounts; +} + +// I-004: shape returned with HTTP 409 when a retire is blocked by active +// downstream dependencies. Keep in lockstep with the handler's inline struct +// in internal/api/handler/agents.go (search "blocked_by_dependencies"). +export interface BlockedByDependenciesResponse { + error: 'blocked_by_dependencies'; + message: string; + counts: AgentDependencyCounts; } export interface Job { diff --git a/web/src/pages/AgentsPage.tsx b/web/src/pages/AgentsPage.tsx index 4123d39..caf94eb 100644 --- a/web/src/pages/AgentsPage.tsx +++ b/web/src/pages/AgentsPage.tsx @@ -1,13 +1,19 @@ +import { useState } from 'react'; import { useNavigate } from 'react-router-dom'; -import { useQuery } from '@tanstack/react-query'; -import { getAgents } from '../api/client'; +import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'; +import { + getAgents, + listRetiredAgents, + retireAgent, + BlockedByDependenciesError, +} from '../api/client'; import PageHeader from '../components/PageHeader'; import DataTable from '../components/DataTable'; import type { Column } from '../components/DataTable'; import StatusBadge from '../components/StatusBadge'; import ErrorState from '../components/ErrorState'; import { timeAgo } from '../api/utils'; -import type { Agent } from '../api/types'; +import type { Agent, AgentDependencyCounts } from '../api/types'; function heartbeatStatus(lastHeartbeat: string): string { if (!lastHeartbeat) return 'Offline'; @@ -17,15 +23,84 @@ function heartbeatStatus(lastHeartbeat: string): string { return 'Offline'; } +type TabKey = 'active' | 'retired'; + +// I-004: retire-modal state machine. +// confirm — operator clicked Retire, shown plain confirm + optional reason. +// blocked — soft retire returned 409; switch to a force-retire dialog that +// shows the dependency counts and requires a reason before the +// operator can opt into ?force=true. +// error — any other failure (network, 500, unexpected 4xx). Reused by both +// the initial attempt and the force retry. +type ModalMode = + | { kind: 'closed' } + | { kind: 'confirm'; agent: Agent; reason: string } + | { kind: 'blocked'; agent: Agent; reason: string; counts: AgentDependencyCounts } + | { kind: 'error'; agent: Agent; message: string }; + export default function AgentsPage() { const navigate = useNavigate(); - const { data, isLoading, error, refetch } = useQuery({ + const qc = useQueryClient(); + const [tab, setTab] = useState('active'); + const [modal, setModal] = useState({ kind: 'closed' }); + + const active = useQuery({ queryKey: ['agents'], queryFn: () => getAgents(), refetchInterval: 15000, + enabled: tab === 'active', }); - const columns: Column[] = [ + const retired = useQuery({ + queryKey: ['agents', 'retired'], + queryFn: () => listRetiredAgents(), + refetchInterval: 30000, + enabled: tab === 'retired', + }); + + // retireAgent mutation wrapping both paths. The caller supplies force/reason, + // and we invalidate both queries on success so the retired tab refreshes and + // the active tab drops the row. 409s are converted into modal.mode=blocked so + // the operator can escalate to force; everything else becomes modal.mode=error. + const mutation = useMutation({ + mutationFn: (input: { agent: Agent; force?: boolean; reason?: string }) => + retireAgent(input.agent.id, { force: input.force, reason: input.reason }), + onSuccess: () => { + qc.invalidateQueries({ queryKey: ['agents'] }); + qc.invalidateQueries({ queryKey: ['agents', 'retired'] }); + setModal({ kind: 'closed' }); + }, + }); + + // Shared submit handler: when we know the current modal.agent + modal.reason, + // decide whether this is a soft retire or force retire based on modal.kind. + const submitRetire = (force: boolean) => { + if (modal.kind !== 'confirm' && modal.kind !== 'blocked') return; + const { agent, reason } = modal; + mutation.mutate( + { agent, force, reason: reason || undefined }, + { + onError: (err) => { + if (err instanceof BlockedByDependenciesError) { + setModal({ + kind: 'blocked', + agent, + reason, + counts: err.counts ?? { active_targets: 0, active_certificates: 0, pending_jobs: 0 }, + }); + return; + } + setModal({ + kind: 'error', + agent, + message: err instanceof Error ? err.message : String(err), + }); + }, + }, + ); + }; + + const activeColumns: Column[] = [ { key: 'name', label: 'Agent', @@ -41,27 +116,318 @@ export default function AgentsPage() { label: 'Health', render: (a) => , }, - { key: 'hostname', label: 'Hostname', render: (a) => {a.hostname || '—'} }, - { key: 'os', label: 'OS / Arch', render: (a) => {a.os && a.architecture ? `${a.os}/${a.architecture}` : a.os || '—'} }, - { key: 'ip', label: 'IP Address', render: (a) => {a.ip_address || '—'} }, - { key: 'version', label: 'Version', render: (a) => {a.version || '—'} }, + { + key: 'hostname', + label: 'Hostname', + render: (a) => {a.hostname || '—'}, + }, + { + key: 'os', + label: 'OS / Arch', + render: (a) => ( + + {a.os && a.architecture ? `${a.os}/${a.architecture}` : a.os || '—'} + + ), + }, + { + key: 'ip', + label: 'IP Address', + render: (a) => {a.ip_address || '—'}, + }, + { + key: 'version', + label: 'Version', + render: (a) => {a.version || '—'}, + }, { key: 'heartbeat', label: 'Last Heartbeat', render: (a) => {timeAgo(a.last_heartbeat_at)}, }, + { + key: 'actions', + label: '', + render: (a) => ( + + ), + }, ]; + const retiredColumns: Column[] = [ + { + key: 'name', + label: 'Agent', + render: (a) => ( +
+
{a.name}
+
{a.id}
+
+ ), + }, + { + key: 'hostname', + label: 'Hostname', + render: (a) => {a.hostname || '—'}, + }, + { + key: 'os', + label: 'OS / Arch', + render: (a) => ( + + {a.os && a.architecture ? `${a.os}/${a.architecture}` : a.os || '—'} + + ), + }, + { + key: 'retired_at', + label: 'Retired', + render: (a) => {timeAgo(a.retired_at || '')}, + }, + { + key: 'retired_reason', + label: 'Reason', + render: (a) => ( + {a.retired_reason || } + ), + }, + ]; + + const currentQuery = tab === 'active' ? active : retired; + const currentColumns = tab === 'active' ? activeColumns : retiredColumns; + const emptyMessage = tab === 'active' ? 'No agents registered' : 'No retired agents'; + return ( <> - + + +
+
+ setTab('active')}> + Active + + setTab('retired')}> + Retired + +
+
+
- {error ? ( - refetch()} /> + {currentQuery.error ? ( + currentQuery.refetch()} /> ) : ( - navigate(`/agents/${a.id}`)} /> + navigate(`/agents/${a.id}`)} + /> )}
+ + {modal.kind !== 'closed' && ( + setModal({ kind: 'closed' })} + onReasonChange={(reason) => { + if (modal.kind === 'confirm') setModal({ ...modal, reason }); + if (modal.kind === 'blocked') setModal({ ...modal, reason }); + }} + onSoftRetire={() => submitRetire(false)} + onForceRetire={() => submitRetire(true)} + /> + )} ); } + +function TabButton({ + active, + onClick, + children, +}: { + active: boolean; + onClick: () => void; + children: React.ReactNode; +}) { + return ( + + ); +} + +function RetireModal({ + mode, + pending, + onClose, + onReasonChange, + onSoftRetire, + onForceRetire, +}: { + mode: ModalMode; + pending: boolean; + onClose: () => void; + onReasonChange: (reason: string) => void; + onSoftRetire: () => void; + onForceRetire: () => void; +}) { + if (mode.kind === 'closed') return null; + + return ( +
+
e.stopPropagation()} + > + {mode.kind === 'confirm' && ( + <> +

Retire agent

+

+ {mode.agent.name} ({mode.agent.id}) will be + soft-retired. The agent will stop receiving heartbeats and be removed from active + listings. This is reversible only by direct database intervention. +

+ +
+ + +
+ + )} + + {mode.kind === 'blocked' && ( + <> +

Cannot retire — active dependencies

+

+ The agent {mode.agent.name} still has downstream + work tied to it. Force-retiring will cascade-retire all active targets and fail any + pending jobs. +

+
+
+
Active targets
+
{mode.counts.active_targets}
+
+
+
Active certs
+
+ {mode.counts.active_certificates} +
+
+
+
Pending jobs
+
{mode.counts.pending_jobs}
+
+
+ +
+ + +
+ + )} + + {mode.kind === 'error' && ( + <> +

Retire failed

+

{mode.message}

+
+ +
+ + )} +
+
+ ); +}