mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:01:37 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
+226
-1
@@ -880,6 +880,40 @@ paths:
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
|
||||
/api/v1/agents/retired:
|
||||
get:
|
||||
tags: [Agents]
|
||||
summary: List retired agents
|
||||
description: |
|
||||
I-004: opt-in listing of soft-retired agents. The default
|
||||
`GET /api/v1/agents` endpoint filters retired rows out; this is the
|
||||
dedicated surface for reading them back (e.g., the operator UI's
|
||||
"Retired" tab, audit and forensics workflows). Pagination defaults
|
||||
match the default agent listing (page=1, per_page=50, max 500). Go
|
||||
1.22's enhanced ServeMux routes `/agents/retired` to this handler
|
||||
via the literal-beats-pattern-var precedence rule, so the sibling
|
||||
`/agents/{id}` route does not shadow it.
|
||||
operationId: listRetiredAgents
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/page"
|
||||
- $ref: "#/components/parameters/per_page"
|
||||
responses:
|
||||
"200":
|
||||
description: Paginated list of retired agents
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
allOf:
|
||||
- $ref: "#/components/schemas/PaginationEnvelope"
|
||||
- type: object
|
||||
properties:
|
||||
data:
|
||||
type: array
|
||||
items:
|
||||
$ref: "#/components/schemas/Agent"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
|
||||
/api/v1/agents/{id}:
|
||||
get:
|
||||
tags: [Agents]
|
||||
@@ -900,12 +934,116 @@ paths:
|
||||
$ref: "#/components/responses/NotFound"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
delete:
|
||||
tags: [Agents]
|
||||
summary: Soft-retire agent
|
||||
description: |
|
||||
I-004: soft-retirement. The agent row is preserved (so its audit
|
||||
trail and historical job links remain intact) and `retired_at` is
|
||||
stamped. A retired agent receives `410 Gone` on subsequent
|
||||
heartbeats so it can shut down cleanly.
|
||||
|
||||
Behavior matrix:
|
||||
|
||||
| Scenario | Query | Status | Body |
|
||||
| --- | --- | --- | --- |
|
||||
| Clean retire (no active dependencies) | none | `200` | `RetireAgentResponse` with `cascade=false`, zero counts |
|
||||
| Blocked by active targets/certs/jobs | none | `409` | `BlockedByDependenciesResponse` with per-bucket counts |
|
||||
| Force-cascade retire | `force=true&reason=...` | `200` | `RetireAgentResponse` with `cascade=true`, pre-cascade counts |
|
||||
| Idempotent re-retire | either | `204` | (empty — downstream consumers break on stray bodies) |
|
||||
| `force=true` without reason | `force=true` | `400` | ErrorResponse (ErrForceReasonRequired) |
|
||||
| Reserved sentinel agent | any | `403` | ErrorResponse (ErrAgentIsSentinel) |
|
||||
| Unknown agent id | any | `404` | ErrorResponse |
|
||||
|
||||
Sentinel agents are the four reserved identities backing non-agent
|
||||
discovery subsystems (`server-scanner`, `cloud-aws-sm`,
|
||||
`cloud-azure-kv`, `cloud-gcp-sm`). Retiring them would orphan the
|
||||
scanner or a cloud secret-manager source, so the handler refuses
|
||||
unconditionally — even with `force=true`.
|
||||
operationId: retireAgent
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/resourceId"
|
||||
- name: force
|
||||
in: query
|
||||
required: false
|
||||
schema:
|
||||
type: boolean
|
||||
default: false
|
||||
description: |
|
||||
Cascade-retire active downstream targets, certificates, and
|
||||
jobs. When `true`, a non-empty `reason` is required. A
|
||||
malformed value (anything strconv.ParseBool rejects) is
|
||||
silently treated as `false` so a typoed query can never
|
||||
accidentally enable the cascade.
|
||||
- name: reason
|
||||
in: query
|
||||
required: false
|
||||
schema:
|
||||
type: string
|
||||
description: |
|
||||
Human-readable reason recorded on the retired row and in the
|
||||
immutable audit trail. Required (non-empty after trimming)
|
||||
when `force=true`.
|
||||
responses:
|
||||
"200":
|
||||
description: |
|
||||
Agent retired (clean retire or successful force-cascade). Body
|
||||
is `RetireAgentResponse`.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RetireAgentResponse"
|
||||
"204":
|
||||
description: |
|
||||
Idempotent retire — the agent was already retired. Response
|
||||
body is empty (the 200-path shape does not apply, and
|
||||
downstream clients that tee responses into dashboards would
|
||||
break on spurious bodies).
|
||||
"400":
|
||||
description: |
|
||||
`force=true` was sent without a non-empty `reason`
|
||||
(ErrForceReasonRequired).
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
"403":
|
||||
description: |
|
||||
Agent is a reserved sentinel and cannot be retired even with
|
||||
`?force=true` (ErrAgentIsSentinel).
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFound"
|
||||
"409":
|
||||
description: |
|
||||
Blocked by active downstream dependencies. Body carries
|
||||
per-bucket counts so the operator UI can show the user which
|
||||
dependency is holding up the retire. Re-run with
|
||||
`?force=true&reason=...` to cascade.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/BlockedByDependenciesResponse"
|
||||
"405":
|
||||
description: Method not allowed (only DELETE, GET are routed to this path)
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
|
||||
/api/v1/agents/{id}/heartbeat:
|
||||
post:
|
||||
tags: [Agents]
|
||||
summary: Agent heartbeat
|
||||
description: Reports agent liveness and metadata (OS, architecture, IP, version).
|
||||
description: |
|
||||
Reports agent liveness and metadata (OS, architecture, IP, version).
|
||||
|
||||
I-004: a retired agent still polling the heartbeat endpoint receives
|
||||
`410 Gone` so `cmd/agent` detects the terminal signal and shuts down
|
||||
cleanly instead of looping forever against a decommissioned identity.
|
||||
The retired-agent check runs before any "not found" string match so
|
||||
it can never be masked by a sibling error branch.
|
||||
operationId: agentHeartbeat
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/resourceId"
|
||||
@@ -936,6 +1074,14 @@ paths:
|
||||
$ref: "#/components/responses/BadRequest"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFound"
|
||||
"410":
|
||||
description: |
|
||||
I-004: the agent has been soft-retired. The agent process should
|
||||
treat this as a terminal signal and shut down cleanly.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
|
||||
@@ -3373,6 +3519,85 @@ components:
|
||||
type: string
|
||||
version:
|
||||
type: string
|
||||
retired_at:
|
||||
type: string
|
||||
format: date-time
|
||||
nullable: true
|
||||
description: |
|
||||
I-004: soft-retirement timestamp. `null` (or field absent) means the
|
||||
agent is active. A non-null value is the canonical "retired" state —
|
||||
the operational `status` column is preserved at retirement time as
|
||||
the last-seen value, but `retired_at` is the source of truth for
|
||||
filtering agents out of active listings.
|
||||
retired_reason:
|
||||
type: string
|
||||
nullable: true
|
||||
description: |
|
||||
I-004: human-readable reason captured at retirement time. Only set
|
||||
when the agent was retired via `?force=true&reason=...` cascade; a
|
||||
default soft-retire leaves this field null.
|
||||
|
||||
AgentDependencyCounts:
|
||||
type: object
|
||||
description: |
|
||||
I-004: preflight counts of active downstream rows that would be
|
||||
orphaned by retiring an agent. Returned in the 409
|
||||
`blocked_by_dependencies` body so the operator UI can tell the user
|
||||
which bucket is blocking the retire, and also in the 200 response
|
||||
body on a successful `?force=true` cascade as a snapshot of what
|
||||
was cascaded.
|
||||
properties:
|
||||
active_targets:
|
||||
type: integer
|
||||
description: Deployment targets with this agent assigned and retired_at IS NULL
|
||||
active_certificates:
|
||||
type: integer
|
||||
description: Certificates currently deployed via one of this agent's active targets
|
||||
pending_jobs:
|
||||
type: integer
|
||||
description: Jobs with agent_id=this in status Pending, AwaitingCSR, AwaitingApproval, or Running
|
||||
|
||||
RetireAgentResponse:
|
||||
type: object
|
||||
description: |
|
||||
I-004: response body for a successful retire on DELETE /api/v1/agents/{id}.
|
||||
Returned on both clean retires (cascade=false, zero counts) and
|
||||
force-cascade retires (cascade=true, counts snapshot of the
|
||||
pre-cascade dependency state). The 204 idempotent-retire path does
|
||||
NOT emit this body — re-retiring an already-retired agent returns
|
||||
an empty response.
|
||||
properties:
|
||||
retired_at:
|
||||
type: string
|
||||
format: date-time
|
||||
already_retired:
|
||||
type: boolean
|
||||
description: |
|
||||
Always false on the 200 response — the already-retired path
|
||||
returns 204 No Content with no body. Surfaced in the schema
|
||||
only so downstream consumers have a complete field map.
|
||||
cascade:
|
||||
type: boolean
|
||||
description: True when the retire was invoked with ?force=true
|
||||
counts:
|
||||
$ref: "#/components/schemas/AgentDependencyCounts"
|
||||
|
||||
BlockedByDependenciesResponse:
|
||||
type: object
|
||||
description: |
|
||||
I-004: 409 response body for a retire request blocked by active
|
||||
downstream dependencies. Returned when `force=true` is not set and
|
||||
any of the three counts is non-zero. The operator UI renders these
|
||||
counts so the human can retire or reassign the blocking rows
|
||||
before re-running the retire, or tick the force checkbox to cascade.
|
||||
properties:
|
||||
error:
|
||||
type: string
|
||||
example: blocked_by_dependencies
|
||||
message:
|
||||
type: string
|
||||
counts:
|
||||
$ref: "#/components/schemas/AgentDependencyCounts"
|
||||
|
||||
WorkItem:
|
||||
type: object
|
||||
|
||||
Reference in New Issue
Block a user