Files
certctl/internal/service/agent_retire.go
T
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00

318 lines
14 KiB
Go

package service
import (
"context"
"errors"
"fmt"
"log/slog"
"time"
"github.com/shankar0123/certctl/internal/domain"
)
// I-004 coverage-gap closure: the agent retirement surface.
//
// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and
// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with
// no preflight, no archival, and no knowledge of in-flight jobs. Any cert
// still rotating through one of those targets would observe half-migrated
// state. I-004 closes that gap with a preflight + soft-retire + optional
// forced-cascade contract; the symbols in this file are the service-layer
// surface that the handler and operator UI bind against.
// ErrAgentIsSentinel is returned when an operator tries to retire one of the
// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm,
// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the
// three cloud secret-manager discovery sources; retiring any of them orphans
// its subsystem. The guard fires unconditionally — force=true does not bypass
// it, because a sentinel is a structural invariant of the deployment, not
// a piece of fleet state the operator owns. Handler maps this to HTTP 403.
var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired")
// ErrBlockedByDependencies is returned by RetireAgent when at least one of
// (active targets, active certificates, pending jobs) referencing the agent
// is non-zero and force=false. The caller always receives it wrapped in
// a *BlockedByDependenciesError (see below), so handlers doing errors.As
// can surface the per-bucket counts in the 409 body for operator
// troubleshooting. Tests use errors.Is; handlers use errors.As.
var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies")
// ErrForceReasonRequired is returned when force=true is supplied without a
// non-empty reason. The force escape hatch is deliberately chatty: operators
// pulling the emergency cord must leave an auditable breadcrumb explaining
// why a cascade was justified. Handler maps this to HTTP 400 so the operator
// retries with --reason rather than silently skipping the guard. Checked
// before any DB mutation to keep the no-reason path transactionally clean.
var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason")
// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated
// call site) when a retired agent is still polling. The handler layer maps
// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it
// deterministically and shut down the agent process, rather than looping
// forever on a soft-retired identity. IsRetired() on the domain model is
// the single source of truth; the sentinel exists so service and handler
// callers can errors.Is against one symbol.
var ErrAgentRetired = errors.New("agent has been retired")
// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the
// per-bucket dependency snapshot the preflight pass captured. The embedded
// AgentDependencyCounts is the same struct the repo returns from the three
// CountActive* calls, so the handler can marshal it directly into the 409
// body without reshaping fields. Unwrap() satisfies errors.Is against the
// sentinel; Error() includes the counts so logs are diagnostic on their own.
type BlockedByDependenciesError struct {
Counts domain.AgentDependencyCounts
}
// Error formats the wrapped error with the per-bucket counts. Kept short so
// it reads cleanly in slog output.
func (e *BlockedByDependenciesError) Error() string {
return fmt.Sprintf(
"%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)",
ErrBlockedByDependencies.Error(),
e.Counts.ActiveTargets,
e.Counts.ActiveCertificates,
e.Counts.PendingJobs,
)
}
// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped
// struct — the test contract (agent_retire_test.go:167) depends on it.
func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies }
// AgentRetirementResult is the outcome surface the handler returns to the
// operator. It discriminates the three happy paths the endpoint can take —
// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and
// forced cascade (Cascade=true) — and always carries the retired_at timestamp
// and the dependency-count snapshot so the 200/204 response body can echo
// what was (or would have been) affected.
//
// AlreadyRetired=true → agent was already retired; no new audit
// event was emitted; RetiredAt is the
// original stamp, not the current time.
// Cascade=false → clean soft-retire; Counts is all zeros.
// Cascade=true → force=true retired agent + downstream
// targets; Counts is the PRE-cascade
// snapshot (so the operator sees what
// they just retired).
type AgentRetirementResult struct {
AlreadyRetired bool
Cascade bool
RetiredAt time.Time
Counts domain.AgentDependencyCounts
}
// RetireAgent implements the I-004 retirement contract. Ordering matters —
// every guard fires before the one that would mutate state, so a rejected
// retire leaves zero trace (no audit event, no partial DB write):
//
// 1. Sentinel check (unconditional; force does not bypass).
// 2. Fetch agent (404 surfaces as-is from the repo).
// 3. Already-retired idempotency: return AlreadyRetired=true with NO new
// audit event — the original retire already recorded one.
// 4. Preflight count pass via the three CountActive* repo methods.
// 5. Force-reason guard: force=true with empty reason is rejected here,
// after the counts are known but before any mutation.
// 6. Default no-force path: any non-zero count returns
// *BlockedByDependenciesError with counts attached.
// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with
// a single retiredAt timestamp pinned BEFORE the repo call so the
// audit event and the DB row agree to the nanosecond.
// 8. Audit: agent_retired always; agent_retirement_cascaded additionally
// on the force=true cascade path.
//
// Actor comes from the handler's resolveActor (API key → user, agent key →
// agent-<id>, unauthenticated → "anonymous"); the service does not second-
// guess it. Audit emission is best-effort: a failed RecordEvent logs a
// warning but does not fail the overall retirement, consistent with how
// the rest of the codebase treats audit as an observability concern
// rather than a correctness barrier.
func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) {
// Step 1 — reserved-sentinel guard. Applies even under force=true.
if domain.IsSentinelAgent(id) {
return nil, ErrAgentIsSentinel
}
// Step 2 — existence check. Missing agent surfaces the repo's not-found
// error verbatim so the handler can map it to 404 via its existing
// detection path (the handler layer already has "not found" mapping
// logic inherited from the pre-I-004 Delete endpoint).
agent, err := s.agentRepo.Get(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to fetch agent: %w", err)
}
// Step 3 — idempotency. A retired agent returns AlreadyRetired=true
// WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204.
// Guarding here (before preflight) means a re-retire of an agent that
// now has zero deps doesn't spuriously "succeed again" and double-log.
if agent.IsRetired() {
return &AgentRetirementResult{
AlreadyRetired: true,
RetiredAt: *agent.RetiredAt,
}, nil
}
// Step 4 — preflight counts. All three run even when force=true: we
// need them to populate AgentRetirementResult.Counts (the pre-cascade
// snapshot). A repo failure here aborts the whole operation — partial
// preflight is worse than no preflight.
counts, err := s.collectAgentDependencyCounts(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err)
}
// Step 5 — force-reason guard. Positioned AFTER preflight so operators
// who forgot --reason still see accurate counts when they retry. The
// empty-reason rejection fires before any mutation, so the rejected
// attempt leaves no audit noise.
if force && reason == "" {
return nil, ErrForceReasonRequired
}
// Step 6 — default path: block on any non-zero bucket. Wrapping the
// sentinel in *BlockedByDependenciesError lets the handler use errors.As
// to surface counts in the 409 body while tests use errors.Is against
// the sentinel. Both callers are satisfied by the single Unwrap chain.
if !force && counts.HasDependencies() {
return nil, &BlockedByDependenciesError{Counts: counts}
}
// Step 7 — mutation. Pin retiredAt once so the audit event, the agent
// row, and (on cascade) every deployment_targets row share the same
// timestamp. Callers querying "what happened at T?" can correlate
// retirement rows across tables without clock-skew tie-breaking.
retiredAt := time.Now()
cascade := force && counts.HasDependencies()
if cascade {
if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to retire agent with cascade: %w", err)
}
} else {
if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to soft-retire agent: %w", err)
}
}
// Step 8 — audit. Two events on the cascade path so forensics can
// distinguish "agent was retired" (agent_retired) from "downstream
// targets were flipped" (agent_retirement_cascaded). Details on the
// cascaded event carry the pre-cascade counts so a reviewer looking
// only at the audit log knows how much state was affected. Emission
// is best-effort — audit is observability, not a correctness barrier.
actorType := s.resolveActorType(actor)
details := map[string]interface{}{
"actor": actor,
"reason": reason,
"force": force,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retired", "agent", id, details); err != nil {
slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err)
}
if cascade {
cascadeDetails := map[string]interface{}{
"actor": actor,
"reason": reason,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil {
slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err)
}
}
return &AgentRetirementResult{
AlreadyRetired: false,
Cascade: cascade,
RetiredAt: retiredAt,
Counts: counts,
}, nil
}
// ListRetiredAgents returns the paginated list of retired agents in
// retired_at DESC order. This is the companion to ListAgents — which
// hides retired rows — so the operator UI can render a dedicated
// "Retired" tab without leaking retired rows into every other listing.
// Pagination defaults (page<1→1, perPage<1→50) are applied here as
// well as in the repo, so callers can pass 0s when they want defaults.
//
// Return shape harmonizes with handler.AgentService: a value slice
// (not pointer slice) and int64 total. The repo returns []*domain.Agent;
// this method dereferences into a value slice so the handler's
// PagedResponse marshals straight objects and so the compile-time
// interface assertion in agent_retire_handler_test.go:387 is satisfied.
// Nil repo entries are skipped defensively — the repo should never
// return them, but the handler contract is more important than the
// repo's (pointer-slice) convenience.
func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage)
if err != nil {
return nil, 0, fmt.Errorf("failed to list retired agents: %w", err)
}
out := make([]domain.Agent, 0, len(agents))
for _, a := range agents {
if a == nil {
continue
}
out = append(out, *a)
}
return out, int64(total), nil
}
// collectAgentDependencyCounts runs the three preflight COUNT queries in
// sequence and bundles the result. Sequential (not parallel) because the
// queries are cheap (<1ms each on the indexed columns added in 000015) and
// sequential keeps error handling simple. Any repo error short-circuits
// — we prefer to refuse the retire than make a half-informed decision.
func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) {
var counts domain.AgentDependencyCounts
targets, err := s.agentRepo.CountActiveTargets(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active targets: %w", err)
}
counts.ActiveTargets = targets
certs, err := s.agentRepo.CountActiveCertificates(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active certificates: %w", err)
}
counts.ActiveCertificates = certs
jobs, err := s.agentRepo.CountPendingJobs(ctx, id)
if err != nil {
return counts, fmt.Errorf("count pending jobs: %w", err)
}
counts.PendingJobs = jobs
return counts, nil
}
// resolveActorType maps an opaque actor string into the typed ActorType
// used by the audit schema. Matches the conventions the rest of the
// service layer uses: "system" → System, anything that looks like an
// agent identity → Agent, everything else → User.
func (s *AgentService) resolveActorType(actor string) domain.ActorType {
switch {
case actor == "system":
return domain.ActorTypeSystem
case len(actor) > 6 && actor[:6] == "agent-":
return domain.ActorTypeAgent
default:
return domain.ActorTypeUser
}
}