Close I-004 (agent hard-delete cascades targets) coverage-gap finding

Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
This commit is contained in:
Shankar Reddy
2026-04-19 05:24:00 +00:00
parent c17ea577e7
commit 49002c8cba
35 changed files with 4400 additions and 33 deletions
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockAgentService is a mock implementation of AgentService interface.
@@ -24,6 +25,11 @@ type MockAgentService struct {
GetWorkFn func(agentID string) ([]domain.Job, error)
GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error)
UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error
// I-004: soft-retirement hooks. Tests that don't set these receive nil
// results and nil errors, which mirrors the safest default (no-op) for
// unrelated suites that mock only the legacy surface.
RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error)
}
func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
@@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo
return nil
}
// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set
// RetireAgentFn get a nil result + nil error, which is a no-op response that
// lets unrelated suites compile without caring about the retirement surface.
func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if m.RetireAgentFn != nil {
return m.RetireAgentFn(agentID, actor, force, reason)
}
return nil, nil
}
// ListRetiredAgents returns retired rows for the retired-agents tab / audit
// views. Same zero-value default as RetireAgent for unrelated tests.
func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if m.ListRetiredAgentsFn != nil {
return m.ListRetiredAgentsFn(page, perPage)
}
return nil, 0, nil
}
// Test ListAgents - success case
func TestListAgents_Success(t *testing.T) {
now := time.Now()
@@ -0,0 +1,393 @@
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose
// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the
// returned mock. Keeps every I-004 handler test self-contained so a single
// failing assertion can't cascade through a shared fixture.
func agentRetireTestSetup() (*MockAgentService, AgentHandler) {
mock := &MockAgentService{}
handler := NewAgentHandler(mock)
return mock, handler
}
// TestRetireAgentHandler_Success_200 pins the happy-path contract for the
// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency
// fallout returns 200 OK and a JSON body echoing retirement metadata
// (retired_at timestamp, already_retired=false, cascade=false, zero counts).
// Operators building dashboards parse these fields; keep the shape stable.
func TestRetireAgentHandler_Success_200(t *testing.T) {
retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC)
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if agentID != "a-prod-001" {
t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID)
}
if force {
t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: false,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 200 body: %v", err)
}
if !body.RetiredAt.Equal(retiredAt) {
t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt)
}
if body.AlreadyRetired {
t.Errorf("already_retired=true want false on clean retire")
}
if body.Cascade {
t.Errorf("cascade=true want false on clean retire")
}
}
// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a
// retire call against an already-retired agent completes with 204 No Content
// (no body). This lets operators safely re-issue the DELETE after a network
// blip without fearing duplicate audit events or state mutations.
func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) {
mock, handler := agentRetireTestSetup()
past := time.Now().Add(-24 * time.Hour)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return &service.AgentRetirementResult{
AlreadyRetired: true,
Cascade: false,
RetiredAt: past,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNoContent {
t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String())
}
// 204 No Content must have zero body. If anything leaks through, downstream
// clients (curl scripts, dashboards) break.
if w.Body.Len() != 0 {
t.Errorf("204 body=%q want empty", w.Body.String())
}
}
// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring
// any of the four sentinel agents that back discovery sources and the
// network scanner. These IDs are reserved; the handler must surface the
// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason
// because no operator intent can legitimately retire them.
func TestRetireAgentHandler_Sentinel_403(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, service.ErrAgentIsSentinel
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String())
}
})
}
}
// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service
// returns a not-found error; handler maps to 404. Keeping the error
// discrimination at the service layer (sentinel errors.Is) rather than string
// matching is the whole point of wrapping.
func TestRetireAgentHandler_NotFound_404(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, errors.New("agent not found")
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked
// path. Service returns *BlockedByDependenciesError wrapping
// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and
// MUST include the counts in the response body so operators know what's
// blocking them. Without counts the 409 is useless — the operator has to
// guess which downstream dependency is holding up the retirement.
func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) {
mock, handler := agentRetireTestSetup()
blockCounts := domain.AgentDependencyCounts{
ActiveTargets: 3,
ActiveCertificates: 7,
PendingJobs: 2,
}
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, &service.BlockedByDependenciesError{Counts: blockCounts}
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String())
}
var body struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 409 body: %v", err)
}
if body.Counts.ActiveTargets != 3 {
t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets)
}
if body.Counts.ActiveCertificates != 7 {
t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates)
}
if body.Counts.PendingJobs != 2 {
t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs)
}
if body.Message == "" {
t.Errorf("409 body missing human-readable message; operators need guidance")
}
}
// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch
// guardrail: force=true without a non-empty reason must be rejected at the
// handler seam BEFORE the service performs any DB work, because a
// reason-less cascade is unauditable. Service returns ErrForceReasonRequired;
// handler maps to 400.
func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; force query param was dropped")
}
if reason != "" {
t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason)
}
return nil, service.ErrForceReasonRequired
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade
// path: DELETE ?force=true&reason=... → service executes transactional
// cascade → 200 with cascade=true and the pre-cascade counts echoed back so
// the operator's confirmation dialog can show "I just retired N targets,
// M certificates, K pending jobs."
func TestRetireAgentHandler_ForceCascade_200(t *testing.T) {
mock, handler := agentRetireTestSetup()
retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; query-param parsing broken")
}
if reason != "decommissioning rack 7" {
t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: true,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{
ActiveTargets: 2,
ActiveCertificates: 5,
PendingJobs: 1,
},
}, nil
}
url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7"
req := httptest.NewRequest(http.MethodDelete, url, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode force-cascade 200 body: %v", err)
}
if !body.Cascade {
t.Errorf("cascade=false want true on ?force=true successful retire")
}
if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 {
t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts)
}
}
// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A
// retired agent that is still polling must be told its identity is gone
// (410 Gone) rather than offered the normal 200 "recorded" response.
// cmd/agent treats 410 as a terminal signal and exits rather than looping
// forever against a decommissioned identity. Service returns ErrAgentRetired;
// handler maps to 410.
func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error {
return service.ErrAgentRetired
}
req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.Heartbeat(w, req)
if w.Code != http.StatusGone {
t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String())
}
}
// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing
// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows
// alongside total count so the GUI can render a "Retired Agents" tab with
// pagination. Default listing (GET /agents) hides retired rows; this is the
// opt-in surface for them.
func TestListRetiredAgentsHandler_Success(t *testing.T) {
past := time.Now().Add(-48 * time.Hour)
reason := "old hardware"
retired := []domain.Agent{
{
ID: "agent-retired-01",
Name: "decom-01",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
RetiredAt: &past,
RetiredReason: &reason,
},
}
mock, handler := agentRetireTestSetup()
mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) {
if page != 1 || perPage != 50 {
t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage)
}
return retired, 1, nil
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRetiredAgents(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var response PagedResponse
if err := json.NewDecoder(w.Body).Decode(&response); err != nil {
t.Fatalf("decode list-retired body: %v", err)
}
if response.Total != 1 {
t.Errorf("total=%d want 1", response.Total)
}
}
// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only
// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH
// must be rejected with 405 so misconfigured callers don't accidentally
// trigger retirement via a wrong-method request.
func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) {
_, handler := agentRetireTestSetup()
for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} {
t.Run(method, func(t *testing.T) {
req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("method=%s status=%d want 405", method, w.Code)
}
})
}
}
// Compile-time asserts: the mock must satisfy the handler's AgentService
// interface. Red state: this fails until the interface grows RetireAgent +
// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this
// assertion goes green along with every test above.
var _ AgentService = (*MockAgentService)(nil)
// Unused-import suppressor for context — the package-level tests already
// pull context from agent_handler_test.go, but leaving this here documents
// that the mock methods receive context.Context values even though this
// file's tests don't construct them directly (they ride on httptest.NewRequest).
var _ = context.Background
+199
View File
@@ -3,16 +3,24 @@ package handler
import (
"context"
"encoding/json"
"errors"
"log/slog"
"net/http"
"strconv"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// AgentService defines the service interface for agent operations.
//
// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement
// surface. The handler depends on the service-package's AgentRetirementResult
// and BlockedByDependenciesError types for result shape + errors.As unwrap,
// which is why this file imports internal/service.
type AgentService interface {
ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
GetAgent(ctx context.Context, id string) (*domain.Agent, error)
@@ -24,6 +32,10 @@ type AgentService interface {
GetWork(ctx context.Context, agentID string) ([]domain.Job, error)
GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error)
UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error
// I-004 soft-retirement API. Both default to no-op (nil result / nil error)
// in mocks that don't override them — handler tests opt in per suite.
RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
}
// AgentHandler handles HTTP requests for agent operations.
@@ -190,6 +202,15 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil {
// I-004: a retired agent still polling must receive 410 Gone so
// cmd/agent detects the terminal signal and shuts down cleanly
// instead of looping forever against a decommissioned identity.
// Check this FIRST — before "not found" string matching — so the
// retired-path is never masked by a sibling error branch.
if errors.Is(err, service.ErrAgentRetired) {
ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
@@ -376,3 +397,181 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques
"status": "updated",
})
}
// RetireAgent executes the I-004 soft-retirement surface.
// DELETE /api/v1/agents/{id}[?force=true&reason=...]
//
// Contract (pinned by agent_retire_handler_test.go):
//
// 405 any method other than DELETE
// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s)
// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot)
// 204 idempotent retire of an already-retired agent (NO body — downstream
// clients that tee responses into dashboards break on spurious bodies)
// 400 force=true without a non-empty reason (ErrForceReasonRequired)
// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel)
// 404 agent does not exist ("not found" string match, kept for compat with
// repo error strings; sentinel checks run first so they never mask)
// 409 blocked by preflight counts (*BlockedByDependenciesError) — body
// carries the per-bucket counts so the operator UI can tell the
// human which downstream dependency is holding up the retirement,
// rather than forcing them to re-run the DELETE with ?force=true
// and guess
// 500 anything else
//
// The 409 body intentionally does NOT go through ErrorWithRequestID because
// that helper's ErrorResponse shape has no `counts` field — we inline-marshal
// a custom body instead. Keeping this shape stable is important: the GUI
// pattern is "show the 409 dialog, list the N targets / M certs / K jobs
// blocking, let the operator retire them first or tick the force checkbox."
func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
// Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so
// the path parser is identical across the agent handler surface and a
// future refactor can extract it once without introducing drift.
rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/")
parts := strings.Split(rawID, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID)
return
}
id := parts[0]
// Parse optional force + reason. A missing `force` param is treated as
// force=false (the default, safe path); anything strconv.ParseBool rejects
// is also force=false so a malformed query can never silently enable the
// cascade. The reason string is passed through verbatim — the service
// owns the "force=true requires reason" rule.
query := r.URL.Query()
force := false
if fv := query.Get("force"); fv != "" {
if parsed, err := strconv.ParseBool(fv); err == nil {
force = parsed
}
}
reason := query.Get("reason")
actor := resolveActor(r.Context())
result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason)
if err != nil {
// Sentinel + typed-error checks run BEFORE string matching on "not
// found" so a repo error that happens to contain those words can
// never mask a structural refusal (403/400/409). Order matters.
if errors.Is(err, service.ErrAgentIsSentinel) {
ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID)
return
}
if errors.Is(err, service.ErrForceReasonRequired) {
ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID)
return
}
var blocked *service.BlockedByDependenciesError
if errors.As(err, &blocked) {
// Custom 409 body with per-bucket counts. ErrorResponse has no
// `counts` field, so we marshal a bespoke struct instead.
// Keep `error`/`message`/`counts` as the stable shape — any
// dashboard parsing this relies on those three keys.
body := struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
Error: "blocked_by_dependencies",
Message: "Agent has active downstream dependencies. Retire or reassign them " +
"first, or re-run with ?force=true&reason=... to cascade.",
Counts: blocked.Counts,
}
JSON(w, http.StatusConflict, body)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
}
slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID)
return
}
// Idempotent retire: the agent was already retired, so we return 204 No
// Content with a ZERO-length body. The Red contract (test line 106) fails
// if even a trailing newline leaks into the response. WriteHeader alone
// emits the status without invoking the JSON encoder.
if result.AlreadyRetired {
w.WriteHeader(http.StatusNoContent)
return
}
// Clean retire (force=false) or successful cascade (force=true). Body
// shape pinned by Red contract: retired_at, already_retired, cascade,
// counts. Omitempty is deliberately NOT used — operators parsing the
// response expect every field to always be present.
JSON(w, http.StatusOK, struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
RetiredAt: result.RetiredAt,
AlreadyRetired: result.AlreadyRetired,
Cascade: result.Cascade,
Counts: result.Counts,
})
}
// ListRetiredAgents returns the opt-in listing of retired agents for the
// operator UI's "Retired" tab and for audit/forensics workflows.
// GET /api/v1/agents/retired?page=1&per_page=50
//
// The default ListAgents handler hides retired rows; this is the dedicated
// surface for reading them back. Pagination defaults match ListAgents so
// the GUI can reuse the same query hook (page=1, per_page=50, cap 500).
//
// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via
// the literal-beats-pattern-var precedence rule (literal `retired` wins over
// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can
// coexist without conflict. If that precedence ever regresses, the failure
// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which
// is the fast signal we want.
func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
page := 1
perPage := 50
query := r.URL.Query()
if p := query.Get("page"); p != "" {
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
page = parsed
}
}
if pp := query.Get("per_page"); pp != "" {
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
perPage = parsed
}
}
agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage)
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID)
return
}
JSON(w, http.StatusOK, PagedResponse{
Data: agents,
Total: total,
Page: page,
PerPage: perPage,
})
}