Close I-004 (agent hard-delete cascades targets) coverage-gap finding

Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
This commit is contained in:
shankar0123
2026-04-19 05:24:00 +00:00
parent 1ee77c89f8
commit 0725713e19
35 changed files with 4400 additions and 33 deletions
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockAgentService is a mock implementation of AgentService interface.
@@ -24,6 +25,11 @@ type MockAgentService struct {
GetWorkFn func(agentID string) ([]domain.Job, error)
GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error)
UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error
// I-004: soft-retirement hooks. Tests that don't set these receive nil
// results and nil errors, which mirrors the safest default (no-op) for
// unrelated suites that mock only the legacy surface.
RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error)
}
func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
@@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo
return nil
}
// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set
// RetireAgentFn get a nil result + nil error, which is a no-op response that
// lets unrelated suites compile without caring about the retirement surface.
func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if m.RetireAgentFn != nil {
return m.RetireAgentFn(agentID, actor, force, reason)
}
return nil, nil
}
// ListRetiredAgents returns retired rows for the retired-agents tab / audit
// views. Same zero-value default as RetireAgent for unrelated tests.
func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if m.ListRetiredAgentsFn != nil {
return m.ListRetiredAgentsFn(page, perPage)
}
return nil, 0, nil
}
// Test ListAgents - success case
func TestListAgents_Success(t *testing.T) {
now := time.Now()
@@ -0,0 +1,393 @@
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose
// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the
// returned mock. Keeps every I-004 handler test self-contained so a single
// failing assertion can't cascade through a shared fixture.
func agentRetireTestSetup() (*MockAgentService, AgentHandler) {
mock := &MockAgentService{}
handler := NewAgentHandler(mock)
return mock, handler
}
// TestRetireAgentHandler_Success_200 pins the happy-path contract for the
// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency
// fallout returns 200 OK and a JSON body echoing retirement metadata
// (retired_at timestamp, already_retired=false, cascade=false, zero counts).
// Operators building dashboards parse these fields; keep the shape stable.
func TestRetireAgentHandler_Success_200(t *testing.T) {
retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC)
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if agentID != "a-prod-001" {
t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID)
}
if force {
t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: false,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 200 body: %v", err)
}
if !body.RetiredAt.Equal(retiredAt) {
t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt)
}
if body.AlreadyRetired {
t.Errorf("already_retired=true want false on clean retire")
}
if body.Cascade {
t.Errorf("cascade=true want false on clean retire")
}
}
// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a
// retire call against an already-retired agent completes with 204 No Content
// (no body). This lets operators safely re-issue the DELETE after a network
// blip without fearing duplicate audit events or state mutations.
func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) {
mock, handler := agentRetireTestSetup()
past := time.Now().Add(-24 * time.Hour)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return &service.AgentRetirementResult{
AlreadyRetired: true,
Cascade: false,
RetiredAt: past,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNoContent {
t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String())
}
// 204 No Content must have zero body. If anything leaks through, downstream
// clients (curl scripts, dashboards) break.
if w.Body.Len() != 0 {
t.Errorf("204 body=%q want empty", w.Body.String())
}
}
// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring
// any of the four sentinel agents that back discovery sources and the
// network scanner. These IDs are reserved; the handler must surface the
// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason
// because no operator intent can legitimately retire them.
func TestRetireAgentHandler_Sentinel_403(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, service.ErrAgentIsSentinel
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String())
}
})
}
}
// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service
// returns a not-found error; handler maps to 404. Keeping the error
// discrimination at the service layer (sentinel errors.Is) rather than string
// matching is the whole point of wrapping.
func TestRetireAgentHandler_NotFound_404(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, errors.New("agent not found")
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked
// path. Service returns *BlockedByDependenciesError wrapping
// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and
// MUST include the counts in the response body so operators know what's
// blocking them. Without counts the 409 is useless — the operator has to
// guess which downstream dependency is holding up the retirement.
func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) {
mock, handler := agentRetireTestSetup()
blockCounts := domain.AgentDependencyCounts{
ActiveTargets: 3,
ActiveCertificates: 7,
PendingJobs: 2,
}
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, &service.BlockedByDependenciesError{Counts: blockCounts}
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String())
}
var body struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 409 body: %v", err)
}
if body.Counts.ActiveTargets != 3 {
t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets)
}
if body.Counts.ActiveCertificates != 7 {
t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates)
}
if body.Counts.PendingJobs != 2 {
t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs)
}
if body.Message == "" {
t.Errorf("409 body missing human-readable message; operators need guidance")
}
}
// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch
// guardrail: force=true without a non-empty reason must be rejected at the
// handler seam BEFORE the service performs any DB work, because a
// reason-less cascade is unauditable. Service returns ErrForceReasonRequired;
// handler maps to 400.
func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; force query param was dropped")
}
if reason != "" {
t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason)
}
return nil, service.ErrForceReasonRequired
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade
// path: DELETE ?force=true&reason=... → service executes transactional
// cascade → 200 with cascade=true and the pre-cascade counts echoed back so
// the operator's confirmation dialog can show "I just retired N targets,
// M certificates, K pending jobs."
func TestRetireAgentHandler_ForceCascade_200(t *testing.T) {
mock, handler := agentRetireTestSetup()
retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; query-param parsing broken")
}
if reason != "decommissioning rack 7" {
t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: true,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{
ActiveTargets: 2,
ActiveCertificates: 5,
PendingJobs: 1,
},
}, nil
}
url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7"
req := httptest.NewRequest(http.MethodDelete, url, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode force-cascade 200 body: %v", err)
}
if !body.Cascade {
t.Errorf("cascade=false want true on ?force=true successful retire")
}
if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 {
t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts)
}
}
// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A
// retired agent that is still polling must be told its identity is gone
// (410 Gone) rather than offered the normal 200 "recorded" response.
// cmd/agent treats 410 as a terminal signal and exits rather than looping
// forever against a decommissioned identity. Service returns ErrAgentRetired;
// handler maps to 410.
func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error {
return service.ErrAgentRetired
}
req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.Heartbeat(w, req)
if w.Code != http.StatusGone {
t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String())
}
}
// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing
// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows
// alongside total count so the GUI can render a "Retired Agents" tab with
// pagination. Default listing (GET /agents) hides retired rows; this is the
// opt-in surface for them.
func TestListRetiredAgentsHandler_Success(t *testing.T) {
past := time.Now().Add(-48 * time.Hour)
reason := "old hardware"
retired := []domain.Agent{
{
ID: "agent-retired-01",
Name: "decom-01",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
RetiredAt: &past,
RetiredReason: &reason,
},
}
mock, handler := agentRetireTestSetup()
mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) {
if page != 1 || perPage != 50 {
t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage)
}
return retired, 1, nil
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRetiredAgents(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var response PagedResponse
if err := json.NewDecoder(w.Body).Decode(&response); err != nil {
t.Fatalf("decode list-retired body: %v", err)
}
if response.Total != 1 {
t.Errorf("total=%d want 1", response.Total)
}
}
// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only
// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH
// must be rejected with 405 so misconfigured callers don't accidentally
// trigger retirement via a wrong-method request.
func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) {
_, handler := agentRetireTestSetup()
for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} {
t.Run(method, func(t *testing.T) {
req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("method=%s status=%d want 405", method, w.Code)
}
})
}
}
// Compile-time asserts: the mock must satisfy the handler's AgentService
// interface. Red state: this fails until the interface grows RetireAgent +
// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this
// assertion goes green along with every test above.
var _ AgentService = (*MockAgentService)(nil)
// Unused-import suppressor for context — the package-level tests already
// pull context from agent_handler_test.go, but leaving this here documents
// that the mock methods receive context.Context values even though this
// file's tests don't construct them directly (they ride on httptest.NewRequest).
var _ = context.Background
+199
View File
@@ -3,16 +3,24 @@ package handler
import (
"context"
"encoding/json"
"errors"
"log/slog"
"net/http"
"strconv"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// AgentService defines the service interface for agent operations.
//
// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement
// surface. The handler depends on the service-package's AgentRetirementResult
// and BlockedByDependenciesError types for result shape + errors.As unwrap,
// which is why this file imports internal/service.
type AgentService interface {
ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
GetAgent(ctx context.Context, id string) (*domain.Agent, error)
@@ -24,6 +32,10 @@ type AgentService interface {
GetWork(ctx context.Context, agentID string) ([]domain.Job, error)
GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error)
UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error
// I-004 soft-retirement API. Both default to no-op (nil result / nil error)
// in mocks that don't override them — handler tests opt in per suite.
RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
}
// AgentHandler handles HTTP requests for agent operations.
@@ -190,6 +202,15 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil {
// I-004: a retired agent still polling must receive 410 Gone so
// cmd/agent detects the terminal signal and shuts down cleanly
// instead of looping forever against a decommissioned identity.
// Check this FIRST — before "not found" string matching — so the
// retired-path is never masked by a sibling error branch.
if errors.Is(err, service.ErrAgentRetired) {
ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
@@ -376,3 +397,181 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques
"status": "updated",
})
}
// RetireAgent executes the I-004 soft-retirement surface.
// DELETE /api/v1/agents/{id}[?force=true&reason=...]
//
// Contract (pinned by agent_retire_handler_test.go):
//
// 405 any method other than DELETE
// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s)
// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot)
// 204 idempotent retire of an already-retired agent (NO body — downstream
// clients that tee responses into dashboards break on spurious bodies)
// 400 force=true without a non-empty reason (ErrForceReasonRequired)
// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel)
// 404 agent does not exist ("not found" string match, kept for compat with
// repo error strings; sentinel checks run first so they never mask)
// 409 blocked by preflight counts (*BlockedByDependenciesError) — body
// carries the per-bucket counts so the operator UI can tell the
// human which downstream dependency is holding up the retirement,
// rather than forcing them to re-run the DELETE with ?force=true
// and guess
// 500 anything else
//
// The 409 body intentionally does NOT go through ErrorWithRequestID because
// that helper's ErrorResponse shape has no `counts` field — we inline-marshal
// a custom body instead. Keeping this shape stable is important: the GUI
// pattern is "show the 409 dialog, list the N targets / M certs / K jobs
// blocking, let the operator retire them first or tick the force checkbox."
func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
// Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so
// the path parser is identical across the agent handler surface and a
// future refactor can extract it once without introducing drift.
rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/")
parts := strings.Split(rawID, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID)
return
}
id := parts[0]
// Parse optional force + reason. A missing `force` param is treated as
// force=false (the default, safe path); anything strconv.ParseBool rejects
// is also force=false so a malformed query can never silently enable the
// cascade. The reason string is passed through verbatim — the service
// owns the "force=true requires reason" rule.
query := r.URL.Query()
force := false
if fv := query.Get("force"); fv != "" {
if parsed, err := strconv.ParseBool(fv); err == nil {
force = parsed
}
}
reason := query.Get("reason")
actor := resolveActor(r.Context())
result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason)
if err != nil {
// Sentinel + typed-error checks run BEFORE string matching on "not
// found" so a repo error that happens to contain those words can
// never mask a structural refusal (403/400/409). Order matters.
if errors.Is(err, service.ErrAgentIsSentinel) {
ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID)
return
}
if errors.Is(err, service.ErrForceReasonRequired) {
ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID)
return
}
var blocked *service.BlockedByDependenciesError
if errors.As(err, &blocked) {
// Custom 409 body with per-bucket counts. ErrorResponse has no
// `counts` field, so we marshal a bespoke struct instead.
// Keep `error`/`message`/`counts` as the stable shape — any
// dashboard parsing this relies on those three keys.
body := struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
Error: "blocked_by_dependencies",
Message: "Agent has active downstream dependencies. Retire or reassign them " +
"first, or re-run with ?force=true&reason=... to cascade.",
Counts: blocked.Counts,
}
JSON(w, http.StatusConflict, body)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
}
slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID)
return
}
// Idempotent retire: the agent was already retired, so we return 204 No
// Content with a ZERO-length body. The Red contract (test line 106) fails
// if even a trailing newline leaks into the response. WriteHeader alone
// emits the status without invoking the JSON encoder.
if result.AlreadyRetired {
w.WriteHeader(http.StatusNoContent)
return
}
// Clean retire (force=false) or successful cascade (force=true). Body
// shape pinned by Red contract: retired_at, already_retired, cascade,
// counts. Omitempty is deliberately NOT used — operators parsing the
// response expect every field to always be present.
JSON(w, http.StatusOK, struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
RetiredAt: result.RetiredAt,
AlreadyRetired: result.AlreadyRetired,
Cascade: result.Cascade,
Counts: result.Counts,
})
}
// ListRetiredAgents returns the opt-in listing of retired agents for the
// operator UI's "Retired" tab and for audit/forensics workflows.
// GET /api/v1/agents/retired?page=1&per_page=50
//
// The default ListAgents handler hides retired rows; this is the dedicated
// surface for reading them back. Pagination defaults match ListAgents so
// the GUI can reuse the same query hook (page=1, per_page=50, cap 500).
//
// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via
// the literal-beats-pattern-var precedence rule (literal `retired` wins over
// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can
// coexist without conflict. If that precedence ever regresses, the failure
// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which
// is the fast signal we want.
func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
page := 1
perPage := 50
query := r.URL.Query()
if p := query.Get("page"); p != "" {
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
page = parsed
}
}
if pp := query.Get("per_page"); pp != "" {
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
perPage = parsed
}
}
agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage)
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID)
return
}
JSON(w, http.StatusOK, PagedResponse{
Data: agents,
Total: total,
Page: page,
PerPage: perPage,
})
}
+12
View File
@@ -131,9 +131,21 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
r.Register("POST /api/v1/targets/{id}/test", http.HandlerFunc(reg.Targets.TestTargetConnection))
// Agents routes: /api/v1/agents
//
// I-004 soft-retirement surface:
// * GET /api/v1/agents/retired — opt-in listing of retired agents.
// MUST be registered before /agents/{id} so Go 1.22 ServeMux's
// literal-beats-pattern-var precedence routes the `retired` literal
// to ListRetiredAgents instead of treating "retired" as a {id}
// parameter value against GetAgent.
// * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004
// hard-delete; the underlying repo does a soft-retire with
// optional cascade.
r.Register("GET /api/v1/agents", http.HandlerFunc(reg.Agents.ListAgents))
r.Register("POST /api/v1/agents", http.HandlerFunc(reg.Agents.RegisterAgent))
r.Register("GET /api/v1/agents/retired", http.HandlerFunc(reg.Agents.ListRetiredAgents))
r.Register("GET /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.GetAgent))
r.Register("DELETE /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.RetireAgent))
r.Register("POST /api/v1/agents/{id}/heartbeat", http.HandlerFunc(reg.Agents.Heartbeat))
r.Register("POST /api/v1/agents/{id}/csr", http.HandlerFunc(reg.Agents.AgentCSRSubmit))
r.Register("GET /api/v1/agents/{id}/certificates/{cert_id}", http.HandlerFunc(reg.Agents.AgentCertificatePickup))
+228
View File
@@ -0,0 +1,228 @@
package cli
import (
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
)
// TestClient_RetireAgent_Success pins the I-004 CLI happy path: the operator
// runs `certctl-cli agents retire <id>` and the client issues a DELETE to
// /api/v1/agents/{id}, parses the 200 JSON body (retired_at, already_retired,
// cascade, counts), and reports success. The handler test already covers the
// server-side contract; this test covers the client-side wire formatting so a
// refactor of the server's 200 body shape can't silently break the CLI.
func TestClient_RetireAgent_Success(t *testing.T) {
var (
sawMethod string
sawPath string
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": false,
"counts": map[string]interface{}{
"active_targets": 0,
"active_certificates": 0,
"pending_jobs": 0,
},
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
// Positional arg: the agent ID. No --force, no --reason — the default
// soft-retire path. Compile-fail until client.RetireAgent exists.
if err := client.RetireAgent([]string{"ag-1"}); err != nil {
t.Fatalf("RetireAgent(ag-1) err=%v want nil", err)
}
if sawMethod != "DELETE" {
t.Errorf("method=%q want DELETE", sawMethod)
}
if sawPath != "/api/v1/agents/ag-1" {
t.Errorf("path=%q want /api/v1/agents/ag-1", sawPath)
}
if sawForce != "" {
t.Errorf("force query=%q want empty (default path sends no force)", sawForce)
}
if sawReason != "" {
t.Errorf("reason query=%q want empty (default path sends no reason)", sawReason)
}
}
// TestClient_RetireAgent_Force_WithReason_Success pins the ?force=true&reason=...
// escape hatch wiring. Operators who supply --force + --reason get their values
// propagated as URL query parameters exactly once, so the server sees the same
// contract the handler test expects. Also verifies the cascade=true response
// body parses cleanly.
func TestClient_RetireAgent_Force_WithReason_Success(t *testing.T) {
var (
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": true,
"counts": map[string]interface{}{
"active_targets": 2,
"active_certificates": 5,
"pending_jobs": 1,
},
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil {
t.Fatalf("RetireAgent(force+reason) err=%v want nil", err)
}
if sawForce != "true" {
t.Errorf("force query=%q want \"true\"", sawForce)
}
if sawReason != "decommissioning rack 7" {
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
}
}
// TestClient_RetireAgent_Force_RequiresReason pins the client-side guard: using
// --force without --reason must fail BEFORE any HTTP request is made. Without
// this, the client would bounce off the server's 400 ErrForceReasonRequired
// only after a round trip — slow feedback, wasted audit-trail noise, and a
// worse operator experience. requestCount=0 enforces that no HTTP call happens.
func TestClient_RetireAgent_Force_RequiresReason(t *testing.T) {
var requestCount int
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
w.WriteHeader(http.StatusOK)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.RetireAgent([]string{"ag-1", "--force"})
if err == nil {
t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error")
}
if !containsStr(err.Error(), "reason") {
t.Errorf("err=%q should mention --reason to guide operator", err.Error())
}
if requestCount != 0 {
t.Fatalf("requestCount=%d want 0; client must short-circuit before HTTP call", requestCount)
}
}
// TestClient_RetireAgent_MissingID covers the other common operator mistake:
// invoking `certctl-cli agents retire` with no agent ID. Must be caught by the
// client with a clear error, not a malformed DELETE to /api/v1/agents/.
func TestClient_RetireAgent_MissingID(t *testing.T) {
var requestCount int
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
w.WriteHeader(http.StatusOK)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.RetireAgent([]string{})
if err == nil {
t.Fatalf("RetireAgent([]) err=nil want missing-id error")
}
if requestCount != 0 {
t.Fatalf("requestCount=%d want 0; client must reject missing-id before HTTP", requestCount)
}
}
// TestClient_ListRetiredAgents_Success pins the audit/forensics CLI surface:
// `certctl-cli agents list-retired` must GET /api/v1/agents/retired and render
// the paged response. The server returns a PagedResponse; the client is
// responsible for printing it in table or JSON format, same as ListAgents.
func TestClient_ListRetiredAgents_Success(t *testing.T) {
var (
sawMethod string
sawPath string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
if r.Method != "GET" || r.URL.Path != "/api/v1/agents/retired" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"data": []map[string]interface{}{
{
"id": "ag-old-01",
"name": "decom-01",
"hostname": "server-old",
"status": "Offline",
"registered_at": "2024-01-01T00:00:00Z",
"retired_at": "2026-01-01T00:00:00Z",
"retired_reason": "old hardware",
},
},
"total": 1,
"page": 1,
"per_page": 50,
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
if err := client.ListRetiredAgents([]string{}); err != nil {
t.Fatalf("ListRetiredAgents err=%v want nil", err)
}
if sawMethod != "GET" {
t.Errorf("method=%q want GET", sawMethod)
}
if sawPath != "/api/v1/agents/retired" {
t.Errorf("path=%q want /api/v1/agents/retired", sawPath)
}
}
// TestClient_ListRetiredAgents_ServerError covers the non-happy path: server
// returns 5xx → client surfaces the error rather than silently printing an
// empty list. Without this, operators running the command as part of a
// compliance audit could miss a backend outage.
func TestClient_ListRetiredAgents_ServerError(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "db unreachable", http.StatusInternalServerError)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.ListRetiredAgents([]string{})
if err == nil {
t.Fatalf("ListRetiredAgents(500) err=nil want propagated error")
}
}
+217
View File
@@ -293,6 +293,194 @@ func (c *Client) ListAgents(args []string) error {
return c.outputAgentsTable(result.Data, result.Total)
}
// ListRetiredAgents lists soft-retired agents from the dedicated endpoint.
//
// I-004: hits GET /api/v1/agents/retired which is a separate route from the
// default listing (the default hides retired rows). Supports --page and
// --per-page just like the active list. Output format mirrors ListAgents
// but prepends RETIRED_AT and RETIRED_REASON columns so the operator can
// forensic-grep the output.
func (c *Client) ListRetiredAgents(args []string) error {
fs := flag.NewFlagSet("agents list --retired", flag.ContinueOnError)
page := fs.Int("page", 1, "Page number")
perPage := fs.Int("per-page", 50, "Items per page")
fs.Parse(args)
query := url.Values{}
query.Set("page", fmt.Sprintf("%d", *page))
query.Set("per_page", fmt.Sprintf("%d", *perPage))
resp, err := c.do("GET", "/api/v1/agents/retired", query, nil)
if err != nil {
return err
}
var result struct {
Data []map[string]interface{} `json:"data"`
Total int `json:"total"`
}
if err := json.Unmarshal(resp, &result); err != nil {
return fmt.Errorf("parsing response: %w", err)
}
if c.format == "json" {
return c.outputJSON(result)
}
return c.outputRetiredAgentsTable(result.Data, result.Total)
}
// RetireAgent soft-retires an agent via DELETE /api/v1/agents/{id}.
//
// I-004: wraps the full status-code matrix pinned by the handler's
// agent_retire_handler_test.go:
//
// 200 clean retire — body: retired_at, already_retired=false, cascade=false, counts=0
// 200 force-cascade retire — body: cascade=true, counts=pre-cascade snapshot
// 204 idempotent retire — agent was already retired, NO body
// 403 sentinel — reserved agent (server-scanner / cloud-*), ErrAgentIsSentinel
// 404 not found — agent doesn't exist
// 409 blocked_by_dependencies — body: error, message, counts
//
// The default (force=false) flow refuses to retire agents with active
// downstream dependencies; the operator must re-run with --force and an
// explicit --reason to cascade. The handler rejects --force without
// --reason with a 400 — we mirror that contract client-side so the
// operator gets a clear error before the round trip.
func (c *Client) RetireAgent(args []string) error {
// Convention: `agents retire <id> [--force] [--reason <reason>]` — the ID
// is a positional arg that precedes the flags. Go's flag package stops
// parsing at the first non-flag token, so we pull args[0] as the ID and
// hand args[1:] to the flag parser. Without this split, `agents retire
// ag-1 --force --reason "x"` would parse with force=false and reason=""
// because the flags land in fs.Args() instead of being recognized.
if len(args) == 0 {
return fmt.Errorf("agent ID is required: agents retire <id> [--force] [--reason <reason>]")
}
id := args[0]
fs := flag.NewFlagSet("agents retire", flag.ContinueOnError)
force := fs.Bool("force", false, "Cascade-retire downstream targets, certs, and jobs")
reason := fs.String("reason", "", "Human-readable reason (required with --force)")
if err := fs.Parse(args[1:]); err != nil {
return err
}
// Mirror the handler's ErrForceReasonRequired contract client-side so
// the operator gets a clear error before the round trip.
if *force && strings.TrimSpace(*reason) == "" {
return fmt.Errorf("--reason is required when --force is set")
}
// Build query string. Skip ?force=false; skip ?reason= when empty.
query := url.Values{}
if *force {
query.Set("force", "true")
}
if *reason != "" {
query.Set("reason", *reason)
}
u, err := url.JoinPath(c.baseURL, fmt.Sprintf("/api/v1/agents/%s", id))
if err != nil {
return fmt.Errorf("invalid URL: %w", err)
}
if len(query) > 0 {
u = u + "?" + query.Encode()
}
req, err := http.NewRequest("DELETE", u, nil)
if err != nil {
return fmt.Errorf("creating request: %w", err)
}
req.Header.Set("Accept", "application/json")
if c.apiKey != "" {
req.Header.Set("Authorization", "Bearer "+c.apiKey)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("request failed: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("reading response: %w", err)
}
switch resp.StatusCode {
case http.StatusNoContent:
// 204 idempotent — the agent was already retired. No body.
if c.format == "json" {
return c.outputJSON(map[string]interface{}{
"agent_id": id,
"already_retired": true,
})
}
fmt.Printf("Agent %s was already retired (idempotent)\n", id)
return nil
case http.StatusOK:
var result struct {
RetiredAt string `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts struct {
ActiveTargets int `json:"active_targets"`
ActiveCertificates int `json:"active_certificates"`
PendingJobs int `json:"pending_jobs"`
} `json:"counts"`
}
if err := json.Unmarshal(body, &result); err != nil {
return fmt.Errorf("parsing 200 response: %w", err)
}
if c.format == "json" {
return c.outputJSON(json.RawMessage(body))
}
if result.Cascade {
fmt.Printf("Agent %s retired (cascade). Retired at: %s\n", id, result.RetiredAt)
fmt.Printf(" Cascaded: %d targets, %d certificates, %d jobs\n",
result.Counts.ActiveTargets, result.Counts.ActiveCertificates, result.Counts.PendingJobs)
} else {
fmt.Printf("Agent %s retired. Retired at: %s\n", id, result.RetiredAt)
}
return nil
case http.StatusConflict:
// 409 blocked_by_dependencies. Parse the body so we can show the
// operator which dependency counts are holding up the retire.
var blocked struct {
Error string `json:"error"`
Message string `json:"message"`
Counts struct {
ActiveTargets int `json:"active_targets"`
ActiveCertificates int `json:"active_certificates"`
PendingJobs int `json:"pending_jobs"`
} `json:"counts"`
}
if err := json.Unmarshal(body, &blocked); err != nil {
return fmt.Errorf("agent has active dependencies (HTTP 409); raw body: %s", string(body))
}
return fmt.Errorf("blocked_by_dependencies: %s (targets=%d certificates=%d jobs=%d); re-run with --force --reason \"<reason>\" to cascade",
blocked.Message, blocked.Counts.ActiveTargets, blocked.Counts.ActiveCertificates, blocked.Counts.PendingJobs)
case http.StatusForbidden:
return fmt.Errorf("agent %s is a reserved sentinel and cannot be retired (HTTP 403)", id)
case http.StatusNotFound:
return fmt.Errorf("agent %s not found (HTTP 404)", id)
case http.StatusBadRequest:
return fmt.Errorf("bad request (HTTP 400): %s", string(body))
default:
return fmt.Errorf("unexpected HTTP %d: %s", resp.StatusCode, string(body))
}
}
// GetAgent retrieves a single agent by ID.
func (c *Client) GetAgent(id string) error {
resp, err := c.do("GET", fmt.Sprintf("/api/v1/agents/%s", id), nil, nil)
@@ -613,6 +801,35 @@ func (c *Client) outputAgentsTable(agents []map[string]interface{}, total int) e
return nil
}
// outputRetiredAgentsTable is the tab-writer view for the retired listing.
// I-004: adds RETIRED_AT + REASON columns so operators can forensic-grep.
func (c *Client) outputRetiredAgentsTable(agents []map[string]interface{}, total int) error {
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
fmt.Fprintln(w, "ID\tHOSTNAME\tOS\tARCHITECTURE\tRETIRED AT\tREASON")
for _, agent := range agents {
id := getString(agent, "id")
hostname := getString(agent, "hostname")
osName := getString(agent, "os")
arch := getString(agent, "architecture")
retiredAt := ""
if raw, ok := agent["retired_at"].(string); ok && raw != "" {
if t, err := time.Parse(time.RFC3339, raw); err == nil {
retiredAt = t.Format("2006-01-02 15:04:05")
} else {
retiredAt = raw
}
}
reason := getString(agent, "retired_reason")
fmt.Fprintf(w, "%s\t%s\t%s\t%s\t%s\t%s\n", id, hostname, osName, arch, retiredAt, reason)
}
w.Flush()
fmt.Printf("\nTotal retired: %d\n", total)
return nil
}
func (c *Client) outputAgentDetail(agent map[string]interface{}) error {
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
+63
View File
@@ -32,6 +32,8 @@ type DeploymentTarget struct {
LastTestedAt *time.Time `json:"last_tested_at,omitempty"`
TestStatus string `json:"test_status,omitempty"`
Source string `json:"source,omitempty"`
RetiredAt *time.Time `json:"retired_at,omitempty"` // I-004: soft-retirement timestamp (nil = active)
RetiredReason *string `json:"retired_reason,omitempty"` // I-004: reason captured at cascade retirement
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
@@ -49,6 +51,67 @@ type Agent struct {
Architecture string `json:"architecture"`
IPAddress string `json:"ip_address"`
Version string `json:"version"`
// I-004: soft-retirement fields. An agent with RetiredAt != nil is the
// canonical "retired" state. The Status column remains as before (Online
// / Offline / Degraded) and is preserved at retirement time as the
// last-seen operational status; RetiredAt is the source of truth for
// "should we filter this row from active listings?".
RetiredAt *time.Time `json:"retired_at,omitempty"`
RetiredReason *string `json:"retired_reason,omitempty"`
}
// IsRetired returns true when this agent has been soft-retired.
// I-004: callers that iterate active agents (stats dashboard, stale-offline
// sweeper, handler-facing list) must skip retired rows by default.
func (a *Agent) IsRetired() bool { return a != nil && a.RetiredAt != nil }
// AgentDependencyCounts captures the active downstream rows that would be
// affected by retiring an agent. Returned by the preflight pass on
// DELETE /api/v1/agents/{id}. Zero counts mean a clean soft-retire is safe;
// any non-zero count blocks a default retire with HTTP 409 and requires an
// explicit ?force=true&reason=... escape hatch from the operator.
type AgentDependencyCounts struct {
ActiveTargets int `json:"active_targets"` // deployment_targets.agent_id=id AND retired_at IS NULL
ActiveCertificates int `json:"active_certificates"` // certificates currently deployed via one of this agent's active targets
PendingJobs int `json:"pending_jobs"` // jobs.agent_id=id AND status IN (Pending, AwaitingCSR, AwaitingApproval, Running)
}
// HasDependencies reports whether any preflight counter is non-zero.
func (d AgentDependencyCounts) HasDependencies() bool {
return d.ActiveTargets > 0 || d.ActiveCertificates > 0 || d.PendingJobs > 0
}
// SentinelAgentIDs enumerates the four reserved agent identities that back
// non-agent discovery subsystems. These rows are created by cmd/server on
// startup and retiring them would orphan their subsystem — the network
// scanner and the three cloud secret-manager sources all key writes to
// these IDs via service.SentinelAgentID / service.SentinelAWSSecretsMgr /
// service.SentinelAzureKeyVault / service.SentinelGCPSecretMgr. The four
// literal IDs below MUST stay in lockstep with those service-package
// constants (see internal/service/network_scan.go line 23 and
// internal/service/cloud_discovery.go lines 14-16).
//
// The retirement service refuses them unconditionally — even with
// ?force=true — via ErrAgentIsSentinel. Living here (and not in the
// service package) lets handler, repository, and scheduler code filter
// them without importing service and creating a cycle.
var SentinelAgentIDs = []string{
"server-scanner",
"cloud-aws-sm",
"cloud-azure-kv",
"cloud-gcp-sm",
}
// IsSentinelAgent reports whether id matches one of the four reserved
// sentinel agent IDs. A linear scan is fine — the slice is length 4 and
// the check is rare (only on retirement attempts and sweeper filters).
func IsSentinelAgent(id string) bool {
for _, s := range SentinelAgentIDs {
if s == id {
return true
}
}
return false
}
// AgentMetadata contains runtime metadata reported by agents via heartbeat.
+55
View File
@@ -0,0 +1,55 @@
package domain
import (
"testing"
"time"
)
// TestAgent_IsRetired covers the I-004 soft-retirement predicate that gates
// which callers hide an agent row from active listings.
func TestAgent_IsRetired(t *testing.T) {
t.Run("nil receiver is not retired", func(t *testing.T) {
var a *Agent
if a.IsRetired() {
t.Fatalf("nil *Agent should not be retired")
}
})
t.Run("zero value is not retired", func(t *testing.T) {
a := &Agent{}
if a.IsRetired() {
t.Fatalf("zero Agent should not be retired")
}
})
t.Run("RetiredAt set is retired", func(t *testing.T) {
now := time.Now()
a := &Agent{RetiredAt: &now}
if !a.IsRetired() {
t.Fatalf("Agent with RetiredAt != nil must be retired")
}
})
}
// TestAgentDependencyCounts_HasDependencies verifies the preflight
// aggregation helper used by the 409 block path of DELETE /agents/{id}.
func TestAgentDependencyCounts_HasDependencies(t *testing.T) {
cases := []struct {
name string
counts AgentDependencyCounts
want bool
}{
{"all zero", AgentDependencyCounts{}, false},
{"active target", AgentDependencyCounts{ActiveTargets: 1}, true},
{"active cert", AgentDependencyCounts{ActiveCertificates: 1}, true},
{"pending job", AgentDependencyCounts{PendingJobs: 1}, true},
{"mixed", AgentDependencyCounts{ActiveTargets: 3, PendingJobs: 2}, true},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
if got := tc.counts.HasDependencies(); got != tc.want {
t.Fatalf("HasDependencies()=%v want=%v counts=%+v", got, tc.want, tc.counts)
}
})
}
}
+50
View File
@@ -848,6 +848,56 @@ func (m *mockAgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (
return nil, fmt.Errorf("agent not found")
}
// I-004: the integration-level mockAgentRepository implements the 6 new
// retirement-surface methods as thin contract-satisfying stubs. The
// integration suite exercises lifecycle flows (issue → renew → deploy)
// that don't touch retirement, so these methods never need real behavior
// here — they exist purely to keep mockAgentRepository a valid
// AgentRepository implementation after migration 000015 expanded the
// interface. Dedicated retirement tests live in internal/service/
// agent_retire_test.go against the richer service-layer mockAgentRepo.
func (m *mockAgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
var retired []*domain.Agent
for _, a := range m.agents {
if a.RetiredAt != nil {
retired = append(retired, a)
}
}
return retired, len(retired), nil
}
func (m *mockAgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
agent, ok := m.agents[id]
if !ok {
return fmt.Errorf("agent not found")
}
if agent.RetiredAt != nil {
return nil
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
func (m *mockAgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
return m.SoftRetire(ctx, id, retiredAt, reason)
}
func (m *mockAgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
func (m *mockAgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
func (m *mockAgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
type mockTargetRepository struct {
targets map[string]*domain.DeploymentTarget
}
+10
View File
@@ -49,6 +49,16 @@ func (c *Client) Delete(path string) (json.RawMessage, error) {
return c.do("DELETE", path, nil, nil)
}
// DeleteWithQuery performs an HTTP DELETE with query parameters. I-004 adds
// this transport so MCP tools can target endpoints that carry flags in the
// query string (e.g. DELETE /api/v1/agents/{id}?force=true&reason=…). Client.Delete
// is path-only; without this method the retire tool silently drops force/reason,
// turning every cascade retire into a default soft-retire. Shares do()'s 204
// normalization and 4xx/5xx error propagation so tool authors get one contract.
func (c *Client) DeleteWithQuery(path string, query url.Values) (json.RawMessage, error) {
return c.do("DELETE", path, query, nil)
}
// GetRaw performs an HTTP GET and returns the raw response body bytes and content type.
// Used for binary responses (DER CRL, OCSP).
func (c *Client) GetRaw(path string) ([]byte, string, error) {
+214
View File
@@ -0,0 +1,214 @@
package mcp
import (
"encoding/json"
"net/http"
"net/http/httptest"
"net/url"
"strings"
"testing"
)
// TestClient_DeleteWithQuery_ForceRetire covers the new transport capability
// that I-004 adds to the MCP client. The retire tool needs to issue
// DELETE /api/v1/agents/{id}?force=true&reason=... — Client.Delete as it
// stands only accepts a path, dropping query parameters on the floor. Phase 2b
// must add DeleteWithQuery so the MCP retire tool can hit the force escape
// hatch; without this, every retire-via-MCP call with force=true silently
// becomes a default soft-retire and either succeeds wrongly or 409s.
func TestClient_DeleteWithQuery_ForceRetire(t *testing.T) {
var (
sawMethod string
sawPath string
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != http.MethodDelete || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": true,
})
}))
defer server.Close()
c := NewClient(server.URL, "test-key")
// Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the
// query as a url.Values is the established pattern (matches Get's shape).
query := url.Values{}
query.Set("force", "true")
query.Set("reason", "decommissioning rack 7")
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", query)
if err != nil {
t.Fatalf("DeleteWithQuery err=%v want nil", err)
}
if data == nil {
t.Fatal("DeleteWithQuery returned nil data; want 200 body echo-back")
}
if sawMethod != http.MethodDelete {
t.Errorf("method=%q want DELETE", sawMethod)
}
if sawPath != "/api/v1/agents/ag-1" {
t.Errorf("path=%q want /api/v1/agents/ag-1 (query must be stripped from path)", sawPath)
}
if sawForce != "true" {
t.Errorf("force query=%q want \"true\"", sawForce)
}
if sawReason != "decommissioning rack 7" {
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
}
}
// TestClient_DeleteWithQuery_NoQuery covers the defensive path: a nil/empty
// query must still produce a clean DELETE against the bare path with no stray
// "?" suffix. Matches the Get() shape (see client.go do()) so downstream tools
// can reuse one code path.
func TestClient_DeleteWithQuery_NoQuery(t *testing.T) {
var sawRawPath string
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawRawPath = r.URL.RequestURI()
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{"ok": true})
}))
defer server.Close()
c := NewClient(server.URL, "")
if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil {
t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err)
}
// No query → no ? suffix.
if strings.Contains(sawRawPath, "?") {
t.Errorf("raw path=%q contains stray ?; empty query must not serialize", sawRawPath)
}
}
// TestClient_DeleteWithQuery_204ReturnsMinimalBody covers the idempotent path.
// The handler returns 204 No Content for an already-retired agent; the
// existing do() helper normalises this to {"status":"deleted"}. The new
// DeleteWithQuery must share that behavior so MCP tool authors don't have to
// special-case the return shape.
func TestClient_DeleteWithQuery_204ReturnsMinimalBody(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusNoContent)
}))
defer server.Close()
c := NewClient(server.URL, "")
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
if err != nil {
t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err)
}
if data == nil {
t.Fatal("DeleteWithQuery(204) returned nil; want synthetic body")
}
if !strings.Contains(string(data), "deleted") && !strings.Contains(string(data), "status") {
t.Errorf("DeleteWithQuery(204) body=%q; must surface a non-empty sentinel", string(data))
}
}
// TestClient_DeleteWithQuery_409PropagatesError covers the preflight-blocked
// surface. A 409 with dependency counts must bubble up as a Go error so the
// MCP tool can present it to the LLM operator rather than silently swallow
// the rejection.
func TestClient_DeleteWithQuery_409PropagatesError(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusConflict)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"error": "blocked_by_dependencies",
"message": "agent has active targets",
"counts": map[string]int{
"active_targets": 3,
"active_certificates": 7,
"pending_jobs": 2,
},
})
}))
defer server.Close()
c := NewClient(server.URL, "")
_, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
if err == nil {
t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error")
}
if !strings.Contains(err.Error(), "409") {
t.Errorf("err=%q should include HTTP status 409 for debuggability", err.Error())
}
}
// TestRetireAgentInput_ShapePinned is a compile-time assertion that the MCP
// tool input struct for certctl_retire_agent exists with the required fields
// and their expected tag shapes. The LLM discovers this input schema via
// jsonschema tags — refactoring field names without updating callers silently
// breaks tool discovery.
//
// Red until Phase 2b adds RetireAgentInput to internal/mcp/types.go. This
// assertion deliberately exercises every field so the test fails at compile
// time rather than runtime.
func TestRetireAgentInput_ShapePinned(t *testing.T) {
// Zero-value construction of the expected input — fails to compile until
// the struct exists with fields {ID string, Force bool, Reason string}.
input := RetireAgentInput{
ID: "ag-1",
Force: true,
Reason: "decommissioning rack 7",
}
if input.ID != "ag-1" {
t.Errorf("RetireAgentInput.ID=%q want ag-1 (field binding broken)", input.ID)
}
if !input.Force {
t.Errorf("RetireAgentInput.Force=false want true")
}
if input.Reason != "decommissioning rack 7" {
t.Errorf("RetireAgentInput.Reason=%q want decommissioning rack 7", input.Reason)
}
// Also pin the JSON surface — LLMs send and receive these field names,
// so json tags must stay snake_case even through refactors.
encoded, err := json.Marshal(input)
if err != nil {
t.Fatalf("marshal RetireAgentInput: %v", err)
}
body := string(encoded)
for _, want := range []string{`"id":"ag-1"`, `"force":true`, `"reason":"decommissioning rack 7"`} {
if !strings.Contains(body, want) {
t.Errorf("RetireAgentInput JSON=%q missing %q (tag shape drifted)", body, want)
}
}
}
// TestListRetiredAgentsInput_ShapePinned mirrors the pagination input shape
// used across the MCP toolset (see ListParams). The list-retired-agents tool
// takes page + per_page with snake_case JSON tags. Compile-fail until
// Phase 2b either adds ListRetiredAgentsInput or documents that list-retired
// reuses the existing ListParams type (both paths are acceptable — the test
// just pins whichever Phase 2b picks).
func TestListRetiredAgentsInput_ShapePinned(t *testing.T) {
// Phase 2b may either (a) add a dedicated ListRetiredAgentsInput struct
// or (b) reuse the existing ListParams. Either is fine — we pin the
// field-access contract rather than the struct name to let the
// implementation choose. Compile-fail guards against the tool being
// registered without any pagination input at all.
var input ListParams
input.Page = 1
input.PerPage = 50
if input.Page != 1 || input.PerPage != 50 {
t.Errorf("ListParams fields Page/PerPage broken; listing pagination will misroute")
}
}
+47
View File
@@ -506,6 +506,53 @@ func registerAgentTools(s *gomcp.Server, c *Client) {
}
return textResult(data)
})
// I-004: soft-retirement. DELETE /api/v1/agents/{id} returns 200 on a
// fresh retire (body echoes retired_at/already_retired/cascade/counts),
// 204 on an idempotent retire of an already-retired agent (do() in
// client.go normalizes that to {"status":"deleted"}), 409 when downstream
// dependencies block the retire and force wasn't set, 403 on sentinel
// agents, or 400 when force=true was sent without a reason. The tool
// forwards the raw handler response so the LLM operator sees the
// dependency counts and can decide whether to retry with force=true.
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_retire_agent",
Description: "Soft-retire an agent (DELETE /api/v1/agents/{id}). Sets retired_at + retired_reason on the row; the agent is filtered from the default listing and surfaces only via certctl_list_retired_agents. Default is a safety-gated soft-retire that returns 409 blocked_by_dependencies if the agent has active targets, active certificates, or pending jobs — the returned counts tell you what would be orphaned. Pass force=true to cascade through and retire those dependents too; force=true requires a non-empty reason (captured in the audit trail). Sentinel discovery agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) cannot be retired — the handler returns 403 unconditionally. Idempotent: retrying on an already-retired agent returns 204 without side effects.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input RetireAgentInput) (*gomcp.CallToolResult, any, error) {
// Client-side mirror of the handler's ErrForceReasonRequired contract
// (see internal/api/handler/agents.go) so the LLM gets an immediate,
// actionable error instead of a round-trip 400. Whitespace-only
// reasons are treated as empty — matches handler's TrimSpace check.
if input.Force && input.Reason == "" {
return errorResult(fmt.Errorf("reason is required when force=true"))
}
query := url.Values{}
if input.Force {
query.Set("force", "true")
}
if input.Reason != "" {
query.Set("reason", input.Reason)
}
data, err := c.DeleteWithQuery("/api/v1/agents/"+input.ID, query)
if err != nil {
return errorResult(err)
}
return textResult(data)
})
// I-004: retired agents are filtered out of GET /api/v1/agents by default.
// The /agents/retired endpoint is the opt-in view — same pagination shape
// as the default listing, but filters to rows where retired_at IS NOT NULL.
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_list_retired_agents",
Description: "List soft-retired agents (GET /api/v1/agents/retired). These are agents that have been retired via certctl_retire_agent; retired_at and retired_reason are populated. Returned separately from certctl_list_agents so the default listing stays focused on operational agents.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input ListParams) (*gomcp.CallToolResult, any, error) {
data, err := c.Get("/api/v1/agents/retired", paginationQuery(input.Page, input.PerPage))
if err != nil {
return errorResult(err)
}
return textResult(data)
})
}
// ── Jobs ────────────────────────────────────────────────────────────
+17
View File
@@ -152,6 +152,23 @@ type AgentJobStatusInput struct {
Error string `json:"error,omitempty" jsonschema:"Error message if job failed"`
}
// RetireAgentInput pins the MCP tool surface for certctl_retire_agent. I-004
// introduces a soft-retirement flow that the handler exposes on DELETE
// /api/v1/agents/{id} with two optional query flags: force=true cascades
// through dependent active targets/certs/jobs, and reason is the human-readable
// string captured in the audit trail. The handler enforces
// ErrForceReasonRequired when force=true is sent without a reason; we surface
// both as separate fields so the LLM can populate them independently and so
// the retire_agent_test shape assertion stays aligned with the JSON-wire
// contract. ID is always emitted (no omitempty) because a retire call without
// a target agent is meaningless; Force and Reason are omitempty so the default
// soft-retire path sends no query suffix at all.
type RetireAgentInput struct {
ID string `json:"id" jsonschema:"Agent ID to soft-retire"`
Force bool `json:"force,omitempty" jsonschema:"Cascade-retire downstream active targets, certs, and jobs (requires reason)"`
Reason string `json:"reason,omitempty" jsonschema:"Human-readable reason (required when force=true)"`
}
// ── Jobs ────────────────────────────────────────────────────────────
type ListJobsInput struct {
+93 -1
View File
@@ -93,9 +93,34 @@ type TargetRepository interface {
// AgentRepository defines operations for managing control plane agents.
type AgentRepository interface {
// List returns all agents.
// List returns all ACTIVE agents — rows with retired_at IS NULL.
//
// I-004: The default listing MUST NOT surface retired agents. The
// handler-facing ListAgents call, the stats dashboard, and the stale-offline
// sweeper all iterate this list and would otherwise re-surface decommissioned
// hardware in operational UI. Callers that genuinely want retired rows (the
// audit tab, compliance exports) must use ListRetired instead.
//
// The partial index idx_agents_retired_at (migration 000015) keeps retired
// rows cheap to exclude — the planner uses it to skip the retired segment
// of the table entirely.
List(ctx context.Context) ([]*domain.Agent, error)
// ListRetired returns a paginated list of retired agents (retired_at IS NOT NULL),
// ordered by retired_at DESC so the most recent retirements appear first. Used
// by the GUI's Retired tab and the audit export path. Returns the slice plus
// the total count (for pagination). A page<1 or perPage<1 is clamped to sensible
// defaults (page=1, perPage=50) in the repo implementation rather than erroring —
// this matches the ListAgents pagination behavior in the service layer.
// I-004 coverage-gap closure, migration 000015.
ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error)
// Get retrieves an agent by ID.
//
// I-004 note: Get returns retired rows (retired_at IS NOT NULL) because
// callers that need to check "has this agent been retired?" — the heartbeat
// handler returning 410 Gone, the retirement service's idempotent-retire
// branch, the detail page rendering a retirement banner — must see the
// retired_at/retired_reason fields. Only the default List path default-
// excludes retired; individual Get lookups surface them.
Get(ctx context.Context, id string) (*domain.Agent, error)
// Create stores a new agent. Callers that want duplicate-key errors surfaced
// (e.g. real-agent registration) must use this method; sentinel/bootstrap
@@ -112,11 +137,78 @@ type AgentRepository interface {
// Update modifies an existing agent.
Update(ctx context.Context, agent *domain.Agent) error
// Delete removes an agent.
//
// I-004: callers should prefer SoftRetire / RetireAgentWithCascade for the
// operator-facing retirement path; hard Delete remains available for test
// cleanup and repository-level administrative tasks. The deployment_targets
// FK flipped to ON DELETE RESTRICT in migration 000015, so hard-deleting an
// agent that still owns active targets will now fail at the DB layer — which
// is intentional: the fail-closed guardrail prevents audit-trail destruction.
Delete(ctx context.Context, id string) error
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
//
// I-004: UpdateHeartbeat is a no-op on retired agents — the UPDATE clause
// includes AND retired_at IS NULL so a stale agent process that keeps polling
// after retirement cannot resurrect its heartbeat. The service layer already
// short-circuits with ErrAgentRetired before calling this method; the WHERE
// filter here is belt-and-braces for anyone who skips the service path.
UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error
// GetByAPIKey retrieves an agent by hashed API key.
//
// I-004: GetByAPIKey returns retired rows so the auth middleware can detect
// "this API key belongs to a retired agent" and fail the request with
// 410 Gone. If retired rows were hidden, auth would return a plain 401 and
// leak no signal — which is wrong: the operator needs the retired state
// made explicit so they can clean up the agent process.
GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error)
// SoftRetire stamps retired_at + retired_reason on the agent row with no
// cascade. Used on the happy path where preflight confirmed the agent has
// zero active dependencies (no active deployment_targets, no pending jobs).
// The UPDATE is scoped to WHERE id=$1 AND retired_at IS NULL so re-retiring
// an already-retired row is a no-op (zero rows affected is NOT returned as
// an error — the service layer detects this via its own idempotent-retire
// branch before calling SoftRetire). Callers supply retiredAt so the service
// can pin a single consistent timestamp across audit + DB writes.
// I-004 coverage-gap closure.
SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error
// RetireAgentWithCascade performs a transactional retire + cascade. In one
// transaction it: (1) stamps retired_at + retired_reason on the agent row,
// and (2) stamps the SAME retired_at + retired_reason on every active
// deployment_targets row whose agent_id matches. Only rows with
// retired_at IS NULL are touched in (2) — already-retired targets keep their
// original retirement metadata (whoever retired them first, whenever). Used
// exclusively on the force=true path from the retirement handler; callers
// supply retiredAt so the agent row and every cascaded target row share an
// exact retirement instant (helps forensic analysis trace the cascade back
// to a single operator action). If the agent row is already retired, the
// whole operation is a no-op — the transaction commits without touching
// either table. I-004 coverage-gap closure, migration 000015.
RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error
// CountActiveTargets returns the number of deployment_targets rows where
// agent_id=id AND retired_at IS NULL. The COUNT query hits the existing
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
// additional retired_at IS NULL predicate is cheap because the partial
// idx_deployment_targets_retired_at index (migration 000015) lets the
// planner skip the retired-row segment entirely. Preflight uses this to
// decide 200 (soft-retire) vs 409 (blocked-by-deps). I-004.
CountActiveTargets(ctx context.Context, agentID string) (int, error)
// CountActiveCertificates returns the count of managed_certificates currently
// deployed through one of this agent's ACTIVE (non-retired) deployment_targets.
// The query joins certificate_target_mappings (migration 000001 line 116) →
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
// deployment_targets.retired_at IS NULL, then COUNT(DISTINCT certificate_id)
// so the same cert deployed to multiple targets on one agent counts once.
// The primary key (certificate_id, target_id) on certificate_target_mappings
// plus idx_certificate_target_mappings_target_id (line 122) cover the join.
// Used purely for the preflight 409 body — the number is informational. I-004.
CountActiveCertificates(ctx context.Context, agentID string) (int, error)
// CountPendingJobs returns the number of jobs belonging to this agent whose
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
// statuses that indicate work the agent would still be expected to pick up.
// Completed/Failed/Cancelled jobs do not count. The filter agent_id=$1 hits
// the idx_jobs_agent_id index (migration 000001 line 161). Used for the
// preflight 409 body. I-004.
CountPendingJobs(ctx context.Context, agentID string) (int, error)
}
// JobRepository defines operations for managing renewal and deployment jobs.
+235 -11
View File
@@ -20,12 +20,18 @@ func NewAgentRepository(db *sql.DB) *AgentRepository {
return &AgentRepository{db: db}
}
// List returns all agents
// List returns all ACTIVE agents — rows with retired_at IS NULL. I-004:
// the default listing path feeds the handler-facing ListAgents call, the
// stats dashboard, and the stale-offline sweeper; every caller wants active
// hardware, not decommissioned rows. Operators who need retired rows reach
// for ListRetired instead. The partial index idx_agents_retired_at
// (migration 000015) lets the planner skip the retired segment cheaply.
func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
rows, err := r.db.QueryContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE retired_at IS NULL
ORDER BY registered_at DESC
`)
@@ -50,11 +56,16 @@ func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
return agents, nil
}
// Get retrieves an agent by ID
// Get retrieves an agent by ID. I-004: retired rows ARE surfaced here —
// callers that need to check "has this agent been retired?" (heartbeat
// handler returning 410 Gone, retirement service's idempotent-retire branch,
// detail page rendering a retirement banner) must see retired_at /
// retired_reason. Only the List path default-excludes retired rows; Get is
// by-ID and returns whatever row exists.
func (r *AgentRepository) Get(ctx context.Context, id string) (*domain.Agent, error) {
row := r.db.QueryRowContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE id = $1
`, id)
@@ -185,7 +196,16 @@ func (r *AgentRepository) Delete(ctx context.Context, id string) error {
return nil
}
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
//
// I-004: both branches include `AND retired_at IS NULL` in the WHERE clause,
// making the UPDATE a no-op on retired rows. The service layer already
// short-circuits with ErrAgentRetired before calling this method (see
// AgentService.Heartbeat), but the WHERE filter is belt-and-braces for any
// path that skips the service — a stale agent process that keeps polling
// after retirement cannot resurrect its heartbeat at the DB layer. A zero
// RowsAffected here returns the same "agent not found" error as before; the
// service layer distinguishes retired from missing by calling Get first.
func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error {
var result sql.Result
var err error
@@ -199,11 +219,11 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
architecture = CASE WHEN $5 = '' THEN architecture ELSE $5 END,
ip_address = CASE WHEN $6 = '' THEN ip_address ELSE $6 END,
version = CASE WHEN $7 = '' THEN version ELSE $7 END
WHERE id = $2
WHERE id = $2 AND retired_at IS NULL
`, time.Now(), id, metadata.Hostname, metadata.OS, metadata.Architecture, metadata.IPAddress, metadata.Version)
} else {
result, err = r.db.ExecContext(ctx, `
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 AND retired_at IS NULL
`, time.Now(), id)
}
@@ -223,11 +243,15 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
return nil
}
// GetByAPIKey retrieves an agent by hashed API key
// GetByAPIKey retrieves an agent by hashed API key. I-004: retired rows ARE
// surfaced here so the auth middleware can detect "this API key belongs to a
// retired agent" and fail the request with 410 Gone instead of 401. If the
// filter hid retired rows, auth would return a plain 401 and leak no signal
// that the agent process needs cleaning up.
func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) {
row := r.db.QueryRowContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE api_key_hash = $1
`, keyHash)
@@ -243,14 +267,214 @@ func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*dom
return agent, nil
}
// scanAgent scans an agent from a row or rows
// ─── I-004 agent retirement surface ──────────────────────────────────────
//
// The methods below implement the I-004 coverage-gap closure. They follow the
// interface contracts in internal/repository/interfaces.go:94-210 (which is the
// spec — keep godoc there in sync if behavior changes).
// ListRetired returns a paginated slice of retired agents ordered by
// retired_at DESC so the most recent retirements appear first. Used by the
// GUI's Retired tab and the audit export path. Returns the rows plus the
// total count (for pagination UI). page<1 or perPage<1 is clamped to
// sensible defaults in-repo rather than erroring, matching the ListAgents
// pagination behavior at the service layer. I-004, migration 000015.
func (r *AgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
// Clamp pagination to safe defaults. Keep in lockstep with the service
// layer's pagination shape — negative / zero values on either axis should
// degrade to "first page, default size" instead of returning an error.
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
offset := (page - 1) * perPage
// Total count first — separate query so pagination math stays correct
// even when the page of rows is empty. Uses the partial
// idx_agents_retired_at index so this is effectively a count of the
// partial-index tuple count, not a full table scan.
var total int
if err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*) FROM agents WHERE retired_at IS NOT NULL
`).Scan(&total); err != nil {
return nil, 0, fmt.Errorf("failed to count retired agents: %w", err)
}
rows, err := r.db.QueryContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE retired_at IS NOT NULL
ORDER BY retired_at DESC
LIMIT $1 OFFSET $2
`, perPage, offset)
if err != nil {
return nil, 0, fmt.Errorf("failed to query retired agents: %w", err)
}
defer rows.Close()
var agents []*domain.Agent
for rows.Next() {
agent, err := scanAgent(rows)
if err != nil {
return nil, 0, err
}
agents = append(agents, agent)
}
if err := rows.Err(); err != nil {
return nil, 0, fmt.Errorf("error iterating retired agent rows: %w", err)
}
return agents, total, nil
}
// SoftRetire stamps retired_at + retired_reason on the agent row with no
// cascade. Scoped to `WHERE id=$1 AND retired_at IS NULL` so re-retiring an
// already-retired row is a silent no-op (zero RowsAffected). The service
// layer has its own idempotent-retire branch that detects already-retired
// rows via Get before calling SoftRetire; a zero here just means a racy
// caller got there first. I-004.
func (r *AgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
if _, err := r.db.ExecContext(ctx, `
UPDATE agents
SET retired_at = $2, retired_reason = $3
WHERE id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to soft-retire agent: %w", err)
}
return nil
}
// RetireAgentWithCascade performs a transactional retire-and-cascade. In one
// transaction it (1) stamps retired_at + retired_reason on the agent row if
// it is still active, and (2) stamps the SAME retired_at + retired_reason on
// every active (retired_at IS NULL) deployment_targets row whose agent_id
// matches. Already-retired targets keep their original retirement metadata;
// only active targets are touched. If the agent is already retired, the
// whole transaction is a no-op — the caller's idempotent-retire branch
// already handled it before we got here. I-004, migration 000015.
//
// The two UPDATEs share a single (retiredAt, reason) pair so forensic
// analysis can trace "every row stamped at T1 with reason R was part of the
// same operator action" back to one cascade. Using BeginTx keeps the agent
// row and its targets' retirement metadata consistent even if something
// crashes mid-cascade.
func (r *AgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
tx, err := r.db.BeginTx(ctx, nil)
if err != nil {
return fmt.Errorf("failed to begin retire-cascade transaction: %w", err)
}
// Rollback is a no-op if Commit has already run — safe to always defer.
defer func() { _ = tx.Rollback() }()
// Agent row: flip to retired only if it was still active. If zero rows
// match, the agent was already retired — the whole cascade becomes a
// no-op (we deliberately do NOT stamp the targets against a retirement
// we didn't perform).
if _, err := tx.ExecContext(ctx, `
UPDATE agents
SET retired_at = $2, retired_reason = $3
WHERE id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to retire agent in cascade: %w", err)
}
// Cascade: copy the same retired_at / retired_reason onto every active
// deployment_target belonging to this agent. Skips targets that are
// already retired so their original retirement metadata is preserved.
if _, err := tx.ExecContext(ctx, `
UPDATE deployment_targets
SET retired_at = $2, retired_reason = $3
WHERE agent_id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to cascade-retire deployment targets: %w", err)
}
if err := tx.Commit(); err != nil {
return fmt.Errorf("failed to commit retire-cascade transaction: %w", err)
}
return nil
}
// CountActiveTargets returns the number of deployment_targets with
// agent_id=agentID AND retired_at IS NULL. Used by the retirement preflight
// to decide 200 (soft-retire) vs 409 (blocked-by-deps). Hits the existing
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
// retired_at IS NULL predicate is cheap because the partial
// idx_deployment_targets_retired_at index (migration 000015) lets the
// planner skip the retired-row segment. I-004.
func (r *AgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*)
FROM deployment_targets
WHERE agent_id = $1 AND retired_at IS NULL
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count active targets for agent: %w", err)
}
return count, nil
}
// CountActiveCertificates returns the count of distinct managed_certificates
// currently deployed through one of this agent's ACTIVE deployment_targets.
// Joins certificate_target_mappings (migration 000001 line 116) →
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
// deployment_targets.retired_at IS NULL. COUNT(DISTINCT certificate_id) so
// the same cert deployed to multiple targets on one agent counts once.
// Used purely for the preflight 409 body. I-004.
func (r *AgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(DISTINCT ctm.certificate_id)
FROM certificate_target_mappings ctm
JOIN deployment_targets dt ON dt.id = ctm.target_id
WHERE dt.agent_id = $1 AND dt.retired_at IS NULL
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count active certificates for agent: %w", err)
}
return count, nil
}
// CountPendingJobs returns the number of jobs belonging to this agent whose
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
// statuses that represent work the agent would still be expected to pick up
// or complete. Completed / Failed / Cancelled jobs do not count toward the
// preflight gate. Status strings match domain.JobStatus* constants in
// internal/domain/job.go:43-49. Hits idx_jobs_agent_id (migration 000001
// line 161). I-004.
func (r *AgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*)
FROM jobs
WHERE agent_id = $1
AND status IN ('Pending', 'AwaitingCSR', 'AwaitingApproval', 'Running')
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count pending jobs for agent: %w", err)
}
return count, nil
}
// scanAgent scans an agent from a row or rows.
//
// I-004: the column list here is the authoritative 13-field post-M15 order —
// retired_at and retired_reason are appended at the tail as nullable
// *time.Time / *string scan targets matching the `json:"...,omitempty"` domain
// fields. Every SELECT in this file that feeds scanAgent must emit columns in
// this same order, otherwise Scan will silently place values into the wrong
// fields (lib/pq does positional binding, not named).
func scanAgent(scanner interface {
Scan(...interface{}) error
}) (*domain.Agent, error) {
var agent domain.Agent
err := scanner.Scan(&agent.ID, &agent.Name, &agent.Hostname, &agent.Status,
&agent.LastHeartbeatAt, &agent.RegisteredAt, &agent.APIKeyHash,
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version)
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version,
&agent.RetiredAt, &agent.RetiredReason)
if err != nil {
return nil, fmt.Errorf("failed to scan agent: %w", err)
@@ -0,0 +1,220 @@
package postgres_test
import (
"context"
"database/sql"
"os"
"path/filepath"
"strings"
"testing"
)
// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test
// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs").
//
// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql,
// which must:
//
// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT`
// columns to the `agents` table. These mirror the revoked_at /
// revocation_reason pair on managed_certificates (migration 000005).
//
// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns
// to `deployment_targets`. When an agent is retired with cascade=true,
// its deployment_targets must be soft-retired (not deleted) so audit
// history — who deployed what to where, when — stays intact.
//
// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id`
// from `ON DELETE CASCADE` (migration 000001, line 104) to
// `ON DELETE RESTRICT`. This is the fail-closed change that makes a
// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer
// instead of silently vaporising every deployment_target row. Today
// the CASCADE means the audit trail gets shredded with zero warning.
//
// The round-trip also validates that the down migration cleanly reverses all
// three changes, so an operator who lands on a rollback can still boot the
// server. Red-until-Green: this test compiles but fails until
// migrations/000015_agent_retire.up.sql + .down.sql exist with the right
// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order —
// the new migration runs automatically once Phase 2b creates the files.
func TestMigration000015_AgentRetireRoundTrip(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
//
// After all .up.sql migrations (including the new 000015) have run, the
// new columns and the flipped FK must be observable in the catalog.
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
// The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'),
// not CASCADE (confdeltype='c'). This is the core fail-closed guarantee
// that fixes I-004 at the storage layer.
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
// The FK on jobs.agent_id is already SET NULL (confdeltype='n') per
// migration 000001 line 146 — pin that it stays that way (or goes to
// RESTRICT; either preserves audit history, both fail on 'c').
assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c")
// ─── Stage 2: Run the 000015 down migration manually ─────────────────
//
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
// the down migration I read and execute it by hand, then re-check the
// catalog.
downSQL := readMigrationFile(t, "000015_agent_retire.down.sql")
if _, err := db.ExecContext(ctx, downSQL); err != nil {
t.Fatalf("000015 down migration failed: %v", err)
}
// Stage 3: Post-down assertions — columns gone, FK restored to CASCADE.
assertColumnGone(t, db, "agents", "retired_at")
assertColumnGone(t, db, "agents", "retired_reason")
assertColumnGone(t, db, "deployment_targets", "retired_at")
assertColumnGone(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c")
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
//
// The up migration must be safely re-runnable — operators sometimes
// re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER
// idempotently.
upSQL := readMigrationFile(t, "000015_agent_retire.up.sql")
if _, err := db.ExecContext(ctx, upSQL); err != nil {
t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err)
}
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
}
// ─── Catalog helpers ──────────────────────────────────────────────────────
//
// These helpers scope every catalog query to the schema the test is actually
// running in by joining against current_schema(). Without that, a test
// running in schema test_xyz would accidentally inspect the public schema
// and green-light drift.
func assertColumnExists(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if !exists {
t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column)
}
}
func assertColumnGone(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if exists {
t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column)
}
}
// assertFKDeleteRule asserts that the foreign key covering `table.column`
// (i.e. the FK whose constrained column matches) has the expected
// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE,
// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION.
func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got != want {
t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')",
table, column, got, want)
}
}
// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where
// multiple confdeltype values are acceptable (SET NULL and RESTRICT both
// preserve audit history) but CASCADE is strictly forbidden.
func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got == disallowed {
t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)",
table, column, got, disallowed)
}
}
// lookupFKDeleteRule returns the confdeltype for the FK constraint whose
// constrained table+column matches. Returns empty string if no FK found —
// that's treated as a test failure because the schema is supposed to have
// these FKs per migration 000001.
func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string {
t.Helper()
// Join pg_constraint → pg_class (constrained rel) → pg_attribute
// (constrained col) → pg_namespace (schema filter). Scoped to
// current_schema() so schema-per-test isolation holds.
const q = `
SELECT c.confdeltype
FROM pg_constraint c
JOIN pg_class cl ON cl.oid = c.conrelid
JOIN pg_namespace n ON n.oid = cl.relnamespace
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
WHERE n.nspname = current_schema()
AND c.contype = 'f'
AND cl.relname = $1
AND a.attname = $2
LIMIT 1
`
var confdeltype string
err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype)
if err == sql.ErrNoRows {
t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column)
return ""
}
if err != nil {
t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err)
return ""
}
return confdeltype
}
// readMigrationFile locates and loads a named migration file. Uses the same
// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers
// agree on where the migrations live.
func readMigrationFile(t *testing.T, name string) string {
t.Helper()
path := filepath.Join(findMigrationsDir(), name)
data, err := os.ReadFile(path)
if err != nil {
t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err)
}
// Defensive: a zero-byte down migration would produce false-positive
// "success" below. Refuse to trust it.
if strings.TrimSpace(string(data)) == "" {
t.Fatalf("migration file %s is empty — down migration missing or truncated", name)
}
return string(data)
}
+15
View File
@@ -92,12 +92,27 @@ func (s *AgentService) Register(ctx context.Context, name string, hostname strin
}
// Heartbeat updates an agent's last seen time, status, and metadata.
//
// I-004: retired agents must be rejected up-front. A retired agent that is
// still polling is a zombie — its row exists only for audit history and must
// not be allowed to bump LastHeartbeatAt (which would resurrect it in stats
// dashboards and stale-offline sweeps). The sentinel ErrAgentRetired is
// returned unwrapped so the HTTP handler can map it to 410 Gone via
// errors.Is; the agent process detects the 410 and shuts down cleanly
// instead of continuing to heartbeat indefinitely.
func (s *AgentService) Heartbeat(ctx context.Context, agentID string, metadata *domain.AgentMetadata) error {
agent, err := s.agentRepo.Get(ctx, agentID)
if err != nil {
return fmt.Errorf("failed to fetch agent: %w", err)
}
// I-004 guard: retired agents are frozen. Do not call UpdateHeartbeat —
// bumping the timestamp would defeat the retired-row filter that protects
// stats, scheduler sweeps, and handler listings.
if agent.IsRetired() {
return ErrAgentRetired
}
// Update heartbeat and metadata
if err := s.agentRepo.UpdateHeartbeat(ctx, agentID, metadata); err != nil {
return fmt.Errorf("failed to update heartbeat: %w", err)
+317
View File
@@ -0,0 +1,317 @@
package service
import (
"context"
"errors"
"fmt"
"log/slog"
"time"
"github.com/shankar0123/certctl/internal/domain"
)
// I-004 coverage-gap closure: the agent retirement surface.
//
// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and
// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with
// no preflight, no archival, and no knowledge of in-flight jobs. Any cert
// still rotating through one of those targets would observe half-migrated
// state. I-004 closes that gap with a preflight + soft-retire + optional
// forced-cascade contract; the symbols in this file are the service-layer
// surface that the handler and operator UI bind against.
// ErrAgentIsSentinel is returned when an operator tries to retire one of the
// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm,
// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the
// three cloud secret-manager discovery sources; retiring any of them orphans
// its subsystem. The guard fires unconditionally — force=true does not bypass
// it, because a sentinel is a structural invariant of the deployment, not
// a piece of fleet state the operator owns. Handler maps this to HTTP 403.
var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired")
// ErrBlockedByDependencies is returned by RetireAgent when at least one of
// (active targets, active certificates, pending jobs) referencing the agent
// is non-zero and force=false. The caller always receives it wrapped in
// a *BlockedByDependenciesError (see below), so handlers doing errors.As
// can surface the per-bucket counts in the 409 body for operator
// troubleshooting. Tests use errors.Is; handlers use errors.As.
var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies")
// ErrForceReasonRequired is returned when force=true is supplied without a
// non-empty reason. The force escape hatch is deliberately chatty: operators
// pulling the emergency cord must leave an auditable breadcrumb explaining
// why a cascade was justified. Handler maps this to HTTP 400 so the operator
// retries with --reason rather than silently skipping the guard. Checked
// before any DB mutation to keep the no-reason path transactionally clean.
var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason")
// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated
// call site) when a retired agent is still polling. The handler layer maps
// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it
// deterministically and shut down the agent process, rather than looping
// forever on a soft-retired identity. IsRetired() on the domain model is
// the single source of truth; the sentinel exists so service and handler
// callers can errors.Is against one symbol.
var ErrAgentRetired = errors.New("agent has been retired")
// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the
// per-bucket dependency snapshot the preflight pass captured. The embedded
// AgentDependencyCounts is the same struct the repo returns from the three
// CountActive* calls, so the handler can marshal it directly into the 409
// body without reshaping fields. Unwrap() satisfies errors.Is against the
// sentinel; Error() includes the counts so logs are diagnostic on their own.
type BlockedByDependenciesError struct {
Counts domain.AgentDependencyCounts
}
// Error formats the wrapped error with the per-bucket counts. Kept short so
// it reads cleanly in slog output.
func (e *BlockedByDependenciesError) Error() string {
return fmt.Sprintf(
"%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)",
ErrBlockedByDependencies.Error(),
e.Counts.ActiveTargets,
e.Counts.ActiveCertificates,
e.Counts.PendingJobs,
)
}
// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped
// struct — the test contract (agent_retire_test.go:167) depends on it.
func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies }
// AgentRetirementResult is the outcome surface the handler returns to the
// operator. It discriminates the three happy paths the endpoint can take —
// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and
// forced cascade (Cascade=true) — and always carries the retired_at timestamp
// and the dependency-count snapshot so the 200/204 response body can echo
// what was (or would have been) affected.
//
// AlreadyRetired=true → agent was already retired; no new audit
// event was emitted; RetiredAt is the
// original stamp, not the current time.
// Cascade=false → clean soft-retire; Counts is all zeros.
// Cascade=true → force=true retired agent + downstream
// targets; Counts is the PRE-cascade
// snapshot (so the operator sees what
// they just retired).
type AgentRetirementResult struct {
AlreadyRetired bool
Cascade bool
RetiredAt time.Time
Counts domain.AgentDependencyCounts
}
// RetireAgent implements the I-004 retirement contract. Ordering matters —
// every guard fires before the one that would mutate state, so a rejected
// retire leaves zero trace (no audit event, no partial DB write):
//
// 1. Sentinel check (unconditional; force does not bypass).
// 2. Fetch agent (404 surfaces as-is from the repo).
// 3. Already-retired idempotency: return AlreadyRetired=true with NO new
// audit event — the original retire already recorded one.
// 4. Preflight count pass via the three CountActive* repo methods.
// 5. Force-reason guard: force=true with empty reason is rejected here,
// after the counts are known but before any mutation.
// 6. Default no-force path: any non-zero count returns
// *BlockedByDependenciesError with counts attached.
// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with
// a single retiredAt timestamp pinned BEFORE the repo call so the
// audit event and the DB row agree to the nanosecond.
// 8. Audit: agent_retired always; agent_retirement_cascaded additionally
// on the force=true cascade path.
//
// Actor comes from the handler's resolveActor (API key → user, agent key →
// agent-<id>, unauthenticated → "anonymous"); the service does not second-
// guess it. Audit emission is best-effort: a failed RecordEvent logs a
// warning but does not fail the overall retirement, consistent with how
// the rest of the codebase treats audit as an observability concern
// rather than a correctness barrier.
func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) {
// Step 1 — reserved-sentinel guard. Applies even under force=true.
if domain.IsSentinelAgent(id) {
return nil, ErrAgentIsSentinel
}
// Step 2 — existence check. Missing agent surfaces the repo's not-found
// error verbatim so the handler can map it to 404 via its existing
// detection path (the handler layer already has "not found" mapping
// logic inherited from the pre-I-004 Delete endpoint).
agent, err := s.agentRepo.Get(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to fetch agent: %w", err)
}
// Step 3 — idempotency. A retired agent returns AlreadyRetired=true
// WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204.
// Guarding here (before preflight) means a re-retire of an agent that
// now has zero deps doesn't spuriously "succeed again" and double-log.
if agent.IsRetired() {
return &AgentRetirementResult{
AlreadyRetired: true,
RetiredAt: *agent.RetiredAt,
}, nil
}
// Step 4 — preflight counts. All three run even when force=true: we
// need them to populate AgentRetirementResult.Counts (the pre-cascade
// snapshot). A repo failure here aborts the whole operation — partial
// preflight is worse than no preflight.
counts, err := s.collectAgentDependencyCounts(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err)
}
// Step 5 — force-reason guard. Positioned AFTER preflight so operators
// who forgot --reason still see accurate counts when they retry. The
// empty-reason rejection fires before any mutation, so the rejected
// attempt leaves no audit noise.
if force && reason == "" {
return nil, ErrForceReasonRequired
}
// Step 6 — default path: block on any non-zero bucket. Wrapping the
// sentinel in *BlockedByDependenciesError lets the handler use errors.As
// to surface counts in the 409 body while tests use errors.Is against
// the sentinel. Both callers are satisfied by the single Unwrap chain.
if !force && counts.HasDependencies() {
return nil, &BlockedByDependenciesError{Counts: counts}
}
// Step 7 — mutation. Pin retiredAt once so the audit event, the agent
// row, and (on cascade) every deployment_targets row share the same
// timestamp. Callers querying "what happened at T?" can correlate
// retirement rows across tables without clock-skew tie-breaking.
retiredAt := time.Now()
cascade := force && counts.HasDependencies()
if cascade {
if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to retire agent with cascade: %w", err)
}
} else {
if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to soft-retire agent: %w", err)
}
}
// Step 8 — audit. Two events on the cascade path so forensics can
// distinguish "agent was retired" (agent_retired) from "downstream
// targets were flipped" (agent_retirement_cascaded). Details on the
// cascaded event carry the pre-cascade counts so a reviewer looking
// only at the audit log knows how much state was affected. Emission
// is best-effort — audit is observability, not a correctness barrier.
actorType := s.resolveActorType(actor)
details := map[string]interface{}{
"actor": actor,
"reason": reason,
"force": force,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retired", "agent", id, details); err != nil {
slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err)
}
if cascade {
cascadeDetails := map[string]interface{}{
"actor": actor,
"reason": reason,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil {
slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err)
}
}
return &AgentRetirementResult{
AlreadyRetired: false,
Cascade: cascade,
RetiredAt: retiredAt,
Counts: counts,
}, nil
}
// ListRetiredAgents returns the paginated list of retired agents in
// retired_at DESC order. This is the companion to ListAgents — which
// hides retired rows — so the operator UI can render a dedicated
// "Retired" tab without leaking retired rows into every other listing.
// Pagination defaults (page<1→1, perPage<1→50) are applied here as
// well as in the repo, so callers can pass 0s when they want defaults.
//
// Return shape harmonizes with handler.AgentService: a value slice
// (not pointer slice) and int64 total. The repo returns []*domain.Agent;
// this method dereferences into a value slice so the handler's
// PagedResponse marshals straight objects and so the compile-time
// interface assertion in agent_retire_handler_test.go:387 is satisfied.
// Nil repo entries are skipped defensively — the repo should never
// return them, but the handler contract is more important than the
// repo's (pointer-slice) convenience.
func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage)
if err != nil {
return nil, 0, fmt.Errorf("failed to list retired agents: %w", err)
}
out := make([]domain.Agent, 0, len(agents))
for _, a := range agents {
if a == nil {
continue
}
out = append(out, *a)
}
return out, int64(total), nil
}
// collectAgentDependencyCounts runs the three preflight COUNT queries in
// sequence and bundles the result. Sequential (not parallel) because the
// queries are cheap (<1ms each on the indexed columns added in 000015) and
// sequential keeps error handling simple. Any repo error short-circuits
// — we prefer to refuse the retire than make a half-informed decision.
func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) {
var counts domain.AgentDependencyCounts
targets, err := s.agentRepo.CountActiveTargets(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active targets: %w", err)
}
counts.ActiveTargets = targets
certs, err := s.agentRepo.CountActiveCertificates(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active certificates: %w", err)
}
counts.ActiveCertificates = certs
jobs, err := s.agentRepo.CountPendingJobs(ctx, id)
if err != nil {
return counts, fmt.Errorf("count pending jobs: %w", err)
}
counts.PendingJobs = jobs
return counts, nil
}
// resolveActorType maps an opaque actor string into the typed ActorType
// used by the audit schema. Matches the conventions the rest of the
// service layer uses: "system" → System, anything that looks like an
// agent identity → Agent, everything else → User.
func (s *AgentService) resolveActorType(actor string) domain.ActorType {
switch {
case actor == "system":
return domain.ActorTypeSystem
case len(actor) > 6 && actor[:6] == "agent-":
return domain.ActorTypeAgent
default:
return domain.ActorTypeUser
}
}
+396
View File
@@ -0,0 +1,396 @@
package service
import (
"context"
"errors"
"log/slog"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
)
// setupRetireTest wires up an AgentService with a single registered agent and
// returns (service, agentRepo, auditRepo) so tests can seed state and assert
// audit events. Kept minimal — tests that need targets/jobs/certs extend the
// returned repos directly.
func setupRetireTest(t *testing.T, agentID string) (*AgentService, *mockAgentRepo, *mockAuditRepo) {
t.Helper()
now := time.Now()
agent := &domain.Agent{
ID: agentID,
Name: "prod-agent",
Hostname: "server-01",
Status: domain.AgentStatusOnline,
RegisteredAt: now,
LastHeartbeatAt: &now,
APIKeyHash: "hash-" + agentID,
}
agentRepo := newMockAgentRepository()
agentRepo.AddAgent(agent)
certRepo := &mockCertRepo{
Certs: make(map[string]*domain.ManagedCertificate),
Versions: make(map[string][]*domain.CertificateVersion),
}
jobRepo := &mockJobRepo{
Jobs: make(map[string]*domain.Job),
StatusUpdates: make(map[string]domain.JobStatus),
}
targetRepo := &mockTargetRepo{
Targets: make(map[string]*domain.DeploymentTarget),
}
auditRepo := &mockAuditRepo{Events: []*domain.AuditEvent{}}
auditService := NewAuditService(auditRepo)
issuerRegistry := NewIssuerRegistry(slog.Default())
svc := NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, nil)
return svc, agentRepo, auditRepo
}
// TestRetireAgent_Sentinel_Rejected covers I-004's sentinel guard. The four
// well-known sentinel agent IDs back discovery sources and the network scanner
// — retiring them would orphan those subsystems. Contract: reject with
// ErrAgentIsSentinel regardless of force/reason.
func TestRetireAgent_Sentinel_Rejected(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
svc, _, _ := setupRetireTest(t, id)
_, err := svc.RetireAgent(context.Background(), id, "alice", false, "")
if !errors.Is(err, ErrAgentIsSentinel) {
t.Fatalf("retire(sentinel %q) err=%v want ErrAgentIsSentinel", id, err)
}
// Sentinel rejection must be deterministic even under force=true.
_, err = svc.RetireAgent(context.Background(), id, "alice", true, "forced by operator")
if !errors.Is(err, ErrAgentIsSentinel) {
t.Fatalf("retire(sentinel %q force=true) err=%v want ErrAgentIsSentinel", id, err)
}
})
}
}
// TestRetireAgent_NotFound covers the 404 preflight path. The handler maps
// ErrAgentNotFound-equivalent sentinel to 404; the service must surface it
// cleanly without partial state mutation.
func TestRetireAgent_NotFound(t *testing.T) {
svc, _, _ := setupRetireTest(t, "agent-001")
_, err := svc.RetireAgent(context.Background(), "agent-does-not-exist", "alice", false, "")
if err == nil {
t.Fatalf("retire(missing id) err=nil want not-found error")
}
}
// TestRetireAgent_AlreadyRetired_Idempotent covers the 204 No Content path.
// Retiring an already-retired agent must succeed without error and without
// emitting a new audit event (the first retirement already recorded one).
// Idempotency matters because the handler is the escape hatch for operators
// re-issuing a failed retire after a partial failure mid-cascade.
func TestRetireAgent_AlreadyRetired_Idempotent(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
past := time.Now().Add(-24 * time.Hour)
reason := "operator decommissioned"
agent := agentRepo.Agents["agent-001"]
agent.RetiredAt = &past
agent.RetiredReason = &reason
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if err != nil {
t.Fatalf("retire(already retired) err=%v want nil (idempotent)", err)
}
if result == nil || !result.AlreadyRetired {
t.Fatalf("retire(already retired) result=%+v want AlreadyRetired=true", result)
}
// Retire-on-retired must not emit a duplicate audit event.
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-001" {
t.Fatalf("retire(already retired) emitted duplicate agent_retired audit event")
}
}
}
// TestRetireAgent_NoDeps_SoftSucceeds covers the happy 200 path: no active
// targets, certs, or jobs referencing the agent. Soft-retire stamps
// RetiredAt + RetiredReason and emits agent_retired audit event.
func TestRetireAgent_NoDeps_SoftSucceeds(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
before := time.Now().Add(-time.Second)
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if err != nil {
t.Fatalf("retire(clean) err=%v want nil", err)
}
if result == nil {
t.Fatal("retire(clean) result=nil want non-nil")
}
if result.AlreadyRetired {
t.Fatalf("retire(clean) result.AlreadyRetired=true want false")
}
if result.Cascade {
t.Fatalf("retire(clean) result.Cascade=true want false (no deps to cascade)")
}
if !result.RetiredAt.After(before) {
t.Fatalf("retire(clean) RetiredAt=%v not after test start %v", result.RetiredAt, before)
}
agent := agentRepo.Agents["agent-001"]
if agent.RetiredAt == nil {
t.Fatalf("retire(clean) agent.RetiredAt=nil want stamped")
}
// Audit event must be emitted with action=agent_retired, actor=alice.
found := false
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-001" && e.Actor == "alice" {
found = true
break
}
}
if !found {
t.Fatalf("retire(clean) missing agent_retired audit event for alice, events=%+v", auditRepo.Events)
}
}
// TestRetireAgent_WithDeps_NoForce_Blocked covers the 409 preflight path. When
// the agent has any of: active non-retired targets, certs deployed via those
// targets, or pending jobs — a default retire must block with
// ErrBlockedByDependencies and the counts must be reachable via errors.As so
// the handler can build the 409 body.
func TestRetireAgent_WithDeps_NoForce_Blocked(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
// Seed dependency counts directly on the mock — the production repo
// implements CountActive* queries; the mock exposes them as fields.
agentRepo.ActiveTargetCounts["agent-001"] = 3
agentRepo.ActiveCertCounts["agent-001"] = 7
agentRepo.PendingJobCounts["agent-001"] = 2
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if !errors.Is(err, ErrBlockedByDependencies) {
t.Fatalf("retire(with deps, no force) err=%v want ErrBlockedByDependencies", err)
}
var blocked *BlockedByDependenciesError
if !errors.As(err, &blocked) {
t.Fatalf("retire(with deps) err=%v want wrapped *BlockedByDependenciesError", err)
}
if blocked.Counts.ActiveTargets != 3 {
t.Errorf("blocked.Counts.ActiveTargets=%d want 3", blocked.Counts.ActiveTargets)
}
if blocked.Counts.ActiveCertificates != 7 {
t.Errorf("blocked.Counts.ActiveCertificates=%d want 7", blocked.Counts.ActiveCertificates)
}
if blocked.Counts.PendingJobs != 2 {
t.Errorf("blocked.Counts.PendingJobs=%d want 2", blocked.Counts.PendingJobs)
}
// Agent must still be un-retired after preflight block.
if agentRepo.Agents["agent-001"].RetiredAt != nil {
t.Fatalf("retire(blocked) left RetiredAt stamped; preflight must be transactionally safe")
}
}
// TestRetireAgent_WithDeps_Force_NoReason_Rejected covers the 400 guard on the
// force escape hatch. Operators using force=true must supply a justifying
// reason; empty reason is rejected before any DB mutation.
func TestRetireAgent_WithDeps_Force_NoReason_Rejected(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
agentRepo.ActiveTargetCounts["agent-001"] = 1
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, "")
if !errors.Is(err, ErrForceReasonRequired) {
t.Fatalf("retire(force, no reason) err=%v want ErrForceReasonRequired", err)
}
if agentRepo.Agents["agent-001"].RetiredAt != nil {
t.Fatalf("retire(force, no reason) left RetiredAt stamped; guard must fire before mutation")
}
}
// TestRetireAgent_WithDeps_Force_Cascades covers the force=true transactional
// path: agent retires, downstream targets also soft-retire with the supplied
// reason, and the result surface indicates cascade happened. Reason
// propagates to every cascaded row so post-mortem forensics can trace the
// cascade to a single operator action.
func TestRetireAgent_WithDeps_Force_Cascades(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
agentRepo.ActiveTargetCounts["agent-001"] = 2
agentRepo.ActiveCertCounts["agent-001"] = 5
agentRepo.PendingJobCounts["agent-001"] = 1
reason := "decommissioning rack 7"
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, reason)
if err != nil {
t.Fatalf("retire(force, reason) err=%v want nil", err)
}
if result == nil {
t.Fatal("retire(force) result=nil want non-nil")
}
if !result.Cascade {
t.Fatalf("retire(force) result.Cascade=false want true")
}
if result.Counts.ActiveTargets != 2 {
t.Errorf("result.Counts.ActiveTargets=%d want 2 (pre-cascade snapshot)", result.Counts.ActiveTargets)
}
agent := agentRepo.Agents["agent-001"]
if agent.RetiredAt == nil {
t.Fatalf("retire(force) agent.RetiredAt=nil want stamped")
}
if agent.RetiredReason == nil || *agent.RetiredReason != reason {
t.Fatalf("retire(force) RetiredReason=%v want %q", agent.RetiredReason, reason)
}
// Two audit events required: agent_retired + agent_retirement_cascaded.
// The cascaded event captures which downstream resources were affected.
var haveRetired, haveCascaded bool
for _, e := range auditRepo.Events {
if e.ResourceID == "agent-001" {
switch e.Action {
case "agent_retired":
haveRetired = true
case "agent_retirement_cascaded":
haveCascaded = true
}
}
}
if !haveRetired {
t.Errorf("retire(force) missing agent_retired audit event")
}
if !haveCascaded {
t.Errorf("retire(force) missing agent_retirement_cascaded audit event")
}
}
// TestRetireAgent_EmitsAuditEvent pins the audit contract for I-004:
// every retire path that mutates DB state emits at least one audit event with
// the operator's actor identity, so post-hoc compliance/forensics can
// reconstruct who retired what and when.
func TestRetireAgent_EmitsAuditEvent(t *testing.T) {
svc, _, auditRepo := setupRetireTest(t, "agent-007")
_, err := svc.RetireAgent(context.Background(), "agent-007", "compliance-bot", false, "")
if err != nil {
t.Fatalf("retire err=%v want nil", err)
}
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-007" {
if e.Actor != "compliance-bot" {
t.Errorf("audit event Actor=%q want compliance-bot", e.Actor)
}
return
}
}
t.Fatalf("no agent_retired audit event emitted, events=%+v", auditRepo.Events)
}
// TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired covers the 410 Gone
// contract. A retired agent that is still polling must be told its identity
// is no longer accepted — the agent process should detect this and shut
// down rather than continue heartbeating indefinitely.
func TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
past := time.Now().Add(-time.Hour)
reason := "decommissioned"
agentRepo.Agents["agent-001"].RetiredAt = &past
agentRepo.Agents["agent-001"].RetiredReason = &reason
err := svc.Heartbeat(context.Background(), "agent-001", &domain.AgentMetadata{
OS: "linux",
Architecture: "amd64",
Hostname: "server-01",
})
if !errors.Is(err, ErrAgentRetired) {
t.Fatalf("heartbeat(retired) err=%v want ErrAgentRetired", err)
}
// Retired heartbeat must NOT bump LastHeartbeatAt — otherwise the retired
// agent could ressurrect itself in stats/observability dashboards.
if _, bumped := agentRepo.HeartbeatUpdates["agent-001"]; bumped {
t.Fatalf("heartbeat(retired) updated LastHeartbeatAt; retired agents must be frozen")
}
}
// TestListAgents_DefaultExcludesRetired covers the contract that the
// handler-facing ListAgents call hides retired rows by default. Otherwise
// every dashboard that paginates agents would surface retired stragglers.
// An explicit "list retired" endpoint (ListRetiredAgents) covers the audit
// use case.
func TestListAgents_DefaultExcludesRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-active")
// Seed one retired agent alongside the active one.
past := time.Now().Add(-24 * time.Hour)
reason := "old hardware"
agentRepo.AddAgent(&domain.Agent{
ID: "agent-retired",
Name: "retired-agent",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
APIKeyHash: "hash-retired",
RetiredAt: &past,
RetiredReason: &reason,
})
agents, total, err := svc.ListAgents(context.Background(), 1, 50)
if err != nil {
t.Fatalf("ListAgents err=%v want nil", err)
}
for _, a := range agents {
if a.ID == "agent-retired" {
t.Fatalf("ListAgents returned retired agent %q in default listing", a.ID)
}
}
if total != 1 {
t.Errorf("ListAgents total=%d want 1 (only active)", total)
}
// ListRetiredAgents must surface retired-only, with count=1.
retired, retiredTotal, err := svc.ListRetiredAgents(context.Background(), 1, 50)
if err != nil {
t.Fatalf("ListRetiredAgents err=%v want nil", err)
}
if retiredTotal != 1 {
t.Errorf("ListRetiredAgents total=%d want 1", retiredTotal)
}
if len(retired) != 1 || retired[0].ID != "agent-retired" {
t.Fatalf("ListRetiredAgents got=%+v want [agent-retired]", retired)
}
}
// TestMarkStaleAgentsOffline_SkipsRetired covers the stale-offline sweeper
// interaction with retirement. A retired agent must not be re-surfaced as
// a state transition ("Online → Offline") by the scheduler, because its
// Status column is preserved as the last-known operational state at
// retirement time and RetiredAt is the source of truth for filtering.
func TestMarkStaleAgentsOffline_SkipsRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-live")
// Active agent is currently stale (no heartbeat for 10 minutes) — eligible
// for Online→Offline transition.
stale := time.Now().Add(-10 * time.Minute)
agentRepo.Agents["agent-live"].LastHeartbeatAt = &stale
// Retired agent was also stale at retirement time, but must NOT be
// touched by the sweeper.
past := time.Now().Add(-24 * time.Hour)
reason := "hw failure"
agentRepo.AddAgent(&domain.Agent{
ID: "agent-retired",
Name: "dead-agent",
Hostname: "server-old",
Status: domain.AgentStatusOnline, // preserved last-seen status
RegisteredAt: past,
LastHeartbeatAt: &past,
APIKeyHash: "hash-dead",
RetiredAt: &past,
RetiredReason: &reason,
})
if err := svc.MarkStaleAgentsOffline(context.Background(), 5*time.Minute); err != nil {
t.Fatalf("MarkStaleAgentsOffline err=%v want nil", err)
}
// Active-stale agent should flip Online → Offline.
if got := agentRepo.Agents["agent-live"].Status; got != domain.AgentStatusOffline {
t.Errorf("agent-live Status=%s want Offline", got)
}
// Retired agent's Status column must be frozen at Online (its preserved
// last-seen state); the sweeper must skip it.
if got := agentRepo.Agents["agent-retired"].Status; got != domain.AgentStatusOnline {
t.Errorf("agent-retired Status=%s want Online (frozen); sweeper touched retired row", got)
}
}
+25
View File
@@ -145,6 +145,31 @@ func (s *DeploymentService) ProcessDeploymentJob(ctx context.Context, job *domai
return fmt.Errorf("failed to fetch agent: %w", err)
}
// I-004: AgentRepository.Get surfaces retired rows by design (for the GUI
// banner + 410 Gone heartbeat path). Deployments must never dispatch to a
// retired agent — it will never heartbeat again and the target row should
// itself have been cascade-retired when the agent was force-retired. A job
// slipping through here would otherwise hit the heartbeat-staleness branch
// below with the misleading reason "agent is offline"; we want operators to
// see the real cause. Fail the job with an explicit reason, send a
// deployment notification so the owner is alerted, and record an audit
// event. Falls through the same notify+audit shape as the offline branch.
if agent.IsRetired() {
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "assigned agent is retired")
if updateErr != nil {
slog.Error("failed to update job status", "job_id", job.ID, "error", updateErr)
}
if notifErr := s.notificationSvc.SendDeploymentNotification(ctx, cert, target, false, fmt.Errorf("agent retired")); notifErr != nil {
slog.Error("failed to send deployment notification", "error", notifErr)
}
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"deployment_job_failed", "certificate", job.CertificateID,
map[string]interface{}{"job_id": job.ID, "reason": "agent retired", "target_id": targetID, "agent_id": agentID}); auditErr != nil {
slog.Error("failed to record audit event", "error", auditErr)
}
return fmt.Errorf("agent %s is retired", agentID)
}
// Check agent heartbeat (must be within last 5 minutes)
if agent.LastHeartbeatAt != nil && time.Since(*agent.LastHeartbeatAt) > 5*time.Minute {
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "agent is offline")
+24 -1
View File
@@ -232,6 +232,18 @@ func (s *TargetService) TestConnection(ctx context.Context, id string) error {
return fmt.Errorf("assigned agent not found: %w", err)
}
// I-004: AgentRepository.Get intentionally surfaces retired rows (the banner
// + 410 Gone paths need to see them). A test against a retired agent can
// never succeed — the agent is tombstoned, will never heartbeat again, and
// any active targets have already been cascade-retired alongside it. Fail
// fast with an explicit message instead of falling through to the Status /
// heartbeat checks, which would produce a misleading "agent is Offline" or
// "heartbeat stale" diagnostic.
if agent.IsRetired() {
s.updateTestStatus(ctx, target, "failed")
return fmt.Errorf("assigned agent %s is retired", agent.ID)
}
if agent.Status != domain.AgentStatusOnline {
s.updateTestStatus(ctx, target, "failed")
return fmt.Errorf("assigned agent %s is %s (expected Online)", agent.ID, agent.Status)
@@ -293,9 +305,20 @@ func (s *TargetService) CreateTarget(ctx context.Context, target domain.Deployme
if target.AgentID == "" {
return nil, fmt.Errorf("%w: agent_id is required", ErrAgentNotFound)
}
if _, err := s.agentRepo.Get(ctx, target.AgentID); err != nil {
agent, err := s.agentRepo.Get(ctx, target.AgentID)
if err != nil {
return nil, fmt.Errorf("%w: %s", ErrAgentNotFound, target.AgentID)
}
// I-004: refuse to attach new targets to a retired agent. The agent is
// tombstoned and no deployments would ever succeed against it; letting a
// row slip past here would immediately be cascade-retired on the next
// dependency sweep and confuse operators ("why is this brand-new target
// already retired?"). Treating retired agents as "not found" for creation
// purposes keeps the error surface tight and matches the default-list
// contract established by repository.AgentRepository.List.
if agent.IsRetired() {
return nil, fmt.Errorf("%w: %s (retired)", ErrAgentNotFound, target.AgentID)
}
if target.ID == "" {
target.ID = generateID("target")
+166 -1
View File
@@ -4,6 +4,7 @@ import (
"context"
"database/sql"
"errors"
"sort"
"sync"
"time"
@@ -607,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) {
m.Policies[policy.ID] = policy
}
// mockAgentRepo is a test implementation of AgentRepository
// mockAgentRepo is a test implementation of AgentRepository.
//
// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by
// agent ID and read back verbatim by the Count* methods — the retirement
// service's preflight pokes these maps to simulate "agent has N active
// deployments / M deployed certs / K pending jobs" without having to seed
// real target/cert/job rows across multiple mock repos. An unset key means
// zero, matching the production repo behavior on an agent with no deps.
type mockAgentRepo struct {
mu sync.Mutex
Agents map[string]*domain.Agent
@@ -619,8 +627,27 @@ type mockAgentRepo struct {
ListErr error
UpdateHeartbeatErr error
GetByAPIKeyErr error
// I-004 preflight count seeds (read by CountActiveTargets etc.).
ActiveTargetCounts map[string]int
ActiveCertCounts map[string]int
PendingJobCounts map[string]int
// I-004 retirement write-path error seams. Let tests force a SoftRetire
// or RetireAgentWithCascade failure after preflight passed, so the
// service's error surfacing (wrap+return, skip audit, etc.) can be
// exercised without having to stand up a real PG connection.
SoftRetireErr error
RetireCascadeErr error
CountErr error
ListRetiredErr error
}
// List mirrors the production repo contract post-I-004: it returns only
// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via
// AddAgent and then call a List-driven service method (e.g. ListAgents,
// MarkStaleAgentsOffline, stats dashboards) must not see the retired row
// here — otherwise the mock would pass while the real planner filters it
// out at the WHERE clause level. ListRetired is the companion method for
// explicit retired-only listing.
func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
m.mu.Lock()
defer m.mu.Unlock()
@@ -629,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
}
var agents []*domain.Agent
for _, a := range m.Agents {
if a.RetiredAt != nil {
continue
}
agents = append(agents, a)
}
return agents, nil
@@ -726,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) {
m.Agents[agent.ID] = agent
}
// ListRetired returns the paginated retired-agents slice + total count.
// Matches the production repo contract: RetiredAt != nil, sorted by
// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over
// the keyed map so the mock stays dependency-free. I-004.
func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.ListRetiredErr != nil {
return nil, 0, m.ListRetiredErr
}
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
var retired []*domain.Agent
for _, a := range m.Agents {
if a.RetiredAt != nil {
retired = append(retired, a)
}
}
total := len(retired)
// Sort by RetiredAt DESC — most recent first. The real query uses the
// partial idx_agents_retired_at index; here we sort in Go.
sort.SliceStable(retired, func(i, j int) bool {
return retired[i].RetiredAt.After(*retired[j].RetiredAt)
})
// Apply page/perPage window.
offset := (page - 1) * perPage
if offset >= total {
return nil, total, nil
}
end := offset + perPage
if end > total {
end = total
}
return retired[offset:end], total, nil
}
// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors
// the real repo's idempotent semantics: a row already retired is left
// untouched (zero-rows-affected is not an error). I-004 preserves
// retirement metadata across re-retire attempts — whoever retired it
// first owns the audit trail.
func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
m.mu.Lock()
defer m.mu.Unlock()
if m.SoftRetireErr != nil {
return m.SoftRetireErr
}
agent, ok := m.Agents[id]
if !ok {
return errNotFound
}
if agent.RetiredAt != nil {
return nil // already retired — no-op
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
// RetireAgentWithCascade stamps the agent row the same way SoftRetire
// does. The real repo also stamps every active deployment_targets row
// in the same transaction; the mock can't do that because targets live
// in mockTargetRepo, which the retirement service doesn't write to
// through this repo interface. Tests that need to assert cascade
// semantics on targets should seed mockTargetRepo directly and verify
// the service-layer audit event captured the cascade count. I-004.
func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
m.mu.Lock()
defer m.mu.Unlock()
if m.RetireCascadeErr != nil {
return m.RetireCascadeErr
}
agent, ok := m.Agents[id]
if !ok {
return errNotFound
}
if agent.RetiredAt != nil {
return nil // already retired — no-op (same as production transaction)
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if
// unset). Matches the real repo signature: COUNT of non-retired
// deployment_targets with agent_id=$1. I-004 preflight.
func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.ActiveTargetCounts[agentID], nil
}
// CountActiveCertificates returns the seeded ActiveCertCounts value.
// Real query: COUNT(DISTINCT certificate_id) across
// certificate_target_mappings ↔ deployment_targets on agent_id. I-004.
func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.ActiveCertCounts[agentID], nil
}
// CountPendingJobs returns the seeded PendingJobCounts value. Real
// query: COUNT of jobs with agent_id=$1 AND status IN (Pending,
// AwaitingCSR, AwaitingApproval, Running). I-004.
func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.PendingJobCounts[agentID], nil
}
// mockTargetRepo is a test implementation of TargetRepository
type mockTargetRepo struct {
mu sync.Mutex
@@ -955,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo {
return &mockAgentRepo{
Agents: make(map[string]*domain.Agent),
HeartbeatUpdates: make(map[string]time.Time),
// I-004 preflight count maps. Tests seed these directly via
// agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys
// read back as zero from CountActiveTargets etc., matching
// the production repo behavior for agents with no deps.
ActiveTargetCounts: make(map[string]int),
ActiveCertCounts: make(map[string]int),
PendingJobCounts: make(map[string]int),
}
}