mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:21:30 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
@@ -10,6 +10,7 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
"github.com/shankar0123/certctl/internal/service"
|
||||
)
|
||||
|
||||
// MockAgentService is a mock implementation of AgentService interface.
|
||||
@@ -24,6 +25,11 @@ type MockAgentService struct {
|
||||
GetWorkFn func(agentID string) ([]domain.Job, error)
|
||||
GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error)
|
||||
UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error
|
||||
// I-004: soft-retirement hooks. Tests that don't set these receive nil
|
||||
// results and nil errors, which mirrors the safest default (no-op) for
|
||||
// unrelated suites that mock only the legacy surface.
|
||||
RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
|
||||
ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error)
|
||||
}
|
||||
|
||||
func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
|
||||
@@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set
|
||||
// RetireAgentFn get a nil result + nil error, which is a no-op response that
|
||||
// lets unrelated suites compile without caring about the retirement surface.
|
||||
func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
if m.RetireAgentFn != nil {
|
||||
return m.RetireAgentFn(agentID, actor, force, reason)
|
||||
}
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
// ListRetiredAgents returns retired rows for the retired-agents tab / audit
|
||||
// views. Same zero-value default as RetireAgent for unrelated tests.
|
||||
func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
|
||||
if m.ListRetiredAgentsFn != nil {
|
||||
return m.ListRetiredAgentsFn(page, perPage)
|
||||
}
|
||||
return nil, 0, nil
|
||||
}
|
||||
|
||||
// Test ListAgents - success case
|
||||
func TestListAgents_Success(t *testing.T) {
|
||||
now := time.Now()
|
||||
|
||||
@@ -0,0 +1,393 @@
|
||||
package handler
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
"github.com/shankar0123/certctl/internal/service"
|
||||
)
|
||||
|
||||
// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose
|
||||
// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the
|
||||
// returned mock. Keeps every I-004 handler test self-contained so a single
|
||||
// failing assertion can't cascade through a shared fixture.
|
||||
func agentRetireTestSetup() (*MockAgentService, AgentHandler) {
|
||||
mock := &MockAgentService{}
|
||||
handler := NewAgentHandler(mock)
|
||||
return mock, handler
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_Success_200 pins the happy-path contract for the
|
||||
// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency
|
||||
// fallout returns 200 OK and a JSON body echoing retirement metadata
|
||||
// (retired_at timestamp, already_retired=false, cascade=false, zero counts).
|
||||
// Operators building dashboards parse these fields; keep the shape stable.
|
||||
func TestRetireAgentHandler_Success_200(t *testing.T) {
|
||||
retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC)
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
if agentID != "a-prod-001" {
|
||||
t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID)
|
||||
}
|
||||
if force {
|
||||
t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false")
|
||||
}
|
||||
return &service.AgentRetirementResult{
|
||||
AlreadyRetired: false,
|
||||
Cascade: false,
|
||||
RetiredAt: retiredAt,
|
||||
Counts: domain.AgentDependencyCounts{},
|
||||
}, nil
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
|
||||
}
|
||||
|
||||
var body struct {
|
||||
RetiredAt time.Time `json:"retired_at"`
|
||||
AlreadyRetired bool `json:"already_retired"`
|
||||
Cascade bool `json:"cascade"`
|
||||
Counts domain.AgentDependencyCounts `json:"counts"`
|
||||
}
|
||||
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
|
||||
t.Fatalf("decode 200 body: %v", err)
|
||||
}
|
||||
if !body.RetiredAt.Equal(retiredAt) {
|
||||
t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt)
|
||||
}
|
||||
if body.AlreadyRetired {
|
||||
t.Errorf("already_retired=true want false on clean retire")
|
||||
}
|
||||
if body.Cascade {
|
||||
t.Errorf("cascade=true want false on clean retire")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a
|
||||
// retire call against an already-retired agent completes with 204 No Content
|
||||
// (no body). This lets operators safely re-issue the DELETE after a network
|
||||
// blip without fearing duplicate audit events or state mutations.
|
||||
func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
return &service.AgentRetirementResult{
|
||||
AlreadyRetired: true,
|
||||
Cascade: false,
|
||||
RetiredAt: past,
|
||||
Counts: domain.AgentDependencyCounts{},
|
||||
}, nil
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusNoContent {
|
||||
t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String())
|
||||
}
|
||||
// 204 No Content must have zero body. If anything leaks through, downstream
|
||||
// clients (curl scripts, dashboards) break.
|
||||
if w.Body.Len() != 0 {
|
||||
t.Errorf("204 body=%q want empty", w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring
|
||||
// any of the four sentinel agents that back discovery sources and the
|
||||
// network scanner. These IDs are reserved; the handler must surface the
|
||||
// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason
|
||||
// because no operator intent can legitimately retire them.
|
||||
func TestRetireAgentHandler_Sentinel_403(t *testing.T) {
|
||||
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
|
||||
for _, id := range sentinels {
|
||||
t.Run(id, func(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
return nil, service.ErrAgentIsSentinel
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusForbidden {
|
||||
t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String())
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service
|
||||
// returns a not-found error; handler maps to 404. Keeping the error
|
||||
// discrimination at the service layer (sentinel errors.Is) rather than string
|
||||
// matching is the whole point of wrapping.
|
||||
func TestRetireAgentHandler_NotFound_404(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
return nil, errors.New("agent not found")
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusNotFound {
|
||||
t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked
|
||||
// path. Service returns *BlockedByDependenciesError wrapping
|
||||
// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and
|
||||
// MUST include the counts in the response body so operators know what's
|
||||
// blocking them. Without counts the 409 is useless — the operator has to
|
||||
// guess which downstream dependency is holding up the retirement.
|
||||
func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
blockCounts := domain.AgentDependencyCounts{
|
||||
ActiveTargets: 3,
|
||||
ActiveCertificates: 7,
|
||||
PendingJobs: 2,
|
||||
}
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
return nil, &service.BlockedByDependenciesError{Counts: blockCounts}
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusConflict {
|
||||
t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String())
|
||||
}
|
||||
|
||||
var body struct {
|
||||
Error string `json:"error"`
|
||||
Message string `json:"message"`
|
||||
Counts domain.AgentDependencyCounts `json:"counts"`
|
||||
}
|
||||
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
|
||||
t.Fatalf("decode 409 body: %v", err)
|
||||
}
|
||||
if body.Counts.ActiveTargets != 3 {
|
||||
t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets)
|
||||
}
|
||||
if body.Counts.ActiveCertificates != 7 {
|
||||
t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates)
|
||||
}
|
||||
if body.Counts.PendingJobs != 2 {
|
||||
t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs)
|
||||
}
|
||||
if body.Message == "" {
|
||||
t.Errorf("409 body missing human-readable message; operators need guidance")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch
|
||||
// guardrail: force=true without a non-empty reason must be rejected at the
|
||||
// handler seam BEFORE the service performs any DB work, because a
|
||||
// reason-less cascade is unauditable. Service returns ErrForceReasonRequired;
|
||||
// handler maps to 400.
|
||||
func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
if !force {
|
||||
t.Fatalf("handler did not forward force=true; force query param was dropped")
|
||||
}
|
||||
if reason != "" {
|
||||
t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason)
|
||||
}
|
||||
return nil, service.ErrForceReasonRequired
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusBadRequest {
|
||||
t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade
|
||||
// path: DELETE ?force=true&reason=... → service executes transactional
|
||||
// cascade → 200 with cascade=true and the pre-cascade counts echoed back so
|
||||
// the operator's confirmation dialog can show "I just retired N targets,
|
||||
// M certificates, K pending jobs."
|
||||
func TestRetireAgentHandler_ForceCascade_200(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC)
|
||||
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
|
||||
if !force {
|
||||
t.Fatalf("handler did not forward force=true; query-param parsing broken")
|
||||
}
|
||||
if reason != "decommissioning rack 7" {
|
||||
t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7")
|
||||
}
|
||||
return &service.AgentRetirementResult{
|
||||
AlreadyRetired: false,
|
||||
Cascade: true,
|
||||
RetiredAt: retiredAt,
|
||||
Counts: domain.AgentDependencyCounts{
|
||||
ActiveTargets: 2,
|
||||
ActiveCertificates: 5,
|
||||
PendingJobs: 1,
|
||||
},
|
||||
}, nil
|
||||
}
|
||||
|
||||
url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7"
|
||||
req := httptest.NewRequest(http.MethodDelete, url, nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
|
||||
}
|
||||
|
||||
var body struct {
|
||||
RetiredAt time.Time `json:"retired_at"`
|
||||
AlreadyRetired bool `json:"already_retired"`
|
||||
Cascade bool `json:"cascade"`
|
||||
Counts domain.AgentDependencyCounts `json:"counts"`
|
||||
}
|
||||
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
|
||||
t.Fatalf("decode force-cascade 200 body: %v", err)
|
||||
}
|
||||
if !body.Cascade {
|
||||
t.Errorf("cascade=false want true on ?force=true successful retire")
|
||||
}
|
||||
if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 {
|
||||
t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts)
|
||||
}
|
||||
}
|
||||
|
||||
// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A
|
||||
// retired agent that is still polling must be told its identity is gone
|
||||
// (410 Gone) rather than offered the normal 200 "recorded" response.
|
||||
// cmd/agent treats 410 as a terminal signal and exits rather than looping
|
||||
// forever against a decommissioned identity. Service returns ErrAgentRetired;
|
||||
// handler maps to 410.
|
||||
func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) {
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error {
|
||||
return service.ErrAgentRetired
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.Heartbeat(w, req)
|
||||
|
||||
if w.Code != http.StatusGone {
|
||||
t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing
|
||||
// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows
|
||||
// alongside total count so the GUI can render a "Retired Agents" tab with
|
||||
// pagination. Default listing (GET /agents) hides retired rows; this is the
|
||||
// opt-in surface for them.
|
||||
func TestListRetiredAgentsHandler_Success(t *testing.T) {
|
||||
past := time.Now().Add(-48 * time.Hour)
|
||||
reason := "old hardware"
|
||||
retired := []domain.Agent{
|
||||
{
|
||||
ID: "agent-retired-01",
|
||||
Name: "decom-01",
|
||||
Hostname: "server-old",
|
||||
Status: domain.AgentStatusOffline,
|
||||
RegisteredAt: past,
|
||||
RetiredAt: &past,
|
||||
RetiredReason: &reason,
|
||||
},
|
||||
}
|
||||
|
||||
mock, handler := agentRetireTestSetup()
|
||||
mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) {
|
||||
if page != 1 || perPage != 50 {
|
||||
t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage)
|
||||
}
|
||||
return retired, 1, nil
|
||||
}
|
||||
|
||||
req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.ListRetiredAgents(w, req)
|
||||
|
||||
if w.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
|
||||
}
|
||||
|
||||
var response PagedResponse
|
||||
if err := json.NewDecoder(w.Body).Decode(&response); err != nil {
|
||||
t.Fatalf("decode list-retired body: %v", err)
|
||||
}
|
||||
if response.Total != 1 {
|
||||
t.Errorf("total=%d want 1", response.Total)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only
|
||||
// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH
|
||||
// must be rejected with 405 so misconfigured callers don't accidentally
|
||||
// trigger retirement via a wrong-method request.
|
||||
func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) {
|
||||
_, handler := agentRetireTestSetup()
|
||||
|
||||
for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} {
|
||||
t.Run(method, func(t *testing.T) {
|
||||
req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil)
|
||||
req = req.WithContext(contextWithRequestID())
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler.RetireAgent(w, req)
|
||||
|
||||
if w.Code != http.StatusMethodNotAllowed {
|
||||
t.Fatalf("method=%s status=%d want 405", method, w.Code)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// Compile-time asserts: the mock must satisfy the handler's AgentService
|
||||
// interface. Red state: this fails until the interface grows RetireAgent +
|
||||
// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this
|
||||
// assertion goes green along with every test above.
|
||||
var _ AgentService = (*MockAgentService)(nil)
|
||||
|
||||
// Unused-import suppressor for context — the package-level tests already
|
||||
// pull context from agent_handler_test.go, but leaving this here documents
|
||||
// that the mock methods receive context.Context values even though this
|
||||
// file's tests don't construct them directly (they ride on httptest.NewRequest).
|
||||
var _ = context.Background
|
||||
@@ -3,16 +3,24 @@ package handler
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/api/middleware"
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
"github.com/shankar0123/certctl/internal/service"
|
||||
)
|
||||
|
||||
// AgentService defines the service interface for agent operations.
|
||||
//
|
||||
// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement
|
||||
// surface. The handler depends on the service-package's AgentRetirementResult
|
||||
// and BlockedByDependenciesError types for result shape + errors.As unwrap,
|
||||
// which is why this file imports internal/service.
|
||||
type AgentService interface {
|
||||
ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
|
||||
GetAgent(ctx context.Context, id string) (*domain.Agent, error)
|
||||
@@ -24,6 +32,10 @@ type AgentService interface {
|
||||
GetWork(ctx context.Context, agentID string) ([]domain.Job, error)
|
||||
GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error)
|
||||
UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error
|
||||
// I-004 soft-retirement API. Both default to no-op (nil result / nil error)
|
||||
// in mocks that don't override them — handler tests opt in per suite.
|
||||
RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
|
||||
ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
|
||||
}
|
||||
|
||||
// AgentHandler handles HTTP requests for agent operations.
|
||||
@@ -190,6 +202,15 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) {
|
||||
}
|
||||
|
||||
if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil {
|
||||
// I-004: a retired agent still polling must receive 410 Gone so
|
||||
// cmd/agent detects the terminal signal and shuts down cleanly
|
||||
// instead of looping forever against a decommissioned identity.
|
||||
// Check this FIRST — before "not found" string matching — so the
|
||||
// retired-path is never masked by a sibling error branch.
|
||||
if errors.Is(err, service.ErrAgentRetired) {
|
||||
ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID)
|
||||
return
|
||||
}
|
||||
if strings.Contains(err.Error(), "not found") {
|
||||
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
|
||||
return
|
||||
@@ -376,3 +397,181 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques
|
||||
"status": "updated",
|
||||
})
|
||||
}
|
||||
|
||||
// RetireAgent executes the I-004 soft-retirement surface.
|
||||
// DELETE /api/v1/agents/{id}[?force=true&reason=...]
|
||||
//
|
||||
// Contract (pinned by agent_retire_handler_test.go):
|
||||
//
|
||||
// 405 any method other than DELETE
|
||||
// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s)
|
||||
// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot)
|
||||
// 204 idempotent retire of an already-retired agent (NO body — downstream
|
||||
// clients that tee responses into dashboards break on spurious bodies)
|
||||
// 400 force=true without a non-empty reason (ErrForceReasonRequired)
|
||||
// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel)
|
||||
// 404 agent does not exist ("not found" string match, kept for compat with
|
||||
// repo error strings; sentinel checks run first so they never mask)
|
||||
// 409 blocked by preflight counts (*BlockedByDependenciesError) — body
|
||||
// carries the per-bucket counts so the operator UI can tell the
|
||||
// human which downstream dependency is holding up the retirement,
|
||||
// rather than forcing them to re-run the DELETE with ?force=true
|
||||
// and guess
|
||||
// 500 anything else
|
||||
//
|
||||
// The 409 body intentionally does NOT go through ErrorWithRequestID because
|
||||
// that helper's ErrorResponse shape has no `counts` field — we inline-marshal
|
||||
// a custom body instead. Keeping this shape stable is important: the GUI
|
||||
// pattern is "show the 409 dialog, list the N targets / M certs / K jobs
|
||||
// blocking, let the operator retire them first or tick the force checkbox."
|
||||
func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) {
|
||||
if r.Method != http.MethodDelete {
|
||||
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
|
||||
return
|
||||
}
|
||||
|
||||
requestID := middleware.GetRequestID(r.Context())
|
||||
|
||||
// Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so
|
||||
// the path parser is identical across the agent handler surface and a
|
||||
// future refactor can extract it once without introducing drift.
|
||||
rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/")
|
||||
parts := strings.Split(rawID, "/")
|
||||
if len(parts) == 0 || parts[0] == "" {
|
||||
ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID)
|
||||
return
|
||||
}
|
||||
id := parts[0]
|
||||
|
||||
// Parse optional force + reason. A missing `force` param is treated as
|
||||
// force=false (the default, safe path); anything strconv.ParseBool rejects
|
||||
// is also force=false so a malformed query can never silently enable the
|
||||
// cascade. The reason string is passed through verbatim — the service
|
||||
// owns the "force=true requires reason" rule.
|
||||
query := r.URL.Query()
|
||||
force := false
|
||||
if fv := query.Get("force"); fv != "" {
|
||||
if parsed, err := strconv.ParseBool(fv); err == nil {
|
||||
force = parsed
|
||||
}
|
||||
}
|
||||
reason := query.Get("reason")
|
||||
|
||||
actor := resolveActor(r.Context())
|
||||
|
||||
result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason)
|
||||
if err != nil {
|
||||
// Sentinel + typed-error checks run BEFORE string matching on "not
|
||||
// found" so a repo error that happens to contain those words can
|
||||
// never mask a structural refusal (403/400/409). Order matters.
|
||||
if errors.Is(err, service.ErrAgentIsSentinel) {
|
||||
ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID)
|
||||
return
|
||||
}
|
||||
if errors.Is(err, service.ErrForceReasonRequired) {
|
||||
ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID)
|
||||
return
|
||||
}
|
||||
var blocked *service.BlockedByDependenciesError
|
||||
if errors.As(err, &blocked) {
|
||||
// Custom 409 body with per-bucket counts. ErrorResponse has no
|
||||
// `counts` field, so we marshal a bespoke struct instead.
|
||||
// Keep `error`/`message`/`counts` as the stable shape — any
|
||||
// dashboard parsing this relies on those three keys.
|
||||
body := struct {
|
||||
Error string `json:"error"`
|
||||
Message string `json:"message"`
|
||||
Counts domain.AgentDependencyCounts `json:"counts"`
|
||||
}{
|
||||
Error: "blocked_by_dependencies",
|
||||
Message: "Agent has active downstream dependencies. Retire or reassign them " +
|
||||
"first, or re-run with ?force=true&reason=... to cascade.",
|
||||
Counts: blocked.Counts,
|
||||
}
|
||||
JSON(w, http.StatusConflict, body)
|
||||
return
|
||||
}
|
||||
if strings.Contains(err.Error(), "not found") {
|
||||
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
|
||||
return
|
||||
}
|
||||
slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error())
|
||||
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID)
|
||||
return
|
||||
}
|
||||
|
||||
// Idempotent retire: the agent was already retired, so we return 204 No
|
||||
// Content with a ZERO-length body. The Red contract (test line 106) fails
|
||||
// if even a trailing newline leaks into the response. WriteHeader alone
|
||||
// emits the status without invoking the JSON encoder.
|
||||
if result.AlreadyRetired {
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
return
|
||||
}
|
||||
|
||||
// Clean retire (force=false) or successful cascade (force=true). Body
|
||||
// shape pinned by Red contract: retired_at, already_retired, cascade,
|
||||
// counts. Omitempty is deliberately NOT used — operators parsing the
|
||||
// response expect every field to always be present.
|
||||
JSON(w, http.StatusOK, struct {
|
||||
RetiredAt time.Time `json:"retired_at"`
|
||||
AlreadyRetired bool `json:"already_retired"`
|
||||
Cascade bool `json:"cascade"`
|
||||
Counts domain.AgentDependencyCounts `json:"counts"`
|
||||
}{
|
||||
RetiredAt: result.RetiredAt,
|
||||
AlreadyRetired: result.AlreadyRetired,
|
||||
Cascade: result.Cascade,
|
||||
Counts: result.Counts,
|
||||
})
|
||||
}
|
||||
|
||||
// ListRetiredAgents returns the opt-in listing of retired agents for the
|
||||
// operator UI's "Retired" tab and for audit/forensics workflows.
|
||||
// GET /api/v1/agents/retired?page=1&per_page=50
|
||||
//
|
||||
// The default ListAgents handler hides retired rows; this is the dedicated
|
||||
// surface for reading them back. Pagination defaults match ListAgents so
|
||||
// the GUI can reuse the same query hook (page=1, per_page=50, cap 500).
|
||||
//
|
||||
// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via
|
||||
// the literal-beats-pattern-var precedence rule (literal `retired` wins over
|
||||
// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can
|
||||
// coexist without conflict. If that precedence ever regresses, the failure
|
||||
// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which
|
||||
// is the fast signal we want.
|
||||
func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) {
|
||||
if r.Method != http.MethodGet {
|
||||
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
|
||||
return
|
||||
}
|
||||
|
||||
requestID := middleware.GetRequestID(r.Context())
|
||||
|
||||
page := 1
|
||||
perPage := 50
|
||||
query := r.URL.Query()
|
||||
if p := query.Get("page"); p != "" {
|
||||
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
|
||||
page = parsed
|
||||
}
|
||||
}
|
||||
if pp := query.Get("per_page"); pp != "" {
|
||||
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
|
||||
perPage = parsed
|
||||
}
|
||||
}
|
||||
|
||||
agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage)
|
||||
if err != nil {
|
||||
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID)
|
||||
return
|
||||
}
|
||||
|
||||
JSON(w, http.StatusOK, PagedResponse{
|
||||
Data: agents,
|
||||
Total: total,
|
||||
Page: page,
|
||||
PerPage: perPage,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -131,9 +131,21 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
r.Register("POST /api/v1/targets/{id}/test", http.HandlerFunc(reg.Targets.TestTargetConnection))
|
||||
|
||||
// Agents routes: /api/v1/agents
|
||||
//
|
||||
// I-004 soft-retirement surface:
|
||||
// * GET /api/v1/agents/retired — opt-in listing of retired agents.
|
||||
// MUST be registered before /agents/{id} so Go 1.22 ServeMux's
|
||||
// literal-beats-pattern-var precedence routes the `retired` literal
|
||||
// to ListRetiredAgents instead of treating "retired" as a {id}
|
||||
// parameter value against GetAgent.
|
||||
// * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004
|
||||
// hard-delete; the underlying repo does a soft-retire with
|
||||
// optional cascade.
|
||||
r.Register("GET /api/v1/agents", http.HandlerFunc(reg.Agents.ListAgents))
|
||||
r.Register("POST /api/v1/agents", http.HandlerFunc(reg.Agents.RegisterAgent))
|
||||
r.Register("GET /api/v1/agents/retired", http.HandlerFunc(reg.Agents.ListRetiredAgents))
|
||||
r.Register("GET /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.GetAgent))
|
||||
r.Register("DELETE /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.RetireAgent))
|
||||
r.Register("POST /api/v1/agents/{id}/heartbeat", http.HandlerFunc(reg.Agents.Heartbeat))
|
||||
r.Register("POST /api/v1/agents/{id}/csr", http.HandlerFunc(reg.Agents.AgentCSRSubmit))
|
||||
r.Register("GET /api/v1/agents/{id}/certificates/{cert_id}", http.HandlerFunc(reg.Agents.AgentCertificatePickup))
|
||||
|
||||
@@ -0,0 +1,228 @@
|
||||
package cli
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestClient_RetireAgent_Success pins the I-004 CLI happy path: the operator
|
||||
// runs `certctl-cli agents retire <id>` and the client issues a DELETE to
|
||||
// /api/v1/agents/{id}, parses the 200 JSON body (retired_at, already_retired,
|
||||
// cascade, counts), and reports success. The handler test already covers the
|
||||
// server-side contract; this test covers the client-side wire formatting so a
|
||||
// refactor of the server's 200 body shape can't silently break the CLI.
|
||||
func TestClient_RetireAgent_Success(t *testing.T) {
|
||||
var (
|
||||
sawMethod string
|
||||
sawPath string
|
||||
sawForce string
|
||||
sawReason string
|
||||
)
|
||||
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
sawMethod = r.Method
|
||||
sawPath = r.URL.Path
|
||||
sawForce = r.URL.Query().Get("force")
|
||||
sawReason = r.URL.Query().Get("reason")
|
||||
|
||||
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
|
||||
w.WriteHeader(http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"retired_at": "2026-04-18T12:00:00Z",
|
||||
"already_retired": false,
|
||||
"cascade": false,
|
||||
"counts": map[string]interface{}{
|
||||
"active_targets": 0,
|
||||
"active_certificates": 0,
|
||||
"pending_jobs": 0,
|
||||
},
|
||||
})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
// Positional arg: the agent ID. No --force, no --reason — the default
|
||||
// soft-retire path. Compile-fail until client.RetireAgent exists.
|
||||
if err := client.RetireAgent([]string{"ag-1"}); err != nil {
|
||||
t.Fatalf("RetireAgent(ag-1) err=%v want nil", err)
|
||||
}
|
||||
|
||||
if sawMethod != "DELETE" {
|
||||
t.Errorf("method=%q want DELETE", sawMethod)
|
||||
}
|
||||
if sawPath != "/api/v1/agents/ag-1" {
|
||||
t.Errorf("path=%q want /api/v1/agents/ag-1", sawPath)
|
||||
}
|
||||
if sawForce != "" {
|
||||
t.Errorf("force query=%q want empty (default path sends no force)", sawForce)
|
||||
}
|
||||
if sawReason != "" {
|
||||
t.Errorf("reason query=%q want empty (default path sends no reason)", sawReason)
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_RetireAgent_Force_WithReason_Success pins the ?force=true&reason=...
|
||||
// escape hatch wiring. Operators who supply --force + --reason get their values
|
||||
// propagated as URL query parameters exactly once, so the server sees the same
|
||||
// contract the handler test expects. Also verifies the cascade=true response
|
||||
// body parses cleanly.
|
||||
func TestClient_RetireAgent_Force_WithReason_Success(t *testing.T) {
|
||||
var (
|
||||
sawForce string
|
||||
sawReason string
|
||||
)
|
||||
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
sawForce = r.URL.Query().Get("force")
|
||||
sawReason = r.URL.Query().Get("reason")
|
||||
|
||||
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
|
||||
w.WriteHeader(http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"retired_at": "2026-04-18T12:00:00Z",
|
||||
"already_retired": false,
|
||||
"cascade": true,
|
||||
"counts": map[string]interface{}{
|
||||
"active_targets": 2,
|
||||
"active_certificates": 5,
|
||||
"pending_jobs": 1,
|
||||
},
|
||||
})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil {
|
||||
t.Fatalf("RetireAgent(force+reason) err=%v want nil", err)
|
||||
}
|
||||
if sawForce != "true" {
|
||||
t.Errorf("force query=%q want \"true\"", sawForce)
|
||||
}
|
||||
if sawReason != "decommissioning rack 7" {
|
||||
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_RetireAgent_Force_RequiresReason pins the client-side guard: using
|
||||
// --force without --reason must fail BEFORE any HTTP request is made. Without
|
||||
// this, the client would bounce off the server's 400 ErrForceReasonRequired
|
||||
// only after a round trip — slow feedback, wasted audit-trail noise, and a
|
||||
// worse operator experience. requestCount=0 enforces that no HTTP call happens.
|
||||
func TestClient_RetireAgent_Force_RequiresReason(t *testing.T) {
|
||||
var requestCount int
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
requestCount++
|
||||
w.WriteHeader(http.StatusOK)
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
err := client.RetireAgent([]string{"ag-1", "--force"})
|
||||
if err == nil {
|
||||
t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error")
|
||||
}
|
||||
if !containsStr(err.Error(), "reason") {
|
||||
t.Errorf("err=%q should mention --reason to guide operator", err.Error())
|
||||
}
|
||||
if requestCount != 0 {
|
||||
t.Fatalf("requestCount=%d want 0; client must short-circuit before HTTP call", requestCount)
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_RetireAgent_MissingID covers the other common operator mistake:
|
||||
// invoking `certctl-cli agents retire` with no agent ID. Must be caught by the
|
||||
// client with a clear error, not a malformed DELETE to /api/v1/agents/.
|
||||
func TestClient_RetireAgent_MissingID(t *testing.T) {
|
||||
var requestCount int
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
requestCount++
|
||||
w.WriteHeader(http.StatusOK)
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
err := client.RetireAgent([]string{})
|
||||
if err == nil {
|
||||
t.Fatalf("RetireAgent([]) err=nil want missing-id error")
|
||||
}
|
||||
if requestCount != 0 {
|
||||
t.Fatalf("requestCount=%d want 0; client must reject missing-id before HTTP", requestCount)
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_ListRetiredAgents_Success pins the audit/forensics CLI surface:
|
||||
// `certctl-cli agents list-retired` must GET /api/v1/agents/retired and render
|
||||
// the paged response. The server returns a PagedResponse; the client is
|
||||
// responsible for printing it in table or JSON format, same as ListAgents.
|
||||
func TestClient_ListRetiredAgents_Success(t *testing.T) {
|
||||
var (
|
||||
sawMethod string
|
||||
sawPath string
|
||||
)
|
||||
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
sawMethod = r.Method
|
||||
sawPath = r.URL.Path
|
||||
|
||||
if r.Method != "GET" || r.URL.Path != "/api/v1/agents/retired" {
|
||||
w.WriteHeader(http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"data": []map[string]interface{}{
|
||||
{
|
||||
"id": "ag-old-01",
|
||||
"name": "decom-01",
|
||||
"hostname": "server-old",
|
||||
"status": "Offline",
|
||||
"registered_at": "2024-01-01T00:00:00Z",
|
||||
"retired_at": "2026-01-01T00:00:00Z",
|
||||
"retired_reason": "old hardware",
|
||||
},
|
||||
},
|
||||
"total": 1,
|
||||
"page": 1,
|
||||
"per_page": 50,
|
||||
})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
if err := client.ListRetiredAgents([]string{}); err != nil {
|
||||
t.Fatalf("ListRetiredAgents err=%v want nil", err)
|
||||
}
|
||||
if sawMethod != "GET" {
|
||||
t.Errorf("method=%q want GET", sawMethod)
|
||||
}
|
||||
if sawPath != "/api/v1/agents/retired" {
|
||||
t.Errorf("path=%q want /api/v1/agents/retired", sawPath)
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_ListRetiredAgents_ServerError covers the non-happy path: server
|
||||
// returns 5xx → client surfaces the error rather than silently printing an
|
||||
// empty list. Without this, operators running the command as part of a
|
||||
// compliance audit could miss a backend outage.
|
||||
func TestClient_ListRetiredAgents_ServerError(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "db unreachable", http.StatusInternalServerError)
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := NewClient(server.URL, "", "table")
|
||||
err := client.ListRetiredAgents([]string{})
|
||||
if err == nil {
|
||||
t.Fatalf("ListRetiredAgents(500) err=nil want propagated error")
|
||||
}
|
||||
}
|
||||
@@ -293,6 +293,194 @@ func (c *Client) ListAgents(args []string) error {
|
||||
return c.outputAgentsTable(result.Data, result.Total)
|
||||
}
|
||||
|
||||
// ListRetiredAgents lists soft-retired agents from the dedicated endpoint.
|
||||
//
|
||||
// I-004: hits GET /api/v1/agents/retired which is a separate route from the
|
||||
// default listing (the default hides retired rows). Supports --page and
|
||||
// --per-page just like the active list. Output format mirrors ListAgents
|
||||
// but prepends RETIRED_AT and RETIRED_REASON columns so the operator can
|
||||
// forensic-grep the output.
|
||||
func (c *Client) ListRetiredAgents(args []string) error {
|
||||
fs := flag.NewFlagSet("agents list --retired", flag.ContinueOnError)
|
||||
page := fs.Int("page", 1, "Page number")
|
||||
perPage := fs.Int("per-page", 50, "Items per page")
|
||||
fs.Parse(args)
|
||||
|
||||
query := url.Values{}
|
||||
query.Set("page", fmt.Sprintf("%d", *page))
|
||||
query.Set("per_page", fmt.Sprintf("%d", *perPage))
|
||||
|
||||
resp, err := c.do("GET", "/api/v1/agents/retired", query, nil)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
var result struct {
|
||||
Data []map[string]interface{} `json:"data"`
|
||||
Total int `json:"total"`
|
||||
}
|
||||
if err := json.Unmarshal(resp, &result); err != nil {
|
||||
return fmt.Errorf("parsing response: %w", err)
|
||||
}
|
||||
|
||||
if c.format == "json" {
|
||||
return c.outputJSON(result)
|
||||
}
|
||||
|
||||
return c.outputRetiredAgentsTable(result.Data, result.Total)
|
||||
}
|
||||
|
||||
// RetireAgent soft-retires an agent via DELETE /api/v1/agents/{id}.
|
||||
//
|
||||
// I-004: wraps the full status-code matrix pinned by the handler's
|
||||
// agent_retire_handler_test.go:
|
||||
//
|
||||
// 200 clean retire — body: retired_at, already_retired=false, cascade=false, counts=0
|
||||
// 200 force-cascade retire — body: cascade=true, counts=pre-cascade snapshot
|
||||
// 204 idempotent retire — agent was already retired, NO body
|
||||
// 403 sentinel — reserved agent (server-scanner / cloud-*), ErrAgentIsSentinel
|
||||
// 404 not found — agent doesn't exist
|
||||
// 409 blocked_by_dependencies — body: error, message, counts
|
||||
//
|
||||
// The default (force=false) flow refuses to retire agents with active
|
||||
// downstream dependencies; the operator must re-run with --force and an
|
||||
// explicit --reason to cascade. The handler rejects --force without
|
||||
// --reason with a 400 — we mirror that contract client-side so the
|
||||
// operator gets a clear error before the round trip.
|
||||
func (c *Client) RetireAgent(args []string) error {
|
||||
// Convention: `agents retire <id> [--force] [--reason <reason>]` — the ID
|
||||
// is a positional arg that precedes the flags. Go's flag package stops
|
||||
// parsing at the first non-flag token, so we pull args[0] as the ID and
|
||||
// hand args[1:] to the flag parser. Without this split, `agents retire
|
||||
// ag-1 --force --reason "x"` would parse with force=false and reason=""
|
||||
// because the flags land in fs.Args() instead of being recognized.
|
||||
if len(args) == 0 {
|
||||
return fmt.Errorf("agent ID is required: agents retire <id> [--force] [--reason <reason>]")
|
||||
}
|
||||
id := args[0]
|
||||
|
||||
fs := flag.NewFlagSet("agents retire", flag.ContinueOnError)
|
||||
force := fs.Bool("force", false, "Cascade-retire downstream targets, certs, and jobs")
|
||||
reason := fs.String("reason", "", "Human-readable reason (required with --force)")
|
||||
if err := fs.Parse(args[1:]); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// Mirror the handler's ErrForceReasonRequired contract client-side so
|
||||
// the operator gets a clear error before the round trip.
|
||||
if *force && strings.TrimSpace(*reason) == "" {
|
||||
return fmt.Errorf("--reason is required when --force is set")
|
||||
}
|
||||
|
||||
// Build query string. Skip ?force=false; skip ?reason= when empty.
|
||||
query := url.Values{}
|
||||
if *force {
|
||||
query.Set("force", "true")
|
||||
}
|
||||
if *reason != "" {
|
||||
query.Set("reason", *reason)
|
||||
}
|
||||
|
||||
u, err := url.JoinPath(c.baseURL, fmt.Sprintf("/api/v1/agents/%s", id))
|
||||
if err != nil {
|
||||
return fmt.Errorf("invalid URL: %w", err)
|
||||
}
|
||||
if len(query) > 0 {
|
||||
u = u + "?" + query.Encode()
|
||||
}
|
||||
|
||||
req, err := http.NewRequest("DELETE", u, nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("creating request: %w", err)
|
||||
}
|
||||
req.Header.Set("Accept", "application/json")
|
||||
if c.apiKey != "" {
|
||||
req.Header.Set("Authorization", "Bearer "+c.apiKey)
|
||||
}
|
||||
|
||||
resp, err := c.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return fmt.Errorf("request failed: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return fmt.Errorf("reading response: %w", err)
|
||||
}
|
||||
|
||||
switch resp.StatusCode {
|
||||
case http.StatusNoContent:
|
||||
// 204 idempotent — the agent was already retired. No body.
|
||||
if c.format == "json" {
|
||||
return c.outputJSON(map[string]interface{}{
|
||||
"agent_id": id,
|
||||
"already_retired": true,
|
||||
})
|
||||
}
|
||||
fmt.Printf("Agent %s was already retired (idempotent)\n", id)
|
||||
return nil
|
||||
|
||||
case http.StatusOK:
|
||||
var result struct {
|
||||
RetiredAt string `json:"retired_at"`
|
||||
AlreadyRetired bool `json:"already_retired"`
|
||||
Cascade bool `json:"cascade"`
|
||||
Counts struct {
|
||||
ActiveTargets int `json:"active_targets"`
|
||||
ActiveCertificates int `json:"active_certificates"`
|
||||
PendingJobs int `json:"pending_jobs"`
|
||||
} `json:"counts"`
|
||||
}
|
||||
if err := json.Unmarshal(body, &result); err != nil {
|
||||
return fmt.Errorf("parsing 200 response: %w", err)
|
||||
}
|
||||
|
||||
if c.format == "json" {
|
||||
return c.outputJSON(json.RawMessage(body))
|
||||
}
|
||||
|
||||
if result.Cascade {
|
||||
fmt.Printf("Agent %s retired (cascade). Retired at: %s\n", id, result.RetiredAt)
|
||||
fmt.Printf(" Cascaded: %d targets, %d certificates, %d jobs\n",
|
||||
result.Counts.ActiveTargets, result.Counts.ActiveCertificates, result.Counts.PendingJobs)
|
||||
} else {
|
||||
fmt.Printf("Agent %s retired. Retired at: %s\n", id, result.RetiredAt)
|
||||
}
|
||||
return nil
|
||||
|
||||
case http.StatusConflict:
|
||||
// 409 blocked_by_dependencies. Parse the body so we can show the
|
||||
// operator which dependency counts are holding up the retire.
|
||||
var blocked struct {
|
||||
Error string `json:"error"`
|
||||
Message string `json:"message"`
|
||||
Counts struct {
|
||||
ActiveTargets int `json:"active_targets"`
|
||||
ActiveCertificates int `json:"active_certificates"`
|
||||
PendingJobs int `json:"pending_jobs"`
|
||||
} `json:"counts"`
|
||||
}
|
||||
if err := json.Unmarshal(body, &blocked); err != nil {
|
||||
return fmt.Errorf("agent has active dependencies (HTTP 409); raw body: %s", string(body))
|
||||
}
|
||||
return fmt.Errorf("blocked_by_dependencies: %s (targets=%d certificates=%d jobs=%d); re-run with --force --reason \"<reason>\" to cascade",
|
||||
blocked.Message, blocked.Counts.ActiveTargets, blocked.Counts.ActiveCertificates, blocked.Counts.PendingJobs)
|
||||
|
||||
case http.StatusForbidden:
|
||||
return fmt.Errorf("agent %s is a reserved sentinel and cannot be retired (HTTP 403)", id)
|
||||
|
||||
case http.StatusNotFound:
|
||||
return fmt.Errorf("agent %s not found (HTTP 404)", id)
|
||||
|
||||
case http.StatusBadRequest:
|
||||
return fmt.Errorf("bad request (HTTP 400): %s", string(body))
|
||||
|
||||
default:
|
||||
return fmt.Errorf("unexpected HTTP %d: %s", resp.StatusCode, string(body))
|
||||
}
|
||||
}
|
||||
|
||||
// GetAgent retrieves a single agent by ID.
|
||||
func (c *Client) GetAgent(id string) error {
|
||||
resp, err := c.do("GET", fmt.Sprintf("/api/v1/agents/%s", id), nil, nil)
|
||||
@@ -613,6 +801,35 @@ func (c *Client) outputAgentsTable(agents []map[string]interface{}, total int) e
|
||||
return nil
|
||||
}
|
||||
|
||||
// outputRetiredAgentsTable is the tab-writer view for the retired listing.
|
||||
// I-004: adds RETIRED_AT + REASON columns so operators can forensic-grep.
|
||||
func (c *Client) outputRetiredAgentsTable(agents []map[string]interface{}, total int) error {
|
||||
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
|
||||
fmt.Fprintln(w, "ID\tHOSTNAME\tOS\tARCHITECTURE\tRETIRED AT\tREASON")
|
||||
|
||||
for _, agent := range agents {
|
||||
id := getString(agent, "id")
|
||||
hostname := getString(agent, "hostname")
|
||||
osName := getString(agent, "os")
|
||||
arch := getString(agent, "architecture")
|
||||
retiredAt := ""
|
||||
if raw, ok := agent["retired_at"].(string); ok && raw != "" {
|
||||
if t, err := time.Parse(time.RFC3339, raw); err == nil {
|
||||
retiredAt = t.Format("2006-01-02 15:04:05")
|
||||
} else {
|
||||
retiredAt = raw
|
||||
}
|
||||
}
|
||||
reason := getString(agent, "retired_reason")
|
||||
|
||||
fmt.Fprintf(w, "%s\t%s\t%s\t%s\t%s\t%s\n", id, hostname, osName, arch, retiredAt, reason)
|
||||
}
|
||||
|
||||
w.Flush()
|
||||
fmt.Printf("\nTotal retired: %d\n", total)
|
||||
return nil
|
||||
}
|
||||
|
||||
func (c *Client) outputAgentDetail(agent map[string]interface{}) error {
|
||||
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
|
||||
|
||||
|
||||
@@ -32,6 +32,8 @@ type DeploymentTarget struct {
|
||||
LastTestedAt *time.Time `json:"last_tested_at,omitempty"`
|
||||
TestStatus string `json:"test_status,omitempty"`
|
||||
Source string `json:"source,omitempty"`
|
||||
RetiredAt *time.Time `json:"retired_at,omitempty"` // I-004: soft-retirement timestamp (nil = active)
|
||||
RetiredReason *string `json:"retired_reason,omitempty"` // I-004: reason captured at cascade retirement
|
||||
CreatedAt time.Time `json:"created_at"`
|
||||
UpdatedAt time.Time `json:"updated_at"`
|
||||
}
|
||||
@@ -49,6 +51,67 @@ type Agent struct {
|
||||
Architecture string `json:"architecture"`
|
||||
IPAddress string `json:"ip_address"`
|
||||
Version string `json:"version"`
|
||||
// I-004: soft-retirement fields. An agent with RetiredAt != nil is the
|
||||
// canonical "retired" state. The Status column remains as before (Online
|
||||
// / Offline / Degraded) and is preserved at retirement time as the
|
||||
// last-seen operational status; RetiredAt is the source of truth for
|
||||
// "should we filter this row from active listings?".
|
||||
RetiredAt *time.Time `json:"retired_at,omitempty"`
|
||||
RetiredReason *string `json:"retired_reason,omitempty"`
|
||||
}
|
||||
|
||||
// IsRetired returns true when this agent has been soft-retired.
|
||||
// I-004: callers that iterate active agents (stats dashboard, stale-offline
|
||||
// sweeper, handler-facing list) must skip retired rows by default.
|
||||
func (a *Agent) IsRetired() bool { return a != nil && a.RetiredAt != nil }
|
||||
|
||||
// AgentDependencyCounts captures the active downstream rows that would be
|
||||
// affected by retiring an agent. Returned by the preflight pass on
|
||||
// DELETE /api/v1/agents/{id}. Zero counts mean a clean soft-retire is safe;
|
||||
// any non-zero count blocks a default retire with HTTP 409 and requires an
|
||||
// explicit ?force=true&reason=... escape hatch from the operator.
|
||||
type AgentDependencyCounts struct {
|
||||
ActiveTargets int `json:"active_targets"` // deployment_targets.agent_id=id AND retired_at IS NULL
|
||||
ActiveCertificates int `json:"active_certificates"` // certificates currently deployed via one of this agent's active targets
|
||||
PendingJobs int `json:"pending_jobs"` // jobs.agent_id=id AND status IN (Pending, AwaitingCSR, AwaitingApproval, Running)
|
||||
}
|
||||
|
||||
// HasDependencies reports whether any preflight counter is non-zero.
|
||||
func (d AgentDependencyCounts) HasDependencies() bool {
|
||||
return d.ActiveTargets > 0 || d.ActiveCertificates > 0 || d.PendingJobs > 0
|
||||
}
|
||||
|
||||
// SentinelAgentIDs enumerates the four reserved agent identities that back
|
||||
// non-agent discovery subsystems. These rows are created by cmd/server on
|
||||
// startup and retiring them would orphan their subsystem — the network
|
||||
// scanner and the three cloud secret-manager sources all key writes to
|
||||
// these IDs via service.SentinelAgentID / service.SentinelAWSSecretsMgr /
|
||||
// service.SentinelAzureKeyVault / service.SentinelGCPSecretMgr. The four
|
||||
// literal IDs below MUST stay in lockstep with those service-package
|
||||
// constants (see internal/service/network_scan.go line 23 and
|
||||
// internal/service/cloud_discovery.go lines 14-16).
|
||||
//
|
||||
// The retirement service refuses them unconditionally — even with
|
||||
// ?force=true — via ErrAgentIsSentinel. Living here (and not in the
|
||||
// service package) lets handler, repository, and scheduler code filter
|
||||
// them without importing service and creating a cycle.
|
||||
var SentinelAgentIDs = []string{
|
||||
"server-scanner",
|
||||
"cloud-aws-sm",
|
||||
"cloud-azure-kv",
|
||||
"cloud-gcp-sm",
|
||||
}
|
||||
|
||||
// IsSentinelAgent reports whether id matches one of the four reserved
|
||||
// sentinel agent IDs. A linear scan is fine — the slice is length 4 and
|
||||
// the check is rare (only on retirement attempts and sweeper filters).
|
||||
func IsSentinelAgent(id string) bool {
|
||||
for _, s := range SentinelAgentIDs {
|
||||
if s == id {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// AgentMetadata contains runtime metadata reported by agents via heartbeat.
|
||||
|
||||
@@ -0,0 +1,55 @@
|
||||
package domain
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
// TestAgent_IsRetired covers the I-004 soft-retirement predicate that gates
|
||||
// which callers hide an agent row from active listings.
|
||||
func TestAgent_IsRetired(t *testing.T) {
|
||||
t.Run("nil receiver is not retired", func(t *testing.T) {
|
||||
var a *Agent
|
||||
if a.IsRetired() {
|
||||
t.Fatalf("nil *Agent should not be retired")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("zero value is not retired", func(t *testing.T) {
|
||||
a := &Agent{}
|
||||
if a.IsRetired() {
|
||||
t.Fatalf("zero Agent should not be retired")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("RetiredAt set is retired", func(t *testing.T) {
|
||||
now := time.Now()
|
||||
a := &Agent{RetiredAt: &now}
|
||||
if !a.IsRetired() {
|
||||
t.Fatalf("Agent with RetiredAt != nil must be retired")
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
// TestAgentDependencyCounts_HasDependencies verifies the preflight
|
||||
// aggregation helper used by the 409 block path of DELETE /agents/{id}.
|
||||
func TestAgentDependencyCounts_HasDependencies(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
counts AgentDependencyCounts
|
||||
want bool
|
||||
}{
|
||||
{"all zero", AgentDependencyCounts{}, false},
|
||||
{"active target", AgentDependencyCounts{ActiveTargets: 1}, true},
|
||||
{"active cert", AgentDependencyCounts{ActiveCertificates: 1}, true},
|
||||
{"pending job", AgentDependencyCounts{PendingJobs: 1}, true},
|
||||
{"mixed", AgentDependencyCounts{ActiveTargets: 3, PendingJobs: 2}, true},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
if got := tc.counts.HasDependencies(); got != tc.want {
|
||||
t.Fatalf("HasDependencies()=%v want=%v counts=%+v", got, tc.want, tc.counts)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
@@ -848,6 +848,56 @@ func (m *mockAgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (
|
||||
return nil, fmt.Errorf("agent not found")
|
||||
}
|
||||
|
||||
// I-004: the integration-level mockAgentRepository implements the 6 new
|
||||
// retirement-surface methods as thin contract-satisfying stubs. The
|
||||
// integration suite exercises lifecycle flows (issue → renew → deploy)
|
||||
// that don't touch retirement, so these methods never need real behavior
|
||||
// here — they exist purely to keep mockAgentRepository a valid
|
||||
// AgentRepository implementation after migration 000015 expanded the
|
||||
// interface. Dedicated retirement tests live in internal/service/
|
||||
// agent_retire_test.go against the richer service-layer mockAgentRepo.
|
||||
|
||||
func (m *mockAgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
var retired []*domain.Agent
|
||||
for _, a := range m.agents {
|
||||
if a.RetiredAt != nil {
|
||||
retired = append(retired, a)
|
||||
}
|
||||
}
|
||||
return retired, len(retired), nil
|
||||
}
|
||||
|
||||
func (m *mockAgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
agent, ok := m.agents[id]
|
||||
if !ok {
|
||||
return fmt.Errorf("agent not found")
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *mockAgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
return m.SoftRetire(ctx, id, retiredAt, reason)
|
||||
}
|
||||
|
||||
func (m *mockAgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
func (m *mockAgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
func (m *mockAgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
type mockTargetRepository struct {
|
||||
targets map[string]*domain.DeploymentTarget
|
||||
}
|
||||
|
||||
@@ -49,6 +49,16 @@ func (c *Client) Delete(path string) (json.RawMessage, error) {
|
||||
return c.do("DELETE", path, nil, nil)
|
||||
}
|
||||
|
||||
// DeleteWithQuery performs an HTTP DELETE with query parameters. I-004 adds
|
||||
// this transport so MCP tools can target endpoints that carry flags in the
|
||||
// query string (e.g. DELETE /api/v1/agents/{id}?force=true&reason=…). Client.Delete
|
||||
// is path-only; without this method the retire tool silently drops force/reason,
|
||||
// turning every cascade retire into a default soft-retire. Shares do()'s 204
|
||||
// normalization and 4xx/5xx error propagation so tool authors get one contract.
|
||||
func (c *Client) DeleteWithQuery(path string, query url.Values) (json.RawMessage, error) {
|
||||
return c.do("DELETE", path, query, nil)
|
||||
}
|
||||
|
||||
// GetRaw performs an HTTP GET and returns the raw response body bytes and content type.
|
||||
// Used for binary responses (DER CRL, OCSP).
|
||||
func (c *Client) GetRaw(path string) ([]byte, string, error) {
|
||||
|
||||
@@ -0,0 +1,214 @@
|
||||
package mcp
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"net/url"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestClient_DeleteWithQuery_ForceRetire covers the new transport capability
|
||||
// that I-004 adds to the MCP client. The retire tool needs to issue
|
||||
// DELETE /api/v1/agents/{id}?force=true&reason=... — Client.Delete as it
|
||||
// stands only accepts a path, dropping query parameters on the floor. Phase 2b
|
||||
// must add DeleteWithQuery so the MCP retire tool can hit the force escape
|
||||
// hatch; without this, every retire-via-MCP call with force=true silently
|
||||
// becomes a default soft-retire and either succeeds wrongly or 409s.
|
||||
func TestClient_DeleteWithQuery_ForceRetire(t *testing.T) {
|
||||
var (
|
||||
sawMethod string
|
||||
sawPath string
|
||||
sawForce string
|
||||
sawReason string
|
||||
)
|
||||
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
sawMethod = r.Method
|
||||
sawPath = r.URL.Path
|
||||
sawForce = r.URL.Query().Get("force")
|
||||
sawReason = r.URL.Query().Get("reason")
|
||||
|
||||
if r.Method != http.MethodDelete || r.URL.Path != "/api/v1/agents/ag-1" {
|
||||
w.WriteHeader(http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"retired_at": "2026-04-18T12:00:00Z",
|
||||
"already_retired": false,
|
||||
"cascade": true,
|
||||
})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
c := NewClient(server.URL, "test-key")
|
||||
// Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the
|
||||
// query as a url.Values is the established pattern (matches Get's shape).
|
||||
query := url.Values{}
|
||||
query.Set("force", "true")
|
||||
query.Set("reason", "decommissioning rack 7")
|
||||
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", query)
|
||||
if err != nil {
|
||||
t.Fatalf("DeleteWithQuery err=%v want nil", err)
|
||||
}
|
||||
if data == nil {
|
||||
t.Fatal("DeleteWithQuery returned nil data; want 200 body echo-back")
|
||||
}
|
||||
|
||||
if sawMethod != http.MethodDelete {
|
||||
t.Errorf("method=%q want DELETE", sawMethod)
|
||||
}
|
||||
if sawPath != "/api/v1/agents/ag-1" {
|
||||
t.Errorf("path=%q want /api/v1/agents/ag-1 (query must be stripped from path)", sawPath)
|
||||
}
|
||||
if sawForce != "true" {
|
||||
t.Errorf("force query=%q want \"true\"", sawForce)
|
||||
}
|
||||
if sawReason != "decommissioning rack 7" {
|
||||
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_DeleteWithQuery_NoQuery covers the defensive path: a nil/empty
|
||||
// query must still produce a clean DELETE against the bare path with no stray
|
||||
// "?" suffix. Matches the Get() shape (see client.go do()) so downstream tools
|
||||
// can reuse one code path.
|
||||
func TestClient_DeleteWithQuery_NoQuery(t *testing.T) {
|
||||
var sawRawPath string
|
||||
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
sawRawPath = r.URL.RequestURI()
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{"ok": true})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
c := NewClient(server.URL, "")
|
||||
if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil {
|
||||
t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err)
|
||||
}
|
||||
// No query → no ? suffix.
|
||||
if strings.Contains(sawRawPath, "?") {
|
||||
t.Errorf("raw path=%q contains stray ?; empty query must not serialize", sawRawPath)
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_DeleteWithQuery_204ReturnsMinimalBody covers the idempotent path.
|
||||
// The handler returns 204 No Content for an already-retired agent; the
|
||||
// existing do() helper normalises this to {"status":"deleted"}. The new
|
||||
// DeleteWithQuery must share that behavior so MCP tool authors don't have to
|
||||
// special-case the return shape.
|
||||
func TestClient_DeleteWithQuery_204ReturnsMinimalBody(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
c := NewClient(server.URL, "")
|
||||
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
||||
if err != nil {
|
||||
t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err)
|
||||
}
|
||||
if data == nil {
|
||||
t.Fatal("DeleteWithQuery(204) returned nil; want synthetic body")
|
||||
}
|
||||
if !strings.Contains(string(data), "deleted") && !strings.Contains(string(data), "status") {
|
||||
t.Errorf("DeleteWithQuery(204) body=%q; must surface a non-empty sentinel", string(data))
|
||||
}
|
||||
}
|
||||
|
||||
// TestClient_DeleteWithQuery_409PropagatesError covers the preflight-blocked
|
||||
// surface. A 409 with dependency counts must bubble up as a Go error so the
|
||||
// MCP tool can present it to the LLM operator rather than silently swallow
|
||||
// the rejection.
|
||||
func TestClient_DeleteWithQuery_409PropagatesError(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusConflict)
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"error": "blocked_by_dependencies",
|
||||
"message": "agent has active targets",
|
||||
"counts": map[string]int{
|
||||
"active_targets": 3,
|
||||
"active_certificates": 7,
|
||||
"pending_jobs": 2,
|
||||
},
|
||||
})
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
c := NewClient(server.URL, "")
|
||||
_, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
||||
if err == nil {
|
||||
t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error")
|
||||
}
|
||||
if !strings.Contains(err.Error(), "409") {
|
||||
t.Errorf("err=%q should include HTTP status 409 for debuggability", err.Error())
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgentInput_ShapePinned is a compile-time assertion that the MCP
|
||||
// tool input struct for certctl_retire_agent exists with the required fields
|
||||
// and their expected tag shapes. The LLM discovers this input schema via
|
||||
// jsonschema tags — refactoring field names without updating callers silently
|
||||
// breaks tool discovery.
|
||||
//
|
||||
// Red until Phase 2b adds RetireAgentInput to internal/mcp/types.go. This
|
||||
// assertion deliberately exercises every field so the test fails at compile
|
||||
// time rather than runtime.
|
||||
func TestRetireAgentInput_ShapePinned(t *testing.T) {
|
||||
// Zero-value construction of the expected input — fails to compile until
|
||||
// the struct exists with fields {ID string, Force bool, Reason string}.
|
||||
input := RetireAgentInput{
|
||||
ID: "ag-1",
|
||||
Force: true,
|
||||
Reason: "decommissioning rack 7",
|
||||
}
|
||||
|
||||
if input.ID != "ag-1" {
|
||||
t.Errorf("RetireAgentInput.ID=%q want ag-1 (field binding broken)", input.ID)
|
||||
}
|
||||
if !input.Force {
|
||||
t.Errorf("RetireAgentInput.Force=false want true")
|
||||
}
|
||||
if input.Reason != "decommissioning rack 7" {
|
||||
t.Errorf("RetireAgentInput.Reason=%q want decommissioning rack 7", input.Reason)
|
||||
}
|
||||
|
||||
// Also pin the JSON surface — LLMs send and receive these field names,
|
||||
// so json tags must stay snake_case even through refactors.
|
||||
encoded, err := json.Marshal(input)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal RetireAgentInput: %v", err)
|
||||
}
|
||||
body := string(encoded)
|
||||
for _, want := range []string{`"id":"ag-1"`, `"force":true`, `"reason":"decommissioning rack 7"`} {
|
||||
if !strings.Contains(body, want) {
|
||||
t.Errorf("RetireAgentInput JSON=%q missing %q (tag shape drifted)", body, want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestListRetiredAgentsInput_ShapePinned mirrors the pagination input shape
|
||||
// used across the MCP toolset (see ListParams). The list-retired-agents tool
|
||||
// takes page + per_page with snake_case JSON tags. Compile-fail until
|
||||
// Phase 2b either adds ListRetiredAgentsInput or documents that list-retired
|
||||
// reuses the existing ListParams type (both paths are acceptable — the test
|
||||
// just pins whichever Phase 2b picks).
|
||||
func TestListRetiredAgentsInput_ShapePinned(t *testing.T) {
|
||||
// Phase 2b may either (a) add a dedicated ListRetiredAgentsInput struct
|
||||
// or (b) reuse the existing ListParams. Either is fine — we pin the
|
||||
// field-access contract rather than the struct name to let the
|
||||
// implementation choose. Compile-fail guards against the tool being
|
||||
// registered without any pagination input at all.
|
||||
var input ListParams
|
||||
input.Page = 1
|
||||
input.PerPage = 50
|
||||
if input.Page != 1 || input.PerPage != 50 {
|
||||
t.Errorf("ListParams fields Page/PerPage broken; listing pagination will misroute")
|
||||
}
|
||||
}
|
||||
@@ -506,6 +506,53 @@ func registerAgentTools(s *gomcp.Server, c *Client) {
|
||||
}
|
||||
return textResult(data)
|
||||
})
|
||||
|
||||
// I-004: soft-retirement. DELETE /api/v1/agents/{id} returns 200 on a
|
||||
// fresh retire (body echoes retired_at/already_retired/cascade/counts),
|
||||
// 204 on an idempotent retire of an already-retired agent (do() in
|
||||
// client.go normalizes that to {"status":"deleted"}), 409 when downstream
|
||||
// dependencies block the retire and force wasn't set, 403 on sentinel
|
||||
// agents, or 400 when force=true was sent without a reason. The tool
|
||||
// forwards the raw handler response so the LLM operator sees the
|
||||
// dependency counts and can decide whether to retry with force=true.
|
||||
gomcp.AddTool(s, &gomcp.Tool{
|
||||
Name: "certctl_retire_agent",
|
||||
Description: "Soft-retire an agent (DELETE /api/v1/agents/{id}). Sets retired_at + retired_reason on the row; the agent is filtered from the default listing and surfaces only via certctl_list_retired_agents. Default is a safety-gated soft-retire that returns 409 blocked_by_dependencies if the agent has active targets, active certificates, or pending jobs — the returned counts tell you what would be orphaned. Pass force=true to cascade through and retire those dependents too; force=true requires a non-empty reason (captured in the audit trail). Sentinel discovery agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) cannot be retired — the handler returns 403 unconditionally. Idempotent: retrying on an already-retired agent returns 204 without side effects.",
|
||||
}, func(ctx context.Context, req *gomcp.CallToolRequest, input RetireAgentInput) (*gomcp.CallToolResult, any, error) {
|
||||
// Client-side mirror of the handler's ErrForceReasonRequired contract
|
||||
// (see internal/api/handler/agents.go) so the LLM gets an immediate,
|
||||
// actionable error instead of a round-trip 400. Whitespace-only
|
||||
// reasons are treated as empty — matches handler's TrimSpace check.
|
||||
if input.Force && input.Reason == "" {
|
||||
return errorResult(fmt.Errorf("reason is required when force=true"))
|
||||
}
|
||||
query := url.Values{}
|
||||
if input.Force {
|
||||
query.Set("force", "true")
|
||||
}
|
||||
if input.Reason != "" {
|
||||
query.Set("reason", input.Reason)
|
||||
}
|
||||
data, err := c.DeleteWithQuery("/api/v1/agents/"+input.ID, query)
|
||||
if err != nil {
|
||||
return errorResult(err)
|
||||
}
|
||||
return textResult(data)
|
||||
})
|
||||
|
||||
// I-004: retired agents are filtered out of GET /api/v1/agents by default.
|
||||
// The /agents/retired endpoint is the opt-in view — same pagination shape
|
||||
// as the default listing, but filters to rows where retired_at IS NOT NULL.
|
||||
gomcp.AddTool(s, &gomcp.Tool{
|
||||
Name: "certctl_list_retired_agents",
|
||||
Description: "List soft-retired agents (GET /api/v1/agents/retired). These are agents that have been retired via certctl_retire_agent; retired_at and retired_reason are populated. Returned separately from certctl_list_agents so the default listing stays focused on operational agents.",
|
||||
}, func(ctx context.Context, req *gomcp.CallToolRequest, input ListParams) (*gomcp.CallToolResult, any, error) {
|
||||
data, err := c.Get("/api/v1/agents/retired", paginationQuery(input.Page, input.PerPage))
|
||||
if err != nil {
|
||||
return errorResult(err)
|
||||
}
|
||||
return textResult(data)
|
||||
})
|
||||
}
|
||||
|
||||
// ── Jobs ────────────────────────────────────────────────────────────
|
||||
|
||||
@@ -152,6 +152,23 @@ type AgentJobStatusInput struct {
|
||||
Error string `json:"error,omitempty" jsonschema:"Error message if job failed"`
|
||||
}
|
||||
|
||||
// RetireAgentInput pins the MCP tool surface for certctl_retire_agent. I-004
|
||||
// introduces a soft-retirement flow that the handler exposes on DELETE
|
||||
// /api/v1/agents/{id} with two optional query flags: force=true cascades
|
||||
// through dependent active targets/certs/jobs, and reason is the human-readable
|
||||
// string captured in the audit trail. The handler enforces
|
||||
// ErrForceReasonRequired when force=true is sent without a reason; we surface
|
||||
// both as separate fields so the LLM can populate them independently and so
|
||||
// the retire_agent_test shape assertion stays aligned with the JSON-wire
|
||||
// contract. ID is always emitted (no omitempty) because a retire call without
|
||||
// a target agent is meaningless; Force and Reason are omitempty so the default
|
||||
// soft-retire path sends no query suffix at all.
|
||||
type RetireAgentInput struct {
|
||||
ID string `json:"id" jsonschema:"Agent ID to soft-retire"`
|
||||
Force bool `json:"force,omitempty" jsonschema:"Cascade-retire downstream active targets, certs, and jobs (requires reason)"`
|
||||
Reason string `json:"reason,omitempty" jsonschema:"Human-readable reason (required when force=true)"`
|
||||
}
|
||||
|
||||
// ── Jobs ────────────────────────────────────────────────────────────
|
||||
|
||||
type ListJobsInput struct {
|
||||
|
||||
@@ -93,9 +93,34 @@ type TargetRepository interface {
|
||||
|
||||
// AgentRepository defines operations for managing control plane agents.
|
||||
type AgentRepository interface {
|
||||
// List returns all agents.
|
||||
// List returns all ACTIVE agents — rows with retired_at IS NULL.
|
||||
//
|
||||
// I-004: The default listing MUST NOT surface retired agents. The
|
||||
// handler-facing ListAgents call, the stats dashboard, and the stale-offline
|
||||
// sweeper all iterate this list and would otherwise re-surface decommissioned
|
||||
// hardware in operational UI. Callers that genuinely want retired rows (the
|
||||
// audit tab, compliance exports) must use ListRetired instead.
|
||||
//
|
||||
// The partial index idx_agents_retired_at (migration 000015) keeps retired
|
||||
// rows cheap to exclude — the planner uses it to skip the retired segment
|
||||
// of the table entirely.
|
||||
List(ctx context.Context) ([]*domain.Agent, error)
|
||||
// ListRetired returns a paginated list of retired agents (retired_at IS NOT NULL),
|
||||
// ordered by retired_at DESC so the most recent retirements appear first. Used
|
||||
// by the GUI's Retired tab and the audit export path. Returns the slice plus
|
||||
// the total count (for pagination). A page<1 or perPage<1 is clamped to sensible
|
||||
// defaults (page=1, perPage=50) in the repo implementation rather than erroring —
|
||||
// this matches the ListAgents pagination behavior in the service layer.
|
||||
// I-004 coverage-gap closure, migration 000015.
|
||||
ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error)
|
||||
// Get retrieves an agent by ID.
|
||||
//
|
||||
// I-004 note: Get returns retired rows (retired_at IS NOT NULL) because
|
||||
// callers that need to check "has this agent been retired?" — the heartbeat
|
||||
// handler returning 410 Gone, the retirement service's idempotent-retire
|
||||
// branch, the detail page rendering a retirement banner — must see the
|
||||
// retired_at/retired_reason fields. Only the default List path default-
|
||||
// excludes retired; individual Get lookups surface them.
|
||||
Get(ctx context.Context, id string) (*domain.Agent, error)
|
||||
// Create stores a new agent. Callers that want duplicate-key errors surfaced
|
||||
// (e.g. real-agent registration) must use this method; sentinel/bootstrap
|
||||
@@ -112,11 +137,78 @@ type AgentRepository interface {
|
||||
// Update modifies an existing agent.
|
||||
Update(ctx context.Context, agent *domain.Agent) error
|
||||
// Delete removes an agent.
|
||||
//
|
||||
// I-004: callers should prefer SoftRetire / RetireAgentWithCascade for the
|
||||
// operator-facing retirement path; hard Delete remains available for test
|
||||
// cleanup and repository-level administrative tasks. The deployment_targets
|
||||
// FK flipped to ON DELETE RESTRICT in migration 000015, so hard-deleting an
|
||||
// agent that still owns active targets will now fail at the DB layer — which
|
||||
// is intentional: the fail-closed guardrail prevents audit-trail destruction.
|
||||
Delete(ctx context.Context, id string) error
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
|
||||
//
|
||||
// I-004: UpdateHeartbeat is a no-op on retired agents — the UPDATE clause
|
||||
// includes AND retired_at IS NULL so a stale agent process that keeps polling
|
||||
// after retirement cannot resurrect its heartbeat. The service layer already
|
||||
// short-circuits with ErrAgentRetired before calling this method; the WHERE
|
||||
// filter here is belt-and-braces for anyone who skips the service path.
|
||||
UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error
|
||||
// GetByAPIKey retrieves an agent by hashed API key.
|
||||
//
|
||||
// I-004: GetByAPIKey returns retired rows so the auth middleware can detect
|
||||
// "this API key belongs to a retired agent" and fail the request with
|
||||
// 410 Gone. If retired rows were hidden, auth would return a plain 401 and
|
||||
// leak no signal — which is wrong: the operator needs the retired state
|
||||
// made explicit so they can clean up the agent process.
|
||||
GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error)
|
||||
// SoftRetire stamps retired_at + retired_reason on the agent row with no
|
||||
// cascade. Used on the happy path where preflight confirmed the agent has
|
||||
// zero active dependencies (no active deployment_targets, no pending jobs).
|
||||
// The UPDATE is scoped to WHERE id=$1 AND retired_at IS NULL so re-retiring
|
||||
// an already-retired row is a no-op (zero rows affected is NOT returned as
|
||||
// an error — the service layer detects this via its own idempotent-retire
|
||||
// branch before calling SoftRetire). Callers supply retiredAt so the service
|
||||
// can pin a single consistent timestamp across audit + DB writes.
|
||||
// I-004 coverage-gap closure.
|
||||
SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error
|
||||
// RetireAgentWithCascade performs a transactional retire + cascade. In one
|
||||
// transaction it: (1) stamps retired_at + retired_reason on the agent row,
|
||||
// and (2) stamps the SAME retired_at + retired_reason on every active
|
||||
// deployment_targets row whose agent_id matches. Only rows with
|
||||
// retired_at IS NULL are touched in (2) — already-retired targets keep their
|
||||
// original retirement metadata (whoever retired them first, whenever). Used
|
||||
// exclusively on the force=true path from the retirement handler; callers
|
||||
// supply retiredAt so the agent row and every cascaded target row share an
|
||||
// exact retirement instant (helps forensic analysis trace the cascade back
|
||||
// to a single operator action). If the agent row is already retired, the
|
||||
// whole operation is a no-op — the transaction commits without touching
|
||||
// either table. I-004 coverage-gap closure, migration 000015.
|
||||
RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error
|
||||
// CountActiveTargets returns the number of deployment_targets rows where
|
||||
// agent_id=id AND retired_at IS NULL. The COUNT query hits the existing
|
||||
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
|
||||
// additional retired_at IS NULL predicate is cheap because the partial
|
||||
// idx_deployment_targets_retired_at index (migration 000015) lets the
|
||||
// planner skip the retired-row segment entirely. Preflight uses this to
|
||||
// decide 200 (soft-retire) vs 409 (blocked-by-deps). I-004.
|
||||
CountActiveTargets(ctx context.Context, agentID string) (int, error)
|
||||
// CountActiveCertificates returns the count of managed_certificates currently
|
||||
// deployed through one of this agent's ACTIVE (non-retired) deployment_targets.
|
||||
// The query joins certificate_target_mappings (migration 000001 line 116) →
|
||||
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
|
||||
// deployment_targets.retired_at IS NULL, then COUNT(DISTINCT certificate_id)
|
||||
// so the same cert deployed to multiple targets on one agent counts once.
|
||||
// The primary key (certificate_id, target_id) on certificate_target_mappings
|
||||
// plus idx_certificate_target_mappings_target_id (line 122) cover the join.
|
||||
// Used purely for the preflight 409 body — the number is informational. I-004.
|
||||
CountActiveCertificates(ctx context.Context, agentID string) (int, error)
|
||||
// CountPendingJobs returns the number of jobs belonging to this agent whose
|
||||
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
|
||||
// statuses that indicate work the agent would still be expected to pick up.
|
||||
// Completed/Failed/Cancelled jobs do not count. The filter agent_id=$1 hits
|
||||
// the idx_jobs_agent_id index (migration 000001 line 161). Used for the
|
||||
// preflight 409 body. I-004.
|
||||
CountPendingJobs(ctx context.Context, agentID string) (int, error)
|
||||
}
|
||||
|
||||
// JobRepository defines operations for managing renewal and deployment jobs.
|
||||
|
||||
@@ -20,12 +20,18 @@ func NewAgentRepository(db *sql.DB) *AgentRepository {
|
||||
return &AgentRepository{db: db}
|
||||
}
|
||||
|
||||
// List returns all agents
|
||||
// List returns all ACTIVE agents — rows with retired_at IS NULL. I-004:
|
||||
// the default listing path feeds the handler-facing ListAgents call, the
|
||||
// stats dashboard, and the stale-offline sweeper; every caller wants active
|
||||
// hardware, not decommissioned rows. Operators who need retired rows reach
|
||||
// for ListRetired instead. The partial index idx_agents_retired_at
|
||||
// (migration 000015) lets the planner skip the retired segment cheaply.
|
||||
func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
rows, err := r.db.QueryContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE retired_at IS NULL
|
||||
ORDER BY registered_at DESC
|
||||
`)
|
||||
|
||||
@@ -50,11 +56,16 @@ func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
return agents, nil
|
||||
}
|
||||
|
||||
// Get retrieves an agent by ID
|
||||
// Get retrieves an agent by ID. I-004: retired rows ARE surfaced here —
|
||||
// callers that need to check "has this agent been retired?" (heartbeat
|
||||
// handler returning 410 Gone, retirement service's idempotent-retire branch,
|
||||
// detail page rendering a retirement banner) must see retired_at /
|
||||
// retired_reason. Only the List path default-excludes retired rows; Get is
|
||||
// by-ID and returns whatever row exists.
|
||||
func (r *AgentRepository) Get(ctx context.Context, id string) (*domain.Agent, error) {
|
||||
row := r.db.QueryRowContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE id = $1
|
||||
`, id)
|
||||
@@ -185,7 +196,16 @@ func (r *AgentRepository) Delete(ctx context.Context, id string) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
|
||||
//
|
||||
// I-004: both branches include `AND retired_at IS NULL` in the WHERE clause,
|
||||
// making the UPDATE a no-op on retired rows. The service layer already
|
||||
// short-circuits with ErrAgentRetired before calling this method (see
|
||||
// AgentService.Heartbeat), but the WHERE filter is belt-and-braces for any
|
||||
// path that skips the service — a stale agent process that keeps polling
|
||||
// after retirement cannot resurrect its heartbeat at the DB layer. A zero
|
||||
// RowsAffected here returns the same "agent not found" error as before; the
|
||||
// service layer distinguishes retired from missing by calling Get first.
|
||||
func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error {
|
||||
var result sql.Result
|
||||
var err error
|
||||
@@ -199,11 +219,11 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
|
||||
architecture = CASE WHEN $5 = '' THEN architecture ELSE $5 END,
|
||||
ip_address = CASE WHEN $6 = '' THEN ip_address ELSE $6 END,
|
||||
version = CASE WHEN $7 = '' THEN version ELSE $7 END
|
||||
WHERE id = $2
|
||||
WHERE id = $2 AND retired_at IS NULL
|
||||
`, time.Now(), id, metadata.Hostname, metadata.OS, metadata.Architecture, metadata.IPAddress, metadata.Version)
|
||||
} else {
|
||||
result, err = r.db.ExecContext(ctx, `
|
||||
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2
|
||||
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 AND retired_at IS NULL
|
||||
`, time.Now(), id)
|
||||
}
|
||||
|
||||
@@ -223,11 +243,15 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetByAPIKey retrieves an agent by hashed API key
|
||||
// GetByAPIKey retrieves an agent by hashed API key. I-004: retired rows ARE
|
||||
// surfaced here so the auth middleware can detect "this API key belongs to a
|
||||
// retired agent" and fail the request with 410 Gone instead of 401. If the
|
||||
// filter hid retired rows, auth would return a plain 401 and leak no signal
|
||||
// that the agent process needs cleaning up.
|
||||
func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) {
|
||||
row := r.db.QueryRowContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE api_key_hash = $1
|
||||
`, keyHash)
|
||||
@@ -243,14 +267,214 @@ func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*dom
|
||||
return agent, nil
|
||||
}
|
||||
|
||||
// scanAgent scans an agent from a row or rows
|
||||
// ─── I-004 agent retirement surface ──────────────────────────────────────
|
||||
//
|
||||
// The methods below implement the I-004 coverage-gap closure. They follow the
|
||||
// interface contracts in internal/repository/interfaces.go:94-210 (which is the
|
||||
// spec — keep godoc there in sync if behavior changes).
|
||||
|
||||
// ListRetired returns a paginated slice of retired agents ordered by
|
||||
// retired_at DESC so the most recent retirements appear first. Used by the
|
||||
// GUI's Retired tab and the audit export path. Returns the rows plus the
|
||||
// total count (for pagination UI). page<1 or perPage<1 is clamped to
|
||||
// sensible defaults in-repo rather than erroring, matching the ListAgents
|
||||
// pagination behavior at the service layer. I-004, migration 000015.
|
||||
func (r *AgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
// Clamp pagination to safe defaults. Keep in lockstep with the service
|
||||
// layer's pagination shape — negative / zero values on either axis should
|
||||
// degrade to "first page, default size" instead of returning an error.
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
offset := (page - 1) * perPage
|
||||
|
||||
// Total count first — separate query so pagination math stays correct
|
||||
// even when the page of rows is empty. Uses the partial
|
||||
// idx_agents_retired_at index so this is effectively a count of the
|
||||
// partial-index tuple count, not a full table scan.
|
||||
var total int
|
||||
if err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*) FROM agents WHERE retired_at IS NOT NULL
|
||||
`).Scan(&total); err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to count retired agents: %w", err)
|
||||
}
|
||||
|
||||
rows, err := r.db.QueryContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE retired_at IS NOT NULL
|
||||
ORDER BY retired_at DESC
|
||||
LIMIT $1 OFFSET $2
|
||||
`, perPage, offset)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to query retired agents: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var agents []*domain.Agent
|
||||
for rows.Next() {
|
||||
agent, err := scanAgent(rows)
|
||||
if err != nil {
|
||||
return nil, 0, err
|
||||
}
|
||||
agents = append(agents, agent)
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, 0, fmt.Errorf("error iterating retired agent rows: %w", err)
|
||||
}
|
||||
return agents, total, nil
|
||||
}
|
||||
|
||||
// SoftRetire stamps retired_at + retired_reason on the agent row with no
|
||||
// cascade. Scoped to `WHERE id=$1 AND retired_at IS NULL` so re-retiring an
|
||||
// already-retired row is a silent no-op (zero RowsAffected). The service
|
||||
// layer has its own idempotent-retire branch that detects already-retired
|
||||
// rows via Get before calling SoftRetire; a zero here just means a racy
|
||||
// caller got there first. I-004.
|
||||
func (r *AgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
if _, err := r.db.ExecContext(ctx, `
|
||||
UPDATE agents
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to soft-retire agent: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgentWithCascade performs a transactional retire-and-cascade. In one
|
||||
// transaction it (1) stamps retired_at + retired_reason on the agent row if
|
||||
// it is still active, and (2) stamps the SAME retired_at + retired_reason on
|
||||
// every active (retired_at IS NULL) deployment_targets row whose agent_id
|
||||
// matches. Already-retired targets keep their original retirement metadata;
|
||||
// only active targets are touched. If the agent is already retired, the
|
||||
// whole transaction is a no-op — the caller's idempotent-retire branch
|
||||
// already handled it before we got here. I-004, migration 000015.
|
||||
//
|
||||
// The two UPDATEs share a single (retiredAt, reason) pair so forensic
|
||||
// analysis can trace "every row stamped at T1 with reason R was part of the
|
||||
// same operator action" back to one cascade. Using BeginTx keeps the agent
|
||||
// row and its targets' retirement metadata consistent even if something
|
||||
// crashes mid-cascade.
|
||||
func (r *AgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
tx, err := r.db.BeginTx(ctx, nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to begin retire-cascade transaction: %w", err)
|
||||
}
|
||||
// Rollback is a no-op if Commit has already run — safe to always defer.
|
||||
defer func() { _ = tx.Rollback() }()
|
||||
|
||||
// Agent row: flip to retired only if it was still active. If zero rows
|
||||
// match, the agent was already retired — the whole cascade becomes a
|
||||
// no-op (we deliberately do NOT stamp the targets against a retirement
|
||||
// we didn't perform).
|
||||
if _, err := tx.ExecContext(ctx, `
|
||||
UPDATE agents
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to retire agent in cascade: %w", err)
|
||||
}
|
||||
|
||||
// Cascade: copy the same retired_at / retired_reason onto every active
|
||||
// deployment_target belonging to this agent. Skips targets that are
|
||||
// already retired so their original retirement metadata is preserved.
|
||||
if _, err := tx.ExecContext(ctx, `
|
||||
UPDATE deployment_targets
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE agent_id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to cascade-retire deployment targets: %w", err)
|
||||
}
|
||||
|
||||
if err := tx.Commit(); err != nil {
|
||||
return fmt.Errorf("failed to commit retire-cascade transaction: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// CountActiveTargets returns the number of deployment_targets with
|
||||
// agent_id=agentID AND retired_at IS NULL. Used by the retirement preflight
|
||||
// to decide 200 (soft-retire) vs 409 (blocked-by-deps). Hits the existing
|
||||
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
|
||||
// retired_at IS NULL predicate is cheap because the partial
|
||||
// idx_deployment_targets_retired_at index (migration 000015) lets the
|
||||
// planner skip the retired-row segment. I-004.
|
||||
func (r *AgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*)
|
||||
FROM deployment_targets
|
||||
WHERE agent_id = $1 AND retired_at IS NULL
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count active targets for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// CountActiveCertificates returns the count of distinct managed_certificates
|
||||
// currently deployed through one of this agent's ACTIVE deployment_targets.
|
||||
// Joins certificate_target_mappings (migration 000001 line 116) →
|
||||
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
|
||||
// deployment_targets.retired_at IS NULL. COUNT(DISTINCT certificate_id) so
|
||||
// the same cert deployed to multiple targets on one agent counts once.
|
||||
// Used purely for the preflight 409 body. I-004.
|
||||
func (r *AgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(DISTINCT ctm.certificate_id)
|
||||
FROM certificate_target_mappings ctm
|
||||
JOIN deployment_targets dt ON dt.id = ctm.target_id
|
||||
WHERE dt.agent_id = $1 AND dt.retired_at IS NULL
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count active certificates for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// CountPendingJobs returns the number of jobs belonging to this agent whose
|
||||
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
|
||||
// statuses that represent work the agent would still be expected to pick up
|
||||
// or complete. Completed / Failed / Cancelled jobs do not count toward the
|
||||
// preflight gate. Status strings match domain.JobStatus* constants in
|
||||
// internal/domain/job.go:43-49. Hits idx_jobs_agent_id (migration 000001
|
||||
// line 161). I-004.
|
||||
func (r *AgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*)
|
||||
FROM jobs
|
||||
WHERE agent_id = $1
|
||||
AND status IN ('Pending', 'AwaitingCSR', 'AwaitingApproval', 'Running')
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count pending jobs for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// scanAgent scans an agent from a row or rows.
|
||||
//
|
||||
// I-004: the column list here is the authoritative 13-field post-M15 order —
|
||||
// retired_at and retired_reason are appended at the tail as nullable
|
||||
// *time.Time / *string scan targets matching the `json:"...,omitempty"` domain
|
||||
// fields. Every SELECT in this file that feeds scanAgent must emit columns in
|
||||
// this same order, otherwise Scan will silently place values into the wrong
|
||||
// fields (lib/pq does positional binding, not named).
|
||||
func scanAgent(scanner interface {
|
||||
Scan(...interface{}) error
|
||||
}) (*domain.Agent, error) {
|
||||
var agent domain.Agent
|
||||
err := scanner.Scan(&agent.ID, &agent.Name, &agent.Hostname, &agent.Status,
|
||||
&agent.LastHeartbeatAt, &agent.RegisteredAt, &agent.APIKeyHash,
|
||||
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version)
|
||||
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version,
|
||||
&agent.RetiredAt, &agent.RetiredReason)
|
||||
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to scan agent: %w", err)
|
||||
|
||||
@@ -0,0 +1,220 @@
|
||||
package postgres_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test
|
||||
// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs").
|
||||
//
|
||||
// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql,
|
||||
// which must:
|
||||
//
|
||||
// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT`
|
||||
// columns to the `agents` table. These mirror the revoked_at /
|
||||
// revocation_reason pair on managed_certificates (migration 000005).
|
||||
//
|
||||
// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns
|
||||
// to `deployment_targets`. When an agent is retired with cascade=true,
|
||||
// its deployment_targets must be soft-retired (not deleted) so audit
|
||||
// history — who deployed what to where, when — stays intact.
|
||||
//
|
||||
// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id`
|
||||
// from `ON DELETE CASCADE` (migration 000001, line 104) to
|
||||
// `ON DELETE RESTRICT`. This is the fail-closed change that makes a
|
||||
// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer
|
||||
// instead of silently vaporising every deployment_target row. Today
|
||||
// the CASCADE means the audit trail gets shredded with zero warning.
|
||||
//
|
||||
// The round-trip also validates that the down migration cleanly reverses all
|
||||
// three changes, so an operator who lands on a rollback can still boot the
|
||||
// server. Red-until-Green: this test compiles but fails until
|
||||
// migrations/000015_agent_retire.up.sql + .down.sql exist with the right
|
||||
// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order —
|
||||
// the new migration runs automatically once Phase 2b creates the files.
|
||||
func TestMigration000015_AgentRetireRoundTrip(t *testing.T) {
|
||||
tdb := getTestDB(t)
|
||||
db := tdb.freshSchema(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
|
||||
//
|
||||
// After all .up.sql migrations (including the new 000015) have run, the
|
||||
// new columns and the flipped FK must be observable in the catalog.
|
||||
|
||||
assertColumnExists(t, db, "agents", "retired_at")
|
||||
assertColumnExists(t, db, "agents", "retired_reason")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_reason")
|
||||
|
||||
// The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'),
|
||||
// not CASCADE (confdeltype='c'). This is the core fail-closed guarantee
|
||||
// that fixes I-004 at the storage layer.
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
|
||||
|
||||
// The FK on jobs.agent_id is already SET NULL (confdeltype='n') per
|
||||
// migration 000001 line 146 — pin that it stays that way (or goes to
|
||||
// RESTRICT; either preserves audit history, both fail on 'c').
|
||||
assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c")
|
||||
|
||||
// ─── Stage 2: Run the 000015 down migration manually ─────────────────
|
||||
//
|
||||
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
|
||||
// the down migration I read and execute it by hand, then re-check the
|
||||
// catalog.
|
||||
|
||||
downSQL := readMigrationFile(t, "000015_agent_retire.down.sql")
|
||||
if _, err := db.ExecContext(ctx, downSQL); err != nil {
|
||||
t.Fatalf("000015 down migration failed: %v", err)
|
||||
}
|
||||
|
||||
// Stage 3: Post-down assertions — columns gone, FK restored to CASCADE.
|
||||
assertColumnGone(t, db, "agents", "retired_at")
|
||||
assertColumnGone(t, db, "agents", "retired_reason")
|
||||
assertColumnGone(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnGone(t, db, "deployment_targets", "retired_reason")
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c")
|
||||
|
||||
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
|
||||
//
|
||||
// The up migration must be safely re-runnable — operators sometimes
|
||||
// re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER
|
||||
// idempotently.
|
||||
|
||||
upSQL := readMigrationFile(t, "000015_agent_retire.up.sql")
|
||||
if _, err := db.ExecContext(ctx, upSQL); err != nil {
|
||||
t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err)
|
||||
}
|
||||
|
||||
assertColumnExists(t, db, "agents", "retired_at")
|
||||
assertColumnExists(t, db, "agents", "retired_reason")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_reason")
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
|
||||
}
|
||||
|
||||
// ─── Catalog helpers ──────────────────────────────────────────────────────
|
||||
//
|
||||
// These helpers scope every catalog query to the schema the test is actually
|
||||
// running in by joining against current_schema(). Without that, a test
|
||||
// running in schema test_xyz would accidentally inspect the public schema
|
||||
// and green-light drift.
|
||||
|
||||
func assertColumnExists(t *testing.T, db *sql.DB, table, column string) {
|
||||
t.Helper()
|
||||
var exists bool
|
||||
err := db.QueryRowContext(context.Background(), `
|
||||
SELECT EXISTS (
|
||||
SELECT 1 FROM information_schema.columns
|
||||
WHERE table_schema = current_schema()
|
||||
AND table_name = $1
|
||||
AND column_name = $2
|
||||
)`, table, column).Scan(&exists)
|
||||
if err != nil {
|
||||
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
|
||||
}
|
||||
if !exists {
|
||||
t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column)
|
||||
}
|
||||
}
|
||||
|
||||
func assertColumnGone(t *testing.T, db *sql.DB, table, column string) {
|
||||
t.Helper()
|
||||
var exists bool
|
||||
err := db.QueryRowContext(context.Background(), `
|
||||
SELECT EXISTS (
|
||||
SELECT 1 FROM information_schema.columns
|
||||
WHERE table_schema = current_schema()
|
||||
AND table_name = $1
|
||||
AND column_name = $2
|
||||
)`, table, column).Scan(&exists)
|
||||
if err != nil {
|
||||
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
|
||||
}
|
||||
if exists {
|
||||
t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column)
|
||||
}
|
||||
}
|
||||
|
||||
// assertFKDeleteRule asserts that the foreign key covering `table.column`
|
||||
// (i.e. the FK whose constrained column matches) has the expected
|
||||
// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE,
|
||||
// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION.
|
||||
func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) {
|
||||
t.Helper()
|
||||
got := lookupFKDeleteRule(t, db, table, column)
|
||||
if got != want {
|
||||
t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')",
|
||||
table, column, got, want)
|
||||
}
|
||||
}
|
||||
|
||||
// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where
|
||||
// multiple confdeltype values are acceptable (SET NULL and RESTRICT both
|
||||
// preserve audit history) but CASCADE is strictly forbidden.
|
||||
func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) {
|
||||
t.Helper()
|
||||
got := lookupFKDeleteRule(t, db, table, column)
|
||||
if got == disallowed {
|
||||
t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)",
|
||||
table, column, got, disallowed)
|
||||
}
|
||||
}
|
||||
|
||||
// lookupFKDeleteRule returns the confdeltype for the FK constraint whose
|
||||
// constrained table+column matches. Returns empty string if no FK found —
|
||||
// that's treated as a test failure because the schema is supposed to have
|
||||
// these FKs per migration 000001.
|
||||
func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string {
|
||||
t.Helper()
|
||||
|
||||
// Join pg_constraint → pg_class (constrained rel) → pg_attribute
|
||||
// (constrained col) → pg_namespace (schema filter). Scoped to
|
||||
// current_schema() so schema-per-test isolation holds.
|
||||
const q = `
|
||||
SELECT c.confdeltype
|
||||
FROM pg_constraint c
|
||||
JOIN pg_class cl ON cl.oid = c.conrelid
|
||||
JOIN pg_namespace n ON n.oid = cl.relnamespace
|
||||
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
|
||||
WHERE n.nspname = current_schema()
|
||||
AND c.contype = 'f'
|
||||
AND cl.relname = $1
|
||||
AND a.attname = $2
|
||||
LIMIT 1
|
||||
`
|
||||
var confdeltype string
|
||||
err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype)
|
||||
if err == sql.ErrNoRows {
|
||||
t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column)
|
||||
return ""
|
||||
}
|
||||
if err != nil {
|
||||
t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err)
|
||||
return ""
|
||||
}
|
||||
return confdeltype
|
||||
}
|
||||
|
||||
// readMigrationFile locates and loads a named migration file. Uses the same
|
||||
// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers
|
||||
// agree on where the migrations live.
|
||||
func readMigrationFile(t *testing.T, name string) string {
|
||||
t.Helper()
|
||||
path := filepath.Join(findMigrationsDir(), name)
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err)
|
||||
}
|
||||
// Defensive: a zero-byte down migration would produce false-positive
|
||||
// "success" below. Refuse to trust it.
|
||||
if strings.TrimSpace(string(data)) == "" {
|
||||
t.Fatalf("migration file %s is empty — down migration missing or truncated", name)
|
||||
}
|
||||
return string(data)
|
||||
}
|
||||
@@ -92,12 +92,27 @@ func (s *AgentService) Register(ctx context.Context, name string, hostname strin
|
||||
}
|
||||
|
||||
// Heartbeat updates an agent's last seen time, status, and metadata.
|
||||
//
|
||||
// I-004: retired agents must be rejected up-front. A retired agent that is
|
||||
// still polling is a zombie — its row exists only for audit history and must
|
||||
// not be allowed to bump LastHeartbeatAt (which would resurrect it in stats
|
||||
// dashboards and stale-offline sweeps). The sentinel ErrAgentRetired is
|
||||
// returned unwrapped so the HTTP handler can map it to 410 Gone via
|
||||
// errors.Is; the agent process detects the 410 and shuts down cleanly
|
||||
// instead of continuing to heartbeat indefinitely.
|
||||
func (s *AgentService) Heartbeat(ctx context.Context, agentID string, metadata *domain.AgentMetadata) error {
|
||||
agent, err := s.agentRepo.Get(ctx, agentID)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// I-004 guard: retired agents are frozen. Do not call UpdateHeartbeat —
|
||||
// bumping the timestamp would defeat the retired-row filter that protects
|
||||
// stats, scheduler sweeps, and handler listings.
|
||||
if agent.IsRetired() {
|
||||
return ErrAgentRetired
|
||||
}
|
||||
|
||||
// Update heartbeat and metadata
|
||||
if err := s.agentRepo.UpdateHeartbeat(ctx, agentID, metadata); err != nil {
|
||||
return fmt.Errorf("failed to update heartbeat: %w", err)
|
||||
|
||||
@@ -0,0 +1,317 @@
|
||||
package service
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
)
|
||||
|
||||
// I-004 coverage-gap closure: the agent retirement surface.
|
||||
//
|
||||
// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and
|
||||
// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with
|
||||
// no preflight, no archival, and no knowledge of in-flight jobs. Any cert
|
||||
// still rotating through one of those targets would observe half-migrated
|
||||
// state. I-004 closes that gap with a preflight + soft-retire + optional
|
||||
// forced-cascade contract; the symbols in this file are the service-layer
|
||||
// surface that the handler and operator UI bind against.
|
||||
|
||||
// ErrAgentIsSentinel is returned when an operator tries to retire one of the
|
||||
// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm,
|
||||
// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the
|
||||
// three cloud secret-manager discovery sources; retiring any of them orphans
|
||||
// its subsystem. The guard fires unconditionally — force=true does not bypass
|
||||
// it, because a sentinel is a structural invariant of the deployment, not
|
||||
// a piece of fleet state the operator owns. Handler maps this to HTTP 403.
|
||||
var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired")
|
||||
|
||||
// ErrBlockedByDependencies is returned by RetireAgent when at least one of
|
||||
// (active targets, active certificates, pending jobs) referencing the agent
|
||||
// is non-zero and force=false. The caller always receives it wrapped in
|
||||
// a *BlockedByDependenciesError (see below), so handlers doing errors.As
|
||||
// can surface the per-bucket counts in the 409 body for operator
|
||||
// troubleshooting. Tests use errors.Is; handlers use errors.As.
|
||||
var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies")
|
||||
|
||||
// ErrForceReasonRequired is returned when force=true is supplied without a
|
||||
// non-empty reason. The force escape hatch is deliberately chatty: operators
|
||||
// pulling the emergency cord must leave an auditable breadcrumb explaining
|
||||
// why a cascade was justified. Handler maps this to HTTP 400 so the operator
|
||||
// retries with --reason rather than silently skipping the guard. Checked
|
||||
// before any DB mutation to keep the no-reason path transactionally clean.
|
||||
var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason")
|
||||
|
||||
// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated
|
||||
// call site) when a retired agent is still polling. The handler layer maps
|
||||
// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it
|
||||
// deterministically and shut down the agent process, rather than looping
|
||||
// forever on a soft-retired identity. IsRetired() on the domain model is
|
||||
// the single source of truth; the sentinel exists so service and handler
|
||||
// callers can errors.Is against one symbol.
|
||||
var ErrAgentRetired = errors.New("agent has been retired")
|
||||
|
||||
// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the
|
||||
// per-bucket dependency snapshot the preflight pass captured. The embedded
|
||||
// AgentDependencyCounts is the same struct the repo returns from the three
|
||||
// CountActive* calls, so the handler can marshal it directly into the 409
|
||||
// body without reshaping fields. Unwrap() satisfies errors.Is against the
|
||||
// sentinel; Error() includes the counts so logs are diagnostic on their own.
|
||||
type BlockedByDependenciesError struct {
|
||||
Counts domain.AgentDependencyCounts
|
||||
}
|
||||
|
||||
// Error formats the wrapped error with the per-bucket counts. Kept short so
|
||||
// it reads cleanly in slog output.
|
||||
func (e *BlockedByDependenciesError) Error() string {
|
||||
return fmt.Sprintf(
|
||||
"%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)",
|
||||
ErrBlockedByDependencies.Error(),
|
||||
e.Counts.ActiveTargets,
|
||||
e.Counts.ActiveCertificates,
|
||||
e.Counts.PendingJobs,
|
||||
)
|
||||
}
|
||||
|
||||
// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped
|
||||
// struct — the test contract (agent_retire_test.go:167) depends on it.
|
||||
func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies }
|
||||
|
||||
// AgentRetirementResult is the outcome surface the handler returns to the
|
||||
// operator. It discriminates the three happy paths the endpoint can take —
|
||||
// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and
|
||||
// forced cascade (Cascade=true) — and always carries the retired_at timestamp
|
||||
// and the dependency-count snapshot so the 200/204 response body can echo
|
||||
// what was (or would have been) affected.
|
||||
//
|
||||
// AlreadyRetired=true → agent was already retired; no new audit
|
||||
// event was emitted; RetiredAt is the
|
||||
// original stamp, not the current time.
|
||||
// Cascade=false → clean soft-retire; Counts is all zeros.
|
||||
// Cascade=true → force=true retired agent + downstream
|
||||
// targets; Counts is the PRE-cascade
|
||||
// snapshot (so the operator sees what
|
||||
// they just retired).
|
||||
type AgentRetirementResult struct {
|
||||
AlreadyRetired bool
|
||||
Cascade bool
|
||||
RetiredAt time.Time
|
||||
Counts domain.AgentDependencyCounts
|
||||
}
|
||||
|
||||
// RetireAgent implements the I-004 retirement contract. Ordering matters —
|
||||
// every guard fires before the one that would mutate state, so a rejected
|
||||
// retire leaves zero trace (no audit event, no partial DB write):
|
||||
//
|
||||
// 1. Sentinel check (unconditional; force does not bypass).
|
||||
// 2. Fetch agent (404 surfaces as-is from the repo).
|
||||
// 3. Already-retired idempotency: return AlreadyRetired=true with NO new
|
||||
// audit event — the original retire already recorded one.
|
||||
// 4. Preflight count pass via the three CountActive* repo methods.
|
||||
// 5. Force-reason guard: force=true with empty reason is rejected here,
|
||||
// after the counts are known but before any mutation.
|
||||
// 6. Default no-force path: any non-zero count returns
|
||||
// *BlockedByDependenciesError with counts attached.
|
||||
// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with
|
||||
// a single retiredAt timestamp pinned BEFORE the repo call so the
|
||||
// audit event and the DB row agree to the nanosecond.
|
||||
// 8. Audit: agent_retired always; agent_retirement_cascaded additionally
|
||||
// on the force=true cascade path.
|
||||
//
|
||||
// Actor comes from the handler's resolveActor (API key → user, agent key →
|
||||
// agent-<id>, unauthenticated → "anonymous"); the service does not second-
|
||||
// guess it. Audit emission is best-effort: a failed RecordEvent logs a
|
||||
// warning but does not fail the overall retirement, consistent with how
|
||||
// the rest of the codebase treats audit as an observability concern
|
||||
// rather than a correctness barrier.
|
||||
func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) {
|
||||
// Step 1 — reserved-sentinel guard. Applies even under force=true.
|
||||
if domain.IsSentinelAgent(id) {
|
||||
return nil, ErrAgentIsSentinel
|
||||
}
|
||||
|
||||
// Step 2 — existence check. Missing agent surfaces the repo's not-found
|
||||
// error verbatim so the handler can map it to 404 via its existing
|
||||
// detection path (the handler layer already has "not found" mapping
|
||||
// logic inherited from the pre-I-004 Delete endpoint).
|
||||
agent, err := s.agentRepo.Get(ctx, id)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// Step 3 — idempotency. A retired agent returns AlreadyRetired=true
|
||||
// WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204.
|
||||
// Guarding here (before preflight) means a re-retire of an agent that
|
||||
// now has zero deps doesn't spuriously "succeed again" and double-log.
|
||||
if agent.IsRetired() {
|
||||
return &AgentRetirementResult{
|
||||
AlreadyRetired: true,
|
||||
RetiredAt: *agent.RetiredAt,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Step 4 — preflight counts. All three run even when force=true: we
|
||||
// need them to populate AgentRetirementResult.Counts (the pre-cascade
|
||||
// snapshot). A repo failure here aborts the whole operation — partial
|
||||
// preflight is worse than no preflight.
|
||||
counts, err := s.collectAgentDependencyCounts(ctx, id)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err)
|
||||
}
|
||||
|
||||
// Step 5 — force-reason guard. Positioned AFTER preflight so operators
|
||||
// who forgot --reason still see accurate counts when they retry. The
|
||||
// empty-reason rejection fires before any mutation, so the rejected
|
||||
// attempt leaves no audit noise.
|
||||
if force && reason == "" {
|
||||
return nil, ErrForceReasonRequired
|
||||
}
|
||||
|
||||
// Step 6 — default path: block on any non-zero bucket. Wrapping the
|
||||
// sentinel in *BlockedByDependenciesError lets the handler use errors.As
|
||||
// to surface counts in the 409 body while tests use errors.Is against
|
||||
// the sentinel. Both callers are satisfied by the single Unwrap chain.
|
||||
if !force && counts.HasDependencies() {
|
||||
return nil, &BlockedByDependenciesError{Counts: counts}
|
||||
}
|
||||
|
||||
// Step 7 — mutation. Pin retiredAt once so the audit event, the agent
|
||||
// row, and (on cascade) every deployment_targets row share the same
|
||||
// timestamp. Callers querying "what happened at T?" can correlate
|
||||
// retirement rows across tables without clock-skew tie-breaking.
|
||||
retiredAt := time.Now()
|
||||
cascade := force && counts.HasDependencies()
|
||||
|
||||
if cascade {
|
||||
if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil {
|
||||
return nil, fmt.Errorf("failed to retire agent with cascade: %w", err)
|
||||
}
|
||||
} else {
|
||||
if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil {
|
||||
return nil, fmt.Errorf("failed to soft-retire agent: %w", err)
|
||||
}
|
||||
}
|
||||
|
||||
// Step 8 — audit. Two events on the cascade path so forensics can
|
||||
// distinguish "agent was retired" (agent_retired) from "downstream
|
||||
// targets were flipped" (agent_retirement_cascaded). Details on the
|
||||
// cascaded event carry the pre-cascade counts so a reviewer looking
|
||||
// only at the audit log knows how much state was affected. Emission
|
||||
// is best-effort — audit is observability, not a correctness barrier.
|
||||
actorType := s.resolveActorType(actor)
|
||||
details := map[string]interface{}{
|
||||
"actor": actor,
|
||||
"reason": reason,
|
||||
"force": force,
|
||||
"active_targets": counts.ActiveTargets,
|
||||
"active_certificates": counts.ActiveCertificates,
|
||||
"pending_jobs": counts.PendingJobs,
|
||||
}
|
||||
if err := s.auditService.RecordEvent(ctx, actor, actorType,
|
||||
"agent_retired", "agent", id, details); err != nil {
|
||||
slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err)
|
||||
}
|
||||
if cascade {
|
||||
cascadeDetails := map[string]interface{}{
|
||||
"actor": actor,
|
||||
"reason": reason,
|
||||
"active_targets": counts.ActiveTargets,
|
||||
"active_certificates": counts.ActiveCertificates,
|
||||
"pending_jobs": counts.PendingJobs,
|
||||
}
|
||||
if err := s.auditService.RecordEvent(ctx, actor, actorType,
|
||||
"agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil {
|
||||
slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err)
|
||||
}
|
||||
}
|
||||
|
||||
return &AgentRetirementResult{
|
||||
AlreadyRetired: false,
|
||||
Cascade: cascade,
|
||||
RetiredAt: retiredAt,
|
||||
Counts: counts,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// ListRetiredAgents returns the paginated list of retired agents in
|
||||
// retired_at DESC order. This is the companion to ListAgents — which
|
||||
// hides retired rows — so the operator UI can render a dedicated
|
||||
// "Retired" tab without leaking retired rows into every other listing.
|
||||
// Pagination defaults (page<1→1, perPage<1→50) are applied here as
|
||||
// well as in the repo, so callers can pass 0s when they want defaults.
|
||||
//
|
||||
// Return shape harmonizes with handler.AgentService: a value slice
|
||||
// (not pointer slice) and int64 total. The repo returns []*domain.Agent;
|
||||
// this method dereferences into a value slice so the handler's
|
||||
// PagedResponse marshals straight objects and so the compile-time
|
||||
// interface assertion in agent_retire_handler_test.go:387 is satisfied.
|
||||
// Nil repo entries are skipped defensively — the repo should never
|
||||
// return them, but the handler contract is more important than the
|
||||
// repo's (pointer-slice) convenience.
|
||||
func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) {
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to list retired agents: %w", err)
|
||||
}
|
||||
out := make([]domain.Agent, 0, len(agents))
|
||||
for _, a := range agents {
|
||||
if a == nil {
|
||||
continue
|
||||
}
|
||||
out = append(out, *a)
|
||||
}
|
||||
return out, int64(total), nil
|
||||
}
|
||||
|
||||
// collectAgentDependencyCounts runs the three preflight COUNT queries in
|
||||
// sequence and bundles the result. Sequential (not parallel) because the
|
||||
// queries are cheap (<1ms each on the indexed columns added in 000015) and
|
||||
// sequential keeps error handling simple. Any repo error short-circuits
|
||||
// — we prefer to refuse the retire than make a half-informed decision.
|
||||
func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) {
|
||||
var counts domain.AgentDependencyCounts
|
||||
|
||||
targets, err := s.agentRepo.CountActiveTargets(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count active targets: %w", err)
|
||||
}
|
||||
counts.ActiveTargets = targets
|
||||
|
||||
certs, err := s.agentRepo.CountActiveCertificates(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count active certificates: %w", err)
|
||||
}
|
||||
counts.ActiveCertificates = certs
|
||||
|
||||
jobs, err := s.agentRepo.CountPendingJobs(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count pending jobs: %w", err)
|
||||
}
|
||||
counts.PendingJobs = jobs
|
||||
|
||||
return counts, nil
|
||||
}
|
||||
|
||||
// resolveActorType maps an opaque actor string into the typed ActorType
|
||||
// used by the audit schema. Matches the conventions the rest of the
|
||||
// service layer uses: "system" → System, anything that looks like an
|
||||
// agent identity → Agent, everything else → User.
|
||||
func (s *AgentService) resolveActorType(actor string) domain.ActorType {
|
||||
switch {
|
||||
case actor == "system":
|
||||
return domain.ActorTypeSystem
|
||||
case len(actor) > 6 && actor[:6] == "agent-":
|
||||
return domain.ActorTypeAgent
|
||||
default:
|
||||
return domain.ActorTypeUser
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,396 @@
|
||||
package service
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"log/slog"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
)
|
||||
|
||||
// setupRetireTest wires up an AgentService with a single registered agent and
|
||||
// returns (service, agentRepo, auditRepo) so tests can seed state and assert
|
||||
// audit events. Kept minimal — tests that need targets/jobs/certs extend the
|
||||
// returned repos directly.
|
||||
func setupRetireTest(t *testing.T, agentID string) (*AgentService, *mockAgentRepo, *mockAuditRepo) {
|
||||
t.Helper()
|
||||
now := time.Now()
|
||||
agent := &domain.Agent{
|
||||
ID: agentID,
|
||||
Name: "prod-agent",
|
||||
Hostname: "server-01",
|
||||
Status: domain.AgentStatusOnline,
|
||||
RegisteredAt: now,
|
||||
LastHeartbeatAt: &now,
|
||||
APIKeyHash: "hash-" + agentID,
|
||||
}
|
||||
agentRepo := newMockAgentRepository()
|
||||
agentRepo.AddAgent(agent)
|
||||
certRepo := &mockCertRepo{
|
||||
Certs: make(map[string]*domain.ManagedCertificate),
|
||||
Versions: make(map[string][]*domain.CertificateVersion),
|
||||
}
|
||||
jobRepo := &mockJobRepo{
|
||||
Jobs: make(map[string]*domain.Job),
|
||||
StatusUpdates: make(map[string]domain.JobStatus),
|
||||
}
|
||||
targetRepo := &mockTargetRepo{
|
||||
Targets: make(map[string]*domain.DeploymentTarget),
|
||||
}
|
||||
auditRepo := &mockAuditRepo{Events: []*domain.AuditEvent{}}
|
||||
auditService := NewAuditService(auditRepo)
|
||||
issuerRegistry := NewIssuerRegistry(slog.Default())
|
||||
|
||||
svc := NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, nil)
|
||||
return svc, agentRepo, auditRepo
|
||||
}
|
||||
|
||||
// TestRetireAgent_Sentinel_Rejected covers I-004's sentinel guard. The four
|
||||
// well-known sentinel agent IDs back discovery sources and the network scanner
|
||||
// — retiring them would orphan those subsystems. Contract: reject with
|
||||
// ErrAgentIsSentinel regardless of force/reason.
|
||||
func TestRetireAgent_Sentinel_Rejected(t *testing.T) {
|
||||
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
|
||||
for _, id := range sentinels {
|
||||
t.Run(id, func(t *testing.T) {
|
||||
svc, _, _ := setupRetireTest(t, id)
|
||||
_, err := svc.RetireAgent(context.Background(), id, "alice", false, "")
|
||||
if !errors.Is(err, ErrAgentIsSentinel) {
|
||||
t.Fatalf("retire(sentinel %q) err=%v want ErrAgentIsSentinel", id, err)
|
||||
}
|
||||
// Sentinel rejection must be deterministic even under force=true.
|
||||
_, err = svc.RetireAgent(context.Background(), id, "alice", true, "forced by operator")
|
||||
if !errors.Is(err, ErrAgentIsSentinel) {
|
||||
t.Fatalf("retire(sentinel %q force=true) err=%v want ErrAgentIsSentinel", id, err)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_NotFound covers the 404 preflight path. The handler maps
|
||||
// ErrAgentNotFound-equivalent sentinel to 404; the service must surface it
|
||||
// cleanly without partial state mutation.
|
||||
func TestRetireAgent_NotFound(t *testing.T) {
|
||||
svc, _, _ := setupRetireTest(t, "agent-001")
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-does-not-exist", "alice", false, "")
|
||||
if err == nil {
|
||||
t.Fatalf("retire(missing id) err=nil want not-found error")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_AlreadyRetired_Idempotent covers the 204 No Content path.
|
||||
// Retiring an already-retired agent must succeed without error and without
|
||||
// emitting a new audit event (the first retirement already recorded one).
|
||||
// Idempotency matters because the handler is the escape hatch for operators
|
||||
// re-issuing a failed retire after a partial failure mid-cascade.
|
||||
func TestRetireAgent_AlreadyRetired_Idempotent(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "operator decommissioned"
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
agent.RetiredAt = &past
|
||||
agent.RetiredReason = &reason
|
||||
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire(already retired) err=%v want nil (idempotent)", err)
|
||||
}
|
||||
if result == nil || !result.AlreadyRetired {
|
||||
t.Fatalf("retire(already retired) result=%+v want AlreadyRetired=true", result)
|
||||
}
|
||||
// Retire-on-retired must not emit a duplicate audit event.
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-001" {
|
||||
t.Fatalf("retire(already retired) emitted duplicate agent_retired audit event")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_NoDeps_SoftSucceeds covers the happy 200 path: no active
|
||||
// targets, certs, or jobs referencing the agent. Soft-retire stamps
|
||||
// RetiredAt + RetiredReason and emits agent_retired audit event.
|
||||
func TestRetireAgent_NoDeps_SoftSucceeds(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
|
||||
before := time.Now().Add(-time.Second)
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire(clean) err=%v want nil", err)
|
||||
}
|
||||
if result == nil {
|
||||
t.Fatal("retire(clean) result=nil want non-nil")
|
||||
}
|
||||
if result.AlreadyRetired {
|
||||
t.Fatalf("retire(clean) result.AlreadyRetired=true want false")
|
||||
}
|
||||
if result.Cascade {
|
||||
t.Fatalf("retire(clean) result.Cascade=true want false (no deps to cascade)")
|
||||
}
|
||||
if !result.RetiredAt.After(before) {
|
||||
t.Fatalf("retire(clean) RetiredAt=%v not after test start %v", result.RetiredAt, before)
|
||||
}
|
||||
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
if agent.RetiredAt == nil {
|
||||
t.Fatalf("retire(clean) agent.RetiredAt=nil want stamped")
|
||||
}
|
||||
|
||||
// Audit event must be emitted with action=agent_retired, actor=alice.
|
||||
found := false
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-001" && e.Actor == "alice" {
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !found {
|
||||
t.Fatalf("retire(clean) missing agent_retired audit event for alice, events=%+v", auditRepo.Events)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_NoForce_Blocked covers the 409 preflight path. When
|
||||
// the agent has any of: active non-retired targets, certs deployed via those
|
||||
// targets, or pending jobs — a default retire must block with
|
||||
// ErrBlockedByDependencies and the counts must be reachable via errors.As so
|
||||
// the handler can build the 409 body.
|
||||
func TestRetireAgent_WithDeps_NoForce_Blocked(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
// Seed dependency counts directly on the mock — the production repo
|
||||
// implements CountActive* queries; the mock exposes them as fields.
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 3
|
||||
agentRepo.ActiveCertCounts["agent-001"] = 7
|
||||
agentRepo.PendingJobCounts["agent-001"] = 2
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if !errors.Is(err, ErrBlockedByDependencies) {
|
||||
t.Fatalf("retire(with deps, no force) err=%v want ErrBlockedByDependencies", err)
|
||||
}
|
||||
var blocked *BlockedByDependenciesError
|
||||
if !errors.As(err, &blocked) {
|
||||
t.Fatalf("retire(with deps) err=%v want wrapped *BlockedByDependenciesError", err)
|
||||
}
|
||||
if blocked.Counts.ActiveTargets != 3 {
|
||||
t.Errorf("blocked.Counts.ActiveTargets=%d want 3", blocked.Counts.ActiveTargets)
|
||||
}
|
||||
if blocked.Counts.ActiveCertificates != 7 {
|
||||
t.Errorf("blocked.Counts.ActiveCertificates=%d want 7", blocked.Counts.ActiveCertificates)
|
||||
}
|
||||
if blocked.Counts.PendingJobs != 2 {
|
||||
t.Errorf("blocked.Counts.PendingJobs=%d want 2", blocked.Counts.PendingJobs)
|
||||
}
|
||||
// Agent must still be un-retired after preflight block.
|
||||
if agentRepo.Agents["agent-001"].RetiredAt != nil {
|
||||
t.Fatalf("retire(blocked) left RetiredAt stamped; preflight must be transactionally safe")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_Force_NoReason_Rejected covers the 400 guard on the
|
||||
// force escape hatch. Operators using force=true must supply a justifying
|
||||
// reason; empty reason is rejected before any DB mutation.
|
||||
func TestRetireAgent_WithDeps_Force_NoReason_Rejected(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 1
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, "")
|
||||
if !errors.Is(err, ErrForceReasonRequired) {
|
||||
t.Fatalf("retire(force, no reason) err=%v want ErrForceReasonRequired", err)
|
||||
}
|
||||
if agentRepo.Agents["agent-001"].RetiredAt != nil {
|
||||
t.Fatalf("retire(force, no reason) left RetiredAt stamped; guard must fire before mutation")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_Force_Cascades covers the force=true transactional
|
||||
// path: agent retires, downstream targets also soft-retire with the supplied
|
||||
// reason, and the result surface indicates cascade happened. Reason
|
||||
// propagates to every cascaded row so post-mortem forensics can trace the
|
||||
// cascade to a single operator action.
|
||||
func TestRetireAgent_WithDeps_Force_Cascades(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 2
|
||||
agentRepo.ActiveCertCounts["agent-001"] = 5
|
||||
agentRepo.PendingJobCounts["agent-001"] = 1
|
||||
|
||||
reason := "decommissioning rack 7"
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, reason)
|
||||
if err != nil {
|
||||
t.Fatalf("retire(force, reason) err=%v want nil", err)
|
||||
}
|
||||
if result == nil {
|
||||
t.Fatal("retire(force) result=nil want non-nil")
|
||||
}
|
||||
if !result.Cascade {
|
||||
t.Fatalf("retire(force) result.Cascade=false want true")
|
||||
}
|
||||
if result.Counts.ActiveTargets != 2 {
|
||||
t.Errorf("result.Counts.ActiveTargets=%d want 2 (pre-cascade snapshot)", result.Counts.ActiveTargets)
|
||||
}
|
||||
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
if agent.RetiredAt == nil {
|
||||
t.Fatalf("retire(force) agent.RetiredAt=nil want stamped")
|
||||
}
|
||||
if agent.RetiredReason == nil || *agent.RetiredReason != reason {
|
||||
t.Fatalf("retire(force) RetiredReason=%v want %q", agent.RetiredReason, reason)
|
||||
}
|
||||
|
||||
// Two audit events required: agent_retired + agent_retirement_cascaded.
|
||||
// The cascaded event captures which downstream resources were affected.
|
||||
var haveRetired, haveCascaded bool
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.ResourceID == "agent-001" {
|
||||
switch e.Action {
|
||||
case "agent_retired":
|
||||
haveRetired = true
|
||||
case "agent_retirement_cascaded":
|
||||
haveCascaded = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if !haveRetired {
|
||||
t.Errorf("retire(force) missing agent_retired audit event")
|
||||
}
|
||||
if !haveCascaded {
|
||||
t.Errorf("retire(force) missing agent_retirement_cascaded audit event")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_EmitsAuditEvent pins the audit contract for I-004:
|
||||
// every retire path that mutates DB state emits at least one audit event with
|
||||
// the operator's actor identity, so post-hoc compliance/forensics can
|
||||
// reconstruct who retired what and when.
|
||||
func TestRetireAgent_EmitsAuditEvent(t *testing.T) {
|
||||
svc, _, auditRepo := setupRetireTest(t, "agent-007")
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-007", "compliance-bot", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire err=%v want nil", err)
|
||||
}
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-007" {
|
||||
if e.Actor != "compliance-bot" {
|
||||
t.Errorf("audit event Actor=%q want compliance-bot", e.Actor)
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
t.Fatalf("no agent_retired audit event emitted, events=%+v", auditRepo.Events)
|
||||
}
|
||||
|
||||
// TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired covers the 410 Gone
|
||||
// contract. A retired agent that is still polling must be told its identity
|
||||
// is no longer accepted — the agent process should detect this and shut
|
||||
// down rather than continue heartbeating indefinitely.
|
||||
func TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
past := time.Now().Add(-time.Hour)
|
||||
reason := "decommissioned"
|
||||
agentRepo.Agents["agent-001"].RetiredAt = &past
|
||||
agentRepo.Agents["agent-001"].RetiredReason = &reason
|
||||
|
||||
err := svc.Heartbeat(context.Background(), "agent-001", &domain.AgentMetadata{
|
||||
OS: "linux",
|
||||
Architecture: "amd64",
|
||||
Hostname: "server-01",
|
||||
})
|
||||
if !errors.Is(err, ErrAgentRetired) {
|
||||
t.Fatalf("heartbeat(retired) err=%v want ErrAgentRetired", err)
|
||||
}
|
||||
// Retired heartbeat must NOT bump LastHeartbeatAt — otherwise the retired
|
||||
// agent could ressurrect itself in stats/observability dashboards.
|
||||
if _, bumped := agentRepo.HeartbeatUpdates["agent-001"]; bumped {
|
||||
t.Fatalf("heartbeat(retired) updated LastHeartbeatAt; retired agents must be frozen")
|
||||
}
|
||||
}
|
||||
|
||||
// TestListAgents_DefaultExcludesRetired covers the contract that the
|
||||
// handler-facing ListAgents call hides retired rows by default. Otherwise
|
||||
// every dashboard that paginates agents would surface retired stragglers.
|
||||
// An explicit "list retired" endpoint (ListRetiredAgents) covers the audit
|
||||
// use case.
|
||||
func TestListAgents_DefaultExcludesRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-active")
|
||||
// Seed one retired agent alongside the active one.
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "old hardware"
|
||||
agentRepo.AddAgent(&domain.Agent{
|
||||
ID: "agent-retired",
|
||||
Name: "retired-agent",
|
||||
Hostname: "server-old",
|
||||
Status: domain.AgentStatusOffline,
|
||||
RegisteredAt: past,
|
||||
APIKeyHash: "hash-retired",
|
||||
RetiredAt: &past,
|
||||
RetiredReason: &reason,
|
||||
})
|
||||
|
||||
agents, total, err := svc.ListAgents(context.Background(), 1, 50)
|
||||
if err != nil {
|
||||
t.Fatalf("ListAgents err=%v want nil", err)
|
||||
}
|
||||
for _, a := range agents {
|
||||
if a.ID == "agent-retired" {
|
||||
t.Fatalf("ListAgents returned retired agent %q in default listing", a.ID)
|
||||
}
|
||||
}
|
||||
if total != 1 {
|
||||
t.Errorf("ListAgents total=%d want 1 (only active)", total)
|
||||
}
|
||||
|
||||
// ListRetiredAgents must surface retired-only, with count=1.
|
||||
retired, retiredTotal, err := svc.ListRetiredAgents(context.Background(), 1, 50)
|
||||
if err != nil {
|
||||
t.Fatalf("ListRetiredAgents err=%v want nil", err)
|
||||
}
|
||||
if retiredTotal != 1 {
|
||||
t.Errorf("ListRetiredAgents total=%d want 1", retiredTotal)
|
||||
}
|
||||
if len(retired) != 1 || retired[0].ID != "agent-retired" {
|
||||
t.Fatalf("ListRetiredAgents got=%+v want [agent-retired]", retired)
|
||||
}
|
||||
}
|
||||
|
||||
// TestMarkStaleAgentsOffline_SkipsRetired covers the stale-offline sweeper
|
||||
// interaction with retirement. A retired agent must not be re-surfaced as
|
||||
// a state transition ("Online → Offline") by the scheduler, because its
|
||||
// Status column is preserved as the last-known operational state at
|
||||
// retirement time and RetiredAt is the source of truth for filtering.
|
||||
func TestMarkStaleAgentsOffline_SkipsRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-live")
|
||||
// Active agent is currently stale (no heartbeat for 10 minutes) — eligible
|
||||
// for Online→Offline transition.
|
||||
stale := time.Now().Add(-10 * time.Minute)
|
||||
agentRepo.Agents["agent-live"].LastHeartbeatAt = &stale
|
||||
|
||||
// Retired agent was also stale at retirement time, but must NOT be
|
||||
// touched by the sweeper.
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "hw failure"
|
||||
agentRepo.AddAgent(&domain.Agent{
|
||||
ID: "agent-retired",
|
||||
Name: "dead-agent",
|
||||
Hostname: "server-old",
|
||||
Status: domain.AgentStatusOnline, // preserved last-seen status
|
||||
RegisteredAt: past,
|
||||
LastHeartbeatAt: &past,
|
||||
APIKeyHash: "hash-dead",
|
||||
RetiredAt: &past,
|
||||
RetiredReason: &reason,
|
||||
})
|
||||
|
||||
if err := svc.MarkStaleAgentsOffline(context.Background(), 5*time.Minute); err != nil {
|
||||
t.Fatalf("MarkStaleAgentsOffline err=%v want nil", err)
|
||||
}
|
||||
|
||||
// Active-stale agent should flip Online → Offline.
|
||||
if got := agentRepo.Agents["agent-live"].Status; got != domain.AgentStatusOffline {
|
||||
t.Errorf("agent-live Status=%s want Offline", got)
|
||||
}
|
||||
// Retired agent's Status column must be frozen at Online (its preserved
|
||||
// last-seen state); the sweeper must skip it.
|
||||
if got := agentRepo.Agents["agent-retired"].Status; got != domain.AgentStatusOnline {
|
||||
t.Errorf("agent-retired Status=%s want Online (frozen); sweeper touched retired row", got)
|
||||
}
|
||||
}
|
||||
@@ -145,6 +145,31 @@ func (s *DeploymentService) ProcessDeploymentJob(ctx context.Context, job *domai
|
||||
return fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// I-004: AgentRepository.Get surfaces retired rows by design (for the GUI
|
||||
// banner + 410 Gone heartbeat path). Deployments must never dispatch to a
|
||||
// retired agent — it will never heartbeat again and the target row should
|
||||
// itself have been cascade-retired when the agent was force-retired. A job
|
||||
// slipping through here would otherwise hit the heartbeat-staleness branch
|
||||
// below with the misleading reason "agent is offline"; we want operators to
|
||||
// see the real cause. Fail the job with an explicit reason, send a
|
||||
// deployment notification so the owner is alerted, and record an audit
|
||||
// event. Falls through the same notify+audit shape as the offline branch.
|
||||
if agent.IsRetired() {
|
||||
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "assigned agent is retired")
|
||||
if updateErr != nil {
|
||||
slog.Error("failed to update job status", "job_id", job.ID, "error", updateErr)
|
||||
}
|
||||
if notifErr := s.notificationSvc.SendDeploymentNotification(ctx, cert, target, false, fmt.Errorf("agent retired")); notifErr != nil {
|
||||
slog.Error("failed to send deployment notification", "error", notifErr)
|
||||
}
|
||||
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
|
||||
"deployment_job_failed", "certificate", job.CertificateID,
|
||||
map[string]interface{}{"job_id": job.ID, "reason": "agent retired", "target_id": targetID, "agent_id": agentID}); auditErr != nil {
|
||||
slog.Error("failed to record audit event", "error", auditErr)
|
||||
}
|
||||
return fmt.Errorf("agent %s is retired", agentID)
|
||||
}
|
||||
|
||||
// Check agent heartbeat (must be within last 5 minutes)
|
||||
if agent.LastHeartbeatAt != nil && time.Since(*agent.LastHeartbeatAt) > 5*time.Minute {
|
||||
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "agent is offline")
|
||||
|
||||
@@ -232,6 +232,18 @@ func (s *TargetService) TestConnection(ctx context.Context, id string) error {
|
||||
return fmt.Errorf("assigned agent not found: %w", err)
|
||||
}
|
||||
|
||||
// I-004: AgentRepository.Get intentionally surfaces retired rows (the banner
|
||||
// + 410 Gone paths need to see them). A test against a retired agent can
|
||||
// never succeed — the agent is tombstoned, will never heartbeat again, and
|
||||
// any active targets have already been cascade-retired alongside it. Fail
|
||||
// fast with an explicit message instead of falling through to the Status /
|
||||
// heartbeat checks, which would produce a misleading "agent is Offline" or
|
||||
// "heartbeat stale" diagnostic.
|
||||
if agent.IsRetired() {
|
||||
s.updateTestStatus(ctx, target, "failed")
|
||||
return fmt.Errorf("assigned agent %s is retired", agent.ID)
|
||||
}
|
||||
|
||||
if agent.Status != domain.AgentStatusOnline {
|
||||
s.updateTestStatus(ctx, target, "failed")
|
||||
return fmt.Errorf("assigned agent %s is %s (expected Online)", agent.ID, agent.Status)
|
||||
@@ -293,9 +305,20 @@ func (s *TargetService) CreateTarget(ctx context.Context, target domain.Deployme
|
||||
if target.AgentID == "" {
|
||||
return nil, fmt.Errorf("%w: agent_id is required", ErrAgentNotFound)
|
||||
}
|
||||
if _, err := s.agentRepo.Get(ctx, target.AgentID); err != nil {
|
||||
agent, err := s.agentRepo.Get(ctx, target.AgentID)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("%w: %s", ErrAgentNotFound, target.AgentID)
|
||||
}
|
||||
// I-004: refuse to attach new targets to a retired agent. The agent is
|
||||
// tombstoned and no deployments would ever succeed against it; letting a
|
||||
// row slip past here would immediately be cascade-retired on the next
|
||||
// dependency sweep and confuse operators ("why is this brand-new target
|
||||
// already retired?"). Treating retired agents as "not found" for creation
|
||||
// purposes keeps the error surface tight and matches the default-list
|
||||
// contract established by repository.AgentRepository.List.
|
||||
if agent.IsRetired() {
|
||||
return nil, fmt.Errorf("%w: %s (retired)", ErrAgentNotFound, target.AgentID)
|
||||
}
|
||||
|
||||
if target.ID == "" {
|
||||
target.ID = generateID("target")
|
||||
|
||||
@@ -4,6 +4,7 @@ import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"sort"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
@@ -607,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) {
|
||||
m.Policies[policy.ID] = policy
|
||||
}
|
||||
|
||||
// mockAgentRepo is a test implementation of AgentRepository
|
||||
// mockAgentRepo is a test implementation of AgentRepository.
|
||||
//
|
||||
// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by
|
||||
// agent ID and read back verbatim by the Count* methods — the retirement
|
||||
// service's preflight pokes these maps to simulate "agent has N active
|
||||
// deployments / M deployed certs / K pending jobs" without having to seed
|
||||
// real target/cert/job rows across multiple mock repos. An unset key means
|
||||
// zero, matching the production repo behavior on an agent with no deps.
|
||||
type mockAgentRepo struct {
|
||||
mu sync.Mutex
|
||||
Agents map[string]*domain.Agent
|
||||
@@ -619,8 +627,27 @@ type mockAgentRepo struct {
|
||||
ListErr error
|
||||
UpdateHeartbeatErr error
|
||||
GetByAPIKeyErr error
|
||||
// I-004 preflight count seeds (read by CountActiveTargets etc.).
|
||||
ActiveTargetCounts map[string]int
|
||||
ActiveCertCounts map[string]int
|
||||
PendingJobCounts map[string]int
|
||||
// I-004 retirement write-path error seams. Let tests force a SoftRetire
|
||||
// or RetireAgentWithCascade failure after preflight passed, so the
|
||||
// service's error surfacing (wrap+return, skip audit, etc.) can be
|
||||
// exercised without having to stand up a real PG connection.
|
||||
SoftRetireErr error
|
||||
RetireCascadeErr error
|
||||
CountErr error
|
||||
ListRetiredErr error
|
||||
}
|
||||
|
||||
// List mirrors the production repo contract post-I-004: it returns only
|
||||
// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via
|
||||
// AddAgent and then call a List-driven service method (e.g. ListAgents,
|
||||
// MarkStaleAgentsOffline, stats dashboards) must not see the retired row
|
||||
// here — otherwise the mock would pass while the real planner filters it
|
||||
// out at the WHERE clause level. ListRetired is the companion method for
|
||||
// explicit retired-only listing.
|
||||
func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
@@ -629,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
}
|
||||
var agents []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
continue
|
||||
}
|
||||
agents = append(agents, a)
|
||||
}
|
||||
return agents, nil
|
||||
@@ -726,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) {
|
||||
m.Agents[agent.ID] = agent
|
||||
}
|
||||
|
||||
// ListRetired returns the paginated retired-agents slice + total count.
|
||||
// Matches the production repo contract: RetiredAt != nil, sorted by
|
||||
// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over
|
||||
// the keyed map so the mock stays dependency-free. I-004.
|
||||
func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.ListRetiredErr != nil {
|
||||
return nil, 0, m.ListRetiredErr
|
||||
}
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
var retired []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
retired = append(retired, a)
|
||||
}
|
||||
}
|
||||
total := len(retired)
|
||||
// Sort by RetiredAt DESC — most recent first. The real query uses the
|
||||
// partial idx_agents_retired_at index; here we sort in Go.
|
||||
sort.SliceStable(retired, func(i, j int) bool {
|
||||
return retired[i].RetiredAt.After(*retired[j].RetiredAt)
|
||||
})
|
||||
// Apply page/perPage window.
|
||||
offset := (page - 1) * perPage
|
||||
if offset >= total {
|
||||
return nil, total, nil
|
||||
}
|
||||
end := offset + perPage
|
||||
if end > total {
|
||||
end = total
|
||||
}
|
||||
return retired[offset:end], total, nil
|
||||
}
|
||||
|
||||
// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors
|
||||
// the real repo's idempotent semantics: a row already retired is left
|
||||
// untouched (zero-rows-affected is not an error). I-004 preserves
|
||||
// retirement metadata across re-retire attempts — whoever retired it
|
||||
// first owns the audit trail.
|
||||
func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.SoftRetireErr != nil {
|
||||
return m.SoftRetireErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgentWithCascade stamps the agent row the same way SoftRetire
|
||||
// does. The real repo also stamps every active deployment_targets row
|
||||
// in the same transaction; the mock can't do that because targets live
|
||||
// in mockTargetRepo, which the retirement service doesn't write to
|
||||
// through this repo interface. Tests that need to assert cascade
|
||||
// semantics on targets should seed mockTargetRepo directly and verify
|
||||
// the service-layer audit event captured the cascade count. I-004.
|
||||
func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.RetireCascadeErr != nil {
|
||||
return m.RetireCascadeErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op (same as production transaction)
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if
|
||||
// unset). Matches the real repo signature: COUNT of non-retired
|
||||
// deployment_targets with agent_id=$1. I-004 preflight.
|
||||
func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveTargetCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountActiveCertificates returns the seeded ActiveCertCounts value.
|
||||
// Real query: COUNT(DISTINCT certificate_id) across
|
||||
// certificate_target_mappings ↔ deployment_targets on agent_id. I-004.
|
||||
func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveCertCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountPendingJobs returns the seeded PendingJobCounts value. Real
|
||||
// query: COUNT of jobs with agent_id=$1 AND status IN (Pending,
|
||||
// AwaitingCSR, AwaitingApproval, Running). I-004.
|
||||
func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.PendingJobCounts[agentID], nil
|
||||
}
|
||||
|
||||
// mockTargetRepo is a test implementation of TargetRepository
|
||||
type mockTargetRepo struct {
|
||||
mu sync.Mutex
|
||||
@@ -955,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo {
|
||||
return &mockAgentRepo{
|
||||
Agents: make(map[string]*domain.Agent),
|
||||
HeartbeatUpdates: make(map[string]time.Time),
|
||||
// I-004 preflight count maps. Tests seed these directly via
|
||||
// agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys
|
||||
// read back as zero from CountActiveTargets etc., matching
|
||||
// the production repo behavior for agents with no deps.
|
||||
ActiveTargetCounts: make(map[string]int),
|
||||
ActiveCertCounts: make(map[string]int),
|
||||
PendingJobCounts: make(map[string]int),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user