mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-11 23:19:01 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
@@ -92,12 +92,27 @@ func (s *AgentService) Register(ctx context.Context, name string, hostname strin
|
||||
}
|
||||
|
||||
// Heartbeat updates an agent's last seen time, status, and metadata.
|
||||
//
|
||||
// I-004: retired agents must be rejected up-front. A retired agent that is
|
||||
// still polling is a zombie — its row exists only for audit history and must
|
||||
// not be allowed to bump LastHeartbeatAt (which would resurrect it in stats
|
||||
// dashboards and stale-offline sweeps). The sentinel ErrAgentRetired is
|
||||
// returned unwrapped so the HTTP handler can map it to 410 Gone via
|
||||
// errors.Is; the agent process detects the 410 and shuts down cleanly
|
||||
// instead of continuing to heartbeat indefinitely.
|
||||
func (s *AgentService) Heartbeat(ctx context.Context, agentID string, metadata *domain.AgentMetadata) error {
|
||||
agent, err := s.agentRepo.Get(ctx, agentID)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// I-004 guard: retired agents are frozen. Do not call UpdateHeartbeat —
|
||||
// bumping the timestamp would defeat the retired-row filter that protects
|
||||
// stats, scheduler sweeps, and handler listings.
|
||||
if agent.IsRetired() {
|
||||
return ErrAgentRetired
|
||||
}
|
||||
|
||||
// Update heartbeat and metadata
|
||||
if err := s.agentRepo.UpdateHeartbeat(ctx, agentID, metadata); err != nil {
|
||||
return fmt.Errorf("failed to update heartbeat: %w", err)
|
||||
|
||||
@@ -0,0 +1,317 @@
|
||||
package service
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
)
|
||||
|
||||
// I-004 coverage-gap closure: the agent retirement surface.
|
||||
//
|
||||
// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and
|
||||
// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with
|
||||
// no preflight, no archival, and no knowledge of in-flight jobs. Any cert
|
||||
// still rotating through one of those targets would observe half-migrated
|
||||
// state. I-004 closes that gap with a preflight + soft-retire + optional
|
||||
// forced-cascade contract; the symbols in this file are the service-layer
|
||||
// surface that the handler and operator UI bind against.
|
||||
|
||||
// ErrAgentIsSentinel is returned when an operator tries to retire one of the
|
||||
// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm,
|
||||
// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the
|
||||
// three cloud secret-manager discovery sources; retiring any of them orphans
|
||||
// its subsystem. The guard fires unconditionally — force=true does not bypass
|
||||
// it, because a sentinel is a structural invariant of the deployment, not
|
||||
// a piece of fleet state the operator owns. Handler maps this to HTTP 403.
|
||||
var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired")
|
||||
|
||||
// ErrBlockedByDependencies is returned by RetireAgent when at least one of
|
||||
// (active targets, active certificates, pending jobs) referencing the agent
|
||||
// is non-zero and force=false. The caller always receives it wrapped in
|
||||
// a *BlockedByDependenciesError (see below), so handlers doing errors.As
|
||||
// can surface the per-bucket counts in the 409 body for operator
|
||||
// troubleshooting. Tests use errors.Is; handlers use errors.As.
|
||||
var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies")
|
||||
|
||||
// ErrForceReasonRequired is returned when force=true is supplied without a
|
||||
// non-empty reason. The force escape hatch is deliberately chatty: operators
|
||||
// pulling the emergency cord must leave an auditable breadcrumb explaining
|
||||
// why a cascade was justified. Handler maps this to HTTP 400 so the operator
|
||||
// retries with --reason rather than silently skipping the guard. Checked
|
||||
// before any DB mutation to keep the no-reason path transactionally clean.
|
||||
var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason")
|
||||
|
||||
// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated
|
||||
// call site) when a retired agent is still polling. The handler layer maps
|
||||
// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it
|
||||
// deterministically and shut down the agent process, rather than looping
|
||||
// forever on a soft-retired identity. IsRetired() on the domain model is
|
||||
// the single source of truth; the sentinel exists so service and handler
|
||||
// callers can errors.Is against one symbol.
|
||||
var ErrAgentRetired = errors.New("agent has been retired")
|
||||
|
||||
// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the
|
||||
// per-bucket dependency snapshot the preflight pass captured. The embedded
|
||||
// AgentDependencyCounts is the same struct the repo returns from the three
|
||||
// CountActive* calls, so the handler can marshal it directly into the 409
|
||||
// body without reshaping fields. Unwrap() satisfies errors.Is against the
|
||||
// sentinel; Error() includes the counts so logs are diagnostic on their own.
|
||||
type BlockedByDependenciesError struct {
|
||||
Counts domain.AgentDependencyCounts
|
||||
}
|
||||
|
||||
// Error formats the wrapped error with the per-bucket counts. Kept short so
|
||||
// it reads cleanly in slog output.
|
||||
func (e *BlockedByDependenciesError) Error() string {
|
||||
return fmt.Sprintf(
|
||||
"%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)",
|
||||
ErrBlockedByDependencies.Error(),
|
||||
e.Counts.ActiveTargets,
|
||||
e.Counts.ActiveCertificates,
|
||||
e.Counts.PendingJobs,
|
||||
)
|
||||
}
|
||||
|
||||
// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped
|
||||
// struct — the test contract (agent_retire_test.go:167) depends on it.
|
||||
func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies }
|
||||
|
||||
// AgentRetirementResult is the outcome surface the handler returns to the
|
||||
// operator. It discriminates the three happy paths the endpoint can take —
|
||||
// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and
|
||||
// forced cascade (Cascade=true) — and always carries the retired_at timestamp
|
||||
// and the dependency-count snapshot so the 200/204 response body can echo
|
||||
// what was (or would have been) affected.
|
||||
//
|
||||
// AlreadyRetired=true → agent was already retired; no new audit
|
||||
// event was emitted; RetiredAt is the
|
||||
// original stamp, not the current time.
|
||||
// Cascade=false → clean soft-retire; Counts is all zeros.
|
||||
// Cascade=true → force=true retired agent + downstream
|
||||
// targets; Counts is the PRE-cascade
|
||||
// snapshot (so the operator sees what
|
||||
// they just retired).
|
||||
type AgentRetirementResult struct {
|
||||
AlreadyRetired bool
|
||||
Cascade bool
|
||||
RetiredAt time.Time
|
||||
Counts domain.AgentDependencyCounts
|
||||
}
|
||||
|
||||
// RetireAgent implements the I-004 retirement contract. Ordering matters —
|
||||
// every guard fires before the one that would mutate state, so a rejected
|
||||
// retire leaves zero trace (no audit event, no partial DB write):
|
||||
//
|
||||
// 1. Sentinel check (unconditional; force does not bypass).
|
||||
// 2. Fetch agent (404 surfaces as-is from the repo).
|
||||
// 3. Already-retired idempotency: return AlreadyRetired=true with NO new
|
||||
// audit event — the original retire already recorded one.
|
||||
// 4. Preflight count pass via the three CountActive* repo methods.
|
||||
// 5. Force-reason guard: force=true with empty reason is rejected here,
|
||||
// after the counts are known but before any mutation.
|
||||
// 6. Default no-force path: any non-zero count returns
|
||||
// *BlockedByDependenciesError with counts attached.
|
||||
// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with
|
||||
// a single retiredAt timestamp pinned BEFORE the repo call so the
|
||||
// audit event and the DB row agree to the nanosecond.
|
||||
// 8. Audit: agent_retired always; agent_retirement_cascaded additionally
|
||||
// on the force=true cascade path.
|
||||
//
|
||||
// Actor comes from the handler's resolveActor (API key → user, agent key →
|
||||
// agent-<id>, unauthenticated → "anonymous"); the service does not second-
|
||||
// guess it. Audit emission is best-effort: a failed RecordEvent logs a
|
||||
// warning but does not fail the overall retirement, consistent with how
|
||||
// the rest of the codebase treats audit as an observability concern
|
||||
// rather than a correctness barrier.
|
||||
func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) {
|
||||
// Step 1 — reserved-sentinel guard. Applies even under force=true.
|
||||
if domain.IsSentinelAgent(id) {
|
||||
return nil, ErrAgentIsSentinel
|
||||
}
|
||||
|
||||
// Step 2 — existence check. Missing agent surfaces the repo's not-found
|
||||
// error verbatim so the handler can map it to 404 via its existing
|
||||
// detection path (the handler layer already has "not found" mapping
|
||||
// logic inherited from the pre-I-004 Delete endpoint).
|
||||
agent, err := s.agentRepo.Get(ctx, id)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// Step 3 — idempotency. A retired agent returns AlreadyRetired=true
|
||||
// WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204.
|
||||
// Guarding here (before preflight) means a re-retire of an agent that
|
||||
// now has zero deps doesn't spuriously "succeed again" and double-log.
|
||||
if agent.IsRetired() {
|
||||
return &AgentRetirementResult{
|
||||
AlreadyRetired: true,
|
||||
RetiredAt: *agent.RetiredAt,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Step 4 — preflight counts. All three run even when force=true: we
|
||||
// need them to populate AgentRetirementResult.Counts (the pre-cascade
|
||||
// snapshot). A repo failure here aborts the whole operation — partial
|
||||
// preflight is worse than no preflight.
|
||||
counts, err := s.collectAgentDependencyCounts(ctx, id)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err)
|
||||
}
|
||||
|
||||
// Step 5 — force-reason guard. Positioned AFTER preflight so operators
|
||||
// who forgot --reason still see accurate counts when they retry. The
|
||||
// empty-reason rejection fires before any mutation, so the rejected
|
||||
// attempt leaves no audit noise.
|
||||
if force && reason == "" {
|
||||
return nil, ErrForceReasonRequired
|
||||
}
|
||||
|
||||
// Step 6 — default path: block on any non-zero bucket. Wrapping the
|
||||
// sentinel in *BlockedByDependenciesError lets the handler use errors.As
|
||||
// to surface counts in the 409 body while tests use errors.Is against
|
||||
// the sentinel. Both callers are satisfied by the single Unwrap chain.
|
||||
if !force && counts.HasDependencies() {
|
||||
return nil, &BlockedByDependenciesError{Counts: counts}
|
||||
}
|
||||
|
||||
// Step 7 — mutation. Pin retiredAt once so the audit event, the agent
|
||||
// row, and (on cascade) every deployment_targets row share the same
|
||||
// timestamp. Callers querying "what happened at T?" can correlate
|
||||
// retirement rows across tables without clock-skew tie-breaking.
|
||||
retiredAt := time.Now()
|
||||
cascade := force && counts.HasDependencies()
|
||||
|
||||
if cascade {
|
||||
if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil {
|
||||
return nil, fmt.Errorf("failed to retire agent with cascade: %w", err)
|
||||
}
|
||||
} else {
|
||||
if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil {
|
||||
return nil, fmt.Errorf("failed to soft-retire agent: %w", err)
|
||||
}
|
||||
}
|
||||
|
||||
// Step 8 — audit. Two events on the cascade path so forensics can
|
||||
// distinguish "agent was retired" (agent_retired) from "downstream
|
||||
// targets were flipped" (agent_retirement_cascaded). Details on the
|
||||
// cascaded event carry the pre-cascade counts so a reviewer looking
|
||||
// only at the audit log knows how much state was affected. Emission
|
||||
// is best-effort — audit is observability, not a correctness barrier.
|
||||
actorType := s.resolveActorType(actor)
|
||||
details := map[string]interface{}{
|
||||
"actor": actor,
|
||||
"reason": reason,
|
||||
"force": force,
|
||||
"active_targets": counts.ActiveTargets,
|
||||
"active_certificates": counts.ActiveCertificates,
|
||||
"pending_jobs": counts.PendingJobs,
|
||||
}
|
||||
if err := s.auditService.RecordEvent(ctx, actor, actorType,
|
||||
"agent_retired", "agent", id, details); err != nil {
|
||||
slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err)
|
||||
}
|
||||
if cascade {
|
||||
cascadeDetails := map[string]interface{}{
|
||||
"actor": actor,
|
||||
"reason": reason,
|
||||
"active_targets": counts.ActiveTargets,
|
||||
"active_certificates": counts.ActiveCertificates,
|
||||
"pending_jobs": counts.PendingJobs,
|
||||
}
|
||||
if err := s.auditService.RecordEvent(ctx, actor, actorType,
|
||||
"agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil {
|
||||
slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err)
|
||||
}
|
||||
}
|
||||
|
||||
return &AgentRetirementResult{
|
||||
AlreadyRetired: false,
|
||||
Cascade: cascade,
|
||||
RetiredAt: retiredAt,
|
||||
Counts: counts,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// ListRetiredAgents returns the paginated list of retired agents in
|
||||
// retired_at DESC order. This is the companion to ListAgents — which
|
||||
// hides retired rows — so the operator UI can render a dedicated
|
||||
// "Retired" tab without leaking retired rows into every other listing.
|
||||
// Pagination defaults (page<1→1, perPage<1→50) are applied here as
|
||||
// well as in the repo, so callers can pass 0s when they want defaults.
|
||||
//
|
||||
// Return shape harmonizes with handler.AgentService: a value slice
|
||||
// (not pointer slice) and int64 total. The repo returns []*domain.Agent;
|
||||
// this method dereferences into a value slice so the handler's
|
||||
// PagedResponse marshals straight objects and so the compile-time
|
||||
// interface assertion in agent_retire_handler_test.go:387 is satisfied.
|
||||
// Nil repo entries are skipped defensively — the repo should never
|
||||
// return them, but the handler contract is more important than the
|
||||
// repo's (pointer-slice) convenience.
|
||||
func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) {
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to list retired agents: %w", err)
|
||||
}
|
||||
out := make([]domain.Agent, 0, len(agents))
|
||||
for _, a := range agents {
|
||||
if a == nil {
|
||||
continue
|
||||
}
|
||||
out = append(out, *a)
|
||||
}
|
||||
return out, int64(total), nil
|
||||
}
|
||||
|
||||
// collectAgentDependencyCounts runs the three preflight COUNT queries in
|
||||
// sequence and bundles the result. Sequential (not parallel) because the
|
||||
// queries are cheap (<1ms each on the indexed columns added in 000015) and
|
||||
// sequential keeps error handling simple. Any repo error short-circuits
|
||||
// — we prefer to refuse the retire than make a half-informed decision.
|
||||
func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) {
|
||||
var counts domain.AgentDependencyCounts
|
||||
|
||||
targets, err := s.agentRepo.CountActiveTargets(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count active targets: %w", err)
|
||||
}
|
||||
counts.ActiveTargets = targets
|
||||
|
||||
certs, err := s.agentRepo.CountActiveCertificates(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count active certificates: %w", err)
|
||||
}
|
||||
counts.ActiveCertificates = certs
|
||||
|
||||
jobs, err := s.agentRepo.CountPendingJobs(ctx, id)
|
||||
if err != nil {
|
||||
return counts, fmt.Errorf("count pending jobs: %w", err)
|
||||
}
|
||||
counts.PendingJobs = jobs
|
||||
|
||||
return counts, nil
|
||||
}
|
||||
|
||||
// resolveActorType maps an opaque actor string into the typed ActorType
|
||||
// used by the audit schema. Matches the conventions the rest of the
|
||||
// service layer uses: "system" → System, anything that looks like an
|
||||
// agent identity → Agent, everything else → User.
|
||||
func (s *AgentService) resolveActorType(actor string) domain.ActorType {
|
||||
switch {
|
||||
case actor == "system":
|
||||
return domain.ActorTypeSystem
|
||||
case len(actor) > 6 && actor[:6] == "agent-":
|
||||
return domain.ActorTypeAgent
|
||||
default:
|
||||
return domain.ActorTypeUser
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,396 @@
|
||||
package service
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"log/slog"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/domain"
|
||||
)
|
||||
|
||||
// setupRetireTest wires up an AgentService with a single registered agent and
|
||||
// returns (service, agentRepo, auditRepo) so tests can seed state and assert
|
||||
// audit events. Kept minimal — tests that need targets/jobs/certs extend the
|
||||
// returned repos directly.
|
||||
func setupRetireTest(t *testing.T, agentID string) (*AgentService, *mockAgentRepo, *mockAuditRepo) {
|
||||
t.Helper()
|
||||
now := time.Now()
|
||||
agent := &domain.Agent{
|
||||
ID: agentID,
|
||||
Name: "prod-agent",
|
||||
Hostname: "server-01",
|
||||
Status: domain.AgentStatusOnline,
|
||||
RegisteredAt: now,
|
||||
LastHeartbeatAt: &now,
|
||||
APIKeyHash: "hash-" + agentID,
|
||||
}
|
||||
agentRepo := newMockAgentRepository()
|
||||
agentRepo.AddAgent(agent)
|
||||
certRepo := &mockCertRepo{
|
||||
Certs: make(map[string]*domain.ManagedCertificate),
|
||||
Versions: make(map[string][]*domain.CertificateVersion),
|
||||
}
|
||||
jobRepo := &mockJobRepo{
|
||||
Jobs: make(map[string]*domain.Job),
|
||||
StatusUpdates: make(map[string]domain.JobStatus),
|
||||
}
|
||||
targetRepo := &mockTargetRepo{
|
||||
Targets: make(map[string]*domain.DeploymentTarget),
|
||||
}
|
||||
auditRepo := &mockAuditRepo{Events: []*domain.AuditEvent{}}
|
||||
auditService := NewAuditService(auditRepo)
|
||||
issuerRegistry := NewIssuerRegistry(slog.Default())
|
||||
|
||||
svc := NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, nil)
|
||||
return svc, agentRepo, auditRepo
|
||||
}
|
||||
|
||||
// TestRetireAgent_Sentinel_Rejected covers I-004's sentinel guard. The four
|
||||
// well-known sentinel agent IDs back discovery sources and the network scanner
|
||||
// — retiring them would orphan those subsystems. Contract: reject with
|
||||
// ErrAgentIsSentinel regardless of force/reason.
|
||||
func TestRetireAgent_Sentinel_Rejected(t *testing.T) {
|
||||
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
|
||||
for _, id := range sentinels {
|
||||
t.Run(id, func(t *testing.T) {
|
||||
svc, _, _ := setupRetireTest(t, id)
|
||||
_, err := svc.RetireAgent(context.Background(), id, "alice", false, "")
|
||||
if !errors.Is(err, ErrAgentIsSentinel) {
|
||||
t.Fatalf("retire(sentinel %q) err=%v want ErrAgentIsSentinel", id, err)
|
||||
}
|
||||
// Sentinel rejection must be deterministic even under force=true.
|
||||
_, err = svc.RetireAgent(context.Background(), id, "alice", true, "forced by operator")
|
||||
if !errors.Is(err, ErrAgentIsSentinel) {
|
||||
t.Fatalf("retire(sentinel %q force=true) err=%v want ErrAgentIsSentinel", id, err)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_NotFound covers the 404 preflight path. The handler maps
|
||||
// ErrAgentNotFound-equivalent sentinel to 404; the service must surface it
|
||||
// cleanly without partial state mutation.
|
||||
func TestRetireAgent_NotFound(t *testing.T) {
|
||||
svc, _, _ := setupRetireTest(t, "agent-001")
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-does-not-exist", "alice", false, "")
|
||||
if err == nil {
|
||||
t.Fatalf("retire(missing id) err=nil want not-found error")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_AlreadyRetired_Idempotent covers the 204 No Content path.
|
||||
// Retiring an already-retired agent must succeed without error and without
|
||||
// emitting a new audit event (the first retirement already recorded one).
|
||||
// Idempotency matters because the handler is the escape hatch for operators
|
||||
// re-issuing a failed retire after a partial failure mid-cascade.
|
||||
func TestRetireAgent_AlreadyRetired_Idempotent(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "operator decommissioned"
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
agent.RetiredAt = &past
|
||||
agent.RetiredReason = &reason
|
||||
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire(already retired) err=%v want nil (idempotent)", err)
|
||||
}
|
||||
if result == nil || !result.AlreadyRetired {
|
||||
t.Fatalf("retire(already retired) result=%+v want AlreadyRetired=true", result)
|
||||
}
|
||||
// Retire-on-retired must not emit a duplicate audit event.
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-001" {
|
||||
t.Fatalf("retire(already retired) emitted duplicate agent_retired audit event")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_NoDeps_SoftSucceeds covers the happy 200 path: no active
|
||||
// targets, certs, or jobs referencing the agent. Soft-retire stamps
|
||||
// RetiredAt + RetiredReason and emits agent_retired audit event.
|
||||
func TestRetireAgent_NoDeps_SoftSucceeds(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
|
||||
before := time.Now().Add(-time.Second)
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire(clean) err=%v want nil", err)
|
||||
}
|
||||
if result == nil {
|
||||
t.Fatal("retire(clean) result=nil want non-nil")
|
||||
}
|
||||
if result.AlreadyRetired {
|
||||
t.Fatalf("retire(clean) result.AlreadyRetired=true want false")
|
||||
}
|
||||
if result.Cascade {
|
||||
t.Fatalf("retire(clean) result.Cascade=true want false (no deps to cascade)")
|
||||
}
|
||||
if !result.RetiredAt.After(before) {
|
||||
t.Fatalf("retire(clean) RetiredAt=%v not after test start %v", result.RetiredAt, before)
|
||||
}
|
||||
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
if agent.RetiredAt == nil {
|
||||
t.Fatalf("retire(clean) agent.RetiredAt=nil want stamped")
|
||||
}
|
||||
|
||||
// Audit event must be emitted with action=agent_retired, actor=alice.
|
||||
found := false
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-001" && e.Actor == "alice" {
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !found {
|
||||
t.Fatalf("retire(clean) missing agent_retired audit event for alice, events=%+v", auditRepo.Events)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_NoForce_Blocked covers the 409 preflight path. When
|
||||
// the agent has any of: active non-retired targets, certs deployed via those
|
||||
// targets, or pending jobs — a default retire must block with
|
||||
// ErrBlockedByDependencies and the counts must be reachable via errors.As so
|
||||
// the handler can build the 409 body.
|
||||
func TestRetireAgent_WithDeps_NoForce_Blocked(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
// Seed dependency counts directly on the mock — the production repo
|
||||
// implements CountActive* queries; the mock exposes them as fields.
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 3
|
||||
agentRepo.ActiveCertCounts["agent-001"] = 7
|
||||
agentRepo.PendingJobCounts["agent-001"] = 2
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
|
||||
if !errors.Is(err, ErrBlockedByDependencies) {
|
||||
t.Fatalf("retire(with deps, no force) err=%v want ErrBlockedByDependencies", err)
|
||||
}
|
||||
var blocked *BlockedByDependenciesError
|
||||
if !errors.As(err, &blocked) {
|
||||
t.Fatalf("retire(with deps) err=%v want wrapped *BlockedByDependenciesError", err)
|
||||
}
|
||||
if blocked.Counts.ActiveTargets != 3 {
|
||||
t.Errorf("blocked.Counts.ActiveTargets=%d want 3", blocked.Counts.ActiveTargets)
|
||||
}
|
||||
if blocked.Counts.ActiveCertificates != 7 {
|
||||
t.Errorf("blocked.Counts.ActiveCertificates=%d want 7", blocked.Counts.ActiveCertificates)
|
||||
}
|
||||
if blocked.Counts.PendingJobs != 2 {
|
||||
t.Errorf("blocked.Counts.PendingJobs=%d want 2", blocked.Counts.PendingJobs)
|
||||
}
|
||||
// Agent must still be un-retired after preflight block.
|
||||
if agentRepo.Agents["agent-001"].RetiredAt != nil {
|
||||
t.Fatalf("retire(blocked) left RetiredAt stamped; preflight must be transactionally safe")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_Force_NoReason_Rejected covers the 400 guard on the
|
||||
// force escape hatch. Operators using force=true must supply a justifying
|
||||
// reason; empty reason is rejected before any DB mutation.
|
||||
func TestRetireAgent_WithDeps_Force_NoReason_Rejected(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 1
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, "")
|
||||
if !errors.Is(err, ErrForceReasonRequired) {
|
||||
t.Fatalf("retire(force, no reason) err=%v want ErrForceReasonRequired", err)
|
||||
}
|
||||
if agentRepo.Agents["agent-001"].RetiredAt != nil {
|
||||
t.Fatalf("retire(force, no reason) left RetiredAt stamped; guard must fire before mutation")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_WithDeps_Force_Cascades covers the force=true transactional
|
||||
// path: agent retires, downstream targets also soft-retire with the supplied
|
||||
// reason, and the result surface indicates cascade happened. Reason
|
||||
// propagates to every cascaded row so post-mortem forensics can trace the
|
||||
// cascade to a single operator action.
|
||||
func TestRetireAgent_WithDeps_Force_Cascades(t *testing.T) {
|
||||
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
|
||||
agentRepo.ActiveTargetCounts["agent-001"] = 2
|
||||
agentRepo.ActiveCertCounts["agent-001"] = 5
|
||||
agentRepo.PendingJobCounts["agent-001"] = 1
|
||||
|
||||
reason := "decommissioning rack 7"
|
||||
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, reason)
|
||||
if err != nil {
|
||||
t.Fatalf("retire(force, reason) err=%v want nil", err)
|
||||
}
|
||||
if result == nil {
|
||||
t.Fatal("retire(force) result=nil want non-nil")
|
||||
}
|
||||
if !result.Cascade {
|
||||
t.Fatalf("retire(force) result.Cascade=false want true")
|
||||
}
|
||||
if result.Counts.ActiveTargets != 2 {
|
||||
t.Errorf("result.Counts.ActiveTargets=%d want 2 (pre-cascade snapshot)", result.Counts.ActiveTargets)
|
||||
}
|
||||
|
||||
agent := agentRepo.Agents["agent-001"]
|
||||
if agent.RetiredAt == nil {
|
||||
t.Fatalf("retire(force) agent.RetiredAt=nil want stamped")
|
||||
}
|
||||
if agent.RetiredReason == nil || *agent.RetiredReason != reason {
|
||||
t.Fatalf("retire(force) RetiredReason=%v want %q", agent.RetiredReason, reason)
|
||||
}
|
||||
|
||||
// Two audit events required: agent_retired + agent_retirement_cascaded.
|
||||
// The cascaded event captures which downstream resources were affected.
|
||||
var haveRetired, haveCascaded bool
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.ResourceID == "agent-001" {
|
||||
switch e.Action {
|
||||
case "agent_retired":
|
||||
haveRetired = true
|
||||
case "agent_retirement_cascaded":
|
||||
haveCascaded = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if !haveRetired {
|
||||
t.Errorf("retire(force) missing agent_retired audit event")
|
||||
}
|
||||
if !haveCascaded {
|
||||
t.Errorf("retire(force) missing agent_retirement_cascaded audit event")
|
||||
}
|
||||
}
|
||||
|
||||
// TestRetireAgent_EmitsAuditEvent pins the audit contract for I-004:
|
||||
// every retire path that mutates DB state emits at least one audit event with
|
||||
// the operator's actor identity, so post-hoc compliance/forensics can
|
||||
// reconstruct who retired what and when.
|
||||
func TestRetireAgent_EmitsAuditEvent(t *testing.T) {
|
||||
svc, _, auditRepo := setupRetireTest(t, "agent-007")
|
||||
|
||||
_, err := svc.RetireAgent(context.Background(), "agent-007", "compliance-bot", false, "")
|
||||
if err != nil {
|
||||
t.Fatalf("retire err=%v want nil", err)
|
||||
}
|
||||
for _, e := range auditRepo.Events {
|
||||
if e.Action == "agent_retired" && e.ResourceID == "agent-007" {
|
||||
if e.Actor != "compliance-bot" {
|
||||
t.Errorf("audit event Actor=%q want compliance-bot", e.Actor)
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
t.Fatalf("no agent_retired audit event emitted, events=%+v", auditRepo.Events)
|
||||
}
|
||||
|
||||
// TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired covers the 410 Gone
|
||||
// contract. A retired agent that is still polling must be told its identity
|
||||
// is no longer accepted — the agent process should detect this and shut
|
||||
// down rather than continue heartbeating indefinitely.
|
||||
func TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
|
||||
past := time.Now().Add(-time.Hour)
|
||||
reason := "decommissioned"
|
||||
agentRepo.Agents["agent-001"].RetiredAt = &past
|
||||
agentRepo.Agents["agent-001"].RetiredReason = &reason
|
||||
|
||||
err := svc.Heartbeat(context.Background(), "agent-001", &domain.AgentMetadata{
|
||||
OS: "linux",
|
||||
Architecture: "amd64",
|
||||
Hostname: "server-01",
|
||||
})
|
||||
if !errors.Is(err, ErrAgentRetired) {
|
||||
t.Fatalf("heartbeat(retired) err=%v want ErrAgentRetired", err)
|
||||
}
|
||||
// Retired heartbeat must NOT bump LastHeartbeatAt — otherwise the retired
|
||||
// agent could ressurrect itself in stats/observability dashboards.
|
||||
if _, bumped := agentRepo.HeartbeatUpdates["agent-001"]; bumped {
|
||||
t.Fatalf("heartbeat(retired) updated LastHeartbeatAt; retired agents must be frozen")
|
||||
}
|
||||
}
|
||||
|
||||
// TestListAgents_DefaultExcludesRetired covers the contract that the
|
||||
// handler-facing ListAgents call hides retired rows by default. Otherwise
|
||||
// every dashboard that paginates agents would surface retired stragglers.
|
||||
// An explicit "list retired" endpoint (ListRetiredAgents) covers the audit
|
||||
// use case.
|
||||
func TestListAgents_DefaultExcludesRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-active")
|
||||
// Seed one retired agent alongside the active one.
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "old hardware"
|
||||
agentRepo.AddAgent(&domain.Agent{
|
||||
ID: "agent-retired",
|
||||
Name: "retired-agent",
|
||||
Hostname: "server-old",
|
||||
Status: domain.AgentStatusOffline,
|
||||
RegisteredAt: past,
|
||||
APIKeyHash: "hash-retired",
|
||||
RetiredAt: &past,
|
||||
RetiredReason: &reason,
|
||||
})
|
||||
|
||||
agents, total, err := svc.ListAgents(context.Background(), 1, 50)
|
||||
if err != nil {
|
||||
t.Fatalf("ListAgents err=%v want nil", err)
|
||||
}
|
||||
for _, a := range agents {
|
||||
if a.ID == "agent-retired" {
|
||||
t.Fatalf("ListAgents returned retired agent %q in default listing", a.ID)
|
||||
}
|
||||
}
|
||||
if total != 1 {
|
||||
t.Errorf("ListAgents total=%d want 1 (only active)", total)
|
||||
}
|
||||
|
||||
// ListRetiredAgents must surface retired-only, with count=1.
|
||||
retired, retiredTotal, err := svc.ListRetiredAgents(context.Background(), 1, 50)
|
||||
if err != nil {
|
||||
t.Fatalf("ListRetiredAgents err=%v want nil", err)
|
||||
}
|
||||
if retiredTotal != 1 {
|
||||
t.Errorf("ListRetiredAgents total=%d want 1", retiredTotal)
|
||||
}
|
||||
if len(retired) != 1 || retired[0].ID != "agent-retired" {
|
||||
t.Fatalf("ListRetiredAgents got=%+v want [agent-retired]", retired)
|
||||
}
|
||||
}
|
||||
|
||||
// TestMarkStaleAgentsOffline_SkipsRetired covers the stale-offline sweeper
|
||||
// interaction with retirement. A retired agent must not be re-surfaced as
|
||||
// a state transition ("Online → Offline") by the scheduler, because its
|
||||
// Status column is preserved as the last-known operational state at
|
||||
// retirement time and RetiredAt is the source of truth for filtering.
|
||||
func TestMarkStaleAgentsOffline_SkipsRetired(t *testing.T) {
|
||||
svc, agentRepo, _ := setupRetireTest(t, "agent-live")
|
||||
// Active agent is currently stale (no heartbeat for 10 minutes) — eligible
|
||||
// for Online→Offline transition.
|
||||
stale := time.Now().Add(-10 * time.Minute)
|
||||
agentRepo.Agents["agent-live"].LastHeartbeatAt = &stale
|
||||
|
||||
// Retired agent was also stale at retirement time, but must NOT be
|
||||
// touched by the sweeper.
|
||||
past := time.Now().Add(-24 * time.Hour)
|
||||
reason := "hw failure"
|
||||
agentRepo.AddAgent(&domain.Agent{
|
||||
ID: "agent-retired",
|
||||
Name: "dead-agent",
|
||||
Hostname: "server-old",
|
||||
Status: domain.AgentStatusOnline, // preserved last-seen status
|
||||
RegisteredAt: past,
|
||||
LastHeartbeatAt: &past,
|
||||
APIKeyHash: "hash-dead",
|
||||
RetiredAt: &past,
|
||||
RetiredReason: &reason,
|
||||
})
|
||||
|
||||
if err := svc.MarkStaleAgentsOffline(context.Background(), 5*time.Minute); err != nil {
|
||||
t.Fatalf("MarkStaleAgentsOffline err=%v want nil", err)
|
||||
}
|
||||
|
||||
// Active-stale agent should flip Online → Offline.
|
||||
if got := agentRepo.Agents["agent-live"].Status; got != domain.AgentStatusOffline {
|
||||
t.Errorf("agent-live Status=%s want Offline", got)
|
||||
}
|
||||
// Retired agent's Status column must be frozen at Online (its preserved
|
||||
// last-seen state); the sweeper must skip it.
|
||||
if got := agentRepo.Agents["agent-retired"].Status; got != domain.AgentStatusOnline {
|
||||
t.Errorf("agent-retired Status=%s want Online (frozen); sweeper touched retired row", got)
|
||||
}
|
||||
}
|
||||
@@ -145,6 +145,31 @@ func (s *DeploymentService) ProcessDeploymentJob(ctx context.Context, job *domai
|
||||
return fmt.Errorf("failed to fetch agent: %w", err)
|
||||
}
|
||||
|
||||
// I-004: AgentRepository.Get surfaces retired rows by design (for the GUI
|
||||
// banner + 410 Gone heartbeat path). Deployments must never dispatch to a
|
||||
// retired agent — it will never heartbeat again and the target row should
|
||||
// itself have been cascade-retired when the agent was force-retired. A job
|
||||
// slipping through here would otherwise hit the heartbeat-staleness branch
|
||||
// below with the misleading reason "agent is offline"; we want operators to
|
||||
// see the real cause. Fail the job with an explicit reason, send a
|
||||
// deployment notification so the owner is alerted, and record an audit
|
||||
// event. Falls through the same notify+audit shape as the offline branch.
|
||||
if agent.IsRetired() {
|
||||
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "assigned agent is retired")
|
||||
if updateErr != nil {
|
||||
slog.Error("failed to update job status", "job_id", job.ID, "error", updateErr)
|
||||
}
|
||||
if notifErr := s.notificationSvc.SendDeploymentNotification(ctx, cert, target, false, fmt.Errorf("agent retired")); notifErr != nil {
|
||||
slog.Error("failed to send deployment notification", "error", notifErr)
|
||||
}
|
||||
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
|
||||
"deployment_job_failed", "certificate", job.CertificateID,
|
||||
map[string]interface{}{"job_id": job.ID, "reason": "agent retired", "target_id": targetID, "agent_id": agentID}); auditErr != nil {
|
||||
slog.Error("failed to record audit event", "error", auditErr)
|
||||
}
|
||||
return fmt.Errorf("agent %s is retired", agentID)
|
||||
}
|
||||
|
||||
// Check agent heartbeat (must be within last 5 minutes)
|
||||
if agent.LastHeartbeatAt != nil && time.Since(*agent.LastHeartbeatAt) > 5*time.Minute {
|
||||
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "agent is offline")
|
||||
|
||||
@@ -232,6 +232,18 @@ func (s *TargetService) TestConnection(ctx context.Context, id string) error {
|
||||
return fmt.Errorf("assigned agent not found: %w", err)
|
||||
}
|
||||
|
||||
// I-004: AgentRepository.Get intentionally surfaces retired rows (the banner
|
||||
// + 410 Gone paths need to see them). A test against a retired agent can
|
||||
// never succeed — the agent is tombstoned, will never heartbeat again, and
|
||||
// any active targets have already been cascade-retired alongside it. Fail
|
||||
// fast with an explicit message instead of falling through to the Status /
|
||||
// heartbeat checks, which would produce a misleading "agent is Offline" or
|
||||
// "heartbeat stale" diagnostic.
|
||||
if agent.IsRetired() {
|
||||
s.updateTestStatus(ctx, target, "failed")
|
||||
return fmt.Errorf("assigned agent %s is retired", agent.ID)
|
||||
}
|
||||
|
||||
if agent.Status != domain.AgentStatusOnline {
|
||||
s.updateTestStatus(ctx, target, "failed")
|
||||
return fmt.Errorf("assigned agent %s is %s (expected Online)", agent.ID, agent.Status)
|
||||
@@ -293,9 +305,20 @@ func (s *TargetService) CreateTarget(ctx context.Context, target domain.Deployme
|
||||
if target.AgentID == "" {
|
||||
return nil, fmt.Errorf("%w: agent_id is required", ErrAgentNotFound)
|
||||
}
|
||||
if _, err := s.agentRepo.Get(ctx, target.AgentID); err != nil {
|
||||
agent, err := s.agentRepo.Get(ctx, target.AgentID)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("%w: %s", ErrAgentNotFound, target.AgentID)
|
||||
}
|
||||
// I-004: refuse to attach new targets to a retired agent. The agent is
|
||||
// tombstoned and no deployments would ever succeed against it; letting a
|
||||
// row slip past here would immediately be cascade-retired on the next
|
||||
// dependency sweep and confuse operators ("why is this brand-new target
|
||||
// already retired?"). Treating retired agents as "not found" for creation
|
||||
// purposes keeps the error surface tight and matches the default-list
|
||||
// contract established by repository.AgentRepository.List.
|
||||
if agent.IsRetired() {
|
||||
return nil, fmt.Errorf("%w: %s (retired)", ErrAgentNotFound, target.AgentID)
|
||||
}
|
||||
|
||||
if target.ID == "" {
|
||||
target.ID = generateID("target")
|
||||
|
||||
@@ -4,6 +4,7 @@ import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"sort"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
@@ -607,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) {
|
||||
m.Policies[policy.ID] = policy
|
||||
}
|
||||
|
||||
// mockAgentRepo is a test implementation of AgentRepository
|
||||
// mockAgentRepo is a test implementation of AgentRepository.
|
||||
//
|
||||
// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by
|
||||
// agent ID and read back verbatim by the Count* methods — the retirement
|
||||
// service's preflight pokes these maps to simulate "agent has N active
|
||||
// deployments / M deployed certs / K pending jobs" without having to seed
|
||||
// real target/cert/job rows across multiple mock repos. An unset key means
|
||||
// zero, matching the production repo behavior on an agent with no deps.
|
||||
type mockAgentRepo struct {
|
||||
mu sync.Mutex
|
||||
Agents map[string]*domain.Agent
|
||||
@@ -619,8 +627,27 @@ type mockAgentRepo struct {
|
||||
ListErr error
|
||||
UpdateHeartbeatErr error
|
||||
GetByAPIKeyErr error
|
||||
// I-004 preflight count seeds (read by CountActiveTargets etc.).
|
||||
ActiveTargetCounts map[string]int
|
||||
ActiveCertCounts map[string]int
|
||||
PendingJobCounts map[string]int
|
||||
// I-004 retirement write-path error seams. Let tests force a SoftRetire
|
||||
// or RetireAgentWithCascade failure after preflight passed, so the
|
||||
// service's error surfacing (wrap+return, skip audit, etc.) can be
|
||||
// exercised without having to stand up a real PG connection.
|
||||
SoftRetireErr error
|
||||
RetireCascadeErr error
|
||||
CountErr error
|
||||
ListRetiredErr error
|
||||
}
|
||||
|
||||
// List mirrors the production repo contract post-I-004: it returns only
|
||||
// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via
|
||||
// AddAgent and then call a List-driven service method (e.g. ListAgents,
|
||||
// MarkStaleAgentsOffline, stats dashboards) must not see the retired row
|
||||
// here — otherwise the mock would pass while the real planner filters it
|
||||
// out at the WHERE clause level. ListRetired is the companion method for
|
||||
// explicit retired-only listing.
|
||||
func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
@@ -629,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
}
|
||||
var agents []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
continue
|
||||
}
|
||||
agents = append(agents, a)
|
||||
}
|
||||
return agents, nil
|
||||
@@ -726,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) {
|
||||
m.Agents[agent.ID] = agent
|
||||
}
|
||||
|
||||
// ListRetired returns the paginated retired-agents slice + total count.
|
||||
// Matches the production repo contract: RetiredAt != nil, sorted by
|
||||
// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over
|
||||
// the keyed map so the mock stays dependency-free. I-004.
|
||||
func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.ListRetiredErr != nil {
|
||||
return nil, 0, m.ListRetiredErr
|
||||
}
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
var retired []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
retired = append(retired, a)
|
||||
}
|
||||
}
|
||||
total := len(retired)
|
||||
// Sort by RetiredAt DESC — most recent first. The real query uses the
|
||||
// partial idx_agents_retired_at index; here we sort in Go.
|
||||
sort.SliceStable(retired, func(i, j int) bool {
|
||||
return retired[i].RetiredAt.After(*retired[j].RetiredAt)
|
||||
})
|
||||
// Apply page/perPage window.
|
||||
offset := (page - 1) * perPage
|
||||
if offset >= total {
|
||||
return nil, total, nil
|
||||
}
|
||||
end := offset + perPage
|
||||
if end > total {
|
||||
end = total
|
||||
}
|
||||
return retired[offset:end], total, nil
|
||||
}
|
||||
|
||||
// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors
|
||||
// the real repo's idempotent semantics: a row already retired is left
|
||||
// untouched (zero-rows-affected is not an error). I-004 preserves
|
||||
// retirement metadata across re-retire attempts — whoever retired it
|
||||
// first owns the audit trail.
|
||||
func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.SoftRetireErr != nil {
|
||||
return m.SoftRetireErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgentWithCascade stamps the agent row the same way SoftRetire
|
||||
// does. The real repo also stamps every active deployment_targets row
|
||||
// in the same transaction; the mock can't do that because targets live
|
||||
// in mockTargetRepo, which the retirement service doesn't write to
|
||||
// through this repo interface. Tests that need to assert cascade
|
||||
// semantics on targets should seed mockTargetRepo directly and verify
|
||||
// the service-layer audit event captured the cascade count. I-004.
|
||||
func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.RetireCascadeErr != nil {
|
||||
return m.RetireCascadeErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op (same as production transaction)
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if
|
||||
// unset). Matches the real repo signature: COUNT of non-retired
|
||||
// deployment_targets with agent_id=$1. I-004 preflight.
|
||||
func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveTargetCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountActiveCertificates returns the seeded ActiveCertCounts value.
|
||||
// Real query: COUNT(DISTINCT certificate_id) across
|
||||
// certificate_target_mappings ↔ deployment_targets on agent_id. I-004.
|
||||
func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveCertCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountPendingJobs returns the seeded PendingJobCounts value. Real
|
||||
// query: COUNT of jobs with agent_id=$1 AND status IN (Pending,
|
||||
// AwaitingCSR, AwaitingApproval, Running). I-004.
|
||||
func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.PendingJobCounts[agentID], nil
|
||||
}
|
||||
|
||||
// mockTargetRepo is a test implementation of TargetRepository
|
||||
type mockTargetRepo struct {
|
||||
mu sync.Mutex
|
||||
@@ -955,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo {
|
||||
return &mockAgentRepo{
|
||||
Agents: make(map[string]*domain.Agent),
|
||||
HeartbeatUpdates: make(map[string]time.Time),
|
||||
// I-004 preflight count maps. Tests seed these directly via
|
||||
// agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys
|
||||
// read back as zero from CountActiveTargets etc., matching
|
||||
// the production repo behavior for agents with no deps.
|
||||
ActiveTargetCounts: make(map[string]int),
|
||||
ActiveCertCounts: make(map[string]int),
|
||||
PendingJobCounts: make(map[string]int),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user