mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:01:32 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
@@ -4,6 +4,7 @@ import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"sort"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
@@ -607,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) {
|
||||
m.Policies[policy.ID] = policy
|
||||
}
|
||||
|
||||
// mockAgentRepo is a test implementation of AgentRepository
|
||||
// mockAgentRepo is a test implementation of AgentRepository.
|
||||
//
|
||||
// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by
|
||||
// agent ID and read back verbatim by the Count* methods — the retirement
|
||||
// service's preflight pokes these maps to simulate "agent has N active
|
||||
// deployments / M deployed certs / K pending jobs" without having to seed
|
||||
// real target/cert/job rows across multiple mock repos. An unset key means
|
||||
// zero, matching the production repo behavior on an agent with no deps.
|
||||
type mockAgentRepo struct {
|
||||
mu sync.Mutex
|
||||
Agents map[string]*domain.Agent
|
||||
@@ -619,8 +627,27 @@ type mockAgentRepo struct {
|
||||
ListErr error
|
||||
UpdateHeartbeatErr error
|
||||
GetByAPIKeyErr error
|
||||
// I-004 preflight count seeds (read by CountActiveTargets etc.).
|
||||
ActiveTargetCounts map[string]int
|
||||
ActiveCertCounts map[string]int
|
||||
PendingJobCounts map[string]int
|
||||
// I-004 retirement write-path error seams. Let tests force a SoftRetire
|
||||
// or RetireAgentWithCascade failure after preflight passed, so the
|
||||
// service's error surfacing (wrap+return, skip audit, etc.) can be
|
||||
// exercised without having to stand up a real PG connection.
|
||||
SoftRetireErr error
|
||||
RetireCascadeErr error
|
||||
CountErr error
|
||||
ListRetiredErr error
|
||||
}
|
||||
|
||||
// List mirrors the production repo contract post-I-004: it returns only
|
||||
// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via
|
||||
// AddAgent and then call a List-driven service method (e.g. ListAgents,
|
||||
// MarkStaleAgentsOffline, stats dashboards) must not see the retired row
|
||||
// here — otherwise the mock would pass while the real planner filters it
|
||||
// out at the WHERE clause level. ListRetired is the companion method for
|
||||
// explicit retired-only listing.
|
||||
func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
@@ -629,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
}
|
||||
var agents []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
continue
|
||||
}
|
||||
agents = append(agents, a)
|
||||
}
|
||||
return agents, nil
|
||||
@@ -726,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) {
|
||||
m.Agents[agent.ID] = agent
|
||||
}
|
||||
|
||||
// ListRetired returns the paginated retired-agents slice + total count.
|
||||
// Matches the production repo contract: RetiredAt != nil, sorted by
|
||||
// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over
|
||||
// the keyed map so the mock stays dependency-free. I-004.
|
||||
func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.ListRetiredErr != nil {
|
||||
return nil, 0, m.ListRetiredErr
|
||||
}
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
var retired []*domain.Agent
|
||||
for _, a := range m.Agents {
|
||||
if a.RetiredAt != nil {
|
||||
retired = append(retired, a)
|
||||
}
|
||||
}
|
||||
total := len(retired)
|
||||
// Sort by RetiredAt DESC — most recent first. The real query uses the
|
||||
// partial idx_agents_retired_at index; here we sort in Go.
|
||||
sort.SliceStable(retired, func(i, j int) bool {
|
||||
return retired[i].RetiredAt.After(*retired[j].RetiredAt)
|
||||
})
|
||||
// Apply page/perPage window.
|
||||
offset := (page - 1) * perPage
|
||||
if offset >= total {
|
||||
return nil, total, nil
|
||||
}
|
||||
end := offset + perPage
|
||||
if end > total {
|
||||
end = total
|
||||
}
|
||||
return retired[offset:end], total, nil
|
||||
}
|
||||
|
||||
// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors
|
||||
// the real repo's idempotent semantics: a row already retired is left
|
||||
// untouched (zero-rows-affected is not an error). I-004 preserves
|
||||
// retirement metadata across re-retire attempts — whoever retired it
|
||||
// first owns the audit trail.
|
||||
func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.SoftRetireErr != nil {
|
||||
return m.SoftRetireErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgentWithCascade stamps the agent row the same way SoftRetire
|
||||
// does. The real repo also stamps every active deployment_targets row
|
||||
// in the same transaction; the mock can't do that because targets live
|
||||
// in mockTargetRepo, which the retirement service doesn't write to
|
||||
// through this repo interface. Tests that need to assert cascade
|
||||
// semantics on targets should seed mockTargetRepo directly and verify
|
||||
// the service-layer audit event captured the cascade count. I-004.
|
||||
func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.RetireCascadeErr != nil {
|
||||
return m.RetireCascadeErr
|
||||
}
|
||||
agent, ok := m.Agents[id]
|
||||
if !ok {
|
||||
return errNotFound
|
||||
}
|
||||
if agent.RetiredAt != nil {
|
||||
return nil // already retired — no-op (same as production transaction)
|
||||
}
|
||||
stamped := retiredAt
|
||||
agent.RetiredAt = &stamped
|
||||
stampedReason := reason
|
||||
agent.RetiredReason = &stampedReason
|
||||
return nil
|
||||
}
|
||||
|
||||
// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if
|
||||
// unset). Matches the real repo signature: COUNT of non-retired
|
||||
// deployment_targets with agent_id=$1. I-004 preflight.
|
||||
func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveTargetCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountActiveCertificates returns the seeded ActiveCertCounts value.
|
||||
// Real query: COUNT(DISTINCT certificate_id) across
|
||||
// certificate_target_mappings ↔ deployment_targets on agent_id. I-004.
|
||||
func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.ActiveCertCounts[agentID], nil
|
||||
}
|
||||
|
||||
// CountPendingJobs returns the seeded PendingJobCounts value. Real
|
||||
// query: COUNT of jobs with agent_id=$1 AND status IN (Pending,
|
||||
// AwaitingCSR, AwaitingApproval, Running). I-004.
|
||||
func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if m.CountErr != nil {
|
||||
return 0, m.CountErr
|
||||
}
|
||||
return m.PendingJobCounts[agentID], nil
|
||||
}
|
||||
|
||||
// mockTargetRepo is a test implementation of TargetRepository
|
||||
type mockTargetRepo struct {
|
||||
mu sync.Mutex
|
||||
@@ -955,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo {
|
||||
return &mockAgentRepo{
|
||||
Agents: make(map[string]*domain.Agent),
|
||||
HeartbeatUpdates: make(map[string]time.Time),
|
||||
// I-004 preflight count maps. Tests seed these directly via
|
||||
// agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys
|
||||
// read back as zero from CountActiveTargets etc., matching
|
||||
// the production repo behavior for agents with no deps.
|
||||
ActiveTargetCounts: make(map[string]int),
|
||||
ActiveCertCounts: make(map[string]int),
|
||||
PendingJobCounts: make(map[string]int),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user