mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-12 16:38:55 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
@@ -93,9 +93,34 @@ type TargetRepository interface {
|
||||
|
||||
// AgentRepository defines operations for managing control plane agents.
|
||||
type AgentRepository interface {
|
||||
// List returns all agents.
|
||||
// List returns all ACTIVE agents — rows with retired_at IS NULL.
|
||||
//
|
||||
// I-004: The default listing MUST NOT surface retired agents. The
|
||||
// handler-facing ListAgents call, the stats dashboard, and the stale-offline
|
||||
// sweeper all iterate this list and would otherwise re-surface decommissioned
|
||||
// hardware in operational UI. Callers that genuinely want retired rows (the
|
||||
// audit tab, compliance exports) must use ListRetired instead.
|
||||
//
|
||||
// The partial index idx_agents_retired_at (migration 000015) keeps retired
|
||||
// rows cheap to exclude — the planner uses it to skip the retired segment
|
||||
// of the table entirely.
|
||||
List(ctx context.Context) ([]*domain.Agent, error)
|
||||
// ListRetired returns a paginated list of retired agents (retired_at IS NOT NULL),
|
||||
// ordered by retired_at DESC so the most recent retirements appear first. Used
|
||||
// by the GUI's Retired tab and the audit export path. Returns the slice plus
|
||||
// the total count (for pagination). A page<1 or perPage<1 is clamped to sensible
|
||||
// defaults (page=1, perPage=50) in the repo implementation rather than erroring —
|
||||
// this matches the ListAgents pagination behavior in the service layer.
|
||||
// I-004 coverage-gap closure, migration 000015.
|
||||
ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error)
|
||||
// Get retrieves an agent by ID.
|
||||
//
|
||||
// I-004 note: Get returns retired rows (retired_at IS NOT NULL) because
|
||||
// callers that need to check "has this agent been retired?" — the heartbeat
|
||||
// handler returning 410 Gone, the retirement service's idempotent-retire
|
||||
// branch, the detail page rendering a retirement banner — must see the
|
||||
// retired_at/retired_reason fields. Only the default List path default-
|
||||
// excludes retired; individual Get lookups surface them.
|
||||
Get(ctx context.Context, id string) (*domain.Agent, error)
|
||||
// Create stores a new agent. Callers that want duplicate-key errors surfaced
|
||||
// (e.g. real-agent registration) must use this method; sentinel/bootstrap
|
||||
@@ -112,11 +137,78 @@ type AgentRepository interface {
|
||||
// Update modifies an existing agent.
|
||||
Update(ctx context.Context, agent *domain.Agent) error
|
||||
// Delete removes an agent.
|
||||
//
|
||||
// I-004: callers should prefer SoftRetire / RetireAgentWithCascade for the
|
||||
// operator-facing retirement path; hard Delete remains available for test
|
||||
// cleanup and repository-level administrative tasks. The deployment_targets
|
||||
// FK flipped to ON DELETE RESTRICT in migration 000015, so hard-deleting an
|
||||
// agent that still owns active targets will now fail at the DB layer — which
|
||||
// is intentional: the fail-closed guardrail prevents audit-trail destruction.
|
||||
Delete(ctx context.Context, id string) error
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
|
||||
//
|
||||
// I-004: UpdateHeartbeat is a no-op on retired agents — the UPDATE clause
|
||||
// includes AND retired_at IS NULL so a stale agent process that keeps polling
|
||||
// after retirement cannot resurrect its heartbeat. The service layer already
|
||||
// short-circuits with ErrAgentRetired before calling this method; the WHERE
|
||||
// filter here is belt-and-braces for anyone who skips the service path.
|
||||
UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error
|
||||
// GetByAPIKey retrieves an agent by hashed API key.
|
||||
//
|
||||
// I-004: GetByAPIKey returns retired rows so the auth middleware can detect
|
||||
// "this API key belongs to a retired agent" and fail the request with
|
||||
// 410 Gone. If retired rows were hidden, auth would return a plain 401 and
|
||||
// leak no signal — which is wrong: the operator needs the retired state
|
||||
// made explicit so they can clean up the agent process.
|
||||
GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error)
|
||||
// SoftRetire stamps retired_at + retired_reason on the agent row with no
|
||||
// cascade. Used on the happy path where preflight confirmed the agent has
|
||||
// zero active dependencies (no active deployment_targets, no pending jobs).
|
||||
// The UPDATE is scoped to WHERE id=$1 AND retired_at IS NULL so re-retiring
|
||||
// an already-retired row is a no-op (zero rows affected is NOT returned as
|
||||
// an error — the service layer detects this via its own idempotent-retire
|
||||
// branch before calling SoftRetire). Callers supply retiredAt so the service
|
||||
// can pin a single consistent timestamp across audit + DB writes.
|
||||
// I-004 coverage-gap closure.
|
||||
SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error
|
||||
// RetireAgentWithCascade performs a transactional retire + cascade. In one
|
||||
// transaction it: (1) stamps retired_at + retired_reason on the agent row,
|
||||
// and (2) stamps the SAME retired_at + retired_reason on every active
|
||||
// deployment_targets row whose agent_id matches. Only rows with
|
||||
// retired_at IS NULL are touched in (2) — already-retired targets keep their
|
||||
// original retirement metadata (whoever retired them first, whenever). Used
|
||||
// exclusively on the force=true path from the retirement handler; callers
|
||||
// supply retiredAt so the agent row and every cascaded target row share an
|
||||
// exact retirement instant (helps forensic analysis trace the cascade back
|
||||
// to a single operator action). If the agent row is already retired, the
|
||||
// whole operation is a no-op — the transaction commits without touching
|
||||
// either table. I-004 coverage-gap closure, migration 000015.
|
||||
RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error
|
||||
// CountActiveTargets returns the number of deployment_targets rows where
|
||||
// agent_id=id AND retired_at IS NULL. The COUNT query hits the existing
|
||||
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
|
||||
// additional retired_at IS NULL predicate is cheap because the partial
|
||||
// idx_deployment_targets_retired_at index (migration 000015) lets the
|
||||
// planner skip the retired-row segment entirely. Preflight uses this to
|
||||
// decide 200 (soft-retire) vs 409 (blocked-by-deps). I-004.
|
||||
CountActiveTargets(ctx context.Context, agentID string) (int, error)
|
||||
// CountActiveCertificates returns the count of managed_certificates currently
|
||||
// deployed through one of this agent's ACTIVE (non-retired) deployment_targets.
|
||||
// The query joins certificate_target_mappings (migration 000001 line 116) →
|
||||
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
|
||||
// deployment_targets.retired_at IS NULL, then COUNT(DISTINCT certificate_id)
|
||||
// so the same cert deployed to multiple targets on one agent counts once.
|
||||
// The primary key (certificate_id, target_id) on certificate_target_mappings
|
||||
// plus idx_certificate_target_mappings_target_id (line 122) cover the join.
|
||||
// Used purely for the preflight 409 body — the number is informational. I-004.
|
||||
CountActiveCertificates(ctx context.Context, agentID string) (int, error)
|
||||
// CountPendingJobs returns the number of jobs belonging to this agent whose
|
||||
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
|
||||
// statuses that indicate work the agent would still be expected to pick up.
|
||||
// Completed/Failed/Cancelled jobs do not count. The filter agent_id=$1 hits
|
||||
// the idx_jobs_agent_id index (migration 000001 line 161). Used for the
|
||||
// preflight 409 body. I-004.
|
||||
CountPendingJobs(ctx context.Context, agentID string) (int, error)
|
||||
}
|
||||
|
||||
// JobRepository defines operations for managing renewal and deployment jobs.
|
||||
|
||||
Reference in New Issue
Block a user