Close I-004 (agent hard-delete cascades targets) coverage-gap finding

Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
This commit is contained in:
Shankar Reddy
2026-04-19 05:24:00 +00:00
parent c17ea577e7
commit 49002c8cba
35 changed files with 4400 additions and 33 deletions
+109
View File
@@ -19,6 +19,8 @@ import {
getAgents,
getAgent,
registerAgent,
retireAgent,
listRetiredAgents,
getJobs,
cancelJob,
approveRenewal,
@@ -399,6 +401,113 @@ describe('API Client', () => {
});
});
// ─── Agent Retirement (I-004) ───────────────────────
//
// These tests pin the GUI's retirement contract against what the backend
// will add in Phase 2b: soft-retire via DELETE, force-cascade via
// ?force=true&reason=..., idempotent 204 on already-retired, 409 blocked
// payload with counts, and a GET /agents/retired listing surface.
//
// All compile-fail until client.ts exports retireAgent + listRetiredAgents
// — the shape of those exports is pinned here rather than assumed.
describe('Agent Retirement (I-004)', () => {
it('retireAgent sends DELETE without query when no force/reason', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: false,
}),
);
await retireAgent('ag-1');
const [url, init] = mockFetch.mock.calls[0];
// Default soft-retire: bare path, no stray ? suffix.
expect(url).toBe('/api/v1/agents/ag-1');
expect(init.method).toBe('DELETE');
});
it('retireAgent propagates force+reason as URL query parameters', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: true,
counts: { active_targets: 3, active_certificates: 7, pending_jobs: 2 },
}),
);
await retireAgent('ag-1', { force: true, reason: 'decommissioning rack 7' });
const [url, init] = mockFetch.mock.calls[0];
// URLSearchParams encodes space as "+"; "decommissioning rack 7" → "decommissioning+rack+7"
expect(url).toBe(
'/api/v1/agents/ag-1?force=true&reason=decommissioning+rack+7',
);
expect(init.method).toBe('DELETE');
});
it('retireAgent omits force=false even when reason is supplied', async () => {
// Client-side guard: the server's 400 ErrForceReasonRequired is the
// fallback; the GUI should never silently promote reason-without-force
// into a force call. Pins that reason-only still hits the soft path.
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: false,
}),
);
await retireAgent('ag-1', { reason: 'routine decommission' });
const [url] = mockFetch.mock.calls[0];
// force defaults to false → query carries reason only.
expect(url).toBe('/api/v1/agents/ag-1?reason=routine+decommission');
});
it('retireAgent surfaces the 409 dependency error message to the caller', async () => {
mockFetch.mockReturnValueOnce(
mockErrorResponse(409, {
message: 'agent has 3 active targets, 7 active certificates, 2 pending jobs',
}),
);
await expect(retireAgent('ag-1')).rejects.toThrow(
/active targets|active certificates|pending jobs/,
);
});
it('retireAgent treats 204 (already-retired) as success with empty body', async () => {
mockFetch.mockReturnValueOnce(
Promise.resolve({
ok: true,
status: 204,
json: () => Promise.reject(new Error('204 has no body')),
statusText: 'No Content',
} as Response),
);
// fetchJSON normalises 204 to {} — caller must not crash.
const result = await retireAgent('ag-1');
expect(result).toBeDefined();
});
it('listRetiredAgents sends GET /agents/retired with default pagination', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ data: [], total: 0, page: 1, per_page: 50 }),
);
await listRetiredAgents();
const [url, init] = mockFetch.mock.calls[0];
expect(url).toBe('/api/v1/agents/retired?page=1&per_page=50');
// Default is GET — no explicit method means fetchJSON falls through.
expect(init.method ?? 'GET').toBe('GET');
});
it('listRetiredAgents forwards page/per_page overrides', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ data: [], total: 0, page: 2, per_page: 100 }),
);
await listRetiredAgents({ page: '2', per_page: '100' });
const [url] = mockFetch.mock.calls[0];
expect(url).toContain('page=2');
expect(url).toContain('per_page=100');
});
});
// ─── Jobs ───────────────────────────────────────────
describe('Jobs', () => {
+93 -1
View File
@@ -1,4 +1,4 @@
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary } from './types';
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary, AgentDependencyCounts, RetireAgentResponse, BlockedByDependenciesResponse } from './types';
const BASE = '/api/v1';
@@ -188,6 +188,98 @@ export const getAgent = (id: string) =>
export const registerAgent = (data: Partial<Agent>) =>
fetchJSON<Agent>(`${BASE}/agents`, { method: 'POST', body: JSON.stringify(data) });
// I-004: typed error thrown by retireAgent when the server returns HTTP 409 with
// {error: "blocked_by_dependencies", ...}. Callers that want to show the
// dependency-counts dialog should `catch (e)` and check `e instanceof
// BlockedByDependenciesError` — the counts field is the same shape the
// backend handler returns from its inline struct in
// internal/api/handler/agents.go. Generic network / 5xx failures still throw
// plain Error so existing error-boundary code is unaffected.
export class BlockedByDependenciesError extends Error {
readonly counts: AgentDependencyCounts;
constructor(message: string, counts: AgentDependencyCounts) {
super(message);
this.name = 'BlockedByDependenciesError';
this.counts = counts;
}
}
// I-004: retire an agent via DELETE /api/v1/agents/{id}. Three distinct
// success paths the UI needs to distinguish:
// * 200 — fresh retire; body has retired_at, already_retired=false, cascade
// flag, counts of what was cascaded.
// * 204 — idempotent re-retire; the row was already retired. No body. We
// synthesize a RetireAgentResponse with already_retired=true and zero
// counts so the caller can keep a single return type.
// * 409 — blocked_by_dependencies; thrown as BlockedByDependenciesError so
// the caller can surface the active_targets/active_certificates/pending_jobs
// counts in a confirmation dialog and offer force=true.
// Anything else bubbles up via the standard fetchJSON error path.
export const retireAgent = async (
id: string,
opts: { force?: boolean; reason?: string } = {},
): Promise<RetireAgentResponse> => {
const qs = new URLSearchParams();
if (opts.force) qs.set('force', 'true');
if (opts.reason) qs.set('reason', opts.reason);
const url = qs.toString()
? `${BASE}/agents/${id}?${qs.toString()}`
: `${BASE}/agents/${id}`;
const res = await fetch(url, {
method: 'DELETE',
headers: authHeaders(),
});
if (res.status === 401) {
window.dispatchEvent(new CustomEvent('certctl:auth-required'));
throw new Error('Authentication required');
}
// 204 No Content — idempotent re-retire. Synthesize a response so callers
// get a uniform shape; already_retired=true tells them the agent was
// already in the retired state before this call.
if (res.status === 204) {
return {
retired_at: '',
already_retired: true,
cascade: false,
counts: { active_targets: 0, active_certificates: 0, pending_jobs: 0 },
};
}
if (res.status === 409) {
// Body is always JSON for 409 per the handler contract.
const body = (await res.json()) as BlockedByDependenciesResponse;
throw new BlockedByDependenciesError(
body.message || 'agent has active dependencies',
body.counts,
);
}
if (!res.ok) {
let errorMsg = res.statusText;
try {
const body = await res.json();
errorMsg = body.message || body.error || errorMsg;
} catch {
// not JSON
}
throw new Error(errorMsg || `HTTP ${res.status}`);
}
return (await res.json()) as RetireAgentResponse;
};
// I-004: list retired agents via GET /api/v1/agents/retired. Kept separate
// from getAgents (which hits the default active-only listing) so the retired
// tab on AgentsPage can page independently. per_page is capped server-side at
// 500 (see handler ListRetiredAgents).
export const listRetiredAgents = (params: Record<string, string> = {}) => {
const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString();
return fetchJSON<PaginatedResponse<Agent>>(`${BASE}/agents/retired?${qs}`);
};
// Jobs
export const getJobs = (params: Record<string, string> = {}) => {
const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString();
+74
View File
@@ -1,5 +1,6 @@
import { describe, it, expect } from 'vitest';
import { POLICY_TYPES, POLICY_SEVERITIES } from './types';
import type { Agent } from './types';
/**
* Regression tests for the policy enum tuples.
@@ -58,3 +59,76 @@ describe('POLICY_SEVERITIES', () => {
expect(POLICY_SEVERITIES as readonly string[]).not.toContain('medium');
});
});
/**
* Regression test for the Agent interface's I-004 soft-retirement shape.
*
* Backend (migration 000015, Phase 2b) adds two nullable timestamps/strings to
* the agents table — `retired_at` and `retired_reason` — mirroring the existing
* Certificate.revoked_at / Certificate.revocation_reason pair. The GUI needs
* these fields on the Agent interface so the Retired tab, retire modal, and
* retirement banner can render the agent's retired state without resorting to
* `(agent as any).retired_at` escapes.
*
* Both fields are optional (agent.ts interface) because the server omits them
* from the response for active agents. A compile-time shape check here pins
* that Phase 2b does not drift the field names (e.g. to retiredAt camelCase)
* or accidentally promote them to required.
*
* Compile-fail until Phase 2b adds:
* retired_at?: string;
* retired_reason?: string;
* to the Agent interface in types.ts.
*/
describe('Agent interface (I-004 retirement)', () => {
it('accepts retired_at and retired_reason as optional string fields', () => {
// Construct an Agent with the retirement fields set. If Phase 2b names
// them anything other than retired_at / retired_reason, this fails to
// compile — which is exactly what the Red stage wants.
const retired: Agent = {
id: 'ag-1',
name: 'decom-01',
hostname: 'server-old',
ip_address: '10.0.0.1',
os: 'linux',
architecture: 'amd64',
status: 'Offline',
version: '2.1.0',
last_heartbeat: '2026-01-01T00:00:00Z',
last_heartbeat_at: '2026-01-01T00:00:00Z',
capabilities: [],
tags: {},
registered_at: '2024-01-01T00:00:00Z',
created_at: '2024-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
retired_at: '2026-01-01T00:00:00Z',
retired_reason: 'old hardware',
};
expect(retired.retired_at).toBe('2026-01-01T00:00:00Z');
expect(retired.retired_reason).toBe('old hardware');
});
it('accepts an Agent without retired_at / retired_reason (optional fields)', () => {
// Active agents should not carry retirement metadata. If Phase 2b makes
// the fields required, this block fails to compile.
const active: Agent = {
id: 'ag-2',
name: 'web01',
hostname: 'web01.prod',
ip_address: '10.0.0.2',
os: 'linux',
architecture: 'amd64',
status: 'Online',
version: '2.1.0',
last_heartbeat: '2026-04-18T12:00:00Z',
last_heartbeat_at: '2026-04-18T12:00:00Z',
capabilities: ['deploy', 'scan'],
tags: {},
registered_at: '2024-06-01T00:00:00Z',
created_at: '2024-06-01T00:00:00Z',
updated_at: '2026-04-18T12:00:00Z',
};
expect(active.retired_at).toBeUndefined();
expect(active.retired_reason).toBeUndefined();
});
});
+37
View File
@@ -67,6 +67,43 @@ export interface Agent {
registered_at: string;
created_at: string;
updated_at: string;
// I-004: soft-retirement fields. When retired_at is non-null, the agent is
// tombstoned — it will never heartbeat again and cascaded targets have been
// retired alongside it. The retired tab on AgentsPage uses these to show the
// when/why. The server filters retired rows from the default /api/v1/agents
// listing; they appear only via GET /api/v1/agents/retired.
retired_at?: string | null;
retired_reason?: string | null;
}
// I-004: dependency counts returned by the retire handler in both the 200
// success-with-cascade body and the 409 blocked_by_dependencies body. The
// operator UI uses these to show "this agent has N targets, M certs, K jobs
// depending on it" in the confirm-retire dialog.
export interface AgentDependencyCounts {
active_targets: number;
active_certificates: number;
pending_jobs: number;
}
// I-004: success shape for DELETE /api/v1/agents/{id}. already_retired is
// always false for 200 responses; 204 responses carry no body (the retire was
// idempotent — the agent was already retired). The frontend distinguishes by
// HTTP status, not by this field.
export interface RetireAgentResponse {
retired_at: string;
already_retired: boolean;
cascade: boolean;
counts: AgentDependencyCounts;
}
// I-004: shape returned with HTTP 409 when a retire is blocked by active
// downstream dependencies. Keep in lockstep with the handler's inline struct
// in internal/api/handler/agents.go (search "blocked_by_dependencies").
export interface BlockedByDependenciesResponse {
error: 'blocked_by_dependencies';
message: string;
counts: AgentDependencyCounts;
}
export interface Job {