Files
certctl/web/src/pages/AgentsPage.tsx
T
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00

434 lines
14 KiB
TypeScript

import { useState } from 'react';
import { useNavigate } from 'react-router-dom';
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
import {
getAgents,
listRetiredAgents,
retireAgent,
BlockedByDependenciesError,
} from '../api/client';
import PageHeader from '../components/PageHeader';
import DataTable from '../components/DataTable';
import type { Column } from '../components/DataTable';
import StatusBadge from '../components/StatusBadge';
import ErrorState from '../components/ErrorState';
import { timeAgo } from '../api/utils';
import type { Agent, AgentDependencyCounts } from '../api/types';
function heartbeatStatus(lastHeartbeat: string): string {
if (!lastHeartbeat) return 'Offline';
const ago = Date.now() - new Date(lastHeartbeat).getTime();
if (ago < 5 * 60 * 1000) return 'Online';
if (ago < 15 * 60 * 1000) return 'Stale';
return 'Offline';
}
type TabKey = 'active' | 'retired';
// I-004: retire-modal state machine.
// confirm — operator clicked Retire, shown plain confirm + optional reason.
// blocked — soft retire returned 409; switch to a force-retire dialog that
// shows the dependency counts and requires a reason before the
// operator can opt into ?force=true.
// error — any other failure (network, 500, unexpected 4xx). Reused by both
// the initial attempt and the force retry.
type ModalMode =
| { kind: 'closed' }
| { kind: 'confirm'; agent: Agent; reason: string }
| { kind: 'blocked'; agent: Agent; reason: string; counts: AgentDependencyCounts }
| { kind: 'error'; agent: Agent; message: string };
export default function AgentsPage() {
const navigate = useNavigate();
const qc = useQueryClient();
const [tab, setTab] = useState<TabKey>('active');
const [modal, setModal] = useState<ModalMode>({ kind: 'closed' });
const active = useQuery({
queryKey: ['agents'],
queryFn: () => getAgents(),
refetchInterval: 15000,
enabled: tab === 'active',
});
const retired = useQuery({
queryKey: ['agents', 'retired'],
queryFn: () => listRetiredAgents(),
refetchInterval: 30000,
enabled: tab === 'retired',
});
// retireAgent mutation wrapping both paths. The caller supplies force/reason,
// and we invalidate both queries on success so the retired tab refreshes and
// the active tab drops the row. 409s are converted into modal.mode=blocked so
// the operator can escalate to force; everything else becomes modal.mode=error.
const mutation = useMutation({
mutationFn: (input: { agent: Agent; force?: boolean; reason?: string }) =>
retireAgent(input.agent.id, { force: input.force, reason: input.reason }),
onSuccess: () => {
qc.invalidateQueries({ queryKey: ['agents'] });
qc.invalidateQueries({ queryKey: ['agents', 'retired'] });
setModal({ kind: 'closed' });
},
});
// Shared submit handler: when we know the current modal.agent + modal.reason,
// decide whether this is a soft retire or force retire based on modal.kind.
const submitRetire = (force: boolean) => {
if (modal.kind !== 'confirm' && modal.kind !== 'blocked') return;
const { agent, reason } = modal;
mutation.mutate(
{ agent, force, reason: reason || undefined },
{
onError: (err) => {
if (err instanceof BlockedByDependenciesError) {
setModal({
kind: 'blocked',
agent,
reason,
counts: err.counts ?? { active_targets: 0, active_certificates: 0, pending_jobs: 0 },
});
return;
}
setModal({
kind: 'error',
agent,
message: err instanceof Error ? err.message : String(err),
});
},
},
);
};
const activeColumns: Column<Agent>[] = [
{
key: 'name',
label: 'Agent',
render: (a) => (
<div>
<div className="font-medium text-ink">{a.name}</div>
<div className="text-xs text-ink-faint">{a.id}</div>
</div>
),
},
{
key: 'status',
label: 'Health',
render: (a) => <StatusBadge status={a.status || heartbeatStatus(a.last_heartbeat_at)} />,
},
{
key: 'hostname',
label: 'Hostname',
render: (a) => <span className="text-ink-muted font-mono text-xs">{a.hostname || '—'}</span>,
},
{
key: 'os',
label: 'OS / Arch',
render: (a) => (
<span className="text-ink-muted text-xs">
{a.os && a.architecture ? `${a.os}/${a.architecture}` : a.os || '—'}
</span>
),
},
{
key: 'ip',
label: 'IP Address',
render: (a) => <span className="text-ink-muted font-mono text-xs">{a.ip_address || '—'}</span>,
},
{
key: 'version',
label: 'Version',
render: (a) => <span className="text-ink-muted text-xs">{a.version || '—'}</span>,
},
{
key: 'heartbeat',
label: 'Last Heartbeat',
render: (a) => <span className="text-ink-muted text-xs">{timeAgo(a.last_heartbeat_at)}</span>,
},
{
key: 'actions',
label: '',
render: (a) => (
<button
type="button"
onClick={(e) => {
// Table rows are navigable via onRowClick. The retire button must
// not trigger the row-click handler or the modal will race the
// navigation and unmount mid-render.
e.stopPropagation();
setModal({ kind: 'confirm', agent: a, reason: '' });
}}
className="px-3 py-1 text-xs font-medium text-danger border border-danger/30 rounded hover:bg-danger/10"
>
Retire
</button>
),
},
];
const retiredColumns: Column<Agent>[] = [
{
key: 'name',
label: 'Agent',
render: (a) => (
<div>
<div className="font-medium text-ink">{a.name}</div>
<div className="text-xs text-ink-faint">{a.id}</div>
</div>
),
},
{
key: 'hostname',
label: 'Hostname',
render: (a) => <span className="text-ink-muted font-mono text-xs">{a.hostname || '—'}</span>,
},
{
key: 'os',
label: 'OS / Arch',
render: (a) => (
<span className="text-ink-muted text-xs">
{a.os && a.architecture ? `${a.os}/${a.architecture}` : a.os || '—'}
</span>
),
},
{
key: 'retired_at',
label: 'Retired',
render: (a) => <span className="text-ink-muted text-xs">{timeAgo(a.retired_at || '')}</span>,
},
{
key: 'retired_reason',
label: 'Reason',
render: (a) => (
<span className="text-ink-muted text-xs">{a.retired_reason || <em></em>}</span>
),
},
];
const currentQuery = tab === 'active' ? active : retired;
const currentColumns = tab === 'active' ? activeColumns : retiredColumns;
const emptyMessage = tab === 'active' ? 'No agents registered' : 'No retired agents';
return (
<>
<PageHeader
title="Agents"
subtitle={
tab === 'active' && active.data
? `${active.data.total} active`
: tab === 'retired' && retired.data
? `${retired.data.total} retired`
: undefined
}
/>
<div className="px-6 pt-2">
<div className="flex gap-2 border-b border-border">
<TabButton active={tab === 'active'} onClick={() => setTab('active')}>
Active
</TabButton>
<TabButton active={tab === 'retired'} onClick={() => setTab('retired')}>
Retired
</TabButton>
</div>
</div>
<div className="flex-1 overflow-y-auto">
{currentQuery.error ? (
<ErrorState error={currentQuery.error as Error} onRetry={() => currentQuery.refetch()} />
) : (
<DataTable
columns={currentColumns}
data={currentQuery.data?.data || []}
isLoading={currentQuery.isLoading}
emptyMessage={emptyMessage}
onRowClick={(a) => navigate(`/agents/${a.id}`)}
/>
)}
</div>
{modal.kind !== 'closed' && (
<RetireModal
mode={modal}
pending={mutation.isPending}
onClose={() => setModal({ kind: 'closed' })}
onReasonChange={(reason) => {
if (modal.kind === 'confirm') setModal({ ...modal, reason });
if (modal.kind === 'blocked') setModal({ ...modal, reason });
}}
onSoftRetire={() => submitRetire(false)}
onForceRetire={() => submitRetire(true)}
/>
)}
</>
);
}
function TabButton({
active,
onClick,
children,
}: {
active: boolean;
onClick: () => void;
children: React.ReactNode;
}) {
return (
<button
type="button"
onClick={onClick}
className={
active
? 'px-4 py-2 text-sm font-medium text-ink border-b-2 border-accent -mb-px'
: 'px-4 py-2 text-sm text-ink-muted hover:text-ink'
}
>
{children}
</button>
);
}
function RetireModal({
mode,
pending,
onClose,
onReasonChange,
onSoftRetire,
onForceRetire,
}: {
mode: ModalMode;
pending: boolean;
onClose: () => void;
onReasonChange: (reason: string) => void;
onSoftRetire: () => void;
onForceRetire: () => void;
}) {
if (mode.kind === 'closed') return null;
return (
<div
role="dialog"
aria-modal="true"
className="fixed inset-0 z-40 flex items-center justify-center bg-black/40"
onClick={onClose}
>
<div
className="w-full max-w-lg rounded-lg bg-surface p-6 shadow-lg border border-border"
onClick={(e) => e.stopPropagation()}
>
{mode.kind === 'confirm' && (
<>
<h2 className="text-lg font-semibold text-ink">Retire agent</h2>
<p className="mt-2 text-sm text-ink-muted">
<span className="font-mono">{mode.agent.name}</span> ({mode.agent.id}) will be
soft-retired. The agent will stop receiving heartbeats and be removed from active
listings. This is reversible only by direct database intervention.
</p>
<label className="mt-4 block text-xs font-medium text-ink-muted">
Reason (optional)
<input
type="text"
value={mode.reason}
onChange={(e) => onReasonChange(e.target.value)}
placeholder="e.g. decommissioning rack 7"
className="mt-1 w-full rounded border border-border bg-surface-alt px-2 py-1 text-sm"
/>
</label>
<div className="mt-6 flex justify-end gap-2">
<button
type="button"
onClick={onClose}
className="px-4 py-2 text-sm text-ink-muted hover:text-ink"
disabled={pending}
>
Cancel
</button>
<button
type="button"
onClick={onSoftRetire}
disabled={pending}
className="px-4 py-2 text-sm font-medium text-white bg-danger rounded hover:bg-danger/90 disabled:opacity-50"
>
{pending ? 'Retiring…' : 'Retire'}
</button>
</div>
</>
)}
{mode.kind === 'blocked' && (
<>
<h2 className="text-lg font-semibold text-ink">Cannot retire active dependencies</h2>
<p className="mt-2 text-sm text-ink-muted">
The agent <span className="font-mono">{mode.agent.name}</span> still has downstream
work tied to it. Force-retiring will cascade-retire all active targets and fail any
pending jobs.
</p>
<dl className="mt-4 grid grid-cols-3 gap-3 text-center">
<div className="rounded border border-border bg-surface-alt p-3">
<dt className="text-xs text-ink-muted">Active targets</dt>
<dd className="mt-1 text-xl font-semibold text-ink">{mode.counts.active_targets}</dd>
</div>
<div className="rounded border border-border bg-surface-alt p-3">
<dt className="text-xs text-ink-muted">Active certs</dt>
<dd className="mt-1 text-xl font-semibold text-ink">
{mode.counts.active_certificates}
</dd>
</div>
<div className="rounded border border-border bg-surface-alt p-3">
<dt className="text-xs text-ink-muted">Pending jobs</dt>
<dd className="mt-1 text-xl font-semibold text-ink">{mode.counts.pending_jobs}</dd>
</div>
</dl>
<label className="mt-4 block text-xs font-medium text-ink-muted">
Reason <span className="text-danger">(required for force retire)</span>
<input
type="text"
value={mode.reason}
onChange={(e) => onReasonChange(e.target.value)}
placeholder="e.g. rack 7 decommission, cascade retire"
className="mt-1 w-full rounded border border-border bg-surface-alt px-2 py-1 text-sm"
/>
</label>
<div className="mt-6 flex justify-end gap-2">
<button
type="button"
onClick={onClose}
className="px-4 py-2 text-sm text-ink-muted hover:text-ink"
disabled={pending}
>
Cancel
</button>
<button
type="button"
onClick={onForceRetire}
// Backend enforces reason on force; keep the GUI in lockstep
// rather than letting a 400 bounce back.
disabled={pending || !mode.reason.trim()}
className="px-4 py-2 text-sm font-medium text-white bg-danger rounded hover:bg-danger/90 disabled:opacity-50"
>
{pending ? 'Force-retiring…' : 'Force retire'}
</button>
</div>
</>
)}
{mode.kind === 'error' && (
<>
<h2 className="text-lg font-semibold text-ink">Retire failed</h2>
<p className="mt-2 text-sm text-danger">{mode.message}</p>
<div className="mt-6 flex justify-end">
<button
type="button"
onClick={onClose}
className="px-4 py-2 text-sm text-ink-muted hover:text-ink"
>
Close
</button>
</div>
</>
)}
</div>
</div>
);
}