Files
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00

221 lines
9.0 KiB
Go

package postgres_test
import (
"context"
"database/sql"
"os"
"path/filepath"
"strings"
"testing"
)
// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test
// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs").
//
// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql,
// which must:
//
// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT`
// columns to the `agents` table. These mirror the revoked_at /
// revocation_reason pair on managed_certificates (migration 000005).
//
// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns
// to `deployment_targets`. When an agent is retired with cascade=true,
// its deployment_targets must be soft-retired (not deleted) so audit
// history — who deployed what to where, when — stays intact.
//
// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id`
// from `ON DELETE CASCADE` (migration 000001, line 104) to
// `ON DELETE RESTRICT`. This is the fail-closed change that makes a
// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer
// instead of silently vaporising every deployment_target row. Today
// the CASCADE means the audit trail gets shredded with zero warning.
//
// The round-trip also validates that the down migration cleanly reverses all
// three changes, so an operator who lands on a rollback can still boot the
// server. Red-until-Green: this test compiles but fails until
// migrations/000015_agent_retire.up.sql + .down.sql exist with the right
// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order —
// the new migration runs automatically once Phase 2b creates the files.
func TestMigration000015_AgentRetireRoundTrip(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
//
// After all .up.sql migrations (including the new 000015) have run, the
// new columns and the flipped FK must be observable in the catalog.
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
// The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'),
// not CASCADE (confdeltype='c'). This is the core fail-closed guarantee
// that fixes I-004 at the storage layer.
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
// The FK on jobs.agent_id is already SET NULL (confdeltype='n') per
// migration 000001 line 146 — pin that it stays that way (or goes to
// RESTRICT; either preserves audit history, both fail on 'c').
assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c")
// ─── Stage 2: Run the 000015 down migration manually ─────────────────
//
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
// the down migration I read and execute it by hand, then re-check the
// catalog.
downSQL := readMigrationFile(t, "000015_agent_retire.down.sql")
if _, err := db.ExecContext(ctx, downSQL); err != nil {
t.Fatalf("000015 down migration failed: %v", err)
}
// Stage 3: Post-down assertions — columns gone, FK restored to CASCADE.
assertColumnGone(t, db, "agents", "retired_at")
assertColumnGone(t, db, "agents", "retired_reason")
assertColumnGone(t, db, "deployment_targets", "retired_at")
assertColumnGone(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c")
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
//
// The up migration must be safely re-runnable — operators sometimes
// re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER
// idempotently.
upSQL := readMigrationFile(t, "000015_agent_retire.up.sql")
if _, err := db.ExecContext(ctx, upSQL); err != nil {
t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err)
}
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
}
// ─── Catalog helpers ──────────────────────────────────────────────────────
//
// These helpers scope every catalog query to the schema the test is actually
// running in by joining against current_schema(). Without that, a test
// running in schema test_xyz would accidentally inspect the public schema
// and green-light drift.
func assertColumnExists(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if !exists {
t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column)
}
}
func assertColumnGone(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if exists {
t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column)
}
}
// assertFKDeleteRule asserts that the foreign key covering `table.column`
// (i.e. the FK whose constrained column matches) has the expected
// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE,
// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION.
func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got != want {
t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')",
table, column, got, want)
}
}
// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where
// multiple confdeltype values are acceptable (SET NULL and RESTRICT both
// preserve audit history) but CASCADE is strictly forbidden.
func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got == disallowed {
t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)",
table, column, got, disallowed)
}
}
// lookupFKDeleteRule returns the confdeltype for the FK constraint whose
// constrained table+column matches. Returns empty string if no FK found —
// that's treated as a test failure because the schema is supposed to have
// these FKs per migration 000001.
func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string {
t.Helper()
// Join pg_constraint → pg_class (constrained rel) → pg_attribute
// (constrained col) → pg_namespace (schema filter). Scoped to
// current_schema() so schema-per-test isolation holds.
const q = `
SELECT c.confdeltype
FROM pg_constraint c
JOIN pg_class cl ON cl.oid = c.conrelid
JOIN pg_namespace n ON n.oid = cl.relnamespace
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
WHERE n.nspname = current_schema()
AND c.contype = 'f'
AND cl.relname = $1
AND a.attname = $2
LIMIT 1
`
var confdeltype string
err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype)
if err == sql.ErrNoRows {
t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column)
return ""
}
if err != nil {
t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err)
return ""
}
return confdeltype
}
// readMigrationFile locates and loads a named migration file. Uses the same
// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers
// agree on where the migrations live.
func readMigrationFile(t *testing.T, name string) string {
t.Helper()
path := filepath.Join(findMigrationsDir(), name)
data, err := os.ReadFile(path)
if err != nil {
t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err)
}
// Defensive: a zero-byte down migration would produce false-positive
// "success" below. Refuse to trust it.
if strings.TrimSpace(string(data)) == "" {
t.Fatalf("migration file %s is empty — down migration missing or truncated", name)
}
return string(data)
}