mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-10 04:48:52 +00:00
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
This commit is contained in:
@@ -20,12 +20,18 @@ func NewAgentRepository(db *sql.DB) *AgentRepository {
|
||||
return &AgentRepository{db: db}
|
||||
}
|
||||
|
||||
// List returns all agents
|
||||
// List returns all ACTIVE agents — rows with retired_at IS NULL. I-004:
|
||||
// the default listing path feeds the handler-facing ListAgents call, the
|
||||
// stats dashboard, and the stale-offline sweeper; every caller wants active
|
||||
// hardware, not decommissioned rows. Operators who need retired rows reach
|
||||
// for ListRetired instead. The partial index idx_agents_retired_at
|
||||
// (migration 000015) lets the planner skip the retired segment cheaply.
|
||||
func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
rows, err := r.db.QueryContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE retired_at IS NULL
|
||||
ORDER BY registered_at DESC
|
||||
`)
|
||||
|
||||
@@ -50,11 +56,16 @@ func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
|
||||
return agents, nil
|
||||
}
|
||||
|
||||
// Get retrieves an agent by ID
|
||||
// Get retrieves an agent by ID. I-004: retired rows ARE surfaced here —
|
||||
// callers that need to check "has this agent been retired?" (heartbeat
|
||||
// handler returning 410 Gone, retirement service's idempotent-retire branch,
|
||||
// detail page rendering a retirement banner) must see retired_at /
|
||||
// retired_reason. Only the List path default-excludes retired rows; Get is
|
||||
// by-ID and returns whatever row exists.
|
||||
func (r *AgentRepository) Get(ctx context.Context, id string) (*domain.Agent, error) {
|
||||
row := r.db.QueryRowContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE id = $1
|
||||
`, id)
|
||||
@@ -185,7 +196,16 @@ func (r *AgentRepository) Delete(ctx context.Context, id string) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata
|
||||
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
|
||||
//
|
||||
// I-004: both branches include `AND retired_at IS NULL` in the WHERE clause,
|
||||
// making the UPDATE a no-op on retired rows. The service layer already
|
||||
// short-circuits with ErrAgentRetired before calling this method (see
|
||||
// AgentService.Heartbeat), but the WHERE filter is belt-and-braces for any
|
||||
// path that skips the service — a stale agent process that keeps polling
|
||||
// after retirement cannot resurrect its heartbeat at the DB layer. A zero
|
||||
// RowsAffected here returns the same "agent not found" error as before; the
|
||||
// service layer distinguishes retired from missing by calling Get first.
|
||||
func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error {
|
||||
var result sql.Result
|
||||
var err error
|
||||
@@ -199,11 +219,11 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
|
||||
architecture = CASE WHEN $5 = '' THEN architecture ELSE $5 END,
|
||||
ip_address = CASE WHEN $6 = '' THEN ip_address ELSE $6 END,
|
||||
version = CASE WHEN $7 = '' THEN version ELSE $7 END
|
||||
WHERE id = $2
|
||||
WHERE id = $2 AND retired_at IS NULL
|
||||
`, time.Now(), id, metadata.Hostname, metadata.OS, metadata.Architecture, metadata.IPAddress, metadata.Version)
|
||||
} else {
|
||||
result, err = r.db.ExecContext(ctx, `
|
||||
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2
|
||||
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 AND retired_at IS NULL
|
||||
`, time.Now(), id)
|
||||
}
|
||||
|
||||
@@ -223,11 +243,15 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetByAPIKey retrieves an agent by hashed API key
|
||||
// GetByAPIKey retrieves an agent by hashed API key. I-004: retired rows ARE
|
||||
// surfaced here so the auth middleware can detect "this API key belongs to a
|
||||
// retired agent" and fail the request with 410 Gone instead of 401. If the
|
||||
// filter hid retired rows, auth would return a plain 401 and leak no signal
|
||||
// that the agent process needs cleaning up.
|
||||
func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) {
|
||||
row := r.db.QueryRowContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE api_key_hash = $1
|
||||
`, keyHash)
|
||||
@@ -243,14 +267,214 @@ func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*dom
|
||||
return agent, nil
|
||||
}
|
||||
|
||||
// scanAgent scans an agent from a row or rows
|
||||
// ─── I-004 agent retirement surface ──────────────────────────────────────
|
||||
//
|
||||
// The methods below implement the I-004 coverage-gap closure. They follow the
|
||||
// interface contracts in internal/repository/interfaces.go:94-210 (which is the
|
||||
// spec — keep godoc there in sync if behavior changes).
|
||||
|
||||
// ListRetired returns a paginated slice of retired agents ordered by
|
||||
// retired_at DESC so the most recent retirements appear first. Used by the
|
||||
// GUI's Retired tab and the audit export path. Returns the rows plus the
|
||||
// total count (for pagination UI). page<1 or perPage<1 is clamped to
|
||||
// sensible defaults in-repo rather than erroring, matching the ListAgents
|
||||
// pagination behavior at the service layer. I-004, migration 000015.
|
||||
func (r *AgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
|
||||
// Clamp pagination to safe defaults. Keep in lockstep with the service
|
||||
// layer's pagination shape — negative / zero values on either axis should
|
||||
// degrade to "first page, default size" instead of returning an error.
|
||||
if page < 1 {
|
||||
page = 1
|
||||
}
|
||||
if perPage < 1 {
|
||||
perPage = 50
|
||||
}
|
||||
offset := (page - 1) * perPage
|
||||
|
||||
// Total count first — separate query so pagination math stays correct
|
||||
// even when the page of rows is empty. Uses the partial
|
||||
// idx_agents_retired_at index so this is effectively a count of the
|
||||
// partial-index tuple count, not a full table scan.
|
||||
var total int
|
||||
if err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*) FROM agents WHERE retired_at IS NOT NULL
|
||||
`).Scan(&total); err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to count retired agents: %w", err)
|
||||
}
|
||||
|
||||
rows, err := r.db.QueryContext(ctx, `
|
||||
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
|
||||
os, architecture, ip_address, version, retired_at, retired_reason
|
||||
FROM agents
|
||||
WHERE retired_at IS NOT NULL
|
||||
ORDER BY retired_at DESC
|
||||
LIMIT $1 OFFSET $2
|
||||
`, perPage, offset)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("failed to query retired agents: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var agents []*domain.Agent
|
||||
for rows.Next() {
|
||||
agent, err := scanAgent(rows)
|
||||
if err != nil {
|
||||
return nil, 0, err
|
||||
}
|
||||
agents = append(agents, agent)
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, 0, fmt.Errorf("error iterating retired agent rows: %w", err)
|
||||
}
|
||||
return agents, total, nil
|
||||
}
|
||||
|
||||
// SoftRetire stamps retired_at + retired_reason on the agent row with no
|
||||
// cascade. Scoped to `WHERE id=$1 AND retired_at IS NULL` so re-retiring an
|
||||
// already-retired row is a silent no-op (zero RowsAffected). The service
|
||||
// layer has its own idempotent-retire branch that detects already-retired
|
||||
// rows via Get before calling SoftRetire; a zero here just means a racy
|
||||
// caller got there first. I-004.
|
||||
func (r *AgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
if _, err := r.db.ExecContext(ctx, `
|
||||
UPDATE agents
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to soft-retire agent: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// RetireAgentWithCascade performs a transactional retire-and-cascade. In one
|
||||
// transaction it (1) stamps retired_at + retired_reason on the agent row if
|
||||
// it is still active, and (2) stamps the SAME retired_at + retired_reason on
|
||||
// every active (retired_at IS NULL) deployment_targets row whose agent_id
|
||||
// matches. Already-retired targets keep their original retirement metadata;
|
||||
// only active targets are touched. If the agent is already retired, the
|
||||
// whole transaction is a no-op — the caller's idempotent-retire branch
|
||||
// already handled it before we got here. I-004, migration 000015.
|
||||
//
|
||||
// The two UPDATEs share a single (retiredAt, reason) pair so forensic
|
||||
// analysis can trace "every row stamped at T1 with reason R was part of the
|
||||
// same operator action" back to one cascade. Using BeginTx keeps the agent
|
||||
// row and its targets' retirement metadata consistent even if something
|
||||
// crashes mid-cascade.
|
||||
func (r *AgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
|
||||
tx, err := r.db.BeginTx(ctx, nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to begin retire-cascade transaction: %w", err)
|
||||
}
|
||||
// Rollback is a no-op if Commit has already run — safe to always defer.
|
||||
defer func() { _ = tx.Rollback() }()
|
||||
|
||||
// Agent row: flip to retired only if it was still active. If zero rows
|
||||
// match, the agent was already retired — the whole cascade becomes a
|
||||
// no-op (we deliberately do NOT stamp the targets against a retirement
|
||||
// we didn't perform).
|
||||
if _, err := tx.ExecContext(ctx, `
|
||||
UPDATE agents
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to retire agent in cascade: %w", err)
|
||||
}
|
||||
|
||||
// Cascade: copy the same retired_at / retired_reason onto every active
|
||||
// deployment_target belonging to this agent. Skips targets that are
|
||||
// already retired so their original retirement metadata is preserved.
|
||||
if _, err := tx.ExecContext(ctx, `
|
||||
UPDATE deployment_targets
|
||||
SET retired_at = $2, retired_reason = $3
|
||||
WHERE agent_id = $1 AND retired_at IS NULL
|
||||
`, id, retiredAt, reason); err != nil {
|
||||
return fmt.Errorf("failed to cascade-retire deployment targets: %w", err)
|
||||
}
|
||||
|
||||
if err := tx.Commit(); err != nil {
|
||||
return fmt.Errorf("failed to commit retire-cascade transaction: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// CountActiveTargets returns the number of deployment_targets with
|
||||
// agent_id=agentID AND retired_at IS NULL. Used by the retirement preflight
|
||||
// to decide 200 (soft-retire) vs 409 (blocked-by-deps). Hits the existing
|
||||
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
|
||||
// retired_at IS NULL predicate is cheap because the partial
|
||||
// idx_deployment_targets_retired_at index (migration 000015) lets the
|
||||
// planner skip the retired-row segment. I-004.
|
||||
func (r *AgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*)
|
||||
FROM deployment_targets
|
||||
WHERE agent_id = $1 AND retired_at IS NULL
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count active targets for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// CountActiveCertificates returns the count of distinct managed_certificates
|
||||
// currently deployed through one of this agent's ACTIVE deployment_targets.
|
||||
// Joins certificate_target_mappings (migration 000001 line 116) →
|
||||
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
|
||||
// deployment_targets.retired_at IS NULL. COUNT(DISTINCT certificate_id) so
|
||||
// the same cert deployed to multiple targets on one agent counts once.
|
||||
// Used purely for the preflight 409 body. I-004.
|
||||
func (r *AgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(DISTINCT ctm.certificate_id)
|
||||
FROM certificate_target_mappings ctm
|
||||
JOIN deployment_targets dt ON dt.id = ctm.target_id
|
||||
WHERE dt.agent_id = $1 AND dt.retired_at IS NULL
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count active certificates for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// CountPendingJobs returns the number of jobs belonging to this agent whose
|
||||
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
|
||||
// statuses that represent work the agent would still be expected to pick up
|
||||
// or complete. Completed / Failed / Cancelled jobs do not count toward the
|
||||
// preflight gate. Status strings match domain.JobStatus* constants in
|
||||
// internal/domain/job.go:43-49. Hits idx_jobs_agent_id (migration 000001
|
||||
// line 161). I-004.
|
||||
func (r *AgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(*)
|
||||
FROM jobs
|
||||
WHERE agent_id = $1
|
||||
AND status IN ('Pending', 'AwaitingCSR', 'AwaitingApproval', 'Running')
|
||||
`, agentID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("failed to count pending jobs for agent: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// scanAgent scans an agent from a row or rows.
|
||||
//
|
||||
// I-004: the column list here is the authoritative 13-field post-M15 order —
|
||||
// retired_at and retired_reason are appended at the tail as nullable
|
||||
// *time.Time / *string scan targets matching the `json:"...,omitempty"` domain
|
||||
// fields. Every SELECT in this file that feeds scanAgent must emit columns in
|
||||
// this same order, otherwise Scan will silently place values into the wrong
|
||||
// fields (lib/pq does positional binding, not named).
|
||||
func scanAgent(scanner interface {
|
||||
Scan(...interface{}) error
|
||||
}) (*domain.Agent, error) {
|
||||
var agent domain.Agent
|
||||
err := scanner.Scan(&agent.ID, &agent.Name, &agent.Hostname, &agent.Status,
|
||||
&agent.LastHeartbeatAt, &agent.RegisteredAt, &agent.APIKeyHash,
|
||||
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version)
|
||||
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version,
|
||||
&agent.RetiredAt, &agent.RetiredReason)
|
||||
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to scan agent: %w", err)
|
||||
|
||||
@@ -0,0 +1,220 @@
|
||||
package postgres_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test
|
||||
// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs").
|
||||
//
|
||||
// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql,
|
||||
// which must:
|
||||
//
|
||||
// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT`
|
||||
// columns to the `agents` table. These mirror the revoked_at /
|
||||
// revocation_reason pair on managed_certificates (migration 000005).
|
||||
//
|
||||
// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns
|
||||
// to `deployment_targets`. When an agent is retired with cascade=true,
|
||||
// its deployment_targets must be soft-retired (not deleted) so audit
|
||||
// history — who deployed what to where, when — stays intact.
|
||||
//
|
||||
// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id`
|
||||
// from `ON DELETE CASCADE` (migration 000001, line 104) to
|
||||
// `ON DELETE RESTRICT`. This is the fail-closed change that makes a
|
||||
// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer
|
||||
// instead of silently vaporising every deployment_target row. Today
|
||||
// the CASCADE means the audit trail gets shredded with zero warning.
|
||||
//
|
||||
// The round-trip also validates that the down migration cleanly reverses all
|
||||
// three changes, so an operator who lands on a rollback can still boot the
|
||||
// server. Red-until-Green: this test compiles but fails until
|
||||
// migrations/000015_agent_retire.up.sql + .down.sql exist with the right
|
||||
// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order —
|
||||
// the new migration runs automatically once Phase 2b creates the files.
|
||||
func TestMigration000015_AgentRetireRoundTrip(t *testing.T) {
|
||||
tdb := getTestDB(t)
|
||||
db := tdb.freshSchema(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
|
||||
//
|
||||
// After all .up.sql migrations (including the new 000015) have run, the
|
||||
// new columns and the flipped FK must be observable in the catalog.
|
||||
|
||||
assertColumnExists(t, db, "agents", "retired_at")
|
||||
assertColumnExists(t, db, "agents", "retired_reason")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_reason")
|
||||
|
||||
// The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'),
|
||||
// not CASCADE (confdeltype='c'). This is the core fail-closed guarantee
|
||||
// that fixes I-004 at the storage layer.
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
|
||||
|
||||
// The FK on jobs.agent_id is already SET NULL (confdeltype='n') per
|
||||
// migration 000001 line 146 — pin that it stays that way (or goes to
|
||||
// RESTRICT; either preserves audit history, both fail on 'c').
|
||||
assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c")
|
||||
|
||||
// ─── Stage 2: Run the 000015 down migration manually ─────────────────
|
||||
//
|
||||
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
|
||||
// the down migration I read and execute it by hand, then re-check the
|
||||
// catalog.
|
||||
|
||||
downSQL := readMigrationFile(t, "000015_agent_retire.down.sql")
|
||||
if _, err := db.ExecContext(ctx, downSQL); err != nil {
|
||||
t.Fatalf("000015 down migration failed: %v", err)
|
||||
}
|
||||
|
||||
// Stage 3: Post-down assertions — columns gone, FK restored to CASCADE.
|
||||
assertColumnGone(t, db, "agents", "retired_at")
|
||||
assertColumnGone(t, db, "agents", "retired_reason")
|
||||
assertColumnGone(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnGone(t, db, "deployment_targets", "retired_reason")
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c")
|
||||
|
||||
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
|
||||
//
|
||||
// The up migration must be safely re-runnable — operators sometimes
|
||||
// re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER
|
||||
// idempotently.
|
||||
|
||||
upSQL := readMigrationFile(t, "000015_agent_retire.up.sql")
|
||||
if _, err := db.ExecContext(ctx, upSQL); err != nil {
|
||||
t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err)
|
||||
}
|
||||
|
||||
assertColumnExists(t, db, "agents", "retired_at")
|
||||
assertColumnExists(t, db, "agents", "retired_reason")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_at")
|
||||
assertColumnExists(t, db, "deployment_targets", "retired_reason")
|
||||
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
|
||||
}
|
||||
|
||||
// ─── Catalog helpers ──────────────────────────────────────────────────────
|
||||
//
|
||||
// These helpers scope every catalog query to the schema the test is actually
|
||||
// running in by joining against current_schema(). Without that, a test
|
||||
// running in schema test_xyz would accidentally inspect the public schema
|
||||
// and green-light drift.
|
||||
|
||||
func assertColumnExists(t *testing.T, db *sql.DB, table, column string) {
|
||||
t.Helper()
|
||||
var exists bool
|
||||
err := db.QueryRowContext(context.Background(), `
|
||||
SELECT EXISTS (
|
||||
SELECT 1 FROM information_schema.columns
|
||||
WHERE table_schema = current_schema()
|
||||
AND table_name = $1
|
||||
AND column_name = $2
|
||||
)`, table, column).Scan(&exists)
|
||||
if err != nil {
|
||||
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
|
||||
}
|
||||
if !exists {
|
||||
t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column)
|
||||
}
|
||||
}
|
||||
|
||||
func assertColumnGone(t *testing.T, db *sql.DB, table, column string) {
|
||||
t.Helper()
|
||||
var exists bool
|
||||
err := db.QueryRowContext(context.Background(), `
|
||||
SELECT EXISTS (
|
||||
SELECT 1 FROM information_schema.columns
|
||||
WHERE table_schema = current_schema()
|
||||
AND table_name = $1
|
||||
AND column_name = $2
|
||||
)`, table, column).Scan(&exists)
|
||||
if err != nil {
|
||||
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
|
||||
}
|
||||
if exists {
|
||||
t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column)
|
||||
}
|
||||
}
|
||||
|
||||
// assertFKDeleteRule asserts that the foreign key covering `table.column`
|
||||
// (i.e. the FK whose constrained column matches) has the expected
|
||||
// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE,
|
||||
// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION.
|
||||
func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) {
|
||||
t.Helper()
|
||||
got := lookupFKDeleteRule(t, db, table, column)
|
||||
if got != want {
|
||||
t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')",
|
||||
table, column, got, want)
|
||||
}
|
||||
}
|
||||
|
||||
// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where
|
||||
// multiple confdeltype values are acceptable (SET NULL and RESTRICT both
|
||||
// preserve audit history) but CASCADE is strictly forbidden.
|
||||
func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) {
|
||||
t.Helper()
|
||||
got := lookupFKDeleteRule(t, db, table, column)
|
||||
if got == disallowed {
|
||||
t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)",
|
||||
table, column, got, disallowed)
|
||||
}
|
||||
}
|
||||
|
||||
// lookupFKDeleteRule returns the confdeltype for the FK constraint whose
|
||||
// constrained table+column matches. Returns empty string if no FK found —
|
||||
// that's treated as a test failure because the schema is supposed to have
|
||||
// these FKs per migration 000001.
|
||||
func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string {
|
||||
t.Helper()
|
||||
|
||||
// Join pg_constraint → pg_class (constrained rel) → pg_attribute
|
||||
// (constrained col) → pg_namespace (schema filter). Scoped to
|
||||
// current_schema() so schema-per-test isolation holds.
|
||||
const q = `
|
||||
SELECT c.confdeltype
|
||||
FROM pg_constraint c
|
||||
JOIN pg_class cl ON cl.oid = c.conrelid
|
||||
JOIN pg_namespace n ON n.oid = cl.relnamespace
|
||||
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
|
||||
WHERE n.nspname = current_schema()
|
||||
AND c.contype = 'f'
|
||||
AND cl.relname = $1
|
||||
AND a.attname = $2
|
||||
LIMIT 1
|
||||
`
|
||||
var confdeltype string
|
||||
err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype)
|
||||
if err == sql.ErrNoRows {
|
||||
t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column)
|
||||
return ""
|
||||
}
|
||||
if err != nil {
|
||||
t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err)
|
||||
return ""
|
||||
}
|
||||
return confdeltype
|
||||
}
|
||||
|
||||
// readMigrationFile locates and loads a named migration file. Uses the same
|
||||
// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers
|
||||
// agree on where the migrations live.
|
||||
func readMigrationFile(t *testing.T, name string) string {
|
||||
t.Helper()
|
||||
path := filepath.Join(findMigrationsDir(), name)
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err)
|
||||
}
|
||||
// Defensive: a zero-byte down migration would produce false-positive
|
||||
// "success" below. Refuse to trust it.
|
||||
if strings.TrimSpace(string(data)) == "" {
|
||||
t.Fatalf("migration file %s is empty — down migration missing or truncated", name)
|
||||
}
|
||||
return string(data)
|
||||
}
|
||||
Reference in New Issue
Block a user