security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6)

Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row locks, so two scheduler replicas in an HA deployment could both read the same row, both decide it was theirs, and race on UpdateStatus, producing duplicate Running jobs and duplicate certificate issuances. Remediation: a claim-style repository API that selects + transitions Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP LOCKED. Concurrent claimants observe disjoint row sets; no worker ever sees another worker's claimed row. Repository changes (internal/repository/postgres/job.go): - New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,... FROM jobs WHERE status='Pending' (optional type filter, optional LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running', updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed rows with status already flipped. - New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL semantics (direct agent_id match, target->agent JOIN fallback, certificate->target->agent chain for AwaitingCSR) but wraps each branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows to Running. AwaitingCSR rows are returned in place (state transition deferred until SubmitCSR, consistent with M8 semantics). - Existing GetPendingJobs / ListPendingByAgentID retained for legacy compatibility; their godoc now directs production callers to the Claim* variants. Production caller switches: - internal/service/job.go ProcessPendingJobs: ListByStatus(Pending) -> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler race between two replicas tick-firing simultaneously. - internal/service/agent.go GetPendingWork: ListPendingByAgentID -> ClaimPendingByAgentID. Eliminates the race between two pollers for the same agent (e.g. brief network blip causing duplicate poll) and between a scheduler tick and an agent poll. Safety argument for pre-flipping Pending -> Running inside the claim transaction: ProcessRenewalJob and ProcessDeploymentJob both call UpdateStatus(Running) unconditionally on entry, so an early flip is idempotent. On panic, the scheduler's panic recovery leaves the job in Running which the existing stale-running reaper handles. Tests (internal/repository/postgres/repo_test.go, skipped in -short): - TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending, claim once, assert all 5 returned + DB rows Running, residual claim returns 0. - TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40 Pending Renewals, spawn N=8 goroutines each calling ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants: (a) no job ID claimed by more than one worker, (b) sum of claims == 40, (c) all 40 rows in Running state in the DB. Bounded empty-streak guard (20 iterations) covers SKIP LOCKED transient zeros under contention. - TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments: seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1 Pending Renewal for agent B (scope check). Asserts deployments flip to Running, AwaitingCSR is returned but preserved, agent B's renewal never appears. Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos mirroring the real Pending -> Running semantics. Mocks intentionally do NOT write to StatusUpdates (that map tracks UpdateStatus() call history specifically; the real claim path uses a bulk UPDATE, not UpdateStatus). Verification (CI-scope): - go build ./cmd/...: ok - go vet ./...: ok - go test -race -short on service, api/handler, api/middleware, scheduler, connector/..., domain, validation, tlsprobe: ok - Coverage gates: service 67.6% (>=55), handler 78.6% (>=60), middleware 80.0% (>=30), domain 92.7% (>=40). All hold. - golangci-lint 2.11.4: 0 issues - govulncheck: no vulnerabilities in call graph - Frontend: tsc clean, 218 vitest tests pass, vite build ok - helm lint + helm template: ok - Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go; H-1 through H-5 fixtures unchanged. Refs: H-6 in certctl-audit-report.md
2026-06-11 21:18:55 +00:00 · 2026-04-17 02:34:56 +00:00
parent 25564021e8
commit 0a75a3065f
8 changed files with 709 additions and 11 deletions
@@ -120,10 +120,20 @@ type JobRepository interface {
 	ListByCertificate(ctx context.Context, certID string) ([]*domain.Job, error)
 	// UpdateStatus updates a job's status and optional error message.
 	UpdateStatus(ctx context.Context, id string, status domain.JobStatus, errMsg string) error
-	// GetPendingJobs returns jobs not yet processed of a specific type.
+	// GetPendingJobs returns jobs not yet processed of a specific type. Prefer ClaimPendingJobs in
+	// production paths where concurrent schedulers may race — see H-6 (CWE-362) remediation.
 	GetPendingJobs(ctx context.Context, jobType domain.JobType) ([]*domain.Job, error)
 	// ListPendingByAgentID returns pending deployment jobs and AwaitingCSR jobs for a specific agent.
+	// Prefer ClaimPendingByAgentID in production paths — see H-6 (CWE-362) remediation.
 	ListPendingByAgentID(ctx context.Context, agentID string) ([]*domain.Job, error)
+	// ClaimPendingJobs atomically claims up to `limit` Pending jobs and transitions them to Running
+	// using SELECT FOR UPDATE SKIP LOCKED inside a transaction. An empty jobType matches any type;
+	// limit <= 0 means no limit. H-6 (CWE-362) race remediation.
+	ClaimPendingJobs(ctx context.Context, jobType domain.JobType, limit int) ([]*domain.Job, error)
+	// ClaimPendingByAgentID atomically claims pending deployment jobs for an agent (flipping them
+	// to Running) and locks AwaitingCSR jobs against concurrent observers (leaving state intact,
+	// since the CSR-submission path drives the next transition). H-6 (CWE-362) race remediation.
+	ClaimPendingByAgentID(ctx context.Context, agentID string) ([]*domain.Job, error)
 }

 // RenewalPolicyRepository defines operations for managing renewal policies.