mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-11 21:18:55 +00:00
security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6)
Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row
locks, so two scheduler replicas in an HA deployment could both read the
same row, both decide it was theirs, and race on UpdateStatus, producing
duplicate Running jobs and duplicate certificate issuances.
Remediation: a claim-style repository API that selects + transitions
Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP
LOCKED. Concurrent claimants observe disjoint row sets; no worker ever
sees another worker's claimed row.
Repository changes (internal/repository/postgres/job.go):
- New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,...
FROM jobs WHERE status='Pending' (optional type filter, optional
LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running',
updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed
rows with status already flipped.
- New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL
semantics (direct agent_id match, target->agent JOIN fallback,
certificate->target->agent chain for AwaitingCSR) but wraps each
branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows
to Running. AwaitingCSR rows are returned in place (state
transition deferred until SubmitCSR, consistent with M8 semantics).
- Existing GetPendingJobs / ListPendingByAgentID retained for legacy
compatibility; their godoc now directs production callers to the
Claim* variants.
Production caller switches:
- internal/service/job.go ProcessPendingJobs: ListByStatus(Pending)
-> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler
race between two replicas tick-firing simultaneously.
- internal/service/agent.go GetPendingWork: ListPendingByAgentID ->
ClaimPendingByAgentID. Eliminates the race between two pollers
for the same agent (e.g. brief network blip causing duplicate
poll) and between a scheduler tick and an agent poll.
Safety argument for pre-flipping Pending -> Running inside the claim
transaction: ProcessRenewalJob and ProcessDeploymentJob both call
UpdateStatus(Running) unconditionally on entry, so an early flip is
idempotent. On panic, the scheduler's panic recovery leaves the job
in Running which the existing stale-running reaper handles.
Tests (internal/repository/postgres/repo_test.go, skipped in -short):
- TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending,
claim once, assert all 5 returned + DB rows Running, residual
claim returns 0.
- TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40
Pending Renewals, spawn N=8 goroutines each calling
ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants:
(a) no job ID claimed by more than one worker, (b) sum of claims
== 40, (c) all 40 rows in Running state in the DB. Bounded
empty-streak guard (20 iterations) covers SKIP LOCKED transient
zeros under contention.
- TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments:
seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1
Pending Renewal for agent B (scope check). Asserts deployments
flip to Running, AwaitingCSR is returned but preserved, agent B's
renewal never appears.
Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go
gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos
mirroring the real Pending -> Running semantics. Mocks intentionally
do NOT write to StatusUpdates (that map tracks UpdateStatus() call
history specifically; the real claim path uses a bulk UPDATE, not
UpdateStatus).
Verification (CI-scope):
- go build ./cmd/...: ok
- go vet ./...: ok
- go test -race -short on service, api/handler, api/middleware,
scheduler, connector/..., domain, validation, tlsprobe: ok
- Coverage gates: service 67.6% (>=55), handler 78.6% (>=60),
middleware 80.0% (>=30), domain 92.7% (>=40). All hold.
- golangci-lint 2.11.4: 0 issues
- govulncheck: no vulnerabilities in call graph
- Frontend: tsc clean, 218 vitest tests pass, vite build ok
- helm lint + helm template: ok
- Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go;
H-1 through H-5 fixtures unchanged.
Refs: H-6 in certctl-audit-report.md
This commit is contained in:
@@ -120,10 +120,20 @@ type JobRepository interface {
|
||||
ListByCertificate(ctx context.Context, certID string) ([]*domain.Job, error)
|
||||
// UpdateStatus updates a job's status and optional error message.
|
||||
UpdateStatus(ctx context.Context, id string, status domain.JobStatus, errMsg string) error
|
||||
// GetPendingJobs returns jobs not yet processed of a specific type.
|
||||
// GetPendingJobs returns jobs not yet processed of a specific type. Prefer ClaimPendingJobs in
|
||||
// production paths where concurrent schedulers may race — see H-6 (CWE-362) remediation.
|
||||
GetPendingJobs(ctx context.Context, jobType domain.JobType) ([]*domain.Job, error)
|
||||
// ListPendingByAgentID returns pending deployment jobs and AwaitingCSR jobs for a specific agent.
|
||||
// Prefer ClaimPendingByAgentID in production paths — see H-6 (CWE-362) remediation.
|
||||
ListPendingByAgentID(ctx context.Context, agentID string) ([]*domain.Job, error)
|
||||
// ClaimPendingJobs atomically claims up to `limit` Pending jobs and transitions them to Running
|
||||
// using SELECT FOR UPDATE SKIP LOCKED inside a transaction. An empty jobType matches any type;
|
||||
// limit <= 0 means no limit. H-6 (CWE-362) race remediation.
|
||||
ClaimPendingJobs(ctx context.Context, jobType domain.JobType, limit int) ([]*domain.Job, error)
|
||||
// ClaimPendingByAgentID atomically claims pending deployment jobs for an agent (flipping them
|
||||
// to Running) and locks AwaitingCSR jobs against concurrent observers (leaving state intact,
|
||||
// since the CSR-submission path drives the next transition). H-6 (CWE-362) race remediation.
|
||||
ClaimPendingByAgentID(ctx context.Context, agentID string) ([]*domain.Job, error)
|
||||
}
|
||||
|
||||
// RenewalPolicyRepository defines operations for managing renewal policies.
|
||||
|
||||
Reference in New Issue
Block a user