mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:51:32 +00:00
1b03d0c594
Phase-9 docker compose smoke surfaced a latent production-breaking bug introduced by commit89b910a(H-6 atomic pending-job claim). The ClaimPendingByAgentID query in internal/repository/postgres/job.go combined UNION ALL with FOR UPDATE SKIP LOCKED in a single statement. Postgres rejects this with: ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT Every agent work-poll returns HTTP 500 in any real deployment where an agent is actually polling. From the compose log: request_id=6da47015-... GET /api/v1/agents/agent-demo-1/work status=500 duration_ms=2 The schema-per-test unit harness in internal/repository/postgres/ *_test.go never inserted jobs and polled, so the SQL execution path was never exercised. The bug has been latent in master since89b910alanded. Fix: split the UNION ALL into two separate FOR UPDATE SKIP LOCKED queries within the existing transaction. The H-6 atomicity invariant (concurrent pollers never see the same Pending row) is preserved because: 1. The two queries run inside the same transaction (tx). 2. Each query independently locks its result rows with FOR UPDATE SKIP LOCKED. 3. The subsequent UPDATE that flips Pending -> Running runs in the same transaction, so the rows stay invisible to concurrent callers from initial SELECT through final COMMIT. 4. The transaction is the unit of consistency, not the single SQL statement. Two queries: - Branch 1 (direct): jobs.agent_id = + status='Pending' + type='Deployment'. ORDER BY created_at ASC, FOR UPDATE SKIP LOCKED. - Branch 2 (fallback): jobs.agent_id IS NULL + INNER JOIN deployment_targets dt ON jobs.target_id = dt.id WHERE dt.agent_id = . ORDER BY j.created_at ASC, FOR UPDATE OF j SKIP LOCKED (FOR UPDATE OF needed because the join brings in dt). Branch 3 (AwaitingCSR) is unchanged — already a single SELECT, not affected by the UNION restriction. Inline comment explains the fix's load-bearing-ness so a future refactor doesn't merge them back into one UNION query. Verify (sandbox): go vet clean; go test -short -count=1 PASS on internal/repository/postgres/. Workstation re-runs 'docker compose up' to confirm the agent's GET /work returns 200 with the next pending-deployment claim. Note: this is NOT a regression introduced by Auth Bundle 2 or the 2026-05-11 audit fixes; it's a pre-existing latent defect from H-6. Including in v2.1.0 because shipping with a broken agent work-poll would block the demo path on day one of release.