feat(agent): per-target deploy mutex serializes concurrent deploys to the same target

Phase 2 of the deploy-hardening I master bundle. Closes the agent-side
race window where two concurrent renewals against the same target ID
(typical: two SAN entries renewing in the same window) would otherwise
collide on the connector's temp-file path or run the reload command
against itself.

The Agent struct grows a sync.Map of *sync.Mutex keyed on target ID;
targetDeployMutex(targetID) lazy-init's one on first acquisition.
executeDeploymentJob acquires the mutex before connector.DeployCertificate
and releases via defer at function exit — the lock spans the full
Deploy duration including PreCommit (validate), atomic-rename, PostCommit
(reload), and post-deploy verify (Phases 4-9).

Granularity per frozen decision 0.5: one mutex per target ID, NOT per
(target, cert) pair. Cert deploy throughput is operator-grade
tens-per-minute; coarse serialization simplifies reasoning about
reload-side race windows. Mutexes live for the agent's lifetime —
target IDs are bounded so no janitor needed (~16 bytes per entry).

Empty TargetID (defensive — should never happen for deploy jobs)
bypasses the lock to avoid a singleton serialization point pulling
all targetless work onto a shared mutex.

Tests (5 named cases in cmd/agent/deploy_mutex_test.go):

- TestAgent_ConcurrentDeploysToSameTarget_Serialize — race-detector
  smoke; 10 goroutines acquire same target's mutex; max-in-flight
  asserts == 1
- TestAgent_DifferentTargetIDs_ParallelizeIndependently — per-target
  granularity proof
- TestAgent_EmptyTargetID_ReturnsNilMutex — defensive contract
- TestAgent_TargetMutex_IsStable — sync.Map LoadOrStore returns same
  pointer across calls
- TestAgent_TargetMutex_RaceLookup — race-free under N=50 concurrent
  lookups for same key

go test -race -count=1 green; gofmt + go vet + golangci-lint v2.11.4
all 0 issues against my new code (pre-existing import-grouping drift
in agent_test.go / main.go / verify*.go is unrelated to this change
and not caught by `go fmt ./...` which CI uses).

Phase 3 next: ValidateOnly method on target.Connector interface;
default impl returns ErrValidateOnlyNotSupported across all 13
connectors.
This commit is contained in:
shankar0123
2026-04-30 14:32:40 +00:00
parent f5c67a51b2
commit 30b251ea13
2 changed files with 200 additions and 0 deletions
+57
View File
@@ -95,6 +95,47 @@ type Agent struct {
// race with ctx.Done() and other cases.
retiredOnce sync.Once
retiredSignal chan struct{}
// Deploy-hardening I Phase 2: per-target deploy mutex.
// Two cert renewals against the same target ID (e.g., two SAN
// entries renewing in the same window, or a fast-cycling
// renewal-then-test workflow) MUST serialize at the agent
// dispatch site. Without this lock, the underlying connector's
// temp-file path could collide and the reload command would
// race against itself.
//
// Granularity is one mutex per target ID, NOT per (target, cert)
// pair — frozen decision 0.5. Cert deploy throughput is
// operator-grade tens-per-minute; coarse serialization is fine
// and simplifies reasoning about reload-side race windows.
//
// sync.Map is sized for thousands of unique target IDs without
// rehash thrash; LoadOrStore is atomic + lock-free on the
// hot path. Mutexes live for the agent's lifetime — no janitor
// because target IDs are bounded and the per-target memory
// (~16 bytes per entry) is negligible vs. typical agent heap.
//
// Job items without a TargetID (e.g., agent-managed cert + no
// connector dispatch — should never happen for deploy jobs but
// defended anyway) bypass the lock to avoid a singleton
// serialization point.
deployMutexes sync.Map // map[string]*sync.Mutex, keyed on JobItem.TargetID
}
// targetDeployMutex returns the per-target-ID *sync.Mutex,
// lazy-initialising one on first acquisition. Returns nil when
// targetID is empty (caller should skip the lock entirely).
//
// Phase 2 of the deploy-hardening I master bundle: the load-bearing
// serialization point that defends against concurrent deploys to the
// same target stomping each other's temp-file paths or reload
// commands.
func (a *Agent) targetDeployMutex(targetID string) *sync.Mutex {
if targetID == "" {
return nil
}
v, _ := a.deployMutexes.LoadOrStore(targetID, &sync.Mutex{})
return v.(*sync.Mutex)
}
// WorkResponse represents the response from the work polling endpoint.
@@ -667,6 +708,22 @@ func (a *Agent) executeDeploymentJob(ctx context.Context, job JobItem) {
},
}
// Phase 2 of the deploy-hardening I master bundle:
// per-target deploy mutex. Acquire BEFORE
// DeployCertificate so two concurrent renewals against
// the same target ID serialize. The lock is held for the
// full Deploy duration including PreCommit (validate),
// PostCommit (reload), and post-deploy verify (Phases
// 4-9). Released on every return path via defer.
var targetID string
if job.TargetID != nil {
targetID = *job.TargetID
}
if mu := a.targetDeployMutex(targetID); mu != nil {
mu.Lock()
defer mu.Unlock()
}
result, err := connector.DeployCertificate(ctx, deployReq)
if err != nil {
a.logger.Error("deployment failed",