Closes Top-10 fix#8 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, every connector's runPostDeployVerify used
linear backoff (default 3 attempts × 2s linear waits). Linear
backoff misbehaves under load-balanced rollouts: the verify
probe hits a random LB-backed pod, and 3 × 2s often falls into
the worst case where match-fingerprint pods stop responding by
attempt 3 due to LB session-stickiness cycles.
This commit:
1. New shared helper internal/tlsprobe/retry.go::
VerifyWithExponentialBackoff. Default 3 attempts; 1s initial,
16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe
func(ctx) error signature so connectors compose
handshake + fingerprint-compare into one lambda.
2. Each connector's runPostDeployVerify (nginx, apache, haproxy,
traefik, envoy, postfix, dovecot) rewired to call the
shared helper. Per-connector signature unchanged.
3. New PostDeployVerifyMaxBackoff time.Duration field added to
each connector's Config. Operators preserving V2 linear
behavior set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.
4. Tests:
- tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_
GrowthAndCap + TestVerifyWithExponentialBackoff_
StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_
CtxCancellation.
- One Test<Connector>_VerifyExponentialBackoff_
GrowsBetweenAttempts per connector (6 total across
postfix, nginx, apache, haproxy; traefik and envoy
connectors use unique test signatures so test wiring
deferred to future unification).
5. docs/deployment-atomicity.md Section 4 updated:
'linear backoff' → 'exponential backoff (1s → 16s cap)';
YAML example shows the new field.
Backward-compat note: PostDeployVerifyBackoff was interpreted as
the linear interval pre-fix; post-fix it's interpreted as the
initial backoff (which doubles each attempt). Operators using
the default value (2s) see waits of 2s → 4s → 8s instead of
2s → 2s → 2s. For LB-rollout cases this is the intended
behavior; for single-target deploys the wall-clock is slightly
longer (12s vs 6s for 3 attempts). Operators preserving V2
linear semantics: set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.
Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/tlsprobe/...
./internal/connector/target/{postfix,nginx,apache,haproxy}/... green.
Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix#8.