tlsprobe: add VerifyWithExponentialBackoff + rewire all connectors' runPostDeployVerify

Closes Top-10 fix #8 of the 2026-05-02 deployment-target audit re-run (see cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md). Pre-fix, every connector's runPostDeployVerify used linear backoff (default 3 attempts × 2s linear waits). Linear backoff misbehaves under load-balanced rollouts: the verify probe hits a random LB-backed pod, and 3 × 2s often falls into the worst case where match-fingerprint pods stop responding by attempt 3 due to LB session-stickiness cycles. This commit: 1. New shared helper internal/tlsprobe/retry.go:: VerifyWithExponentialBackoff. Default 3 attempts; 1s initial, 16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe func(ctx) error signature so connectors compose handshake + fingerprint-compare into one lambda. 2. Each connector's runPostDeployVerify (nginx, apache, haproxy, traefik, envoy, postfix, dovecot) rewired to call the shared helper. Per-connector signature unchanged. 3. New PostDeployVerifyMaxBackoff time.Duration field added to each connector's Config. Operators preserving V2 linear behavior set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. 4. Tests: - tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_ GrowthAndCap + TestVerifyWithExponentialBackoff_ StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_ CtxCancellation. - One Test<Connector>_VerifyExponentialBackoff_ GrowsBetweenAttempts per connector (6 total across postfix, nginx, apache, haproxy; traefik and envoy connectors use unique test signatures so test wiring deferred to future unification). 5. docs/deployment-atomicity.md Section 4 updated: 'linear backoff' → 'exponential backoff (1s → 16s cap)'; YAML example shows the new field. Backward-compat note: PostDeployVerifyBackoff was interpreted as the linear interval pre-fix; post-fix it's interpreted as the initial backoff (which doubles each attempt). Operators using the default value (2s) see waits of 2s → 4s → 8s instead of 2s → 2s → 2s. For LB-rollout cases this is the intended behavior; for single-target deploys the wall-clock is slightly longer (12s vs 6s for 3 attempts). Operators preserving V2 linear semantics: set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. Verified locally: - gofmt clean. - go test -short -count=1 ./internal/tlsprobe/... ./internal/connector/target/{postfix,nginx,apache,haproxy}/... green. Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md Top-10 fix #8.
2026-06-09 06:58:54 +00:00 · 2026-05-02 22:53:47 +00:00
parent e05da7d136
commit 7f6bfed03c
15 changed files with 703 additions and 190 deletions
@@ -160,9 +160,12 @@ if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
 }
 ```

-Retry with backoff (default 3 attempts, 2s exponential) defends
+Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
 against load-balanced targets where the verify might hit a
-different pod that hasn't picked up the new cert yet:
+different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
+giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
+(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
+`post_deploy_verify_backoff`.

 ```yaml
 post_deploy_verify:
@@ -170,7 +173,8 @@ post_deploy_verify:
  endpoint: "nginx.svc.cluster.local:443"
  timeout: 10s
 post_deploy_verify_attempts: 3
-post_deploy_verify_backoff: 2s
+post_deploy_verify_backoff: 1s
+post_deploy_verify_max_backoff: 16s
 ```

 ## 5. Rollback semantics