Closes Top-10 fix#8 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, every connector's runPostDeployVerify used
linear backoff (default 3 attempts × 2s linear waits). Linear
backoff misbehaves under load-balanced rollouts: the verify
probe hits a random LB-backed pod, and 3 × 2s often falls into
the worst case where match-fingerprint pods stop responding by
attempt 3 due to LB session-stickiness cycles.
This commit:
1. New shared helper internal/tlsprobe/retry.go::
VerifyWithExponentialBackoff. Default 3 attempts; 1s initial,
16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe
func(ctx) error signature so connectors compose
handshake + fingerprint-compare into one lambda.
2. Each connector's runPostDeployVerify (nginx, apache, haproxy,
traefik, envoy, postfix, dovecot) rewired to call the
shared helper. Per-connector signature unchanged.
3. New PostDeployVerifyMaxBackoff time.Duration field added to
each connector's Config. Operators preserving V2 linear
behavior set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.
4. Tests:
- tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_
GrowthAndCap + TestVerifyWithExponentialBackoff_
StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_
CtxCancellation.
- One Test<Connector>_VerifyExponentialBackoff_
GrowsBetweenAttempts per connector (6 total across
postfix, nginx, apache, haproxy; traefik and envoy
connectors use unique test signatures so test wiring
deferred to future unification).
5. docs/deployment-atomicity.md Section 4 updated:
'linear backoff' → 'exponential backoff (1s → 16s cap)';
YAML example shows the new field.
Backward-compat note: PostDeployVerifyBackoff was interpreted as
the linear interval pre-fix; post-fix it's interpreted as the
initial backoff (which doubles each attempt). Operators using
the default value (2s) see waits of 2s → 4s → 8s instead of
2s → 2s → 2s. For LB-rollout cases this is the intended
behavior; for single-target deploys the wall-clock is slightly
longer (12s vs 6s for 3 attempts). Operators preserving V2
linear semantics: set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.
Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/tlsprobe/...
./internal/connector/target/{postfix,nginx,apache,haproxy}/... green.
Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix#8.
Phase 5 of the deploy-hardening I master bundle. Mirrors the Phase 4
NGINX template for Apache httpd. Test count lifts 3 → 34 (above the
prompt's >=30 target; matches and slightly exceeds the IIS bar).
Apache-specific quirks codified in apache.go:
- Validate command convention is `apachectl configtest` (NOT
`apachectl -t` — that flag exists but configtest is the documented
operator-facing form).
- Reload command convention is `apachectl graceful` for zero-
downtime worker swap (NOT `apachectl restart` which drops
in-flight TLS sessions).
- Per-distro user defaults: Debian/Ubuntu apache2, RHEL/CentOS
apache, Alpine httpd. pickFirstExistingUser walks the list and
picks the one that resolves on the host; falls back to no-chown
when none exist (cross-distro portability without operator
config; same approach as nginx).
- Default key file mode 0600 for back-compat with operators
relying on the historical hard-coded value (matches the
pre-Phase-5 implementation behavior).
DeployCertificate refactor:
- Replaces the duplicated os.WriteFile chain with deploy.Apply.
- PreCommit runs the operator's ValidateCommand via the test
seam (which wraps `sh -c <cmd>` in production).
- PostCommit runs ReloadCommand the same way.
- Post-deploy TLS verify (frozen-decision-0.3 default ON when
Endpoint is configured): probes the configured target,
compares leaf cert SHA-256 against deployed bytes, retries with
exponential backoff (default 3 attempts / 2s backoff for
load-balanced targets).
- Rollback wires: reload-fail → restore backups + retry reload;
verify-fail → restore backups + reload again. Second-failure
surfaces ErrRollbackFailed for operator-actionable triage.
ValidateOnly real implementation replaces the Phase 3 stub.
Returns ErrValidateOnlyNotSupported when no ValidateCommand
configured; otherwise runs the validate-with-the-target command
without touching the live cert.
Test seams (SetTestRunValidate / SetTestRunReload / SetTestProbe)
allow tests to skip exec without `apachectl` on PATH; mirror the
nginx pattern.
Tests (34 total: 31 in apache_atomic_test.go + 3 pre-existing
in apache_test.go):
- Atomic invariants (happy, validate-fail-no-files-changed,
reload-fail-rollback, rollback-also-fail-escalation)
- SHA-256 idempotency (full skip + partial-mismatch full-deploy)
- Post-deploy verify (match-success, mismatch-rollback,
dial-timeout-rollback, retries-until-match,
retries-exhausted-rollback, no-endpoint-skips, disabled-skips)
- Ownership / mode preservation (existing-mode, override-wins,
default-key-0600, default-cert-0644)
- Backup retention (keeps-N, disabled-no-backups, backup-created)
- Concurrency (same-paths-serialize)
- ValidateOnly (happy, fails, no-command-sentinel, stderr-in-error)
- Edge cases (no-chain, no-key, ctx-cancelled, verify-rollback-
reload, deployment-id-prefix, metadata-populated)
Coverage: Apache 86.6% (above the >=85% prompt bar). Race detector
clean. golangci-lint v2.11.4 clean.
Smoke test connectorsAtPhase3 list shrunk from 12 to 11
entries (apache removed; nginx + apache now have real impls).
Phase 6 next: HAProxy (combined PEM atomic write + `haproxy -c -f`
validate + uplift 3 → >=30).