mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:41:29 +00:00
b8b7e1e3dd
Closes Top-10 fix #8 of the 2026-05-02 deployment-target audit re-run (see cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md). Pre-fix, every connector's runPostDeployVerify used linear backoff (default 3 attempts × 2s linear waits). Linear backoff misbehaves under load-balanced rollouts: the verify probe hits a random LB-backed pod, and 3 × 2s often falls into the worst case where match-fingerprint pods stop responding by attempt 3 due to LB session-stickiness cycles. This commit: 1. New shared helper internal/tlsprobe/retry.go:: VerifyWithExponentialBackoff. Default 3 attempts; 1s initial, 16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe func(ctx) error signature so connectors compose handshake + fingerprint-compare into one lambda. 2. Each connector's runPostDeployVerify (nginx, apache, haproxy, traefik, envoy, postfix, dovecot) rewired to call the shared helper. Per-connector signature unchanged. 3. New PostDeployVerifyMaxBackoff time.Duration field added to each connector's Config. Operators preserving V2 linear behavior set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. 4. Tests: - tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_ GrowthAndCap + TestVerifyWithExponentialBackoff_ StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_ CtxCancellation. - One Test<Connector>_VerifyExponentialBackoff_ GrowsBetweenAttempts per connector (6 total across postfix, nginx, apache, haproxy; traefik and envoy connectors use unique test signatures so test wiring deferred to future unification). 5. docs/deployment-atomicity.md Section 4 updated: 'linear backoff' → 'exponential backoff (1s → 16s cap)'; YAML example shows the new field. Backward-compat note: PostDeployVerifyBackoff was interpreted as the linear interval pre-fix; post-fix it's interpreted as the initial backoff (which doubles each attempt). Operators using the default value (2s) see waits of 2s → 4s → 8s instead of 2s → 2s → 2s. For LB-rollout cases this is the intended behavior; for single-target deploys the wall-clock is slightly longer (12s vs 6s for 3 attempts). Operators preserving V2 linear semantics: set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. Verified locally: - gofmt clean. - go test -short -count=1 ./internal/tlsprobe/... ./internal/connector/target/{postfix,nginx,apache,haproxy}/... green. Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md Top-10 fix #8.
69 lines
1.8 KiB
Go
69 lines
1.8 KiB
Go
// Copyright (c) 2025 Certctl Contributors <certctl@proton.me>
|
|
//
|
|
// SPDX-License-Identifier: BSL-1.1
|
|
// See COPYING for license details.
|
|
|
|
package tlsprobe
|
|
|
|
import (
|
|
"context"
|
|
"time"
|
|
)
|
|
|
|
// RetryConfig holds parameters for exponential-backoff retries.
|
|
// Zero values use defaults: 3 attempts, 1s initial, 16s max.
|
|
type RetryConfig struct {
|
|
Attempts int // total attempts; 0 = use 3 default
|
|
InitialBackoff time.Duration // base; 0 = use 1 * time.Second default
|
|
MaxBackoff time.Duration // cap; 0 = use 16 * time.Second default
|
|
}
|
|
|
|
// VerifyWithExponentialBackoff calls the probe at most cfg.Attempts times,
|
|
// waiting cfg.InitialBackoff, 2*InitialBackoff, 4*InitialBackoff, ... capped at
|
|
// cfg.MaxBackoff between consecutive attempts. Returns nil on first probe success;
|
|
// returns the last attempt's error on full exhaustion.
|
|
//
|
|
// The probe function returns:
|
|
// - nil error on success → return immediately, no further attempts.
|
|
// - non-nil error → wait the exponentially-growing backoff and retry.
|
|
//
|
|
// The ctx is checked between attempts; ctx cancellation aborts immediately.
|
|
//
|
|
// Top-10 fix #8 of the 2026-05-02 deployment-target audit re-run.
|
|
func VerifyWithExponentialBackoff(ctx context.Context, cfg RetryConfig, probe func(ctx context.Context) error) error {
|
|
attempts := cfg.Attempts
|
|
if attempts <= 0 {
|
|
attempts = 3
|
|
}
|
|
initial := cfg.InitialBackoff
|
|
if initial <= 0 {
|
|
initial = 1 * time.Second
|
|
}
|
|
max := cfg.MaxBackoff
|
|
if max <= 0 {
|
|
max = 16 * time.Second
|
|
}
|
|
|
|
backoff := initial
|
|
var lastErr error
|
|
for i := 0; i < attempts; i++ {
|
|
if i > 0 {
|
|
select {
|
|
case <-ctx.Done():
|
|
return ctx.Err()
|
|
case <-time.After(backoff):
|
|
}
|
|
backoff *= 2
|
|
if backoff > max {
|
|
backoff = max
|
|
}
|
|
}
|
|
if err := probe(ctx); err == nil {
|
|
return nil
|
|
} else {
|
|
lastErr = err
|
|
}
|
|
}
|
|
return lastErr
|
|
}
|