mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 21:51:30 +00:00
8043e2bbac2d40cb4c31452297f40960ccec8d4d
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b8b7e1e3dd |
tlsprobe: add VerifyWithExponentialBackoff + rewire all connectors' runPostDeployVerify
Closes Top-10 fix #8 of the 2026-05-02 deployment-target audit re-run (see cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md). Pre-fix, every connector's runPostDeployVerify used linear backoff (default 3 attempts × 2s linear waits). Linear backoff misbehaves under load-balanced rollouts: the verify probe hits a random LB-backed pod, and 3 × 2s often falls into the worst case where match-fingerprint pods stop responding by attempt 3 due to LB session-stickiness cycles. This commit: 1. New shared helper internal/tlsprobe/retry.go:: VerifyWithExponentialBackoff. Default 3 attempts; 1s initial, 16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe func(ctx) error signature so connectors compose handshake + fingerprint-compare into one lambda. 2. Each connector's runPostDeployVerify (nginx, apache, haproxy, traefik, envoy, postfix, dovecot) rewired to call the shared helper. Per-connector signature unchanged. 3. New PostDeployVerifyMaxBackoff time.Duration field added to each connector's Config. Operators preserving V2 linear behavior set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. 4. Tests: - tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_ GrowthAndCap + TestVerifyWithExponentialBackoff_ StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_ CtxCancellation. - One Test<Connector>_VerifyExponentialBackoff_ GrowsBetweenAttempts per connector (6 total across postfix, nginx, apache, haproxy; traefik and envoy connectors use unique test signatures so test wiring deferred to future unification). 5. docs/deployment-atomicity.md Section 4 updated: 'linear backoff' → 'exponential backoff (1s → 16s cap)'; YAML example shows the new field. Backward-compat note: PostDeployVerifyBackoff was interpreted as the linear interval pre-fix; post-fix it's interpreted as the initial backoff (which doubles each attempt). Operators using the default value (2s) see waits of 2s → 4s → 8s instead of 2s → 2s → 2s. For LB-rollout cases this is the intended behavior; for single-target deploys the wall-clock is slightly longer (12s vs 6s for 3 attempts). Operators preserving V2 linear semantics: set PostDeployVerifyMaxBackoff equal to PostDeployVerifyBackoff. Verified locally: - gofmt clean. - go test -short -count=1 ./internal/tlsprobe/... ./internal/connector/target/{postfix,nginx,apache,haproxy}/... green. Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/ RESULTS.md Top-10 fix #8. |
||
|
|
85d247455b |
docs(postfix): add Mode=postfix vs Mode=dovecot decision matrix subsection
Closes Top-10 fix #9 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, the Postfix connector's docs in
docs/connectors.md described the connector as a single
"Postfix / Dovecot" target without explicit guidance on when to
use Mode=postfix vs Mode=dovecot. Operators with a mail server
running both Postfix (MTA, port 25) and Dovecot (IMAPS, port
993) had to read source to figure out the dual-deploy pattern.
Bundle 11 (commit
|
||
|
|
fb88e0f8a8 |
docs(deployment-atomicity): K8s row honest + audit-closure rollup
Closes Bundle 1 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The
audit's original Bundle 1 spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore / K8s rollback claims first so the doc
isn't a procurement-liability while bundles 5-8 catch the
implementation up." Execution order inverted that loop —
Bundles 3-11 shipped before Bundle 1, and each landed the
implementation that made the corresponding row honest. So this
commit's effective scope is dramatically smaller than the audit
originally specified.
Three changes, all in docs/deployment-atomicity.md:
1. L95 k8ssecret row softened. Pre-fix the row claimed "GetSecret
RBAC probe" / "Update Secret" / "SHA-256 verify of returned
Secret" / "Atomic at API server; kubelet sync polled via
Pod.Status.ContainerStatuses" — as if all four columns described
live behavior. The production realK8sClient at
internal/connector/target/k8ssecret/k8ssecret.go:397-420 is
still a stub returning "real Kubernetes client not implemented
— use NewWithClient for tests" for every method. Post-fix the
row says so explicitly, points at the stub source, notes that
test mocks via NewWithClient work today, and forward-references
the Bundle 2 tracking prompt at
cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.
2. New Section 1.5 "Audit closure status" inserted between
Overview (Section 1) and the atomic-write primitive (Section 2).
Pins which deployment-target-audit bundles shipped with their
commit hashes:
envoy Bundle 3
|
||
|
|
b95a548f65 |
docs: deploy-hardening I — atomic deploy + post-verify operator guide + connectors / README updates
Phase 12 of the deploy-hardening I master bundle. NEW docs/deployment-atomicity.md (12 sections, ~280 lines): 1. Overview — the three procurement-checklist gaps closed 2. The atomic-write primitive (Plan / File / Apply algorithm) 3. Per-connector atomic contract table (all 13 connectors) 4. Post-deploy TLS verification (handshake + SHA-256 + retries) 5. Rollback semantics (3 triggers + escalation path) 6. ValidateOnly dry-run mode (per-connector matrix) 7. File ownership + mode preservation (precedence + per-distro defaults) 8. Per-target deploy mutex (Phase 2) 9. Idempotency via SHA-256 (defends against retry storms) 10. Troubleshooting matrix (one row per failure mode) 11. V3-Pro deferrals (multi-region, pin manifests, SOC 2 export) 12. Per-connector quick reference (paste-able config snippets) UPDATE README.md::Deployment Targets — every connector row now notes the atomic + verify + rollback semantics that landed in deploy-hardening I. Added a closing paragraph linking to the new docs/deployment-atomicity.md. UPDATE docs/features.md — two new env-var rows: - CERTCTL_DEPLOY_BACKUP_RETENTION (default 3, -1 disables) - CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT (default 60s) The G-3 docs-drift CI guard is satisfied: every new CERTCTL_DEPLOY_* env var documented here also appears in source (internal/deploy/types.go for BACKUP_RETENTION, k8ssecret config for KUBELET_SYNC_TIMEOUT). S-1 stale-counts guard: no literal-number current-state counts in the new doc — the per-connector tests are referenced via the file:line pattern (internal/connector/target/<name>/<name>_atomic_test.go) so the operator can grep for the actual count. Phase 13 next: pre-commit verification (full matrix + CI guard reproductions). |