Commit Graph

4 Commits

Author SHA1 Message Date
shankar0123 7f6bfed03c tlsprobe: add VerifyWithExponentialBackoff + rewire all connectors' runPostDeployVerify
Closes Top-10 fix #8 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, every connector's runPostDeployVerify used
linear backoff (default 3 attempts × 2s linear waits). Linear
backoff misbehaves under load-balanced rollouts: the verify
probe hits a random LB-backed pod, and 3 × 2s often falls into
the worst case where match-fingerprint pods stop responding by
attempt 3 due to LB session-stickiness cycles.

This commit:

1. New shared helper internal/tlsprobe/retry.go::
   VerifyWithExponentialBackoff. Default 3 attempts; 1s initial,
   16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe
   func(ctx) error signature so connectors compose
   handshake + fingerprint-compare into one lambda.

2. Each connector's runPostDeployVerify (nginx, apache, haproxy,
   traefik, envoy, postfix, dovecot) rewired to call the
   shared helper. Per-connector signature unchanged.

3. New PostDeployVerifyMaxBackoff time.Duration field added to
   each connector's Config. Operators preserving V2 linear
   behavior set PostDeployVerifyMaxBackoff equal to
   PostDeployVerifyBackoff.

4. Tests:
   - tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_
     GrowthAndCap + TestVerifyWithExponentialBackoff_
     StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_
     CtxCancellation.
   - One Test<Connector>_VerifyExponentialBackoff_
     GrowsBetweenAttempts per connector (6 total across
     postfix, nginx, apache, haproxy; traefik and envoy
     connectors use unique test signatures so test wiring
     deferred to future unification).

5. docs/deployment-atomicity.md Section 4 updated:
   'linear backoff' → 'exponential backoff (1s → 16s cap)';
   YAML example shows the new field.

Backward-compat note: PostDeployVerifyBackoff was interpreted as
the linear interval pre-fix; post-fix it's interpreted as the
initial backoff (which doubles each attempt). Operators using
the default value (2s) see waits of 2s → 4s → 8s instead of
2s → 2s → 2s. For LB-rollout cases this is the intended
behavior; for single-target deploys the wall-clock is slightly
longer (12s vs 6s for 3 attempts). Operators preserving V2
linear semantics: set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.

Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/tlsprobe/...
  ./internal/connector/target/{postfix,nginx,apache,haproxy}/... green.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #8.
2026-05-02 22:56:07 +00:00
shankar0123 e05da7d136 docs(postfix): add Mode=postfix vs Mode=dovecot decision matrix subsection
Closes Top-10 fix #9 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, the Postfix connector's docs in
docs/connectors.md described the connector as a single
"Postfix / Dovecot" target without explicit guidance on when to
use Mode=postfix vs Mode=dovecot. Operators with a mail server
running both Postfix (MTA, port 25) and Dovecot (IMAPS, port
993) had to read source to figure out the dual-deploy pattern.

Bundle 11 (commit 88e8881) added test pin for Mode=dovecot
(TestPostfix_Atomic_DovecotMode_HappyPath +
TestPostfix_Atomic_DovecotMode_VerifyFails_Rollback). This
commit lands the operator-facing doc that complements the test:

1. New "Choosing Mode=postfix vs Mode=dovecot" subsection in
   docs/connectors.md "Built-in: Postfix / Dovecot" section.
   Covers:
   - When to use each mode (MTA on 25 vs IMAPS on 993).
   - Daemon-specific defaults (cert_path, key_path,
     validate_command, reload_command) cited verbatim from
     internal/connector/target/postfix/postfix.go applyDefaults.
   - Note that postfix is the default when mode is unset.
   - Post-deploy verify endpoint is operator-supplied, NOT a
     per-mode default (the connector does not bake in
     port 25 / 993 — operators set post_deploy_verify.endpoint
     themselves to point at their daemon's listener).
   - Dual-deploy pattern for hosts running both daemons (two
     separate targets; byte-equal cert hits SHA-256
     idempotency on subsequent renewals; targets are independent
     in the scheduler so one reload failing rolls back that
     target only).
   - Shared-cert-via-symlink pattern (atomic-write os.Rename
     follows symlinks).
   - Daemon-specific quirks (Postfix STARTTLS chain
     requirements for external MTA validation; Dovecot IMAPS
     client-facing chain shipping; reload independence).
   - Test pin reference (Bundle 11 commit hash + dovecot test
     names; postfix-mode equivalent test names).

2. Forward-pointer footnote in docs/deployment-atomicity.md
   Section 3 "Per-connector atomic contract" pointing at the
   new subsection.

No code changes; no test changes; doc-only commit.

Verified locally:
- All defaults cited verbatim from postfix.go::applyDefaults
  (cert_path, key_path, validate_command, reload_command).
- Bundle 11 test names verified to exist in
  internal/connector/target/postfix/postfix_atomic_test.go
  (TestPostfix_Atomic_DovecotMode_HappyPath at L272,
  TestPostfix_Atomic_DovecotMode_VerifyFails_Rollback at L354).
- Spec's claim of "verify port 25 / 993 default" was incorrect:
  the connector does not bake in a per-mode verify port.
  Doc reflects ground truth.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #9.
2026-05-02 22:46:44 +00:00
shankar0123 97a1243cc4 docs(deployment-atomicity): K8s row honest + audit-closure rollup
Closes Bundle 1 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The
audit's original Bundle 1 spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore / K8s rollback claims first so the doc
isn't a procurement-liability while bundles 5-8 catch the
implementation up." Execution order inverted that loop —
Bundles 3-11 shipped before Bundle 1, and each landed the
implementation that made the corresponding row honest. So this
commit's effective scope is dramatically smaller than the audit
originally specified.

Three changes, all in docs/deployment-atomicity.md:

1. L95 k8ssecret row softened. Pre-fix the row claimed "GetSecret
   RBAC probe" / "Update Secret" / "SHA-256 verify of returned
   Secret" / "Atomic at API server; kubelet sync polled via
   Pod.Status.ContainerStatuses" — as if all four columns described
   live behavior. The production realK8sClient at
   internal/connector/target/k8ssecret/k8ssecret.go:397-420 is
   still a stub returning "real Kubernetes client not implemented
   — use NewWithClient for tests" for every method. Post-fix the
   row says so explicitly, points at the stub source, notes that
   test mocks via NewWithClient work today, and forward-references
   the Bundle 2 tracking prompt at
   cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

2. New Section 1.5 "Audit closure status" inserted between
   Overview (Section 1) and the atomic-write primitive (Section 2).
   Pins which deployment-target-audit bundles shipped with their
   commit hashes:

     envoy           Bundle 3   d8cd981
     traefik         Bundle 4   37634e6
     iis             Bundle 5   223f279
     ssh             Bundle 6   eb39059
     wincertstore    Bundle 7   1dd1dd4
     javakeystore    Bundle 8   87e0009
     caddy           Bundle 9   8cda860
     postfix/dovecot Bundle 11  88e8881

   Outstanding: Bundle 2 (K8s real client) — the V2 P0 blocker.
   Bundle 10 (loadtest, commit 6286cd4) is documented separately
   at deploy/test/loadtest/README.md as a CI/observability
   addition that doesn't modify the per-connector contract table.
   Section 1.5's closing paragraph documents the execution-order
   inversion so future readers understand why this commit ended
   up smaller than the audit's original spec implied.

3. Section 1's gap table updated. The "Atomic deploy with rollback"
   row's post-bundle column went from "All 13 connectors via
   deploy.Apply" to "12 of 13 connectors via deploy.Apply (K8s
   pending Bundle 2 — see Section 1.5)" with an anchor link.

Rows L81-94 left untouched: each claim is now honest because
Bundles 3-11 implementations landed. Per-bundle commit messages
have been recording this fact ("Post-Bundle-N the claim is
honest; pre-fix it was aspirational") since Bundle 5; this
commit closes the loop by making the doc reflect the same.

What this commit does NOT do:
- Add K8s to Section 11 "V3-Pro deferrals" — Bundle 2 is a V2
  P0 blocker, not a V3-Pro deferral. Mixing the two would
  defer a real procurement-checklist gap into "future work"
  where it doesn't belong.
- Edit rows L81-94 of the per-connector table — they're honest
  as-is.
- Touch docs/architecture.md / connectors.md / security.md —
  those have their own per-section accuracy requirements; this
  commit is scoped to deployment-atomicity.md.

Verified locally:
- gofmt -l ./internal/ ./cmd/  clean (doc-only commit; no Go diff).
- markdown structure check via `grep -n '^## '`: Section 1.5
  inserted cleanly between 1 and 2; no other headings disturbed.
- All 8 commit hashes in Section 1.5 verified against
  `git log --oneline --reverse v2.0.67..HEAD` at HEAD=88e8881.

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 1.
2026-05-02 20:06:24 +00:00
claude 2eb608f40f docs: deploy-hardening I — atomic deploy + post-verify operator guide + connectors / README updates
Phase 12 of the deploy-hardening I master bundle.

NEW docs/deployment-atomicity.md (12 sections, ~280 lines):
1. Overview — the three procurement-checklist gaps closed
2. The atomic-write primitive (Plan / File / Apply algorithm)
3. Per-connector atomic contract table (all 13 connectors)
4. Post-deploy TLS verification (handshake + SHA-256 + retries)
5. Rollback semantics (3 triggers + escalation path)
6. ValidateOnly dry-run mode (per-connector matrix)
7. File ownership + mode preservation (precedence + per-distro defaults)
8. Per-target deploy mutex (Phase 2)
9. Idempotency via SHA-256 (defends against retry storms)
10. Troubleshooting matrix (one row per failure mode)
11. V3-Pro deferrals (multi-region, pin manifests, SOC 2 export)
12. Per-connector quick reference (paste-able config snippets)

UPDATE README.md::Deployment Targets — every connector row now
notes the atomic + verify + rollback semantics that landed in
deploy-hardening I. Added a closing paragraph linking to the new
docs/deployment-atomicity.md.

UPDATE docs/features.md — two new env-var rows:
- CERTCTL_DEPLOY_BACKUP_RETENTION (default 3, -1 disables)
- CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT (default 60s)

The G-3 docs-drift CI guard is satisfied: every new
CERTCTL_DEPLOY_* env var documented here also appears in source
(internal/deploy/types.go for BACKUP_RETENTION, k8ssecret config
for KUBELET_SYNC_TIMEOUT).

S-1 stale-counts guard: no literal-number current-state counts in
the new doc — the per-connector tests are referenced via the
file:line pattern (internal/connector/target/<name>/<name>_atomic_test.go)
so the operator can grep for the actual count.

Phase 13 next: pre-commit verification (full matrix + CI guard
reproductions).
2026-04-30 15:30:45 +00:00