mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:41:30 +00:00
febf50090b
Closes Bundle 3 of the 2026-05-02 deployment-target coverage audit (see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The audit ranked this fix #3 by acquirer impact behind the K8s real client (#1) and the docs realignment (#2 / Bundle 1). Two production-grade gaps closed: 1. SDS JSON config write was non-atomic. Cert/key/chain at envoy.go L155/L168/L183 went through deploy.AtomicWriteFile (atomic + backups + ownership preservation), but the SDS JSON at L260 went through os.WriteFile directly. A power loss / OOM / process-kill mid-write of the SDS JSON produces a torn file Envoy cannot parse, and Envoy's file-based SDS watcher refuses to load any cert (not just the rotating one) until the JSON is repaired by hand. Replaced with deploy.AtomicWriteFile and threaded ctx through writeSDSConfig. 2. No watcher pickup confirmation before returning success. Pre-fix, DeployCertificate returned the moment file writes completed. Envoy's SDS watcher is asynchronous; a caller running post-deploy TLS verify immediately after DeployCertificate could see Envoy still serving the old cert (watcher latency, load-balanced replica hit one that hadn't reloaded yet). Added the canonical post-deploy verify pattern (mirrors nginx.go::runPostDeployVerify L416): probe seam + retry/backoff + SHA-256 fingerprint compare against request.CertPEM. On verify failure, restore from per-file backups via the new restoreFromBackups helper. Envoy has no PostCommit reload to re-run; the watcher auto-reloads on the restored files. Config additions to envoy.Config (mirror nginx.Config L84-93): - PostDeployVerify *PostDeployVerifyConfig (Enabled, Endpoint, Timeout) - PostDeployVerifyAttempts int (default 3 in runPostDeployVerify) - PostDeployVerifyBackoff time.Duration (default 2s) - BackupRetention int (mirrors nginx; passed to AtomicWriteFile per file) Default behaviour unchanged for callers that don't set PostDeployVerify — verify is opt-in. nil or Enabled=false skips it entirely. Probe seam: c.probe = tlsprobe.ProbeTLS at construction; tests inject via the new SetTestProbe method. Same shape NGINX uses (nginx.go:130); also mirrors the existing Traefik SetTestProbe at traefik.go:62. WriteResult retention: every AtomicWriteFile call now retains its *deploy.WriteResult in a local []*deploy.WriteResult slice so the rollback path can restore from BackupPath across all four files (cert, key, chain, SDS JSON), not just the cert. Pre-fix the cert's WriteResult was discarded. restoreFromBackups (envoy.go new): iterates the WriteResults from a successful per-file pass, rewrites each non-idempotent destination from its BackupPath via AtomicWriteFile{SkipIdempotent:true, BackupRetention:-1}. The -1 prevents backup-of-the-backup pollution. For files that didn't exist pre-deploy (BackupPath == ""), restore = remove. Mirrors nginx.go::rollbackToBackups (L487-515) with the reload step elided. Idempotency gate: shouldRunVerify returns true unless EVERY WriteResult was Idempotent — same all-files semantics NGINX gets from res.SkippedAsIdempotent. Pre-fix Envoy had no verify at all, so there was no gate to get wrong; this introduces the correct all-files shape from the start. Tests added to envoy_atomic_test.go: - TestEnvoy_Atomic_SDSConfigWriteIsAtomic — pre-writes a sentinel SDS JSON, runs DeployCertificate, asserts a backup file with deploy.BackupSuffix appears alongside the new sds.json (proves AtomicWriteFile is now in the SDS path). - TestEnvoy_Atomic_WatcherPickupRetries — stub probe returns wrong fingerprint on attempts 1+2 and correct on attempt 3; deploy succeeds; probe called exactly 3 times. - TestEnvoy_Atomic_WatcherPickupAllAttemptsFail_RollsBack — pre-writes SENTINEL bytes for cert+key, stub probe always wrong; deploy returns wrapped error AND the destination files contain the sentinel bytes (rollback restored). - TestEnvoy_Atomic_PostDeployVerifyDisabledByDefault — Config with nil PostDeployVerify; asserts probe is never called (opt-in default preserved). A small certPEMFingerprint helper added to the test file mirrors the production envoy.certPEMToFingerprint (which is package-private — external tests can't call it). docs/deployment-atomicity.md L87 row already documents "TLS handshake | atomic-write replaces os.WriteFile" — pre-fix the claim was aspirational (verify happened in the agent verify-and-report path, not the connector; SDS JSON wasn't atomic). Post-fix the claim is honest. No doc change required. Verified locally: - gofmt -l ./internal/connector/target/envoy/ clean - go vet ./internal/connector/target/envoy/... clean - staticcheck ./internal/connector/target/envoy/... clean - go build ./... clean - go test -race -count=1 ./internal/connector/target/envoy/... green (5 pre-existing tests + 4 new = 9 total) - go test -short -count=1 ./internal/connector/target/... green Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md Bundle 3.