Files
certctl/docs/deployment-atomicity.md
T
shankar0123 97a1243cc4 docs(deployment-atomicity): K8s row honest + audit-closure rollup
Closes Bundle 1 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The
audit's original Bundle 1 spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore / K8s rollback claims first so the doc
isn't a procurement-liability while bundles 5-8 catch the
implementation up." Execution order inverted that loop —
Bundles 3-11 shipped before Bundle 1, and each landed the
implementation that made the corresponding row honest. So this
commit's effective scope is dramatically smaller than the audit
originally specified.

Three changes, all in docs/deployment-atomicity.md:

1. L95 k8ssecret row softened. Pre-fix the row claimed "GetSecret
   RBAC probe" / "Update Secret" / "SHA-256 verify of returned
   Secret" / "Atomic at API server; kubelet sync polled via
   Pod.Status.ContainerStatuses" — as if all four columns described
   live behavior. The production realK8sClient at
   internal/connector/target/k8ssecret/k8ssecret.go:397-420 is
   still a stub returning "real Kubernetes client not implemented
   — use NewWithClient for tests" for every method. Post-fix the
   row says so explicitly, points at the stub source, notes that
   test mocks via NewWithClient work today, and forward-references
   the Bundle 2 tracking prompt at
   cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

2. New Section 1.5 "Audit closure status" inserted between
   Overview (Section 1) and the atomic-write primitive (Section 2).
   Pins which deployment-target-audit bundles shipped with their
   commit hashes:

     envoy           Bundle 3   d8cd981
     traefik         Bundle 4   37634e6
     iis             Bundle 5   223f279
     ssh             Bundle 6   eb39059
     wincertstore    Bundle 7   1dd1dd4
     javakeystore    Bundle 8   87e0009
     caddy           Bundle 9   8cda860
     postfix/dovecot Bundle 11  88e8881

   Outstanding: Bundle 2 (K8s real client) — the V2 P0 blocker.
   Bundle 10 (loadtest, commit 6286cd4) is documented separately
   at deploy/test/loadtest/README.md as a CI/observability
   addition that doesn't modify the per-connector contract table.
   Section 1.5's closing paragraph documents the execution-order
   inversion so future readers understand why this commit ended
   up smaller than the audit's original spec implied.

3. Section 1's gap table updated. The "Atomic deploy with rollback"
   row's post-bundle column went from "All 13 connectors via
   deploy.Apply" to "12 of 13 connectors via deploy.Apply (K8s
   pending Bundle 2 — see Section 1.5)" with an anchor link.

Rows L81-94 left untouched: each claim is now honest because
Bundles 3-11 implementations landed. Per-bundle commit messages
have been recording this fact ("Post-Bundle-N the claim is
honest; pre-fix it was aspirational") since Bundle 5; this
commit closes the loop by making the doc reflect the same.

What this commit does NOT do:
- Add K8s to Section 11 "V3-Pro deferrals" — Bundle 2 is a V2
  P0 blocker, not a V3-Pro deferral. Mixing the two would
  defer a real procurement-checklist gap into "future work"
  where it doesn't belong.
- Edit rows L81-94 of the per-connector table — they're honest
  as-is.
- Touch docs/architecture.md / connectors.md / security.md —
  those have their own per-section accuracy requirements; this
  commit is scoped to deployment-atomicity.md.

Verified locally:
- gofmt -l ./internal/ ./cmd/  clean (doc-only commit; no Go diff).
- markdown structure check via `grep -n '^## '`: Section 1.5
  inserted cleanly between 1 and 2; no other headings disturbed.
- All 8 commit hashes in Section 1.5 verified against
  `git log --oneline --reverse v2.0.67..HEAD` at HEAD=88e8881.

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 1.
2026-05-02 20:06:24 +00:00

16 KiB
Raw Blame History

Deployment Atomicity, Post-Deploy Verification, and Rollback

Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).

1. Overview

Before deploy-hardening I, certctl's target connectors used duplicated os.WriteFile flows. A failure mid-deploy could leave a target with a renewed cert but no chain (or vice versa); a reload-fail produced a half-deployed state that required manual rollback; a wrong-vhost cert was silent until users reported it.

Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:

Gap Pre-bundle Post-bundle
Atomic deploy with rollback F5 only (transactional API) 12 of 13 connectors via deploy.Apply (K8s pending Bundle 2 — see Section 1.5)
Post-deploy TLS verification None NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback
Vendor-specific deployment recipes Light docs (Bundle II — cowork/deploy-hardening-ii-prompt.md)

This document describes the operator-visible surface. The Go-level contract lives at internal/deploy/doc.go.

1.5. Audit closure status (2026-05-02 deployment-target audit)

The 2026-05-02 deployment-target coverage audit (cowork/deployment-target-audit-2026-05-02/RESULTS.md) tightened the atomic + rollback contract on the connectors below. All bundles in the table are committed to master as of this section's last edit; commit hashes pin to the canonical landing commit for each piece of work.

Connector Bundle Commit Closes
envoy Bundle 3 d8cd981 atomic SDS JSON write + post-deploy watcher pickup poll
traefik Bundle 4 37634e6 single deploy.Apply Plan + all-files atomicity + rollback
iis Bundle 5 223f279 pre-deploy Get-WebBinding snapshot + on-failure binding rollback
ssh Bundle 6 eb39059 pre-deploy SFTP snapshot + reload-failure rollback
wincertstore Bundle 7 1dd1dd4 Get-ChildItem snapshot + on-import-failure rollback
javakeystore Bundle 8 87e0009 keytool -exportkeystore snapshot + on-import-failure rollback + operator playbook for argv password
caddy Bundle 9 8cda860 duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency
postfix/dovecot Bundle 11 88e8881 applyDefaults + verify-fails-rollback test pin under Mode=dovecot

Outstanding from the same audit:

  • Bundle 2 (k8ssecret). The production realK8sClient is still a stub (see Section 3 / row k8ssecret below). Replacing it with a real k8s.io/client-go implementation + ResourceVersion plumbing
    • post-deploy SHA-256 verify + kubelet sync poll is the remaining V2 P0 blocker. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

Bundle 10 (per-connector loadtest harness, commit 6286cd4) does not modify the per-connector contract table; it's a CI / observability addition documented separately at deploy/test/loadtest/README.md.

The original Bundle 1 audit spec read "soften the IIS / SSH / WinCertStore / JavaKeystore rollback claims first while bundles 58 catch the implementation up". Execution order inverted that loop — Bundles 311 shipped before the doc-realignment commit, so the rows in Section 3 below are honest as-shipped without ever needing a softening pass. The K8s row is the one exception, and Section 3's notes call it out explicitly.

2. The atomic-write primitive — Plan / Apply

internal/deploy.Apply(ctx, plan) is the load-bearing entry point. Connectors build a Plan describing one or more files + their PreCommit (validate) and PostCommit (reload) hooks; Apply executes them all-or-nothing.

plan := deploy.Plan{
    Files: []deploy.File{
        {Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/key.pem",   Bytes: keyPEM,   Mode: 0640},
    },
    PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
        // Run `nginx -t` against the staged config — bytes already
        // written to <path>.certctl-tmp.<unix-nanos>.
        return runValidate(ctx, "nginx -t")
    },
    PostCommit: func(ctx context.Context) error {
        return runReload(ctx, "nginx -s reload")
    },
}
res, err := deploy.Apply(ctx, plan)

Apply's algorithm:

  1. Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
  2. SHA-256 idempotency short-circuit. If every File's destination already matches, return Result.SkippedAsIdempotent=true without firing PreCommit/PostCommit.
  3. Pre-deploy backup: copy each existing destination to <path>.certctl-bak.<unix-nanos>.
  4. Write each File's bytes to <path>.certctl-tmp.<unix-nanos> in the destination directory (same-filesystem rename).
  5. Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
  6. Call PreCommit(ctx, tempPaths). On error: clean up temps; return ErrValidateFailed.
  7. os.Rename each temp → final. POSIX guarantees atomic.
  8. Call PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: return ErrRollbackFailed (operator-actionable).
  9. Janitor: prune backups beyond Plan.BackupRetention (default 3, -1 to disable).

3. Per-connector atomic contract

Connector PreCommit (validate) PostCommit (reload) Post-deploy verify Quirks
nginx nginx -t nginx -s reload TLS handshake to host:443 Default key mode 0640 (worker reads via group)
apache apachectl configtest apachectl graceful TLS handshake Default key mode 0600; per-distro user (apache2/apache/httpd)
haproxy haproxy -c -f <cfg> systemctl reload haproxy TLS handshake Combined PEM (cert+chain+key in one file); default mode 0600
traefik (none — file watcher) (none — file watcher auto-reloads) TLS handshake atomic-write only; ValidateOnly returns sentinel
caddy (file mode) (none) (none — file watcher) TLS handshake atomic-write replaces os.WriteFile
caddy (api mode) Probe admin /config/ POST /load (already atomic at admin server) (admin server confirms) ValidateOnly real impl probes admin API
envoy (none — SDS file watcher) (none — SDS file watcher) TLS handshake atomic-write replaces os.WriteFile
postfix postfix check postfix reload TLS handshake to port 25 Chain appended to cert if no ChainPath
dovecot doveconf -n doveadm reload TLS handshake to port 993 Same code path as postfix
f5 (Authenticate probe) (Transactional commit) TLS handshake to VS Already transactional; rollback automatic via failed commit
iis (Get-WebSite probe) (PowerShell cert install) TLS handshake Already explicit pre-deploy backup + post-rollback re-import
ssh (Connect probe) (SCP upload + remote chmod) tls.Dial to remote TLS port Pre-deploy SCP backup of remote files
wincertstore (Get-ChildItem Cert:) (Import-PfxCertificate) (admin probe) Get-ChildItem snapshot for rollback
javakeystore (keytool -list) (keytool -importkeystore) (admin probe) keytool snapshot; rollback via keytool -delete + re-import
k8ssecret (V2 blocker — see note below) (V2 blocker — see note below) (V2 blocker — see note below) V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit. Production realK8sClient at internal/connector/target/k8ssecret/k8ssecret.go:397-420 is a stub (every method returns "real Kubernetes client not implemented — use NewWithClient for tests"). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via NewWithClient work today. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

4. Post-deploy TLS verification

Frozen decision 0.3 (deploy-hardening I): post-deploy verify is ON by default when the operator configures PostDeployVerify.Endpoint. Per-target opt-out via PostDeployVerify.Enabled = false.

The connector-side flow:

// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
    // Mismatch — wrong vhost, NGINX serving cached cert,
    // load-balanced target hit a different pod, etc.
    rollbackToBackups(ctx, applyResult.BackupPaths)
    emitAlert("post-deploy verify SHA-256 mismatch")
}

Retry with backoff (default 3 attempts, 2s exponential) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet:

post_deploy_verify:
  enabled: true
  endpoint: "nginx.svc.cluster.local:443"
  timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 2s

5. Rollback semantics

Rollback fires automatically on three triggers:

  1. PostCommit (reload) fails → Apply restores backups + retries reload. Returns ErrReloadFailed on success (degraded no-op) or ErrRollbackFailed if the second reload also fails.
  2. Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
  3. Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.

ErrRollbackFailed is operator-actionable. The destination is in a known-bad state; operators must either:

  • Restore from Result.BackupPaths manually + run <reload command>
  • Push a fresh known-good cert via the next deploy cycle

The certctl_deploy_rollback_total{outcome="also_failed"} metric is the alert target.

6. ValidateOnly — dry-run mode

target.Connector.ValidateOnly(ctx, request) runs the validate step without touching the live cert. Connectors that can't dry-run (Traefik / Envoy / Caddy file mode) return target.ErrValidateOnlyNotSupported.

Connector ValidateOnly
nginx nginx -t
apache apachectl configtest
haproxy haproxy -c -f <cfg>
postfix/dovecot postfix check / doveconf -n
caddy (api) GET /config/ probe
caddy (file) / traefik / envoy ErrValidateOnlyNotSupported
f5 client.Authenticate() probe
iis Get-WebSite -Name <SiteName>
ssh client.Connect() probe
wincertstore Get-ChildItem Cert:\<loc>\<store>
javakeystore keytool -list -keystore <path>
k8ssecret client.GetSecret() RBAC probe

Operators preview a deploy via the agent's --dry-run flag (or the equivalent CLI invocation).

7. File ownership + mode preservation

The single most common silent-failure mode pre-bundle: agent runs as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out of the existing nginx:nginx 0640 key file.

Per frozen decision 0.7, deploy.Apply resolves ownership via this precedence:

  1. Explicit File.Mode / File.Owner / File.Group (per-target config) → use as given.
  2. Existing destination file → preserve its chown + chmod.
  3. Plan.Defaults.Mode / .Owner / .Group → use as fallback for new files.
  4. Nothing set → os.WriteFile default (0644) for new files; preserved for existing.

Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):

Connector Default user Default group Default cert mode Default key mode
nginx nginx → www-data nginx → www-data 0644 0640
apache apache → www-data → httpd same 0644 0600
haproxy haproxy haproxy n/a (combined PEM) 0600
postfix postfix → dovecot → _postfix same 0644 0600
traefik (none) (none) 0644 0600
envoy (none) (none) 0644 0600
caddy (none) (none) 0644 0600

8. Per-target deploy mutex

Phase 2 of the master bundle: the agent (cmd/agent/main.go) serializes concurrent deploys to the same target ID via a sync.Map[targetID]*sync.Mutex. Granularity per frozen decision 0.5: one mutex per target, NOT per (target, cert).

Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.

9. Idempotency via SHA-256

Every deploy.Apply short-circuits when all File destinations already match SHA-256 of the new bytes. PreCommit + PostCommit do not fire; backups are not created; the result reports SkippedAsIdempotent = true.

Defends against agent-restart retry storms that would otherwise hammer targets with no-op reloads. Operator-visible signal: certctl_deploy_idempotent_skip_total{target_type="..."}.

10. Troubleshooting matrix

Symptom Root cause Operator action
ErrValidateFailed: nginx -t failed Validate command rejected the staged config Read PreCommit's wrapped error for the nginx stderr; fix config
ErrReloadFailed: nginx -s reload failed; rolled back Reload command failed; rollback succeeded; serving the OLD cert Investigate why reload failed; re-deploy when fixed
ErrRollbackFailed Reload AND rollback both failed; in known-bad state Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership
post-deploy TLS verify SHA-256 mismatch New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts
chown ... permission denied (in agent log) Non-root agent OR target user doesn't exist on host Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx)
Backups accumulating in cert dir BackupRetention misconfigured Set BackupRetention: 3 (default) or higher on per-target config
File world-readable after deploy Default mode 0644 applied to new key file Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache)

11. V3-Pro deferrals

Out of scope for the V2-free deploy-hardening I bundle:

  • Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
  • Cert-pinning verification against mobile-app pin manifests.
  • SOC 2 evidence-report generator — auto-export of the deploy audit trail in the format SOC 2 auditors expect.
  • Customer-paid validation matrices — vendor-version certified quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See cowork/deploy-hardening-ii-prompt.md for the per-vendor edge-case audit + integration test sidecars.

12. Per-connector quick reference

Paste-able config snippets for the most-used connectors. Full field reference at docs/connectors.md.

NGINX

target_type: nginx
target_config:
  cert_path: /etc/nginx/certs/cert.pem
  chain_path: /etc/nginx/certs/chain.pem
  key_path: /etc/nginx/certs/key.pem
  reload_command: "nginx -s reload"
  validate_command: "nginx -t"
  cert_file_mode: 0644
  key_file_mode: 0640
  post_deploy_verify:
    enabled: true
    endpoint: "nginx.example.com:443"
    timeout: 10s
  backup_retention: 3

HAProxy

target_type: haproxy
target_config:
  pem_path: /etc/haproxy/certs/cert.pem
  reload_command: "systemctl reload haproxy"
  validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
  pem_file_mode: 0600
  post_deploy_verify:
    enabled: true
    endpoint: "haproxy.example.com:443"

Traefik (file watcher; no reload command)

target_type: traefik
target_config:
  cert_dir: /etc/traefik/certs
  cert_file: cert.pem
  key_file: key.pem
  post_deploy_verify:
    enabled: true
    endpoint: "traefik.example.com:443"

See per-connector tests at internal/connector/target/<name>/<name>_atomic_test.go for the full failure-mode matrix each connector handles.