Files
certctl/docs/reference/deployment-model.md
T
shankar0123 3a807ae37e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00

17 KiB
Raw Blame History

Deployment Atomicity, Post-Deploy Verification, and Rollback

Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).

1. Overview

Before deploy-hardening I, certctl's target connectors used duplicated os.WriteFile flows. A failure mid-deploy could leave a target with a renewed cert but no chain (or vice versa); a reload-fail produced a half-deployed state that required manual rollback; a wrong-vhost cert was silent until users reported it.

Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:

Gap Pre-bundle Post-bundle
Atomic deploy with rollback F5 only (transactional API) 12 of 13 connectors via deploy.Apply (K8s pending Bundle 2 — see Section 1.5)
Post-deploy TLS verification None NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback
Vendor-specific deployment recipes Light docs (Bundle II — cowork/deploy-hardening-ii-prompt.md)

This document describes the operator-visible surface. The Go-level contract lives at internal/deploy/doc.go.

1.5. Audit closure status (2026-05-02 deployment-target audit)

The 2026-05-02 deployment-target coverage audit (cowork/deployment-target-audit-2026-05-02/RESULTS.md) tightened the atomic + rollback contract on the connectors below. All bundles in the table are committed to master as of this section's last edit; commit hashes pin to the canonical landing commit for each piece of work.

Connector Bundle Commit Closes
envoy Bundle 3 d8cd981 atomic SDS JSON write + post-deploy watcher pickup poll
traefik Bundle 4 37634e6 single deploy.Apply Plan + all-files atomicity + rollback
iis Bundle 5 223f279 pre-deploy Get-WebBinding snapshot + on-failure binding rollback
ssh Bundle 6 eb39059 pre-deploy SFTP snapshot + reload-failure rollback
wincertstore Bundle 7 1dd1dd4 Get-ChildItem snapshot + on-import-failure rollback
javakeystore Bundle 8 87e0009 keytool -exportkeystore snapshot + on-import-failure rollback + operator playbook for argv password
caddy Bundle 9 8cda860 duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency
postfix/dovecot Bundle 11 88e8881 applyDefaults + verify-fails-rollback test pin under Mode=dovecot

Outstanding from the same audit:

  • Bundle 2 (k8ssecret). The production realK8sClient is still a stub (see Section 3 / row k8ssecret below). Replacing it with a real k8s.io/client-go implementation + ResourceVersion plumbing
    • post-deploy SHA-256 verify + kubelet sync poll is the remaining V2 P0 blocker. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

Bundle 10 (per-connector loadtest harness, commit 6286cd4) does not modify the per-connector contract table; it's a CI / observability addition documented separately at deploy/test/loadtest/README.md.

The original Bundle 1 audit spec read "soften the IIS / SSH / WinCertStore / JavaKeystore rollback claims first while bundles 58 catch the implementation up". Execution order inverted that loop — Bundles 311 shipped before the doc-realignment commit, so the rows in Section 3 below are honest as-shipped without ever needing a softening pass. The K8s row is the one exception, and Section 3's notes call it out explicitly.

2. The atomic-write primitive — Plan / Apply

internal/deploy.Apply(ctx, plan) is the load-bearing entry point. Connectors build a Plan describing one or more files + their PreCommit (validate) and PostCommit (reload) hooks; Apply executes them all-or-nothing.

plan := deploy.Plan{
    Files: []deploy.File{
        {Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/key.pem",   Bytes: keyPEM,   Mode: 0640},
    },
    PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
        // Run `nginx -t` against the staged config — bytes already
        // written to <path>.certctl-tmp.<unix-nanos>.
        return runValidate(ctx, "nginx -t")
    },
    PostCommit: func(ctx context.Context) error {
        return runReload(ctx, "nginx -s reload")
    },
}
res, err := deploy.Apply(ctx, plan)

Apply's algorithm:

  1. Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
  2. SHA-256 idempotency short-circuit. If every File's destination already matches, return Result.SkippedAsIdempotent=true without firing PreCommit/PostCommit.
  3. Pre-deploy backup: copy each existing destination to <path>.certctl-bak.<unix-nanos>.
  4. Write each File's bytes to <path>.certctl-tmp.<unix-nanos> in the destination directory (same-filesystem rename).
  5. Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
  6. Call PreCommit(ctx, tempPaths). On error: clean up temps; return ErrValidateFailed.
  7. os.Rename each temp → final. POSIX guarantees atomic.
  8. Call PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: return ErrRollbackFailed (operator-actionable).
  9. Janitor: prune backups beyond Plan.BackupRetention (default 3, -1 to disable).

3. Per-connector atomic contract

Connector PreCommit (validate) PostCommit (reload) Post-deploy verify Quirks
nginx nginx -t nginx -s reload TLS handshake to host:443 Default key mode 0640 (worker reads via group)
apache apachectl configtest apachectl graceful TLS handshake Default key mode 0600; per-distro user (apache2/apache/httpd)
haproxy haproxy -c -f <cfg> systemctl reload haproxy TLS handshake Combined PEM (cert+chain+key in one file); default mode 0600
traefik (none — file watcher) (none — file watcher auto-reloads) TLS handshake atomic-write only; ValidateOnly returns sentinel
caddy (file mode) (none) (none — file watcher) TLS handshake atomic-write replaces os.WriteFile
caddy (api mode) Probe admin /config/ POST /load (already atomic at admin server) (admin server confirms) ValidateOnly real impl probes admin API
envoy (none — SDS file watcher) (none — SDS file watcher) TLS handshake atomic-write replaces os.WriteFile
postfix postfix check postfix reload TLS handshake to port 25 Chain appended to cert if no ChainPath
dovecot doveconf -n doveadm reload TLS handshake to port 993 Same code path as postfix
f5 (Authenticate probe) (Transactional commit) TLS handshake to VS Already transactional; rollback automatic via failed commit
iis (Get-WebSite probe) (PowerShell cert install) TLS handshake Already explicit pre-deploy backup + post-rollback re-import
ssh (Connect probe) (SCP upload + remote chmod) tls.Dial to remote TLS port Pre-deploy SCP backup of remote files
wincertstore (Get-ChildItem Cert:) (Import-PfxCertificate) (admin probe) Get-ChildItem snapshot for rollback
javakeystore (keytool -list) (keytool -importkeystore) (admin probe) keytool snapshot; rollback via keytool -delete + re-import
k8ssecret (V2 blocker — see note below) (V2 blocker — see note below) (V2 blocker — see note below) V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit. Production realK8sClient at internal/connector/target/k8ssecret/k8ssecret.go:397-420 is a stub (every method returns "real Kubernetes client not implemented — use NewWithClient for tests"). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via NewWithClient work today. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

Postfix vs Dovecot mode: see "Choosing Mode=postfix vs Mode=dovecot" in docs/connectors.md for the per-mode defaults (cert/key paths, validate + reload commands), the dual-deploy guidance for mail servers running both daemons, and the test-pin reference (Bundle 11 commit 88e8881).

4. Post-deploy TLS verification

Frozen decision 0.3 (deploy-hardening I): post-deploy verify is ON by default when the operator configures PostDeployVerify.Endpoint. Per-target opt-out via PostDeployVerify.Enabled = false.

The connector-side flow:

// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
    // Mismatch — wrong vhost, NGINX serving cached cert,
    // load-balanced target hit a different pod, etc.
    rollbackToBackups(ctx, applyResult.BackupPaths)
    emitAlert("post-deploy verify SHA-256 mismatch")
}

Retry with exponential backoff (default 3 attempts; 1s initial, 16s cap) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s, giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics (every attempt waits the same interval) set post_deploy_verify_max_backoff equal to post_deploy_verify_backoff.

post_deploy_verify:
  enabled: true
  endpoint: "nginx.svc.cluster.local:443"
  timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s

5. Rollback semantics

Rollback fires automatically on three triggers:

  1. PostCommit (reload) fails → Apply restores backups + retries reload. Returns ErrReloadFailed on success (degraded no-op) or ErrRollbackFailed if the second reload also fails.
  2. Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
  3. Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.

ErrRollbackFailed is operator-actionable. The destination is in a known-bad state; operators must either:

  • Restore from Result.BackupPaths manually + run <reload command>
  • Push a fresh known-good cert via the next deploy cycle

The certctl_deploy_rollback_total{outcome="also_failed"} metric is the alert target.

6. ValidateOnly — dry-run mode

target.Connector.ValidateOnly(ctx, request) runs the validate step without touching the live cert. Connectors that can't dry-run (Traefik / Envoy / Caddy file mode) return target.ErrValidateOnlyNotSupported.

Connector ValidateOnly
nginx nginx -t
apache apachectl configtest
haproxy haproxy -c -f <cfg>
postfix/dovecot postfix check / doveconf -n
caddy (api) GET /config/ probe
caddy (file) / traefik / envoy ErrValidateOnlyNotSupported
f5 client.Authenticate() probe
iis Get-WebSite -Name <SiteName>
ssh client.Connect() probe
wincertstore Get-ChildItem Cert:\<loc>\<store>
javakeystore keytool -list -keystore <path>
k8ssecret client.GetSecret() RBAC probe

Operators preview a deploy via the agent's --dry-run flag (or the equivalent CLI invocation).

7. File ownership + mode preservation

The single most common silent-failure mode pre-bundle: agent runs as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out of the existing nginx:nginx 0640 key file.

Per frozen decision 0.7, deploy.Apply resolves ownership via this precedence:

  1. Explicit File.Mode / File.Owner / File.Group (per-target config) → use as given.
  2. Existing destination file → preserve its chown + chmod.
  3. Plan.Defaults.Mode / .Owner / .Group → use as fallback for new files.
  4. Nothing set → os.WriteFile default (0644) for new files; preserved for existing.

Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):

Connector Default user Default group Default cert mode Default key mode
nginx nginx → www-data nginx → www-data 0644 0640
apache apache → www-data → httpd same 0644 0600
haproxy haproxy haproxy n/a (combined PEM) 0600
postfix postfix → dovecot → _postfix same 0644 0600
traefik (none) (none) 0644 0600
envoy (none) (none) 0644 0600
caddy (none) (none) 0644 0600

8. Per-target deploy mutex

Phase 2 of the master bundle: the agent (cmd/agent/main.go) serializes concurrent deploys to the same target ID via a sync.Map[targetID]*sync.Mutex. Granularity per frozen decision 0.5: one mutex per target, NOT per (target, cert).

Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.

9. Idempotency via SHA-256

Every deploy.Apply short-circuits when all File destinations already match SHA-256 of the new bytes. PreCommit + PostCommit do not fire; backups are not created; the result reports SkippedAsIdempotent = true.

Defends against agent-restart retry storms that would otherwise hammer targets with no-op reloads. Operator-visible signal: certctl_deploy_idempotent_skip_total{target_type="..."}.

10. Troubleshooting matrix

Symptom Root cause Operator action
ErrValidateFailed: nginx -t failed Validate command rejected the staged config Read PreCommit's wrapped error for the nginx stderr; fix config
ErrReloadFailed: nginx -s reload failed; rolled back Reload command failed; rollback succeeded; serving the OLD cert Investigate why reload failed; re-deploy when fixed
ErrRollbackFailed Reload AND rollback both failed; in known-bad state Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership
post-deploy TLS verify SHA-256 mismatch New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts
chown ... permission denied (in agent log) Non-root agent OR target user doesn't exist on host Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx)
Backups accumulating in cert dir BackupRetention misconfigured Set BackupRetention: 3 (default) or higher on per-target config
File world-readable after deploy Default mode 0644 applied to new key file Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache)

11. V3-Pro deferrals

Out of scope for the V2-free deploy-hardening I bundle:

  • Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
  • Cert-pinning verification against mobile-app pin manifests.
  • SOC 2 evidence-report generator — auto-export of the deploy audit trail in the format SOC 2 auditors expect.
  • Customer-paid validation matrices — vendor-version certified quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See cowork/deploy-hardening-ii-prompt.md for the per-vendor edge-case audit + integration test sidecars.

12. Per-connector quick reference

Paste-able config snippets for the most-used connectors. Full field reference at docs/connectors.md.

NGINX

target_type: nginx
target_config:
  cert_path: /etc/nginx/certs/cert.pem
  chain_path: /etc/nginx/certs/chain.pem
  key_path: /etc/nginx/certs/key.pem
  reload_command: "nginx -s reload"
  validate_command: "nginx -t"
  cert_file_mode: 0644
  key_file_mode: 0640
  post_deploy_verify:
    enabled: true
    endpoint: "nginx.example.com:443"
    timeout: 10s
  backup_retention: 3

HAProxy

target_type: haproxy
target_config:
  pem_path: /etc/haproxy/certs/cert.pem
  reload_command: "systemctl reload haproxy"
  validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
  pem_file_mode: 0600
  post_deploy_verify:
    enabled: true
    endpoint: "haproxy.example.com:443"

Traefik (file watcher; no reload command)

target_type: traefik
target_config:
  cert_dir: /etc/traefik/certs
  cert_file: cert.pem
  key_file: key.pem
  post_deploy_verify:
    enabled: true
    endpoint: "traefik.example.com:443"

See per-connector tests at internal/connector/target/<name>/<name>_atomic_test.go for the full failure-mode matrix each connector handles.