Phase 12 of the deploy-hardening I master bundle. NEW docs/deployment-atomicity.md (12 sections, ~280 lines): 1. Overview — the three procurement-checklist gaps closed 2. The atomic-write primitive (Plan / File / Apply algorithm) 3. Per-connector atomic contract table (all 13 connectors) 4. Post-deploy TLS verification (handshake + SHA-256 + retries) 5. Rollback semantics (3 triggers + escalation path) 6. ValidateOnly dry-run mode (per-connector matrix) 7. File ownership + mode preservation (precedence + per-distro defaults) 8. Per-target deploy mutex (Phase 2) 9. Idempotency via SHA-256 (defends against retry storms) 10. Troubleshooting matrix (one row per failure mode) 11. V3-Pro deferrals (multi-region, pin manifests, SOC 2 export) 12. Per-connector quick reference (paste-able config snippets) UPDATE README.md::Deployment Targets — every connector row now notes the atomic + verify + rollback semantics that landed in deploy-hardening I. Added a closing paragraph linking to the new docs/deployment-atomicity.md. UPDATE docs/features.md — two new env-var rows: - CERTCTL_DEPLOY_BACKUP_RETENTION (default 3, -1 disables) - CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT (default 60s) The G-3 docs-drift CI guard is satisfied: every new CERTCTL_DEPLOY_* env var documented here also appears in source (internal/deploy/types.go for BACKUP_RETENTION, k8ssecret config for KUBELET_SYNC_TIMEOUT). S-1 stale-counts guard: no literal-number current-state counts in the new doc — the per-connector tests are referenced via the file:line pattern (internal/connector/target/<name>/<name>_atomic_test.go) so the operator can grep for the actual count. Phase 13 next: pre-commit verification (full matrix + CI guard reproductions).
13 KiB
Deployment Atomicity, Post-Deploy Verification, and Rollback
Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).
1. Overview
Before deploy-hardening I, certctl's target connectors used
duplicated os.WriteFile flows. A failure mid-deploy could leave
a target with a renewed cert but no chain (or vice versa); a
reload-fail produced a half-deployed state that required manual
rollback; a wrong-vhost cert was silent until users reported it.
Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:
| Gap | Pre-bundle | Post-bundle |
|---|---|---|
| Atomic deploy with rollback | F5 only (transactional API) | All 13 connectors via deploy.Apply |
| Post-deploy TLS verification | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
| Vendor-specific deployment recipes | Light docs | (Bundle II — cowork/deploy-hardening-ii-prompt.md) |
This document describes the operator-visible surface. The Go-level
contract lives at internal/deploy/doc.go.
2. The atomic-write primitive — Plan / Apply
internal/deploy.Apply(ctx, plan) is the load-bearing entry
point. Connectors build a Plan describing one or more files +
their PreCommit (validate) and PostCommit (reload) hooks; Apply
executes them all-or-nothing.
plan := deploy.Plan{
Files: []deploy.File{
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
},
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
// Run `nginx -t` against the staged config — bytes already
// written to <path>.certctl-tmp.<unix-nanos>.
return runValidate(ctx, "nginx -t")
},
PostCommit: func(ctx context.Context) error {
return runReload(ctx, "nginx -s reload")
},
}
res, err := deploy.Apply(ctx, plan)
Apply's algorithm:
- Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
- SHA-256 idempotency short-circuit. If every File's destination
already matches, return
Result.SkippedAsIdempotent=truewithout firing PreCommit/PostCommit. - Pre-deploy backup: copy each existing destination to
<path>.certctl-bak.<unix-nanos>. - Write each File's bytes to
<path>.certctl-tmp.<unix-nanos>in the destination directory (same-filesystem rename). - Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
- Call
PreCommit(ctx, tempPaths). On error: clean up temps; returnErrValidateFailed. os.Renameeach temp → final. POSIX guarantees atomic.- Call
PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: returnErrRollbackFailed(operator-actionable). - Janitor: prune backups beyond
Plan.BackupRetention(default 3, -1 to disable).
3. Per-connector atomic contract
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|---|---|---|---|---|
| nginx | nginx -t |
nginx -s reload |
TLS handshake to host:443 |
Default key mode 0640 (worker reads via group) |
| apache | apachectl configtest |
apachectl graceful |
TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
| haproxy | haproxy -c -f <cfg> |
systemctl reload haproxy |
TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| postfix | postfix check |
postfix reload |
TLS handshake to port 25 | Chain appended to cert if no ChainPath |
| dovecot | doveconf -n |
doveadm reload |
TLS handshake to port 993 | Same code path as postfix |
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
| ssh | (Connect probe) | (SCP upload + remote chmod) | tls.Dial to remote TLS port |
Pre-deploy SCP backup of remote files |
| wincertstore | (Get-ChildItem Cert:) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
| javakeystore | (keytool -list) |
(keytool -importkeystore) |
(admin probe) | keytool snapshot; rollback via keytool -delete + re-import |
| k8ssecret | (GetSecret RBAC probe) | (Update Secret) | SHA-256 verify of returned Secret | Atomic at API server; kubelet sync polled via Pod.Status.ContainerStatuses |
4. Post-deploy TLS verification
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
ON by default when the operator configures
PostDeployVerify.Endpoint. Per-target opt-out via
PostDeployVerify.Enabled = false.
The connector-side flow:
// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
// Mismatch — wrong vhost, NGINX serving cached cert,
// load-balanced target hit a different pod, etc.
rollbackToBackups(ctx, applyResult.BackupPaths)
emitAlert("post-deploy verify SHA-256 mismatch")
}
Retry with backoff (default 3 attempts, 2s exponential) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet:
post_deploy_verify:
enabled: true
endpoint: "nginx.svc.cluster.local:443"
timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 2s
5. Rollback semantics
Rollback fires automatically on three triggers:
- PostCommit (reload) fails → Apply restores backups + retries
reload. Returns
ErrReloadFailedon success (degraded no-op) orErrRollbackFailedif the second reload also fails. - Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
- Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.
ErrRollbackFailed is operator-actionable. The destination is in
a known-bad state; operators must either:
- Restore from
Result.BackupPathsmanually + run<reload command> - Push a fresh known-good cert via the next deploy cycle
The certctl_deploy_rollback_total{outcome="also_failed"} metric
is the alert target.
6. ValidateOnly — dry-run mode
target.Connector.ValidateOnly(ctx, request) runs the validate
step without touching the live cert. Connectors that can't
dry-run (Traefik / Envoy / Caddy file mode) return
target.ErrValidateOnlyNotSupported.
| Connector | ValidateOnly |
|---|---|
| nginx | nginx -t |
| apache | apachectl configtest |
| haproxy | haproxy -c -f <cfg> |
| postfix/dovecot | postfix check / doveconf -n |
| caddy (api) | GET /config/ probe |
| caddy (file) / traefik / envoy | ErrValidateOnlyNotSupported |
| f5 | client.Authenticate() probe |
| iis | Get-WebSite -Name <SiteName> |
| ssh | client.Connect() probe |
| wincertstore | Get-ChildItem Cert:\<loc>\<store> |
| javakeystore | keytool -list -keystore <path> |
| k8ssecret | client.GetSecret() RBAC probe |
Operators preview a deploy via the agent's --dry-run flag (or
the equivalent CLI invocation).
7. File ownership + mode preservation
The single most common silent-failure mode pre-bundle: agent runs
as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out
of the existing nginx:nginx 0640 key file.
Per frozen decision 0.7, deploy.Apply resolves ownership via
this precedence:
- Explicit
File.Mode/File.Owner/File.Group(per-target config) → use as given. - Existing destination file → preserve its
chown+chmod. Plan.Defaults.Mode/.Owner/.Group→ use as fallback for new files.- Nothing set →
os.WriteFiledefault (0644) for new files; preserved for existing.
Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):
| Connector | Default user | Default group | Default cert mode | Default key mode |
|---|---|---|---|---|
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
| apache | apache → www-data → httpd | same | 0644 | 0600 |
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
| traefik | (none) | (none) | 0644 | 0600 |
| envoy | (none) | (none) | 0644 | 0600 |
| caddy | (none) | (none) | 0644 | 0600 |
8. Per-target deploy mutex
Phase 2 of the master bundle: the agent (cmd/agent/main.go)
serializes concurrent deploys to the same target ID via a
sync.Map[targetID]*sync.Mutex. Granularity per frozen decision
0.5: one mutex per target, NOT per (target, cert).
Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.
9. Idempotency via SHA-256
Every deploy.Apply short-circuits when all File destinations
already match SHA-256 of the new bytes. PreCommit + PostCommit do
not fire; backups are not created; the result reports
SkippedAsIdempotent = true.
Defends against agent-restart retry storms that would otherwise
hammer targets with no-op reloads. Operator-visible signal:
certctl_deploy_idempotent_skip_total{target_type="..."}.
10. Troubleshooting matrix
| Symptom | Root cause | Operator action |
|---|---|---|
ErrValidateFailed: nginx -t failed |
Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
ErrReloadFailed: nginx -s reload failed; rolled back |
Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
ErrRollbackFailed |
Reload AND rollback both failed; in known-bad state | Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership |
post-deploy TLS verify SHA-256 mismatch |
New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts |
chown ... permission denied (in agent log) |
Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
| Backups accumulating in cert dir | BackupRetention misconfigured | Set BackupRetention: 3 (default) or higher on per-target config |
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache) |
11. V3-Pro deferrals
Out of scope for the V2-free deploy-hardening I bundle:
- Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
- Cert-pinning verification against mobile-app pin manifests.
- SOC 2 evidence-report generator — auto-export of the deploy audit trail in the format SOC 2 auditors expect.
- Customer-paid validation matrices — vendor-version certified
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
cowork/deploy-hardening-ii-prompt.mdfor the per-vendor edge-case audit + integration test sidecars.
12. Per-connector quick reference
Paste-able config snippets for the most-used connectors. Full
field reference at docs/connectors.md.
NGINX
target_type: nginx
target_config:
cert_path: /etc/nginx/certs/cert.pem
chain_path: /etc/nginx/certs/chain.pem
key_path: /etc/nginx/certs/key.pem
reload_command: "nginx -s reload"
validate_command: "nginx -t"
cert_file_mode: 0644
key_file_mode: 0640
post_deploy_verify:
enabled: true
endpoint: "nginx.example.com:443"
timeout: 10s
backup_retention: 3
HAProxy
target_type: haproxy
target_config:
pem_path: /etc/haproxy/certs/cert.pem
reload_command: "systemctl reload haproxy"
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
pem_file_mode: 0600
post_deploy_verify:
enabled: true
endpoint: "haproxy.example.com:443"
Traefik (file watcher; no reload command)
target_type: traefik
target_config:
cert_dir: /etc/traefik/certs
cert_file: cert.pem
key_file: key.pem
post_deploy_verify:
enabled: true
endpoint: "traefik.example.com:443"
See per-connector tests at
internal/connector/target/<name>/<name>_atomic_test.go for the
full failure-mode matrix each connector handles.