Closes Bundle 1 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The
audit's original Bundle 1 spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore / K8s rollback claims first so the doc
isn't a procurement-liability while bundles 5-8 catch the
implementation up." Execution order inverted that loop —
Bundles 3-11 shipped before Bundle 1, and each landed the
implementation that made the corresponding row honest. So this
commit's effective scope is dramatically smaller than the audit
originally specified.
Three changes, all in docs/deployment-atomicity.md:
1. L95 k8ssecret row softened. Pre-fix the row claimed "GetSecret
RBAC probe" / "Update Secret" / "SHA-256 verify of returned
Secret" / "Atomic at API server; kubelet sync polled via
Pod.Status.ContainerStatuses" — as if all four columns described
live behavior. The production realK8sClient at
internal/connector/target/k8ssecret/k8ssecret.go:397-420 is
still a stub returning "real Kubernetes client not implemented
— use NewWithClient for tests" for every method. Post-fix the
row says so explicitly, points at the stub source, notes that
test mocks via NewWithClient work today, and forward-references
the Bundle 2 tracking prompt at
cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.
2. New Section 1.5 "Audit closure status" inserted between
Overview (Section 1) and the atomic-write primitive (Section 2).
Pins which deployment-target-audit bundles shipped with their
commit hashes:
envoy Bundle 3 febf500
traefik Bundle 4 b767f57
iis Bundle 5 30daadb
ssh Bundle 6 636de7f
wincertstore Bundle 7 60ae92b
javakeystore Bundle 8 eb390b2
caddy Bundle 9 08a86d3
postfix/dovecot Bundle 11 b829365
Outstanding: Bundle 2 (K8s real client) — the V2 P0 blocker.
Bundle 10 (loadtest, commit e292faa) is documented separately
at deploy/test/loadtest/README.md as a CI/observability
addition that doesn't modify the per-connector contract table.
Section 1.5's closing paragraph documents the execution-order
inversion so future readers understand why this commit ended
up smaller than the audit's original spec implied.
3. Section 1's gap table updated. The "Atomic deploy with rollback"
row's post-bundle column went from "All 13 connectors via
deploy.Apply" to "12 of 13 connectors via deploy.Apply (K8s
pending Bundle 2 — see Section 1.5)" with an anchor link.
Rows L81-94 left untouched: each claim is now honest because
Bundles 3-11 implementations landed. Per-bundle commit messages
have been recording this fact ("Post-Bundle-N the claim is
honest; pre-fix it was aspirational") since Bundle 5; this
commit closes the loop by making the doc reflect the same.
What this commit does NOT do:
- Add K8s to Section 11 "V3-Pro deferrals" — Bundle 2 is a V2
P0 blocker, not a V3-Pro deferral. Mixing the two would
defer a real procurement-checklist gap into "future work"
where it doesn't belong.
- Edit rows L81-94 of the per-connector table — they're honest
as-is.
- Touch docs/architecture.md / connectors.md / security.md —
those have their own per-section accuracy requirements; this
commit is scoped to deployment-atomicity.md.
Verified locally:
- gofmt -l ./internal/ ./cmd/ clean (doc-only commit; no Go diff).
- markdown structure check via `grep -n '^## '`: Section 1.5
inserted cleanly between 1 and 2; no other headings disturbed.
- All 8 commit hashes in Section 1.5 verified against
`git log --oneline --reverse v2.0.67..HEAD` at HEAD=b829365.
Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 1.
16 KiB
Deployment Atomicity, Post-Deploy Verification, and Rollback
Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).
1. Overview
Before deploy-hardening I, certctl's target connectors used
duplicated os.WriteFile flows. A failure mid-deploy could leave
a target with a renewed cert but no chain (or vice versa); a
reload-fail produced a half-deployed state that required manual
rollback; a wrong-vhost cert was silent until users reported it.
Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:
| Gap | Pre-bundle | Post-bundle |
|---|---|---|
| Atomic deploy with rollback | F5 only (transactional API) | 12 of 13 connectors via deploy.Apply (K8s pending Bundle 2 — see Section 1.5) |
| Post-deploy TLS verification | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
| Vendor-specific deployment recipes | Light docs | (Bundle II — cowork/deploy-hardening-ii-prompt.md) |
This document describes the operator-visible surface. The Go-level
contract lives at internal/deploy/doc.go.
1.5. Audit closure status (2026-05-02 deployment-target audit)
The 2026-05-02 deployment-target coverage audit
(cowork/deployment-target-audit-2026-05-02/RESULTS.md) tightened the
atomic + rollback contract on the connectors below. All bundles in the
table are committed to master as of this section's last edit; commit
hashes pin to the canonical landing commit for each piece of work.
| Connector | Bundle | Commit | Closes |
|---|---|---|---|
| envoy | Bundle 3 | d8cd981 |
atomic SDS JSON write + post-deploy watcher pickup poll |
| traefik | Bundle 4 | 37634e6 |
single deploy.Apply Plan + all-files atomicity + rollback |
| iis | Bundle 5 | 223f279 |
pre-deploy Get-WebBinding snapshot + on-failure binding rollback |
| ssh | Bundle 6 | eb39059 |
pre-deploy SFTP snapshot + reload-failure rollback |
| wincertstore | Bundle 7 | 1dd1dd4 |
Get-ChildItem snapshot + on-import-failure rollback |
| javakeystore | Bundle 8 | 87e0009 |
keytool -exportkeystore snapshot + on-import-failure rollback + operator playbook for argv password |
| caddy | Bundle 9 | 8cda860 |
duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
| postfix/dovecot | Bundle 11 | 88e8881 |
applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
Outstanding from the same audit:
- Bundle 2 (k8ssecret). The production
realK8sClientis still a stub (see Section 3 / rowk8ssecretbelow). Replacing it with a realk8s.io/client-goimplementation +ResourceVersionplumbing- post-deploy SHA-256 verify + kubelet sync poll is the remaining
V2 P0 blocker. Tracking prompt:
cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.
- post-deploy SHA-256 verify + kubelet sync poll is the remaining
V2 P0 blocker. Tracking prompt:
Bundle 10 (per-connector loadtest harness, commit 6286cd4) does not
modify the per-connector contract table; it's a CI / observability
addition documented separately at deploy/test/loadtest/README.md.
The original Bundle 1 audit spec read "soften the IIS / SSH / WinCertStore / JavaKeystore rollback claims first while bundles 5–8 catch the implementation up". Execution order inverted that loop — Bundles 3–11 shipped before the doc-realignment commit, so the rows in Section 3 below are honest as-shipped without ever needing a softening pass. The K8s row is the one exception, and Section 3's notes call it out explicitly.
2. The atomic-write primitive — Plan / Apply
internal/deploy.Apply(ctx, plan) is the load-bearing entry
point. Connectors build a Plan describing one or more files +
their PreCommit (validate) and PostCommit (reload) hooks; Apply
executes them all-or-nothing.
plan := deploy.Plan{
Files: []deploy.File{
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
},
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
// Run `nginx -t` against the staged config — bytes already
// written to <path>.certctl-tmp.<unix-nanos>.
return runValidate(ctx, "nginx -t")
},
PostCommit: func(ctx context.Context) error {
return runReload(ctx, "nginx -s reload")
},
}
res, err := deploy.Apply(ctx, plan)
Apply's algorithm:
- Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
- SHA-256 idempotency short-circuit. If every File's destination
already matches, return
Result.SkippedAsIdempotent=truewithout firing PreCommit/PostCommit. - Pre-deploy backup: copy each existing destination to
<path>.certctl-bak.<unix-nanos>. - Write each File's bytes to
<path>.certctl-tmp.<unix-nanos>in the destination directory (same-filesystem rename). - Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
- Call
PreCommit(ctx, tempPaths). On error: clean up temps; returnErrValidateFailed. os.Renameeach temp → final. POSIX guarantees atomic.- Call
PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: returnErrRollbackFailed(operator-actionable). - Janitor: prune backups beyond
Plan.BackupRetention(default 3, -1 to disable).
3. Per-connector atomic contract
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|---|---|---|---|---|
| nginx | nginx -t |
nginx -s reload |
TLS handshake to host:443 |
Default key mode 0640 (worker reads via group) |
| apache | apachectl configtest |
apachectl graceful |
TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
| haproxy | haproxy -c -f <cfg> |
systemctl reload haproxy |
TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| postfix | postfix check |
postfix reload |
TLS handshake to port 25 | Chain appended to cert if no ChainPath |
| dovecot | doveconf -n |
doveadm reload |
TLS handshake to port 993 | Same code path as postfix |
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
| ssh | (Connect probe) | (SCP upload + remote chmod) | tls.Dial to remote TLS port |
Pre-deploy SCP backup of remote files |
| wincertstore | (Get-ChildItem Cert:) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
| javakeystore | (keytool -list) |
(keytool -importkeystore) |
(admin probe) | keytool snapshot; rollback via keytool -delete + re-import |
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit. Production realK8sClient at internal/connector/target/k8ssecret/k8ssecret.go:397-420 is a stub (every method returns "real Kubernetes client not implemented — use NewWithClient for tests"). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via NewWithClient work today. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md. |
4. Post-deploy TLS verification
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
ON by default when the operator configures
PostDeployVerify.Endpoint. Per-target opt-out via
PostDeployVerify.Enabled = false.
The connector-side flow:
// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
// Mismatch — wrong vhost, NGINX serving cached cert,
// load-balanced target hit a different pod, etc.
rollbackToBackups(ctx, applyResult.BackupPaths)
emitAlert("post-deploy verify SHA-256 mismatch")
}
Retry with backoff (default 3 attempts, 2s exponential) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet:
post_deploy_verify:
enabled: true
endpoint: "nginx.svc.cluster.local:443"
timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 2s
5. Rollback semantics
Rollback fires automatically on three triggers:
- PostCommit (reload) fails → Apply restores backups + retries
reload. Returns
ErrReloadFailedon success (degraded no-op) orErrRollbackFailedif the second reload also fails. - Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
- Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.
ErrRollbackFailed is operator-actionable. The destination is in
a known-bad state; operators must either:
- Restore from
Result.BackupPathsmanually + run<reload command> - Push a fresh known-good cert via the next deploy cycle
The certctl_deploy_rollback_total{outcome="also_failed"} metric
is the alert target.
6. ValidateOnly — dry-run mode
target.Connector.ValidateOnly(ctx, request) runs the validate
step without touching the live cert. Connectors that can't
dry-run (Traefik / Envoy / Caddy file mode) return
target.ErrValidateOnlyNotSupported.
| Connector | ValidateOnly |
|---|---|
| nginx | nginx -t |
| apache | apachectl configtest |
| haproxy | haproxy -c -f <cfg> |
| postfix/dovecot | postfix check / doveconf -n |
| caddy (api) | GET /config/ probe |
| caddy (file) / traefik / envoy | ErrValidateOnlyNotSupported |
| f5 | client.Authenticate() probe |
| iis | Get-WebSite -Name <SiteName> |
| ssh | client.Connect() probe |
| wincertstore | Get-ChildItem Cert:\<loc>\<store> |
| javakeystore | keytool -list -keystore <path> |
| k8ssecret | client.GetSecret() RBAC probe |
Operators preview a deploy via the agent's --dry-run flag (or
the equivalent CLI invocation).
7. File ownership + mode preservation
The single most common silent-failure mode pre-bundle: agent runs
as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out
of the existing nginx:nginx 0640 key file.
Per frozen decision 0.7, deploy.Apply resolves ownership via
this precedence:
- Explicit
File.Mode/File.Owner/File.Group(per-target config) → use as given. - Existing destination file → preserve its
chown+chmod. Plan.Defaults.Mode/.Owner/.Group→ use as fallback for new files.- Nothing set →
os.WriteFiledefault (0644) for new files; preserved for existing.
Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):
| Connector | Default user | Default group | Default cert mode | Default key mode |
|---|---|---|---|---|
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
| apache | apache → www-data → httpd | same | 0644 | 0600 |
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
| traefik | (none) | (none) | 0644 | 0600 |
| envoy | (none) | (none) | 0644 | 0600 |
| caddy | (none) | (none) | 0644 | 0600 |
8. Per-target deploy mutex
Phase 2 of the master bundle: the agent (cmd/agent/main.go)
serializes concurrent deploys to the same target ID via a
sync.Map[targetID]*sync.Mutex. Granularity per frozen decision
0.5: one mutex per target, NOT per (target, cert).
Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.
9. Idempotency via SHA-256
Every deploy.Apply short-circuits when all File destinations
already match SHA-256 of the new bytes. PreCommit + PostCommit do
not fire; backups are not created; the result reports
SkippedAsIdempotent = true.
Defends against agent-restart retry storms that would otherwise
hammer targets with no-op reloads. Operator-visible signal:
certctl_deploy_idempotent_skip_total{target_type="..."}.
10. Troubleshooting matrix
| Symptom | Root cause | Operator action |
|---|---|---|
ErrValidateFailed: nginx -t failed |
Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
ErrReloadFailed: nginx -s reload failed; rolled back |
Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
ErrRollbackFailed |
Reload AND rollback both failed; in known-bad state | Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership |
post-deploy TLS verify SHA-256 mismatch |
New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts |
chown ... permission denied (in agent log) |
Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
| Backups accumulating in cert dir | BackupRetention misconfigured | Set BackupRetention: 3 (default) or higher on per-target config |
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache) |
11. V3-Pro deferrals
Out of scope for the V2-free deploy-hardening I bundle:
- Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
- Cert-pinning verification against mobile-app pin manifests.
- SOC 2 evidence-report generator — auto-export of the deploy audit trail in the format SOC 2 auditors expect.
- Customer-paid validation matrices — vendor-version certified
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
cowork/deploy-hardening-ii-prompt.mdfor the per-vendor edge-case audit + integration test sidecars.
12. Per-connector quick reference
Paste-able config snippets for the most-used connectors. Full
field reference at docs/connectors.md.
NGINX
target_type: nginx
target_config:
cert_path: /etc/nginx/certs/cert.pem
chain_path: /etc/nginx/certs/chain.pem
key_path: /etc/nginx/certs/key.pem
reload_command: "nginx -s reload"
validate_command: "nginx -t"
cert_file_mode: 0644
key_file_mode: 0640
post_deploy_verify:
enabled: true
endpoint: "nginx.example.com:443"
timeout: 10s
backup_retention: 3
HAProxy
target_type: haproxy
target_config:
pem_path: /etc/haproxy/certs/cert.pem
reload_command: "systemctl reload haproxy"
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
pem_file_mode: 0600
post_deploy_verify:
enabled: true
endpoint: "haproxy.example.com:443"
Traefik (file watcher; no reload command)
target_type: traefik
target_config:
cert_dir: /etc/traefik/certs
cert_file: cert.pem
key_file: key.pem
post_deploy_verify:
enabled: true
endpoint: "traefik.example.com:443"
See per-connector tests at
internal/connector/target/<name>/<name>_atomic_test.go for the
full failure-mode matrix each connector handles.