Files
certctl/docs/reference/deployment-model.md
shankar0123 d60a0ac297 fix(security): close BUNDLE 1 — server+agent connector config validation chain
Bundle 1 closure (2026-05-12 acquisition diligence audit). Closes the
acquisition-blocker chain: target.edit (default r-operator grant per
migrations/000029_rbac.up.sql:196) → arbitrary reload_command stored
without validation → agent createTargetConnector json.Unmarshal-only
→ sh -c on agent host. README's 'shell injection prevention on all
connector scripts' claim is now true at the chain level.

Server-side: new internal/connector/target/configcheck package + a
configcheck.Validate call in target.go::Create + ::Update +
::CreateTarget + ::UpdateTarget (all 4 entry points). Rejects shell
metacharacters in reload_command / validate_command / restart_command
for nginx, apache, haproxy, postfix/dovecot, javakeystore, ssh. Sentinel
errors.Is(err, service.ErrInvalidConnectorConfig) available for handler
400 mapping. Non-shell connector types (F5, IIS, Caddy, Traefik, Envoy,
cloud targets, K8s) are no-ops by design.

Agent-side: defense-in-depth connector.ValidateConfig(ctx, configJSON)
call in cmd/agent/main.go inserted between createTargetConnector and
DeployCertificate. This catches (a) configs pre-dating the server gate,
(b) encrypted-blob tampering, (c) per-connector filesystem invariants
that the server can't check.

F5 (S2 finding): proven docs-vs-code drift, not a security bug. The
applyDefaults function never set Insecure=true; runtime default has
always been Go zero-value (false → TLS verified). Three lying 'default
true' comments in f5/f5.go (lines 30, 45-47, 126) rewritten to match
actual code behavior.

Docs (C4 + C9): README L12 + L68 narrowed — 'any CA / any server' →
'Twelve native CA connectors plus an OpenSSL adapter; fifteen native
deployment-target connectors plus a proxy-agent pattern.' 'Every deploy
goes through atomic-write + ...' narrowed to file-based connectors with
inline link to per-target guarantee matrix. New deployment-model.md §1.6
ships a 15-target × 8-property guarantee table covering atomic write /
owner-perms / SHA-256 idempotency / pre-deploy snapshot / on-failure
rollback / post-deploy TLS verify / Prometheus counters / shell-injection
validation — including the K8s preview honesty marker (CLAIM-H4).

Tests: internal/connector/target/configcheck/configcheck_test.go covers
14 shell-injection payloads (semicolon, pipe, backtick, dollar-paren,
redirect, and-chain, newline, double-quote, escape, dollar-var) × 7
shell-using connectors + benign-command acceptance + non-shell no-op
behavior + empty config + malformed JSON. All pass.

Verification (run from /sessions/gifted-blissful-pasteur/mnt/cowork/certctl):
  go fmt ./...              # clean (no diffs)
  go vet ./...              # clean (no findings)
  go test -short -count=1 ./internal/... ./cmd/...
                            # 60+ packages all ok, zero FAIL

Audit-Closes: BUNDLE-1 RT-C1 SEC-M4 CLAIM-M2 CLAIM-L3
Audit-Verifies-False: S2 (F5 'default insecure' was a comment lie, code was always secure)
2026-05-12 23:48:08 +00:00

21 KiB
Raw Permalink Blame History

Deployment Atomicity, Post-Deploy Verification, and Rollback

Last reviewed: 2026-05-05

Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).

1. Overview

Before deploy-hardening I, certctl's target connectors used duplicated os.WriteFile flows. A failure mid-deploy could leave a target with a renewed cert but no chain (or vice versa); a reload-fail produced a half-deployed state that required manual rollback; a wrong-vhost cert was silent until users reported it.

Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:

Gap Pre-bundle Post-bundle
Atomic deploy with rollback F5 only (transactional API) 12 of 13 connectors via deploy.Apply (K8s pending Bundle 2 — see Section 1.5)
Post-deploy TLS verification None NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback
Vendor-specific deployment recipes Light docs (Bundle II — per the project's deploy-hardening II spec)

This document describes the operator-visible surface. The Go-level contract lives at internal/deploy/doc.go.

1.6. Per-target guarantee matrix

Added 2026-05-12 (Bundle 1 / CLAIM-M2 closure). The README previously claimed "every deploy goes through atomic-write + ownership-preservation

  • SHA-256 idempotency + per-target Prometheus counters + pre-deploy snapshot + on-failure rollback." That claim is true for the file-based deploy primitive only. Cloud / API targets use vendor-SDK semantics and do not share the same primitive. This matrix is the authoritative per-target answer.

Legend: ✓ = supported / always on. ✗ = not applicable to this target family. ◐ = partial / vendor-specific equivalent. preview = ships but the production code path is a stub (see CLAIM-H4).

Target Atomic write Owner/perms preserved SHA-256 idempotency Pre-deploy snapshot On-failure rollback Post-deploy TLS verify Prometheus counters Server+agent shell-injection validation
NGINX
Apache
HAProxy
Caddy ✗ (no operator commands)
Traefik
Envoy
Postfix / Dovecot
SSH known-hosts ✗ (no TLS endpoint)
JavaKeystore ✗ (file format, no socket)
IIS ◐ (Windows cert store API)
WinCertStore ◐ (Windows cert store API)
F5 BIG-IP ✓ (iControl REST transaction) ✗ (no FS) ◐ (cert object name) ◐ (transaction rollback) ✓ (transaction rollback) ✓ (mgmt API GET)
AWS ACM ✗ (SDK call) ✗ (no FS) ◐ (ACM-side replace) ◐ (re-import old ARN)
Azure Key Vault ✗ (SDK call) ✗ (no FS) ◐ (KV-side versioning) ◐ (KV versioning)
Kubernetes Secrets preview preview preview preview preview preview preview

Notes on the matrix:

  • Atomic write / owner-perms / SHA-256 idempotency / snapshot / rollback are properties of the shared deploy.Apply primitive in internal/deploy/. They apply to file-based targets where certctl writes to disk.
  • Cloud / API targets (AWS ACM, Azure Key Vault) use the vendor SDK's import / replace operation. The vendor handles versioning and atomicity at their layer. certctl tracks the operation outcome via Prometheus counters; "rollback" in this row means "re-import the previous cert ARN" rather than the file-primitive's os.Rename rollback.
  • F5 uses iControl REST transactions for atomicity (deploy-hardening I docs above). It does not touch a filesystem; the snapshot/rollback semantics live in the F5 transaction protocol.
  • Kubernetes Secrets ships but the production client (realK8sClient) returns "real Kubernetes client not implemented" for all methods (see internal/connector/target/k8ssecret/k8ssecret.go:395+). Operators evaluating against a real cluster should treat this connector as preview until the production client lands.
  • Server+agent shell-injection validation (Bundle 1 / RT-C1 closure 2026-05-12) is on for every connector that accepts operator-supplied command strings: reload_command, validate_command, restart_command. Validation runs at API ingestion (internal/service/target.go::Create + ::Update + ::CreateTarget + ::UpdateTarget via internal/connector/target/configcheck) AND on the agent before deploy (cmd/agent/main.go post-createTargetConnector, calling each connector's full ValidateConfig method). Connectors that do not accept operator shell strings (Caddy / Traefik / Envoy / cloud targets) skip this check by design.

1.5. Audit closure status (2026-05-02 deployment-target audit)

The 2026-05-02 deployment-target coverage audit (the 2026-05-02 deployment-target audit) tightened the atomic + rollback contract on the connectors below. All bundles in the table are committed to master as of this section's last edit; commit hashes pin to the canonical landing commit for each piece of work.

Connector Bundle Commit Closes
envoy Bundle 3 d8cd981 atomic SDS JSON write + post-deploy watcher pickup poll
traefik Bundle 4 37634e6 single deploy.Apply Plan + all-files atomicity + rollback
iis Bundle 5 223f279 pre-deploy Get-WebBinding snapshot + on-failure binding rollback
ssh Bundle 6 eb39059 pre-deploy SFTP snapshot + reload-failure rollback
wincertstore Bundle 7 1dd1dd4 Get-ChildItem snapshot + on-import-failure rollback
javakeystore Bundle 8 87e0009 keytool -exportkeystore snapshot + on-import-failure rollback + operator playbook for argv password
caddy Bundle 9 8cda860 duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency
postfix/dovecot Bundle 11 88e8881 applyDefaults + verify-fails-rollback test pin under Mode=dovecot

Outstanding from the same audit:

  • Bundle 2 (k8ssecret). The production realK8sClient is still a stub (see Section 3 / row k8ssecret below). Replacing it with a real k8s.io/client-go implementation + ResourceVersion plumbing
    • post-deploy SHA-256 verify + kubelet sync poll is the remaining V2 P0 blocker. Tracking prompt: the project's k8s-real-client spec.

Bundle 10 (per-connector loadtest harness, commit 6286cd4) does not modify the per-connector contract table; it's a CI / observability addition documented separately at deploy/test/loadtest/README.md.

The original Bundle 1 audit spec read "soften the IIS / SSH / WinCertStore / JavaKeystore rollback claims first while bundles 58 catch the implementation up". Execution order inverted that loop — Bundles 311 shipped before the doc-realignment commit, so the rows in Section 3 below are honest as-shipped without ever needing a softening pass. The K8s row is the one exception, and Section 3's notes call it out explicitly.

2. The atomic-write primitive — Plan / Apply

internal/deploy.Apply(ctx, plan) is the load-bearing entry point. Connectors build a Plan describing one or more files + their PreCommit (validate) and PostCommit (reload) hooks; Apply executes them all-or-nothing.

plan := deploy.Plan{
    Files: []deploy.File{
        {Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/key.pem",   Bytes: keyPEM,   Mode: 0640},
    },
    PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
        // Run `nginx -t` against the staged config — bytes already
        // written to <path>.certctl-tmp.<unix-nanos>.
        return runValidate(ctx, "nginx -t")
    },
    PostCommit: func(ctx context.Context) error {
        return runReload(ctx, "nginx -s reload")
    },
}
res, err := deploy.Apply(ctx, plan)

Apply's algorithm:

  1. Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
  2. SHA-256 idempotency short-circuit. If every File's destination already matches, return Result.SkippedAsIdempotent=true without firing PreCommit/PostCommit.
  3. Pre-deploy backup: copy each existing destination to <path>.certctl-bak.<unix-nanos>.
  4. Write each File's bytes to <path>.certctl-tmp.<unix-nanos> in the destination directory (same-filesystem rename).
  5. Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
  6. Call PreCommit(ctx, tempPaths). On error: clean up temps; return ErrValidateFailed.
  7. os.Rename each temp → final. POSIX guarantees atomic.
  8. Call PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: return ErrRollbackFailed (operator-actionable).
  9. Janitor: prune backups beyond Plan.BackupRetention (default 3, -1 to disable).

3. Per-connector atomic contract

Connector PreCommit (validate) PostCommit (reload) Post-deploy verify Quirks
nginx nginx -t nginx -s reload TLS handshake to host:443 Default key mode 0640 (worker reads via group)
apache apachectl configtest apachectl graceful TLS handshake Default key mode 0600; per-distro user (apache2/apache/httpd)
haproxy haproxy -c -f <cfg> systemctl reload haproxy TLS handshake Combined PEM (cert+chain+key in one file); default mode 0600
traefik (none — file watcher) (none — file watcher auto-reloads) TLS handshake atomic-write only; ValidateOnly returns sentinel
caddy (file mode) (none) (none — file watcher) TLS handshake atomic-write replaces os.WriteFile
caddy (api mode) Probe admin /config/ POST /load (already atomic at admin server) (admin server confirms) ValidateOnly real impl probes admin API
envoy (none — SDS file watcher) (none — SDS file watcher) TLS handshake atomic-write replaces os.WriteFile
postfix postfix check postfix reload TLS handshake to port 25 Chain appended to cert if no ChainPath
dovecot doveconf -n doveadm reload TLS handshake to port 993 Same code path as postfix
f5 (Authenticate probe) (Transactional commit) TLS handshake to VS Already transactional; rollback automatic via failed commit
iis (Get-WebSite probe) (PowerShell cert install) TLS handshake Already explicit pre-deploy backup + post-rollback re-import
ssh (Connect probe) (SCP upload + remote chmod) tls.Dial to remote TLS port Pre-deploy SCP backup of remote files
wincertstore (Get-ChildItem Cert:) (Import-PfxCertificate) (admin probe) Get-ChildItem snapshot for rollback
javakeystore (keytool -list) (keytool -importkeystore) (admin probe) keytool snapshot; rollback via keytool -delete + re-import
k8ssecret (V2 blocker — see note below) (V2 blocker — see note below) (V2 blocker — see note below) V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit. Production realK8sClient at internal/connector/target/k8ssecret/k8ssecret.go:397-420 is a stub (every method returns "real Kubernetes client not implemented — use NewWithClient for tests"). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via NewWithClient work today. Tracking prompt: the project's k8s-real-client spec.

Postfix vs Dovecot mode: see "Choosing Mode=postfix vs Mode=dovecot" in docs/connectors.md for the per-mode defaults (cert/key paths, validate + reload commands), the dual-deploy guidance for mail servers running both daemons, and the test-pin reference (Bundle 11 commit 88e8881).

4. Post-deploy TLS verification

Frozen decision 0.3 (deploy-hardening I): post-deploy verify is ON by default when the operator configures PostDeployVerify.Endpoint. Per-target opt-out via PostDeployVerify.Enabled = false.

The connector-side flow:

// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
    // Mismatch — wrong vhost, NGINX serving cached cert,
    // load-balanced target hit a different pod, etc.
    rollbackToBackups(ctx, applyResult.BackupPaths)
    emitAlert("post-deploy verify SHA-256 mismatch")
}

Retry with exponential backoff (default 3 attempts; 1s initial, 16s cap) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s, giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics (every attempt waits the same interval) set post_deploy_verify_max_backoff equal to post_deploy_verify_backoff.

post_deploy_verify:
  enabled: true
  endpoint: "nginx.svc.cluster.local:443"
  timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s

5. Rollback semantics

Rollback fires automatically on three triggers:

  1. PostCommit (reload) fails → Apply restores backups + retries reload. Returns ErrReloadFailed on success (degraded no-op) or ErrRollbackFailed if the second reload also fails.
  2. Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
  3. Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.

ErrRollbackFailed is operator-actionable. The destination is in a known-bad state; operators must either:

  • Restore from Result.BackupPaths manually + run <reload command>
  • Push a fresh known-good cert via the next deploy cycle

The certctl_deploy_rollback_total{outcome="also_failed"} metric is the alert target.

6. ValidateOnly — dry-run mode

target.Connector.ValidateOnly(ctx, request) runs the validate step without touching the live cert. Connectors that can't dry-run (Traefik / Envoy / Caddy file mode) return target.ErrValidateOnlyNotSupported.

Connector ValidateOnly
nginx nginx -t
apache apachectl configtest
haproxy haproxy -c -f <cfg>
postfix/dovecot postfix check / doveconf -n
caddy (api) GET /config/ probe
caddy (file) / traefik / envoy ErrValidateOnlyNotSupported
f5 client.Authenticate() probe
iis Get-WebSite -Name <SiteName>
ssh client.Connect() probe
wincertstore Get-ChildItem Cert:\<loc>\<store>
javakeystore keytool -list -keystore <path>
k8ssecret client.GetSecret() RBAC probe

Operators preview a deploy via the agent's --dry-run flag (or the equivalent CLI invocation).

7. File ownership + mode preservation

The single most common silent-failure mode pre-bundle: agent runs as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out of the existing nginx:nginx 0640 key file.

Per frozen decision 0.7, deploy.Apply resolves ownership via this precedence:

  1. Explicit File.Mode / File.Owner / File.Group (per-target config) → use as given.
  2. Existing destination file → preserve its chown + chmod.
  3. Plan.Defaults.Mode / .Owner / .Group → use as fallback for new files.
  4. Nothing set → os.WriteFile default (0644) for new files; preserved for existing.

Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):

Connector Default user Default group Default cert mode Default key mode
nginx nginx → www-data nginx → www-data 0644 0640
apache apache → www-data → httpd same 0644 0600
haproxy haproxy haproxy n/a (combined PEM) 0600
postfix postfix → dovecot → _postfix same 0644 0600
traefik (none) (none) 0644 0600
envoy (none) (none) 0644 0600
caddy (none) (none) 0644 0600

8. Per-target deploy mutex

Phase 2 of the master bundle: the agent (cmd/agent/main.go) serializes concurrent deploys to the same target ID via a sync.Map[targetID]*sync.Mutex. Granularity per frozen decision 0.5: one mutex per target, NOT per (target, cert).

Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.

9. Idempotency via SHA-256

Every deploy.Apply short-circuits when all File destinations already match SHA-256 of the new bytes. PreCommit + PostCommit do not fire; backups are not created; the result reports SkippedAsIdempotent = true.

Defends against agent-restart retry storms that would otherwise hammer targets with no-op reloads. Operator-visible signal: certctl_deploy_idempotent_skip_total{target_type="..."}.

10. Troubleshooting matrix

Symptom Root cause Operator action
ErrValidateFailed: nginx -t failed Validate command rejected the staged config Read PreCommit's wrapped error for the nginx stderr; fix config
ErrReloadFailed: nginx -s reload failed; rolled back Reload command failed; rollback succeeded; serving the OLD cert Investigate why reload failed; re-deploy when fixed
ErrRollbackFailed Reload AND rollback both failed; in known-bad state Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership
post-deploy TLS verify SHA-256 mismatch New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts
chown ... permission denied (in agent log) Non-root agent OR target user doesn't exist on host Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx)
Backups accumulating in cert dir BackupRetention misconfigured Set BackupRetention: 3 (default) or higher on per-target config
File world-readable after deploy Default mode 0644 applied to new key file Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache)

11. V3-Pro deferrals

Out of scope for the V2-free deploy-hardening I bundle:

  • Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
  • Cert-pinning verification against mobile-app pin manifests.
  • Audit-evidence report generator — auto-export of the deploy audit trail in a reviewer-friendly format.
  • Customer-paid validation matrices — vendor-version certified quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See the project's deploy-hardening II spec for the per-vendor edge-case audit + integration test sidecars.

12. Per-connector quick reference

Paste-able config snippets for the most-used connectors. Full field reference at docs/connectors.md.

NGINX

target_type: nginx
target_config:
  cert_path: /etc/nginx/certs/cert.pem
  chain_path: /etc/nginx/certs/chain.pem
  key_path: /etc/nginx/certs/key.pem
  reload_command: "nginx -s reload"
  validate_command: "nginx -t"
  cert_file_mode: 0644
  key_file_mode: 0640
  post_deploy_verify:
    enabled: true
    endpoint: "nginx.example.com:443"
    timeout: 10s
  backup_retention: 3

HAProxy

target_type: haproxy
target_config:
  pem_path: /etc/haproxy/certs/cert.pem
  reload_command: "systemctl reload haproxy"
  validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
  pem_file_mode: 0600
  post_deploy_verify:
    enabled: true
    endpoint: "haproxy.example.com:443"

Traefik (file watcher; no reload command)

target_type: traefik
target_config:
  cert_dir: /etc/traefik/certs
  cert_file: cert.pem
  key_file: key.pem
  post_deploy_verify:
    enabled: true
    endpoint: "traefik.example.com:443"

See per-connector tests at internal/connector/target/<name>/<name>_atomic_test.go for the full failure-mode matrix each connector handles.