Files
certctl/docs/reference/deployment-model.md
T
shankar0123 d809874fa1 docs: retire compliance subtree + sweep framework name-drops from prose
Per operator decision the framework-mapping docs are gone. They
were aspirational (no audit, no certification, no validated
mapping); keeping them around was misleading.

Files deleted (1,883 lines):
- docs/compliance/index.md
- docs/compliance/soc2.md
- docs/compliance/pci-dss.md
- docs/compliance/nist-sp-800-57.md

Hyperlinks removed:
- README.md: 'Auditor / compliance' row in the doc table; the
  '(compliance mapping included)' parenthetical in the
  positioning paragraph
- docs/README.md: the '## Compliance' section table; the
  'Auditor / compliance team' reading-order-by-role row

Prose name-drops swept across 24 files:
- README.md: 'FedRAMP boundary CAs / financial-services policy
  CAs' → '4-level boundary CAs / 3-level policy CAs';
  'Compliance-grade for PCI-DSS Level 1, FedRAMP Moderate / High,
  SOC 2 Type II, HIPAA' → cut entirely
- getting-started/{quickstart,concepts,examples,why-certctl,
  advanced-demo}.md: 'compliance' → 'audit' / 'policy';
  'PCI-DSS / SOC 2 / NIST SP 800-57' framework lists cut;
  ''pci': 'true'' tag example → ''environment': 'production''
- migration/cert-manager-coexistence.md: 'compliance rules' →
  'policy rules'
- operator/approval-workflow.md: 'Compliance customers (PCI-DSS
  Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA)' →
  'Operators'; entire 'Compliance control mapping' table
  (PCI-DSS §6.4.5 / NIST SP 800-53 SA-15 / SOC 2 Type II CC6.1
  / HIPAA §164.308(a)(4)) deleted; 'compliance contract' →
  'two-person-integrity contract'; 'compliance auditors' →
  'reviewers'
- operator/legacy-clients-tls-1.2.md: 'PCI-DSS v4.0 Req 4 §2.2.5'
  audit-reference → CWE-326 (kept); 'PCI-DSS Req 4 §2.2.5
  attestation' section retitled to 'TLS posture summary' and
  rewritten without framework framing; 'PCI-DSS, NIST, and
  major browsers will eventually deprecate TLS 1.2' →
  'Major browsers and OS vendors will eventually deprecate
  TLS 1.2'
- operator/database-tls.md: PCI-DSS Req 4 §2.2.5 audit-ref →
  CWE-319 only; 'PCI-DSS scope' → 'sensitive data'; PCI-DSS
  Req 4 v4.0 prose footing → cut
- operator/runbooks/disaster-recovery.md: 'SOC 2 / PCI
  procurement-team deliverable' → 'on-call deliverable';
  'compliance auditors' → 'reviewers'
- reference/connectors/{acme,aws-acm,azure-kv,globalsign,
  local-ca,openssl,ssh,index}.md: 'compliance reporting
  (PCI-DSS §3.6, HIPAA §164.312)' → 'audit reporting';
  'Compliance environments (PCI-DSS Level 1, FedRAMP High,
  HIPAA)' → 'Regulated environments'; 'compliance audits' →
  'audit'; 'FedRAMP boundary CA' pattern names →
  '4-level boundary CA' (technically descriptive)
- reference/protocols/est.md: 'compliance-hook seam' →
  'device-state hook seam'; 'compliance gating' → 'device-state
  gating'; 'est_compliance_failed' → 'est_device_state_failed'
- reference/protocols/scep-intune.md: 'Optional compliance
  check' → 'Optional device-state check'; failure-counter
  'compliance_failed' → 'device_state_failed'; 'Conditional
  Access compliance gating' → 'Conditional Access
  device-state gating'
- reference/intermediate-ca-hierarchy.md: 'FedRAMP boundary-CA
  deployments where the regulator requires...' →
  'Boundary-CA deployments where you want separation of policy
  and issuing authorities'; pattern A retitled '4-level FedRAMP
  boundary CA' → '4-level boundary CA'
- reference/architecture.md: broken Related-docs link to
  compliance.md removed; the rest of that block had stale
  pre-Phase-2 paths (quickstart.md, demo-advanced.md,
  connectors.md, openapi.md, testing-guide.md, test-env.md) —
  retargeted to current locations
- reference/deployment-model.md: 'SOC 2 evidence-report
  generator' → 'Audit-evidence report generator'
- reference/vendor-matrix.md: 'SOC 2 / PCI auditors paste this
  into evidence packs' → 'reviewers paste this into
  vendor-evaluation packs'
- contributor/qa-test-suite.md: 'compliance exist' coverage
  description cut; 'Compliance (PCI / SOC2 / HIPAA-relevant)'
  risk-class label → 'Audit-relevant'

What was kept:
- CWE references (legitimate technical pointers)
- Microsoft API/feature names that happen to use 'compliance'
  literally ('Microsoft Graph compliance API',
  'device-compliance validators' — these are MS product names,
  not framework name-drops)
- 'NIST PQC' on the landing page (Post-Quantum Cryptography is
  the actual NIST standard family, not a compliance framework)

Verified: zero hyperlinks into docs/compliance/ remain. All 24
ci-guards/*.sh pass locally. qa-doc-seed-count.sh clean.
Net diff: 26 files / -1,883 deletions in compliance/ + -32 net
across the prose sweep.

Companion edits in cowork/ (CLAUDE.md doc-tree summary +
WORKSPACE-CHANGELOG.md retirement note) land separately.
2026-05-05 05:26:44 +00:00

17 KiB
Raw Blame History

Deployment Atomicity, Post-Deploy Verification, and Rollback

Last reviewed: 2026-05-05

Deploy-hardening I master bundle (v2.X.0). Operator + integrator reference for the atomic-write + post-deploy TLS verify + rollback pipeline that closes the procurement-checklist gap with commercial competitors (Venafi, DigiCert Certificate Manager, Sectigo).

1. Overview

Before deploy-hardening I, certctl's target connectors used duplicated os.WriteFile flows. A failure mid-deploy could leave a target with a renewed cert but no chain (or vice versa); a reload-fail produced a half-deployed state that required manual rollback; a wrong-vhost cert was silent until users reported it.

Deploy-hardening I closes three procurement-checklist gaps in a single shared primitive:

Gap Pre-bundle Post-bundle
Atomic deploy with rollback F5 only (transactional API) 12 of 13 connectors via deploy.Apply (K8s pending Bundle 2 — see Section 1.5)
Post-deploy TLS verification None NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback
Vendor-specific deployment recipes Light docs (Bundle II — cowork/deploy-hardening-ii-prompt.md)

This document describes the operator-visible surface. The Go-level contract lives at internal/deploy/doc.go.

1.5. Audit closure status (2026-05-02 deployment-target audit)

The 2026-05-02 deployment-target coverage audit (cowork/deployment-target-audit-2026-05-02/RESULTS.md) tightened the atomic + rollback contract on the connectors below. All bundles in the table are committed to master as of this section's last edit; commit hashes pin to the canonical landing commit for each piece of work.

Connector Bundle Commit Closes
envoy Bundle 3 d8cd981 atomic SDS JSON write + post-deploy watcher pickup poll
traefik Bundle 4 37634e6 single deploy.Apply Plan + all-files atomicity + rollback
iis Bundle 5 223f279 pre-deploy Get-WebBinding snapshot + on-failure binding rollback
ssh Bundle 6 eb39059 pre-deploy SFTP snapshot + reload-failure rollback
wincertstore Bundle 7 1dd1dd4 Get-ChildItem snapshot + on-import-failure rollback
javakeystore Bundle 8 87e0009 keytool -exportkeystore snapshot + on-import-failure rollback + operator playbook for argv password
caddy Bundle 9 8cda860 duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency
postfix/dovecot Bundle 11 88e8881 applyDefaults + verify-fails-rollback test pin under Mode=dovecot

Outstanding from the same audit:

  • Bundle 2 (k8ssecret). The production realK8sClient is still a stub (see Section 3 / row k8ssecret below). Replacing it with a real k8s.io/client-go implementation + ResourceVersion plumbing
    • post-deploy SHA-256 verify + kubelet sync poll is the remaining V2 P0 blocker. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

Bundle 10 (per-connector loadtest harness, commit 6286cd4) does not modify the per-connector contract table; it's a CI / observability addition documented separately at deploy/test/loadtest/README.md.

The original Bundle 1 audit spec read "soften the IIS / SSH / WinCertStore / JavaKeystore rollback claims first while bundles 58 catch the implementation up". Execution order inverted that loop — Bundles 311 shipped before the doc-realignment commit, so the rows in Section 3 below are honest as-shipped without ever needing a softening pass. The K8s row is the one exception, and Section 3's notes call it out explicitly.

2. The atomic-write primitive — Plan / Apply

internal/deploy.Apply(ctx, plan) is the load-bearing entry point. Connectors build a Plan describing one or more files + their PreCommit (validate) and PostCommit (reload) hooks; Apply executes them all-or-nothing.

plan := deploy.Plan{
    Files: []deploy.File{
        {Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
        {Path: "/etc/nginx/certs/key.pem",   Bytes: keyPEM,   Mode: 0640},
    },
    PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
        // Run `nginx -t` against the staged config — bytes already
        // written to <path>.certctl-tmp.<unix-nanos>.
        return runValidate(ctx, "nginx -t")
    },
    PostCommit: func(ctx context.Context) error {
        return runReload(ctx, "nginx -s reload")
    },
}
res, err := deploy.Apply(ctx, plan)

Apply's algorithm:

  1. Per-file mutex acquired (sync.Map; coarse-grained per-path serialization).
  2. SHA-256 idempotency short-circuit. If every File's destination already matches, return Result.SkippedAsIdempotent=true without firing PreCommit/PostCommit.
  3. Pre-deploy backup: copy each existing destination to <path>.certctl-bak.<unix-nanos>.
  4. Write each File's bytes to <path>.certctl-tmp.<unix-nanos> in the destination directory (same-filesystem rename).
  5. Apply ownership (chown + chmod) to each temp file BEFORE rename so the swap is atomic with the right perms.
  6. Call PreCommit(ctx, tempPaths). On error: clean up temps; return ErrValidateFailed.
  7. os.Rename each temp → final. POSIX guarantees atomic.
  8. Call PostCommit(ctx). On error: restore each backup; re-call PostCommit. If second PostCommit also fails: return ErrRollbackFailed (operator-actionable).
  9. Janitor: prune backups beyond Plan.BackupRetention (default 3, -1 to disable).

3. Per-connector atomic contract

Connector PreCommit (validate) PostCommit (reload) Post-deploy verify Quirks
nginx nginx -t nginx -s reload TLS handshake to host:443 Default key mode 0640 (worker reads via group)
apache apachectl configtest apachectl graceful TLS handshake Default key mode 0600; per-distro user (apache2/apache/httpd)
haproxy haproxy -c -f <cfg> systemctl reload haproxy TLS handshake Combined PEM (cert+chain+key in one file); default mode 0600
traefik (none — file watcher) (none — file watcher auto-reloads) TLS handshake atomic-write only; ValidateOnly returns sentinel
caddy (file mode) (none) (none — file watcher) TLS handshake atomic-write replaces os.WriteFile
caddy (api mode) Probe admin /config/ POST /load (already atomic at admin server) (admin server confirms) ValidateOnly real impl probes admin API
envoy (none — SDS file watcher) (none — SDS file watcher) TLS handshake atomic-write replaces os.WriteFile
postfix postfix check postfix reload TLS handshake to port 25 Chain appended to cert if no ChainPath
dovecot doveconf -n doveadm reload TLS handshake to port 993 Same code path as postfix
f5 (Authenticate probe) (Transactional commit) TLS handshake to VS Already transactional; rollback automatic via failed commit
iis (Get-WebSite probe) (PowerShell cert install) TLS handshake Already explicit pre-deploy backup + post-rollback re-import
ssh (Connect probe) (SCP upload + remote chmod) tls.Dial to remote TLS port Pre-deploy SCP backup of remote files
wincertstore (Get-ChildItem Cert:) (Import-PfxCertificate) (admin probe) Get-ChildItem snapshot for rollback
javakeystore (keytool -list) (keytool -importkeystore) (admin probe) keytool snapshot; rollback via keytool -delete + re-import
k8ssecret (V2 blocker — see note below) (V2 blocker — see note below) (V2 blocker — see note below) V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit. Production realK8sClient at internal/connector/target/k8ssecret/k8ssecret.go:397-420 is a stub (every method returns "real Kubernetes client not implemented — use NewWithClient for tests"). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via NewWithClient work today. Tracking prompt: cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md.

Postfix vs Dovecot mode: see "Choosing Mode=postfix vs Mode=dovecot" in docs/connectors.md for the per-mode defaults (cert/key paths, validate + reload commands), the dual-deploy guidance for mail servers running both daemons, and the test-pin reference (Bundle 11 commit 88e8881).

4. Post-deploy TLS verification

Frozen decision 0.3 (deploy-hardening I): post-deploy verify is ON by default when the operator configures PostDeployVerify.Endpoint. Per-target opt-out via PostDeployVerify.Enabled = false.

The connector-side flow:

// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
    // Mismatch — wrong vhost, NGINX serving cached cert,
    // load-balanced target hit a different pod, etc.
    rollbackToBackups(ctx, applyResult.BackupPaths)
    emitAlert("post-deploy verify SHA-256 mismatch")
}

Retry with exponential backoff (default 3 attempts; 1s initial, 16s cap) defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s, giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics (every attempt waits the same interval) set post_deploy_verify_max_backoff equal to post_deploy_verify_backoff.

post_deploy_verify:
  enabled: true
  endpoint: "nginx.svc.cluster.local:443"
  timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s

5. Rollback semantics

Rollback fires automatically on three triggers:

  1. PostCommit (reload) fails → Apply restores backups + retries reload. Returns ErrReloadFailed on success (degraded no-op) or ErrRollbackFailed if the second reload also fails.
  2. Post-deploy verify fails → Connector manually triggers rollback (Apply already returned successfully). Backups are restored + reload is invoked again. Same escalation path on second failure.
  3. Mid-loop rename fails (rare; only with cross-filesystem misuse) → Apply rolls back the renames that already succeeded.

ErrRollbackFailed is operator-actionable. The destination is in a known-bad state; operators must either:

  • Restore from Result.BackupPaths manually + run <reload command>
  • Push a fresh known-good cert via the next deploy cycle

The certctl_deploy_rollback_total{outcome="also_failed"} metric is the alert target.

6. ValidateOnly — dry-run mode

target.Connector.ValidateOnly(ctx, request) runs the validate step without touching the live cert. Connectors that can't dry-run (Traefik / Envoy / Caddy file mode) return target.ErrValidateOnlyNotSupported.

Connector ValidateOnly
nginx nginx -t
apache apachectl configtest
haproxy haproxy -c -f <cfg>
postfix/dovecot postfix check / doveconf -n
caddy (api) GET /config/ probe
caddy (file) / traefik / envoy ErrValidateOnlyNotSupported
f5 client.Authenticate() probe
iis Get-WebSite -Name <SiteName>
ssh client.Connect() probe
wincertstore Get-ChildItem Cert:\<loc>\<store>
javakeystore keytool -list -keystore <path>
k8ssecret client.GetSecret() RBAC probe

Operators preview a deploy via the agent's --dry-run flag (or the equivalent CLI invocation).

7. File ownership + mode preservation

The single most common silent-failure mode pre-bundle: agent runs as root, calls os.WriteFile(path, bytes, 0600), locks NGINX out of the existing nginx:nginx 0640 key file.

Per frozen decision 0.7, deploy.Apply resolves ownership via this precedence:

  1. Explicit File.Mode / File.Owner / File.Group (per-target config) → use as given.
  2. Existing destination file → preserve its chown + chmod.
  3. Plan.Defaults.Mode / .Owner / .Group → use as fallback for new files.
  4. Nothing set → os.WriteFile default (0644) for new files; preserved for existing.

Per-connector defaults (cross-distro, fall back to no-chown if no candidate user exists):

Connector Default user Default group Default cert mode Default key mode
nginx nginx → www-data nginx → www-data 0644 0640
apache apache → www-data → httpd same 0644 0600
haproxy haproxy haproxy n/a (combined PEM) 0600
postfix postfix → dovecot → _postfix same 0644 0600
traefik (none) (none) 0644 0600
envoy (none) (none) 0644 0600
caddy (none) (none) 0644 0600

8. Per-target deploy mutex

Phase 2 of the master bundle: the agent (cmd/agent/main.go) serializes concurrent deploys to the same target ID via a sync.Map[targetID]*sync.Mutex. Granularity per frozen decision 0.5: one mutex per target, NOT per (target, cert).

Cert deploy throughput is operator-grade tens-per-minute. Coarse serialization is fine and simplifies reasoning about reload-side race windows.

9. Idempotency via SHA-256

Every deploy.Apply short-circuits when all File destinations already match SHA-256 of the new bytes. PreCommit + PostCommit do not fire; backups are not created; the result reports SkippedAsIdempotent = true.

Defends against agent-restart retry storms that would otherwise hammer targets with no-op reloads. Operator-visible signal: certctl_deploy_idempotent_skip_total{target_type="..."}.

10. Troubleshooting matrix

Symptom Root cause Operator action
ErrValidateFailed: nginx -t failed Validate command rejected the staged config Read PreCommit's wrapped error for the nginx stderr; fix config
ErrReloadFailed: nginx -s reload failed; rolled back Reload command failed; rollback succeeded; serving the OLD cert Investigate why reload failed; re-deploy when fixed
ErrRollbackFailed Reload AND rollback both failed; in known-bad state Restore from Result.BackupPaths manually; run reload command directly; check disk space + ownership
post-deploy TLS verify SHA-256 mismatch New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) Check NGINX SSL session cache TTL; verify SNI; bump verify retries via PostDeployVerifyAttempts
chown ... permission denied (in agent log) Non-root agent OR target user doesn't exist on host Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx)
Backups accumulating in cert dir BackupRetention misconfigured Set BackupRetention: 3 (default) or higher on per-target config
File world-readable after deploy Default mode 0644 applied to new key file Set explicit KeyFileMode: 0640 (NGINX) or KeyFileMode: 0600 (Apache)

11. V3-Pro deferrals

Out of scope for the V2-free deploy-hardening I bundle:

  • Multi-region deployment coordination — orchestration of N data-center deploys with operator approval gates per stage.
  • Cert-pinning verification against mobile-app pin manifests.
  • Audit-evidence report generator — auto-export of the deploy audit trail in a reviewer-friendly format.
  • Customer-paid validation matrices — vendor-version certified quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See cowork/deploy-hardening-ii-prompt.md for the per-vendor edge-case audit + integration test sidecars.

12. Per-connector quick reference

Paste-able config snippets for the most-used connectors. Full field reference at docs/connectors.md.

NGINX

target_type: nginx
target_config:
  cert_path: /etc/nginx/certs/cert.pem
  chain_path: /etc/nginx/certs/chain.pem
  key_path: /etc/nginx/certs/key.pem
  reload_command: "nginx -s reload"
  validate_command: "nginx -t"
  cert_file_mode: 0644
  key_file_mode: 0640
  post_deploy_verify:
    enabled: true
    endpoint: "nginx.example.com:443"
    timeout: 10s
  backup_retention: 3

HAProxy

target_type: haproxy
target_config:
  pem_path: /etc/haproxy/certs/cert.pem
  reload_command: "systemctl reload haproxy"
  validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
  pem_file_mode: 0600
  post_deploy_verify:
    enabled: true
    endpoint: "haproxy.example.com:443"

Traefik (file watcher; no reload command)

target_type: traefik
target_config:
  cert_dir: /etc/traefik/certs
  cert_file: cert.pem
  key_file: key.pem
  post_deploy_verify:
    enabled: true
    endpoint: "traefik.example.com:443"

See per-connector tests at internal/connector/target/<name>/<name>_atomic_test.go for the full failure-mode matrix each connector handles.