Files
certctl/docs/reference/deployment-model.md
T
shankar0123 3a807ae37e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00

360 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment Atomicity, Post-Deploy Verification, and Rollback
> Deploy-hardening I master bundle (v2.X.0). Operator + integrator
> reference for the atomic-write + post-deploy TLS verify +
> rollback pipeline that closes the procurement-checklist gap with
> commercial competitors (Venafi, DigiCert Certificate Manager,
> Sectigo).
## 1. Overview
Before deploy-hardening I, certctl's target connectors used
duplicated `os.WriteFile` flows. A failure mid-deploy could leave
a target with a renewed cert but no chain (or vice versa); a
reload-fail produced a half-deployed state that required manual
rollback; a wrong-vhost cert was silent until users reported it.
Deploy-hardening I closes three procurement-checklist gaps in
a single shared primitive:
| Gap | Pre-bundle | Post-bundle |
|---|---|---|
| **Atomic deploy with rollback** | F5 only (transactional API) | 12 of 13 connectors via `deploy.Apply` (K8s pending Bundle 2 — see [Section 1.5](#15-audit-closure-status-2026-05-02-deployment-target-audit)) |
| **Post-deploy TLS verification** | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
| **Vendor-specific deployment recipes** | Light docs | (Bundle II — `cowork/deploy-hardening-ii-prompt.md`) |
This document describes the operator-visible surface. The Go-level
contract lives at `internal/deploy/doc.go`.
## 1.5. Audit closure status (2026-05-02 deployment-target audit)
The 2026-05-02 deployment-target coverage audit
(`cowork/deployment-target-audit-2026-05-02/RESULTS.md`) tightened the
atomic + rollback contract on the connectors below. All bundles in the
table are committed to `master` as of this section's last edit; commit
hashes pin to the canonical landing commit for each piece of work.
| Connector | Bundle | Commit | Closes |
|-----------------|-----------|-----------|--------|
| envoy | Bundle 3 | `d8cd981` | atomic SDS JSON write + post-deploy watcher pickup poll |
| traefik | Bundle 4 | `37634e6` | single `deploy.Apply` Plan + all-files atomicity + rollback |
| iis | Bundle 5 | `223f279` | pre-deploy `Get-WebBinding` snapshot + on-failure binding rollback |
| ssh | Bundle 6 | `eb39059` | pre-deploy SFTP snapshot + reload-failure rollback |
| wincertstore | Bundle 7 | `1dd1dd4` | `Get-ChildItem` snapshot + on-import-failure rollback |
| javakeystore | Bundle 8 | `87e0009` | `keytool -exportkeystore` snapshot + on-import-failure rollback + operator playbook for argv password |
| caddy | Bundle 9 | `8cda860` | duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
| postfix/dovecot | Bundle 11 | `88e8881` | applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
**Outstanding from the same audit:**
- **Bundle 2 (k8ssecret).** The production `realK8sClient` is still a
stub (see Section 3 / row `k8ssecret` below). Replacing it with a
real `k8s.io/client-go` implementation + `ResourceVersion` plumbing
+ post-deploy SHA-256 verify + kubelet sync poll is the remaining
V2 P0 blocker. Tracking prompt:
`cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`.
Bundle 10 (per-connector loadtest harness, commit `6286cd4`) does not
modify the per-connector contract table; it's a CI / observability
addition documented separately at `deploy/test/loadtest/README.md`.
The original Bundle 1 audit spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore rollback claims first while bundles 58
catch the implementation up". Execution order inverted that loop —
Bundles 311 shipped before the doc-realignment commit, so the rows
in Section 3 below are honest as-shipped without ever needing a
softening pass. The K8s row is the one exception, and Section 3's
notes call it out explicitly.
## 2. The atomic-write primitive — `Plan` / `Apply`
`internal/deploy.Apply(ctx, plan)` is the load-bearing entry
point. Connectors build a `Plan` describing one or more files +
their PreCommit (validate) and PostCommit (reload) hooks; Apply
executes them all-or-nothing.
```go
plan := deploy.Plan{
Files: []deploy.File{
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
},
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
// Run `nginx -t` against the staged config — bytes already
// written to <path>.certctl-tmp.<unix-nanos>.
return runValidate(ctx, "nginx -t")
},
PostCommit: func(ctx context.Context) error {
return runReload(ctx, "nginx -s reload")
},
}
res, err := deploy.Apply(ctx, plan)
```
Apply's algorithm:
1. Per-file mutex acquired (sync.Map; coarse-grained per-path
serialization).
2. SHA-256 idempotency short-circuit. If every File's destination
already matches, return `Result.SkippedAsIdempotent=true`
without firing PreCommit/PostCommit.
3. Pre-deploy backup: copy each existing destination to
`<path>.certctl-bak.<unix-nanos>`.
4. Write each File's bytes to `<path>.certctl-tmp.<unix-nanos>`
in the destination directory (same-filesystem rename).
5. Apply ownership (chown + chmod) to each temp file BEFORE
rename so the swap is atomic with the right perms.
6. Call `PreCommit(ctx, tempPaths)`. On error: clean up temps;
return `ErrValidateFailed`.
7. `os.Rename` each temp → final. POSIX guarantees atomic.
8. Call `PostCommit(ctx)`. On error: restore each backup; re-call
PostCommit. If second PostCommit also fails: return
`ErrRollbackFailed` (operator-actionable).
9. Janitor: prune backups beyond `Plan.BackupRetention`
(default 3, -1 to disable).
## 3. Per-connector atomic contract
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|---|---|---|---|---|
| nginx | `nginx -t` | `nginx -s reload` | TLS handshake to `host:443` | Default key mode 0640 (worker reads via group) |
| apache | `apachectl configtest` | `apachectl graceful` | TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
| haproxy | `haproxy -c -f <cfg>` | `systemctl reload haproxy` | TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| postfix | `postfix check` | `postfix reload` | TLS handshake to port 25 | Chain appended to cert if no ChainPath |
| dovecot | `doveconf -n` | `doveadm reload` | TLS handshake to port 993 | Same code path as postfix |
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
| ssh | (Connect probe) | (SCP upload + remote chmod) | `tls.Dial` to remote TLS port | Pre-deploy SCP backup of remote files |
| wincertstore | (Get-ChildItem Cert:\) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
| javakeystore | (`keytool -list`) | (`keytool -importkeystore`) | (admin probe) | keytool snapshot; rollback via `keytool -delete` + re-import |
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | **V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit.** Production `realK8sClient` at `internal/connector/target/k8ssecret/k8ssecret.go:397-420` is a stub (every method returns `"real Kubernetes client not implemented — use NewWithClient for tests"`). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via `NewWithClient` work today. Tracking prompt: `cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`. |
> **Postfix vs Dovecot mode**: see "Choosing Mode=postfix vs Mode=dovecot" in
> `docs/connectors.md` for the per-mode defaults (cert/key paths, validate +
> reload commands), the dual-deploy guidance for mail servers running both
> daemons, and the test-pin reference (Bundle 11 commit `88e8881`).
## 4. Post-deploy TLS verification
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
**ON by default** when the operator configures
`PostDeployVerify.Endpoint`. Per-target opt-out via
`PostDeployVerify.Enabled = false`.
The connector-side flow:
```go
// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
// Mismatch — wrong vhost, NGINX serving cached cert,
// load-balanced target hit a different pod, etc.
rollbackToBackups(ctx, applyResult.BackupPaths)
emitAlert("post-deploy verify SHA-256 mismatch")
}
```
Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
against load-balanced targets where the verify might hit a
different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
`post_deploy_verify_backoff`.
```yaml
post_deploy_verify:
enabled: true
endpoint: "nginx.svc.cluster.local:443"
timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s
```
## 5. Rollback semantics
Rollback fires automatically on three triggers:
1. **PostCommit (reload) fails** → Apply restores backups + retries
reload. Returns `ErrReloadFailed` on success (degraded
no-op) or `ErrRollbackFailed` if the second reload also fails.
2. **Post-deploy verify fails** → Connector manually triggers
rollback (Apply already returned successfully). Backups are
restored + reload is invoked again. Same escalation path on
second failure.
3. **Mid-loop rename fails** (rare; only with cross-filesystem
misuse) → Apply rolls back the renames that already
succeeded.
`ErrRollbackFailed` is operator-actionable. The destination is in
a known-bad state; operators must either:
- Restore from `Result.BackupPaths` manually + run `<reload command>`
- Push a fresh known-good cert via the next deploy cycle
The `certctl_deploy_rollback_total{outcome="also_failed"}` metric
is the alert target.
## 6. ValidateOnly — dry-run mode
`target.Connector.ValidateOnly(ctx, request)` runs the validate
step without touching the live cert. Connectors that can't
dry-run (Traefik / Envoy / Caddy file mode) return
`target.ErrValidateOnlyNotSupported`.
| Connector | ValidateOnly |
|---|---|
| nginx | `nginx -t` |
| apache | `apachectl configtest` |
| haproxy | `haproxy -c -f <cfg>` |
| postfix/dovecot | `postfix check` / `doveconf -n` |
| caddy (api) | GET /config/ probe |
| caddy (file) / traefik / envoy | `ErrValidateOnlyNotSupported` |
| f5 | `client.Authenticate()` probe |
| iis | `Get-WebSite -Name <SiteName>` |
| ssh | `client.Connect()` probe |
| wincertstore | `Get-ChildItem Cert:\<loc>\<store>` |
| javakeystore | `keytool -list -keystore <path>` |
| k8ssecret | `client.GetSecret()` RBAC probe |
Operators preview a deploy via the agent's `--dry-run` flag (or
the equivalent CLI invocation).
## 7. File ownership + mode preservation
The single most common silent-failure mode pre-bundle: agent runs
as root, calls `os.WriteFile(path, bytes, 0600)`, locks NGINX out
of the existing nginx:nginx 0640 key file.
Per frozen decision 0.7, `deploy.Apply` resolves ownership via
this precedence:
1. Explicit `File.Mode` / `File.Owner` / `File.Group` (per-target
config) → use as given.
2. Existing destination file → preserve its `chown` + `chmod`.
3. `Plan.Defaults.Mode` / `.Owner` / `.Group` → use as fallback
for new files.
4. Nothing set → `os.WriteFile` default (0644) for new files;
preserved for existing.
Per-connector defaults (cross-distro, fall back to no-chown if
no candidate user exists):
| Connector | Default user | Default group | Default cert mode | Default key mode |
|---|---|---|---|---|
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
| apache | apache → www-data → httpd | same | 0644 | 0600 |
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
| traefik | (none) | (none) | 0644 | 0600 |
| envoy | (none) | (none) | 0644 | 0600 |
| caddy | (none) | (none) | 0644 | 0600 |
## 8. Per-target deploy mutex
Phase 2 of the master bundle: the agent (`cmd/agent/main.go`)
serializes concurrent deploys to the same target ID via a
`sync.Map[targetID]*sync.Mutex`. Granularity per frozen decision
0.5: one mutex per target, NOT per (target, cert).
Cert deploy throughput is operator-grade tens-per-minute. Coarse
serialization is fine and simplifies reasoning about reload-side
race windows.
## 9. Idempotency via SHA-256
Every `deploy.Apply` short-circuits when all File destinations
already match SHA-256 of the new bytes. PreCommit + PostCommit do
not fire; backups are not created; the result reports
`SkippedAsIdempotent = true`.
Defends against agent-restart retry storms that would otherwise
hammer targets with no-op reloads. Operator-visible signal:
`certctl_deploy_idempotent_skip_total{target_type="..."}`.
## 10. Troubleshooting matrix
| Symptom | Root cause | Operator action |
|---|---|---|
| `ErrValidateFailed: nginx -t failed` | Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
| `ErrReloadFailed: nginx -s reload failed; rolled back` | Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
| `ErrRollbackFailed` | Reload AND rollback both failed; in known-bad state | Restore from `Result.BackupPaths` manually; run reload command directly; check disk space + ownership |
| `post-deploy TLS verify SHA-256 mismatch` | New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via `PostDeployVerifyAttempts` |
| `chown ... permission denied` (in agent log) | Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
| Backups accumulating in cert dir | BackupRetention misconfigured | Set `BackupRetention: 3` (default) or higher on per-target config |
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit `KeyFileMode: 0640` (NGINX) or `KeyFileMode: 0600` (Apache) |
## 11. V3-Pro deferrals
Out of scope for the V2-free deploy-hardening I bundle:
- **Multi-region deployment coordination** — orchestration of N
data-center deploys with operator approval gates per stage.
- **Cert-pinning verification against mobile-app pin manifests**.
- **SOC 2 evidence-report generator** — auto-export of the
deploy audit trail in the format SOC 2 auditors expect.
- **Customer-paid validation matrices** — vendor-version certified
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
`cowork/deploy-hardening-ii-prompt.md` for the per-vendor
edge-case audit + integration test sidecars.
## 12. Per-connector quick reference
Paste-able config snippets for the most-used connectors. Full
field reference at `docs/connectors.md`.
### NGINX
```yaml
target_type: nginx
target_config:
cert_path: /etc/nginx/certs/cert.pem
chain_path: /etc/nginx/certs/chain.pem
key_path: /etc/nginx/certs/key.pem
reload_command: "nginx -s reload"
validate_command: "nginx -t"
cert_file_mode: 0644
key_file_mode: 0640
post_deploy_verify:
enabled: true
endpoint: "nginx.example.com:443"
timeout: 10s
backup_retention: 3
```
### HAProxy
```yaml
target_type: haproxy
target_config:
pem_path: /etc/haproxy/certs/cert.pem
reload_command: "systemctl reload haproxy"
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
pem_file_mode: 0600
post_deploy_verify:
enabled: true
endpoint: "haproxy.example.com:443"
```
### Traefik (file watcher; no reload command)
```yaml
target_type: traefik
target_config:
cert_dir: /etc/traefik/certs
cert_file: cert.pem
key_file: key.pem
post_deploy_verify:
enabled: true
endpoint: "traefik.example.com:443"
```
See per-connector tests at
`internal/connector/target/<name>/<name>_atomic_test.go` for the
full failure-mode matrix each connector handles.