docs: Phase 2 mechanical file moves to subdirectory structure

Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
This commit is contained in:
shankar0123
2026-05-05 02:49:28 +00:00
parent cda957f302
commit 3a807ae37e
35 changed files with 0 additions and 0 deletions
+359
View File
@@ -0,0 +1,359 @@
# Deployment Atomicity, Post-Deploy Verification, and Rollback
> Deploy-hardening I master bundle (v2.X.0). Operator + integrator
> reference for the atomic-write + post-deploy TLS verify +
> rollback pipeline that closes the procurement-checklist gap with
> commercial competitors (Venafi, DigiCert Certificate Manager,
> Sectigo).
## 1. Overview
Before deploy-hardening I, certctl's target connectors used
duplicated `os.WriteFile` flows. A failure mid-deploy could leave
a target with a renewed cert but no chain (or vice versa); a
reload-fail produced a half-deployed state that required manual
rollback; a wrong-vhost cert was silent until users reported it.
Deploy-hardening I closes three procurement-checklist gaps in
a single shared primitive:
| Gap | Pre-bundle | Post-bundle |
|---|---|---|
| **Atomic deploy with rollback** | F5 only (transactional API) | 12 of 13 connectors via `deploy.Apply` (K8s pending Bundle 2 — see [Section 1.5](#15-audit-closure-status-2026-05-02-deployment-target-audit)) |
| **Post-deploy TLS verification** | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
| **Vendor-specific deployment recipes** | Light docs | (Bundle II — `cowork/deploy-hardening-ii-prompt.md`) |
This document describes the operator-visible surface. The Go-level
contract lives at `internal/deploy/doc.go`.
## 1.5. Audit closure status (2026-05-02 deployment-target audit)
The 2026-05-02 deployment-target coverage audit
(`cowork/deployment-target-audit-2026-05-02/RESULTS.md`) tightened the
atomic + rollback contract on the connectors below. All bundles in the
table are committed to `master` as of this section's last edit; commit
hashes pin to the canonical landing commit for each piece of work.
| Connector | Bundle | Commit | Closes |
|-----------------|-----------|-----------|--------|
| envoy | Bundle 3 | `d8cd981` | atomic SDS JSON write + post-deploy watcher pickup poll |
| traefik | Bundle 4 | `37634e6` | single `deploy.Apply` Plan + all-files atomicity + rollback |
| iis | Bundle 5 | `223f279` | pre-deploy `Get-WebBinding` snapshot + on-failure binding rollback |
| ssh | Bundle 6 | `eb39059` | pre-deploy SFTP snapshot + reload-failure rollback |
| wincertstore | Bundle 7 | `1dd1dd4` | `Get-ChildItem` snapshot + on-import-failure rollback |
| javakeystore | Bundle 8 | `87e0009` | `keytool -exportkeystore` snapshot + on-import-failure rollback + operator playbook for argv password |
| caddy | Bundle 9 | `8cda860` | duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
| postfix/dovecot | Bundle 11 | `88e8881` | applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
**Outstanding from the same audit:**
- **Bundle 2 (k8ssecret).** The production `realK8sClient` is still a
stub (see Section 3 / row `k8ssecret` below). Replacing it with a
real `k8s.io/client-go` implementation + `ResourceVersion` plumbing
+ post-deploy SHA-256 verify + kubelet sync poll is the remaining
V2 P0 blocker. Tracking prompt:
`cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`.
Bundle 10 (per-connector loadtest harness, commit `6286cd4`) does not
modify the per-connector contract table; it's a CI / observability
addition documented separately at `deploy/test/loadtest/README.md`.
The original Bundle 1 audit spec read "soften the IIS / SSH /
WinCertStore / JavaKeystore rollback claims first while bundles 58
catch the implementation up". Execution order inverted that loop —
Bundles 311 shipped before the doc-realignment commit, so the rows
in Section 3 below are honest as-shipped without ever needing a
softening pass. The K8s row is the one exception, and Section 3's
notes call it out explicitly.
## 2. The atomic-write primitive — `Plan` / `Apply`
`internal/deploy.Apply(ctx, plan)` is the load-bearing entry
point. Connectors build a `Plan` describing one or more files +
their PreCommit (validate) and PostCommit (reload) hooks; Apply
executes them all-or-nothing.
```go
plan := deploy.Plan{
Files: []deploy.File{
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
},
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
// Run `nginx -t` against the staged config — bytes already
// written to <path>.certctl-tmp.<unix-nanos>.
return runValidate(ctx, "nginx -t")
},
PostCommit: func(ctx context.Context) error {
return runReload(ctx, "nginx -s reload")
},
}
res, err := deploy.Apply(ctx, plan)
```
Apply's algorithm:
1. Per-file mutex acquired (sync.Map; coarse-grained per-path
serialization).
2. SHA-256 idempotency short-circuit. If every File's destination
already matches, return `Result.SkippedAsIdempotent=true`
without firing PreCommit/PostCommit.
3. Pre-deploy backup: copy each existing destination to
`<path>.certctl-bak.<unix-nanos>`.
4. Write each File's bytes to `<path>.certctl-tmp.<unix-nanos>`
in the destination directory (same-filesystem rename).
5. Apply ownership (chown + chmod) to each temp file BEFORE
rename so the swap is atomic with the right perms.
6. Call `PreCommit(ctx, tempPaths)`. On error: clean up temps;
return `ErrValidateFailed`.
7. `os.Rename` each temp → final. POSIX guarantees atomic.
8. Call `PostCommit(ctx)`. On error: restore each backup; re-call
PostCommit. If second PostCommit also fails: return
`ErrRollbackFailed` (operator-actionable).
9. Janitor: prune backups beyond `Plan.BackupRetention`
(default 3, -1 to disable).
## 3. Per-connector atomic contract
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|---|---|---|---|---|
| nginx | `nginx -t` | `nginx -s reload` | TLS handshake to `host:443` | Default key mode 0640 (worker reads via group) |
| apache | `apachectl configtest` | `apachectl graceful` | TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
| haproxy | `haproxy -c -f <cfg>` | `systemctl reload haproxy` | TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
| postfix | `postfix check` | `postfix reload` | TLS handshake to port 25 | Chain appended to cert if no ChainPath |
| dovecot | `doveconf -n` | `doveadm reload` | TLS handshake to port 993 | Same code path as postfix |
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
| ssh | (Connect probe) | (SCP upload + remote chmod) | `tls.Dial` to remote TLS port | Pre-deploy SCP backup of remote files |
| wincertstore | (Get-ChildItem Cert:\) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
| javakeystore | (`keytool -list`) | (`keytool -importkeystore`) | (admin probe) | keytool snapshot; rollback via `keytool -delete` + re-import |
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | **V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit.** Production `realK8sClient` at `internal/connector/target/k8ssecret/k8ssecret.go:397-420` is a stub (every method returns `"real Kubernetes client not implemented — use NewWithClient for tests"`). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via `NewWithClient` work today. Tracking prompt: `cowork/deployment-target-audit-2026-05-02/k8s-real-client-prompt.md`. |
> **Postfix vs Dovecot mode**: see "Choosing Mode=postfix vs Mode=dovecot" in
> `docs/connectors.md` for the per-mode defaults (cert/key paths, validate +
> reload commands), the dual-deploy guidance for mail servers running both
> daemons, and the test-pin reference (Bundle 11 commit `88e8881`).
## 4. Post-deploy TLS verification
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
**ON by default** when the operator configures
`PostDeployVerify.Endpoint`. Per-target opt-out via
`PostDeployVerify.Enabled = false`.
The connector-side flow:
```go
// After Apply returns successfully, the connector dials the
// configured endpoint, pulls the leaf cert SHA-256, and compares.
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
// Mismatch — wrong vhost, NGINX serving cached cert,
// load-balanced target hit a different pod, etc.
rollbackToBackups(ctx, applyResult.BackupPaths)
emitAlert("post-deploy verify SHA-256 mismatch")
}
```
Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
against load-balanced targets where the verify might hit a
different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
`post_deploy_verify_backoff`.
```yaml
post_deploy_verify:
enabled: true
endpoint: "nginx.svc.cluster.local:443"
timeout: 10s
post_deploy_verify_attempts: 3
post_deploy_verify_backoff: 1s
post_deploy_verify_max_backoff: 16s
```
## 5. Rollback semantics
Rollback fires automatically on three triggers:
1. **PostCommit (reload) fails** → Apply restores backups + retries
reload. Returns `ErrReloadFailed` on success (degraded
no-op) or `ErrRollbackFailed` if the second reload also fails.
2. **Post-deploy verify fails** → Connector manually triggers
rollback (Apply already returned successfully). Backups are
restored + reload is invoked again. Same escalation path on
second failure.
3. **Mid-loop rename fails** (rare; only with cross-filesystem
misuse) → Apply rolls back the renames that already
succeeded.
`ErrRollbackFailed` is operator-actionable. The destination is in
a known-bad state; operators must either:
- Restore from `Result.BackupPaths` manually + run `<reload command>`
- Push a fresh known-good cert via the next deploy cycle
The `certctl_deploy_rollback_total{outcome="also_failed"}` metric
is the alert target.
## 6. ValidateOnly — dry-run mode
`target.Connector.ValidateOnly(ctx, request)` runs the validate
step without touching the live cert. Connectors that can't
dry-run (Traefik / Envoy / Caddy file mode) return
`target.ErrValidateOnlyNotSupported`.
| Connector | ValidateOnly |
|---|---|
| nginx | `nginx -t` |
| apache | `apachectl configtest` |
| haproxy | `haproxy -c -f <cfg>` |
| postfix/dovecot | `postfix check` / `doveconf -n` |
| caddy (api) | GET /config/ probe |
| caddy (file) / traefik / envoy | `ErrValidateOnlyNotSupported` |
| f5 | `client.Authenticate()` probe |
| iis | `Get-WebSite -Name <SiteName>` |
| ssh | `client.Connect()` probe |
| wincertstore | `Get-ChildItem Cert:\<loc>\<store>` |
| javakeystore | `keytool -list -keystore <path>` |
| k8ssecret | `client.GetSecret()` RBAC probe |
Operators preview a deploy via the agent's `--dry-run` flag (or
the equivalent CLI invocation).
## 7. File ownership + mode preservation
The single most common silent-failure mode pre-bundle: agent runs
as root, calls `os.WriteFile(path, bytes, 0600)`, locks NGINX out
of the existing nginx:nginx 0640 key file.
Per frozen decision 0.7, `deploy.Apply` resolves ownership via
this precedence:
1. Explicit `File.Mode` / `File.Owner` / `File.Group` (per-target
config) → use as given.
2. Existing destination file → preserve its `chown` + `chmod`.
3. `Plan.Defaults.Mode` / `.Owner` / `.Group` → use as fallback
for new files.
4. Nothing set → `os.WriteFile` default (0644) for new files;
preserved for existing.
Per-connector defaults (cross-distro, fall back to no-chown if
no candidate user exists):
| Connector | Default user | Default group | Default cert mode | Default key mode |
|---|---|---|---|---|
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
| apache | apache → www-data → httpd | same | 0644 | 0600 |
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
| traefik | (none) | (none) | 0644 | 0600 |
| envoy | (none) | (none) | 0644 | 0600 |
| caddy | (none) | (none) | 0644 | 0600 |
## 8. Per-target deploy mutex
Phase 2 of the master bundle: the agent (`cmd/agent/main.go`)
serializes concurrent deploys to the same target ID via a
`sync.Map[targetID]*sync.Mutex`. Granularity per frozen decision
0.5: one mutex per target, NOT per (target, cert).
Cert deploy throughput is operator-grade tens-per-minute. Coarse
serialization is fine and simplifies reasoning about reload-side
race windows.
## 9. Idempotency via SHA-256
Every `deploy.Apply` short-circuits when all File destinations
already match SHA-256 of the new bytes. PreCommit + PostCommit do
not fire; backups are not created; the result reports
`SkippedAsIdempotent = true`.
Defends against agent-restart retry storms that would otherwise
hammer targets with no-op reloads. Operator-visible signal:
`certctl_deploy_idempotent_skip_total{target_type="..."}`.
## 10. Troubleshooting matrix
| Symptom | Root cause | Operator action |
|---|---|---|
| `ErrValidateFailed: nginx -t failed` | Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
| `ErrReloadFailed: nginx -s reload failed; rolled back` | Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
| `ErrRollbackFailed` | Reload AND rollback both failed; in known-bad state | Restore from `Result.BackupPaths` manually; run reload command directly; check disk space + ownership |
| `post-deploy TLS verify SHA-256 mismatch` | New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via `PostDeployVerifyAttempts` |
| `chown ... permission denied` (in agent log) | Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
| Backups accumulating in cert dir | BackupRetention misconfigured | Set `BackupRetention: 3` (default) or higher on per-target config |
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit `KeyFileMode: 0640` (NGINX) or `KeyFileMode: 0600` (Apache) |
## 11. V3-Pro deferrals
Out of scope for the V2-free deploy-hardening I bundle:
- **Multi-region deployment coordination** — orchestration of N
data-center deploys with operator approval gates per stage.
- **Cert-pinning verification against mobile-app pin manifests**.
- **SOC 2 evidence-report generator** — auto-export of the
deploy audit trail in the format SOC 2 auditors expect.
- **Customer-paid validation matrices** — vendor-version certified
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
`cowork/deploy-hardening-ii-prompt.md` for the per-vendor
edge-case audit + integration test sidecars.
## 12. Per-connector quick reference
Paste-able config snippets for the most-used connectors. Full
field reference at `docs/connectors.md`.
### NGINX
```yaml
target_type: nginx
target_config:
cert_path: /etc/nginx/certs/cert.pem
chain_path: /etc/nginx/certs/chain.pem
key_path: /etc/nginx/certs/key.pem
reload_command: "nginx -s reload"
validate_command: "nginx -t"
cert_file_mode: 0644
key_file_mode: 0640
post_deploy_verify:
enabled: true
endpoint: "nginx.example.com:443"
timeout: 10s
backup_retention: 3
```
### HAProxy
```yaml
target_type: haproxy
target_config:
pem_path: /etc/haproxy/certs/cert.pem
reload_command: "systemctl reload haproxy"
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
pem_file_mode: 0600
post_deploy_verify:
enabled: true
endpoint: "haproxy.example.com:443"
```
### Traefik (file watcher; no reload command)
```yaml
target_type: traefik
target_config:
cert_dir: /etc/traefik/certs
cert_file: cert.pem
key_file: key.pem
post_deploy_verify:
enabled: true
endpoint: "traefik.example.com:443"
```
See per-connector tests at
`internal/connector/target/<name>/<name>_atomic_test.go` for the
full failure-mode matrix each connector handles.