mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:01:31 +00:00
b4ce357803
Bundle 1 closure (2026-05-12 acquisition diligence audit). Closes the
acquisition-blocker chain: target.edit (default r-operator grant per
migrations/000029_rbac.up.sql:196) → arbitrary reload_command stored
without validation → agent createTargetConnector json.Unmarshal-only
→ sh -c on agent host. README's 'shell injection prevention on all
connector scripts' claim is now true at the chain level.
Server-side: new internal/connector/target/configcheck package + a
configcheck.Validate call in target.go::Create + ::Update +
::CreateTarget + ::UpdateTarget (all 4 entry points). Rejects shell
metacharacters in reload_command / validate_command / restart_command
for nginx, apache, haproxy, postfix/dovecot, javakeystore, ssh. Sentinel
errors.Is(err, service.ErrInvalidConnectorConfig) available for handler
400 mapping. Non-shell connector types (F5, IIS, Caddy, Traefik, Envoy,
cloud targets, K8s) are no-ops by design.
Agent-side: defense-in-depth connector.ValidateConfig(ctx, configJSON)
call in cmd/agent/main.go inserted between createTargetConnector and
DeployCertificate. This catches (a) configs pre-dating the server gate,
(b) encrypted-blob tampering, (c) per-connector filesystem invariants
that the server can't check.
F5 (S2 finding): proven docs-vs-code drift, not a security bug. The
applyDefaults function never set Insecure=true; runtime default has
always been Go zero-value (false → TLS verified). Three lying 'default
true' comments in f5/f5.go (lines 30, 45-47, 126) rewritten to match
actual code behavior.
Docs (C4 + C9): README L12 + L68 narrowed — 'any CA / any server' →
'Twelve native CA connectors plus an OpenSSL adapter; fifteen native
deployment-target connectors plus a proxy-agent pattern.' 'Every deploy
goes through atomic-write + ...' narrowed to file-based connectors with
inline link to per-target guarantee matrix. New deployment-model.md §1.6
ships a 15-target × 8-property guarantee table covering atomic write /
owner-perms / SHA-256 idempotency / pre-deploy snapshot / on-failure
rollback / post-deploy TLS verify / Prometheus counters / shell-injection
validation — including the K8s preview honesty marker (CLAIM-H4).
Tests: internal/connector/target/configcheck/configcheck_test.go covers
14 shell-injection payloads (semicolon, pipe, backtick, dollar-paren,
redirect, and-chain, newline, double-quote, escape, dollar-var) × 7
shell-using connectors + benign-command acceptance + non-shell no-op
behavior + empty config + malformed JSON. All pass.
Verification (run from /sessions/gifted-blissful-pasteur/mnt/cowork/certctl):
go fmt ./... # clean (no diffs)
go vet ./... # clean (no findings)
go test -short -count=1 ./internal/... ./cmd/...
# 60+ packages all ok, zero FAIL
Audit-Closes: BUNDLE-1 RT-C1 SEC-M4 CLAIM-M2 CLAIM-L3
Audit-Verifies-False: S2 (F5 'default insecure' was a comment lie, code was always secure)
402 lines
21 KiB
Markdown
402 lines
21 KiB
Markdown
# Deployment Atomicity, Post-Deploy Verification, and Rollback
|
||
|
||
> Last reviewed: 2026-05-05
|
||
|
||
> Deploy-hardening I master bundle (v2.X.0). Operator + integrator
|
||
> reference for the atomic-write + post-deploy TLS verify +
|
||
> rollback pipeline that closes the procurement-checklist gap with
|
||
> commercial competitors (Venafi, DigiCert Certificate Manager,
|
||
> Sectigo).
|
||
|
||
## 1. Overview
|
||
|
||
Before deploy-hardening I, certctl's target connectors used
|
||
duplicated `os.WriteFile` flows. A failure mid-deploy could leave
|
||
a target with a renewed cert but no chain (or vice versa); a
|
||
reload-fail produced a half-deployed state that required manual
|
||
rollback; a wrong-vhost cert was silent until users reported it.
|
||
|
||
Deploy-hardening I closes three procurement-checklist gaps in
|
||
a single shared primitive:
|
||
|
||
| Gap | Pre-bundle | Post-bundle |
|
||
|---|---|---|
|
||
| **Atomic deploy with rollback** | F5 only (transactional API) | 12 of 13 connectors via `deploy.Apply` (K8s pending Bundle 2 — see [Section 1.5](#15-audit-closure-status-2026-05-02-deployment-target-audit)) |
|
||
| **Post-deploy TLS verification** | None | NGINX/Apache/HAProxy/Traefik/Caddy/Envoy/Postfix all do TLS handshake + SHA-256 fingerprint compare; fail → rollback |
|
||
| **Vendor-specific deployment recipes** | Light docs | (Bundle II — per the project's deploy-hardening II spec) |
|
||
|
||
This document describes the operator-visible surface. The Go-level
|
||
contract lives at `internal/deploy/doc.go`.
|
||
|
||
## 1.6. Per-target guarantee matrix
|
||
|
||
Added 2026-05-12 (Bundle 1 / CLAIM-M2 closure). The README previously
|
||
claimed "every deploy goes through atomic-write + ownership-preservation
|
||
+ SHA-256 idempotency + per-target Prometheus counters + pre-deploy
|
||
snapshot + on-failure rollback." That claim is true for the file-based
|
||
deploy primitive only. Cloud / API targets use vendor-SDK semantics and
|
||
do not share the same primitive. This matrix is the authoritative
|
||
per-target answer.
|
||
|
||
Legend: ✓ = supported / always on. ✗ = not applicable to this target
|
||
family. ◐ = partial / vendor-specific equivalent. preview = ships but
|
||
the production code path is a stub (see CLAIM-H4).
|
||
|
||
| Target | Atomic write | Owner/perms preserved | SHA-256 idempotency | Pre-deploy snapshot | On-failure rollback | Post-deploy TLS verify | Prometheus counters | Server+agent shell-injection validation |
|
||
|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|
||
| NGINX | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
||
| Apache | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
||
| HAProxy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
||
| Caddy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ (no operator commands) |
|
||
| Traefik | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||
| Envoy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||
| Postfix / Dovecot| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
||
| SSH known-hosts | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ (no TLS endpoint) | ✓ | ✓ |
|
||
| JavaKeystore | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ (file format, no socket) | ✓ | ✓ |
|
||
| IIS | ◐ (Windows cert store API) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||
| WinCertStore | ◐ (Windows cert store API) | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
|
||
| F5 BIG-IP | ✓ (iControl REST transaction) | ✗ (no FS) | ◐ (cert object name) | ◐ (transaction rollback) | ✓ (transaction rollback) | ✓ (mgmt API GET) | ✓ | ✗ |
|
||
| AWS ACM | ✗ (SDK call) | ✗ (no FS) | ◐ (ACM-side replace) | ✗ | ◐ (re-import old ARN) | ✗ | ✓ | ✗ |
|
||
| Azure Key Vault | ✗ (SDK call) | ✗ (no FS) | ◐ (KV-side versioning) | ✗ | ◐ (KV versioning) | ✗ | ✓ | ✗ |
|
||
| Kubernetes Secrets | preview | preview | preview | preview | preview | preview | preview | ✗ |
|
||
|
||
**Notes on the matrix:**
|
||
|
||
- **Atomic write / owner-perms / SHA-256 idempotency / snapshot / rollback** are properties of the shared `deploy.Apply` primitive in `internal/deploy/`. They apply to file-based targets where certctl writes to disk.
|
||
- **Cloud / API targets** (AWS ACM, Azure Key Vault) use the vendor SDK's import / replace operation. The vendor handles versioning and atomicity at their layer. certctl tracks the operation outcome via Prometheus counters; "rollback" in this row means "re-import the previous cert ARN" rather than the file-primitive's `os.Rename` rollback.
|
||
- **F5** uses iControl REST transactions for atomicity (deploy-hardening I docs above). It does not touch a filesystem; the snapshot/rollback semantics live in the F5 transaction protocol.
|
||
- **Kubernetes Secrets** ships but the production client (`realK8sClient`) returns `"real Kubernetes client not implemented"` for all methods (see `internal/connector/target/k8ssecret/k8ssecret.go:395+`). Operators evaluating against a real cluster should treat this connector as preview until the production client lands.
|
||
- **Server+agent shell-injection validation** (Bundle 1 / RT-C1 closure 2026-05-12) is on for every connector that accepts operator-supplied command strings: `reload_command`, `validate_command`, `restart_command`. Validation runs at API ingestion (`internal/service/target.go::Create` + `::Update` + `::CreateTarget` + `::UpdateTarget` via `internal/connector/target/configcheck`) AND on the agent before deploy (`cmd/agent/main.go` post-`createTargetConnector`, calling each connector's full `ValidateConfig` method). Connectors that do not accept operator shell strings (Caddy / Traefik / Envoy / cloud targets) skip this check by design.
|
||
|
||
## 1.5. Audit closure status (2026-05-02 deployment-target audit)
|
||
|
||
The 2026-05-02 deployment-target coverage audit
|
||
(the 2026-05-02 deployment-target audit) tightened the
|
||
atomic + rollback contract on the connectors below. All bundles in the
|
||
table are committed to `master` as of this section's last edit; commit
|
||
hashes pin to the canonical landing commit for each piece of work.
|
||
|
||
| Connector | Bundle | Commit | Closes |
|
||
|-----------------|-----------|-----------|--------|
|
||
| envoy | Bundle 3 | `d8cd981` | atomic SDS JSON write + post-deploy watcher pickup poll |
|
||
| traefik | Bundle 4 | `37634e6` | single `deploy.Apply` Plan + all-files atomicity + rollback |
|
||
| iis | Bundle 5 | `223f279` | pre-deploy `Get-WebBinding` snapshot + on-failure binding rollback |
|
||
| ssh | Bundle 6 | `eb39059` | pre-deploy SFTP snapshot + reload-failure rollback |
|
||
| wincertstore | Bundle 7 | `1dd1dd4` | `Get-ChildItem` snapshot + on-import-failure rollback |
|
||
| javakeystore | Bundle 8 | `87e0009` | `keytool -exportkeystore` snapshot + on-import-failure rollback + operator playbook for argv password |
|
||
| caddy | Bundle 9 | `8cda860` | duration metric fix + file-mode PEM validate + api-mode SHA-256 idempotency |
|
||
| postfix/dovecot | Bundle 11 | `88e8881` | applyDefaults + verify-fails-rollback test pin under Mode=dovecot |
|
||
|
||
**Outstanding from the same audit:**
|
||
|
||
- **Bundle 2 (k8ssecret).** The production `realK8sClient` is still a
|
||
stub (see Section 3 / row `k8ssecret` below). Replacing it with a
|
||
real `k8s.io/client-go` implementation + `ResourceVersion` plumbing
|
||
+ post-deploy SHA-256 verify + kubelet sync poll is the remaining
|
||
V2 P0 blocker. Tracking prompt:
|
||
the project's k8s-real-client spec.
|
||
|
||
Bundle 10 (per-connector loadtest harness, commit `6286cd4`) does not
|
||
modify the per-connector contract table; it's a CI / observability
|
||
addition documented separately at `deploy/test/loadtest/README.md`.
|
||
|
||
The original Bundle 1 audit spec read "soften the IIS / SSH /
|
||
WinCertStore / JavaKeystore rollback claims first while bundles 5–8
|
||
catch the implementation up". Execution order inverted that loop —
|
||
Bundles 3–11 shipped before the doc-realignment commit, so the rows
|
||
in Section 3 below are honest as-shipped without ever needing a
|
||
softening pass. The K8s row is the one exception, and Section 3's
|
||
notes call it out explicitly.
|
||
|
||
## 2. The atomic-write primitive — `Plan` / `Apply`
|
||
|
||
`internal/deploy.Apply(ctx, plan)` is the load-bearing entry
|
||
point. Connectors build a `Plan` describing one or more files +
|
||
their PreCommit (validate) and PostCommit (reload) hooks; Apply
|
||
executes them all-or-nothing.
|
||
|
||
```go
|
||
plan := deploy.Plan{
|
||
Files: []deploy.File{
|
||
{Path: "/etc/nginx/certs/cert.pem", Bytes: certPEM, Mode: 0644},
|
||
{Path: "/etc/nginx/certs/chain.pem", Bytes: chainPEM, Mode: 0644},
|
||
{Path: "/etc/nginx/certs/key.pem", Bytes: keyPEM, Mode: 0640},
|
||
},
|
||
PreCommit: func(ctx context.Context, tempPaths map[string]string) error {
|
||
// Run `nginx -t` against the staged config — bytes already
|
||
// written to <path>.certctl-tmp.<unix-nanos>.
|
||
return runValidate(ctx, "nginx -t")
|
||
},
|
||
PostCommit: func(ctx context.Context) error {
|
||
return runReload(ctx, "nginx -s reload")
|
||
},
|
||
}
|
||
res, err := deploy.Apply(ctx, plan)
|
||
```
|
||
|
||
Apply's algorithm:
|
||
|
||
1. Per-file mutex acquired (sync.Map; coarse-grained per-path
|
||
serialization).
|
||
2. SHA-256 idempotency short-circuit. If every File's destination
|
||
already matches, return `Result.SkippedAsIdempotent=true`
|
||
without firing PreCommit/PostCommit.
|
||
3. Pre-deploy backup: copy each existing destination to
|
||
`<path>.certctl-bak.<unix-nanos>`.
|
||
4. Write each File's bytes to `<path>.certctl-tmp.<unix-nanos>`
|
||
in the destination directory (same-filesystem rename).
|
||
5. Apply ownership (chown + chmod) to each temp file BEFORE
|
||
rename so the swap is atomic with the right perms.
|
||
6. Call `PreCommit(ctx, tempPaths)`. On error: clean up temps;
|
||
return `ErrValidateFailed`.
|
||
7. `os.Rename` each temp → final. POSIX guarantees atomic.
|
||
8. Call `PostCommit(ctx)`. On error: restore each backup; re-call
|
||
PostCommit. If second PostCommit also fails: return
|
||
`ErrRollbackFailed` (operator-actionable).
|
||
9. Janitor: prune backups beyond `Plan.BackupRetention`
|
||
(default 3, -1 to disable).
|
||
|
||
## 3. Per-connector atomic contract
|
||
|
||
| Connector | PreCommit (validate) | PostCommit (reload) | Post-deploy verify | Quirks |
|
||
|---|---|---|---|---|
|
||
| nginx | `nginx -t` | `nginx -s reload` | TLS handshake to `host:443` | Default key mode 0640 (worker reads via group) |
|
||
| apache | `apachectl configtest` | `apachectl graceful` | TLS handshake | Default key mode 0600; per-distro user (apache2/apache/httpd) |
|
||
| haproxy | `haproxy -c -f <cfg>` | `systemctl reload haproxy` | TLS handshake | Combined PEM (cert+chain+key in one file); default mode 0600 |
|
||
| traefik | (none — file watcher) | (none — file watcher auto-reloads) | TLS handshake | atomic-write only; ValidateOnly returns sentinel |
|
||
| caddy (file mode) | (none) | (none — file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
|
||
| caddy (api mode) | Probe admin /config/ | POST /load (already atomic at admin server) | (admin server confirms) | ValidateOnly real impl probes admin API |
|
||
| envoy | (none — SDS file watcher) | (none — SDS file watcher) | TLS handshake | atomic-write replaces os.WriteFile |
|
||
| postfix | `postfix check` | `postfix reload` | TLS handshake to port 25 | Chain appended to cert if no ChainPath |
|
||
| dovecot | `doveconf -n` | `doveadm reload` | TLS handshake to port 993 | Same code path as postfix |
|
||
| f5 | (Authenticate probe) | (Transactional commit) | TLS handshake to VS | Already transactional; rollback automatic via failed commit |
|
||
| iis | (Get-WebSite probe) | (PowerShell cert install) | TLS handshake | Already explicit pre-deploy backup + post-rollback re-import |
|
||
| ssh | (Connect probe) | (SCP upload + remote chmod) | `tls.Dial` to remote TLS port | Pre-deploy SCP backup of remote files |
|
||
| wincertstore | (Get-ChildItem Cert:\) | (Import-PfxCertificate) | (admin probe) | Get-ChildItem snapshot for rollback |
|
||
| javakeystore | (`keytool -list`) | (`keytool -importkeystore`) | (admin probe) | keytool snapshot; rollback via `keytool -delete` + re-import |
|
||
| k8ssecret | (V2 blocker — see note below) | (V2 blocker — see note below) | (V2 blocker — see note below) | **V2 blocker — Bundle 2 of the 2026-05-02 deployment-target audit.** Production `realK8sClient` at `internal/connector/target/k8ssecret/k8ssecret.go:397-420` is a stub (every method returns `"real Kubernetes client not implemented — use NewWithClient for tests"`). The SHA-256 post-deploy verify and kubelet sync poll are designed but not yet implemented; production deploys to a real cluster fail with "not implemented" until Bundle 2 lands. Test mocks via `NewWithClient` work today. Tracking prompt: the project's k8s-real-client spec. |
|
||
|
||
> **Postfix vs Dovecot mode**: see "Choosing Mode=postfix vs Mode=dovecot" in
|
||
> `docs/connectors.md` for the per-mode defaults (cert/key paths, validate +
|
||
> reload commands), the dual-deploy guidance for mail servers running both
|
||
> daemons, and the test-pin reference (Bundle 11 commit `88e8881`).
|
||
|
||
## 4. Post-deploy TLS verification
|
||
|
||
Frozen decision 0.3 (deploy-hardening I): post-deploy verify is
|
||
**ON by default** when the operator configures
|
||
`PostDeployVerify.Endpoint`. Per-target opt-out via
|
||
`PostDeployVerify.Enabled = false`.
|
||
|
||
The connector-side flow:
|
||
|
||
```go
|
||
// After Apply returns successfully, the connector dials the
|
||
// configured endpoint, pulls the leaf cert SHA-256, and compares.
|
||
res := tlsprobe.ProbeTLS(ctx, "nginx-test:443", 10*time.Second)
|
||
if res.Fingerprint != certPEMToFingerprint(deployedCertPEM) {
|
||
// Mismatch — wrong vhost, NGINX serving cached cert,
|
||
// load-balanced target hit a different pod, etc.
|
||
rollbackToBackups(ctx, applyResult.BackupPaths)
|
||
emitAlert("post-deploy verify SHA-256 mismatch")
|
||
}
|
||
```
|
||
|
||
Retry with **exponential backoff** (default 3 attempts; 1s initial, 16s cap) defends
|
||
against load-balanced targets where the verify might hit a
|
||
different pod that hasn't picked up the new cert yet. Backoff grows 1s → 2s → 4s → 8s → 16s,
|
||
giving the LB fleet time to converge before giving up. Operators preserving V2 linear semantics
|
||
(every attempt waits the same interval) set `post_deploy_verify_max_backoff` equal to
|
||
`post_deploy_verify_backoff`.
|
||
|
||
```yaml
|
||
post_deploy_verify:
|
||
enabled: true
|
||
endpoint: "nginx.svc.cluster.local:443"
|
||
timeout: 10s
|
||
post_deploy_verify_attempts: 3
|
||
post_deploy_verify_backoff: 1s
|
||
post_deploy_verify_max_backoff: 16s
|
||
```
|
||
|
||
## 5. Rollback semantics
|
||
|
||
Rollback fires automatically on three triggers:
|
||
|
||
1. **PostCommit (reload) fails** → Apply restores backups + retries
|
||
reload. Returns `ErrReloadFailed` on success (degraded
|
||
no-op) or `ErrRollbackFailed` if the second reload also fails.
|
||
2. **Post-deploy verify fails** → Connector manually triggers
|
||
rollback (Apply already returned successfully). Backups are
|
||
restored + reload is invoked again. Same escalation path on
|
||
second failure.
|
||
3. **Mid-loop rename fails** (rare; only with cross-filesystem
|
||
misuse) → Apply rolls back the renames that already
|
||
succeeded.
|
||
|
||
`ErrRollbackFailed` is operator-actionable. The destination is in
|
||
a known-bad state; operators must either:
|
||
- Restore from `Result.BackupPaths` manually + run `<reload command>`
|
||
- Push a fresh known-good cert via the next deploy cycle
|
||
|
||
The `certctl_deploy_rollback_total{outcome="also_failed"}` metric
|
||
is the alert target.
|
||
|
||
## 6. ValidateOnly — dry-run mode
|
||
|
||
`target.Connector.ValidateOnly(ctx, request)` runs the validate
|
||
step without touching the live cert. Connectors that can't
|
||
dry-run (Traefik / Envoy / Caddy file mode) return
|
||
`target.ErrValidateOnlyNotSupported`.
|
||
|
||
| Connector | ValidateOnly |
|
||
|---|---|
|
||
| nginx | `nginx -t` |
|
||
| apache | `apachectl configtest` |
|
||
| haproxy | `haproxy -c -f <cfg>` |
|
||
| postfix/dovecot | `postfix check` / `doveconf -n` |
|
||
| caddy (api) | GET /config/ probe |
|
||
| caddy (file) / traefik / envoy | `ErrValidateOnlyNotSupported` |
|
||
| f5 | `client.Authenticate()` probe |
|
||
| iis | `Get-WebSite -Name <SiteName>` |
|
||
| ssh | `client.Connect()` probe |
|
||
| wincertstore | `Get-ChildItem Cert:\<loc>\<store>` |
|
||
| javakeystore | `keytool -list -keystore <path>` |
|
||
| k8ssecret | `client.GetSecret()` RBAC probe |
|
||
|
||
Operators preview a deploy via the agent's `--dry-run` flag (or
|
||
the equivalent CLI invocation).
|
||
|
||
## 7. File ownership + mode preservation
|
||
|
||
The single most common silent-failure mode pre-bundle: agent runs
|
||
as root, calls `os.WriteFile(path, bytes, 0600)`, locks NGINX out
|
||
of the existing nginx:nginx 0640 key file.
|
||
|
||
Per frozen decision 0.7, `deploy.Apply` resolves ownership via
|
||
this precedence:
|
||
|
||
1. Explicit `File.Mode` / `File.Owner` / `File.Group` (per-target
|
||
config) → use as given.
|
||
2. Existing destination file → preserve its `chown` + `chmod`.
|
||
3. `Plan.Defaults.Mode` / `.Owner` / `.Group` → use as fallback
|
||
for new files.
|
||
4. Nothing set → `os.WriteFile` default (0644) for new files;
|
||
preserved for existing.
|
||
|
||
Per-connector defaults (cross-distro, fall back to no-chown if
|
||
no candidate user exists):
|
||
|
||
| Connector | Default user | Default group | Default cert mode | Default key mode |
|
||
|---|---|---|---|---|
|
||
| nginx | nginx → www-data | nginx → www-data | 0644 | 0640 |
|
||
| apache | apache → www-data → httpd | same | 0644 | 0600 |
|
||
| haproxy | haproxy | haproxy | n/a (combined PEM) | 0600 |
|
||
| postfix | postfix → dovecot → _postfix | same | 0644 | 0600 |
|
||
| traefik | (none) | (none) | 0644 | 0600 |
|
||
| envoy | (none) | (none) | 0644 | 0600 |
|
||
| caddy | (none) | (none) | 0644 | 0600 |
|
||
|
||
## 8. Per-target deploy mutex
|
||
|
||
Phase 2 of the master bundle: the agent (`cmd/agent/main.go`)
|
||
serializes concurrent deploys to the same target ID via a
|
||
`sync.Map[targetID]*sync.Mutex`. Granularity per frozen decision
|
||
0.5: one mutex per target, NOT per (target, cert).
|
||
|
||
Cert deploy throughput is operator-grade tens-per-minute. Coarse
|
||
serialization is fine and simplifies reasoning about reload-side
|
||
race windows.
|
||
|
||
## 9. Idempotency via SHA-256
|
||
|
||
Every `deploy.Apply` short-circuits when all File destinations
|
||
already match SHA-256 of the new bytes. PreCommit + PostCommit do
|
||
not fire; backups are not created; the result reports
|
||
`SkippedAsIdempotent = true`.
|
||
|
||
Defends against agent-restart retry storms that would otherwise
|
||
hammer targets with no-op reloads. Operator-visible signal:
|
||
`certctl_deploy_idempotent_skip_total{target_type="..."}`.
|
||
|
||
## 10. Troubleshooting matrix
|
||
|
||
| Symptom | Root cause | Operator action |
|
||
|---|---|---|
|
||
| `ErrValidateFailed: nginx -t failed` | Validate command rejected the staged config | Read PreCommit's wrapped error for the nginx stderr; fix config |
|
||
| `ErrReloadFailed: nginx -s reload failed; rolled back` | Reload command failed; rollback succeeded; serving the OLD cert | Investigate why reload failed; re-deploy when fixed |
|
||
| `ErrRollbackFailed` | Reload AND rollback both failed; in known-bad state | Restore from `Result.BackupPaths` manually; run reload command directly; check disk space + ownership |
|
||
| `post-deploy TLS verify SHA-256 mismatch` | New cert deployed but a different cert is being served (cached, wrong vhost, stale pod in load balancer) | Check NGINX SSL session cache TTL; verify SNI; bump verify retries via `PostDeployVerifyAttempts` |
|
||
| `chown ... permission denied` (in agent log) | Non-root agent OR target user doesn't exist on host | Verify agent runs as root in production; check distro user (Debian: www-data, RHEL: nginx) |
|
||
| Backups accumulating in cert dir | BackupRetention misconfigured | Set `BackupRetention: 3` (default) or higher on per-target config |
|
||
| File world-readable after deploy | Default mode 0644 applied to new key file | Set explicit `KeyFileMode: 0640` (NGINX) or `KeyFileMode: 0600` (Apache) |
|
||
|
||
## 11. V3-Pro deferrals
|
||
|
||
Out of scope for the V2-free deploy-hardening I bundle:
|
||
|
||
- **Multi-region deployment coordination** — orchestration of N
|
||
data-center deploys with operator approval gates per stage.
|
||
- **Cert-pinning verification against mobile-app pin manifests**.
|
||
- **Audit-evidence report generator** — auto-export of the
|
||
deploy audit trail in a reviewer-friendly format.
|
||
- **Customer-paid validation matrices** — vendor-version certified
|
||
quirks (e.g. "tested on F5 v15.1 + v17.0 + v17.5"). See
|
||
the project's deploy-hardening II spec for the per-vendor
|
||
edge-case audit + integration test sidecars.
|
||
|
||
## 12. Per-connector quick reference
|
||
|
||
Paste-able config snippets for the most-used connectors. Full
|
||
field reference at `docs/connectors.md`.
|
||
|
||
### NGINX
|
||
|
||
```yaml
|
||
target_type: nginx
|
||
target_config:
|
||
cert_path: /etc/nginx/certs/cert.pem
|
||
chain_path: /etc/nginx/certs/chain.pem
|
||
key_path: /etc/nginx/certs/key.pem
|
||
reload_command: "nginx -s reload"
|
||
validate_command: "nginx -t"
|
||
cert_file_mode: 0644
|
||
key_file_mode: 0640
|
||
post_deploy_verify:
|
||
enabled: true
|
||
endpoint: "nginx.example.com:443"
|
||
timeout: 10s
|
||
backup_retention: 3
|
||
```
|
||
|
||
### HAProxy
|
||
|
||
```yaml
|
||
target_type: haproxy
|
||
target_config:
|
||
pem_path: /etc/haproxy/certs/cert.pem
|
||
reload_command: "systemctl reload haproxy"
|
||
validate_command: "haproxy -c -f /etc/haproxy/haproxy.cfg"
|
||
pem_file_mode: 0600
|
||
post_deploy_verify:
|
||
enabled: true
|
||
endpoint: "haproxy.example.com:443"
|
||
```
|
||
|
||
### Traefik (file watcher; no reload command)
|
||
|
||
```yaml
|
||
target_type: traefik
|
||
target_config:
|
||
cert_dir: /etc/traefik/certs
|
||
cert_file: cert.pem
|
||
key_file: key.pem
|
||
post_deploy_verify:
|
||
enabled: true
|
||
endpoint: "traefik.example.com:443"
|
||
```
|
||
|
||
See per-connector tests at
|
||
`internal/connector/target/<name>/<name>_atomic_test.go` for the
|
||
full failure-mode matrix each connector handles.
|