config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)

Eleven findings from the architecture diligence audit's Phase 2 bundle closed in one PR. All touch the same backend config + Helm chart + operator docs surface, so reviewing in one diff is the natural fit. config.go: three new fail-closed Validate() branches behind sentinels ===================================================================== Three new error sentinels exported from internal/config/config.go for tests to pin via errors.Is + message-text: - ErrAgentBootstrapTokenRequired (SEC-H1) - ErrACMEInsecureWithoutAck (SEC-M4) - ErrDemoModeAckExpired (SEC-H3) SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY as an opt-in feature flag. When true AND the bootstrap token is empty, Validate() returns ErrAgentBootstrapTokenRequired and the server refuses to start. Default in THIS release: false (warn-mode pass-through preserved). WORKSPACE-ROADMAP.md schedules the default flip to true for v2.2.0 — operators get one upgrade window. SEC-M4: upgrades the existing boot-time WARN log for CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with the existing INSECURE flag; either alone fails closed. The boot-time WARN log at cmd/server/main.go:611 continues to fire for the ACK'd case so every restart logs the reminder. SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h. When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS to be set as a unix-epoch timestamp within the last 24h (24h-tolerance on the past side, 1-minute clock-skew on the future side). Catches the "forgotten demo deployment promoted to production" failure mode — next container restart past 24h refuses unless re-ack'd. Tests in internal/config/config_test.go cover every new branch: positive (passes when properly set), negative (each fail-closed path fires with the matching sentinel + message-text). 11 new tests added. Helm chart + HA runbook (DEPL-H1) ================================= Created docs/operator/runbooks/ha.md documenting the three values flips required for production HA: server.replicas, podDisruptionBudget, service.sessionAffinity. Cross-link comments added to deploy/helm/certctl/values.yaml next to the server.replicas (line 19) and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE — that's the point per the prompt's 'do not flip networkPolicy default' guidance: a default-enabled PDB blocks fresh helm install on single-node clusters. CI guard (DEPL-M2) ================== scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any 'change-me-' literal in compose files OTHER than docker-compose.demo.yml. Catches the placeholder-credential-leak regression one layer earlier than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12). Excludes comment lines so docs explaining the pattern don't trip the guard. Verified to fire on a synthetic leak; clean on the current tree. Consolidated 'Security carve-outs' doc section ============================================== docs/operator/security.md grows by one new section documenting the seven existing carve-outs in one canonical place: - SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe) - SEC-M5: F5 connector InsecureSkipVerify per-config field - SEC-M4: ACME insecure + new ACK gate - SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out) - SEC-L2: break-glass Argon2id rest-defense reminder - SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override - DEPL-M2: change-me-* placeholder credentials in demo overlay - DEPL-M3: K8s NetworkPolicy operator-opt-in default Each entry cites the file:line, the rationale for the carve-out, and the operator action. CHANGELOG + ENVIRONMENTS coverage ================================== CHANGELOG.md grows by one new '### Breaking changes (scheduled for v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3 with explicit upgrade-window guidance for each. deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN + AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS + ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean. WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip for v2.2.0. Sandbox limitation ================== The certctl repo's working tree is 6.1 GB which fills the sandbox volume; the go1.25.10 toolchain download (go.mod requires it, sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' / 'go test' were NOT run in this commit's verification path. make verify MUST be run on the operator's workstation before push per CLAUDE.md operating rules. CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, + all existing) verified clean by running each individually. Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1, cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3, cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4, cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1, cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2, cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3, cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3, cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5, cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1, cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2, cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
2026-06-07 12:21:31 +00:00 · 2026-05-13 19:50:00 +00:00
parent 95cb002905
commit 69a2b5c55a
8 changed files with 722 additions and 4 deletions
@@ -403,6 +403,124 @@ the end of step 4, extend the window before step 5.
  from the env var and restart. That's appropriate for a small env-var
  inventory; it would not scale to a per-user-key-issued model.

+## Security carve-outs &amp; operator-tunable defaults
+
+Phase 2 of the architecture diligence remediation (2026-05-13)
+consolidated the following carve-outs into one canonical section so
+operators reviewing security posture have a single search target. Each
+entry cites the exact file:line of the carve-out, why it exists, and
+what the operator should do.
+
+### TLS verification — dev escape hatches
+
+certctl has three `InsecureSkipVerify=true` sites that are dev/probe
+escape hatches, never enabled by default in production:
+
+- **Agent dev escape** — `cmd/agent/main.go:179` (wired from
+  `cmd/agent/main.go:61` config field + `cmd/agent/main.go:1371` CLI
+  flag). Operators flip this only when debugging an agent against a
+  self-signed control plane that hasn't been added to the agent's
+  trust store. Document as `--insecure-skip-verify` in the agent's
+  install runbook; the agent logs a startup WARN any time the flag
+  is set. SEC-M3 pins that the carve-out is intentional.
+- **Agent verification probe** — `cmd/agent/verify.go:78`. The probe
+  intentionally opens a TLS connection with verification disabled so
+  it can inspect any certificate the endpoint serves (including
+  self-signed or expired ones — that's the whole point of a probe).
+  The probe never returns trust state to a security-relevant code
+  path; it only reads cert metadata. SEC-M3 pins this.
+- **tlsprobe (network scanner)** — `internal/tlsprobe/probe.go:54`.
+  Same rationale as the agent verify probe — network discovery must
+  introspect any certificate it finds, including the ones with the
+  problems we're scanning for. SEC-M3 pins this.
+
+### F5 target connector — `InsecureSkipVerify` per-config
+
+The F5 target connector exposes an `Insecure: bool` field on its
+per-target config blob (default `false`). When set,
+`internal/connector/target/f5/f5.go:134` builds the HTTP client with
+`InsecureSkipVerify: config.Insecure`. SEC-M5 closure: operator
+opt-in for self-signed F5 BIG-IP device certs; mitigation is to run
+the F5 + the proxy-agent on a network-segmented internal subnet.
+Document in the F5 connector's per-target setup guide.
+
+### ACME issuer — `CERTCTL_ACME_INSECURE` (now gated on ACK)
+
+`internal/connector/issuer/acme/acme.go:201` builds the ACME HTTP
+client with `InsecureSkipVerify: true` for the Pebble integration
+test path. The per-issuer runtime setting comes from
+`CERTCTL_ACME_INSECURE` (`internal/config/config.go:2116`); Phase 2
+SEC-M4 closure (2026-05-13) added the fail-closed gate so the operator
+must ALSO set `CERTCTL_ACME_INSECURE_ACK=true` for the server to boot.
+Production deploys must never set either flag. The boot-time WARN log
+at `cmd/server/main.go:611` continues to fire for the ACK'd case so
+every restart logs the reminder.
+
+### CSP `'unsafe-inline'` on `style-src`
+
+`internal/api/middleware/securityheaders.go:58` ships the dashboard
+CSP with `style-src 'self' 'unsafe-inline'`. This is required because
+Tailwind compiles utility classes into a single stylesheet at build
+time, but inline-style attributes appear in the dashboard via inline
+`<svg>` elements + Recharts' `<ResponsiveContainer>` injecting inline
+width/height. SEC-L1 closure: the carve-out is necessary today; the
+planned tightening flow is the frontend audit's FE-H2 (icon library)
+ decorative-SVG sweep that then unlocks the CSP hardening (drops
+`'unsafe-inline'`).
+
+### Break-glass admin — Argon2id rest-defense reminder
+
+The break-glass admin path (`docs/operator/runbooks/disaster-recovery.md`)
+hashes the operator-supplied password with Argon2id and stores the
+hash in the `breakglass_credentials` table. SEC-L2 reminder: the
+strength of the rest-defense is operator-supplied — pick a password
+with sufficient entropy (≥ 64 random bits via `openssl rand -base64
+12`) and rotate after every use. Argon2id resists offline cracking
+but an operator-supplied "Password123" hashes the same way.
+
+### Body-size limit (1 MB default) — operator-tunable
+
+The `http.MaxBytesReader` wrap caps inbound request bodies at 1 MB
+by default. The cap is necessary defense against unbounded-body DOS
+but catches legitimate operator workflows:
+
+- Bulk truststore PEM bundle uploads (CA bundles for federated trust
+  stores can be > 1 MB).
+- Multi-MB CRL pushes via the CRL-cache endpoint.
+- Bulk-import of certificates with embedded chains.
+
+SEC-L3 closure: operators raise the cap via `CERTCTL_MAX_BODY_SIZE`
+(units: bytes; e.g. `CERTCTL_MAX_BODY_SIZE=10485760` for 10 MB).
+Document in `deploy/ENVIRONMENTS.md`.
+
+### Demo Compose placeholder credentials
+
+`deploy/docker-compose.demo.yml` ships `CERTCTL_AUTH_SECRET=change-me-in-production`,
+`CERTCTL_CONFIG_ENCRYPTION_KEY=change-me-32-char-encryption-key`, and
+`CERTCTL_API_KEY=change-me-in-production` as documented demo
+defaults. The runtime `Validate()` fail-closed guards
+(`internal/config/config.go::Validate`, Bundle 2 2026-05-12) refuse
+to start if those literal strings reach a non-demo config. Phase 2
+DEPL-M2 closure adds a CI guard
+(`scripts/ci-guards/no-change-me-in-prod-compose.sh`) that fails the
+build at PR time if a `change-me-*` literal leaks into a non-demo
+compose file — catching the regression one layer before the runtime
+guard fires.
+
+### Kubernetes NetworkPolicy — operator-opt-in
+
+`deploy/helm/certctl/templates/networkpolicy.yaml` ships the template
+but `deploy/helm/certctl/values.yaml` defaults `networkPolicy.enabled:
+false`. DEPL-M3 rationale: most Kubernetes clusters don't have a
+NetworkPolicy controller installed (kind / minikube / fresh k3s); a
+default-enabled NetworkPolicy renders fine but produces no
+enforcement, and bare-metal `kube-router`-style controllers may
+interpret a permissive default differently. Production deploys with a
+real NetworkPolicy controller (Calico, Cilium, Antrea) flip the
+values key to `true` and tune the policy in their values overlay.
+Document the production-enable in
+`docs/operator/runbooks/ha.md` (added Phase 2 DEPL-H1).
+
 ## Reporting a vulnerability

 Email `certctl@proton.me`. Coordinated disclosure preferred; we will