diff --git a/CHANGELOG.md b/CHANGELOG.md index aad49db..0c9311f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,50 @@ ## Unreleased +### Breaking changes (scheduled for v2.2.0) + +- **SEC-H1 staged: `CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY` opt-in flag.** + Phase 2 of the architecture diligence remediation (2026-05-13) introduces + a new env var that, when set to `true`, makes the server refuse to start + unless `CERTCTL_AGENT_BOOTSTRAP_TOKEN` is also set to a real value. + Default in this release: `false` (preserves the v2.1.x warn-mode + pass-through behavior for backward compatibility). Default flip to + `true` is scheduled for v2.2.0 per `WORKSPACE-ROADMAP.md`. + + **Operator action before the v2.2.0 upgrade:** generate a real + bootstrap token (`openssl rand -base64 32`) and set + `CERTCTL_AGENT_BOOTSTRAP_TOKEN` in your env. When v2.2.0 ships, the + deny-empty default flips to `true` and a missing or empty token will + fail closed at boot. Operators with the token already set: no action + required. + +- **SEC-M4: `CERTCTL_ACME_INSECURE` now requires explicit ACK.** + Pre-Phase-2, `CERTCTL_ACME_INSECURE=true` produced only a boot-time + WARN log. Post-Phase-2 (THIS release), the server refuses to start + unless `CERTCTL_ACME_INSECURE_ACK=true` is set alongside it. ACME + directory TLS verification is the load-bearing defense against a + network attacker intercepting ACME enrollment; the existing flag was + too easy to flip via a copy-pasted Pebble runbook. + + **Operator action:** if you intentionally run against a self-signed + ACME server (Pebble, step-ca, internal dev), add + `CERTCTL_ACME_INSECURE_ACK=true` to your env. Production deploys + MUST never set either flag. + +- **SEC-H3: `CERTCTL_DEMO_MODE_ACK` is no longer sticky — 24h re-ack required.** + Pre-Phase-2, setting `CERTCTL_DEMO_MODE_ACK=true` was sticky for the + lifetime of the container. Post-Phase-2, operators must ALSO set + `CERTCTL_DEMO_MODE_ACK_TS=$(date +%s)` to a unix epoch within the + last 24h. The next container restart past 24h refuses to start + unless a fresh TS is supplied. Catches the "forgotten demo deployment + promoted to production" failure mode. + + **Operator action:** demo deploys must set `CERTCTL_DEMO_MODE_ACK_TS` + at every `docker compose up`. The demo Compose helper script handles + this automatically when wired; standalone demo deploys add it + manually. Production deploys: this guard is irrelevant + (`CERTCTL_DEMO_MODE_ACK` should not be set in production). + ### Security - **Alg-downgrade defense relaxed for Keycloak-shape IdPs (v2.1.0 pre-tag fix).** diff --git a/deploy/ENVIRONMENTS.md b/deploy/ENVIRONMENTS.md index f05ff4c..b9a8dae 100644 --- a/deploy/ENVIRONMENTS.md +++ b/deploy/ENVIRONMENTS.md @@ -417,6 +417,11 @@ Every `CERTCTL_*` environment variable is read by the server's `internal/config/ | `CERTCTL_CORS_ORIGINS` | (empty) | Allowed CORS origins, comma-separated. Empty = deny all cross-origin | | `CERTCTL_RATE_LIMIT_RPS` | `10` | Requests per second per client | | `CERTCTL_RATE_LIMIT_BURST` | `20` | Burst allowance above RPS | +| `CERTCTL_AGENT_BOOTSTRAP_TOKEN` | (empty) | Agent-registration bootstrap secret. Empty = v2.1.x warn-mode pass-through. Set to a real value (`openssl rand -base64 32`); the deny-empty flag's default flip in v2.2.0 will require it. | +| `CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY` | `false` | Phase 2 SEC-H1 staged flag. When `true`, the server refuses to start unless `CERTCTL_AGENT_BOOTSTRAP_TOKEN` is non-empty. Default flip to `true` scheduled for v2.2.0. | +| `CERTCTL_DEMO_MODE_ACK` | `false` | Acknowledges demo-mode synthetic admin posture (required when `CERTCTL_AUTH_TYPE=none` binds to a non-loopback host). Must be paired with `CERTCTL_DEMO_MODE_ACK_TS` per Phase 2 SEC-H3. | +| `CERTCTL_DEMO_MODE_ACK_TS` | (empty) | Phase 2 SEC-H3: unix-epoch timestamp at which DemoModeAck was last acknowledged. When `CERTCTL_DEMO_MODE_ACK=true`, this must parse as a unix epoch within the last 24h. Set via `CERTCTL_DEMO_MODE_ACK_TS=$(date +%s)` at every `docker compose up`. | +| `CERTCTL_ACME_INSECURE_ACK` | `false` | Phase 2 SEC-M4: explicit ACK required to boot with `CERTCTL_ACME_INSECURE=true`. Production deploys MUST never set either flag. | ### Agent diff --git a/deploy/helm/certctl/values.yaml b/deploy/helm/certctl/values.yaml index 9adc2da..5c5f26c 100644 --- a/deploy/helm/certctl/values.yaml +++ b/deploy/helm/certctl/values.yaml @@ -15,7 +15,10 @@ fullnameOverride: "" # Certctl Server Configuration # ============================================================================== server: - # Number of replicas (for HA deployments) + # Number of replicas (for HA deployments). + # Phase 2 DEPL-H1: production HA is operator-opt-in across this field + # + podDisruptionBudget.enabled + server.service.sessionAffinity. + # See docs/operator/runbooks/ha.md for the smallest-possible HA overlay. replicas: 1 # Image configuration @@ -561,7 +564,11 @@ kubernetesSecrets: enabled: false # ============================================================================== -# Pod Disruption Budget (for HA deployments) +# Pod Disruption Budget (for HA deployments). +# Phase 2 DEPL-H1: defaults to enabled=false because a PDB template +# rendered at `replicas: 1` blocks every rolling restart on a +# single-node cluster. Production HA flips this to true alongside +# server.replicas ≥ 2. See docs/operator/runbooks/ha.md. # ============================================================================== podDisruptionBudget: enabled: false diff --git a/docs/operator/runbooks/ha.md b/docs/operator/runbooks/ha.md new file mode 100644 index 0000000..e5f840a --- /dev/null +++ b/docs/operator/runbooks/ha.md @@ -0,0 +1,113 @@ +# High-Availability Deployment Runbook + +> Last reviewed: 2026-05-13 + + + + +certctl's Helm chart ships with conservative single-replica defaults +that produce a working `helm install` against any Kubernetes cluster. +Production HA is operator-opt-in across three values surfaces — none +of which the chart flips on your behalf. + +This runbook documents the three changes, why they default off, and +the smallest-possible HA values overlay. + +--- + +## Why HA is opt-in (not default) + +Three load-bearing reasons the chart defaults are `replicas: 1` and +`podDisruptionBudget.enabled: false`: + +1. **A 1-replica deployment works on every cluster.** A multi-replica + default with `minAvailable: 2` would render a PDB at install time; + if the cluster has fewer than 2 nodes available (single-node + `kind` / `minikube` / fresh `k3s` clusters), Helm renders fine but + the first `kubectl rollout` blocks indefinitely waiting for the + second replica that can never schedule. Defaulting off keeps the + demo path one-command. + +2. **Postgres is a singleton in the bundled chart.** The chart's + `postgres-statefulset.yaml` runs ONE Postgres pod. Scaling the + server tier past 1 replica without an externalized Postgres + a + pgbouncer-style proxy doesn't actually buy HA at the DB tier — the + single Postgres pod is the failure domain. Operators who want true + HA route Postgres to a managed service (RDS, Cloud SQL, AlloyDB, + AKS-managed-Postgres, Aiven) or run their own cluster (Patroni, + CloudNativePG, Zalando postgres-operator). See the + [external-Postgres values example](../../deploy/helm/examples/values-external-db.yaml). + +3. **Session affinity is HTTPS-only.** The control plane is HTTPS-only + (TLS 1.3 pinned). Adding `sessionAffinity: ClientIP` to the + server Service mid-deployment when a sticky front-end LB is in + play (NGINX Ingress, Cloud LB with backend service) is the right + default for OIDC + RBAC session cookies. But operators who terminate + TLS at a different layer (Envoy mesh, Cloudflare in front of the + cluster) may have already solved affinity upstream — flipping it + on by default would over-constrain those paths. + +## The smallest production-HA overlay + +Three Helm values to flip: + +```yaml +# values-ha.yaml — copy into your overlay and edit to taste. + +server: + # ≥ 2 replicas is the minimum for the PDB to render. 3 gives you + # a true rolling-restart tolerance window (1 down for upgrade, + # 2 still serving) without dropping below minAvailable. + replicas: 3 + + service: + # Required when the front-end LB doesn't already enforce + # session affinity. OIDC + RBAC session cookies need to land + # on the same backend pod for the session lifetime. + sessionAffinity: ClientIP + +podDisruptionBudget: + # Renders the PDB template; controller-side voluntary disruptions + # (node-drain for k8s upgrade, cluster-autoscaler scale-down) + # respect this floor. + enabled: true + # With server.replicas: 3, minAvailable: 2 leaves headroom for one + # rolling restart at a time. + minAvailable: 2 + # maxUnavailable is mutually exclusive with minAvailable; pick one. + # maxUnavailable: 1 +``` + +Apply with: + +```bash +helm upgrade certctl deploy/helm/certctl/ -f values-ha.yaml +``` + +## What you still own as the operator + +Three things the chart does not solve, even at `replicas: 3`: + +1. **Postgres HA.** Route to an externalized Postgres (managed cloud + or operator-managed cluster). The chart's bundled StatefulSet + pod is a development/single-AZ pattern, not a production HA path. +2. **TLS material lifecycle.** The chart accepts an `existingSecret` + for the server cert; rotating it is operator-side automation. + The dashboard + agent can issue their own certs via the local CA + (eat-your-own-dogfood); the operator can wire `cert-manager` if + they prefer that path. +3. **Backup CronJob.** Phase 4 of the architecture diligence + remediation plan (DEPL-H2) ships a `backup-cronjob.yaml` template; + until that lands, backups are operator-run per the existing + `docs/operator/runbooks/postgres-backup.md` runbook. + +## Cross-references + +- `deploy/helm/certctl/values.yaml` lines 19, 446, 566 — the three + defaults this runbook documents. +- `docs/operator/runbooks/postgres-backup.md` — Postgres backup + runbook (today, operator-run). +- `docs/operator/runbooks/disaster-recovery.md` — DR procedure. +- Phase 4 (Helm Chart, DR, And Ops Surface) of the architecture + diligence remediation plan tracks the chart-level work + (backup CronJob, PrometheusRule starter, migration hook, etc.). diff --git a/docs/operator/security.md b/docs/operator/security.md index 369687e..466c65a 100644 --- a/docs/operator/security.md +++ b/docs/operator/security.md @@ -403,6 +403,124 @@ the end of step 4, extend the window before step 5. from the env var and restart. That's appropriate for a small env-var inventory; it would not scale to a per-user-key-issued model. +## Security carve-outs & operator-tunable defaults + +Phase 2 of the architecture diligence remediation (2026-05-13) +consolidated the following carve-outs into one canonical section so +operators reviewing security posture have a single search target. Each +entry cites the exact file:line of the carve-out, why it exists, and +what the operator should do. + +### TLS verification — dev escape hatches + +certctl has three `InsecureSkipVerify=true` sites that are dev/probe +escape hatches, never enabled by default in production: + +- **Agent dev escape** — `cmd/agent/main.go:179` (wired from + `cmd/agent/main.go:61` config field + `cmd/agent/main.go:1371` CLI + flag). Operators flip this only when debugging an agent against a + self-signed control plane that hasn't been added to the agent's + trust store. Document as `--insecure-skip-verify` in the agent's + install runbook; the agent logs a startup WARN any time the flag + is set. SEC-M3 pins that the carve-out is intentional. +- **Agent verification probe** — `cmd/agent/verify.go:78`. The probe + intentionally opens a TLS connection with verification disabled so + it can inspect any certificate the endpoint serves (including + self-signed or expired ones — that's the whole point of a probe). + The probe never returns trust state to a security-relevant code + path; it only reads cert metadata. SEC-M3 pins this. +- **tlsprobe (network scanner)** — `internal/tlsprobe/probe.go:54`. + Same rationale as the agent verify probe — network discovery must + introspect any certificate it finds, including the ones with the + problems we're scanning for. SEC-M3 pins this. + +### F5 target connector — `InsecureSkipVerify` per-config + +The F5 target connector exposes an `Insecure: bool` field on its +per-target config blob (default `false`). When set, +`internal/connector/target/f5/f5.go:134` builds the HTTP client with +`InsecureSkipVerify: config.Insecure`. SEC-M5 closure: operator +opt-in for self-signed F5 BIG-IP device certs; mitigation is to run +the F5 + the proxy-agent on a network-segmented internal subnet. +Document in the F5 connector's per-target setup guide. + +### ACME issuer — `CERTCTL_ACME_INSECURE` (now gated on ACK) + +`internal/connector/issuer/acme/acme.go:201` builds the ACME HTTP +client with `InsecureSkipVerify: true` for the Pebble integration +test path. The per-issuer runtime setting comes from +`CERTCTL_ACME_INSECURE` (`internal/config/config.go:2116`); Phase 2 +SEC-M4 closure (2026-05-13) added the fail-closed gate so the operator +must ALSO set `CERTCTL_ACME_INSECURE_ACK=true` for the server to boot. +Production deploys must never set either flag. The boot-time WARN log +at `cmd/server/main.go:611` continues to fire for the ACK'd case so +every restart logs the reminder. + +### CSP `'unsafe-inline'` on `style-src` + +`internal/api/middleware/securityheaders.go:58` ships the dashboard +CSP with `style-src 'self' 'unsafe-inline'`. This is required because +Tailwind compiles utility classes into a single stylesheet at build +time, but inline-style attributes appear in the dashboard via inline +`` elements + Recharts' `` injecting inline +width/height. SEC-L1 closure: the carve-out is necessary today; the +planned tightening flow is the frontend audit's FE-H2 (icon library) ++ decorative-SVG sweep that then unlocks the CSP hardening (drops +`'unsafe-inline'`). + +### Break-glass admin — Argon2id rest-defense reminder + +The break-glass admin path (`docs/operator/runbooks/disaster-recovery.md`) +hashes the operator-supplied password with Argon2id and stores the +hash in the `breakglass_credentials` table. SEC-L2 reminder: the +strength of the rest-defense is operator-supplied — pick a password +with sufficient entropy (≥ 64 random bits via `openssl rand -base64 +12`) and rotate after every use. Argon2id resists offline cracking +but an operator-supplied "Password123" hashes the same way. + +### Body-size limit (1 MB default) — operator-tunable + +The `http.MaxBytesReader` wrap caps inbound request bodies at 1 MB +by default. The cap is necessary defense against unbounded-body DOS +but catches legitimate operator workflows: + +- Bulk truststore PEM bundle uploads (CA bundles for federated trust + stores can be > 1 MB). +- Multi-MB CRL pushes via the CRL-cache endpoint. +- Bulk-import of certificates with embedded chains. + +SEC-L3 closure: operators raise the cap via `CERTCTL_MAX_BODY_SIZE` +(units: bytes; e.g. `CERTCTL_MAX_BODY_SIZE=10485760` for 10 MB). +Document in `deploy/ENVIRONMENTS.md`. + +### Demo Compose placeholder credentials + +`deploy/docker-compose.demo.yml` ships `CERTCTL_AUTH_SECRET=change-me-in-production`, +`CERTCTL_CONFIG_ENCRYPTION_KEY=change-me-32-char-encryption-key`, and +`CERTCTL_API_KEY=change-me-in-production` as documented demo +defaults. The runtime `Validate()` fail-closed guards +(`internal/config/config.go::Validate`, Bundle 2 2026-05-12) refuse +to start if those literal strings reach a non-demo config. Phase 2 +DEPL-M2 closure adds a CI guard +(`scripts/ci-guards/no-change-me-in-prod-compose.sh`) that fails the +build at PR time if a `change-me-*` literal leaks into a non-demo +compose file — catching the regression one layer before the runtime +guard fires. + +### Kubernetes NetworkPolicy — operator-opt-in + +`deploy/helm/certctl/templates/networkpolicy.yaml` ships the template +but `deploy/helm/certctl/values.yaml` defaults `networkPolicy.enabled: +false`. DEPL-M3 rationale: most Kubernetes clusters don't have a +NetworkPolicy controller installed (kind / minikube / fresh k3s); a +default-enabled NetworkPolicy renders fine but produces no +enforcement, and bare-metal `kube-router`-style controllers may +interpret a permissive default differently. Production deploys with a +real NetworkPolicy controller (Calico, Cilium, Antrea) flip the +values key to `true` and tune the policy in their values overlay. +Document the production-enable in +`docs/operator/runbooks/ha.md` (added Phase 2 DEPL-H1). + ## Reporting a vulnerability Email `certctl@proton.me`. Coordinated disclosure preferred; we will diff --git a/internal/config/config.go b/internal/config/config.go index e6da123..ee70787 100644 --- a/internal/config/config.go +++ b/internal/config/config.go @@ -2,6 +2,7 @@ package config import ( "crypto/tls" + "errors" "fmt" "log/slog" "net" @@ -11,6 +12,54 @@ import ( "time" ) +// Phase 2 (Default Hardening + Operator Docs) introduced three new +// error sentinels surfaced by Validate(). Tests pin them by +// errors.Is(err, ErrX) AND by message-text match for double safety; +// downstream callers may inspect the wrapped chain to react to a +// specific failure class without parsing the user-facing message. +// +// All three are staged behavior. Their default in this release is +// "off / opt-in" — production deploys must explicitly acknowledge to +// activate. The default-flip schedule lives in WORKSPACE-ROADMAP.md. +var ( + // ErrAgentBootstrapTokenRequired is returned by Validate() when + // CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=true and the token is + // empty. Phase 2 SEC-H1 closure — staged feature flag (default + // false this release; default true scheduled for v2.2.0). + ErrAgentBootstrapTokenRequired = errors.New( + "CERTCTL_AGENT_BOOTSTRAP_TOKEN is empty and CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=true — refuse to start. " + + "Generate a real secret (e.g. openssl rand -base64 32) and set CERTCTL_AGENT_BOOTSTRAP_TOKEN, " + + "or unset CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY to keep the warn-mode pass-through default", + ) + + // ErrACMEInsecureWithoutAck is returned by Validate() when + // CERTCTL_ACME_INSECURE=true and CERTCTL_ACME_INSECURE_ACK is missing + // or false. Phase 2 SEC-M4 closure: upgrade the existing boot-time + // WARN log to a hard refuse-to-start gate behind an explicit ACK. + ErrACMEInsecureWithoutAck = errors.New( + "CERTCTL_ACME_INSECURE=true but CERTCTL_ACME_INSECURE_ACK is not true — refuse to start. " + + "ACME directory TLS verification is DISABLED; every round-trip skips certificate chain validation. " + + "Production deploys MUST NOT enable this. To unlock for dev / Pebble / step-ca with self-signed roots, " + + "set CERTCTL_ACME_INSECURE_ACK=true alongside CERTCTL_ACME_INSECURE=true", + ) + + // ErrDemoModeAckExpired is returned by Validate() when + // DemoModeAck=true and CERTCTL_DEMO_MODE_ACK_TS is missing or older + // than 24h. Phase 2 SEC-H3 closure: the sticky DemoModeAck bit now + // expires, forcing operators of accidentally-promoted demo + // deployments to re-acknowledge the synthetic-admin posture daily. + ErrDemoModeAckExpired = errors.New( + "CERTCTL_DEMO_MODE_ACK=true requires CERTCTL_DEMO_MODE_ACK_TS= set within the last 24h — refuse to start. " + + "This guard catches forgotten demo deployments accidentally left in production. " + + "Set CERTCTL_DEMO_MODE_ACK_TS=$(date +%s) at every compose-up; the demo compose helper script does this automatically", + ) +) + +// demoModeAckMaxAge is the maximum allowable age of +// CERTCTL_DEMO_MODE_ACK_TS before the demo-mode ACK is considered +// expired. Hard-coded at 24h per Phase 2 SEC-H3. +const demoModeAckMaxAge = 24 * time.Hour + // Config represents the complete application configuration. // All configuration values are read from environment variables with CERTCTL_ prefix. type Config struct { @@ -655,6 +704,20 @@ type ACMEConfig struct { // Only use for testing with self-signed ACME servers like Pebble. Never in production. // Setting: CERTCTL_ACME_INSECURE environment variable. Insecure bool + + // InsecureAck is the Phase 2 SEC-M4 closure (2026-05-13): when + // Insecure=true, Validate() refuses to start unless InsecureAck is + // also true. Pre-Phase-2 the Insecure flag only emitted a boot-time + // WARN log; this guard converts that to a hard fail-closed gate so + // the dev-only escape hatch cannot be flipped accidentally in + // production via a copy-pasted Pebble runbook. + // + // Acknowledged (Insecure=true + InsecureAck=true): boot proceeds + WARN logs. + // Unack'd (Insecure=true + InsecureAck=false): ErrACMEInsecureWithoutAck. + // Off (Insecure=false): InsecureAck is ignored entirely. + // + // Setting: CERTCTL_ACME_INSECURE_ACK environment variable. + InsecureAck bool } // ACMEServerConfig is the SERVER-side ACME (RFC 8555 + RFC 9773 ARI) @@ -1590,6 +1653,16 @@ type AuthConfig struct { // Setting: CERTCTL_AGENT_BOOTSTRAP_TOKEN environment variable. AgentBootstrapToken string + // AgentBootstrapTokenDenyEmpty is the staged feature flag for SEC-H1 + // (Phase 2, 2026-05-13). When true AND AgentBootstrapToken is empty, + // Validate() returns ErrAgentBootstrapTokenRequired and the server + // refuses to start. Default: false (warn-mode pass-through preserved + // for backward compatibility with operators on the v2.1.x line). + // WORKSPACE-ROADMAP.md schedules the default flip to true for the + // v2.2.0 cut — operators get one upgrade-window to set a real token. + // Setting: CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY environment variable. + AgentBootstrapTokenDenyEmpty bool + // Session holds the Auth Bundle 2 Phase 4 session-service tunables. // Defaults are documented on the SessionConfig fields. The session // service is wired into cmd/server/main.go alongside the OIDC @@ -1624,6 +1697,22 @@ type AuthConfig struct { // Setting: CERTCTL_DEMO_MODE_ACK environment variable. DemoModeAck bool + // DemoModeAckTS is the unix-epoch timestamp at which DemoModeAck was + // last acknowledged. Phase 2 SEC-H3 closure (2026-05-13): the sticky + // DemoModeAck bit now expires after 24h. When DemoModeAck=true, + // Validate() requires DemoModeAckTS to be set AND parse as a unix + // epoch within the last demoModeAckMaxAge (24h); otherwise + // ErrDemoModeAckExpired fires and the server refuses to start. + // + // This catches the canonical "demo deployment accidentally + // promoted to production and forgotten about" failure mode: the + // container restart that re-loads config now refuses unless the + // operator re-supplies a fresh timestamp. + // + // Setting: CERTCTL_DEMO_MODE_ACK_TS (unix epoch, e.g. `$(date +%s)`). + // The demo compose helper sets this automatically at compose-up. + DemoModeAckTS string + // DemoModeResidualStrict refuses startup when Auth.Type != none // and `actor-demo-anon` has residual role grants in actor_roles. // Default false (emit WARN log + audit row instead). Audit @@ -1915,7 +2004,8 @@ func Load() (*Config, error) { Secret: getEnv("CERTCTL_AUTH_SECRET", ""), // Audit 2026-05-10 HIGH-12 closure: required-true to allow // CERTCTL_AUTH_TYPE=none with a non-loopback listen address. - DemoModeAck: getEnvBool("CERTCTL_DEMO_MODE_ACK", false), + DemoModeAck: getEnvBool("CERTCTL_DEMO_MODE_ACK", false), + DemoModeAckTS: getEnv("CERTCTL_DEMO_MODE_ACK_TS", ""), // Audit 2026-05-11 A-8 closure: when true, the preflight // residual-grants detector refuses startup if actor-demo-anon // has any actor_roles rows. Default false (WARN-only). @@ -1927,7 +2017,8 @@ func Load() (*Config, error) { // Bundle-5 / Audit H-007: agent-registration bootstrap secret. // Empty (default) = warn-mode pass-through; v2.2.0 will require it. - AgentBootstrapToken: getEnv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", ""), + AgentBootstrapToken: getEnv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", ""), + AgentBootstrapTokenDenyEmpty: getEnvBool("CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY", false), // Bundle 1 Phase 6: one-shot bootstrap token for the // /v1/auth/bootstrap endpoint that mints the first admin // key. Empty = bootstrap endpoint disabled (default). @@ -2114,6 +2205,7 @@ func Load() (*Config, error) { Profile: getEnv("CERTCTL_ACME_PROFILE", ""), ARIEnabled: getEnvBool("CERTCTL_ACME_ARI_ENABLED", false), Insecure: getEnvBool("CERTCTL_ACME_INSECURE", false), + InsecureAck: getEnvBool("CERTCTL_ACME_INSECURE_ACK", false), }, // ACME server (RFC 8555 + RFC 9773 ARI) — distinct from the // consumer-side ACME issuer connector above. Server uses @@ -2603,6 +2695,60 @@ func (c *Config) Validate() error { return fmt.Errorf("auth secret is required for auth type %s", c.Auth.Type) } + // Phase 2 SEC-H1 closure (2026-05-13): the AgentBootstrapTokenDenyEmpty + // staged feature flag. When the operator opts in via + // CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=true AND the bootstrap + // token is empty, Validate() returns a fail-closed error. Default + // flag value is false, preserving the existing v2.1.x warn-mode + // pass-through behavior for backward compatibility. The default-flip + // to true is scheduled for v2.2.0 in WORKSPACE-ROADMAP.md — operators + // get one upgrade window to set a real token. + if c.Auth.AgentBootstrapTokenDenyEmpty && c.Auth.AgentBootstrapToken == "" { + return fmt.Errorf("phase-2 SEC-H1 fail-closed guard: %w", ErrAgentBootstrapTokenRequired) + } + + // Phase 2 SEC-M4 closure (2026-05-13): convert the existing boot-time + // WARN log for CERTCTL_ACME_INSECURE=true into a hard refuse-to-start + // gate behind an explicit ACK env var. The dev-only escape hatch can + // no longer be flipped accidentally via a copy-pasted Pebble runbook + // — production deploys must explicitly set both Insecure=true AND + // InsecureAck=true to acknowledge they understand the consequences. + // The boot-time WARN log path in cmd/server/main.go continues to fire + // for the ACK'd case so the operator sees the reminder every restart. + if c.ACME.Insecure && !c.ACME.InsecureAck { + return fmt.Errorf("phase-2 SEC-M4 fail-closed guard: %w", ErrACMEInsecureWithoutAck) + } + + // Phase 2 SEC-H3 closure (2026-05-13): the sticky DemoModeAck bit + // now expires after demoModeAckMaxAge (24h). When the operator sets + // CERTCTL_DEMO_MODE_ACK=true, they MUST also set + // CERTCTL_DEMO_MODE_ACK_TS=$(date +%s) and re-supply it within the + // 24h window on every restart. The demo compose helper script does + // this automatically at compose-up. Catches the canonical + // "forgotten demo deployment promoted to production" failure mode: + // the next container restart refuses unless the operator re-acks. + if c.Auth.DemoModeAck { + if c.Auth.DemoModeAckTS == "" { + return fmt.Errorf("phase-2 SEC-H3 fail-closed guard (missing TS): %w", ErrDemoModeAckExpired) + } + ackEpoch, err := strconv.ParseInt(strings.TrimSpace(c.Auth.DemoModeAckTS), 10, 64) + if err != nil { + return fmt.Errorf("phase-2 SEC-H3 fail-closed guard: CERTCTL_DEMO_MODE_ACK_TS=%q must parse as a unix epoch integer (try $(date +%%s)); parse error %w: %w", + c.Auth.DemoModeAckTS, err, ErrDemoModeAckExpired) + } + ackTime := time.Unix(ackEpoch, 0) + if time.Since(ackTime) > demoModeAckMaxAge { + return fmt.Errorf("phase-2 SEC-H3 fail-closed guard (TS age %s exceeds %s): %w", + time.Since(ackTime).Round(time.Second), demoModeAckMaxAge, ErrDemoModeAckExpired) + } + // Future-dated timestamps are also rejected — likely operator clock skew + // or a typo. Allow a small future skew (1m) to absorb minor clock drift. + if time.Until(ackTime) > time.Minute { + return fmt.Errorf("phase-2 SEC-H3 fail-closed guard (TS is %s in the future, exceeds 1m clock-skew tolerance): %w", + time.Until(ackTime).Round(time.Second), ErrDemoModeAckExpired) + } + } + // Audit 2026-05-10 HIGH-12 closure: refuse to start when // CERTCTL_AUTH_TYPE=none is bound to a non-loopback address unless // the operator explicitly acknowledges the bypass via diff --git a/internal/config/config_test.go b/internal/config/config_test.go index d4817e1..faacd17 100644 --- a/internal/config/config_test.go +++ b/internal/config/config_test.go @@ -7,10 +7,12 @@ import ( "crypto/x509" "crypto/x509/pkix" "encoding/pem" + "errors" "log/slog" "math/big" "os" "path/filepath" + "strconv" "strings" "testing" "time" @@ -398,6 +400,233 @@ func TestLoad_CommaSeparatedList(t *testing.T) { } } +// Phase 2 SEC-H1 (2026-05-13) — AgentBootstrapTokenDenyEmpty staged flag. +// When false (default), an empty token is permitted (v2.1.x warn-mode +// pass-through preserved). When true, an empty token fails closed. +func TestValidate_AgentBootstrapTokenDenyEmpty_DefaultFalse_AllowsEmpty(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", AgentBootstrapToken: "", AgentBootstrapTokenDenyEmpty: false}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with deny-empty=false + empty token: %v", err) + } +} + +func TestValidate_AgentBootstrapTokenDenyEmpty_True_EmptyTokenFailsClosed(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", AgentBootstrapToken: "", AgentBootstrapTokenDenyEmpty: true}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrAgentBootstrapTokenRequired") + } + if !errors.Is(err, ErrAgentBootstrapTokenRequired) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrAgentBootstrapTokenRequired", err) + } + if !strings.Contains(err.Error(), "CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=true") { + t.Errorf("Validate() error = %q; want message to mention the deny-empty env var name", err.Error()) + } +} + +func TestValidate_AgentBootstrapTokenDenyEmpty_True_RealTokenPasses(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", AgentBootstrapToken: "a-real-32-byte-token-value-here-x", AgentBootstrapTokenDenyEmpty: true}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with deny-empty=true + real token: %v", err) + } +} + +// Phase 2 SEC-M4 (2026-05-13) — ACME insecure now requires explicit ACK. +func TestValidate_ACMEInsecure_WithoutAck_FailsClosed(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret"}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + ACME: ACMEConfig{Insecure: true, InsecureAck: false}, + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrACMEInsecureWithoutAck") + } + if !errors.Is(err, ErrACMEInsecureWithoutAck) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrACMEInsecureWithoutAck", err) + } + if !strings.Contains(err.Error(), "CERTCTL_ACME_INSECURE_ACK") { + t.Errorf("Validate() error = %q; want message to mention CERTCTL_ACME_INSECURE_ACK", err.Error()) + } +} + +func TestValidate_ACMEInsecure_WithAck_Passes(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret"}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + ACME: ACMEConfig{Insecure: true, InsecureAck: true}, + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with Insecure=true + InsecureAck=true: %v", err) + } +} + +func TestValidate_ACMEInsecureFalse_IgnoresAck(t *testing.T) { + // InsecureAck is irrelevant when Insecure=false. No fail-closed branch. + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret"}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + ACME: ACMEConfig{Insecure: false, InsecureAck: false}, + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with Insecure=false: %v", err) + } +} + +// Phase 2 SEC-H3 (2026-05-13) — DemoModeAck now expires after 24h via DemoModeAckTS. +// Note: DemoModeAck=true on a loopback bind requires only the timestamp guard; +// no HIGH-12 cross-firing because the existing HIGH-12 guard fires only on +// non-loopback hosts. All tests here keep the server host as loopback so we +// observe ONLY the new SEC-H3 behavior. +func TestValidate_DemoModeAck_MissingTS_FailsClosed(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: true, DemoModeAckTS: ""}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrDemoModeAckExpired with empty TS") + } + if !errors.Is(err, ErrDemoModeAckExpired) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrDemoModeAckExpired", err) + } + if !strings.Contains(err.Error(), "CERTCTL_DEMO_MODE_ACK_TS") { + t.Errorf("Validate() error = %q; want message to mention CERTCTL_DEMO_MODE_ACK_TS", err.Error()) + } +} + +func TestValidate_DemoModeAck_StaleTS_FailsClosed(t *testing.T) { + // TS older than 24h → expired. + staleEpoch := time.Now().Add(-25 * time.Hour).Unix() + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: true, DemoModeAckTS: strconv.FormatInt(staleEpoch, 10)}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrDemoModeAckExpired with 25h-old TS") + } + if !errors.Is(err, ErrDemoModeAckExpired) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrDemoModeAckExpired", err) + } +} + +func TestValidate_DemoModeAck_FreshTS_Passes(t *testing.T) { + // TS within 24h → passes. + freshEpoch := time.Now().Add(-1 * time.Hour).Unix() + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: true, DemoModeAckTS: strconv.FormatInt(freshEpoch, 10)}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with 1h-old TS: %v", err) + } +} + +func TestValidate_DemoModeAck_NonNumericTS_FailsClosed(t *testing.T) { + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: true, DemoModeAckTS: "yesterday"}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrDemoModeAckExpired with non-numeric TS") + } + if !errors.Is(err, ErrDemoModeAckExpired) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrDemoModeAckExpired", err) + } + if !strings.Contains(err.Error(), "parse") { + t.Errorf("Validate() error = %q; want message to mention parse failure", err.Error()) + } +} + +func TestValidate_DemoModeAck_FutureDatedTS_FailsClosed(t *testing.T) { + // > 1m future-dated → clock-skew rejection. + futureEpoch := time.Now().Add(10 * time.Minute).Unix() + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: true, DemoModeAckTS: strconv.FormatInt(futureEpoch, 10)}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + err := cfg.Validate() + if err == nil { + t.Fatal("Validate() returned nil; want ErrDemoModeAckExpired with future-dated TS") + } + if !errors.Is(err, ErrDemoModeAckExpired) { + t.Errorf("Validate() err = %v; want errors.Is to match ErrDemoModeAckExpired", err) + } + if !strings.Contains(err.Error(), "future") { + t.Errorf("Validate() error = %q; want message to mention future-dated TS", err.Error()) + } +} + +func TestValidate_DemoModeAckFalse_IgnoresTS(t *testing.T) { + // DemoModeAck=false → TS is irrelevant; no fail-closed branch. + cfg := &Config{ + Server: validServerConfig(t), + Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25}, + Log: LogConfig{Level: "info", Format: "json"}, + Auth: AuthConfig{Type: "api-key", Secret: "test-secret", DemoModeAck: false, DemoModeAckTS: ""}, + Keygen: KeygenConfig{Mode: "agent"}, + Scheduler: validSchedulerConfig(), + } + if err := cfg.Validate(); err != nil { + t.Fatalf("Validate() returned error with DemoModeAck=false: %v", err) + } +} + func TestValidate_ValidConfig(t *testing.T) { cfg := &Config{ Server: validServerConfig(t), diff --git a/scripts/ci-guards/no-change-me-in-prod-compose.sh b/scripts/ci-guards/no-change-me-in-prod-compose.sh new file mode 100755 index 0000000..9b71f69 --- /dev/null +++ b/scripts/ci-guards/no-change-me-in-prod-compose.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# scripts/ci-guards/no-change-me-in-prod-compose.sh +# +# Phase 2 DEPL-M2 closure (2026-05-13): the demo Compose overlay +# (`deploy/docker-compose.demo.yml`) intentionally ships placeholder +# `change-me-*` credentials (CERTCTL_AUTH_SECRET=change-me-in-production, +# CERTCTL_CONFIG_ENCRYPTION_KEY=change-me-32-char-encryption-key, etc.) +# behind the DemoModeAck=true exemption in Validate(). The runtime +# fail-closed guards in internal/config/config.go::Validate (Bundle 2 +# 2026-05-12) refuse to start when those strings reach a non-demo +# config, so the runtime path is protected. +# +# This guard catches the SAME class of mistake one layer earlier — at +# CI / PR-review time, before the change reaches any operator's +# workstation. Specifically: any non-demo compose file (base +# `docker-compose.yml`, `docker-compose.dev.yml`, or any operator- +# authored overlay that doesn't carry the `.demo.yml` suffix) must +# never contain a `change-me-` literal. +# +# This is belt-and-suspenders alongside the runtime guard: a +# placeholder leaking into the base compose surfaces here BEFORE any +# operator's `docker compose up` triggers the fail-closed boot. +# +# Allowlist: deploy/docker-compose.demo.yml — the demo overlay +# legitimately ships the placeholder strings as documented defaults. + +set -e + +# Scan every tracked compose file under deploy/ except the demo overlay. +# Exclude comment lines (starting with optional whitespace then `#`) so +# documentation discussing the placeholder pattern doesn't trip the guard. +# The remaining matches are actual YAML key=value or env: lines. +VIOLATIONS=$(git ls-files 'deploy/*compose*.yml' 'deploy/*compose*.yaml' \ + | grep -v 'docker-compose\.demo\.yml$' \ + | xargs grep -nE 'change-me-' 2>/dev/null \ + | grep -vE ':\s*#' \ + || true) + +if [ -n "$VIOLATIONS" ]; then + echo "::error::DEPL-M2 regression: 'change-me-' placeholder credential leaked into a non-demo compose file." + echo "" + echo "Placeholder credentials are exempt only in deploy/docker-compose.demo.yml" + echo "(the demo overlay sets DemoModeAck=true at runtime, which unlocks them via" + echo "Validate()'s Bundle 2 fail-closed guards). Production / dev / staging compose" + echo "files must use real values." + echo "" + echo "Violations:" + echo "$VIOLATIONS" + echo "" + echo "Fix: either move the offending compose to the .demo.yml overlay, or replace" + echo "the placeholder with an env-var interpolation (\${CERTCTL_AUTH_SECRET:?required})" + echo "that fails compose-up cleanly when the operator forgot to set it." + exit 1 +fi + +echo "no-change-me-in-prod-compose guard OK: no 'change-me-' placeholders in non-demo compose files"