mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:21:31 +00:00
config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.
config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================
Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
- ErrAgentBootstrapTokenRequired (SEC-H1)
- ErrACMEInsecureWithoutAck (SEC-M4)
- ErrDemoModeAckExpired (SEC-H3)
SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.
SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.
SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.
Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.
Helm chart + HA runbook (DEPL-H1)
=================================
Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.
CI guard (DEPL-M2)
==================
scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.
Consolidated 'Security carve-outs' doc section
==============================================
docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
- SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
- SEC-M5: F5 connector InsecureSkipVerify per-config field
- SEC-M4: ACME insecure + new ACK gate
- SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
- SEC-L2: break-glass Argon2id rest-defense reminder
- SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
- DEPL-M2: change-me-* placeholder credentials in demo overlay
- DEPL-M3: K8s NetworkPolicy operator-opt-in default
Each entry cites the file:line, the rationale for the carve-out, and
the operator action.
CHANGELOG + ENVIRONMENTS coverage
==================================
CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.
deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.
WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.
Sandbox limitation
==================
The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.
CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
This commit is contained in:
@@ -403,6 +403,124 @@ the end of step 4, extend the window before step 5.
|
||||
from the env var and restart. That's appropriate for a small env-var
|
||||
inventory; it would not scale to a per-user-key-issued model.
|
||||
|
||||
## Security carve-outs & operator-tunable defaults
|
||||
|
||||
Phase 2 of the architecture diligence remediation (2026-05-13)
|
||||
consolidated the following carve-outs into one canonical section so
|
||||
operators reviewing security posture have a single search target. Each
|
||||
entry cites the exact file:line of the carve-out, why it exists, and
|
||||
what the operator should do.
|
||||
|
||||
### TLS verification — dev escape hatches
|
||||
|
||||
certctl has three `InsecureSkipVerify=true` sites that are dev/probe
|
||||
escape hatches, never enabled by default in production:
|
||||
|
||||
- **Agent dev escape** — `cmd/agent/main.go:179` (wired from
|
||||
`cmd/agent/main.go:61` config field + `cmd/agent/main.go:1371` CLI
|
||||
flag). Operators flip this only when debugging an agent against a
|
||||
self-signed control plane that hasn't been added to the agent's
|
||||
trust store. Document as `--insecure-skip-verify` in the agent's
|
||||
install runbook; the agent logs a startup WARN any time the flag
|
||||
is set. SEC-M3 pins that the carve-out is intentional.
|
||||
- **Agent verification probe** — `cmd/agent/verify.go:78`. The probe
|
||||
intentionally opens a TLS connection with verification disabled so
|
||||
it can inspect any certificate the endpoint serves (including
|
||||
self-signed or expired ones — that's the whole point of a probe).
|
||||
The probe never returns trust state to a security-relevant code
|
||||
path; it only reads cert metadata. SEC-M3 pins this.
|
||||
- **tlsprobe (network scanner)** — `internal/tlsprobe/probe.go:54`.
|
||||
Same rationale as the agent verify probe — network discovery must
|
||||
introspect any certificate it finds, including the ones with the
|
||||
problems we're scanning for. SEC-M3 pins this.
|
||||
|
||||
### F5 target connector — `InsecureSkipVerify` per-config
|
||||
|
||||
The F5 target connector exposes an `Insecure: bool` field on its
|
||||
per-target config blob (default `false`). When set,
|
||||
`internal/connector/target/f5/f5.go:134` builds the HTTP client with
|
||||
`InsecureSkipVerify: config.Insecure`. SEC-M5 closure: operator
|
||||
opt-in for self-signed F5 BIG-IP device certs; mitigation is to run
|
||||
the F5 + the proxy-agent on a network-segmented internal subnet.
|
||||
Document in the F5 connector's per-target setup guide.
|
||||
|
||||
### ACME issuer — `CERTCTL_ACME_INSECURE` (now gated on ACK)
|
||||
|
||||
`internal/connector/issuer/acme/acme.go:201` builds the ACME HTTP
|
||||
client with `InsecureSkipVerify: true` for the Pebble integration
|
||||
test path. The per-issuer runtime setting comes from
|
||||
`CERTCTL_ACME_INSECURE` (`internal/config/config.go:2116`); Phase 2
|
||||
SEC-M4 closure (2026-05-13) added the fail-closed gate so the operator
|
||||
must ALSO set `CERTCTL_ACME_INSECURE_ACK=true` for the server to boot.
|
||||
Production deploys must never set either flag. The boot-time WARN log
|
||||
at `cmd/server/main.go:611` continues to fire for the ACK'd case so
|
||||
every restart logs the reminder.
|
||||
|
||||
### CSP `'unsafe-inline'` on `style-src`
|
||||
|
||||
`internal/api/middleware/securityheaders.go:58` ships the dashboard
|
||||
CSP with `style-src 'self' 'unsafe-inline'`. This is required because
|
||||
Tailwind compiles utility classes into a single stylesheet at build
|
||||
time, but inline-style attributes appear in the dashboard via inline
|
||||
`<svg>` elements + Recharts' `<ResponsiveContainer>` injecting inline
|
||||
width/height. SEC-L1 closure: the carve-out is necessary today; the
|
||||
planned tightening flow is the frontend audit's FE-H2 (icon library)
|
||||
+ decorative-SVG sweep that then unlocks the CSP hardening (drops
|
||||
`'unsafe-inline'`).
|
||||
|
||||
### Break-glass admin — Argon2id rest-defense reminder
|
||||
|
||||
The break-glass admin path (`docs/operator/runbooks/disaster-recovery.md`)
|
||||
hashes the operator-supplied password with Argon2id and stores the
|
||||
hash in the `breakglass_credentials` table. SEC-L2 reminder: the
|
||||
strength of the rest-defense is operator-supplied — pick a password
|
||||
with sufficient entropy (≥ 64 random bits via `openssl rand -base64
|
||||
12`) and rotate after every use. Argon2id resists offline cracking
|
||||
but an operator-supplied "Password123" hashes the same way.
|
||||
|
||||
### Body-size limit (1 MB default) — operator-tunable
|
||||
|
||||
The `http.MaxBytesReader` wrap caps inbound request bodies at 1 MB
|
||||
by default. The cap is necessary defense against unbounded-body DOS
|
||||
but catches legitimate operator workflows:
|
||||
|
||||
- Bulk truststore PEM bundle uploads (CA bundles for federated trust
|
||||
stores can be > 1 MB).
|
||||
- Multi-MB CRL pushes via the CRL-cache endpoint.
|
||||
- Bulk-import of certificates with embedded chains.
|
||||
|
||||
SEC-L3 closure: operators raise the cap via `CERTCTL_MAX_BODY_SIZE`
|
||||
(units: bytes; e.g. `CERTCTL_MAX_BODY_SIZE=10485760` for 10 MB).
|
||||
Document in `deploy/ENVIRONMENTS.md`.
|
||||
|
||||
### Demo Compose placeholder credentials
|
||||
|
||||
`deploy/docker-compose.demo.yml` ships `CERTCTL_AUTH_SECRET=change-me-in-production`,
|
||||
`CERTCTL_CONFIG_ENCRYPTION_KEY=change-me-32-char-encryption-key`, and
|
||||
`CERTCTL_API_KEY=change-me-in-production` as documented demo
|
||||
defaults. The runtime `Validate()` fail-closed guards
|
||||
(`internal/config/config.go::Validate`, Bundle 2 2026-05-12) refuse
|
||||
to start if those literal strings reach a non-demo config. Phase 2
|
||||
DEPL-M2 closure adds a CI guard
|
||||
(`scripts/ci-guards/no-change-me-in-prod-compose.sh`) that fails the
|
||||
build at PR time if a `change-me-*` literal leaks into a non-demo
|
||||
compose file — catching the regression one layer before the runtime
|
||||
guard fires.
|
||||
|
||||
### Kubernetes NetworkPolicy — operator-opt-in
|
||||
|
||||
`deploy/helm/certctl/templates/networkpolicy.yaml` ships the template
|
||||
but `deploy/helm/certctl/values.yaml` defaults `networkPolicy.enabled:
|
||||
false`. DEPL-M3 rationale: most Kubernetes clusters don't have a
|
||||
NetworkPolicy controller installed (kind / minikube / fresh k3s); a
|
||||
default-enabled NetworkPolicy renders fine but produces no
|
||||
enforcement, and bare-metal `kube-router`-style controllers may
|
||||
interpret a permissive default differently. Production deploys with a
|
||||
real NetworkPolicy controller (Calico, Cilium, Antrea) flip the
|
||||
values key to `true` and tune the policy in their values overlay.
|
||||
Document the production-enable in
|
||||
`docs/operator/runbooks/ha.md` (added Phase 2 DEPL-H1).
|
||||
|
||||
## Reporting a vulnerability
|
||||
|
||||
Email `certctl@proton.me`. Coordinated disclosure preferred; we will
|
||||
|
||||
Reference in New Issue
Block a user