mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:21:30 +00:00
config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.
config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================
Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
- ErrAgentBootstrapTokenRequired (SEC-H1)
- ErrACMEInsecureWithoutAck (SEC-M4)
- ErrDemoModeAckExpired (SEC-H3)
SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.
SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.
SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.
Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.
Helm chart + HA runbook (DEPL-H1)
=================================
Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.
CI guard (DEPL-M2)
==================
scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.
Consolidated 'Security carve-outs' doc section
==============================================
docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
- SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
- SEC-M5: F5 connector InsecureSkipVerify per-config field
- SEC-M4: ACME insecure + new ACK gate
- SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
- SEC-L2: break-glass Argon2id rest-defense reminder
- SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
- DEPL-M2: change-me-* placeholder credentials in demo overlay
- DEPL-M3: K8s NetworkPolicy operator-opt-in default
Each entry cites the file:line, the rationale for the carve-out, and
the operator action.
CHANGELOG + ENVIRONMENTS coverage
==================================
CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.
deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.
WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.
Sandbox limitation
==================
The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.
CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
This commit is contained in:
@@ -0,0 +1,113 @@
|
||||
# High-Availability Deployment Runbook
|
||||
|
||||
> Last reviewed: 2026-05-13
|
||||
|
||||
<!-- Phase 2 DEPL-H1 closure -->
|
||||
|
||||
|
||||
certctl's Helm chart ships with conservative single-replica defaults
|
||||
that produce a working `helm install` against any Kubernetes cluster.
|
||||
Production HA is operator-opt-in across three values surfaces — none
|
||||
of which the chart flips on your behalf.
|
||||
|
||||
This runbook documents the three changes, why they default off, and
|
||||
the smallest-possible HA values overlay.
|
||||
|
||||
---
|
||||
|
||||
## Why HA is opt-in (not default)
|
||||
|
||||
Three load-bearing reasons the chart defaults are `replicas: 1` and
|
||||
`podDisruptionBudget.enabled: false`:
|
||||
|
||||
1. **A 1-replica deployment works on every cluster.** A multi-replica
|
||||
default with `minAvailable: 2` would render a PDB at install time;
|
||||
if the cluster has fewer than 2 nodes available (single-node
|
||||
`kind` / `minikube` / fresh `k3s` clusters), Helm renders fine but
|
||||
the first `kubectl rollout` blocks indefinitely waiting for the
|
||||
second replica that can never schedule. Defaulting off keeps the
|
||||
demo path one-command.
|
||||
|
||||
2. **Postgres is a singleton in the bundled chart.** The chart's
|
||||
`postgres-statefulset.yaml` runs ONE Postgres pod. Scaling the
|
||||
server tier past 1 replica without an externalized Postgres + a
|
||||
pgbouncer-style proxy doesn't actually buy HA at the DB tier — the
|
||||
single Postgres pod is the failure domain. Operators who want true
|
||||
HA route Postgres to a managed service (RDS, Cloud SQL, AlloyDB,
|
||||
AKS-managed-Postgres, Aiven) or run their own cluster (Patroni,
|
||||
CloudNativePG, Zalando postgres-operator). See the
|
||||
[external-Postgres values example](../../deploy/helm/examples/values-external-db.yaml).
|
||||
|
||||
3. **Session affinity is HTTPS-only.** The control plane is HTTPS-only
|
||||
(TLS 1.3 pinned). Adding `sessionAffinity: ClientIP` to the
|
||||
server Service mid-deployment when a sticky front-end LB is in
|
||||
play (NGINX Ingress, Cloud LB with backend service) is the right
|
||||
default for OIDC + RBAC session cookies. But operators who terminate
|
||||
TLS at a different layer (Envoy mesh, Cloudflare in front of the
|
||||
cluster) may have already solved affinity upstream — flipping it
|
||||
on by default would over-constrain those paths.
|
||||
|
||||
## The smallest production-HA overlay
|
||||
|
||||
Three Helm values to flip:
|
||||
|
||||
```yaml
|
||||
# values-ha.yaml — copy into your overlay and edit to taste.
|
||||
|
||||
server:
|
||||
# ≥ 2 replicas is the minimum for the PDB to render. 3 gives you
|
||||
# a true rolling-restart tolerance window (1 down for upgrade,
|
||||
# 2 still serving) without dropping below minAvailable.
|
||||
replicas: 3
|
||||
|
||||
service:
|
||||
# Required when the front-end LB doesn't already enforce
|
||||
# session affinity. OIDC + RBAC session cookies need to land
|
||||
# on the same backend pod for the session lifetime.
|
||||
sessionAffinity: ClientIP
|
||||
|
||||
podDisruptionBudget:
|
||||
# Renders the PDB template; controller-side voluntary disruptions
|
||||
# (node-drain for k8s upgrade, cluster-autoscaler scale-down)
|
||||
# respect this floor.
|
||||
enabled: true
|
||||
# With server.replicas: 3, minAvailable: 2 leaves headroom for one
|
||||
# rolling restart at a time.
|
||||
minAvailable: 2
|
||||
# maxUnavailable is mutually exclusive with minAvailable; pick one.
|
||||
# maxUnavailable: 1
|
||||
```
|
||||
|
||||
Apply with:
|
||||
|
||||
```bash
|
||||
helm upgrade certctl deploy/helm/certctl/ -f values-ha.yaml
|
||||
```
|
||||
|
||||
## What you still own as the operator
|
||||
|
||||
Three things the chart does not solve, even at `replicas: 3`:
|
||||
|
||||
1. **Postgres HA.** Route to an externalized Postgres (managed cloud
|
||||
or operator-managed cluster). The chart's bundled StatefulSet
|
||||
pod is a development/single-AZ pattern, not a production HA path.
|
||||
2. **TLS material lifecycle.** The chart accepts an `existingSecret`
|
||||
for the server cert; rotating it is operator-side automation.
|
||||
The dashboard + agent can issue their own certs via the local CA
|
||||
(eat-your-own-dogfood); the operator can wire `cert-manager` if
|
||||
they prefer that path.
|
||||
3. **Backup CronJob.** Phase 4 of the architecture diligence
|
||||
remediation plan (DEPL-H2) ships a `backup-cronjob.yaml` template;
|
||||
until that lands, backups are operator-run per the existing
|
||||
`docs/operator/runbooks/postgres-backup.md` runbook.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- `deploy/helm/certctl/values.yaml` lines 19, 446, 566 — the three
|
||||
defaults this runbook documents.
|
||||
- `docs/operator/runbooks/postgres-backup.md` — Postgres backup
|
||||
runbook (today, operator-run).
|
||||
- `docs/operator/runbooks/disaster-recovery.md` — DR procedure.
|
||||
- Phase 4 (Helm Chart, DR, And Ops Surface) of the architecture
|
||||
diligence remediation plan tracks the chart-level work
|
||||
(backup CronJob, PrometheusRule starter, migration hook, etc.).
|
||||
Reference in New Issue
Block a user