From 073f28f437fe9f467beea794d3bf1458e5c5113b Mon Sep 17 00:00:00 2001 From: Shankar Date: Fri, 24 Apr 2026 23:51:13 +0000 Subject: [PATCH] fix(deploy,examples,env): close U-1 trap end-to-end across Helm, examples, and root env MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to 78dcc9e (U-1 docker-compose fix) — closes the remaining adjacent code paths that share the postgres-first-boot-password-binding root cause but were scoped out of the original commit. The runtime diagnostic in internal/repository/postgres/db.go::wrapPingError (landed in 67f352d) already covers every NewDB call site, so Helm operators and example users hit the SQLSTATE 28P01 guidance for free at startup. What was missing: deployment-shape-specific remediation guidance (kubectl vs docker-compose), the hardcoded password in the *root* .env.example, and shared ops notes for the 5 examples/ compose files. This commit closes all three. Files changed: - .env.example (root) — line 16 had `postgres://certctl:certctl@...` with the password hardcoded literally instead of interpolating POSTGRES_PASSWORD. Edit if a user copied this file as their .env (binary-direct deployment, not docker-compose) and rotated POSTGRES_PASSWORD on line 10, the URL on line 16 still carried 'certctl' — silent two-line drift. Replaced 'certctl' with the same default that line 10 carries ('change-me-in-production') and added an explanatory comment block describing the docker-compose override semantics, when this URL matters (binary-direct), and the cross-reference to the U-1 wrapPingError diagnostic. Also fixed an adjacent bug: line 31 CERTCTL_SERVER_URL was `http://localhost:8443`, which agents reject at startup since v2.2 (HTTPS-everywhere milestone made the control plane HTTPS-only with TLS 1.3 pinned). Updated to https:// with a comment pointing operators at the bootstrap CA bundle. - deploy/helm/certctl/values.yaml — postgresql.auth.password field had a one-line 'REQUIRED' comment. Expanded into a full WARNING block (~25 lines) explaining the PVC retention semantics, the failure symptom, and both kubectl-flavored remediation paths: non-destructive (`kubectl exec ... ALTER ROLE`) preferred for environments with data, and destructive (`helm uninstall + kubectl delete pvc`) for dev/demo. Cross-references the wrapPingError runtime diagnostic. - deploy/helm/certctl/README.md (new, ~115 lines) — chart-level operational guide. Covers quick install, both remediation paths with concrete kubectl commands, why-we-don't-fix-this-in-the-chart explanation, cross-references to the docker-compose docs, server API key rotation (the easy case — comma-separated key list), TLS provisioning shapes, embedded-vs-external postgres, and uninstall semantics with the PVC retention gotcha called out. - examples/README.md (new, ~55 lines) — shared operational notes for the 5 example deployments. Covers the postgres password rotation trap with example-flavored remediation paths (`docker compose -f examples//...`), the TLS warning, and teardown semantics. Replaces what would otherwise be 5x duplication across per-example READMEs. - examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,private-ca-traefik, step-ca-haproxy}/*.md — one-line cross-reference at the top of each example's primary doc, pointing at examples/README.md for the shared ops notes. Avoids 5x duplication of the same warning text while still surfacing the link in every operator's first-touch surface. Verification: - go build ./... — clean - go vet ./... — clean - go test -short ./internal/repository/postgres/ — 4/4 wrapPingError tests still passing (no production-code touch in this commit) - helm lint deploy/helm/certctl/ — clean (1 INFO about chart icon, pre-existing) - helm template smoke test — renders without error - python3 yaml.safe_load on values.yaml — parses Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap Closes the three deliberate scope-outs from 78dcc9e (Helm, root .env.example, examples/) end-to-end. Adjacent bugs caught while in scope: - root .env.example:16 hardcoded password not matching line 10 - root .env.example:31 http:// URL incompatible with HTTPS-only v2.2 --- .env.example | 18 ++- deploy/helm/certctl/README.md | 114 ++++++++++++++++++ deploy/helm/certctl/values.yaml | 25 +++- examples/README.md | 55 +++++++++ examples/acme-nginx/acme-nginx.md | 2 + .../acme-wildcard-dns01.md | 2 + examples/multi-issuer/multi-issuer.md | 2 + .../private-ca-traefik/private-ca-traefik.md | 2 + examples/step-ca-haproxy/step-ca-haproxy.md | 2 + 9 files changed, 219 insertions(+), 3 deletions(-) create mode 100644 deploy/helm/certctl/README.md create mode 100644 examples/README.md diff --git a/.env.example b/.env.example index 826eda1..5b8739d 100644 --- a/.env.example +++ b/.env.example @@ -13,7 +13,18 @@ POSTGRES_PASSWORD=change-me-in-production # Certctl Server # All server vars use the CERTCTL_ prefix (see internal/config/config.go) # ============================================================================== -CERTCTL_DATABASE_URL=postgres://certctl:certctl@postgres:5432/certctl?sslmode=disable +# IMPORTANT: keep the password segment of CERTCTL_DATABASE_URL in sync with +# POSTGRES_PASSWORD above. If you deploy via `deploy/docker-compose.yml`, +# this value is *overridden* by the compose file's +# `postgres://certctl:${POSTGRES_PASSWORD:-certctl}@postgres:5432/...` +# interpolation — but if you run the binary directly with this .env loaded +# (e.g. `set -a; source .env; ./certctl-server`), update *both* lines. +# Background: editing POSTGRES_PASSWORD after the postgres data directory +# has been initialized once does NOT rotate the password — initdb only +# seeds pg_authid on first boot of an empty volume. See docs/quickstart.md +# "Warning" callout and `internal/repository/postgres/db.go::wrapPingError` +# for the SQLSTATE 28P01 diagnostic that fires when the two drift. +CERTCTL_DATABASE_URL=postgres://certctl:change-me-in-production@postgres:5432/certctl?sslmode=disable CERTCTL_SERVER_HOST=0.0.0.0 CERTCTL_SERVER_PORT=8443 CERTCTL_LOG_LEVEL=info @@ -28,7 +39,10 @@ CERTCTL_AUTH_TYPE=none # ============================================================================== # Certctl Agent # ============================================================================== -CERTCTL_SERVER_URL=http://localhost:8443 +# HTTPS-only as of v2.2 (TLS 1.3 pinned). Agents reject http:// URLs at +# startup. Use the docker-compose self-signed bootstrap CA bundle from +# `deploy/test/certs/ca.crt` or supply your own via CERTCTL_SERVER_CA_BUNDLE_PATH. +CERTCTL_SERVER_URL=https://localhost:8443 CERTCTL_API_KEY=change-me-in-production CERTCTL_AGENT_NAME=local-agent diff --git a/deploy/helm/certctl/README.md b/deploy/helm/certctl/README.md new file mode 100644 index 0000000..d19aace --- /dev/null +++ b/deploy/helm/certctl/README.md @@ -0,0 +1,114 @@ +# certctl Helm Chart + +Production-ready Helm chart for deploying [certctl](https://github.com/shankar0123/certctl) on Kubernetes. Wires up the certctl server (Deployment), PostgreSQL (StatefulSet with PVC), and the agent (DaemonSet — one per node) on a private cluster, with health probes, security contexts, and optional Ingress. + +## Quick install + +```bash +helm install certctl deploy/helm/certctl/ \ + --create-namespace --namespace certctl \ + --set server.auth.apiKey="$(openssl rand -base64 32)" \ + --set postgresql.auth.password="$(openssl rand -base64 24)" +``` + +This brings up: + +- `-server` Deployment (HTTPS-only on port 8443; TLS 1.3) +- `-postgres` StatefulSet (PostgreSQL 16-alpine, 1 replica, 10Gi PVC by default) +- `-agent` DaemonSet (polls server, generates ECDSA P-256 keys locally) +- Service objects, optional Ingress, and ServiceAccount with RBAC + +See [`values.yaml`](values.yaml) for the full configuration surface — issuer settings, target connectors, scheduler intervals, notifier credentials, and resource requests/limits all live there. + +## Operational notes + +### Postgres password rotation — read this before changing `postgresql.auth.password` + +**The trap.** `postgresql.auth.password` is bound to `pg_authid` exactly once — when the StatefulSet's PVC is provisioned and `initdb` runs. The official `postgres:16-alpine` image only runs `initdb` when `/var/lib/postgresql/data` is empty, so on every subsequent rollout the `POSTGRES_PASSWORD` env var is read into the container but **ignored** by postgres itself. The certctl-server container also picks up the new value (via the database URL helper template), so the two halves diverge: server presents the new password, postgres still expects the old one. + +**Symptom.** The certctl-server pod's startup log shows: + +``` +failed to ping database: postgres rejected the configured credentials +(SQLSTATE 28P01 — invalid_password). If you recently rotated POSTGRES_PASSWORD ... +``` + +That diagnostic is emitted by `internal/repository/postgres/db.go::wrapPingError` — it points operators at the two remediation paths below. + +**Remediation, non-destructive (preferred for any environment with real data):** + +```bash +# 1. Rotate the password in postgres directly +kubectl -n certctl exec -it -postgres-0 -- \ + psql -U certctl -c "ALTER ROLE certctl PASSWORD '';" + +# 2. Update the secret / Helm values to the same value +helm upgrade deploy/helm/certctl/ \ + --reuse-values \ + --set postgresql.auth.password='' + +# 3. Bounce the certctl-server pod so it re-reads the secret +kubectl -n certctl rollout restart deployment/-server +``` + +**Remediation, destructive (DESTROYS ALL CERTCTL DATA — only acceptable on dev/demo clusters):** + +```bash +helm uninstall -n certctl +kubectl -n certctl delete pvc -l \ + app.kubernetes.io/name=certctl,app.kubernetes.io/component=postgres +helm install deploy/helm/certctl/ \ + --namespace certctl \ + --set postgresql.auth.password='' +``` + +The PVC re-creates empty, `initdb` runs on first boot of the new postgres pod, and `pg_authid` is seeded with the new password. + +**Why we don't fix this in the chart.** The env-vs-`pg_authid` divergence is intrinsic to how the upstream `postgres` image bootstraps — `initdb` is run-once-per-empty-data-dir, and there is no upstream-supported way to make subsequent boots re-seed `pg_authid` from `POSTGRES_PASSWORD`. The ergonomic answer is the runtime diagnostic plus this operational note. + +**Cross-references.** Same root cause is documented for the docker-compose path in [`docs/quickstart.md`](../../../docs/quickstart.md) (Warning callout after the `cp .env.example .env` block) and in [`deploy/ENVIRONMENTS.md`](../../ENVIRONMENTS.md) (Stateful volume — first-boot password binding section). The runtime diagnostic itself lives in `internal/repository/postgres/db.go::wrapPingError` with regression coverage in `internal/repository/postgres/db_test.go`. + +### Server API key rotation + +Unlike the postgres password, `server.auth.apiKey` accepts a comma-separated list, so zero-downtime rotation is straightforward: + +```bash +# 1. Add the new key alongside the old +helm upgrade deploy/helm/certctl/ \ + --reuse-values \ + --set server.auth.apiKey='new-key,old-key' + +# 2. Roll your agents / clients over to the new key + +# 3. Remove the old key +helm upgrade deploy/helm/certctl/ \ + --reuse-values \ + --set server.auth.apiKey='new-key' +``` + +### TLS certificate sourcing + +By default the chart provisions a self-signed cert via the same init-container pattern as the docker-compose deploy. For production, supply an operator-managed Secret (cert-manager, internal CA, etc.) — see [`docs/tls.md`](../../../docs/tls.md) for the full provisioning matrix and [`docs/upgrade-to-tls.md`](../../../docs/upgrade-to-tls.md) for upgrade-from-HTTP procedures. + +## Disabling embedded postgres + +If you have an existing PostgreSQL cluster, disable the embedded one and point at it directly: + +```bash +helm install certctl deploy/helm/certctl/ \ + --set postgresql.enabled=false \ + --set server.databaseUrl='postgres://certctl:@my-pg-host:5432/certctl?sslmode=require' +``` + +The volume-trap section above does **not** apply to this configuration — your postgres operator (or cloud DB) handles password rotation, and you control `pg_authid` directly. + +## Uninstall + +```bash +helm uninstall -n certctl +# Optional — also delete the postgres PVC (DESTROYS DATA): +kubectl -n certctl delete pvc -l \ + app.kubernetes.io/name=certctl,app.kubernetes.io/component=postgres +``` + +By default `helm uninstall` retains the StatefulSet's PVCs, so reinstalling with the same release name preserves the database. If you've changed `postgresql.auth.password` in your values between uninstall and reinstall, you'll hit the trap on the reinstall — apply the non-destructive remediation above, or also delete the PVC. diff --git a/deploy/helm/certctl/values.yaml b/deploy/helm/certctl/values.yaml index 84600c0..de5f5ff 100644 --- a/deploy/helm/certctl/values.yaml +++ b/deploy/helm/certctl/values.yaml @@ -260,7 +260,30 @@ postgresql: auth: database: certctl username: certctl - password: "" # REQUIRED - set via --set or values override + # REQUIRED — set via `--set postgresql.auth.password=` or values override. + # + # WARNING (U-1): rotating this value after first deploy does NOT change the + # database password. The `postgres:16-alpine` image runs `initdb` only when + # /var/lib/postgresql/data is empty, so POSTGRES_PASSWORD is written into + # pg_authid exactly once — on the first boot of the StatefulSet's PVC. + # Subsequent rollouts pick up the new env value in the postgres container + # but the certctl-server container's CERTCTL_DATABASE_URL also picks up + # the new value, while pg_authid still expects the old one — leading to + # `pq: password authentication failed for user "certctl"` (SQLSTATE 28P01). + # + # The certctl-server emits guidance via internal/repository/postgres/db.go:: + # wrapPingError when it sees SQLSTATE 28P01 at startup. To resolve in a + # Helm deployment: + # - Non-destructive (preferred for environments with data): + # kubectl exec -it -postgres-0 -- \ + # psql -U certctl -c "ALTER ROLE certctl PASSWORD '';" + # then update the secret/values to match and let the certctl-server + # pod restart against the matching credential. + # - Destructive (DESTROYS DATA — only acceptable on dev/demo PVCs): + # helm uninstall && \ + # kubectl delete pvc -l app.kubernetes.io/name=certctl,app.kubernetes.io/component=postgres && \ + # helm install ... # PVC re-creates empty, initdb seeds new password + password: "" # Storage configuration storage: diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..56e1407 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,55 @@ +# Deployment Examples + +Five turnkey docker-compose scenarios that show certctl deployed against real CA backends and target shapes. Each subdirectory is self-contained — pick the one closest to your stack and have it running in minutes. + +| Example | Stack | What it shows | +|---------|-------|---------------| +| [`acme-nginx/`](acme-nginx/acme-nginx.md) | Let's Encrypt + NGINX (HTTP-01) | The default public-CA path: ACME-issued certs deployed to NGINX. | +| [`acme-wildcard-dns01/`](acme-wildcard-dns01/acme-wildcard-dns01.md) | Let's Encrypt wildcard (DNS-01) | Wildcard certificates via DNS-01 with pluggable DNS hooks. | +| [`private-ca-traefik/`](private-ca-traefik/private-ca-traefik.md) | Local CA + Traefik | Internal-only certs from a private CA, deployed to Traefik. | +| [`step-ca-haproxy/`](step-ca-haproxy/step-ca-haproxy.md) | Smallstep step-ca + HAProxy | Self-hosted CA with HAProxy as the deployment target. | +| [`multi-issuer/`](multi-issuer/multi-issuer.md) | Let's Encrypt + Local CA | Public + private certs side-by-side from a single dashboard. | + +## Common operational notes + +These notes apply to **every** example. They're called out here so the per-example walkthroughs stay focused on the issuer/target wiring instead of repeating ops boilerplate. + +### Postgres password rotation — first-boot binding trap (U-1) + +Every example file uses `${DB_PASSWORD:-certctl-dev-password}` as the postgres password env var, with the data directory persisted via a named volume. The `postgres:16-alpine` image runs `initdb` exactly once — when `/var/lib/postgresql/data` is empty — and that's the only time `POSTGRES_PASSWORD` is written into `pg_authid`. If you boot once with the default and then change `DB_PASSWORD` (in your shell, in a `.env` file, or in a wrapper script), the certctl-server container picks up the new value but the postgres container continues to authenticate against the old one. The server fails its startup `db.Ping()` with `pq: password authentication failed for user "certctl"` (SQLSTATE 28P01). + +The certctl-server emits guidance pointing at the fix when this fires (see `internal/repository/postgres/db.go::wrapPingError`). The two remediation paths: + +- **Destructive — wipes all certctl data, only acceptable on demo/test setups:** + ```bash + docker compose -f examples//docker-compose.yml down -v + docker compose -f examples//docker-compose.yml up -d --build + ``` +- **Non-destructive — preserves data, rotates `pg_authid` in place:** + ```bash + docker compose -f examples//docker-compose.yml exec postgres \ + psql -U certctl -c "ALTER ROLE certctl PASSWORD '';" + # Then redeploy with DB_PASSWORD set to in your shell or .env + ``` + +The cleanest practice for a fresh demo: set `DB_PASSWORD` once in your shell **before** the very first `docker compose up`, and don't change it during the demo's lifetime. If you must rotate, use the non-destructive path. + +Same root cause and remediation pattern is documented for the canonical quickstart in [`../docs/quickstart.md`](../docs/quickstart.md), the production compose surface in [`../deploy/ENVIRONMENTS.md`](../deploy/ENVIRONMENTS.md), and the Helm chart in [`../deploy/helm/certctl/README.md`](../deploy/helm/certctl/README.md). + +### TLS for the certctl control plane + +Every example boots certctl with HTTPS-only on port 8443 (TLS 1.3 pinned, no plaintext listener as of v2.2). The shipped `certctl-tls-init` init container generates a self-signed ECDSA-P256 cert on first boot — fine for the example demos, **never** acceptable for a public deployment. For production, swap the init container for cert-manager, an operator-supplied Secret, or your internal CA — see [`../docs/tls.md`](../docs/tls.md) for the full pattern matrix. + +### Tearing down + +To stop services but **keep** the postgres volume (so you can pick up where you left off): +```bash +docker compose -f examples//docker-compose.yml down +``` + +To stop services **and** wipe all data (clean slate for the next run): +```bash +docker compose -f examples//docker-compose.yml down -v +``` + +Note that `down -v` is the only canonical way to recover from the postgres-password trap when the non-destructive `ALTER ROLE` route is unavailable (e.g., you've forgotten the original password). diff --git a/examples/acme-nginx/acme-nginx.md b/examples/acme-nginx/acme-nginx.md index 4dcec6e..7b99f23 100644 --- a/examples/acme-nginx/acme-nginx.md +++ b/examples/acme-nginx/acme-nginx.md @@ -2,6 +2,8 @@ This example demonstrates certctl's core use case: **automatically manage TLS certificates for NGINX using Let's Encrypt (ACME HTTP-01 challenges).** +> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only. + ## What This Does - Deploys certctl server (control plane) with PostgreSQL diff --git a/examples/acme-wildcard-dns01/acme-wildcard-dns01.md b/examples/acme-wildcard-dns01/acme-wildcard-dns01.md index 1db5f1f..96daf09 100644 --- a/examples/acme-wildcard-dns01/acme-wildcard-dns01.md +++ b/examples/acme-wildcard-dns01/acme-wildcard-dns01.md @@ -2,6 +2,8 @@ **What this does:** Issues wildcard certificates (e.g., `*.example.com`) from Let's Encrypt using DNS-01 challenge validation. +> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only. + This example is ideal for: - Issuing wildcard certificates (`*.example.com`) - Services behind NAT, firewalls, or non-public networks diff --git a/examples/multi-issuer/multi-issuer.md b/examples/multi-issuer/multi-issuer.md index 8d49588..6bc219f 100644 --- a/examples/multi-issuer/multi-issuer.md +++ b/examples/multi-issuer/multi-issuer.md @@ -2,6 +2,8 @@ This example demonstrates certctl managing **both public and internal certificates from a single dashboard**. Public-facing services use Let's Encrypt (ACME), while internal services use a private Local CA — all visible and managed in one place. +> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only. + ## The Use Case You have: diff --git a/examples/private-ca-traefik/private-ca-traefik.md b/examples/private-ca-traefik/private-ca-traefik.md index e541f67..f031bdb 100644 --- a/examples/private-ca-traefik/private-ca-traefik.md +++ b/examples/private-ca-traefik/private-ca-traefik.md @@ -1,5 +1,7 @@ # Private CA + Traefik Example +> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only. + This example demonstrates certctl managing certificates for **internal services without public CA dependency**. Ideal for enterprise environments where: - All services are internal (VPN, private networks) diff --git a/examples/step-ca-haproxy/step-ca-haproxy.md b/examples/step-ca-haproxy/step-ca-haproxy.md index 717f630..efe8563 100644 --- a/examples/step-ca-haproxy/step-ca-haproxy.md +++ b/examples/step-ca-haproxy/step-ca-haproxy.md @@ -2,6 +2,8 @@ This example demonstrates certctl managing certificates issued by **Smallstep step-ca** and deploying them to **HAProxy**. +> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only. + ## Scenario You're a Smallstep user running step-ca as your internal PKI. You have HAProxy load balancers that need certificates. This setup: