mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 23:42:00 +00:00

Files

T

shankar0123 86fffa305a fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)

Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

1. Helm chart `server.readinessProbe.httpGet.path` pointed at
`/readyz`, the kube-flavored convention. The certctl server
doesn't register `/readyz` (only `/health` and `/ready` are
wired and bypass the auth middleware — see
internal/api/router/router.go:81 and cmd/server/main.go:920).
K8s readiness probes therefore got 401 (api-key auth rejection)
or 404 (when auth was disabled), pods stayed `NotReady`
indefinitely, and Helm rollouts stalled.

2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
so bare-`docker run` agents got zero health signal. The
compose override at `deploy/docker-compose.yml:173` called
`pgrep -f certctl-agent` against the agent image, but the
agent image didn't ship `procps` — pgrep was missing too. The
compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
health` to `curl -fsk https://localhost:8443/health`. `-k`
(insecure) is acceptable because the probe is localhost-to-localhost:
the same process serving the cert is being probed, no network hop.
Pinning `--cacert` is not viable for the published image because
the bootstrap cert is per-deploy (generated into the `certs` named
volume on first up; operator-supplied via Helm's `existingSecret`
or cert-manager). Long-form docblock cross-references the audit
closure, the compose vs Helm vs examples coverage matrix, and the
CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
matching the compose pattern. Added `procps` to the runtime apk
install — fixes both the new image-level HEALTHCHECK AND the
pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
changed from `/readyz` to `/ready`. Liveness probe path
(`/health`) was correct and is unchanged. Probes block now carries
an explanatory comment naming the registered no-auth probe routes
and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
image, inspects `Config.Healthcheck.Test` via `docker inspect`,
and asserts the array contains `https://localhost:8443/health` and
`-k`, and does NOT contain `http://localhost:8443/health`
(positive + negative regression contracts).
TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
Both tests `t.Skip` cleanly when docker isn't available (sandbox /
CI without docker-in-docker) — verified locally: tests skip with the
diagnostic and the suite returns PASS.
TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
documented `t.Skip` placeholder until the harness wires a sidecar
postgres for image-level smoke; the spec-level tests above cover the
audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
regression guard (U-2)" step. Scoped patterns catch
`HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
out of scope (the post-cutover invariant string at line 182 is
intentionally a documented expected-failure assertion). Verified
locally on the real tree (passes) and against synthetic regressions
(each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
`http://localhost:8443/...` to `https://localhost:8443/...` with
`--cacert "$CA"` injected on every site. Added a one-time
introductory note documenting the `$CA` extraction with
`docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
matching the pattern in docs/quickstart.md. Pre-U-2 these examples
silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
block (immediately below the G-1 entry). Sections: explanatory
blockquote covering all three bugs (primary + 2 adjacent), Fixed,
Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
`path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
`path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
invariant line — that's the expected-failure assertion for the
post-cutover state ("plaintext is gone, expect Connection refused");
intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-healthcheck_protocol_mismatch
Audit recommendation followed verbatim: 'change Dockerfile:80
to CMD curl -kf https://localhost:8443/health'.

2026-04-25 12:02:18 +00:00

templates

fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)

2026-04-25 00:22:23 +00:00

.helmignore

feat(m28+m29+m30): ACME ARI, email digest, and Helm chart

2026-03-28 21:18:35 -04:00

Chart.yaml

feat(m28+m29+m30): ACME ARI, email digest, and Helm chart

2026-03-28 21:18:35 -04:00

README.md

fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)

2026-04-25 00:22:23 +00:00

values.yaml

fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)

2026-04-25 12:02:18 +00:00

README.md

certctl Helm Chart

Production-ready Helm chart for deploying certctl on Kubernetes. Wires up the certctl server (Deployment), PostgreSQL (StatefulSet with PVC), and the agent (DaemonSet — one per node) on a private cluster, with health probes, security contexts, and optional Ingress.

Quick install

helm install certctl deploy/helm/certctl/ \
  --create-namespace --namespace certctl \
  --set server.auth.apiKey="$(openssl rand -base64 32)" \
  --set postgresql.auth.password="$(openssl rand -base64 24)"

This brings up:

<release>-server Deployment (HTTPS-only on port 8443; TLS 1.3)
<release>-postgres StatefulSet (PostgreSQL 16-alpine, 1 replica, 10Gi PVC by default)
<release>-agent DaemonSet (polls server, generates ECDSA P-256 keys locally)
Service objects, optional Ingress, and ServiceAccount with RBAC

See values.yaml for the full configuration surface — issuer settings, target connectors, scheduler intervals, notifier credentials, and resource requests/limits all live there.

Operational notes

Postgres password rotation — read this before changing `postgresql.auth.password`

The trap. postgresql.auth.password is bound to pg_authid exactly once — when the StatefulSet's PVC is provisioned and initdb runs. The official postgres:16-alpine image only runs initdb when /var/lib/postgresql/data is empty, so on every subsequent rollout the POSTGRES_PASSWORD env var is read into the container but ignored by postgres itself. The certctl-server container also picks up the new value (via the database URL helper template), so the two halves diverge: server presents the new password, postgres still expects the old one.

Symptom. The certctl-server pod's startup log shows:

failed to ping database: postgres rejected the configured credentials
(SQLSTATE 28P01 — invalid_password). If you recently rotated POSTGRES_PASSWORD ...

That diagnostic is emitted by internal/repository/postgres/db.go::wrapPingError — it points operators at the two remediation paths below.

Remediation, non-destructive (preferred for any environment with real data):

# 1. Rotate the password in postgres directly
kubectl -n certctl exec -it <release>-postgres-0 -- \
  psql -U certctl -c "ALTER ROLE certctl PASSWORD '<new-password>';"

# 2. Update the secret / Helm values to the same value
helm upgrade <release> deploy/helm/certctl/ \
  --reuse-values \
  --set postgresql.auth.password='<new-password>'

# 3. Bounce the certctl-server pod so it re-reads the secret
kubectl -n certctl rollout restart deployment/<release>-server

Remediation, destructive (DESTROYS ALL CERTCTL DATA — only acceptable on dev/demo clusters):

helm uninstall <release> -n certctl
kubectl -n certctl delete pvc -l \
  app.kubernetes.io/name=certctl,app.kubernetes.io/component=postgres
helm install <release> deploy/helm/certctl/ \
  --namespace certctl \
  --set postgresql.auth.password='<new-password>'

The PVC re-creates empty, initdb runs on first boot of the new postgres pod, and pg_authid is seeded with the new password.

Why we don't fix this in the chart. The env-vs-pg_authid divergence is intrinsic to how the upstream postgres image bootstraps — initdb is run-once-per-empty-data-dir, and there is no upstream-supported way to make subsequent boots re-seed pg_authid from POSTGRES_PASSWORD. The ergonomic answer is the runtime diagnostic plus this operational note.

Cross-references. Same root cause is documented for the docker-compose path in docs/quickstart.md (Warning callout after the cp .env.example .env block) and in deploy/ENVIRONMENTS.md (Stateful volume — first-boot password binding section). The runtime diagnostic itself lives in internal/repository/postgres/db.go::wrapPingError with regression coverage in internal/repository/postgres/db_test.go.

Server API key rotation

Unlike the postgres password, server.auth.apiKey accepts a comma-separated list, so zero-downtime rotation is straightforward:

# 1. Add the new key alongside the old
helm upgrade <release> deploy/helm/certctl/ \
  --reuse-values \
  --set server.auth.apiKey='new-key,old-key'

# 2. Roll your agents / clients over to the new key

# 3. Remove the old key
helm upgrade <release> deploy/helm/certctl/ \
  --reuse-values \
  --set server.auth.apiKey='new-key'

JWT / OIDC via authenticating gateway

certctl's in-process auth surface is intentionally narrow: server.auth.type=api-key for production deployments and server.auth.type=none for development. There is no in-process JWT, OIDC, mTLS, or SAML middleware. (server.auth.type=jwt was accepted pre-G-1 but silently routed every request through the api-key bearer middleware — silent auth downgrade. The chart now fails at helm install/helm upgrade template time via the certctl.validateAuthType helper if you set it. See ../../../docs/upgrade-to-v2-jwt-removal.md if you previously had this in your values.)

For deployments that need JWT/OIDC, the canonical Kubernetes-flavored shape is to put oauth2-proxy in front of the certctl Service, attach an authenticating Ingress middleware, and run certctl with server.auth.type=none:

# 1. Install oauth2-proxy (or any OIDC-terminating sidecar) in the same namespace
helm install oauth2-proxy oauth2-proxy/oauth2-proxy \
  --namespace certctl \
  --set config.clientID="$OIDC_CLIENT_ID" \
  --set config.clientSecret="$OIDC_CLIENT_SECRET" \
  --set config.cookieSecret="$(openssl rand -base64 32)" \
  --set config.configFile='|
    provider = "oidc"
    oidc_issuer_url = "https://your-issuer/"
    upstreams = ["http://<release>-server.certctl.svc.cluster.local:8443"]
    pass_authorization_header = true
    set_authorization_header = true
    email_domains = ["*"]
  '

# 2. Install certctl with type=none (gateway terminates auth)
helm install certctl deploy/helm/certctl/ \
  --namespace certctl \
  --set server.auth.type=none \
  --set postgresql.auth.password="$(openssl rand -base64 24)"

# 3. Attach an Ingress that routes through oauth2-proxy
#    (Traefik ForwardAuth, nginx auth_request, Envoy ext_authz, etc.)

Same root pattern works with Pomerium, Authelia, Caddy forward_auth, Apache mod_auth_openidc, or any service-mesh ext_authz. See ../../../docs/architecture.md "Authenticating-gateway pattern" for the full design rationale and ../../../docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough.

TLS certificate sourcing

By default the chart provisions a self-signed cert via the same init-container pattern as the docker-compose deploy. For production, supply an operator-managed Secret (cert-manager, internal CA, etc.) — see docs/tls.md for the full provisioning matrix and docs/upgrade-to-tls.md for upgrade-from-HTTP procedures.

Disabling embedded postgres

If you have an existing PostgreSQL cluster, disable the embedded one and point at it directly:

helm install certctl deploy/helm/certctl/ \
  --set postgresql.enabled=false \
  --set server.databaseUrl='postgres://certctl:<pw>@my-pg-host:5432/certctl?sslmode=require'

The volume-trap section above does not apply to this configuration — your postgres operator (or cloud DB) handles password rotation, and you control pg_authid directly.

Uninstall

helm uninstall <release> -n certctl
# Optional — also delete the postgres PVC (DESTROYS DATA):
kubectl -n certctl delete pvc -l \
  app.kubernetes.io/name=certctl,app.kubernetes.io/component=postgres

By default helm uninstall retains the StatefulSet's PVCs, so reinstalling with the same release name preserves the database. If you've changed postgresql.auth.password in your values between uninstall and reinstall, you'll hit the trap on the reinstall — apply the non-destructive remediation above, or also delete the PVC.

README.md

certctl Helm Chart

Quick install

Operational notes

Postgres password rotation — read this before changing postgresql.auth.password

Server API key rotation

JWT / OIDC via authenticating gateway

TLS certificate sourcing

Disabling embedded postgres

Uninstall

Postgres password rotation — read this before changing `postgresql.auth.password`