fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)

Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

  1. Helm chart `server.readinessProbe.httpGet.path` pointed at
     `/readyz`, the kube-flavored convention. The certctl server
     doesn't register `/readyz` (only `/health` and `/ready` are
     wired and bypass the auth middleware — see
     internal/api/router/router.go:81 and cmd/server/main.go:920).
     K8s readiness probes therefore got 401 (api-key auth rejection)
     or 404 (when auth was disabled), pods stayed `NotReady`
     indefinitely, and Helm rollouts stalled.

  2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
     so bare-`docker run` agents got zero health signal. The
     compose override at `deploy/docker-compose.yml:173` called
     `pgrep -f certctl-agent` against the agent image, but the
     agent image didn't ship `procps` — pgrep was missing too. The
     compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
  health` to `curl -fsk https://localhost:8443/health`. `-k`
  (insecure) is acceptable because the probe is localhost-to-localhost:
  the same process serving the cert is being probed, no network hop.
  Pinning `--cacert` is not viable for the published image because
  the bootstrap cert is per-deploy (generated into the `certs` named
  volume on first up; operator-supplied via Helm's `existingSecret`
  or cert-manager). Long-form docblock cross-references the audit
  closure, the compose vs Helm vs examples coverage matrix, and the
  CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
  matching the compose pattern. Added `procps` to the runtime apk
  install — fixes both the new image-level HEALTHCHECK AND the
  pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
  changed from `/readyz` to `/ready`. Liveness probe path
  (`/health`) was correct and is unchanged. Probes block now carries
  an explanatory comment naming the registered no-auth probe routes
  and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
  TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
  image, inspects `Config.Healthcheck.Test` via `docker inspect`,
  and asserts the array contains `https://localhost:8443/health` and
  `-k`, and does NOT contain `http://localhost:8443/health`
  (positive + negative regression contracts).
  TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
  and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
  Both tests `t.Skip` cleanly when docker isn't available (sandbox /
  CI without docker-in-docker) — verified locally: tests skip with the
  diagnostic and the suite returns PASS.
  TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
  documented `t.Skip` placeholder until the harness wires a sidecar
  postgres for image-level smoke; the spec-level tests above cover the
  audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
  regression guard (U-2)" step. Scoped patterns catch
  `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
  in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
  out of scope (the post-cutover invariant string at line 182 is
  intentionally a documented expected-failure assertion). Verified
  locally on the real tree (passes) and against synthetic regressions
  (each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
  `http://localhost:8443/...` to `https://localhost:8443/...` with
  `--cacert "$CA"` injected on every site. Added a one-time
  introductory note documenting the `$CA` extraction with
  `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
  matching the pattern in docs/quickstart.md. Pre-U-2 these examples
  silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
  block (immediately below the G-1 entry). Sections: explanatory
  blockquote covering all three bugs (primary + 2 adjacent), Fixed,
  Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
  three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
  `path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
  regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
  this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
  correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
  `path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
  invariant line — that's the expected-failure assertion for the
  post-cutover state ("plaintext is gone, expect Connection refused");
  intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
  Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-healthcheck_protocol_mismatch
      Audit recommendation followed verbatim: 'change Dockerfile:80
      to CMD curl -kf https://localhost:8443/health'.
This commit is contained in:
shankar0123
2026-04-25 12:02:18 +00:00
parent e17788355b
commit 86fffa305a
7 changed files with 388 additions and 19 deletions
+22
View File
@@ -37,6 +37,28 @@ All notable changes to certctl are documented in this file. Dates use ISO 8601.
- `internal/api/handler/health_test.go` — the prior `TestAuthInfo_ReturnsAuthType_JWT` (which asserted the handler echoed `"jwt"`, baking the silent-downgrade lie into the regression suite) is removed; the pre-existing `TestAuthInfo_ReturnsAuthType_APIKey` continues to cover the api-key happy path.
- Auth-disabled startup log in `main.go` now points operators at the authenticating-gateway pattern explicitly.
### U-2: Dockerfile HEALTHCHECK protocol mismatch — closed end-to-end
> Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker-compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart-loop. Recon for U-2 also surfaced two adjacent bugs from the same v2.2 milestone gap: the Helm chart's `readinessProbe.httpGet.path` pointed at `/readyz`, a route the server doesn't register (only `/health` and `/ready` are wired and bypass the auth middleware), so K8s readiness probes were getting 404/auth-rejection and pods stayed `NotReady`; and the agent image had no HEALTHCHECK at all (the compose override called `pgrep -f certctl-agent` against an image that didn't ship `procps` — latent always-fail). All three are closed in this commit.
### Fixed
- **`Dockerfile` HEALTHCHECK now speaks HTTPS.** Bare `docker run` / Swarm / Nomad / ECS users no longer see `unhealthy` forever. The probe uses `curl -fsk https://localhost:8443/health``-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed; the probe never traverses a network. Compose / Helm / examples already perform full cert-chain validation and are unaffected.
- **Helm `server.readinessProbe.httpGet.path` corrected from `/readyz` to `/ready`.** The `/readyz` path was never registered as a no-auth route (see `internal/api/router/router.go:81` and `cmd/server/main.go:920`), so K8s readiness probes received 401 (api-key auth rejection) or 404 (when auth was disabled). Pods previously failed to report Ready under most realistic Helm deployments. Liveness probe path (`/health`) was already correct and is unchanged.
- **`docs/connectors.md` curl examples** (15 sites) updated from `http://localhost:8443/...` to `https://localhost:8443/...` with a one-time `--cacert "$CA"` extraction note matching the existing pattern in `docs/quickstart.md`. Pre-U-2 these examples silently failed against the HTTPS listener.
### Added
- **`Dockerfile.agent` HEALTHCHECK** — `pgrep -f certctl-agent` process-presence check (the agent has no HTTP listener; presence is the right primitive). Bare-`docker run` agents now report health-status the same way compose-managed ones do. Also adds `procps` to the runtime image so `pgrep` is actually available — pre-U-2 the docker-compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against an image that lacked it (latent always-fail; container was reported unhealthy in compose too, just rarely noticed because nothing acted on the signal).
- **`deploy/test/healthcheck_test.go`** (`//go:build integration`) — image-level integration tests. `TestPublishedServerImage_HealthcheckSpecUsesHTTPS` builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (negative regression contract). `TestPublishedAgentImage_HealthcheckSpecExists` builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker). A third runtime test (`TestPublishedServerImage_HealthcheckTransitionsToHealthy`) is a `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke — documented honestly so the next refactor adopts it instead of rediscovering the gap.
- **CI regression guardrail** in `.github/workflows/ci.yml` (`Forbidden plaintext HEALTHCHECK regression guard (U-2)`) — grep-fails the build if any `Dockerfile*` carries `HEALTHCHECK.*http://` or `curl -f http://localhost:8443/health`. Comments exempt; the `docs/upgrade-to-tls.md:182` post-cutover invariant string (which deliberately documents the expected-failure shape) is out of the guardrail's scope because the guardrail only scans Dockerfiles.
### Changed
- `Dockerfile` final-stage HEALTHCHECK lines now carry a long-form docblock explaining the `-k` design choice, the published-image vs compose vs Helm vs examples coverage matrix, and cross-references to the audit closure + the integration test.
- `Dockerfile.agent` runtime stage adds `procps` to the apk install so the new HEALTHCHECK and the existing compose override both have a working `pgrep`.
- `deploy/helm/certctl/values.yaml` server probes block now carries an explanatory comment naming the registered probe routes (`/health`, `/ready`) and the U-2 closure rationale for the `/readyz``/ready` correction.
## [2.2.0] — 2026-04-19
### HTTPS Everywhere — The Irony