mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-13 00:28:57 +00:00
fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed on every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker- compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart- loop. (Audit: cat-u-healthcheck_protocol_mismatch in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone gap, both bundled into this commit because they share the same root cause and the same operator surface: 1. Helm chart `server.readinessProbe.httpGet.path` pointed at `/readyz`, the kube-flavored convention. The certctl server doesn't register `/readyz` (only `/health` and `/ready` are wired and bypass the auth middleware — see internal/api/router/router.go:81 and cmd/server/main.go:920). K8s readiness probes therefore got 401 (api-key auth rejection) or 404 (when auth was disabled), pods stayed `NotReady` indefinitely, and Helm rollouts stalled. 2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all, so bare-`docker run` agents got zero health signal. The compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against the agent image, but the agent image didn't ship `procps` — pgrep was missing too. The compose probe was a latent always-fail. We fixed all three with the audit-recommended shape (option (a) — `-k`) plus three structural backstops: Files changed: Phase 1 — Dockerfile fix: - Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/ health` to `curl -fsk https://localhost:8443/health`. `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed, no network hop. Pinning `--cacert` is not viable for the published image because the bootstrap cert is per-deploy (generated into the `certs` named volume on first up; operator-supplied via Helm's `existingSecret` or cert-manager). Long-form docblock cross-references the audit closure, the compose vs Helm vs examples coverage matrix, and the CI guardrail. - Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent` matching the compose pattern. Added `procps` to the runtime apk install — fixes both the new image-level HEALTHCHECK AND the pre-existing compose probe that was silently failing. Phase 2 — Helm readiness probe path: - deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path changed from `/readyz` to `/ready`. Liveness probe path (`/health`) was correct and is unchanged. Probes block now carries an explanatory comment naming the registered no-auth probe routes and the U-2 closure rationale. Phase 3 — Image-level integration tests: - deploy/test/healthcheck_test.go (new, //go:build integration): TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (positive + negative regression contracts). TestPublishedAgentImage_HealthcheckSpecExists builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker) — verified locally: tests skip with the diagnostic and the suite returns PASS. TestPublishedServerImage_HealthcheckTransitionsToHealthy is a documented `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke; the spec-level tests above cover the audit-flagged regression. Phase 4 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK regression guard (U-2)" step. Scoped patterns catch `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health` in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md out of scope (the post-cutover invariant string at line 182 is intentionally a documented expected-failure assertion). Verified locally on the real tree (passes) and against synthetic regressions (each fires the guard). Phase 5 — Docs sweep: - docs/connectors.md: 15 stale curl examples updated from `http://localhost:8443/...` to `https://localhost:8443/...` with `--cacert "$CA"` injected on every site. Added a one-time introductory note documenting the `$CA` extraction with `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`, matching the pattern in docs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener. Phase 6 — Release surface: - CHANGELOG.md: appended U-2 section to the existing [unreleased] block (immediately below the G-1 entry). Sections: explanatory blockquote covering all three bugs (primary + 2 adjacent), Fixed, Added, Changed. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go vet -tags integration ./deploy/test/ — clean - go test -short ./... — every package green - go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ — three tests SKIP cleanly with "docker not available" diagnostic - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds; rendered Deployment carries `path: /ready` and zero `/readyz` matches - python3 yaml.safe_load on api/openapi.yaml — parses - govulncheck ./... — no vulnerabilities in our code - CI guardrail mirror: clean on real tree, fires on synthetic regression patterns Out of scope (intentionally untouched): - cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct, this finding does NOT propose adding back a plaintext listener. - deploy/docker-compose.yml:126 HEALTHCHECK — already correct. - deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct. - All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already correct (they ALSO use `-fsk https://localhost:8443/health`). - Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` + `path: /health`, correct. - docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health` invariant line — that's the expected-failure assertion for the post-cutover state ("plaintext is gone, expect Connection refused"); intentionally left intact. - Go production code — this is purely a deploy-image / probe / docs / Helm-chart fix. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-healthcheck_protocol_mismatch Audit recommendation followed verbatim: 'change Dockerfile:80 to CMD curl -kf https://localhost:8443/health'.
This commit is contained in:
@@ -37,6 +37,28 @@ All notable changes to certctl are documented in this file. Dates use ISO 8601.
|
||||
- `internal/api/handler/health_test.go` — the prior `TestAuthInfo_ReturnsAuthType_JWT` (which asserted the handler echoed `"jwt"`, baking the silent-downgrade lie into the regression suite) is removed; the pre-existing `TestAuthInfo_ReturnsAuthType_APIKey` continues to cover the api-key happy path.
|
||||
- Auth-disabled startup log in `main.go` now points operators at the authenticating-gateway pattern explicitly.
|
||||
|
||||
### U-2: Dockerfile HEALTHCHECK protocol mismatch — closed end-to-end
|
||||
|
||||
> Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker-compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart-loop. Recon for U-2 also surfaced two adjacent bugs from the same v2.2 milestone gap: the Helm chart's `readinessProbe.httpGet.path` pointed at `/readyz`, a route the server doesn't register (only `/health` and `/ready` are wired and bypass the auth middleware), so K8s readiness probes were getting 404/auth-rejection and pods stayed `NotReady`; and the agent image had no HEALTHCHECK at all (the compose override called `pgrep -f certctl-agent` against an image that didn't ship `procps` — latent always-fail). All three are closed in this commit.
|
||||
|
||||
### Fixed
|
||||
|
||||
- **`Dockerfile` HEALTHCHECK now speaks HTTPS.** Bare `docker run` / Swarm / Nomad / ECS users no longer see `unhealthy` forever. The probe uses `curl -fsk https://localhost:8443/health` — `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed; the probe never traverses a network. Compose / Helm / examples already perform full cert-chain validation and are unaffected.
|
||||
- **Helm `server.readinessProbe.httpGet.path` corrected from `/readyz` to `/ready`.** The `/readyz` path was never registered as a no-auth route (see `internal/api/router/router.go:81` and `cmd/server/main.go:920`), so K8s readiness probes received 401 (api-key auth rejection) or 404 (when auth was disabled). Pods previously failed to report Ready under most realistic Helm deployments. Liveness probe path (`/health`) was already correct and is unchanged.
|
||||
- **`docs/connectors.md` curl examples** (15 sites) updated from `http://localhost:8443/...` to `https://localhost:8443/...` with a one-time `--cacert "$CA"` extraction note matching the existing pattern in `docs/quickstart.md`. Pre-U-2 these examples silently failed against the HTTPS listener.
|
||||
|
||||
### Added
|
||||
|
||||
- **`Dockerfile.agent` HEALTHCHECK** — `pgrep -f certctl-agent` process-presence check (the agent has no HTTP listener; presence is the right primitive). Bare-`docker run` agents now report health-status the same way compose-managed ones do. Also adds `procps` to the runtime image so `pgrep` is actually available — pre-U-2 the docker-compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against an image that lacked it (latent always-fail; container was reported unhealthy in compose too, just rarely noticed because nothing acted on the signal).
|
||||
- **`deploy/test/healthcheck_test.go`** (`//go:build integration`) — image-level integration tests. `TestPublishedServerImage_HealthcheckSpecUsesHTTPS` builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (negative regression contract). `TestPublishedAgentImage_HealthcheckSpecExists` builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker). A third runtime test (`TestPublishedServerImage_HealthcheckTransitionsToHealthy`) is a `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke — documented honestly so the next refactor adopts it instead of rediscovering the gap.
|
||||
- **CI regression guardrail** in `.github/workflows/ci.yml` (`Forbidden plaintext HEALTHCHECK regression guard (U-2)`) — grep-fails the build if any `Dockerfile*` carries `HEALTHCHECK.*http://` or `curl -f http://localhost:8443/health`. Comments exempt; the `docs/upgrade-to-tls.md:182` post-cutover invariant string (which deliberately documents the expected-failure shape) is out of the guardrail's scope because the guardrail only scans Dockerfiles.
|
||||
|
||||
### Changed
|
||||
|
||||
- `Dockerfile` final-stage HEALTHCHECK lines now carry a long-form docblock explaining the `-k` design choice, the published-image vs compose vs Helm vs examples coverage matrix, and cross-references to the audit closure + the integration test.
|
||||
- `Dockerfile.agent` runtime stage adds `procps` to the apk install so the new HEALTHCHECK and the existing compose override both have a working `pgrep`.
|
||||
- `deploy/helm/certctl/values.yaml` server probes block now carries an explanatory comment naming the registered probe routes (`/health`, `/ready`) and the U-2 closure rationale for the `/readyz` → `/ready` correction.
|
||||
|
||||
## [2.2.0] — 2026-04-19
|
||||
|
||||
### HTTPS Everywhere — The Irony
|
||||
|
||||
Reference in New Issue
Block a user