Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)
Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:
1. Helm chart `server.readinessProbe.httpGet.path` pointed at
`/readyz`, the kube-flavored convention. The certctl server
doesn't register `/readyz` (only `/health` and `/ready` are
wired and bypass the auth middleware — see
internal/api/router/router.go:81 and cmd/server/main.go:920).
K8s readiness probes therefore got 401 (api-key auth rejection)
or 404 (when auth was disabled), pods stayed `NotReady`
indefinitely, and Helm rollouts stalled.
2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
so bare-`docker run` agents got zero health signal. The
compose override at `deploy/docker-compose.yml:173` called
`pgrep -f certctl-agent` against the agent image, but the
agent image didn't ship `procps` — pgrep was missing too. The
compose probe was a latent always-fail.
We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:
Files changed:
Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
health` to `curl -fsk https://localhost:8443/health`. `-k`
(insecure) is acceptable because the probe is localhost-to-localhost:
the same process serving the cert is being probed, no network hop.
Pinning `--cacert` is not viable for the published image because
the bootstrap cert is per-deploy (generated into the `certs` named
volume on first up; operator-supplied via Helm's `existingSecret`
or cert-manager). Long-form docblock cross-references the audit
closure, the compose vs Helm vs examples coverage matrix, and the
CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
matching the compose pattern. Added `procps` to the runtime apk
install — fixes both the new image-level HEALTHCHECK AND the
pre-existing compose probe that was silently failing.
Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
changed from `/readyz` to `/ready`. Liveness probe path
(`/health`) was correct and is unchanged. Probes block now carries
an explanatory comment naming the registered no-auth probe routes
and the U-2 closure rationale.
Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
image, inspects `Config.Healthcheck.Test` via `docker inspect`,
and asserts the array contains `https://localhost:8443/health` and
`-k`, and does NOT contain `http://localhost:8443/health`
(positive + negative regression contracts).
TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
Both tests `t.Skip` cleanly when docker isn't available (sandbox /
CI without docker-in-docker) — verified locally: tests skip with the
diagnostic and the suite returns PASS.
TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
documented `t.Skip` placeholder until the harness wires a sidecar
postgres for image-level smoke; the spec-level tests above cover the
audit-flagged regression.
Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
regression guard (U-2)" step. Scoped patterns catch
`HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
out of scope (the post-cutover invariant string at line 182 is
intentionally a documented expected-failure assertion). Verified
locally on the real tree (passes) and against synthetic regressions
(each fires the guard).
Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
`http://localhost:8443/...` to `https://localhost:8443/...` with
`--cacert "$CA"` injected on every site. Added a one-time
introductory note documenting the `$CA` extraction with
`docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
matching the pattern in docs/quickstart.md. Pre-U-2 these examples
silently failed against the HTTPS listener.
Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
block (immediately below the G-1 entry). Sections: explanatory
blockquote covering all three bugs (primary + 2 adjacent), Fixed,
Added, Changed.
Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
`path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
regression patterns
Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
`path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
invariant line — that's the expected-failure assertion for the
post-cutover state ("plaintext is gone, expect Connection refused");
intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
Helm-chart fix.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-healthcheck_protocol_mismatch
Audit recommendation followed verbatim: 'change Dockerfile:80
to CMD curl -kf https://localhost:8443/health'.
Addresses Medium finding M-4 in the audit report. The multi-stage
Dockerfiles previously had no ARG declarations for HTTP_PROXY,
HTTPS_PROXY, or NO_PROXY, so corporate-proxy environments silently
failed at 'npm ci' (frontend stage) and 'go mod download' (Go builder).
The npm retry idiom (`npm ci --include=dev || npm ci --include=dev`)
masked the failure because the upstream 'Exit handler never called!'
bug exits 0 despite the install crash.
Fix: thread HTTP_PROXY / HTTPS_PROXY / NO_PROXY ARGs through every
Docker build stage that performs network I/O, re-export them as ENV
with both upper- and lower-case aliases (apk/curl/npm read lowercase;
Go/Node read uppercase), and forward the host shell's environment via
`build.args:` in every compose file and `build-args:` in the release
workflow's docker/build-push-action steps. Defaults are empty strings
so un-proxied builds remain byte-identical to the pre-fix tree.
Scope: Dockerfile (frontend + Go builder stages), Dockerfile.agent
(Go builder stage), deploy/docker-compose.yml (server + agent),
deploy/docker-compose.dev.yml (server + agent), deploy/docker-compose.test.yml
(server + agent), .github/workflows/release.yml (both docker/build-push-action
v6 invocations). Zero Go, web, test, or runtime code changes. Zero
base-image changes. Existing npm `||` retry idiom and `ARG TARGETARCH`
preserved verbatim.
CWE-1173 (Improper Use of Validated Input) / CWE-16 (Configuration).
Verification:
- YAML parses clean across all four compose files and release.yml.
- yamllint -d relaxed: clean exit across all five YAML files.
- All six `build.args:` blocks expose HTTP_PROXY, HTTPS_PROXY, NO_PROXY
with default-empty ${VAR:-} substitution.
- Both release.yml docker/build-push-action steps expose the same
three keys sourced from ${{ secrets.HTTP_PROXY }}, etc.
- Dockerfiles contain 5 proxy ARG declarations total (Dockerfile has 2
stages × 3 ARGs = 6 lines, Dockerfile.agent has 1 stage × 3 ARGs = 3
lines); lowercase ENV aliases verified present in every stage.
- git diff --shortstat: 6 files changed, 117 insertions(+), 0 deletions.
Pure additive.
Docker-live verification (`docker build`, `docker compose config`)
deferred to CI / post-commit smoke because the sandbox has no Docker
runtime. hadolint, go, golangci-lint, govulncheck likewise unavailable
in the sandbox; per-layer CI coverage gates (service 55%, handler 60%,
domain 40%, middleware 30%) are trivially unaffected as M-4 touches
zero Go source files.
npm has a known bug where `npm ci` can crash with "Exit handler never
called!" behind corporate proxies yet exit with code 0. This adds a
single retry on failure and verifies tsc is actually installed before
proceeding to build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix (--include=dev) was necessary but insufficient. The
real issue is that node_modules created by npm ci in one layer can be
lost when COPY web/ . creates the next layer — depending on the Docker
storage driver (fuse-overlayfs, vfs). Merging install and build into a
single RUN eliminates the layer boundary entirely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
npm ci skips devDependencies when NODE_ENV=production leaks from the
host environment into the Docker build. This breaks the frontend stage
because typescript and vite are devDependencies. Adding --include=dev
makes the install hermetic regardless of host environment.
Closes#9
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
go.mod requires go >= 1.25.0 but both Dockerfiles used golang:1.22-alpine,
causing `go mod download` to fail during container build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Dockerfile was copying raw web/ source files but never building the
frontend. Since .gitignore excludes web/dist/, the Docker image had no
built frontend — only the Vite dev entry point (web/index.html) which
references /src/main.tsx and only works with the Vite dev server. This
caused a blank page when accessing the dashboard.
Fix: Add a Node.js build stage that runs npm ci && npm run build, then
copy only web/dist/ into the final image. Also add web/node_modules and
web/dist to .dockerignore to keep the build context clean.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runtime fixes:
- Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL)
- Fix table name mismatches (certificates → managed_certificates, notifications → notification_events)
- Add renewal_policy_id to certificate queries
- Remove non-existent created_at from notification queries
- Add env var fallback for agent CLI flags
- Graceful degradation for missing notifiers/issuers in demo mode
- Copy web/ directory in Dockerfile for dashboard serving
Service layer:
- Implement handler-service interface pattern across all services
- Wire up certificate, agent, job, policy, team, owner, audit, notification services
Documentation:
- Add concepts.md: beginner-friendly guide to TLS, CAs, private keys
- Rewrite quickstart.md with accurate API examples matching actual handlers
- Add demo-advanced.md: interactive demo with cert issuance and automated script
- Update architecture.md with correct table names and connector interfaces
- Update connectors.md to match actual Go interface signatures
- Update demo-guide.md with cross-references to new docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>