mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:11:29 +00:00
86fffa305a
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed on every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker- compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart- loop. (Audit: cat-u-healthcheck_protocol_mismatch in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone gap, both bundled into this commit because they share the same root cause and the same operator surface: 1. Helm chart `server.readinessProbe.httpGet.path` pointed at `/readyz`, the kube-flavored convention. The certctl server doesn't register `/readyz` (only `/health` and `/ready` are wired and bypass the auth middleware — see internal/api/router/router.go:81 and cmd/server/main.go:920). K8s readiness probes therefore got 401 (api-key auth rejection) or 404 (when auth was disabled), pods stayed `NotReady` indefinitely, and Helm rollouts stalled. 2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all, so bare-`docker run` agents got zero health signal. The compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against the agent image, but the agent image didn't ship `procps` — pgrep was missing too. The compose probe was a latent always-fail. We fixed all three with the audit-recommended shape (option (a) — `-k`) plus three structural backstops: Files changed: Phase 1 — Dockerfile fix: - Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/ health` to `curl -fsk https://localhost:8443/health`. `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed, no network hop. Pinning `--cacert` is not viable for the published image because the bootstrap cert is per-deploy (generated into the `certs` named volume on first up; operator-supplied via Helm's `existingSecret` or cert-manager). Long-form docblock cross-references the audit closure, the compose vs Helm vs examples coverage matrix, and the CI guardrail. - Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent` matching the compose pattern. Added `procps` to the runtime apk install — fixes both the new image-level HEALTHCHECK AND the pre-existing compose probe that was silently failing. Phase 2 — Helm readiness probe path: - deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path changed from `/readyz` to `/ready`. Liveness probe path (`/health`) was correct and is unchanged. Probes block now carries an explanatory comment naming the registered no-auth probe routes and the U-2 closure rationale. Phase 3 — Image-level integration tests: - deploy/test/healthcheck_test.go (new, //go:build integration): TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (positive + negative regression contracts). TestPublishedAgentImage_HealthcheckSpecExists builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker) — verified locally: tests skip with the diagnostic and the suite returns PASS. TestPublishedServerImage_HealthcheckTransitionsToHealthy is a documented `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke; the spec-level tests above cover the audit-flagged regression. Phase 4 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK regression guard (U-2)" step. Scoped patterns catch `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health` in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md out of scope (the post-cutover invariant string at line 182 is intentionally a documented expected-failure assertion). Verified locally on the real tree (passes) and against synthetic regressions (each fires the guard). Phase 5 — Docs sweep: - docs/connectors.md: 15 stale curl examples updated from `http://localhost:8443/...` to `https://localhost:8443/...` with `--cacert "$CA"` injected on every site. Added a one-time introductory note documenting the `$CA` extraction with `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`, matching the pattern in docs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener. Phase 6 — Release surface: - CHANGELOG.md: appended U-2 section to the existing [unreleased] block (immediately below the G-1 entry). Sections: explanatory blockquote covering all three bugs (primary + 2 adjacent), Fixed, Added, Changed. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go vet -tags integration ./deploy/test/ — clean - go test -short ./... — every package green - go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ — three tests SKIP cleanly with "docker not available" diagnostic - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds; rendered Deployment carries `path: /ready` and zero `/readyz` matches - python3 yaml.safe_load on api/openapi.yaml — parses - govulncheck ./... — no vulnerabilities in our code - CI guardrail mirror: clean on real tree, fires on synthetic regression patterns Out of scope (intentionally untouched): - cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct, this finding does NOT propose adding back a plaintext listener. - deploy/docker-compose.yml:126 HEALTHCHECK — already correct. - deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct. - All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already correct (they ALSO use `-fsk https://localhost:8443/health`). - Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` + `path: /health`, correct. - docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health` invariant line — that's the expected-failure assertion for the post-cutover state ("plaintext is gone, expect Connection refused"); intentionally left intact. - Go production code — this is purely a deploy-image / probe / docs / Helm-chart fix. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-healthcheck_protocol_mismatch Audit recommendation followed verbatim: 'change Dockerfile:80 to CMD curl -kf https://localhost:8443/health'.
77 lines
2.7 KiB
Docker
77 lines
2.7 KiB
Docker
# Multi-stage build for certctl agent
|
|
# Stage 1: Build
|
|
FROM golang:1.25-alpine AS builder
|
|
|
|
# Proxy propagation (M-4, Issue #9) — defaulted to empty so un-proxied builds
|
|
# behave identically to the pre-fix tree. When `HTTP_PROXY`/`HTTPS_PROXY`/
|
|
# `NO_PROXY` are forwarded via `docker build --build-arg` (or compose
|
|
# `build.args`), they are re-exported as ENV with both upper- and lower-case
|
|
# names because apk and curl read the lowercase variants while Go reads the
|
|
# uppercase ones.
|
|
ARG HTTP_PROXY=
|
|
ARG HTTPS_PROXY=
|
|
ARG NO_PROXY=
|
|
ENV HTTP_PROXY=${HTTP_PROXY} \
|
|
HTTPS_PROXY=${HTTPS_PROXY} \
|
|
NO_PROXY=${NO_PROXY} \
|
|
http_proxy=${HTTP_PROXY} \
|
|
https_proxy=${HTTPS_PROXY} \
|
|
no_proxy=${NO_PROXY}
|
|
|
|
RUN apk add --no-cache git ca-certificates
|
|
|
|
WORKDIR /app
|
|
|
|
COPY go.mod go.sum ./
|
|
RUN go mod download
|
|
|
|
COPY . .
|
|
|
|
ARG TARGETARCH=amd64
|
|
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build \
|
|
-ldflags="-w -s" \
|
|
-o bin/agent \
|
|
./cmd/agent
|
|
|
|
# Stage 2: Runtime
|
|
FROM alpine:3.19
|
|
|
|
# U-2: `procps` ships pgrep, which the HEALTHCHECK below uses to verify the
|
|
# agent process is alive. Pre-U-2 the deploy/docker-compose.yml agent
|
|
# HEALTHCHECK called `pgrep -f certctl-agent` against this image but
|
|
# pgrep wasn't installed — the compose probe was a latent always-fail.
|
|
# Adding procps here fixes both the new image-level HEALTHCHECK and the
|
|
# pre-existing compose override. Adds ~250KB to the image; acceptable for
|
|
# observability parity with the server image.
|
|
RUN apk add --no-cache ca-certificates curl procps
|
|
|
|
RUN addgroup -g 1000 certctl && \
|
|
adduser -D -u 1000 -G certctl certctl
|
|
|
|
WORKDIR /app
|
|
|
|
COPY --from=builder /app/bin/agent .
|
|
|
|
# Create key storage directory for agent-side keygen
|
|
RUN mkdir -p /var/lib/certctl/keys && \
|
|
chown -R certctl:certctl /app /var/lib/certctl
|
|
|
|
USER certctl
|
|
|
|
# Image-level HEALTHCHECK for bare `docker run` / Docker Swarm / Nomad / ECS.
|
|
#
|
|
# U-2 (P1, cat-u-healthcheck_protocol_mismatch — adjacent fix): the agent
|
|
# has no HTTP listener (it polls the server via outbound HTTPS), so a
|
|
# process-presence check is the correct primitive. Pre-U-2 the agent image
|
|
# shipped with no HEALTHCHECK at all, so bare-`docker run` operators got
|
|
# zero health signal and orchestrators that key off Docker's HEALTHCHECK
|
|
# (Swarm, Nomad, ECS) saw the container reported as `none`. The compose
|
|
# override at deploy/docker-compose.yml:173 used the same `pgrep -f
|
|
# certctl-agent` shape; we mirror it here so the published image has
|
|
# parity with the compose stack and the override on docker-compose.yml
|
|
# becomes redundant-but-correct rather than load-bearing.
|
|
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
|
CMD pgrep -f certctl-agent > /dev/null || exit 1
|
|
|
|
ENTRYPOINT ["/app/agent"]
|