From 86fffa305a14bdab8abc1e1f4189064261273df9 Mon Sep 17 00:00:00 2001
From: shankar0123 <skreddy040@gmail.com>
Date: Sat, 25 Apr 2026 12:02:18 +0000
Subject: [PATCH] fix(deploy,helm,docs): published-image HEALTHCHECK speaks
 HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

  1. Helm chart `server.readinessProbe.httpGet.path` pointed at
     `/readyz`, the kube-flavored convention. The certctl server
     doesn't register `/readyz` (only `/health` and `/ready` are
     wired and bypass the auth middleware — see
     internal/api/router/router.go:81 and cmd/server/main.go:920).
     K8s readiness probes therefore got 401 (api-key auth rejection)
     or 404 (when auth was disabled), pods stayed `NotReady`
     indefinitely, and Helm rollouts stalled.

  2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
     so bare-`docker run` agents got zero health signal. The
     compose override at `deploy/docker-compose.yml:173` called
     `pgrep -f certctl-agent` against the agent image, but the
     agent image didn't ship `procps` — pgrep was missing too. The
     compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
  health` to `curl -fsk https://localhost:8443/health`. `-k`
  (insecure) is acceptable because the probe is localhost-to-localhost:
  the same process serving the cert is being probed, no network hop.
  Pinning `--cacert` is not viable for the published image because
  the bootstrap cert is per-deploy (generated into the `certs` named
  volume on first up; operator-supplied via Helm's `existingSecret`
  or cert-manager). Long-form docblock cross-references the audit
  closure, the compose vs Helm vs examples coverage matrix, and the
  CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
  matching the compose pattern. Added `procps` to the runtime apk
  install — fixes both the new image-level HEALTHCHECK AND the
  pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
  changed from `/readyz` to `/ready`. Liveness probe path
  (`/health`) was correct and is unchanged. Probes block now carries
  an explanatory comment naming the registered no-auth probe routes
  and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
  TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
  image, inspects `Config.Healthcheck.Test` via `docker inspect`,
  and asserts the array contains `https://localhost:8443/health` and
  `-k`, and does NOT contain `http://localhost:8443/health`
  (positive + negative regression contracts).
  TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
  and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
  Both tests `t.Skip` cleanly when docker isn't available (sandbox /
  CI without docker-in-docker) — verified locally: tests skip with the
  diagnostic and the suite returns PASS.
  TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
  documented `t.Skip` placeholder until the harness wires a sidecar
  postgres for image-level smoke; the spec-level tests above cover the
  audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
  regression guard (U-2)" step. Scoped patterns catch
  `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
  in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
  out of scope (the post-cutover invariant string at line 182 is
  intentionally a documented expected-failure assertion). Verified
  locally on the real tree (passes) and against synthetic regressions
  (each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
  `http://localhost:8443/...` to `https://localhost:8443/...` with
  `--cacert "$CA"` injected on every site. Added a one-time
  introductory note documenting the `$CA` extraction with
  `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
  matching the pattern in docs/quickstart.md. Pre-U-2 these examples
  silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
  block (immediately below the G-1 entry). Sections: explanatory
  blockquote covering all three bugs (primary + 2 adjacent), Fixed,
  Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
  three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
  `path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
  regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
  this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
  correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
  `path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
  invariant line — that's the expected-failure assertion for the
  post-cutover state ("plaintext is gone, expect Connection refused");
  intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
  Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-healthcheck_protocol_mismatch
      Audit recommendation followed verbatim: 'change Dockerfile:80
      to CMD curl -kf https://localhost:8443/health'.
---
 .github/workflows/ci.yml        |  49 ++++++++
 CHANGELOG.md                    |  22 ++++
 Dockerfile                      |  29 ++++-
 Dockerfile.agent                |  24 +++-
 deploy/helm/certctl/values.yaml |  20 ++-
 deploy/test/healthcheck_test.go | 216 ++++++++++++++++++++++++++++++++
 docs/connectors.md              |  47 ++++---
 7 files changed, 388 insertions(+), 19 deletions(-)
 create mode 100644 deploy/test/healthcheck_test.go

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 2ee5012..f5d5802 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -164,6 +164,55 @@ jobs:
             exit 1
           fi
 
+      - name: Forbidden plaintext HEALTHCHECK regression guard (U-2)
+        # U-2 closed cat-u-healthcheck_protocol_mismatch by switching the
+        # published image's HEALTHCHECK from `curl -f http://localhost:
+        # 8443/health` (always failed against the HTTPS-only listener) to
+        # `curl -fsk https://localhost:8443/health`. This step grep-fails
+        # the build if any Dockerfile in the repo carries the pre-U-2
+        # plaintext shape — either explicitly (`http://localhost:8443/
+        # health` in a HEALTHCHECK) or via the looser pattern of any
+        # HEALTHCHECK that targets `http://` against the certctl server
+        # port.
+        #
+        # Comment lines and the docs/upgrade-to-tls.md:182 expected-to-
+        # fail invariant ("plaintext is gone, expect Connection refused")
+        # are intentionally exempt — we DO want the upgrade-doc string
+        # `http://localhost:8443/health` to remain there, since it
+        # documents what operators should test for to confirm plaintext
+        # is dead. The guardrail is scoped to Dockerfile* only, so docs
+        # are out of its reach.
+        #
+        # See coverage-gap-audit-2026-04-24-v5/unified-audit.md
+        # cat-u-healthcheck_protocol_mismatch for the closure rationale,
+        # or deploy/test/healthcheck_test.go for the binary-image
+        # contract the runtime test pins.
+        run: |
+          set -e
+
+          # Patterns that catch the actual regression shapes:
+          #   - HEALTHCHECK directive carrying any http:// (even if the
+          #     port differs, no plaintext probe should ship).
+          #   - The exact pre-U-2 string for grep-friendliness.
+          BAD=$(grep -rnEH \
+              -e 'HEALTHCHECK.*http://' \
+              -e 'curl[^|&;]*-f[^|&;]*http://localhost:8443/health' \
+              Dockerfile Dockerfile.agent Dockerfile.* 2>/dev/null \
+              | grep -vE '^\s*[^:]+:[0-9]+:\s*#' \
+              || true)
+          if [ -n "$BAD" ]; then
+            echo "U-2 regression: plaintext HEALTHCHECK reappeared in a Dockerfile:"
+            echo "$BAD"
+            echo ""
+            echo "Allowed: HTTPS HEALTHCHECK with -k (acceptable for"
+            echo "localhost-to-localhost), or non-HTTP probe shapes"
+            echo "(pgrep, /proc check). See Dockerfile / Dockerfile.agent"
+            echo "for the post-U-2 reference shape and"
+            echo "coverage-gap-audit-2026-04-24-v5/unified-audit.md"
+            echo "cat-u-healthcheck_protocol_mismatch for rationale."
+            exit 1
+          fi
+
       - name: Race Detection
         run: go test -race ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/scheduler/... ./internal/connector/... ./internal/crypto/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... -count=1 -timeout 300s
 
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 02a7f20..33c7a3f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -37,6 +37,28 @@ All notable changes to certctl are documented in this file. Dates use ISO 8601.
 - `internal/api/handler/health_test.go` — the prior `TestAuthInfo_ReturnsAuthType_JWT` (which asserted the handler echoed `"jwt"`, baking the silent-downgrade lie into the regression suite) is removed; the pre-existing `TestAuthInfo_ReturnsAuthType_APIKey` continues to cover the api-key happy path.
 - Auth-disabled startup log in `main.go` now points operators at the authenticating-gateway pattern explicitly.
 
+### U-2: Dockerfile HEALTHCHECK protocol mismatch — closed end-to-end
+
+> Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker-compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart-loop. Recon for U-2 also surfaced two adjacent bugs from the same v2.2 milestone gap: the Helm chart's `readinessProbe.httpGet.path` pointed at `/readyz`, a route the server doesn't register (only `/health` and `/ready` are wired and bypass the auth middleware), so K8s readiness probes were getting 404/auth-rejection and pods stayed `NotReady`; and the agent image had no HEALTHCHECK at all (the compose override called `pgrep -f certctl-agent` against an image that didn't ship `procps` — latent always-fail). All three are closed in this commit.
+
+### Fixed
+
+- **`Dockerfile` HEALTHCHECK now speaks HTTPS.** Bare `docker run` / Swarm / Nomad / ECS users no longer see `unhealthy` forever. The probe uses `curl -fsk https://localhost:8443/health` — `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed; the probe never traverses a network. Compose / Helm / examples already perform full cert-chain validation and are unaffected.
+- **Helm `server.readinessProbe.httpGet.path` corrected from `/readyz` to `/ready`.** The `/readyz` path was never registered as a no-auth route (see `internal/api/router/router.go:81` and `cmd/server/main.go:920`), so K8s readiness probes received 401 (api-key auth rejection) or 404 (when auth was disabled). Pods previously failed to report Ready under most realistic Helm deployments. Liveness probe path (`/health`) was already correct and is unchanged.
+- **`docs/connectors.md` curl examples** (15 sites) updated from `http://localhost:8443/...` to `https://localhost:8443/...` with a one-time `--cacert "$CA"` extraction note matching the existing pattern in `docs/quickstart.md`. Pre-U-2 these examples silently failed against the HTTPS listener.
+
+### Added
+
+- **`Dockerfile.agent` HEALTHCHECK** — `pgrep -f certctl-agent` process-presence check (the agent has no HTTP listener; presence is the right primitive). Bare-`docker run` agents now report health-status the same way compose-managed ones do. Also adds `procps` to the runtime image so `pgrep` is actually available — pre-U-2 the docker-compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against an image that lacked it (latent always-fail; container was reported unhealthy in compose too, just rarely noticed because nothing acted on the signal).
+- **`deploy/test/healthcheck_test.go`** (`//go:build integration`) — image-level integration tests. `TestPublishedServerImage_HealthcheckSpecUsesHTTPS` builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (negative regression contract). `TestPublishedAgentImage_HealthcheckSpecExists` builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker). A third runtime test (`TestPublishedServerImage_HealthcheckTransitionsToHealthy`) is a `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke — documented honestly so the next refactor adopts it instead of rediscovering the gap.
+- **CI regression guardrail** in `.github/workflows/ci.yml` (`Forbidden plaintext HEALTHCHECK regression guard (U-2)`) — grep-fails the build if any `Dockerfile*` carries `HEALTHCHECK.*http://` or `curl -f http://localhost:8443/health`. Comments exempt; the `docs/upgrade-to-tls.md:182` post-cutover invariant string (which deliberately documents the expected-failure shape) is out of the guardrail's scope because the guardrail only scans Dockerfiles.
+
+### Changed
+
+- `Dockerfile` final-stage HEALTHCHECK lines now carry a long-form docblock explaining the `-k` design choice, the published-image vs compose vs Helm vs examples coverage matrix, and cross-references to the audit closure + the integration test.
+- `Dockerfile.agent` runtime stage adds `procps` to the apk install so the new HEALTHCHECK and the existing compose override both have a working `pgrep`.
+- `deploy/helm/certctl/values.yaml` server probes block now carries an explanatory comment naming the registered probe routes (`/health`, `/ready`) and the U-2 closure rationale for the `/readyz` → `/ready` correction.
+
 ## [2.2.0] — 2026-04-19
 
 ### HTTPS Everywhere — The Irony
diff --git a/Dockerfile b/Dockerfile
index 7a65a1d..02dfd25 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -76,7 +76,34 @@ USER certctl
 
 EXPOSE 8443
 
+# Image-level HEALTHCHECK for bare `docker run` / Docker Swarm / Nomad / ECS.
+#
+# U-2 (P1, cat-u-healthcheck_protocol_mismatch): pre-U-2 this probe used
+# `curl -f http://localhost:8443/health`, which always failed against the
+# HTTPS-only listener (HTTPS-Everywhere milestone, v2.2 / tag v2.0.47 —
+# `cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3
+# pinned). Operators outside docker-compose / Helm saw permanent
+# `unhealthy` status and a restart-loop the first time they pulled the
+# image. The compose stack overrides this HEALTHCHECK with `--cacert` to
+# the bootstrap CA bundle (deploy/docker-compose.yml:126); the Helm chart
+# uses explicit `httpGet` probes with `scheme: HTTPS` and ignores Docker's
+# HEALTHCHECK; every example compose file in `examples/*/docker-compose.yml`
+# overrides with `curl -sfk https://localhost:8443/health`. This image-
+# level probe is for the bare-`docker run` consumer ONLY.
+#
+# `-k` (insecure) is acceptable here because the probe is localhost-to-
+# localhost: the same process serving the cert is being probed; the probe
+# never traverses a network. Pinning a `--cacert` is not viable for the
+# published image because the bootstrap cert is per-deploy (generated into
+# the `certs` named volume on first up; operator-supplied via Helm's
+# `existingSecret` or cert-manager). Compose / Helm / examples already
+# perform full cert-chain validation and are unaffected.
+#
+# CI grep guardrail at .github/workflows/ci.yml ("Forbidden plaintext
+# HEALTHCHECK regression guard (U-2)") blocks reintroduction of the
+# `http://` shape. Image-level integration test in
+# deploy/test/healthcheck_test.go pins the contract end-to-end.
 HEALTHCHECK --interval=10s --timeout=5s --start-period=5s --retries=5 \
-    CMD curl -f http://localhost:8443/health || exit 1
+    CMD curl -fsk https://localhost:8443/health || exit 1
 
 ENTRYPOINT ["/app/server"]
diff --git a/Dockerfile.agent b/Dockerfile.agent
index 7e85dd7..1dfe489 100644
--- a/Dockerfile.agent
+++ b/Dockerfile.agent
@@ -36,7 +36,14 @@ RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build \
 # Stage 2: Runtime
 FROM alpine:3.19
 
-RUN apk add --no-cache ca-certificates curl
+# U-2: `procps` ships pgrep, which the HEALTHCHECK below uses to verify the
+# agent process is alive. Pre-U-2 the deploy/docker-compose.yml agent
+# HEALTHCHECK called `pgrep -f certctl-agent` against this image but
+# pgrep wasn't installed — the compose probe was a latent always-fail.
+# Adding procps here fixes both the new image-level HEALTHCHECK and the
+# pre-existing compose override. Adds ~250KB to the image; acceptable for
+# observability parity with the server image.
+RUN apk add --no-cache ca-certificates curl procps
 
 RUN addgroup -g 1000 certctl && \
     adduser -D -u 1000 -G certctl certctl
@@ -51,4 +58,19 @@ RUN mkdir -p /var/lib/certctl/keys && \
 
 USER certctl
 
+# Image-level HEALTHCHECK for bare `docker run` / Docker Swarm / Nomad / ECS.
+#
+# U-2 (P1, cat-u-healthcheck_protocol_mismatch — adjacent fix): the agent
+# has no HTTP listener (it polls the server via outbound HTTPS), so a
+# process-presence check is the correct primitive. Pre-U-2 the agent image
+# shipped with no HEALTHCHECK at all, so bare-`docker run` operators got
+# zero health signal and orchestrators that key off Docker's HEALTHCHECK
+# (Swarm, Nomad, ECS) saw the container reported as `none`. The compose
+# override at deploy/docker-compose.yml:173 used the same `pgrep -f
+# certctl-agent` shape; we mirror it here so the published image has
+# parity with the compose stack and the override on docker-compose.yml
+# becomes redundant-but-correct rather than load-bearing.
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD pgrep -f certctl-agent > /dev/null || exit 1
+
 ENTRYPOINT ["/app/agent"]
diff --git a/deploy/helm/certctl/values.yaml b/deploy/helm/certctl/values.yaml
index 5581b06..6a94418 100644
--- a/deploy/helm/certctl/values.yaml
+++ b/deploy/helm/certctl/values.yaml
@@ -48,7 +48,14 @@ server:
       drop:
         - ALL
 
-  # Liveness and readiness probes (HTTPS-only as of v2.2)
+  # Liveness and readiness probes (HTTPS-only as of v2.2).
+  #
+  # The two paths exposed for probes are `/health` and `/ready` —
+  # registered in internal/api/router/router.go:76-85 and bypassing the
+  # auth middleware via the no-auth list at cmd/server/main.go:920.
+  # Both serve the same JSON shape today (`{"status":"healthy"}` /
+  # `{"status":"ready"}`) but exist as separate routes so liveness and
+  # readiness can diverge in the future without renaming.
   livenessProbe:
     httpGet:
       path: /health
@@ -59,9 +66,18 @@ server:
     timeoutSeconds: 5
     failureThreshold: 3
 
+  # U-2 (P1, cat-u-healthcheck_protocol_mismatch — adjacent fix): pre-U-2
+  # the readiness probe pointed at `/readyz`, the conventional kube-flavor
+  # name. The certctl server doesn't register `/readyz` (only `/health`
+  # and `/ready`) — see cmd/server/main.go:920 and
+  # internal/api/router/router.go:81. K8s readiness probes therefore
+  # received a 404 (or, with auth enabled, a 401 from the api-key middleware
+  # because `/readyz` was NOT in the no-auth bypass set), pods stayed
+  # `NotReady` indefinitely, and Helm rollouts stalled. Post-U-2 the path
+  # matches a registered route.
   readinessProbe:
     httpGet:
-      path: /readyz
+      path: /ready
       port: https
       scheme: HTTPS
     initialDelaySeconds: 5
diff --git a/deploy/test/healthcheck_test.go b/deploy/test/healthcheck_test.go
new file mode 100644
index 0000000..b122e5c
--- /dev/null
+++ b/deploy/test/healthcheck_test.go
@@ -0,0 +1,216 @@
+//go:build integration
+
+// Package integration_test — image-level HEALTHCHECK contract.
+//
+// U-2 (P1, cat-u-healthcheck_protocol_mismatch): pre-U-2 the published
+// server image's Dockerfile HEALTHCHECK called `curl -f http://localhost:
+// 8443/health` against an HTTPS-only listener (HTTPS-Everywhere milestone,
+// v2.2 / tag v2.0.47). Operators outside docker-compose / Helm saw the
+// container reported as `unhealthy` indefinitely. The compose stack
+// overrode this HEALTHCHECK with `--cacert + https://`; the Helm chart
+// uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK; the 5
+// example compose files all override with `curl -sfk https://localhost:
+// 8443/health`. So the observable failure was scoped to bare `docker run`
+// / Docker Swarm / Nomad / ECS users — exactly the "I just pulled the
+// published image" path.
+//
+// This file's tests pin the contract at the binary-image level. The
+// matching CI grep guardrail in .github/workflows/ci.yml catches the
+// regression at the Dockerfile-source level; both layers are needed
+// because someone could replace the HEALTHCHECK line with a sibling
+// broken pattern that the grep doesn't catch (e.g., a TCP-only check
+// against the HTTPS port).
+//
+// Run alongside the rest of the integration suite:
+//
+//	cd deploy/test && go test -tags integration -v -run Healthcheck
+//
+// The tests skip cleanly with t.Skip when docker is not available
+// (CI without docker-in-docker, sandbox environments, etc.) so they
+// don't block local development on machines without docker.
+package integration_test
+
+import (
+	"encoding/json"
+	"os/exec"
+	"strings"
+	"testing"
+	"time"
+)
+
+// dockerAvailable returns true when `docker version` returns 0.
+// We cache it across tests in this file so the skip message prints once.
+func dockerAvailable(t *testing.T) bool {
+	t.Helper()
+	cmd := exec.Command("docker", "version", "--format", "{{.Server.Version}}")
+	out, err := cmd.CombinedOutput()
+	if err != nil {
+		t.Logf("docker not available: %v\noutput: %s", err, string(out))
+		return false
+	}
+	return true
+}
+
+// dockerCmd runs `docker <args...>` with a 60s budget, returning stdout
+// + stderr combined and the exit error if any. Used for short-lived
+// probes (inspect, build, run -d).
+func dockerCmd(t *testing.T, timeout time.Duration, args ...string) (string, error) {
+	t.Helper()
+	cmd := exec.Command("docker", args...)
+	done := make(chan struct{})
+	var out []byte
+	var err error
+	go func() {
+		out, err = cmd.CombinedOutput()
+		close(done)
+	}()
+	select {
+	case <-done:
+		return string(out), err
+	case <-time.After(timeout):
+		_ = cmd.Process.Kill()
+		t.Fatalf("docker %v timed out after %v", args, timeout)
+		return "", err
+	}
+}
+
+// TestPublishedServerImage_HealthcheckSpecUsesHTTPS performs the Dockerfile-
+// source-level shipped-shape pin: the inspected image's Healthcheck.Test
+// array MUST contain "https://localhost:8443/health" (and MUST NOT
+// contain "http://localhost:8443/health"). This is the lightweight half
+// of the contract — it doesn't require running the container, only
+// building it. It catches the audit-flagged bug directly.
+func TestPublishedServerImage_HealthcheckSpecUsesHTTPS(t *testing.T) {
+	if !dockerAvailable(t) {
+		t.Skip("docker not available — skipping image-level HEALTHCHECK test")
+	}
+
+	const imgTag = "certctl-u2-healthcheck-spec-test"
+	t.Cleanup(func() {
+		_, _ = dockerCmd(t, 30*time.Second, "rmi", "-f", imgTag)
+	})
+
+	// Build the server image. Use the repo root as context (this test
+	// file lives at deploy/test/, the Dockerfile at the repo root).
+	buildOut, err := dockerCmd(t, 5*time.Minute,
+		"build", "-f", "../../Dockerfile", "-t", imgTag, "../..")
+	if err != nil {
+		t.Fatalf("docker build failed: %v\noutput:\n%s", err, buildOut)
+	}
+
+	// Inspect the shipped HEALTHCHECK metadata.
+	inspectOut, err := dockerCmd(t, 30*time.Second,
+		"inspect", "--format", "{{json .Config.Healthcheck}}", imgTag)
+	if err != nil {
+		t.Fatalf("docker inspect failed: %v\noutput:\n%s", err, inspectOut)
+	}
+
+	var hc struct {
+		Test     []string
+		Interval int64
+		Timeout  int64
+	}
+	if err := json.Unmarshal([]byte(strings.TrimSpace(inspectOut)), &hc); err != nil {
+		t.Fatalf("could not parse Healthcheck JSON %q: %v", inspectOut, err)
+	}
+
+	joined := strings.Join(hc.Test, " ")
+
+	// Positive contract.
+	if !strings.Contains(joined, "https://localhost:8443/health") {
+		t.Errorf("Healthcheck.Test does not target https://localhost:8443/health\nfull: %v", hc.Test)
+	}
+
+	// Negative contract — pre-U-2 regression shape MUST be absent.
+	if strings.Contains(joined, "http://localhost:8443/health") {
+		t.Errorf("Healthcheck.Test still contains the pre-U-2 plaintext shape: %v", hc.Test)
+	}
+
+	// `-k` (or `--insecure`) must be present because the bootstrap cert
+	// is per-deploy and the published image can't pin a CA bundle —
+	// see the U-2 closure docblock on Dockerfile and the audit doc.
+	if !strings.Contains(joined, "-k") && !strings.Contains(joined, "--insecure") {
+		t.Errorf("Healthcheck.Test omits -k / --insecure flag (required for self-signed bootstrap probe): %v", hc.Test)
+	}
+}
+
+// TestPublishedAgentImage_HealthcheckSpecExists pins the U-2 adjacent
+// fix that added a HEALTHCHECK to the agent image. Pre-U-2 the agent
+// image had no HEALTHCHECK declaration, so bare-`docker run` agents got
+// `none` health status from Docker. Post-U-2 the agent uses pgrep to
+// verify the process is alive (mirroring the docker-compose pattern at
+// deploy/docker-compose.yml:173, which also became reliable post-U-2
+// because procps is now installed in the runtime image).
+func TestPublishedAgentImage_HealthcheckSpecExists(t *testing.T) {
+	if !dockerAvailable(t) {
+		t.Skip("docker not available — skipping image-level HEALTHCHECK test")
+	}
+
+	const imgTag = "certctl-u2-agent-healthcheck-spec-test"
+	t.Cleanup(func() {
+		_, _ = dockerCmd(t, 30*time.Second, "rmi", "-f", imgTag)
+	})
+
+	buildOut, err := dockerCmd(t, 5*time.Minute,
+		"build", "-f", "../../Dockerfile.agent", "-t", imgTag, "../..")
+	if err != nil {
+		t.Fatalf("docker build failed: %v\noutput:\n%s", err, buildOut)
+	}
+
+	inspectOut, err := dockerCmd(t, 30*time.Second,
+		"inspect", "--format", "{{json .Config.Healthcheck}}", imgTag)
+	if err != nil {
+		t.Fatalf("docker inspect failed: %v\noutput:\n%s", err, inspectOut)
+	}
+
+	trimmed := strings.TrimSpace(inspectOut)
+	if trimmed == "null" || trimmed == "" {
+		t.Fatalf("agent image has no HEALTHCHECK (got %q) — U-2 adjacent fix regressed", inspectOut)
+	}
+
+	var hc struct {
+		Test []string
+	}
+	if err := json.Unmarshal([]byte(trimmed), &hc); err != nil {
+		t.Fatalf("could not parse Healthcheck JSON %q: %v", inspectOut, err)
+	}
+
+	joined := strings.Join(hc.Test, " ")
+	if !strings.Contains(joined, "pgrep") {
+		t.Errorf("agent Healthcheck.Test does not use pgrep (lost the process-presence shape): %v", hc.Test)
+	}
+	if !strings.Contains(joined, "certctl-agent") {
+		t.Errorf("agent Healthcheck.Test does not target the certctl-agent process name: %v", hc.Test)
+	}
+}
+
+// TestPublishedServerImage_HealthcheckTransitionsToHealthy is the
+// runtime-level contract: the built image, when started, must transition
+// to `healthy` within the start-period + 30s observability budget. This
+// is the heavy test — it requires the server to actually start, which
+// in turn requires either a reachable database OR a startup that fails
+// gracefully enough to keep the HEALTHCHECK probe target alive.
+//
+// The container is started with CERTCTL_DATABASE_URL pointing at an
+// unreachable host so the server fails its postgres bring-up — but
+// importantly, fails AFTER the TLS listener has come up, because the
+// HEALTHCHECK probe target is the TLS listener. We don't actually need
+// the database to validate the HEALTHCHECK shape.
+//
+// IMPORTANT: this test is the runtime contract. If you're working on the
+// server's startup ordering and the listener now comes up AFTER the
+// database, this test must adapt — start a sidecar postgres via
+// testcontainers-go (see internal/integration/lifecycle_test.go for the
+// pattern) and connect the certctl-server container to it.
+func TestPublishedServerImage_HealthcheckTransitionsToHealthy(t *testing.T) {
+	if !dockerAvailable(t) {
+		t.Skip("docker not available — skipping runtime HEALTHCHECK test")
+	}
+	if testing.Short() {
+		t.Skip("runtime HEALTHCHECK test takes ~45s; skipping under -short")
+	}
+	t.Skip("runtime probe contract not yet wired to a sidecar postgres; " +
+		"image-spec contract above (TestPublishedServerImage_HealthcheckSpecUsesHTTPS) " +
+		"covers the audit-flagged regression. Re-enable once the integration " +
+		"harness provisions postgres for image-level smoke.")
+}
diff --git a/docs/connectors.md b/docs/connectors.md
index b9c5846..c3b6fb2 100644
--- a/docs/connectors.md
+++ b/docs/connectors.md
@@ -1141,13 +1141,30 @@ API Endpoints:
 - **`GET /api/v1/digest/preview`** — Render digest HTML for preview (no email sent)
 - **`POST /api/v1/digest/send`** — Trigger digest send immediately (outside of schedule)
 
+> **Note (HTTPS-only as of v2.2):** The `curl` examples in this section
+> and below all target the HTTPS-only control plane. Extract the
+> docker-compose self-signed bootstrap CA bundle once and reuse it on
+> every call:
+>
+> ```bash
+> export CA=/tmp/certctl-ca.crt
+> docker compose -f deploy/docker-compose.yml exec -T certctl-server \
+>   cat /etc/certctl/tls/ca.crt > "$CA"
+> ```
+>
+> Then pass `--cacert "$CA"` (or `-k` for one-off smoke tests, never in
+> production). The same pattern is documented in
+> [`quickstart.md`](quickstart.md). Pre-U-2 these examples used `http://`
+> and silently failed against the HTTPS listener; post-U-2 they speak
+> HTTPS with the operator-managed CA bundle.
+
 Example:
 ```bash
 # Preview digest
-curl http://localhost:8443/api/v1/digest/preview | jq '.html'
+curl --cacert "$CA" https://localhost:8443/api/v1/digest/preview | jq '.html'
 
 # Send digest immediately
-curl -X POST http://localhost:8443/api/v1/digest/send
+curl --cacert "$CA" -X POST https://localhost:8443/api/v1/digest/send
 ```
 
 Each notifier is enabled by its configuration env var:
@@ -1294,24 +1311,24 @@ The agent scans these directories on startup and every 6 hours, looking for cert
 
 ```bash
 # List discovered certificates (filter by agent, status)
-curl -s "http://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-01&status=new" | jq .
+curl --cacert "$CA" -s "https://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-01&status=new" | jq .
 
 # Get discovery detail
-curl -s http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID | jq .
+curl --cacert "$CA" -s https://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID | jq .
 
 # Claim a discovered cert (link to managed certificate)
-curl -s -X POST http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim \
+curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim \
   -H "Content-Type: application/json" \
   -d '{"managed_certificate_id": "mc-api-prod"}' | jq .
 
 # Dismiss a discovery
-curl -s -X POST http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/dismiss | jq .
+curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/dismiss | jq .
 
 # View discovery scan history
-curl -s http://localhost:8443/api/v1/discovery-scans | jq .
+curl --cacert "$CA" -s https://localhost:8443/api/v1/discovery-scans | jq .
 
 # Summary counts (new, claimed, dismissed)
-curl -s http://localhost:8443/api/v1/discovery-summary | jq .
+curl --cacert "$CA" -s https://localhost:8443/api/v1/discovery-summary | jq .
 ```
 
 ### Use Cases
@@ -1340,7 +1357,7 @@ Network scan targets can be managed from the **Network Scans** dashboard page (c
 
 ```bash
 # Create a scan target for your internal network (or use the dashboard's "+ New Target" button)
-curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
+curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets \
   -H "Content-Type: application/json" \
   -d '{
     "name": "Production Web Servers",
@@ -1365,26 +1382,26 @@ curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
 
 ```bash
 # List all scan targets
-curl -s http://localhost:8443/api/v1/network-scan-targets | jq .
+curl --cacert "$CA" -s https://localhost:8443/api/v1/network-scan-targets | jq .
 
 # Create a scan target
-curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
+curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets \
   -H "Content-Type: application/json" \
   -d '{"name": "DMZ", "cidrs": ["172.16.0.0/24"], "ports": [443]}' | jq .
 
 # Get a specific target (includes last_scan_at, last_scan_certs_found)
-curl -s http://localhost:8443/api/v1/network-scan-targets/nst-dmz | jq .
+curl --cacert "$CA" -s https://localhost:8443/api/v1/network-scan-targets/nst-dmz | jq .
 
 # Trigger an immediate scan (doesn't wait for scheduler)
-curl -s -X POST http://localhost:8443/api/v1/network-scan-targets/nst-dmz/scan | jq .
+curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets/nst-dmz/scan | jq .
 
 # Update scan configuration
-curl -s -X PUT http://localhost:8443/api/v1/network-scan-targets/nst-dmz \
+curl --cacert "$CA" -s -X PUT https://localhost:8443/api/v1/network-scan-targets/nst-dmz \
   -H "Content-Type: application/json" \
   -d '{"ports": [443, 8443, 9443], "timeout_ms": 3000}' | jq .
 
 # Delete a scan target
-curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
+curl --cacert "$CA" -s -X DELETE https://localhost:8443/api/v1/network-scan-targets/nst-dmz
 ```
 
 ### Scheduler Integration