Five audit findings, all category cat-d or cat-f, all rooted in two
frontend files. The dashboard silently lied:
cat-d-359e92c20cbf [P1, primary] — Agent: 'Stale' dead key + 'Degraded'
neutral fallthrough
cat-d-9f4c8e4a91f1 [P2] — Notification: 'dead' missing
cat-d-1447e04732e7 [P3] — Cert: 'PendingIssuance' dead key
cat-f-cert_detail_page_key_render_fallback [P2] — render-site reads
cert.key_algorithm directly
cat-f-ae0d06b6588f [P2] — Certificate TS phantom fields (root cause)
Pre-D-1, agents in the only Go AgentStatus that means 'needs operator
attention' (Degraded) rendered as default neutral grey because StatusBadge
mapped 'Stale' (a key Go has never emitted) to yellow. Dead-letter
notifications visually equated with 'read' (operator-acknowledged). The
Certificate badge map carried a 'PendingIssuance' key no Go enum emits.
CertificateDetailPage's Key Algorithm and Key Size rows always rendered
'—' even when the data was a single fetch away — the lookup went through
cert.key_algorithm / cert.key_size directly, both phantom Certificate TS
fields. Trim the TS type so the missing-data case is explicit; fix the
render site to use latestVersion?.field; pin the contract with a 38-case
Vitest property test that walks every Go enum.
StatusBadge (web/src/components/StatusBadge.tsx)
- Drop 'Stale' (Agent dead key) + 'PendingIssuance' (Cert dead key).
- Add 'Degraded' (Agent → badge-warning) + 'dead' (Notification → badge-danger).
- Add leading docblock naming Go-side source-of-truth file for every
status family and pointing at the property test as regression vector.
Property test (web/src/components/StatusBadge.test.tsx — 38 cases)
- Iterates every Go-emitted enum value (AgentStatus, CertificateStatus,
JobStatus, NotificationStatus, DiscoveryStatus, HealthStatus) plus the
two frontend-synthesized Enabled/Disabled labels, asserts every value
gets a non-default class (or an explicit 'badge badge-neutral' for the
five intentionally-neutral terminal values: Archived, Cancelled,
Dismissed, read, unknown).
- Negative assertions: 'Stale' and 'PendingIssuance' must fall through
to the dictionary default — re-adding either key surfaces here.
- Specific UX-correctness assertions: 'dead' → badge-danger,
'Degraded' → badge-warning.
- Unknown-status fallthrough preserves label text.
Certificate TS trim (web/src/api/types.ts)
- Drop serial_number?, fingerprint_sha256?, key_algorithm?, key_size?,
issued_at? from Certificate. Go's ManagedCertificate has never carried
these — they live on CertificateVersion. Post-trim a cert.X access for
any of the five fields is a TS compile error.
- Leading docblock cross-references the closure rationale and the
latestVersion fallback pattern.
Render-site fix (web/src/pages/CertificateDetailPage.tsx)
- Key Algorithm / Key Size rows now read latestVersion?.key_algorithm /
latestVersion?.key_size, mirroring the existing latestVersion fallback
used a few lines above for serial_number / fingerprint_sha256.
- The same edit also tightened the serial / fingerprint / issued_at
derivations to drop the now-impossible 'cert.X || latestVersion?.X'
cert-side leg (cert.serial_number is a TS error post-trim).
Type-test regression (web/src/api/types.test.ts)
- Certificate literal construction pinned post-trim — adding any of the
five fields back makes the literal an excess-property TS error.
- Sibling CertificateVersion literal pinning the trimmed fields still
live on the version envelope (so the CertificateDetailPage fallback
path can't break).
OpenAPI (api/openapi.yaml)
- ManagedCertificate schema unchanged — was already correct (no phantom
fields). Added a leading comment cross-referencing the D-5 closure for
future readers.
CI guardrail (.github/workflows/ci.yml)
- 'Forbidden StatusBadge dead-key + Certificate phantom-field regression
guard (D-1)'. Two grep blocks: catches Stale/PendingIssuance map
literals in StatusBadge.tsx; uses an awk-scoped window over the
'export interface Certificate {' block in types.ts to catch the five
phantom fields reappearing while explicitly excluding CertificateVersion
(which legitimately carries them). Comments + test files exempt.
Verification
- Backend build/vet/test -short -race all clean across handler/router/
middleware packages.
- Frontend tsc --noEmit clean.
- Vitest 256 → 296 tests (+40: 38 from new StatusBadge test, 2 from D-5
Certificate trim regression in types.test.ts).
- OpenAPI YAML parses (87 paths).
- Both CI guardrail patterns clear on the post-fix tree; both fire
against synthetic regression patterns (re-add Stale → fires; re-add
serial_number? to Certificate → fires).
Out of scope (deferred)
- diff-05x06-* type drifts for Agent/DeploymentTarget/Notification/
DiscoveredCertificate/Issuer TS interfaces. Per-type field-by-field
Go ↔ TS diff is codegen-shaped, not edit-shaped — warrants its own
D-2 master prompt. Noted in CHANGELOG follow-ups section.
31 KiB
Changelog
All notable changes to certctl are documented in this file. Dates use ISO 8601. Versions follow Semantic Versioning.
[unreleased] — 2026-04-25
D-1: StatusBadge enum drift + Certificate phantom fields — closed end-to-end
The dashboard silently lied in five places. Agents in the
Degradedstate (the only Go-side AgentStatus that means "needs operator attention") rendered as default neutral grey because StatusBadge mappedStale(a key Go has never emitted) to yellow and let the realDegradedvalue fall through to the dictionary default. Dead-letter notifications (status: 'dead', retries exhausted) rendered as default neutral, visually equated withread(operator-acknowledged). The Certificate badge map carried aPendingIssuancekey that no Go enum value ever emits — dead key, latent confusion vector. CertificateDetailPage's Key Algorithm and Key Size rows always rendered—even when the data was a single fetch away, because the lookup went throughcert.key_algorithmdirectly — and the underlyingCertificateTypeScript interface declared five optional fields (serial_number,fingerprint_sha256,key_algorithm,key_size,issued_at) that Go'sManagedCertificatehas never carried (those values live onCertificateVersion). Five findings, two files, one frontend rebuild. Pre-D-1 the only reason this didn't trip a regression suite was that the regression suite never asserted "every Go-emitted enum value gets a non-default StatusBadge class" — D-1 fixes the visual lies and adds a 38-case Vitest property test that walks every Go enum and pins the contract.
Breaking Changes
CertificateTypeScript interface no longer declaresserial_number?,fingerprint_sha256?,key_algorithm?,key_size?, orissued_at?. The GoManagedCertificate(internal/domain/certificate.go) has never emitted these fields on list responses; they live onCertificateVersionand are reachable viagetCertificateVersions(id). Pre-D-5 (the cat-f phantom-fields finding) the optional declarations madecert.Xalways-undefined on lists, and downstream consumers silently rendered—for every cert. Post-D-5 acert.Xaccess for any of the five fields is a TypeScript compile error, forcing every consumer to acknowledge the version-fallback pattern. The OpenAPIManagedCertificateschema was already correct — only the TS type was drifted.- StatusBadge no longer maps
Stale(Agent) orPendingIssuance(Certificate). Both were dead keys — no Go enum value emits them. Operators with custom CSS hooked off.badge-warningforStalewill see the same color come back via the newDegradedmapping (same class), but JS/TS code that switches on the literal'Stale'will need to switch on'Degraded'instead. ThePendingIssuancedeletion has no documented downstream consumer.
Added
web/src/components/StatusBadge.tsx:Degraded(Agent) →badge-warninganddead(Notification) →badge-danger. First mappings restore the color contract for the two real Go-side values that previously fell through to the dictionary default. TheDegradedmapping cross-referencesinternal/domain/connector.go::AgentStatusDegraded; thedeadmapping cross-referencesinternal/domain/notification.go::NotificationStatusDead.web/src/components/StatusBadge.test.tsx: 38-case Vitest property test. Iterates every Go-side enum value (AgentStatus,CertificateStatus,JobStatus,NotificationStatus,DiscoveryStatus,HealthStatus) plus the two frontend-synthesizedEnabled/Disabledlabels, asserts every value gets a non-default class (or, for the five intentionally-neutral terminal values likeArchived/Cancelled/read, an explicitbadge badge-neutral). Includes negative assertions on the deletedStaleandPendingIssuancekeys (must fall through to neutral) and specific UX-correctness assertions on the operator-attention semantics (dead→ danger,Degraded→ warning).web/src/api/types.test.ts: D-5 Certificate phantom-fields trim regression. ACertificateliteral construction pinned post-trim, plus a siblingCertificateVersionliteral pinning that the trimmed fields still live on the version envelope. Thetsc --noEmitgate in CI is the primary enforcement; the test is the documentation of intent.- CI regression guardrail in
.github/workflows/ci.yml(Forbidden StatusBadge dead-key + Certificate phantom-field regression guard (D-1)). Two grep blocks: (1) catchesStale: 'badge-...'orPendingIssuance: 'badge-...'inweb/src/components/StatusBadge.tsx; (2) uses an awk-scoped window over theexport interface Certificate {block inweb/src/api/types.tsto catch any of the five phantom fields reappearing — explicitly excludes theCertificateVersionblock which legitimately carries them. Verified locally on the post-fix tree (passes) and against synthetic regressions (each fires the guardrail).
Changed
web/src/pages/CertificateDetailPage.tsx: Key Algorithm and Key Size rows now read fromlatestVersion?.key_algorithm/latestVersion?.key_size. Mirrors the existinglatestVersionfallback used forserial_numberandfingerprint_sha256earlier in the same file. Pre-D-4 these rows accessedcert.key_algorithmandcert.key_sizedirectly — both phantom fields per D-5 — so the rows always rendered—. The same file'sserial_number/fingerprint_sha256/issued_atderivations were also simplified to drop the now-impossiblecert.X || latestVersion?.Xcert-side leg.web/src/components/StatusBadge.tsxadds a leading docblock naming the Go-side source-of-truth file for every status family it maps (AgentStatus,CertificateStatus,JobStatus,NotificationStatus,DiscoveryStatus,HealthStatus) and pointing at the property test as the regression vector for future enum changes.api/openapi.yaml::ManagedCertificategets a leading comment cross-referencing the D-5 closure and explaining why per-issuance fields legitimately don't appear here (they live onCertificateVersion). Schema property list unchanged — the OpenAPI spec was already correct.
Closed audit findings
cat-d-359e92c20cbf(P1 primary) — Agent:Staledead key +Degradedneutral fallthroughcat-d-9f4c8e4a91f1(P2) — Notification:deadmissingcat-d-1447e04732e7(P3) — Certificate:PendingIssuancedead keycat-f-cert_detail_page_key_render_fallback(P2) — render-site usescert.key_algorithmdirectlycat-f-ae0d06b6588f(P2) — Certificate TS phantom fields (root cause)
Known follow-ups (deferred from D-1 scope)
The audit's broader type-drift cluster (diff-05x06-7cdf4e78ae24 Agent TS, diff-05x06-2044a46f4dd0 DeploymentTarget TS, diff-05x06-caba9eb3620e Notification TS, diff-05x06-85ab6b98a2f7 DiscoveredCertificate TS, diff-05x06-97fab8783a5c Issuer TS) is out of D-1 scope. Recon for those is per-type field-by-field diff Go ↔ TS — codegen-shaped, not edit-shaped — and warrants its own D-2 master prompt.
U-3: GitHub #10 reopened — fresh-clone first-up postgres init failure (P1) — closed end-to-end
Operator
mikeakasullycloned v2.0.50 fresh, ran the canonical quickstartdocker compose -f deploy/docker-compose.yml up -d --build, and postgres reportedunhealthyindefinitely; dependent containers (certctl-server, certctl-agent) never started. Root cause: the deploy compose stack mounted both a hand-curated subset ofmigrations/*.up.sqlandseed.sqlinto postgres/docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Onceseed.sqlreferenced columns added by migrations after the mounted cutoff (e.g.,policy_rules.severityfrom migration 000013, which the mount list never included), initdb crashed mid-seed and the container loop wedged. Two sources of truth — the mount list and the in-tree migration ladder — diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. The U-3 closure removes the dual source: postgres now boots empty and the server applies the entire migration ladder + seed at startup viaRunMigrations+RunSeed. Same pattern Helm has used since day one. Bundled with four ride-along audit findings whose fixes are in adjacent code (column rename, missing column, dropped orphan columns, new build-identity endpoint) so operators take the schema-change pain only once.
Breaking Changes
deploy/docker-compose.ymlpostgres no longer initdb-mounts the migration files orseed.sql. Operators running on a populatedpostgres_datavolume from a pre-U-3 release see no behavioral change (the schema is already in place;RunMigrationsisIF NOT EXISTSandRunSeedisON CONFLICT DO NOTHING). Operators running on a fresh clone now rely on the server to apply both — which is the bug fix. There is no rollback path other than re-introducing the dual-source-of-truth hazard. Seeinternal/repository/postgres/db.go::RunSeedfor the runtime contract.migrations/000017_db_coupling_cleanup.up.sqlrenamesrenewal_policies.retry_interval_minutes→retry_interval_seconds. The column always held seconds; the column name lied (cat-o-retry_interval_unit_mismatch). Operators running raw SQL against the old name need to update their queries. The Go layer (internal/repository/postgres/renewal_policy.go) is updated in lockstep so the in-tree code path is unaffected.migrations/000017_db_coupling_cleanup.up.sqldropsnetwork_scan_targets.health_check_enabledandnetwork_scan_targets.health_check_interval_seconds. These columns were declared by a long-ago migration but never wired into Go code (cat-o-health_check_column_orphans) — schema noise that confused operators reading raw SQL. Anyone with custom dashboards selecting those columns will break.- The compose demo overlay (
deploy/docker-compose.demo.yml) no longer initdb-mountsseed_demo.sql. It now setsCERTCTL_DEMO_SEED=trueand the server applies the demo seed at boot viaRunDemoSeedafter baseline migrations + seed.sql are in place. Same single-source-of-truth pattern as the production path.
Added
- Migration
000017_db_coupling_cleanup(up + down). Bundles three schema changes in idempotent SQL: (1) renamerenewal_policies.retry_interval_minutes→retry_interval_seconds(DO $$ guard so re-application is safe), (2) addnotification_events.created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), (3) drop the orphannetwork_scan_targets.health_check_*columns. Reduces operator-visible "schema-change releases" from four to one. internal/repository/postgres.RunSeed— runtime equivalent of the deleted initdb mount forseed.sql. Called fromcmd/server/main.goimmediately afterRunMigrations. Idempotent (every INSERT in the shipped seed usesON CONFLICT (id) DO NOTHING); missing-file is a no-op so operators with custom packaging that strips the seed don't break.internal/repository/postgres.RunDemoSeed+config.DatabaseConfig.DemoSeed+CERTCTL_DEMO_SEEDenv var. Replaces the deletedseed_demo.sqlinitdb mount. The compose demo overlay setsCERTCTL_DEMO_SEED=trueand the server applies the demo seed after baseline. Same idempotency contract as the baseline path. Default-off so a vanilla deploy never lands fake-history rows.GET /api/v1/versionendpoint +internal/api/handler.VersionHandler. Returns{version, commit, modified, build_time, go_version}fromruntime/debug.ReadBuildInfo()with ldflags-suppliedVersiontaking priority. Wired through the no-auth dispatch incmd/server/main.goso probes and rollout systems can read build identity without Bearer credentials. Audit middleware excludes the path so rollout polls don't dominate the audit trail. Closescat-u-no_version_endpoint.notification_events.created_atcolumn is now populated byNotificationRepository.Create(with atime.Now()fallback when the caller leaves it zero) and read back byscanNotification. Pre-U-3 the JSON API serialised0001-01-01T00:00:00Z— closescat-o-notification_created_at_dead_field.- Five regression tests for the U-3 contract:
TestRunSeed_AppliesIdempotently,TestRunSeed_MissingFileIsNoOp,TestRunDemoSeed_AppliesIdempotently,TestMigration000017_RetryIntervalRename,TestMigration000017_NotificationCreatedAt,TestMigration000017_HealthCheckOrphansDropped, plusTestNotificationRepository_CreatedAt_IsPersisted/TestNotificationRepository_CreatedAt_DefaultsToNowfor the round-trip. All testcontainers-gated (skipped under-short). Three handler-layer unit tests pin/api/v1/version(TestVersion_ReturnsBuildInfo,TestVersion_RejectsNonGet,TestVersion_LdflagsOverride). - CI regression guardrail in
.github/workflows/ci.yml(Forbidden migration mount in compose initdb (U-3)) — grep-fails the build if anymigrations/.*\.sqlorseed.*\.sqlfile is re-mounted into/docker-entrypoint-initdb.din any compose file. Catches future drift before a fresh-clone operator hits it.
Changed
deploy/docker-compose.yml+deploy/docker-compose.test.yml— postgresvolumes:no longer mount migrations or seed files; postgres healthcheck gainsstart_period: 30s; certctl-server healthcheck gainsstart_period: 30sto absorb the runtime migration + seed application window on first boot.deploy/docker-compose.demo.yml— replaces theseed_demo.sqlinitdb mount with theCERTCTL_DEMO_SEED=trueenv var oncertctl-server.migrations/seed.sql—INSERT INTO renewal_policiesupdated to use the newretry_interval_secondscolumn name (lockstep with migration 000017).internal/repository/postgres/renewal_policy.go— column references updated toretry_interval_secondsacross SELECT, INSERT, and UPDATE sites (lockstep with migration 000017).
Closed audit findings
cat-u-seed_initdb_schema_drift(P1, primary U-3 finding)cat-o-retry_interval_unit_mismatch(P1)cat-o-notification_created_at_dead_field(P2)cat-o-health_check_column_orphans(P1)cat-u-no_version_endpoint(P2)
G-1: JWT silent auth downgrade — closed end-to-end
Pre-G-1 the config validator accepted
CERTCTL_AUTH_TYPE=jwtand the startup log faithfully echoed"authentication enabled" "type"="jwt". Reasonable people read that and concluded JWT was on. It wasn't. The auth-middleware wiring atcmd/server/main.gounconditionally routed every request through the api-key bearer middleware regardless ofcfg.Auth.Type. SoCERTCTL_AUTH_TYPE=jwtquietly compared incomingAuthorization: Bearer <something>against whatever string the operator put inCERTCTL_AUTH_SECRET— real JWT clients got 401, and operators who treatedCERTCTL_AUTH_SECRETas a signing secret (because they thought they were configuring JWT) had effectively handed an attacker an api-key. A security finding masquerading as a config option. We chose to remove the option rather than ship JWT middleware — the audit-recommended structural fix that closes the hazard. Operators who actually need JWT/OIDC front certctl with an authenticating gateway (oauth2-proxy / Envoyext_authz/ TraefikForwardAuth/ Pomerium / Authelia) and run the upstream certctl withCERTCTL_AUTH_TYPE=none. The same pattern works on docker-compose and Helm.
Breaking Changes
CERTCTL_AUTH_TYPE=jwtis no longer accepted. Pre-G-1 the value was silently downgraded to api-key middleware. Post-G-1 the server fails at startup with a dedicated diagnostic naming the authenticating-gateway pattern. Operators with this in their env block must either switch toapi-key(if they were de facto using api-key auth all along — same Bearer token continues to work) or switch tononeand front certctl with an oauth2-proxy / Envoy / Traefik / Pomerium gateway. Seedocs/upgrade-to-v2-jwt-removal.md.- Helm chart
server.auth.type=jwtnow fails athelm install/helm upgradetemplate time. Newcertctl.validateAuthTypetemplate helper runs on every template that depends on.Values.server.auth.type(server-deployment.yaml,server-configmap.yaml,server-secret.yaml) and fails the render with a pointer at the gateway-fronting pattern. - OpenAPI spec
auth_typeenum no longer includesjwt. API consumers checking/api/v1/auth/infoagainst the spec will see a smaller enum.
Removed
- Documented references to JWT in the certctl auth surface (config docblocks, middleware/health-handler comments,
.env.example,docs/architecture.mdmiddleware-stack bullet). Connector-level JWT references (Google OAuth2 service-account JWT ininternal/connector/discovery/gcpsm/,internal/connector/issuer/googlecas/; step-ca's provisioner one-time-token JWT ininternal/connector/issuer/stepca/) are unrelated and untouched — those are external-protocol uses, not certctl's own auth shape.
Added
config.AuthTypetyped alias withAuthTypeAPIKey/AuthTypeNoneexported constants. Single source of truth for the allowed set across the validator, the runtime defense-in-depth switch inmain.go, and the helm chart'svalidateAuthTypehelper.config.ValidAuthTypes()helper returning the complete allowed set; pinned by a property test (TestValidAuthTypesDoesNotContainJWT) that fails the build if"jwt"is ever re-added to the slice.- Defense-in-depth runtime guard in
cmd/server/main.goimmediately afterconfig.Load()— aswitch config.AuthType(cfg.Auth.Type)that exits 1 if the validator was bypassed (test harness, alt config loader, env-var rebinding). certctl.validateAuthTypeHelm template helper mirroring the existingcertctl.tls.requiredpattern. Fails template render on anyserver.auth.typeoutside{api-key, none}.docs/architecture.md"Authenticating-gateway pattern (JWT, OIDC, mTLS)" section explaining the design rationale for the narrow in-process auth surface and listing oauth2-proxy / Envoyext_authz/ TraefikForwardAuth/ Pomerium / Authelia / Caddyforward_auth/ Apachemod_auth_openidc/ nginxauth_requestas the standard fronting options.docs/upgrade-to-v2-jwt-removal.mdmigration guide. Same shape asdocs/upgrade-to-tls.md. Walks through the dedicated startup error, both recovery paths (api-keyvs gateway-fronting), a complete docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoyext_authzpatterns, and rollback posture.deploy/helm/certctl/README.md"JWT / OIDC via authenticating gateway" section with a Kubernetes-flavored oauth2-proxy + certctl walkthrough.- CI regression guardrail in
.github/workflows/ci.yml(Forbidden auth-type literal regression guard (G-1)) — grep-fails the build if"jwt"appears as an auth-type literal in production code or spec. Connector packages exempt (legitimate external-protocol uses). - Negative test coverage in
internal/config/config_test.go:TestValidate_JWTAuth_RejectedDedicated(two table rows pinning that the dedicated G-1 error fires regardless of whetherSecretis set),TestValidAuthTypesDoesNotContainJWT(property-level guard),TestValidAuthTypesIsExactly_APIKey_None(allowed-set contract),TestValidate_GenericInvalidAuthType(pins that other invalid values still surface the generic invalid-auth-type error, so the dedicated G-1 path doesn't accidentally swallow non-jwt typos).
Changed
internal/api/middleware/middleware.go::AuthConfig.Typefield comment now references the typedconfig.AuthTypeconstants instead of an inline string enumeration.internal/api/handler/health.go::HealthHandler.AuthTypefield comment same treatment.internal/api/handler/health_test.go— the priorTestAuthInfo_ReturnsAuthType_JWT(which asserted the handler echoed"jwt", baking the silent-downgrade lie into the regression suite) is removed; the pre-existingTestAuthInfo_ReturnsAuthType_APIKeycontinues to cover the api-key happy path.- Auth-disabled startup log in
main.gonow points operators at the authenticating-gateway pattern explicitly.
U-2: Dockerfile HEALTHCHECK protocol mismatch — closed end-to-end
Pre-U-2 the published
ghcr.io/shankar0123/certctl-serverimage shipped withHEALTHCHECK CMD curl -f http://localhost:8443/health. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (cmd/server/main.go::ListenAndServeTLS, no plaintext fallback, TLS 1.3 pinned), so the probe failed every interval and Docker marked the containerunhealthyindefinitely. Operators inside docker-compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with--cacert + https://, Helm uses explicithttpGetprobes that ignore Docker's HEALTHCHECK, and every example compose file overrides withcurl -sfk https://localhost:8443/health. But anyone running baredocker run/ Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanentunhealthystatus and (depending on orchestrator policy) a restart-loop. Recon for U-2 also surfaced two adjacent bugs from the same v2.2 milestone gap: the Helm chart'sreadinessProbe.httpGet.pathpointed at/readyz, a route the server doesn't register (only/healthand/readyare wired and bypass the auth middleware), so K8s readiness probes were getting 404/auth-rejection and pods stayedNotReady; and the agent image had no HEALTHCHECK at all (the compose override calledpgrep -f certctl-agentagainst an image that didn't shipprocps— latent always-fail). All three are closed in this commit.
Fixed
DockerfileHEALTHCHECK now speaks HTTPS. Baredocker run/ Swarm / Nomad / ECS users no longer seeunhealthyforever. The probe usescurl -fsk https://localhost:8443/health—-k(insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed; the probe never traverses a network. Compose / Helm / examples already perform full cert-chain validation and are unaffected.- Helm
server.readinessProbe.httpGet.pathcorrected from/readyzto/ready. The/readyzpath was never registered as a no-auth route (seeinternal/api/router/router.go:81andcmd/server/main.go:920), so K8s readiness probes received 401 (api-key auth rejection) or 404 (when auth was disabled). Pods previously failed to report Ready under most realistic Helm deployments. Liveness probe path (/health) was already correct and is unchanged. docs/connectors.mdcurl examples (15 sites) updated fromhttp://localhost:8443/...tohttps://localhost:8443/...with a one-time--cacert "$CA"extraction note matching the existing pattern indocs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener.
Added
Dockerfile.agentHEALTHCHECK —pgrep -f certctl-agentprocess-presence check (the agent has no HTTP listener; presence is the right primitive). Bare-docker runagents now report health-status the same way compose-managed ones do. Also addsprocpsto the runtime image sopgrepis actually available — pre-U-2 the docker-compose override atdeploy/docker-compose.yml:173calledpgrep -f certctl-agentagainst an image that lacked it (latent always-fail; container was reported unhealthy in compose too, just rarely noticed because nothing acted on the signal).deploy/test/healthcheck_test.go(//go:build integration) — image-level integration tests.TestPublishedServerImage_HealthcheckSpecUsesHTTPSbuilds the server image, inspectsConfig.Healthcheck.Testviadocker inspect, and asserts the array containshttps://localhost:8443/healthand-k, and does NOT containhttp://localhost:8443/health(negative regression contract).TestPublishedAgentImage_HealthcheckSpecExistsbuilds the agent image and asserts the HEALTHCHECK usespgrepagainstcertctl-agent. Both testst.Skipcleanly when docker isn't available (sandbox / CI without docker-in-docker). A third runtime test (TestPublishedServerImage_HealthcheckTransitionsToHealthy) is at.Skipplaceholder until the harness wires a sidecar postgres for image-level smoke — documented honestly so the next refactor adopts it instead of rediscovering the gap.- CI regression guardrail in
.github/workflows/ci.yml(Forbidden plaintext HEALTHCHECK regression guard (U-2)) — grep-fails the build if anyDockerfile*carriesHEALTHCHECK.*http://orcurl -f http://localhost:8443/health. Comments exempt; thedocs/upgrade-to-tls.md:182post-cutover invariant string (which deliberately documents the expected-failure shape) is out of the guardrail's scope because the guardrail only scans Dockerfiles.
Changed
Dockerfilefinal-stage HEALTHCHECK lines now carry a long-form docblock explaining the-kdesign choice, the published-image vs compose vs Helm vs examples coverage matrix, and cross-references to the audit closure + the integration test.Dockerfile.agentruntime stage addsprocpsto the apk install so the new HEALTHCHECK and the existing compose override both have a workingpgrep.deploy/helm/certctl/values.yamlserver probes block now carries an explanatory comment naming the registered probe routes (/health,/ready) and the U-2 closure rationale for the/readyz→/readycorrection.
[2.2.0] — 2026-04-19
HTTPS Everywhere — The Irony
certctl manages other teams' certificates. Until v2.2, it didn't terminate TLS on its own control plane. We treated the server as an internal service sitting behind whatever TLS-terminating infrastructure the operator already owned — reverse proxies, Kubernetes Ingress controllers, service mesh sidecars. Working through an EST coverage-gap audit surfaced this as a credibility problem we wanted to fix head-on: a cert-lifecycle product should ship with HTTPS by default. This release flips that. Self-signed bootstrap for docker-compose demos, operator-supplied Secret for Helm (with optional cert-manager integration), and a one-step cutover with no backward-compat bridge. Out-of-date agents will fail at the TLS handshake layer on upgrade; the upgrade guide walks operators through the roll.
Breaking Changes
- HTTPS-only control plane. The plaintext HTTP listener is gone. There is no
CERTCTL_TLS_ENABLED=falseescape hatch and no:8080fallback. Operators who were running certctl behind their own TLS terminator must either (a) continue doing so and let the downstream TLS terminator talk to certctl's HTTPS listener, or (b) bring their own cert/key and terminate on certctl directly. Either path requires config changes — seedocs/upgrade-to-tls.mdfor a one-step cutover. - Agents reject
CERTCTL_SERVER_URL=http://...at startup. This is a pre-flight config validation failure with a fail-loud diagnostic pointing atdocs/upgrade-to-tls.md. Not a TCP-refused, not a TLS-handshake-error — the agent will not even attempt the network call. Every agent deployment must be reconfigured before upgrading the server. - CLI and MCP clients require
https://URLs. Same pre-flight rejection of plaintext schemes. - TLS 1.2 is not supported. TLS 1.3 only. The server's
tls.Config.MinVersionis pinned totls.VersionTLS13. Any client still negotiating TLS 1.2 will fail at the handshake. Modern curl, Go stdlib, browsers, and Kubernetes tooling all default to 1.3-capable; legacy clients may need an upgrade. - Helm chart requires a TLS source.
helm installwithout one ofserver.tls.existingSecret,server.tls.certManager.enabled, or (for eval only)server.tls.selfSigned.enabledfails at template time with a diagnostic pointing atdocs/tls.md. There is no default-to-plaintext path.
Added
- Self-signed bootstrap for Docker Compose demos. A
certctl-tls-initinit container runs before the server on first boot, generates a SAN-valid self-signed cert intodeploy/test/certs/, and exits. The server mounts the resulting cert/key. Every curl in the demo stack pins against./deploy/test/certs/ca.crtwith--cacert. - Helm chart TLS provisioning — three modes. Operator-supplied Secret (
server.tls.existingSecret), cert-manager integration (server.tls.certManager.enabledwith issuer selection), or self-signed (server.tls.selfSigned.enabled— eval only, not supported for production). Chart templates enforce exactly one is active. - Hot-reload of TLS cert/key on
SIGHUP. Overwrite the cert/key on disk, sendSIGHUPto the server PID, watch theslog.Info("tls.reload", ...)log line, and new TLS connections use the new cert. Failure during reload is logged and does not crash the server; the previous cert remains in use. - Agent CA-bundle env vars.
CERTCTL_SERVER_CA_BUNDLE_PATHpoints at a PEM file the agent's HTTP client will trust.CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFYdisables verification (development only — the agent logs a loud warning at startup).install-agent.shwrites both as commented template lines into the generatedagent.env. - Integration test suite runs over HTTPS.
go test -tags=integration ./deploy/test/...stands up the full Compose stack, extracts the self-signed CA bundle, and exercises every certctl API overhttps://localhost:8443. All 34 subtests green. docs/tls.md— cert provisioning patterns: bring-your-own Secret, cert-manager, self-signed bootstrap, SAN requirements, rotation workflows, SIGHUP reload semantics, troubleshooting.docs/upgrade-to-tls.md— one-step cutover guide for existing v2.1 operators. Walks through the agent fleet roll, Helm upgrade sequencing, downgrade-is-not-supported warnings, and cert-provisioning decision tree.
Changed
cmd/server/main.gonow callshttp.Server.ListenAndServeTLS(certFile, keyFile). The plaintextListenAndServecode path is deleted —grep -rn "ListenAndServe[^T]" cmd/ internal/returns zero hits.- All documentation curls (
docs/testing-guide.md,docs/quickstart.md,deploy/helm/INSTALLATION.md,deploy/helm/DEPLOYMENT_GUIDE.md,deploy/ENVIRONMENTS.md,docs/openapi.md, migration guides, example READMEs) usehttps://localhost:8443and--cacertagainst the demo stack's bundle. - OpenAPI spec (
api/openapi.yaml)serversblocks default tohttps://localhost:8443.
Security
- TLS 1.3 pinned via
tls.Config.MinVersion = tls.VersionTLS13. - Plaintext HTTP listener removed entirely — no port 8080, no
Upgrade-Insecure-Requests, no HSTS-required redirect dance. There is only one port: 8443, TLS 1.3. grep -rn "http://" cmd/ internal/returns zero hits outside test fixtures and the agent-side URL-scheme rejection error message.
Upgrade Notes
Read docs/upgrade-to-tls.md before upgrading. The short version:
- Pick a TLS source — bring-your-own cert, cert-manager, or self-signed bootstrap.
- Upgrade the server with TLS configured. First boot over HTTPS.
- Roll the agent fleet: set
CERTCTL_SERVER_URL=https://...and, if using a private CA,CERTCTL_SERVER_CA_BUNDLE_PATH. Old agents will fail loud at startup — expected. - Roll CLI/MCP clients the same way.
There is no backward-compat bridge. There is no dual-listener mode. The cutover is one step.