From 9155ec9174c76a73fe871138474d80d84d11d767 Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Sat, 16 May 2026 22:09:42 +0000 Subject: [PATCH] =?UTF-8?q?fix(helm):=20DEPL-004=20follow-up=20=E2=80=94?= =?UTF-8?q?=20default=20tlsConfig=20to=20real=20verify;=20fix=20ill-formed?= =?UTF-8?q?=20required-nil?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sprint 6 ACQ DEPL-004 closure follow-up. CI run on commit 58a15e0 caught two issues: 1. The fail-closed guard in templates/servicemonitor.yaml used `{{ required "msg" nil }}`, which is wrong Helm syntax — the bareword `nil` isn't valid in Go templates and Helm interprets it as no value, hitting "wrong number of args for required: want 2 got 0". The B3-helm-chart-coherence ci-guard's production-hardening render (`--set monitoring.serviceMonitor.enabled=true` without explicit tlsConfig) failed with this error AND with the downstream "missing kind: ServiceMonitor / PodDisruptionBudget / NetworkPolicy" cascades (the entire render aborted before producing the matrix). 2. The original DEPL-004 framing — "operators MUST explicitly choose tlsConfig or you get a chart-render error" — was the right intent but the wrong default. The chart's existingSecret integration mounts the CA bundle at a canonical path (/etc/prometheus/secrets/certctl-ca/ca.crt); defaulting to that path closes the implicit-skipVerify gap without forcing every operator to repeat the same boilerplate. Fixes ===== deploy/helm/certctl/values.yaml — flips monitoring.serviceMonitor.tlsConfig from commented-out (which fell through to implicit insecureSkipVerify: true) to a real verify default: tlsConfig: caFile: /etc/prometheus/secrets/certctl-ca/ca.crt serverName: certctl-server Operators with a different CA mount path override caFile; operators who genuinely want skipVerify back must set `{ insecureSkipVerify: true }` explicitly. Operators who blank tlsConfig entirely (`tlsConfig: null` or `tlsConfig: {}`) still trip the fail-closed guard. deploy/helm/certctl/templates/servicemonitor.yaml — replaces `required "msg" nil` with `fail "msg"`. The `fail` builtin is the correct Helm pattern for an unconditional render-time error; `required` is for "this value MUST be non-empty" which is the wrong semantic here (we want to fail when the operator went OUT OF THEIR WAY to blank the default). Failure message updated to reflect the new default + the operator-action recipes. docs/operator/helm-deployment.md — rewrites the "2026-05-16 — ServiceMonitor TLS default flipped" subsection to match the new default-on-real-verify semantics. The three operator recipes (default install / different CA mount / explicit skipVerify) are documented; the explicit "there is no way to inherit pre-2026-05-16 implicit-skipVerify behavior silently" guarantee is preserved. Verified locally: python3 YAML parse on values.yaml clean; the helm-templates-lint and B3-helm-chart-coherence ci-guards require helm itself which isn't in the sandbox — both should pass on the CI re-run. --- .../certctl/templates/servicemonitor.yaml | 30 +++++++-------- deploy/helm/certctl/values.yaml | 38 +++++++++++-------- docs/operator/helm-deployment.md | 25 +++++++++--- 3 files changed, 58 insertions(+), 35 deletions(-) diff --git a/deploy/helm/certctl/templates/servicemonitor.yaml b/deploy/helm/certctl/templates/servicemonitor.yaml index b7abc7c..d338c70 100644 --- a/deploy/helm/certctl/templates/servicemonitor.yaml +++ b/deploy/helm/certctl/templates/servicemonitor.yaml @@ -43,24 +43,24 @@ spec: scrapeTimeout: {{ .Values.monitoring.serviceMonitor.scrapeTimeout | default "10s" }} tlsConfig: # Acquisition-audit DEPL-004 closure (Sprint 6 ACQ, 2026-05-16). - # The default flipped from `insecureSkipVerify: true` to a - # fail-closed posture: operators MUST either supply a - # `monitoring.serviceMonitor.tlsConfig` block (caFile / ca / - # serverName for a real TLS verify) or opt back in explicitly - # with `tlsConfig: { insecureSkipVerify: true }`. The {{ required }} - # check below renders an error at `helm template` / `helm upgrade` - # time if neither is supplied, surfacing the misconfiguration - # before the ServiceMonitor lands in-cluster. - # - # In-cluster scrapes from a Prometheus pod that already trusts the - # certctl CA (via existingSecret + cert-manager) keep working with - # zero operator action — they just point at the right caFile. - # Out-of-cluster Prometheus deployments now require the operator - # to surface the trust decision explicitly. + # Pre-Sprint-6 the default was an implicit `insecureSkipVerify: + # true` (template fell through the else-branch). Post-Sprint-6 + # values.yaml ships a real-verify default + # (caFile + serverName matching the chart's existingSecret / + # cert-manager-emitted Secret at /etc/prometheus/secrets/ + # certctl-ca/), so the truthy `{{ if .Values.monitoring. + # serviceMonitor.tlsConfig }}` branch always fires for the + # default install. Operators who genuinely want skipVerify + # back must override with `tlsConfig: { insecureSkipVerify: + # true }` explicitly. Operators who blank tlsConfig entirely + # (`tlsConfig: null` or `tlsConfig: {}`) hit the else-branch + # below and trip the `{{ fail }}` at chart-render time — + # there is no way to inherit the pre-Sprint-6 implicit- + # skipVerify behavior silently. {{- if .Values.monitoring.serviceMonitor.tlsConfig }} {{- toYaml .Values.monitoring.serviceMonitor.tlsConfig | nindent 8 }} {{- else }} - {{- required "monitoring.serviceMonitor.tlsConfig is required when monitoring.serviceMonitor.enabled=true (Sprint 6 ACQ DEPL-004 closure, 2026-05-16). Supply { caFile: \"/etc/prometheus/secrets/.../ca.crt\", serverName: \"certctl-server\" } to verify against your CA, or { insecureSkipVerify: true } to opt back into the pre-2026-05-16 default. See docs/operator/helm-deployment.md for the upgrade-path note." nil }} + {{- fail "monitoring.serviceMonitor.tlsConfig was explicitly blanked but monitoring.serviceMonitor.enabled=true (Sprint 6 ACQ DEPL-004 closure, 2026-05-16). The values.yaml default ships caFile=/etc/prometheus/secrets/certctl-ca/ca.crt + serverName=certctl-server which matches the existingSecret mount pattern. If your Prometheus pod mounts the CA bundle at a different path, override caFile rather than blanking the block. If you genuinely need skipVerify, set `tlsConfig: { insecureSkipVerify: true }` explicitly — never blank. See docs/operator/helm-deployment.md for the upgrade-path note." }} {{- end }} {{- with .Values.monitoring.serviceMonitor.bearerTokenSecret }} bearerTokenSecret: diff --git a/deploy/helm/certctl/values.yaml b/deploy/helm/certctl/values.yaml index 28d6828..9fece21 100644 --- a/deploy/helm/certctl/values.yaml +++ b/deploy/helm/certctl/values.yaml @@ -681,25 +681,33 @@ monitoring: # key: api-key # bearerTokenSecret: {} # TLS config for the scrape endpoint. Acquisition-audit DEPL-004 - # closure (Sprint 6 ACQ, 2026-05-16): the default flipped from - # `insecureSkipVerify: true` to fail-closed. Operators MUST supply - # tlsConfig — either a real verify (caFile / ca / serverName) for - # production, or explicit `{ insecureSkipVerify: true }` to opt - # back into the pre-2026-05-16 default. The ServiceMonitor template - # `{{ required ... }}` guard surfaces missing tlsConfig at chart- - # render time before it lands in-cluster. + # closure (Sprint 6 ACQ, 2026-05-16): pre-Sprint-6 the default was + # an implicit `insecureSkipVerify: true` (fell through the + # template's else-branch). Post-Sprint-6 the default is a real + # verify against the chart's CA at the canonical mount path the + # existingSecret pattern produces (Prometheus mounts the + # certctl-ca Secret as a volume at /etc/prometheus/secrets/ + # certctl-ca/). Operators whose Prometheus pod mounts the bundle + # at a different path override `caFile` below; operators who + # genuinely want skipVerify back can do so explicitly. Operators + # who blank tlsConfig entirely (`tlsConfig: null` or + # `tlsConfig: {}`) trip the `{{ fail }}` guard in + # templates/servicemonitor.yaml at chart-render time — there is + # no way to inherit the pre-Sprint-6 implicit-skipVerify behavior + # silently. # - # In-cluster Prometheus that already trusts the certctl CA via - # the chart's existingSecret / cert-manager-emitted bundle: point - # caFile at that path (typically /etc/prometheus/secrets//ca.crt - # once you mount the Secret into the Prometheus pod). + # Production default (verify against the chart's CA): + tlsConfig: + caFile: /etc/prometheus/secrets/certctl-ca/ca.crt + serverName: certctl-server # - # Production-shaped example (verify against the chart's CA): + # Operator override — different CA mount path: # tlsConfig: - # caFile: /etc/prometheus/secrets/certctl-ca/ca.crt - # serverName: certctl-server + # caFile: /path/to/your/ca.crt + # serverName: your-cert-CN # - # Demo / dev-cluster escape hatch (operator-acknowledged): + # Operator override — demo / dev-cluster escape hatch + # (operator-acknowledged unsafe): # tlsConfig: # insecureSkipVerify: true # Optional relabeling for the scrape job. diff --git a/docs/operator/helm-deployment.md b/docs/operator/helm-deployment.md index d83e4ec..3416105 100644 --- a/docs/operator/helm-deployment.md +++ b/docs/operator/helm-deployment.md @@ -96,20 +96,35 @@ Postgres state survives the upgrade (the PVC is retained). The server / agent im ### 2026-05-16 — ServiceMonitor TLS default flipped (DEPL-004) -Acquisition-audit DEPL-004 closure. `monitoring.serviceMonitor.tlsConfig` was previously empty by default and the chart template fell through to `insecureSkipVerify: true`. Post-2026-05-16, the template emits a `{{ required ... }}` fail-closed message at `helm template` / `helm upgrade` time if neither a real verify nor an explicit opt-back is supplied. +Acquisition-audit DEPL-004 closure. Pre-2026-05-16, `monitoring.serviceMonitor.tlsConfig` was empty by default and the chart template fell through to an implicit `insecureSkipVerify: true`. Post-2026-05-16, the values.yaml default is a real TLS verify against the chart's CA (caFile + serverName matching the existingSecret mount path the chart's Prometheus integration produces). -Operators with `monitoring.serviceMonitor.enabled: true` MUST set one of: +The new default works out of the box for the canonical install (the chart's `existingSecret` or cert-manager-emitted Secret mounted at `/etc/prometheus/secrets/certctl-ca/`): ```yaml -# A. Real TLS verify against the chart's CA (production-shaped). +# Default in values.yaml (no operator action required for the +# canonical install path). monitoring: serviceMonitor: enabled: true tlsConfig: caFile: /etc/prometheus/secrets/certctl-ca/ca.crt serverName: certctl-server +``` -# B. Demo / dev-cluster — operator-acknowledged opt-back to pre-flip default. +Operators whose Prometheus pod mounts the CA bundle at a different path override `caFile`: + +```yaml +monitoring: + serviceMonitor: + enabled: true + tlsConfig: + caFile: /path/to/your/ca.crt + serverName: your-cert-CN +``` + +Operators who genuinely need `insecureSkipVerify` (demo / dev clusters) must opt in **explicitly** — blanking the `tlsConfig` block trips the chart's `{{ fail }}` guard at render time: + +```yaml monitoring: serviceMonitor: enabled: true @@ -117,7 +132,7 @@ monitoring: insecureSkipVerify: true ``` -Operators with `monitoring.serviceMonitor.enabled: false` (the chart default) need no action — the template short-circuits before the `tlsConfig` block. +There is no way to inherit the pre-2026-05-16 implicit-skipVerify behavior silently. Operators with `monitoring.serviceMonitor.enabled: false` (the chart default) need no action — the template short-circuits before the `tlsConfig` block. ## Configuration reference