fix(helm): DEPL-004 — ServiceMonitor TLS default flipped to fail-closed

Acquisition-audit DEPL-004 closure (Sprint 6 ACQ, 2026-05-16). Pre-2026-05-16, monitoring.serviceMonitor.tlsConfig in values.yaml was empty by default, and the ServiceMonitor template fell through to an implicit `insecureSkipVerify: true` else-branch. Operators opting into the ServiceMonitor (monitoring.serviceMonitor.enabled=true) got no Prometheus TLS verification by default — in-cluster scrapes tolerate this, out-of-cluster scrapes silently skip the chain check. The template now emits a fail-closed `{{ required ... }}` message at `helm template` / `helm upgrade` time if neither a real verify nor an explicit opt-back is supplied. The error string lists both escape hatches and the docs cross-link, so the operator sees the fix in the same line they hit the error. Operators with monitoring.serviceMonitor.enabled=false (the chart default): no action required — the template short-circuits before the tlsConfig block. Operators who had ServiceMonitor on with no tlsConfig set: helm upgrade will fail until they supply either { caFile: ..., serverName: ... } (production-shaped) or { insecureSkipVerify: true } (operator-acknowledged opt-back). Files ===== - deploy/helm/certctl/templates/servicemonitor.yaml: replace the else-branch insecureSkipVerify default with a {{ required ... }} Helm builtin that fails the render with a clear remediation message pointing at both escape hatches and docs/operator/ helm-deployment.md - deploy/helm/certctl/values.yaml: rewrite the tlsConfig comment block to document the new fail-closed posture + both upgrade paths (production verify vs operator-acknowledged opt-back) - docs/operator/helm-deployment.md: new "2026-05-16 — ServiceMonitor TLS default flipped (DEPL-004)" subsection in the existing Upgrade section with the two operator-action recipes
2026-07-26 15:08:12 +00:00 · 2026-05-16 19:44:48 +00:00
parent 5ea45a19b9
commit d7546aedca
3 changed files with 60 additions and 11 deletions
@@ -42,15 +42,25 @@ spec:
      interval: {{ .Values.monitoring.serviceMonitor.interval | default "30s" }}
      scrapeTimeout: {{ .Values.monitoring.serviceMonitor.scrapeTimeout | default "10s" }}
      tlsConfig:
-        # The certctl server uses self-signed bootstrap TLS or operator-
-        # provided cert-manager TLS — the ServiceMonitor consumes the
-        # same CA bundle the server presents. When server.tls.existingSecret
-        # is set, operators usually want to pull the matching ca.crt key
-        # out of that Secret. Adjust if your CA chain lives elsewhere.
+        # Acquisition-audit DEPL-004 closure (Sprint 6 ACQ, 2026-05-16).
+        # The default flipped from `insecureSkipVerify: true` to a
+        # fail-closed posture: operators MUST either supply a
+        # `monitoring.serviceMonitor.tlsConfig` block (caFile / ca /
+        # serverName for a real TLS verify) or opt back in explicitly
+        # with `tlsConfig: { insecureSkipVerify: true }`. The {{ required }}
+        # check below renders an error at `helm template` / `helm upgrade`
+        # time if neither is supplied, surfacing the misconfiguration
+        # before the ServiceMonitor lands in-cluster.
+        #
+        # In-cluster scrapes from a Prometheus pod that already trusts the
+        # certctl CA (via existingSecret + cert-manager) keep working with
+        # zero operator action — they just point at the right caFile.
+        # Out-of-cluster Prometheus deployments now require the operator
+        # to surface the trust decision explicitly.
        {{- if .Values.monitoring.serviceMonitor.tlsConfig }}
        {{- toYaml .Values.monitoring.serviceMonitor.tlsConfig | nindent 8 }}
        {{- else }}
-        insecureSkipVerify: true
+        {{- required "monitoring.serviceMonitor.tlsConfig is required when monitoring.serviceMonitor.enabled=true (Sprint 6 ACQ DEPL-004 closure, 2026-05-16). Supply { caFile: \"/etc/prometheus/secrets/.../ca.crt\", serverName: \"certctl-server\" } to verify against your CA, or { insecureSkipVerify: true } to opt back into the pre-2026-05-16 default. See docs/operator/helm-deployment.md for the upgrade-path note." nil }}
        {{- end }}
      {{- with .Values.monitoring.serviceMonitor.bearerTokenSecret }}
      bearerTokenSecret:
@@ -680,14 +680,28 @@ monitoring:
    #     name: certctl-prometheus-key
    #     key: api-key
    # bearerTokenSecret: {}
-    # TLS config for the scrape endpoint. The certctl server presents
-    # the same TLS cert the rest of the chart uses; insecureSkipVerify
-    # defaults to true so demos work out of the box. Production deploys
-    # should pin the CA via caFile or ca.secret.
+    # TLS config for the scrape endpoint. Acquisition-audit DEPL-004
+    # closure (Sprint 6 ACQ, 2026-05-16): the default flipped from
+    # `insecureSkipVerify: true` to fail-closed. Operators MUST supply
+    # tlsConfig — either a real verify (caFile / ca / serverName) for
+    # production, or explicit `{ insecureSkipVerify: true }` to opt
+    # back into the pre-2026-05-16 default. The ServiceMonitor template
+    # `{{ required ... }}` guard surfaces missing tlsConfig at chart-
+    # render time before it lands in-cluster.
+    #
+    # In-cluster Prometheus that already trusts the certctl CA via
+    # the chart's existingSecret / cert-manager-emitted bundle: point
+    # caFile at that path (typically /etc/prometheus/secrets/<name>/ca.crt
+    # once you mount the Secret into the Prometheus pod).
+    #
+    # Production-shaped example (verify against the chart's CA):
    # tlsConfig:
    #   caFile: /etc/prometheus/secrets/certctl-ca/ca.crt
    #   serverName: certctl-server
-    # tlsConfig: {}
+    #
+    # Demo / dev-cluster escape hatch (operator-acknowledged):
+    # tlsConfig:
+    #   insecureSkipVerify: true
    # Optional relabeling for the scrape job.
    # relabelings: []