fix(helm): DEPL-003 + DEPL-006 — render viaHook env, sessionAffinity, HA backend default

Sprint 3 unified-master-audit closure — two Helm-chart correctness
defects with overlapping CI-guard surface.

DEPL-003 — CERTCTL_MIGRATIONS_VIA_HOOK never rendered:
  Pre-fix the env var was documented in values.yaml and the
  migration-job.yaml comment but never made it into the server
  Deployment env block. With migrations.viaHook=true the operator's
  intent is 'the pre-install/pre-upgrade Helm Job owns migrations,'
  but the server pods, missing the env, ran their own
  cmd/server/migrations.go::runBootMigrations alongside the hook
  Job, racing on the schema lock.
  Fix: render '- name: CERTCTL_MIGRATIONS_VIA_HOOK / value: true'
  in server-deployment.yaml under '{{- if .Values.migrations.viaHook }}'.

DEPL-006 — HA example missing rate-limit backend + sessionAffinity:
  values-prod-ha.yaml sets replicas:3 but inherited the chart-wide
  default rateLimiting.backend=memory (which gives each pod its
  own bucket map, effectively tripling the cap on a 3-replica fleet)
  AND the chart had no render path for server.service.sessionAffinity
  even though docs/operator/runbooks/ha.md instructed operators to
  set it for ClientIP-routed sticky sessions.
  Fix:
    - server-service.yaml gains a conditional sessionAffinity +
      sessionAffinityConfig.clientIP.timeoutSeconds render.
    - values.yaml grows the matching schema entries (default empty
      so single-replica deploys are unaffected).
    - values-prod-ha.yaml flips rateLimiting.backend=postgres and
      service.sessionAffinity=ClientIP.
    - NOTES.txt emits a loud warning when replicas>1 + either toggle
      is still in the default state, so the misconfig surfaces at
      helm install time instead of in a confused login-flow bug
      report a week later.

CI:
  scripts/ci-guards/B3-helm-chart-coherence.sh gains 'Check 7'
  (DEPL-003 viaHook env render — both positive and negative —
  the inverse case catches future drift that drops the {{- if }}
  guard) and 'Check 8' (DEPL-006 sessionAffinity render). Both
  helm-template through to assert the rendered YAML carries the
  expected text.

Closes DEPL-003, DEPL-006.
This commit is contained in:
shankar0123
2026-05-16 04:30:37 +00:00
parent 15fedbaa06
commit 6a640ac3e7
6 changed files with 136 additions and 1 deletions
+25
View File
@@ -72,3 +72,28 @@ IMPORTANT NOTES FOR PRODUCTION:
- All containers run as non-root
- Implement network policies to restrict traffic between components
- Consider pod security policies or security standards for your cluster
{{- /*
DEPL-006 closure (Sprint 3, 2026-05-16). Loud notice when the
operator runs a multi-replica deploy without crossing the two
required HA toggles. Per-pod rate-limit buckets and round-robin
load balancing both silently break correctness above replicas:1.
*/}}
{{- if gt (int .Values.server.replicas) 1 }}
⚠️ HA MISCONFIGURATION WARNINGS (replicas={{ .Values.server.replicas }}):
{{- $backend := .Values.server.rateLimiting.backend | default "memory" }}
{{- if eq $backend "memory" }}
- server.rateLimiting.backend = "memory" with replicas > 1 gives each
pod its own bucket map, so the configured cap is effectively
multiplied by the replica count. Set
`--set server.rateLimiting.backend=postgres` (see DEPL-006 /
docs/operator/runbooks/ha.md).
{{- end }}
{{- if not .Values.server.service.sessionAffinity }}
- server.service.sessionAffinity is empty. Round-robin Service load
balancing routes login → /api/v1/auth/login → /api/v1/auth/csrf
across different pods, breaking the CSRF token + session cookie
handshake. Set
`--set server.service.sessionAffinity=ClientIP`.
{{- end }}
{{- end }}
@@ -51,6 +51,20 @@ spec:
containerPort: {{ .Values.server.port }}
protocol: TCP
env:
# DEPL-003 closure (Sprint 3, 2026-05-16). Pre-fix the
# CERTCTL_MIGRATIONS_VIA_HOOK env var was documented in
# values.yaml (L797-810) and migration-job.yaml comments
# but was never rendered into the server Deployment env
# block. With migrations.viaHook=true the operator's
# intent is "the pre-install/pre-upgrade Helm Job owns
# migrations" — but the server pods, missing the env,
# ran their own boot-time RunMigrations alongside the
# hook Job, racing on the schema lock. cmd/server/migrations.go
# only short-circuits when this env is "true" (line 144).
{{- if .Values.migrations.viaHook }}
- name: CERTCTL_MIGRATIONS_VIA_HOOK
value: "true"
{{- end }}
- name: CERTCTL_SERVER_HOST
value: "0.0.0.0"
- name: CERTCTL_SERVER_PORT
@@ -11,6 +11,23 @@ metadata:
{{- end }}
spec:
type: {{ .Values.server.service.type }}
{{- /*
DEPL-006 closure (Sprint 3, 2026-05-16). Render the optional
sessionAffinity field. docs/operator/runbooks/ha.md instructs
operators to set sessionAffinity: ClientIP for replicas > 1 so
login + CSRF flows stay on the same pod; pre-fix the chart did
not actually pass the value through. sessionAffinityConfig
clientIP.timeoutSeconds renders only when set, otherwise
Kubernetes applies its default (10800s / 3h).
*/}}
{{- if .Values.server.service.sessionAffinity }}
sessionAffinity: {{ .Values.server.service.sessionAffinity }}
{{- with .Values.server.service.sessionAffinityTimeoutSeconds }}
sessionAffinityConfig:
clientIP:
timeoutSeconds: {{ . }}
{{- end }}
{{- end }}
ports:
- port: {{ .Values.server.service.port }}
targetPort: https
+11
View File
@@ -160,6 +160,17 @@ server:
type: ClusterIP
port: 8443
annotations: {}
# DEPL-006 closure (Sprint 3, 2026-05-16). Optional sticky-session
# routing. REQUIRED when server.replicas > 1 so login + CSRF token
# rows stay on the same pod for the duration of a session — the
# default round-robin load balancing breaks those flows. Set to
# "ClientIP" for production HA (see deploy/helm/examples/values-prod-ha.yaml).
# Leave empty for single-replica deploys.
sessionAffinity: ""
# When sessionAffinity is set, timeout window (in seconds) the
# Service maps a source IP to the same pod. Default null →
# Kubernetes applies its built-in default (10800s / 3h).
sessionAffinityTimeoutSeconds: null
# Authentication configuration.
# Valid types: "api-key" (production) or "none" (demo only — disables
+16
View File
@@ -36,6 +36,14 @@ server:
service:
type: ClusterIP
# DEPL-006 closure (Sprint 3, 2026-05-16): with replicas:3, the
# default round-robin Service load balancing breaks login/CSRF
# flows because the session cookie + the CSRF token row land on
# different pods between requests. sessionAffinity: ClientIP
# routes every connection from a given source IP to the same
# pod for the configured timeout window. docs/operator/runbooks/ha.md
# documents this; pre-fix the chart did not actually render it.
sessionAffinity: ClientIP
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8443"
@@ -53,6 +61,14 @@ server:
rateLimiting:
rps: 500
burst: 1000
# DEPL-006 closure (Sprint 3, 2026-05-16): replicas > 1 REQUIRES
# the postgres backend so per-key buckets are cross-replica-
# consistent. The default 'memory' backend gives each pod its
# own bucket map, so a 3-replica fleet effectively triples the
# configured cap (a client churning across pods bypasses the
# limit). See deploy/helm/certctl/values.yaml L217-226 for the
# canonical comment.
backend: postgres
postgresql:
enabled: true
+53 -1
View File
@@ -152,10 +152,62 @@ if helm template c "$CHART" \
failed=1
fi
# Check 8 — DEPL-006 (Sprint 3 closure): when
# server.service.sessionAffinity is set, server-service.yaml MUST
# render the field. Pre-fix the chart silently dropped the value
# and docs/operator/runbooks/ha.md's instructions had no effect.
if helm template c "$CHART" \
--set server.tls.existingSecret=ci \
--set postgresql.auth.password=p \
--set server.auth.apiKey=k \
--set server.service.sessionAffinity=ClientIP \
> "$TMP/affinity.yaml" 2> "$TMP/affinity.err"; then
if ! grep -q 'sessionAffinity: ClientIP' "$TMP/affinity.yaml"; then
echo "::error file=${CHART}::B3 regression (DEPL-006): server.service.sessionAffinity=ClientIP did not render. Round-robin Service routing will break login + CSRF flows."
failed=1
fi
else
echo "::error file=${CHART}::B3 regression (DEPL-006): sessionAffinity mode failed to render."
cat "$TMP/affinity.err"
failed=1
fi
# Check 7 — DEPL-003 (Sprint 3 closure): when migrations.viaHook=true
# is set, the server Deployment env block MUST render
# CERTCTL_MIGRATIONS_VIA_HOOK=true. Without it the server runs
# boot-time migrations alongside the pre-install/pre-upgrade hook
# Job, racing on the schema lock.
if helm template c "$CHART" \
--set server.tls.existingSecret=ci \
--set postgresql.auth.password=p \
--set server.auth.apiKey=k \
--set migrations.viaHook=true \
> "$TMP/viahook.yaml" 2> "$TMP/viahook.err"; then
if ! grep -q 'name: CERTCTL_MIGRATIONS_VIA_HOOK' "$TMP/viahook.yaml"; then
echo "::error file=${CHART}::B3 regression (DEPL-003): migrations.viaHook=true did not render CERTCTL_MIGRATIONS_VIA_HOOK in the server Deployment env block. Server pods will race the hook Job."
failed=1
fi
# Inverse: with viaHook unset (default), the env var MUST NOT render.
if helm template c "$CHART" \
--set server.tls.existingSecret=ci \
--set postgresql.auth.password=p \
--set server.auth.apiKey=k \
> "$TMP/noviahook.yaml" 2> "$TMP/noviahook.err"; then
if grep -q 'name: CERTCTL_MIGRATIONS_VIA_HOOK' "$TMP/noviahook.yaml"; then
echo "::error file=${CHART}::B3 regression (DEPL-003): default render (viaHook=false) leaked CERTCTL_MIGRATIONS_VIA_HOOK env var. Conditional template missing the {{- if }} guard."
failed=1
fi
fi
else
echo "::error file=${CHART}::B3 regression (DEPL-003): migrations.viaHook=true mode failed to render."
cat "$TMP/viahook.err"
failed=1
fi
if [ "$failed" -ne 0 ]; then
echo ""
echo "${GUARD_NAME}: FAILED — Helm chart coherence regression."
exit 1
fi
echo "${GUARD_NAME}: clean (default + external-Postgres + cert-manager + production hardening + 3 fail-fast gates all green)."
echo "${GUARD_NAME}: clean (default + external-Postgres + cert-manager + production hardening + 3 fail-fast gates + DEPL-003 viaHook env render all green)."