deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
This commit is contained in:
shankar0123
2026-05-14 00:58:00 +00:00
parent b2284ef2a4
commit d6f4d5c5e8
10 changed files with 1223 additions and 11 deletions
+87
View File
@@ -0,0 +1,87 @@
#!/usr/bin/env bash
# scripts/ci-guards/helm-templates-lint.sh
#
# Phase 4 closure (2026-05-14): Helm chart lint + template-render gate.
#
# Runs `helm lint` against the chart and `helm template` against four
# representative value combinations to catch:
# - Syntax errors in any chart template
# - Schema-violation in values.yaml
# - Missing required values uncovered by the opt-in toggles
# (backup, monitoring.prometheusRules, migrations.viaHook)
# - Render errors when new templates are added without updating
# this guard's coverage matrix
#
# The opt-in templates added in Phase 4 (backup-cronjob.yaml,
# prometheusrules.yaml, migration-job.yaml) default OFF; without
# explicit coverage in the guard's matrix they would never render in
# CI and silent breakage could ship.
set -euo pipefail
CHART_DIR="deploy/helm/certctl"
if [ ! -d "$CHART_DIR" ]; then
echo "helm-templates-lint: skipped — $CHART_DIR not found (running outside repo root?)"
exit 0
fi
if ! command -v helm >/dev/null 2>&1; then
echo "helm-templates-lint: skipped — helm not on PATH."
echo " Install: https://helm.sh/docs/intro/install/"
exit 0
fi
echo "helm-templates-lint: running helm lint"
helm lint "$CHART_DIR" >/dev/null
# Minimal valid value set to satisfy chart preflight validators
# (server.tls.existingSecret, server.auth.apiKey, postgresql.auth.password).
# These are NOT real secrets — they're just non-empty strings to
# make the chart render in lint mode.
BASE_VALUES=(
--set "server.tls.existingSecret=lint-test-tls"
--set "server.auth.apiKey=lint-test-apikey"
--set "postgresql.auth.password=lint-test-pgpass"
)
render_and_check() {
local label="$1"
shift
local out
out="$(helm template "$CHART_DIR" "${BASE_VALUES[@]}" "$@" 2>&1)" || {
echo "helm-templates-lint: FAIL — template render error for '$label'"
echo "$out" | tail -20
return 1
}
echo "helm-templates-lint: OK — '$label'"
}
# Matrix:
# 1. Defaults (no Phase 4 opt-ins) — confirms the chart still
# renders cleanly when every Phase 4 feature is off.
# 2. backup.enabled=true (PVC sink) — confirms backup-cronjob renders.
# 3. backup.enabled=true + sink=s3 — confirms S3 sink branch renders.
# 4. monitoring.prometheusRules.enabled=true — confirms PrometheusRule renders.
# 5. migrations.viaHook=true — confirms migration-job hook renders.
# 6. All Phase 4 opt-ins on simultaneously — confirms no template
# interaction breaks the others.
render_and_check "defaults"
render_and_check "backup.enabled (pvc)" \
--set "backup.enabled=true"
render_and_check "backup.enabled (s3)" \
--set "backup.enabled=true" \
--set "backup.sink=s3" \
--set "backup.s3.bucket=lint-test-bucket"
render_and_check "monitoring.prometheusRules.enabled" \
--set "monitoring.enabled=true" \
--set "monitoring.prometheusRules.enabled=true"
render_and_check "migrations.viaHook" \
--set "migrations.viaHook=true"
render_and_check "all phase 4 opt-ins" \
--set "backup.enabled=true" \
--set "monitoring.enabled=true" \
--set "monitoring.prometheusRules.enabled=true" \
--set "migrations.viaHook=true"
echo "helm-templates-lint: all matrix combinations rendered cleanly"