deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
This commit is contained in:
shankar0123
2026-05-14 00:58:00 +00:00
parent b2284ef2a4
commit d6f4d5c5e8
10 changed files with 1223 additions and 11 deletions
+9
View File
@@ -176,6 +176,15 @@ jobs:
# 167 legitimate tests for no observable behavior change. The
# Test<Func>_<Scenario>_<ExpectedResult> form remains the
# recommended pattern for parameterized scenarios, but is not gated.
# Phase 4 DEPL-* prerequisite (2026-05-14): helm-templates-lint.sh
# needs the `helm` CLI on PATH to run helm lint + helm template
# against the chart. The official azure/setup-helm action installs
# a SHA-pinned helm binary into the runner.
- name: Install Helm (for helm-templates-lint guard)
uses: azure/setup-helm@b9e51907a09c216f16ebe8536097933489208112 # v4.3.0
with:
version: v3.16.0
- name: Regression guards (extracted to scripts/ci-guards/)
# All named regression guards live at scripts/ci-guards/<id>.sh per
# ci-pipeline-cleanup bundle Phase 1. Each guard is callable locally:
+71 -11
View File
@@ -55,6 +55,26 @@ import (
)
func main() {
// Phase 4 DEPL-M1 closure (2026-05-14): --migrate-only flag for
// the Helm pre-install/pre-upgrade hook (see
// deploy/helm/certctl/templates/migration-job.yaml). When set, the
// server loads config, opens the DB pool, runs migrations + seed,
// and exits — no HTTP listener, no scheduler, no signing work.
// Same migration code path as boot-time RunMigrations; only the
// surrounding lifecycle differs.
//
// Hand-parsed (instead of pulling in flag.Parse) because the rest
// of the server's config surface is env-var driven via
// config.Load(); adding a flag.Parse() with global state risks
// conflicting with other binaries that import cmd/server later.
migrateOnly := false
for _, arg := range os.Args[1:] {
if arg == "--migrate-only" {
migrateOnly = true
break
}
}
// Load configuration
cfg, err := config.Load()
if err != nil {
@@ -146,13 +166,37 @@ func main() {
defer db.Close()
logger.Info("connected to database")
// Run migrations
logger.Info("running migrations", "path", cfg.Database.MigrationsPath)
if err := postgres.RunMigrations(db, cfg.Database.MigrationsPath); err != nil {
logger.Error("failed to run migrations", "error", err)
os.Exit(1)
// Phase 4 DEPL-M1 closure (2026-05-14): migration-via-hook posture.
//
// Three lifecycles to support:
// (a) Compose / VM / bare-metal: server runs migrations at boot.
// Default behavior — preserved unchanged.
// (b) Helm with pre-install/pre-upgrade hook: the migration Job
// runs `certctl-server --migrate-only`, does its work, and
// exits. The server Deployment's pods then start with
// CERTCTL_MIGRATIONS_VIA_HOOK=true set; they see the env
// var and skip their boot-time RunMigrations call so the
// Job's work isn't duplicated.
// (c) Bare `certctl-server --migrate-only` invocation (e.g.
// operator running a one-shot migration from the CLI):
// runs migrations + seed and exits cleanly. No HTTP
// listener, no scheduler, no signing work.
//
// migrateOnly captures case (c); CERTCTL_MIGRATIONS_VIA_HOOK
// captures case (b). Both paths converge on the same RunMigrations
// + RunSeed code below.
migrationsViaHook := strings.EqualFold(os.Getenv("CERTCTL_MIGRATIONS_VIA_HOOK"), "true")
if migrateOnly || !migrationsViaHook {
logger.Info("running migrations", "path", cfg.Database.MigrationsPath)
if err := postgres.RunMigrations(db, cfg.Database.MigrationsPath); err != nil {
logger.Error("failed to run migrations", "error", err)
os.Exit(1)
}
logger.Info("migrations completed")
} else {
logger.Info("skipping migrations at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — Helm pre-install/pre-upgrade hook owns this work)")
}
logger.Info("migrations completed")
// Apply baseline seed data.
//
@@ -166,12 +210,28 @@ func main() {
// server runs RunMigrations above, then this RunSeed call lands the
// baseline data — all from a single source of truth (this binary).
// See internal/repository/postgres/db.go::RunSeed for the contract.
logger.Info("applying baseline seed", "path", cfg.Database.MigrationsPath)
if err := postgres.RunSeed(db, cfg.Database.MigrationsPath); err != nil {
logger.Error("failed to apply seed data", "error", err)
os.Exit(1)
//
// Phase 4 DEPL-M1: same migration-via-hook gating as RunMigrations.
// When the hook owns migrations it also owns the seed pass.
if migrateOnly || !migrationsViaHook {
logger.Info("applying baseline seed", "path", cfg.Database.MigrationsPath)
if err := postgres.RunSeed(db, cfg.Database.MigrationsPath); err != nil {
logger.Error("failed to apply seed data", "error", err)
os.Exit(1)
}
logger.Info("seed completed")
} else {
logger.Info("skipping baseline seed at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — hook applies seed alongside migrations)")
}
// Phase 4 DEPL-M1: --migrate-only early-exit. Migrations + seed are
// done; the operator only asked for the migration pass. Skip the
// HTTP listener, scheduler, signing setup, banner, etc. Exit 0
// cleanly so Kubernetes Job lifecycle reports success.
if migrateOnly {
logger.Info("--migrate-only: migrations + seed complete; exiting without starting server lifecycle")
os.Exit(0)
}
logger.Info("seed completed")
// Apply demo overlay seed when CERTCTL_DEMO_SEED=true. Pre-U-3 the demo
// overlay (deploy/docker-compose.demo.yml) mounted seed_demo.sql into
@@ -0,0 +1,178 @@
{{- /*
Phase 4 DEPL-H2 closure (2026-05-14): opt-in Helm CronJob for
PostgreSQL backups.
OPERATOR OPT-IN. Default `backup.enabled: false`. Turning it on
requires:
- In-cluster Postgres (this CronJob does NOT cover managed DB
services — for AWS RDS / GCP CloudSQL / Azure DB rely on the
provider's PITR).
- A sink choice (PVC or S3) configured in values.yaml.
- For S3: a Secret holding AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
(or use a service account with IRSA on EKS).
The pg_dump invocation matches the canonical shape documented in
docs/operator/runbooks/postgres-backup.md so a manual run and a
CronJob run produce byte-identical dumps:
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
For sink choices beyond PVC + S3 (GCS, Azure Blob, NFS, restic, etc.),
extend the `aws s3 cp` line below. The Job is intentionally minimal —
it does ONE thing (capture + ship), not orchestrate retention or
rotation. Off-host retention is the sink's responsibility (S3 lifecycle
rules, PVC snapshot retention on the storage class, etc.).
*/ -}}
{{- if .Values.backup.enabled }}
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ include "certctl.fullname" . }}-postgres-backup
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: postgres-backup
spec:
schedule: {{ .Values.backup.schedule | quote }}
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: {{ .Values.backup.successfulJobsHistoryLimit | default 3 }}
failedJobsHistoryLimit: {{ .Values.backup.failedJobsHistoryLimit | default 1 }}
startingDeadlineSeconds: {{ .Values.backup.startingDeadlineSeconds | default 300 }}
jobTemplate:
spec:
backoffLimit: {{ .Values.backup.backoffLimit | default 1 }}
activeDeadlineSeconds: {{ .Values.backup.activeDeadlineSeconds | default 3600 }}
template:
metadata:
labels:
{{- include "certctl.labels" . | nindent 12 }}
app.kubernetes.io/component: postgres-backup
spec:
restartPolicy: Never
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 12 }}
{{- end }}
serviceAccountName: {{ include "certctl.serviceAccountName" . }}
securityContext:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
fsGroup: 1000
containers:
- name: backup
image: {{ .Values.backup.image | default "postgres:16-alpine" | quote }}
imagePullPolicy: {{ .Values.backup.imagePullPolicy | default "IfNotPresent" | quote }}
env:
- name: PGHOST
value: {{ include "certctl.fullname" . }}-postgres
- name: PGPORT
value: {{ .Values.postgresql.service.port | default 5432 | quote }}
- name: PGUSER
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: username
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: password
- name: PGDATABASE
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: database
{{- if eq (.Values.backup.sink | default "pvc") "s3" }}
# S3 sink — operator provides AWS credentials via the
# Secret referenced in backup.s3.credentialsSecret. The
# credentials need s3:PutObject + s3:ListBucket on the
# target bucket only; least-privilege per industry
# standard.
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
key: {{ .Values.backup.s3.credentialsSecret.accessKeyIdKey | default "AWS_ACCESS_KEY_ID" }}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
key: {{ .Values.backup.s3.credentialsSecret.secretAccessKeyKey | default "AWS_SECRET_ACCESS_KEY" }}
{{- with .Values.backup.s3.region }}
- name: AWS_DEFAULT_REGION
value: {{ . | quote }}
{{- end }}
{{- end }}
command:
- /bin/sh
- -ceu
- |
# Phase 4 DEPL-H2: canonical pg_dump shape per
# docs/operator/runbooks/postgres-backup.md.
# Custom-format compressed dump, no ownership /
# ACL embedded — produces a portable artifact
# restorable into any Postgres ≥ source major
# via `pg_restore -d certctl <dump>`.
set -euo pipefail
TIMESTAMP="$(date -u +%Y%m%dT%H%M%SZ)"
DUMP_FILE="/tmp/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] capturing dump at ${TIMESTAMP}"
pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" \
> "${DUMP_FILE}"
# Integrity check — pg_restore --list parses the
# dump's table-of-contents; a corrupt dump fails
# here without shipping garbage off-host. Same
# check the manual runbook performs.
echo "[backup-cronjob] verifying dump integrity"
pg_restore --list "${DUMP_FILE}" > /dev/null
{{- if eq (.Values.backup.sink | default "pvc") "s3" }}
# S3 sink — requires aws-cli. The default
# postgres:16-alpine image does NOT include
# aws-cli; operators MUST set
# backup.image to an image that bundles both
# (e.g. ghcr.io/your-org/postgres-aws:16) OR
# override backup.command to install aws-cli at
# runtime. The line below assumes the image has
# `aws` on PATH.
S3_PATH="{{ .Values.backup.s3.bucket }}/{{ .Values.backup.s3.prefix | default "certctl" }}/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] uploading to s3://${S3_PATH}"
aws s3 cp "${DUMP_FILE}" "s3://${S3_PATH}"
rm -f "${DUMP_FILE}"
{{- else }}
# PVC sink — dump lands at /backups/certctl-${TIMESTAMP}.dump
# mounted from backup.pvc.claimName. Retention is the
# PVC's responsibility (storage-class snapshot lifecycle
# or a separate cleanup CronJob). The Job moves the
# file from /tmp to /backups atomically; never
# writes partial dumps into the durable mount.
FINAL_PATH="/backups/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] persisting to ${FINAL_PATH}"
mv "${DUMP_FILE}" "${FINAL_PATH}"
{{- end }}
echo "[backup-cronjob] done"
{{- if ne (.Values.backup.sink | default "pvc") "s3" }}
volumeMounts:
- name: backups
mountPath: /backups
{{- end }}
resources:
{{- toYaml (.Values.backup.resources | default dict) | nindent 16 }}
{{- if ne (.Values.backup.sink | default "pvc") "s3" }}
volumes:
- name: backups
persistentVolumeClaim:
claimName: {{ .Values.backup.pvc.claimName | quote }}
{{- end }}
{{- with .Values.nodeAffinity }}
affinity:
nodeAffinity:
{{- toYaml . | nindent 14 }}
{{- end }}
{{- with .Values.backup.tolerations }}
tolerations:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
@@ -0,0 +1,89 @@
{{- /*
Phase 4 DEPL-M1 closure (2026-05-14): Helm pre-install / pre-upgrade
hook that runs Postgres migrations before the server Deployment rolls.
Pre-DEPL-M1, postgres.RunMigrations was invoked at server boot
(cmd/server/main.go:151) as the only migration path. That works for
Compose deployments but conflicts with Kubernetes rolling deploys:
when a new server image lands with a schema change, multiple replicas
race the migration during the rollout. The hook resolves the race by
running migrations OUT OF BAND, exactly once, before any new server
pod starts.
How it works:
- The Job ships the same certctl-server image as the Deployment, so
the migration code path is binary-identical to the boot-time path.
- It runs `certctl-server --migrate-only` (a flag the cmd/server
main process must support — see cmd/server/main.go for the flag
parse + early-exit path).
- The CERTCTL_MIGRATIONS_VIA_HOOK=true env var is ALSO set on the
server Deployment (via values.yaml). When the server boots, it
sees this env var and skips its own RunMigrations call — the
hook already did the work. Compose deploys don't set the env
var, so they keep the boot-time path unchanged.
- hook-delete-policy hook-succeeded means the Job is cleaned up
automatically on success but retained on failure for operator
diagnosis.
- The hook-weight ensures the migration Job runs before any other
pre-install/pre-upgrade resources (the StatefulSet's PVC has to
exist first; in practice the StatefulSet has no hook so it lands
naturally in the install phase after the Job completes).
Operators on Compose: this hook is a no-op for you. The server still
runs migrations at boot per the existing path.
*/ -}}
{{- if .Values.migrations.viaHook }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "certctl.fullname" . }}-migrate
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: migration
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
spec:
backoffLimit: {{ .Values.migrations.backoffLimit | default 1 }}
activeDeadlineSeconds: {{ .Values.migrations.activeDeadlineSeconds | default 600 }}
template:
metadata:
labels:
{{- include "certctl.labels" . | nindent 8 }}
app.kubernetes.io/component: migration
spec:
restartPolicy: Never
serviceAccountName: {{ include "certctl.serviceAccountName" . }}
securityContext:
{{- include "certctl.podSecurityContext" .Values.server.securityContext | nindent 8 }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
containers:
- name: migrate
image: {{ include "certctl.serverImage" . }}
imagePullPolicy: {{ .Values.server.image.pullPolicy }}
# Migration-only entrypoint. The server binary supports a
# --migrate-only flag that runs postgres.RunMigrations +
# postgres.RunSeed and exits cleanly (zero on success,
# non-zero on migration failure). See cmd/server/main.go
# for the implementation. The flag is hermetic — no HTTP
# listener starts, no scheduler ticks, no signing
# operations occur. Pure schema-mutation pass.
command:
- /app/server
- --migrate-only
env:
- name: CERTCTL_DATABASE_URL
value: {{ include "certctl.databaseURL" . | quote }}
- name: CERTCTL_LOG_LEVEL
value: {{ .Values.server.logging.level | default "info" | quote }}
- name: CERTCTL_LOG_FORMAT
value: {{ .Values.server.logging.format | default "json" | quote }}
resources:
{{- toYaml (.Values.migrations.resources | default .Values.server.resources) | nindent 12 }}
securityContext:
{{- include "certctl.containerSecurityContext" .Values.server.securityContext | nindent 12 }}
{{- end }}
@@ -9,6 +9,21 @@ metadata:
spec:
serviceName: {{ include "certctl.fullname" . }}-postgres
replicas: 1
# Phase 4 DEPL-M4 closure (2026-05-14): explicit StatefulSet update +
# pod-management strategies. Defaults make Postgres upgrades
# operator-controlled rather than automatic:
# updateStrategy.type: OnDelete — Postgres pods do NOT roll
# automatically when the StatefulSet spec changes. Operator
# deletes the pod explicitly after taking a backup + reviewing
# the change. Prevents an accidental Helm-template tweak from
# triggering a database restart at an awkward time.
# podManagementPolicy: OrderedReady — when scaling Postgres to
# a replica >1 (future HA work), pods come up one at a time
# and must reach Ready before the next pod is created. Aligns
# with the standard Postgres-on-Kubernetes pattern.
updateStrategy:
type: OnDelete
podManagementPolicy: OrderedReady
selector:
matchLabels:
{{- include "certctl.postgresSelectorLabels" . | nindent 6 }}
@@ -0,0 +1,145 @@
{{- /*
Phase 4 DEPL-L2 closure (2026-05-14): opt-in Prometheus AlertManager
rules covering the four operationally-actionable alerts every certctl
deployment wants out of the box.
OPERATOR OPT-IN. Default `monitoring.prometheusRules.enabled: false`.
Turning it on requires Prometheus Operator CRDs (PrometheusRule kind)
to be installed in-cluster. Without them this template renders an
object Kubernetes will reject — keep the toggle off if you're scraping
with vanilla Prometheus + a Helm-installed AlertManager rules
ConfigMap instead.
Metric names + thresholds verified against the actual
internal/api/handler/metrics.go exposition path:
- certctl_certificate_expiring_soon: server-side count of certs with
ExpiresAt in (now, now + 30d]. The 30-day window is computed in
internal/service/stats.go::GetDashboardSummary.
- certctl_agent_online: agents with heartbeat in the last 5 minutes.
A drop below certctl_agent_total signals offline agents.
- certctl_job_failed_total + certctl_job_completed_total: cumulative
counters; ratio gives the failure rate over the rate() window.
- certctl_issuance_failures_total: cumulative counter of failed
issuance attempts (renewal failures are issuance failures with a
specific error_class label).
Adjust thresholds per fleet — the defaults below are tuned for the
demo dataset (15 certs / 1 agent) and may need raising for production
fleets with thousands of certs where a steady rate of expiring certs
is the normal operating state.
*/ -}}
{{- if and .Values.monitoring.enabled .Values.monitoring.prometheusRules.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "certctl.fullname" . }}-rules
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: monitoring
{{- with .Values.monitoring.prometheusRules.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
groups:
- name: certctl.alerts
interval: {{ .Values.monitoring.prometheusRules.interval | default "60s" }}
rules:
# ---------------------------------------------------------------
# Alert: CertctlCertificateExpiringSoon
# Series: certctl_certificate_expiring_soon
# The certctl-server counts certs with ExpiresAt in
# (now, now + 30d] every metrics scrape. Fires whenever any cert
# crosses into that window — operator must triage or extend
# automation coverage. Rapid renewal infrastructure should keep
# this number small in steady state.
# ---------------------------------------------------------------
- alert: CertctlCertificateExpiringSoon
expr: certctl_certificate_expiring_soon > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateFor | default "5m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: {{`{{ $value }}`}} certificate(s) expiring within 30 days"
description: >-
certctl_certificate_expiring_soon has been > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
for 5+ minutes. Investigate via
/api/v1/certificates?status=expiring or the dashboard's
Expiring tab. If renewal automation should have covered
these, check the renewal scheduler logs for the cert IDs
+ the per-issuer failure rate.
# ---------------------------------------------------------------
# Alert: CertctlAgentOffline
# Series: certctl_agent_total - certctl_agent_online
# Agents flip from online → offline after 5 minutes without a
# heartbeat (internal/service/stats.go::GetDashboardSummary).
# The 1h `for:` window prevents a flapping agent from paging the
# operator on every transient network blip.
# ---------------------------------------------------------------
- alert: CertctlAgentOffline
expr: (certctl_agent_total - certctl_agent_online) > {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentCount | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentFor | default "1h" }}
labels:
severity: warning
component: certctl-agent
annotations:
summary: "certctl: {{`{{ $value }}`}} agent(s) offline for >1h"
description: >-
One or more certctl-agent instances have been without a
heartbeat for over an hour. Check the agent logs on the
affected hosts. If the agent host is intentionally
decommissioned, retire the agent via the dashboard or
POST /api/v1/agents/{id}/retire to suppress this alert.
# ---------------------------------------------------------------
# Alert: CertctlJobFailureRateHigh
# Series: certctl_job_failed_total / (certctl_job_failed_total + certctl_job_completed_total)
# Computes the failure rate over a 15-minute rate() window so
# short bursts don't fire but a sustained issue does. The 5%
# threshold is a conservative starter — adjust per fleet's
# baseline.
# ---------------------------------------------------------------
- alert: CertctlJobFailureRateHigh
expr: >-
(
rate(certctl_job_failed_total[15m])
/
clamp_min(rate(certctl_job_failed_total[15m]) + rate(certctl_job_completed_total[15m]), 1)
) > {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRate | default 0.05 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRateFor | default "15m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: job failure rate above 5% over 15m"
description: >-
The 15m rate of certctl_job_failed_total / total jobs
has been above 5% for 15+ minutes. Open
/api/v1/jobs?status=failed to see the failing job IDs
and root-cause the recurring error class.
# ---------------------------------------------------------------
# Alert: CertctlIssuanceFailures
# Series: certctl_issuance_failures_total
# Any non-zero rate of issuance failures over a 15m window is
# operationally significant — a single CA outage or expired
# ACME account can cascade across the fleet.
# ---------------------------------------------------------------
- alert: CertctlIssuanceFailures
expr: rate(certctl_issuance_failures_total[15m]) > {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureRate | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureFor | default "15m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: certificate issuance / renewal failures over 15m"
description: >-
certctl_issuance_failures_total has been incrementing
over the last 15 minutes. Check the per-issuer breakdown
via /api/v1/issuers + the failed-job log in
/api/v1/jobs?status=failed. Common causes: CA
outage, ACME account rate-limit, EAB credential
expiration, stepca provisioner key rotation without
certctl-side update.
{{- end }}
+193
View File
@@ -31,6 +31,36 @@ server:
port: 8443
# Resource requests and limits
#
# Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder. The
# default values below are validated against the demo dataset
# (15 certs / 1 agent) and the baselines in
# docs/operator/performance-baselines.md (single endpoint < 5s for
# 100 sequential requests = ~50ms p50; cursor-paginated 1000-cert
# inventory walk < 3s; renewal scan for 15 certs < 100ms).
#
# Larger fleet recommendations (TBD pending Phase 8 load-test runs;
# operators tune empirically until then — capture readings in your
# own loadtest-baselines log):
#
# ≤ 500 certs / 100 agents: defaults below (100m / 128Mi req, 500m / 512Mi lim)
# 5K certs / 1K agents: tune up — TBD Phase 8 (suggested starter: 500m / 512Mi req, 2000m / 2Gi lim)
# 50K certs / 10K agents: tune up — TBD Phase 8 (suggested starter: 2000m / 2Gi req, 4000m / 4Gi lim)
#
# The "suggested starter" values above are operator-tuning starting
# points, NOT validated. Phase 8 (load test coverage expansion) will
# measure them against synthetic fleets and replace the suggestions
# with measured ceilings. Until then, treat them as a "raise CPU
# before raising memory; raise both before scaling out" mental
# model. Per docs/operator/performance-baselines.md, certctl-server
# is CPU-bound on issuance / renewal scan work and memory-bound on
# the inventory query path.
#
# Database scale (postgresql.* below) tracks server scale roughly
# 1:1 — at 50K certs the Postgres instance needs 4 CPU / 4Gi RAM
# and shared_buffers ≥ 1Gi. Postgres tuning is out of scope for
# this comment; see docs/operator/runbooks/postgres-backup.md
# for the production-tuning entry-point.
resources:
requests:
cpu: 100m
@@ -449,6 +479,26 @@ agent:
replicas: 1
# Resource requests and limits
#
# Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder for the
# agent. Defaults are sized for the standard "one cert per host"
# operating pattern: the agent polls the server every 60s (default
# CERTCTL_AGENT_POLL_INTERVAL), generates ECDSA P-256 keys locally on
# issuance/renewal events, and is otherwise idle. CPU is bursty only
# during keygen + CSR submission.
#
# Tuning ladder (TBD pending Phase 8 — measure on your fleet):
#
# 1 cert / host (typical): defaults below (50m / 64Mi req, 200m / 256Mi lim)
# 10 certs / host: stays at defaults — agent is poll-driven, not work-bound by cert count
# 100 certs / host (rare): raise lim to 500m / 512Mi if you see throttling on issuance bursts
#
# The agent does NOT cache certs in memory — issuance is one-shot
# generate-then-deploy. So per-host memory scales with whatever
# truststore PEM bundles the agent's connectors load (Apache /
# Postfix / similar), not with the cert count. Defaults are
# appropriate for any "agent terminates ≤ 100 certs on this host"
# deployment.
resources:
requests:
cpu: 50m
@@ -612,6 +662,149 @@ monitoring:
# Optional relabeling for the scrape job.
# relabelings: []
# ----------------------------------------------------------------------
# Phase 4 DEPL-L2 closure (2026-05-14): PrometheusRule (alert rules)
#
# Operator opt-in. Requires Prometheus Operator CRDs (the
# `monitoring.coreos.com/v1` PrometheusRule kind) installed in
# cluster. Without those CRDs the rendered object is rejected by
# `kubectl apply` — keep enabled: false if you scrape with vanilla
# Prometheus + AlertManager rules ConfigMap instead.
#
# Four starter rules ship out of the box (see
# templates/prometheusrules.yaml for the full PromQL):
#
# CertctlCertificateExpiringSoon — certs expiring within 30d
# CertctlAgentOffline — agent without heartbeat for >1h
# CertctlJobFailureRateHigh — job-failure rate over 5% (15m)
# CertctlIssuanceFailures — any issuance failures in last 15m
#
# All thresholds are operator-tunable via the `thresholds:` block
# below. The defaults are tuned for the demo dataset (15 certs / 1
# agent); production fleets with sustained renewal volume MAY want
# to raise the expiringCertificateCount + jobFailureRate thresholds
# to suppress steady-state noise.
prometheusRules:
enabled: false
# Evaluation interval for the rule group.
interval: 60s
# Additional labels applied to the PrometheusRule metadata.
# labels: {}
# Per-alert threshold / duration tunables.
thresholds:
# Fire when more than N certs are in the expiring-soon window.
expiringCertificateCount: 0
expiringCertificateFor: 5m
# Fire when more than N agents are offline (server - online).
offlineAgentCount: 0
offlineAgentFor: 1h
# Fire when job failure rate exceeds this fraction (15m window).
jobFailureRate: 0.05
jobFailureRateFor: 15m
# Fire when issuance failure rate exceeds this value (15m window).
issuanceFailureRate: 0
issuanceFailureFor: 15m
# ==============================================================================
# Backup CronJob (Phase 4 DEPL-H2 closure, 2026-05-14)
# ==============================================================================
# Operator opt-in. Default OFF. The CronJob runs `pg_dump --format=custom
# --no-owner --no-acl --dbname=certctl` matching the canonical shape
# documented in docs/operator/runbooks/postgres-backup.md (so manual
# and automated dumps are byte-identical) and ships the result to a
# sink chosen below.
#
# DO NOT enable this for managed Postgres deployments (AWS RDS / GCP
# Cloud SQL / Azure DB) — those have built-in PITR backup that this
# CronJob cannot match. For in-cluster Postgres only.
backup:
enabled: false
# Cron expression (UTC). Default: 02:30 UTC daily.
schedule: "30 2 * * *"
# Sink: "pvc" (default — dump lands on a PersistentVolumeClaim) or
# "s3" (uploads via aws-cli — requires an image that bundles
# aws-cli, see backup.image below).
sink: pvc
# Container image. The default postgres:16-alpine has pg_dump but
# NOT aws-cli; for sink: s3 set this to an image that bundles both
# (e.g. ghcr.io/your-org/postgres-aws:16) or override the Job's
# command to install aws-cli at runtime.
image: postgres:16-alpine
imagePullPolicy: IfNotPresent
# PVC sink config — used when sink: pvc.
pvc:
# Name of an existing PersistentVolumeClaim mounted at /backups
# in the Job's pod. The PVC's storage class controls durability
# and snapshot retention. Operator creates this PVC out of band
# via their own storage policy.
claimName: certctl-backups
# S3 sink config — used when sink: s3.
s3:
# Target bucket (without s3:// prefix).
bucket: ""
# Object key prefix inside the bucket. Dumps land at
# s3://<bucket>/<prefix>/certctl-<TIMESTAMP>.dump.
prefix: certctl
# AWS region (sets AWS_DEFAULT_REGION). Optional if the image's
# AWS SDK can resolve the region another way (instance profile,
# IRSA, etc.).
region: ""
# Secret holding AWS credentials. The IAM principal needs
# s3:PutObject + s3:ListBucket on the target bucket only.
credentialsSecret:
name: certctl-backup-aws-creds
accessKeyIdKey: AWS_ACCESS_KEY_ID
secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
# Job housekeeping.
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 300
backoffLimit: 1
activeDeadlineSeconds: 3600
# Resource budget for the backup container. pg_dump is generally
# memory-light; ~250MB RSS for fleets up to 100K certs is typical.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Optional tolerations for the backup Job pod.
tolerations: []
# ==============================================================================
# Migrations via Helm hook (Phase 4 DEPL-M1 closure, 2026-05-14)
# ==============================================================================
# When viaHook: true, the chart deploys templates/migration-job.yaml as
# a pre-install + pre-upgrade hook that runs `certctl-server
# --migrate-only` (a hermetic schema-mutation pass) before the server
# Deployment rolls.
#
# Set CERTCTL_MIGRATIONS_VIA_HOOK=true in the server Deployment env to
# tell the server to skip its boot-time RunMigrations call (the hook
# already did the work; running again at boot would race across
# replicas during rollouts).
#
# Default OFF — when off, the server runs migrations at boot exactly
# as it always has (Compose deploys keep this path).
migrations:
viaHook: false
# Job housekeeping.
backoffLimit: 1
activeDeadlineSeconds: 600
# Resource budget for the migration Job pod. The migration pass is
# I/O-bound on Postgres; matches the server's resource budget by
# default. Override here if migrations on a large database need
# more headroom than the steady-state server.
# resources:
# requests:
# cpu: 100m
# memory: 128Mi
# limits:
# cpu: 500m
# memory: 512Mi
# ==============================================================================
# Network Policy (Bundle 3 closure / D11)
# ==============================================================================
@@ -0,0 +1,243 @@
# Runbook: Prometheus bearer token for the metrics scrape endpoint
> Last reviewed: 2026-05-14
Use this when:
- You're enabling Prometheus Operator scraping via the Helm chart's
`monitoring.serviceMonitor.enabled` toggle.
- Your Prometheus scrapes are returning 401 against
`/api/v1/metrics/prometheus`.
- An auditor asks "how is the metrics endpoint authenticated?"
## The constraint
The certctl server exposes Prometheus metrics at
`/api/v1/metrics/prometheus`. This endpoint is **RBAC-gated on the
`metrics.read` permission** (per `internal/api/router/router.go`).
Like every other gated handler, it requires an authenticated actor
holding that permission — there is no anonymous-scrape path.
The rationale: the metrics payload includes operational counters
(cert counts by status, agent counts, issuance failure rates) that
a public-facing observer should not see. Most certctl deployments
expose a reverse proxy / load balancer to the wider network; the
auth gate on `/api/v1/metrics/prometheus` prevents an external
observer from learning operational state via the metrics endpoint
even when the proxy itself is reachable.
## What you need to set up
Three pieces:
1. **An API key with `metrics.read` permission** (and only that
permission — least-privilege).
2. **A Kubernetes Secret** holding that API key.
3. **`monitoring.serviceMonitor.bearerTokenSecret`** in the chart's
values pointing at the Secret.
## Step 1: Create the metrics-read role + API key
The chart's seed migration ships a `metrics-read` role-template, but
some operators want a dedicated identity per scrape source. Both
approaches work; the dedicated-identity path is below.
```bash
# 1. Bootstrap or impersonate a session with auth.role.assign +
# auth.apikey.create permissions (admin actor is fine).
# 2. Create a role with only metrics.read.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/roles \
-d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'
# 3. Create an actor that holds the role.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/actors \
-d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'
# 4. Mint an API key for the actor. The response includes a
# `key_value` field that's only returned ONCE — capture it.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
| tee /tmp/prom-key.json
# Extract just the secret material:
jq -r '.key_value' /tmp/prom-key.json
```
The mint endpoint returns the API key plaintext exactly once. The
server stores only a constant-time-comparable hash; if you lose the
key value, mint a new one.
## Step 2: Create the Kubernetes Secret
```bash
NAMESPACE=certctl
API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)
kubectl create secret generic certctl-prometheus-key \
-n "$NAMESPACE" \
--from-literal=api-key="$API_KEY"
```
Now scrub the temporary file:
```bash
shred -u /tmp/prom-key.json
```
## Step 3: Wire the Secret into the chart values
In your `values.yaml` (or `--set` overrides):
```yaml
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
bearerTokenSecret:
name: certctl-prometheus-key
key: api-key
```
Re-apply the chart:
```bash
helm upgrade certctl . -n "$NAMESPACE" --reuse-values
```
The rendered ServiceMonitor will now include the `bearerTokenSecret`
block. Prometheus Operator's reconciler picks it up and injects the
bearer token into the scrape request.
## Verification
```bash
# 1. Confirm the ServiceMonitor renders with the secret reference
kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
| grep -A2 bearerTokenSecret
# Expected:
# bearerTokenSecret:
# name: certctl-prometheus-key
# key: api-key
# 2. Tail the certctl-server logs for the next ~60 seconds (one
# Prometheus scrape interval). Look for incoming GET /metrics/prometheus
# requests authenticated successfully — no 401s.
kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
--tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"
# 3. From the Prometheus UI's "Targets" page, the certctl-server
# target should be UP and last-scrape-error empty. If it's
# showing 401, the bearer token isn't reaching the request — see
# troubleshooting below.
```
## Troubleshooting
### Prometheus target shows 401
Three possible causes:
1. **Wrong Secret name / key.** Run
`kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yaml`
and confirm the `data.api-key` field exists with a base64-encoded
non-empty value. The Secret's data field name must match the
`bearerTokenSecret.key` value in `monitoring.serviceMonitor`.
2. **API key doesn't have `metrics.read`.** Hit the gating endpoint
manually from inside the cluster with the same key:
```bash
kubectl run --rm -it --image=curlimages/curl debug -- \
curl -sS -H "Authorization: Bearer <API_KEY>" \
https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheus
```
A 401 here means the role doesn't include `metrics.read`. A 403
means the role exists but the API key isn't assigned to it.
3. **TLS verification failure (not a 401, but masquerading as one in
Prometheus's logs).** The default ServiceMonitor template sets
`insecureSkipVerify: true` to support demos — production deploys
should set `tlsConfig.caFile` or `tlsConfig.ca.secret` per the
ServiceMonitor docs.
### Prometheus target shows TLS errors
`monitoring.serviceMonitor.tlsConfig` overrides the default. Three
patterns:
```yaml
# Pattern 1: trust the system CA bundle (production behind a real CA)
tlsConfig:
caFile: /etc/ssl/certs/ca-certificates.crt
serverName: certctl.your-org.example
# Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
tlsConfig:
ca:
secret:
name: certctl-ca
key: ca.crt
serverName: certctl.your-org.example
# Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
tlsConfig:
insecureSkipVerify: true
```
The certctl server's self-signed bootstrap cert (default
`server.tls.existingSecret` from the chart) presents a CN of
`certctl-server`. If your `serverName` doesn't match, the scrape
fails with `x509: certificate is valid for certctl-server, not ...`.
## Rotation
API keys are constant-time-compared, stored hashed, and never
logged. Rotation:
```bash
# 1. Mint a new key (same actor + role)
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
| tee /tmp/prom-key-new.json
# 2. Update the Secret in place
kubectl create secret generic certctl-prometheus-key \
-n certctl \
--from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Wait one scrape interval; verify the next scrape uses the new key.
# 4. Revoke the old key
curl -sS --cacert ./ca.crt -X DELETE \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>
# 5. Scrub the temp file
shred -u /tmp/prom-key-new.json
```
Prometheus Operator picks up Secret changes automatically — no
ServiceMonitor edit needed, no Prometheus restart.
## Related reading
- [`docs/operator/rbac.md`](../rbac.md) — the full RBAC primitive,
permission catalogue, and role-assignment workflow.
- [`docs/operator/security.md`](../security.md) — the broader auth
posture including the API key / OIDC / break-glass paths.
- [`docs/operator/auth-threat-model.md`](../auth-threat-model.md) —
why `/api/v1/metrics/prometheus` is gated, and what an
unauthenticated leak of metrics data would reveal.
+193
View File
@@ -0,0 +1,193 @@
# Runbook: Helm rollback for certctl
> Last reviewed: 2026-05-14
Use this when:
- A `helm upgrade` rolled out a bad release and the operator wants to
return to the previous working state.
- A schema migration shipped a change the operator wants to back out.
- An emergency change needs reverting and forward-fix isn't yet
available.
This page covers `helm rollback` mechanics + the cases where
rollback is NOT enough on its own (schema migrations are the main
one).
## What `helm rollback` does
`helm rollback <release> [revision]` re-applies the manifests from a
previous Helm revision. It re-creates / updates Kubernetes objects to
match that revision's template output and is safe for:
- **Deployment image bumps:** rolls the container image back to the
previous tag. Pods restart with the old image.
- **ConfigMap / Secret content changes:** old values land in the
config; pods that consume them via `envFrom` or volume mounts get
the prior values on the next restart.
- **Resource requests / limits / replica count:** the spec changes
back to the prior values. Kubernetes reschedules pods accordingly.
- **Service / Ingress / NetworkPolicy changes:** networking flips
back to the previous shape immediately.
## What `helm rollback` does NOT do
The Kubernetes layer is reversible; the **database schema is not**.
This is the single most common gap in a rollback plan.
### Schema migrations are forward-only by design
certctl's migrations under `migrations/` are numbered up-migrations
(`NNNNNN_*.up.sql`) with paired down-migrations
(`NNNNNN_*.down.sql`) shipped alongside. The `postgres.RunMigrations`
path applied at server boot only runs the `*.up.sql` files. The
`*.down.sql` files exist for development reference + a hypothetical
"surgical revert" path but are **not invoked by `helm rollback`**.
The implication: if `v2.1.0 → v2.2.0` ships migrations 000100,
000101, 000102 (adding columns, changing constraints, dropping
indexes), then `helm rollback` to v2.1.0 takes you back to the v2.1.0
container image — but the database still has migrations 000100-102
applied. The v2.1.0 server code doesn't know about those columns; it
either ignores them (best case) or fails to start (if the schema
diverged in a way the older code can't tolerate).
### When is rollback safe without a schema revert?
Migrations are **additive-only** in 90%+ of cases. The categories:
| Migration class | Safe to roll back without schema revert? | Why |
|---|---|---|
| Add column with default | Yes | Old code ignores the new column |
| Add table | Yes | Old code doesn't reference the table |
| Add index | Yes | Old code doesn't depend on the index existing |
| Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
| Rename column / table | NO | Old code's queries reference the original name |
| Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
| Type change (`VARCHAR(40)``TEXT`) | Usually yes | Old code's column read still works |
| Backfill a column | Yes | Old code ignores the backfilled value |
If your upgrade only added columns / tables / indexes, `helm
rollback` is sufficient. If it renamed or dropped anything, you need
a database-level revert.
## Procedure: standard rollback (additive-only migrations)
```bash
# 1. Identify the target revision
helm history certctl -n <namespace>
# 2. Take a backup BEFORE rolling back (defense in depth — if
# rollback exposes a data corruption issue, restore is the only
# path back)
# See docs/operator/runbooks/postgres-backup.md for the canonical
# pg_dump invocation.
# 3. Roll back to the chosen revision
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
# 4. Verify
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
```
Watch for migration-version mismatch warnings in the server logs. If
the older server code refuses to start because the schema is ahead
of what it knows about, escalate to "rollback with schema revert."
## Procedure: rollback with schema revert
This is the rare case. Use it when:
- A column / table was renamed or dropped in the rolled-up release.
- The older code refuses to start with the newer schema.
```bash
# 1. Take a fresh backup right NOW (the current schema is what we're
# reverting from; if anything goes wrong we want a clean
# forward-recovery option)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
# 2. Stop the server Deployment to prevent it from writing to the
# database during the revert
kubectl scale deploy/certctl-server -n <namespace> --replicas=0
# 3. Apply the relevant *.down.sql files manually, one at a time, in
# reverse migration-number order. Example for reverting two
# migrations:
NEW=000102 # newest migration on the running schema
OLD=000100 # oldest migration to revert (inclusive)
for MIG in 000102 000101 000100; do
kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl \
< migrations/${MIG}_*.down.sql
done
# 4. Manually update the schema_migrations table to reflect the
# reverted state (the migration runner's bookkeeping)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl -c \
"DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
# 5. NOW run helm rollback. The server pod will start with a schema
# that matches its code.
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
```
The `*.down.sql` files are tested but only against pristine schemas —
they may not handle every data shape a production database
accumulates. ALWAYS take a backup first; the down-migrations are
a recovery tool, not a transactional contract.
## Procedure: full restore (when revert isn't tractable)
When a down-migration would lose data (drop columns / tables that
hold rows the older code can't read but the newer code populated), a
full restore is the only safe path. This is the procedure described
in
[`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md#postgres-restore).
The summary:
1. Stop certctl.
2. Take a backup of the CURRENT schema (defense in depth).
3. Restore the LAST backup taken BEFORE the bad upgrade.
4. Roll the Helm release back to the matching code version.
5. Restart certctl.
6. Re-run any audited writes that happened in the window between the
backup and the bad upgrade (read the audit log; the API surface
is recoverable).
The DR runbook owns the canonical commands.
## Common pitfalls
- **Forgetting the backup before rollback.** A schema-revert path is
not safe without a fresh backup. If something goes wrong mid-revert
and your most recent backup is from last night, you've lost any
cert-issuance history between then and now.
- **Rolling back the chart without rolling back the database state**
on a release that included a destructive migration (drop column,
drop table). Symptoms: old code starts, queries fail with
"column does not exist," server crashes in a loop. Recovery
requires schema revert OR full restore.
- **Letting the agents drift.** `helm rollback` updates the agent
DaemonSet's image too — agents on different versions than the
server may produce incompatible CSR payloads. After rollback,
confirm agent images are at the matching version via
`kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'`.
- **GHCR images pinned by digest:** the rollback restores the prior
`image:` value from the Helm template. If your operator workflow
uses `image.digest` pinning, the digest comes back too — make
sure that digest still exists on ghcr.io. They do persist; old
tags are never deleted, but a private mirror may have garbage-collected.
## Related reading
- [`docs/operator/runbooks/postgres-backup.md`](postgres-backup.md) —
the backup procedure that's the precondition for any
schema-revert path.
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) —
the full restore procedure when rollback isn't tractable.
- [`docs/migration/api-keys-to-rbac.md`](../../migration/api-keys-to-rbac.md) —
example of a migration that the runtime supports rolling back via
feature flag (rare).
+87
View File
@@ -0,0 +1,87 @@
#!/usr/bin/env bash
# scripts/ci-guards/helm-templates-lint.sh
#
# Phase 4 closure (2026-05-14): Helm chart lint + template-render gate.
#
# Runs `helm lint` against the chart and `helm template` against four
# representative value combinations to catch:
# - Syntax errors in any chart template
# - Schema-violation in values.yaml
# - Missing required values uncovered by the opt-in toggles
# (backup, monitoring.prometheusRules, migrations.viaHook)
# - Render errors when new templates are added without updating
# this guard's coverage matrix
#
# The opt-in templates added in Phase 4 (backup-cronjob.yaml,
# prometheusrules.yaml, migration-job.yaml) default OFF; without
# explicit coverage in the guard's matrix they would never render in
# CI and silent breakage could ship.
set -euo pipefail
CHART_DIR="deploy/helm/certctl"
if [ ! -d "$CHART_DIR" ]; then
echo "helm-templates-lint: skipped — $CHART_DIR not found (running outside repo root?)"
exit 0
fi
if ! command -v helm >/dev/null 2>&1; then
echo "helm-templates-lint: skipped — helm not on PATH."
echo " Install: https://helm.sh/docs/intro/install/"
exit 0
fi
echo "helm-templates-lint: running helm lint"
helm lint "$CHART_DIR" >/dev/null
# Minimal valid value set to satisfy chart preflight validators
# (server.tls.existingSecret, server.auth.apiKey, postgresql.auth.password).
# These are NOT real secrets — they're just non-empty strings to
# make the chart render in lint mode.
BASE_VALUES=(
--set "server.tls.existingSecret=lint-test-tls"
--set "server.auth.apiKey=lint-test-apikey"
--set "postgresql.auth.password=lint-test-pgpass"
)
render_and_check() {
local label="$1"
shift
local out
out="$(helm template "$CHART_DIR" "${BASE_VALUES[@]}" "$@" 2>&1)" || {
echo "helm-templates-lint: FAIL — template render error for '$label'"
echo "$out" | tail -20
return 1
}
echo "helm-templates-lint: OK — '$label'"
}
# Matrix:
# 1. Defaults (no Phase 4 opt-ins) — confirms the chart still
# renders cleanly when every Phase 4 feature is off.
# 2. backup.enabled=true (PVC sink) — confirms backup-cronjob renders.
# 3. backup.enabled=true + sink=s3 — confirms S3 sink branch renders.
# 4. monitoring.prometheusRules.enabled=true — confirms PrometheusRule renders.
# 5. migrations.viaHook=true — confirms migration-job hook renders.
# 6. All Phase 4 opt-ins on simultaneously — confirms no template
# interaction breaks the others.
render_and_check "defaults"
render_and_check "backup.enabled (pvc)" \
--set "backup.enabled=true"
render_and_check "backup.enabled (s3)" \
--set "backup.enabled=true" \
--set "backup.sink=s3" \
--set "backup.s3.bucket=lint-test-bucket"
render_and_check "monitoring.prometheusRules.enabled" \
--set "monitoring.enabled=true" \
--set "monitoring.prometheusRules.enabled=true"
render_and_check "migrations.viaHook" \
--set "migrations.viaHook=true"
render_and_check "all phase 4 opt-ins" \
--set "backup.enabled=true" \
--set "monitoring.enabled=true" \
--set "monitoring.prometheusRules.enabled=true" \
--set "migrations.viaHook=true"
echo "helm-templates-lint: all matrix combinations rendered cleanly"