Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.
DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
matching the canonical shape in
docs/operator/runbooks/postgres-backup.md (so manual and
automated dumps are byte-identical). Sink: PVC (default) OR S3
via aws-cli. Documented as in-cluster-Postgres only — managed DB
deployments rely on their provider's PITR.
DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
deploy/helm/certctl/templates/migration-job.yaml — runs
`certctl-server --migrate-only` before the server Deployment
rolls. The --migrate-only flag (new in cmd/server/main.go) is a
hermetic schema-mutation pass: load config, open DB pool, run
RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
no signing setup.
Server's boot-time RunMigrations call is now gated on
CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
the boot path (the hook owns the work). Default still runs at
boot, so Compose / VM / bare-metal deploys are unchanged.
migrations.viaHook: false in values.yaml (off by default).
DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
spec.updateStrategy.type: OnDelete
spec.podManagementPolicy: OrderedReady
Operator-controlled Postgres upgrades (the OnDelete strategy
means a chart template tweak no longer triggers an immediate
Postgres restart). OrderedReady aligns with the standard
Postgres-on-Kubernetes pattern for any future HA work.
DEPL-M5 (Med) — per-fleet-size resource ladder documentation
deploy/helm/certctl/values.yaml — extended comments next to
server.resources + agent.resources documenting:
"≤ 500 certs / 100 agents" → defaults are validated
"5K certs / 1K agents" → starter suggestions, TBD Phase 8
"50K certs / 10K agents" → starter suggestions, TBD Phase 8
Numbers for the small-fleet case derive from the measured
baselines in docs/operator/performance-baselines.md
(50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
fleet numbers explicitly marked TBD pending Phase 8 load-test
runs — operators tune empirically until then.
DEPL-L1 (Low) — Helm rollback runbook
docs/operator/runbooks/rollback.md — covers helm rollback
mechanics, the schema-migration manual-cleanup path (when
*.down.sql files apply vs. when full restore is the only safe
path), and the per-migration-class safe-to-rollback table.
DEPL-L2 (Low) — Prometheus AlertManager rules
deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
monitoring.prometheusRules.enabled=true. Default OFF. Four
starter rules using verified metric names from
internal/api/handler/metrics.go:
CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
CertctlJobFailureRateHigh (failure rate over 5% for 15m)
CertctlIssuanceFailures (any failures over 15m window)
All thresholds operator-tunable via
monitoring.prometheusRules.thresholds.* in values.
DEPL-L3 (Low) — Prometheus bearer-token setup runbook
docs/operator/runbooks/prometheus-bearer-token.md — documents
the API-key + Secret + values wiring for the RBAC-gated
/api/v1/metrics/prometheus scrape endpoint. End-to-end
procedure with troubleshooting steps + rotation guide.
CI guard: scripts/ci-guards/helm-templates-lint.sh
Six-combo matrix: defaults / backup PVC / backup S3 /
prometheusRules / migrations.viaHook / all-on. Each runs helm
template + checks render success. helm lint also gated.
Wired into the auto-pickup loop in .github/workflows/ci.yml;
azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
RED-2) installs helm v3.16.0 on the runner.
Verification (all pass):
ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches
helm template deploy/helm/certctl/ --set backup.enabled=true \
--set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
| grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches
helm lint deploy/helm/certctl/ # 0 failed
ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass
Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
8.4 KiB
Runbook: Helm rollback for certctl
Last reviewed: 2026-05-14
Use this when:
- A
helm upgraderolled out a bad release and the operator wants to return to the previous working state. - A schema migration shipped a change the operator wants to back out.
- An emergency change needs reverting and forward-fix isn't yet available.
This page covers helm rollback mechanics + the cases where
rollback is NOT enough on its own (schema migrations are the main
one).
What helm rollback does
helm rollback <release> [revision] re-applies the manifests from a
previous Helm revision. It re-creates / updates Kubernetes objects to
match that revision's template output and is safe for:
- Deployment image bumps: rolls the container image back to the previous tag. Pods restart with the old image.
- ConfigMap / Secret content changes: old values land in the
config; pods that consume them via
envFromor volume mounts get the prior values on the next restart. - Resource requests / limits / replica count: the spec changes back to the prior values. Kubernetes reschedules pods accordingly.
- Service / Ingress / NetworkPolicy changes: networking flips back to the previous shape immediately.
What helm rollback does NOT do
The Kubernetes layer is reversible; the database schema is not. This is the single most common gap in a rollback plan.
Schema migrations are forward-only by design
certctl's migrations under migrations/ are numbered up-migrations
(NNNNNN_*.up.sql) with paired down-migrations
(NNNNNN_*.down.sql) shipped alongside. The postgres.RunMigrations
path applied at server boot only runs the *.up.sql files. The
*.down.sql files exist for development reference + a hypothetical
"surgical revert" path but are not invoked by helm rollback.
The implication: if v2.1.0 → v2.2.0 ships migrations 000100,
000101, 000102 (adding columns, changing constraints, dropping
indexes), then helm rollback to v2.1.0 takes you back to the v2.1.0
container image — but the database still has migrations 000100-102
applied. The v2.1.0 server code doesn't know about those columns; it
either ignores them (best case) or fails to start (if the schema
diverged in a way the older code can't tolerate).
When is rollback safe without a schema revert?
Migrations are additive-only in 90%+ of cases. The categories:
| Migration class | Safe to roll back without schema revert? | Why |
|---|---|---|
| Add column with default | Yes | Old code ignores the new column |
| Add table | Yes | Old code doesn't reference the table |
| Add index | Yes | Old code doesn't depend on the index existing |
| Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
| Rename column / table | NO | Old code's queries reference the original name |
| Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
Type change (VARCHAR(40) → TEXT) |
Usually yes | Old code's column read still works |
| Backfill a column | Yes | Old code ignores the backfilled value |
If your upgrade only added columns / tables / indexes, helm rollback is sufficient. If it renamed or dropped anything, you need
a database-level revert.
Procedure: standard rollback (additive-only migrations)
# 1. Identify the target revision
helm history certctl -n <namespace>
# 2. Take a backup BEFORE rolling back (defense in depth — if
# rollback exposes a data corruption issue, restore is the only
# path back)
# See docs/operator/runbooks/postgres-backup.md for the canonical
# pg_dump invocation.
# 3. Roll back to the chosen revision
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
# 4. Verify
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
Watch for migration-version mismatch warnings in the server logs. If the older server code refuses to start because the schema is ahead of what it knows about, escalate to "rollback with schema revert."
Procedure: rollback with schema revert
This is the rare case. Use it when:
- A column / table was renamed or dropped in the rolled-up release.
- The older code refuses to start with the newer schema.
# 1. Take a fresh backup right NOW (the current schema is what we're
# reverting from; if anything goes wrong we want a clean
# forward-recovery option)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
# 2. Stop the server Deployment to prevent it from writing to the
# database during the revert
kubectl scale deploy/certctl-server -n <namespace> --replicas=0
# 3. Apply the relevant *.down.sql files manually, one at a time, in
# reverse migration-number order. Example for reverting two
# migrations:
NEW=000102 # newest migration on the running schema
OLD=000100 # oldest migration to revert (inclusive)
for MIG in 000102 000101 000100; do
kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl \
< migrations/${MIG}_*.down.sql
done
# 4. Manually update the schema_migrations table to reflect the
# reverted state (the migration runner's bookkeeping)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl -c \
"DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
# 5. NOW run helm rollback. The server pod will start with a schema
# that matches its code.
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
The *.down.sql files are tested but only against pristine schemas —
they may not handle every data shape a production database
accumulates. ALWAYS take a backup first; the down-migrations are
a recovery tool, not a transactional contract.
Procedure: full restore (when revert isn't tractable)
When a down-migration would lose data (drop columns / tables that
hold rows the older code can't read but the newer code populated), a
full restore is the only safe path. This is the procedure described
in
docs/operator/runbooks/disaster-recovery.md.
The summary:
- Stop certctl.
- Take a backup of the CURRENT schema (defense in depth).
- Restore the LAST backup taken BEFORE the bad upgrade.
- Roll the Helm release back to the matching code version.
- Restart certctl.
- Re-run any audited writes that happened in the window between the backup and the bad upgrade (read the audit log; the API surface is recoverable).
The DR runbook owns the canonical commands.
Common pitfalls
- Forgetting the backup before rollback. A schema-revert path is not safe without a fresh backup. If something goes wrong mid-revert and your most recent backup is from last night, you've lost any cert-issuance history between then and now.
- Rolling back the chart without rolling back the database state on a release that included a destructive migration (drop column, drop table). Symptoms: old code starts, queries fail with "column does not exist," server crashes in a loop. Recovery requires schema revert OR full restore.
- Letting the agents drift.
helm rollbackupdates the agent DaemonSet's image too — agents on different versions than the server may produce incompatible CSR payloads. After rollback, confirm agent images are at the matching version viakubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'. - GHCR images pinned by digest: the rollback restores the prior
image:value from the Helm template. If your operator workflow usesimage.digestpinning, the digest comes back too — make sure that digest still exists on ghcr.io. They do persist; old tags are never deleted, but a private mirror may have garbage-collected.
Related reading
docs/operator/runbooks/postgres-backup.md— the backup procedure that's the precondition for any schema-revert path.docs/operator/runbooks/disaster-recovery.md— the full restore procedure when rollback isn't tractable.docs/migration/api-keys-to-rbac.md— example of a migration that the runtime supports rolling back via feature flag (rare).