mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 19:51:33 +00:00

Files

T

shankar0123 d6f4d5c5e8 deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3

2026-05-14 00:58:00 +00:00

8.4 KiB

Raw Blame History

Runbook: Helm rollback for certctl

Last reviewed: 2026-05-14

Use this when:

A helm upgrade rolled out a bad release and the operator wants to return to the previous working state.
A schema migration shipped a change the operator wants to back out.
An emergency change needs reverting and forward-fix isn't yet available.

This page covers helm rollback mechanics + the cases where rollback is NOT enough on its own (schema migrations are the main one).

What `helm rollback` does

helm rollback <release> [revision] re-applies the manifests from a previous Helm revision. It re-creates / updates Kubernetes objects to match that revision's template output and is safe for:

Deployment image bumps: rolls the container image back to the previous tag. Pods restart with the old image.
ConfigMap / Secret content changes: old values land in the config; pods that consume them via envFrom or volume mounts get the prior values on the next restart.
Resource requests / limits / replica count: the spec changes back to the prior values. Kubernetes reschedules pods accordingly.
Service / Ingress / NetworkPolicy changes: networking flips back to the previous shape immediately.

What `helm rollback` does NOT do

The Kubernetes layer is reversible; the database schema is not. This is the single most common gap in a rollback plan.

Schema migrations are forward-only by design

certctl's migrations under migrations/ are numbered up-migrations (NNNNNN_*.up.sql) with paired down-migrations (NNNNNN_*.down.sql) shipped alongside. The postgres.RunMigrations path applied at server boot only runs the *.up.sql files. The *.down.sql files exist for development reference + a hypothetical "surgical revert" path but are not invoked by helm rollback.

The implication: if v2.1.0 → v2.2.0 ships migrations 000100, 000101, 000102 (adding columns, changing constraints, dropping indexes), then helm rollback to v2.1.0 takes you back to the v2.1.0 container image — but the database still has migrations 000100-102 applied. The v2.1.0 server code doesn't know about those columns; it either ignores them (best case) or fails to start (if the schema diverged in a way the older code can't tolerate).

When is rollback safe without a schema revert?

Migrations are additive-only in 90%+ of cases. The categories:

Migration class	Safe to roll back without schema revert?	Why
Add column with default	Yes	Old code ignores the new column
Add table	Yes	Old code doesn't reference the table
Add index	Yes	Old code doesn't depend on the index existing
Add CHECK / FOREIGN KEY constraint	Usually yes	Only fails on row data inserted by new code that violates the old code's constraints
Rename column / table	NO	Old code's queries reference the original name
Drop column / table	NO (data loss)	New code already stopped writing the column; old code expects it
Type change (`VARCHAR(40)` → `TEXT`)	Usually yes	Old code's column read still works
Backfill a column	Yes	Old code ignores the backfilled value

If your upgrade only added columns / tables / indexes, helm rollback is sufficient. If it renamed or dropped anything, you need a database-level revert.

Procedure: standard rollback (additive-only migrations)

# 1. Identify the target revision
helm history certctl -n <namespace>

# 2. Take a backup BEFORE rolling back (defense in depth — if
#    rollback exposes a data corruption issue, restore is the only
#    path back)
#    See docs/operator/runbooks/postgres-backup.md for the canonical
#    pg_dump invocation.

# 3. Roll back to the chosen revision
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m

# 4. Verify
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50

Watch for migration-version mismatch warnings in the server logs. If the older server code refuses to start because the schema is ahead of what it knows about, escalate to "rollback with schema revert."

Procedure: rollback with schema revert

This is the rare case. Use it when:

A column / table was renamed or dropped in the rolled-up release.
The older code refuses to start with the newer schema.

# 1. Take a fresh backup right NOW (the current schema is what we're
#    reverting from; if anything goes wrong we want a clean
#    forward-recovery option)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
  > "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"

# 2. Stop the server Deployment to prevent it from writing to the
#    database during the revert
kubectl scale deploy/certctl-server -n <namespace> --replicas=0

# 3. Apply the relevant *.down.sql files manually, one at a time, in
#    reverse migration-number order. Example for reverting two
#    migrations:
NEW=000102  # newest migration on the running schema
OLD=000100  # oldest migration to revert (inclusive)
for MIG in 000102 000101 000100; do
  kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
    psql --user=certctl --dbname=certctl \
    < migrations/${MIG}_*.down.sql
done

# 4. Manually update the schema_migrations table to reflect the
#    reverted state (the migration runner's bookkeeping)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
  psql --user=certctl --dbname=certctl -c \
  "DELETE FROM schema_migrations WHERE version > $((OLD - 1));"

# 5. NOW run helm rollback. The server pod will start with a schema
#    that matches its code.
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m

The *.down.sql files are tested but only against pristine schemas — they may not handle every data shape a production database accumulates. ALWAYS take a backup first; the down-migrations are a recovery tool, not a transactional contract.

Procedure: full restore (when revert isn't tractable)

When a down-migration would lose data (drop columns / tables that hold rows the older code can't read but the newer code populated), a full restore is the only safe path. This is the procedure described in docs/operator/runbooks/disaster-recovery.md. The summary:

Stop certctl.
Take a backup of the CURRENT schema (defense in depth).
Restore the LAST backup taken BEFORE the bad upgrade.
Roll the Helm release back to the matching code version.
Restart certctl.
Re-run any audited writes that happened in the window between the backup and the bad upgrade (read the audit log; the API surface is recoverable).

The DR runbook owns the canonical commands.

Common pitfalls

Forgetting the backup before rollback. A schema-revert path is not safe without a fresh backup. If something goes wrong mid-revert and your most recent backup is from last night, you've lost any cert-issuance history between then and now.
Rolling back the chart without rolling back the database state on a release that included a destructive migration (drop column, drop table). Symptoms: old code starts, queries fail with "column does not exist," server crashes in a loop. Recovery requires schema revert OR full restore.
Letting the agents drift. helm rollback updates the agent DaemonSet's image too — agents on different versions than the server may produce incompatible CSR payloads. After rollback, confirm agent images are at the matching version via kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'.
GHCR images pinned by digest: the rollback restores the prior image: value from the Helm template. If your operator workflow uses image.digest pinning, the digest comes back too — make sure that digest still exists on ghcr.io. They do persist; old tags are never deleted, but a private mirror may have garbage-collected.

docs/operator/runbooks/postgres-backup.md — the backup procedure that's the precondition for any schema-revert path.
docs/operator/runbooks/disaster-recovery.md — the full restore procedure when rollback isn't tractable.
docs/migration/api-keys-to-rbac.md — example of a migration that the runtime supports rolling back via feature flag (rare).

8.4 KiB Raw Blame History