mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:22:07 +00:00
d6f4d5c5e8
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.
DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
matching the canonical shape in
docs/operator/runbooks/postgres-backup.md (so manual and
automated dumps are byte-identical). Sink: PVC (default) OR S3
via aws-cli. Documented as in-cluster-Postgres only — managed DB
deployments rely on their provider's PITR.
DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
deploy/helm/certctl/templates/migration-job.yaml — runs
`certctl-server --migrate-only` before the server Deployment
rolls. The --migrate-only flag (new in cmd/server/main.go) is a
hermetic schema-mutation pass: load config, open DB pool, run
RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
no signing setup.
Server's boot-time RunMigrations call is now gated on
CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
the boot path (the hook owns the work). Default still runs at
boot, so Compose / VM / bare-metal deploys are unchanged.
migrations.viaHook: false in values.yaml (off by default).
DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
spec.updateStrategy.type: OnDelete
spec.podManagementPolicy: OrderedReady
Operator-controlled Postgres upgrades (the OnDelete strategy
means a chart template tweak no longer triggers an immediate
Postgres restart). OrderedReady aligns with the standard
Postgres-on-Kubernetes pattern for any future HA work.
DEPL-M5 (Med) — per-fleet-size resource ladder documentation
deploy/helm/certctl/values.yaml — extended comments next to
server.resources + agent.resources documenting:
"≤ 500 certs / 100 agents" → defaults are validated
"5K certs / 1K agents" → starter suggestions, TBD Phase 8
"50K certs / 10K agents" → starter suggestions, TBD Phase 8
Numbers for the small-fleet case derive from the measured
baselines in docs/operator/performance-baselines.md
(50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
fleet numbers explicitly marked TBD pending Phase 8 load-test
runs — operators tune empirically until then.
DEPL-L1 (Low) — Helm rollback runbook
docs/operator/runbooks/rollback.md — covers helm rollback
mechanics, the schema-migration manual-cleanup path (when
*.down.sql files apply vs. when full restore is the only safe
path), and the per-migration-class safe-to-rollback table.
DEPL-L2 (Low) — Prometheus AlertManager rules
deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
monitoring.prometheusRules.enabled=true. Default OFF. Four
starter rules using verified metric names from
internal/api/handler/metrics.go:
CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
CertctlJobFailureRateHigh (failure rate over 5% for 15m)
CertctlIssuanceFailures (any failures over 15m window)
All thresholds operator-tunable via
monitoring.prometheusRules.thresholds.* in values.
DEPL-L3 (Low) — Prometheus bearer-token setup runbook
docs/operator/runbooks/prometheus-bearer-token.md — documents
the API-key + Secret + values wiring for the RBAC-gated
/api/v1/metrics/prometheus scrape endpoint. End-to-end
procedure with troubleshooting steps + rotation guide.
CI guard: scripts/ci-guards/helm-templates-lint.sh
Six-combo matrix: defaults / backup PVC / backup S3 /
prometheusRules / migrations.viaHook / all-on. Each runs helm
template + checks render success. helm lint also gated.
Wired into the auto-pickup loop in .github/workflows/ci.yml;
azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
RED-2) installs helm v3.16.0 on the runner.
Verification (all pass):
ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches
helm template deploy/helm/certctl/ --set backup.enabled=true \
--set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
| grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches
helm lint deploy/helm/certctl/ # 0 failed
ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass
Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
194 lines
8.4 KiB
Markdown
194 lines
8.4 KiB
Markdown
# Runbook: Helm rollback for certctl
|
|
|
|
> Last reviewed: 2026-05-14
|
|
|
|
Use this when:
|
|
- A `helm upgrade` rolled out a bad release and the operator wants to
|
|
return to the previous working state.
|
|
- A schema migration shipped a change the operator wants to back out.
|
|
- An emergency change needs reverting and forward-fix isn't yet
|
|
available.
|
|
|
|
This page covers `helm rollback` mechanics + the cases where
|
|
rollback is NOT enough on its own (schema migrations are the main
|
|
one).
|
|
|
|
## What `helm rollback` does
|
|
|
|
`helm rollback <release> [revision]` re-applies the manifests from a
|
|
previous Helm revision. It re-creates / updates Kubernetes objects to
|
|
match that revision's template output and is safe for:
|
|
|
|
- **Deployment image bumps:** rolls the container image back to the
|
|
previous tag. Pods restart with the old image.
|
|
- **ConfigMap / Secret content changes:** old values land in the
|
|
config; pods that consume them via `envFrom` or volume mounts get
|
|
the prior values on the next restart.
|
|
- **Resource requests / limits / replica count:** the spec changes
|
|
back to the prior values. Kubernetes reschedules pods accordingly.
|
|
- **Service / Ingress / NetworkPolicy changes:** networking flips
|
|
back to the previous shape immediately.
|
|
|
|
## What `helm rollback` does NOT do
|
|
|
|
The Kubernetes layer is reversible; the **database schema is not**.
|
|
This is the single most common gap in a rollback plan.
|
|
|
|
### Schema migrations are forward-only by design
|
|
|
|
certctl's migrations under `migrations/` are numbered up-migrations
|
|
(`NNNNNN_*.up.sql`) with paired down-migrations
|
|
(`NNNNNN_*.down.sql`) shipped alongside. The `postgres.RunMigrations`
|
|
path applied at server boot only runs the `*.up.sql` files. The
|
|
`*.down.sql` files exist for development reference + a hypothetical
|
|
"surgical revert" path but are **not invoked by `helm rollback`**.
|
|
|
|
The implication: if `v2.1.0 → v2.2.0` ships migrations 000100,
|
|
000101, 000102 (adding columns, changing constraints, dropping
|
|
indexes), then `helm rollback` to v2.1.0 takes you back to the v2.1.0
|
|
container image — but the database still has migrations 000100-102
|
|
applied. The v2.1.0 server code doesn't know about those columns; it
|
|
either ignores them (best case) or fails to start (if the schema
|
|
diverged in a way the older code can't tolerate).
|
|
|
|
### When is rollback safe without a schema revert?
|
|
|
|
Migrations are **additive-only** in 90%+ of cases. The categories:
|
|
|
|
| Migration class | Safe to roll back without schema revert? | Why |
|
|
|---|---|---|
|
|
| Add column with default | Yes | Old code ignores the new column |
|
|
| Add table | Yes | Old code doesn't reference the table |
|
|
| Add index | Yes | Old code doesn't depend on the index existing |
|
|
| Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
|
|
| Rename column / table | NO | Old code's queries reference the original name |
|
|
| Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
|
|
| Type change (`VARCHAR(40)` → `TEXT`) | Usually yes | Old code's column read still works |
|
|
| Backfill a column | Yes | Old code ignores the backfilled value |
|
|
|
|
If your upgrade only added columns / tables / indexes, `helm
|
|
rollback` is sufficient. If it renamed or dropped anything, you need
|
|
a database-level revert.
|
|
|
|
## Procedure: standard rollback (additive-only migrations)
|
|
|
|
```bash
|
|
# 1. Identify the target revision
|
|
helm history certctl -n <namespace>
|
|
|
|
# 2. Take a backup BEFORE rolling back (defense in depth — if
|
|
# rollback exposes a data corruption issue, restore is the only
|
|
# path back)
|
|
# See docs/operator/runbooks/postgres-backup.md for the canonical
|
|
# pg_dump invocation.
|
|
|
|
# 3. Roll back to the chosen revision
|
|
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
|
|
|
|
# 4. Verify
|
|
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
|
|
kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
|
|
```
|
|
|
|
Watch for migration-version mismatch warnings in the server logs. If
|
|
the older server code refuses to start because the schema is ahead
|
|
of what it knows about, escalate to "rollback with schema revert."
|
|
|
|
## Procedure: rollback with schema revert
|
|
|
|
This is the rare case. Use it when:
|
|
- A column / table was renamed or dropped in the rolled-up release.
|
|
- The older code refuses to start with the newer schema.
|
|
|
|
```bash
|
|
# 1. Take a fresh backup right NOW (the current schema is what we're
|
|
# reverting from; if anything goes wrong we want a clean
|
|
# forward-recovery option)
|
|
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
|
|
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
|
|
> "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
|
|
|
|
# 2. Stop the server Deployment to prevent it from writing to the
|
|
# database during the revert
|
|
kubectl scale deploy/certctl-server -n <namespace> --replicas=0
|
|
|
|
# 3. Apply the relevant *.down.sql files manually, one at a time, in
|
|
# reverse migration-number order. Example for reverting two
|
|
# migrations:
|
|
NEW=000102 # newest migration on the running schema
|
|
OLD=000100 # oldest migration to revert (inclusive)
|
|
for MIG in 000102 000101 000100; do
|
|
kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
|
|
psql --user=certctl --dbname=certctl \
|
|
< migrations/${MIG}_*.down.sql
|
|
done
|
|
|
|
# 4. Manually update the schema_migrations table to reflect the
|
|
# reverted state (the migration runner's bookkeeping)
|
|
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
|
|
psql --user=certctl --dbname=certctl -c \
|
|
"DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
|
|
|
|
# 5. NOW run helm rollback. The server pod will start with a schema
|
|
# that matches its code.
|
|
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
|
|
```
|
|
|
|
The `*.down.sql` files are tested but only against pristine schemas —
|
|
they may not handle every data shape a production database
|
|
accumulates. ALWAYS take a backup first; the down-migrations are
|
|
a recovery tool, not a transactional contract.
|
|
|
|
## Procedure: full restore (when revert isn't tractable)
|
|
|
|
When a down-migration would lose data (drop columns / tables that
|
|
hold rows the older code can't read but the newer code populated), a
|
|
full restore is the only safe path. This is the procedure described
|
|
in
|
|
[`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md#postgres-restore).
|
|
The summary:
|
|
|
|
1. Stop certctl.
|
|
2. Take a backup of the CURRENT schema (defense in depth).
|
|
3. Restore the LAST backup taken BEFORE the bad upgrade.
|
|
4. Roll the Helm release back to the matching code version.
|
|
5. Restart certctl.
|
|
6. Re-run any audited writes that happened in the window between the
|
|
backup and the bad upgrade (read the audit log; the API surface
|
|
is recoverable).
|
|
|
|
The DR runbook owns the canonical commands.
|
|
|
|
## Common pitfalls
|
|
|
|
- **Forgetting the backup before rollback.** A schema-revert path is
|
|
not safe without a fresh backup. If something goes wrong mid-revert
|
|
and your most recent backup is from last night, you've lost any
|
|
cert-issuance history between then and now.
|
|
- **Rolling back the chart without rolling back the database state**
|
|
on a release that included a destructive migration (drop column,
|
|
drop table). Symptoms: old code starts, queries fail with
|
|
"column does not exist," server crashes in a loop. Recovery
|
|
requires schema revert OR full restore.
|
|
- **Letting the agents drift.** `helm rollback` updates the agent
|
|
DaemonSet's image too — agents on different versions than the
|
|
server may produce incompatible CSR payloads. After rollback,
|
|
confirm agent images are at the matching version via
|
|
`kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'`.
|
|
- **GHCR images pinned by digest:** the rollback restores the prior
|
|
`image:` value from the Helm template. If your operator workflow
|
|
uses `image.digest` pinning, the digest comes back too — make
|
|
sure that digest still exists on ghcr.io. They do persist; old
|
|
tags are never deleted, but a private mirror may have garbage-collected.
|
|
|
|
## Related reading
|
|
|
|
- [`docs/operator/runbooks/postgres-backup.md`](postgres-backup.md) —
|
|
the backup procedure that's the precondition for any
|
|
schema-revert path.
|
|
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) —
|
|
the full restore procedure when rollback isn't tractable.
|
|
- [`docs/migration/api-keys-to-rbac.md`](../../migration/api-keys-to-rbac.md) —
|
|
example of a migration that the runtime supports rolling back via
|
|
feature flag (rare).
|