Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:
deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
PVC + S3 sinks, pg_dump shape byte-comparable with the manual
command earlier in the runbook.
Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.
Rewrite of sections 110-143:
- Lead with the shipped CronJob, two install one-liners (PVC + S3).
- Move the recipes-by-topology block down to 'When the bundled
CronJob is NOT the answer' — still call out managed Postgres
(use provider PITR) and bare-VM Postgres (systemd + pg_dump +
restic) as deliberately out-of-scope.
- Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
nightly schedule, RTO ≈ 30-60min from the existing drill steps
further down the page. Tells the reader where the bundled
CronJob fits in their RPO/RTO budget without overpromising
(anything below 24h RPO needs WAL-shipping, which the CronJob
doesn't do).
- Bump '> Last reviewed:' to today.
Closes DEPL-005.
9.0 KiB
Runbook: PostgreSQL backup for certctl
Last reviewed: 2026-05-16
Use this when:
- You're setting up a new certctl deployment and need a backup policy before going to production.
- A buyer or auditor asks "where's the backup automation?" and you need to point at the recommended cadence + procedure.
- You're rotating the encryption key, swapping CAs, or doing any other destructive maintenance and want a snapshot to roll back to.
certctl does not ship a built-in backup daemon. Postgres is the system of record for every piece of certctl state that isn't on the operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust bundles — see "Operator-managed (NOT in DB)" in the disaster-recovery runbook); backing it up is treated as a standard PostgreSQL operations task that the operator owns end-to-end with their existing tooling.
This page is the recommended recipe.
What to back up
| Layer | Tool | Cadence |
|---|---|---|
certctl database (the row data) |
pg_dump (logical) or pg_basebackup + WAL archive (physical PIT) |
≥ daily, retention ≥ 30d |
CA cert + key (CERTCTL_CA_CERT_PATH, CERTCTL_CA_KEY_PATH) |
Out-of-band file backup (operator's existing secret-management tool) | On change |
| SCEP RA cert + key (per profile) | Out-of-band file backup | On change |
| OCSP responder keys | Out-of-band file backup (CERTCTL_OCSP_RESPONDER_KEY_DIR) |
On change |
| Trust-anchor PEM bundles | Out-of-band file backup | On change |
| Env vars (auth secret, etc.) | Operator's secret-management tool (Vault, AWS Secrets Manager, etc.) | On rotation |
A backup of only the Postgres database without the operator-managed file material is not a complete restore artifact — see the disaster-recovery runbook's Postgres-restore section for the full inventory. The DR runbook owns the restore procedure; this page owns the capture procedure.
Logical backup (recommended for most deployments)
pg_dump -Fc produces a portable compressed dump that's easy to
restore into a fresh Postgres instance at any version ≥ the dump's
source version. Best for deployments where the DB is small enough
that a full logical dump fits the backup window (rough rule of thumb:
under a million managed_certificates rows + corresponding history).
docker-compose
# 1. Snapshot. Run from any host that can reach the postgres container.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
docker compose -f deploy/docker-compose.yml exec -T postgres \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-${TIMESTAMP}.dump"
# 2. Verify integrity (catch transport / truncation bugs early).
docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
&& echo "OK: pg_restore --list parses the dump cleanly" \
|| { echo "CORRUPT DUMP"; exit 1; }
# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
# storage of your choice). DO NOT leave the dump on the certctl host
# alone — that defeats the purpose of having a backup.
aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"
Kubernetes (with the bundled Helm chart)
# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
NAMESPACE=certctl
kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-${TIMESTAMP}.dump"
# 2. Same verification step as above.
# 3. Same off-host storage step as above.
Restore (cross-reference)
The restore procedure lives in disaster-recovery.md § Postgres restore. The key reminders: stop certctl first, restore the DB, run any migrations newer than the snapshot, truncate the CRL + OCSP caches, then restart.
Physical / PITR backup (large fleets, RPO < 1h)
Logical dumps have a coarse RPO (the last successful dump). For deployments where ≤ 1h of cert-issuance history loss is unacceptable, pair Postgres physical backups with continuous WAL archiving:
pg_basebackupfor the initial seedarchive_command = '<your-WAL-archiver>'inpostgresql.confto ship every WAL segment off the host as it closespgbackrestorwal-gfor the operational layer (both are battle-tested, support encryption, and integrate cleanly with S3 / GCS / Azure Blob)
certctl ships nothing in this layer — it's standard PostgreSQL DBA
work, and shipping a bespoke recipe would just be a worse version of
what pgbackrest already does. The
pgbackrest configuration guide
is the authoritative reference.
Automation paths
certctl ships an opt-in Helm CronJob for the in-cluster-Postgres
case (the most common bundled-deploy shape). The template lives at
deploy/helm/certctl/templates/backup-cronjob.yaml and is gated by
backup.enabled in values.yaml. Default OFF; flip it on with one
toggle and a sink choice. For managed Postgres (AWS RDS / GCP Cloud
SQL / Azure DB) the operator relies on the provider's PITR layer;
this CronJob is intentionally scoped to the in-cluster-Postgres path.
Enabling the bundled CronJob
# PVC sink (in-cluster persistent volume — simplest)
helm upgrade --install certctl charts/certctl \
--set backup.enabled=true \
--set backup.sink=pvc \
--set backup.pvc.storageClassName=<your-storage-class> \
--set backup.pvc.size=20Gi \
--set backup.schedule="0 2 * * *"
# S3 sink (off-cluster, recommended for any deploy past the lab)
kubectl create secret generic certctl-backup-aws \
--from-literal=AWS_ACCESS_KEY_ID=AKIA... \
--from-literal=AWS_SECRET_ACCESS_KEY=... \
--namespace certctl
helm upgrade --install certctl charts/certctl \
--set backup.enabled=true \
--set backup.sink=s3 \
--set backup.s3.bucket=my-certctl-backups \
--set backup.s3.region=us-east-1 \
--set backup.s3.credentialsSecret=certctl-backup-aws \
--set backup.schedule="0 2 * * *"
The CronJob runs pg_dump --format=custom --no-owner --no-acl --dbname=certctl (the same shape as the manual command earlier in
this runbook, so a manual dump and a Job dump are byte-comparable)
and ships the artifact to the configured sink. Off-host retention
is the sink's responsibility — S3 lifecycle rules or PVC snapshot
retention on the storage class, not the CronJob.
When the bundled CronJob is NOT the answer
- Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB). Use the provider's built-in PITR; configure retention ≥ 30 days. The certctl deployment surface is the connection string alone — no CronJob to run.
- Self-hosted Postgres on a VM (no Kubernetes). Use a systemd
timer +
pg_dump+restic(orborgbackup) to encrypted off-host storage. The bundled CronJob has no equivalent on bare VMs. - Already running pgbackrest / wal-g. Keep using it. The bundled CronJob is for the operator who doesn't yet have a backup posture, not a replacement for production-grade WAL-shipping.
Recovery objectives
The bundled CronJob targets the same RPO/RTO that any nightly-dump strategy gives you:
- RPO ≈ 24h at the default
0 2 * * *schedule (you lose at most one day of writes if Postgres burns down). Tighten by running every 6h or 1h; tighten further by switching to WAL-shipping (out of scope for the bundled CronJob). - RTO ≈ 30–60min for the restore drill below — drop the dump into a fresh Postgres instance, point certctl at it, confirm routes return 200. Empirically measured during the disaster-recovery runbook drill.
If your contractual RPO is below 24h, run pgbackrest WAL-shipping alongside (or instead of) the CronJob.
Verification — what to dry-run quarterly
A backup you've never restored is a backup you don't have. Add this to your quarterly on-call rotation:
- Pick the most recent dump from the previous quarter.
- Stand up a throwaway Postgres instance (Docker, kind, anything).
pg_restore -d certctl <the dump>.- Bring up a certctl-server container pointed at the throwaway DB
(
CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...). - Confirm
/api/v1/versionreturns 200,/api/v1/certificateslists the expected rows, and the scheduler logs show no migration-version mismatch. - Tear down. Note the timing in your DR registry.
The disaster-recovery runbook covers what to do when this dry-run reveals a gap.
Related reading
docs/operator/runbooks/disaster-recovery.md— the restore companiondocs/operator/secret-custody.md— what the operator-managed file material (CA keys, RA keys, trust anchors) contains, why it lives outside the DB, and what it costs to lose