mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:21:30 +00:00
docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob
Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:
deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
PVC + S3 sinks, pg_dump shape byte-comparable with the manual
command earlier in the runbook.
Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.
Rewrite of sections 110-143:
- Lead with the shipped CronJob, two install one-liners (PVC + S3).
- Move the recipes-by-topology block down to 'When the bundled
CronJob is NOT the answer' — still call out managed Postgres
(use provider PITR) and bare-VM Postgres (systemd + pg_dump +
restic) as deliberately out-of-scope.
- Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
nightly schedule, RTO ≈ 30-60min from the existing drill steps
further down the page. Tells the reader where the bundled
CronJob fits in their RPO/RTO budget without overpromising
(anything below 24h RPO needs WAL-shipping, which the CronJob
doesn't do).
- Bump '> Last reviewed:' to today.
Closes DEPL-005.
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
# Runbook: PostgreSQL backup for certctl
|
# Runbook: PostgreSQL backup for certctl
|
||||||
|
|
||||||
> Last reviewed: 2026-05-13
|
> Last reviewed: 2026-05-16
|
||||||
|
|
||||||
Use this when:
|
Use this when:
|
||||||
- You're setting up a new certctl deployment and need a backup policy
|
- You're setting up a new certctl deployment and need a backup policy
|
||||||
@@ -109,38 +109,76 @@ is the authoritative reference.
|
|||||||
|
|
||||||
## Automation paths
|
## Automation paths
|
||||||
|
|
||||||
This is the gap an acquisition reviewer typically wants to see filled.
|
certctl ships an **opt-in Helm CronJob** for the in-cluster-Postgres
|
||||||
certctl ships no backup CronJob template in the Helm chart — the
|
case (the most common bundled-deploy shape). The template lives at
|
||||||
operator owns this layer because:
|
`deploy/helm/certctl/templates/backup-cronjob.yaml` and is gated by
|
||||||
|
`backup.enabled` in `values.yaml`. Default OFF; flip it on with one
|
||||||
|
toggle and a sink choice. For managed Postgres (AWS RDS / GCP Cloud
|
||||||
|
SQL / Azure DB) the operator relies on the provider's PITR layer;
|
||||||
|
this CronJob is intentionally scoped to the in-cluster-Postgres path.
|
||||||
|
|
||||||
1. The right tool depends on the deployment topology (in-cluster
|
### Enabling the bundled CronJob
|
||||||
Postgres vs. managed Postgres vs. self-hosted on a VM).
|
|
||||||
2. The right secret-management integration depends on the operator's
|
|
||||||
existing stack (Vault, AWS Secrets Manager, GCP Secret Manager,
|
|
||||||
sealed-secrets, External Secrets).
|
|
||||||
3. The right storage backend depends on the operator's existing
|
|
||||||
off-host blob storage.
|
|
||||||
|
|
||||||
A bundled CronJob would be a half-answer for any operator with an
|
```bash
|
||||||
established backup posture, and would have to be torn out before
|
# PVC sink (in-cluster persistent volume — simplest)
|
||||||
production. Three sample recipes that cover the common cases:
|
helm upgrade --install certctl charts/certctl \
|
||||||
|
--set backup.enabled=true \
|
||||||
|
--set backup.sink=pvc \
|
||||||
|
--set backup.pvc.storageClassName=<your-storage-class> \
|
||||||
|
--set backup.pvc.size=20Gi \
|
||||||
|
--set backup.schedule="0 2 * * *"
|
||||||
|
|
||||||
- **In-cluster Postgres → S3:** a CronJob running an alpine image with
|
# S3 sink (off-cluster, recommended for any deploy past the lab)
|
||||||
`aws-cli` + the `pg_dump` command above, output piped to
|
kubectl create secret generic certctl-backup-aws \
|
||||||
`aws s3 cp`. Cosign-signed if your supply-chain policy requires it.
|
--from-literal=AWS_ACCESS_KEY_ID=AKIA... \
|
||||||
- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB):** rely on
|
--from-literal=AWS_SECRET_ACCESS_KEY=... \
|
||||||
the cloud provider's built-in PITR backup; configure retention
|
--namespace certctl
|
||||||
≥ 30 days; the certctl deployment surface is the connection string
|
helm upgrade --install certctl charts/certctl \
|
||||||
alone.
|
--set backup.enabled=true \
|
||||||
- **Self-hosted VM:** systemd timer + `pg_dump` + `restic` (or
|
--set backup.sink=s3 \
|
||||||
`borgbackup`) to encrypted off-host storage.
|
--set backup.s3.bucket=my-certctl-backups \
|
||||||
|
--set backup.s3.region=us-east-1 \
|
||||||
|
--set backup.s3.credentialsSecret=certctl-backup-aws \
|
||||||
|
--set backup.schedule="0 2 * * *"
|
||||||
|
```
|
||||||
|
|
||||||
Tracked in [WORKSPACE-ROADMAP.md](../../../WORKSPACE-ROADMAP.md) as a
|
The CronJob runs `pg_dump --format=custom --no-owner --no-acl
|
||||||
post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the
|
--dbname=certctl` (the same shape as the manual command earlier in
|
||||||
in-cluster-Postgres-to-S3 case as a starter. The right time to ship
|
this runbook, so a manual dump and a Job dump are byte-comparable)
|
||||||
it is when a real operator asks for it; speculatively shipping it
|
and ships the artifact to the configured sink. Off-host retention
|
||||||
without that signal would just produce a template every deployment
|
is the sink's responsibility — S3 lifecycle rules or PVC snapshot
|
||||||
ends up rewriting.
|
retention on the storage class, not the CronJob.
|
||||||
|
|
||||||
|
### When the bundled CronJob is NOT the answer
|
||||||
|
|
||||||
|
- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB).** Use the
|
||||||
|
provider's built-in PITR; configure retention ≥ 30 days. The
|
||||||
|
certctl deployment surface is the connection string alone — no
|
||||||
|
CronJob to run.
|
||||||
|
- **Self-hosted Postgres on a VM (no Kubernetes).** Use a systemd
|
||||||
|
timer + `pg_dump` + `restic` (or `borgbackup`) to encrypted
|
||||||
|
off-host storage. The bundled CronJob has no equivalent on bare
|
||||||
|
VMs.
|
||||||
|
- **Already running pgbackrest / wal-g.** Keep using it. The bundled
|
||||||
|
CronJob is for the operator who doesn't yet have a backup posture,
|
||||||
|
not a replacement for production-grade WAL-shipping.
|
||||||
|
|
||||||
|
### Recovery objectives
|
||||||
|
|
||||||
|
The bundled CronJob targets the same RPO/RTO that any nightly-dump
|
||||||
|
strategy gives you:
|
||||||
|
|
||||||
|
- **RPO ≈ 24h** at the default `0 2 * * *` schedule (you lose at
|
||||||
|
most one day of writes if Postgres burns down). Tighten by running
|
||||||
|
every 6h or 1h; tighten further by switching to WAL-shipping
|
||||||
|
(out of scope for the bundled CronJob).
|
||||||
|
- **RTO ≈ 30–60min** for the restore drill below — drop the dump
|
||||||
|
into a fresh Postgres instance, point certctl at it, confirm
|
||||||
|
routes return 200. Empirically measured during the
|
||||||
|
[disaster-recovery runbook](disaster-recovery.md) drill.
|
||||||
|
|
||||||
|
If your contractual RPO is below 24h, run pgbackrest WAL-shipping
|
||||||
|
alongside (or instead of) the CronJob.
|
||||||
|
|
||||||
## Verification — what to dry-run quarterly
|
## Verification — what to dry-run quarterly
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user