docs(b12): observability reference + Postgres backup runbook

Closes acquisition-diligence Bundle 12 — Observability, DR, Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8, T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7. Two new operator-facing references; both non-audit-framed per the Bundle 5 doc-placement policy. docs/operator/observability.md — single canonical statement of what certctl emits, what it doesn't, and what survives a restart: - Metrics surface: both /api/v1/metrics (JSON) and /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of certctl_certificate_* gauges + certctl_issuance_duration_seconds per-issuer-type histogram + certctl_uptime_seconds. - Prometheus library vs hand-rolled exposition: explicit scope statement — hand-rolled fmt.Fprintf is intentional for v2.x given the shallow metric surface; client_golang migration tracked as v3 item (closes OPS-M1). - Tracing: explicit deferral — no OTel SDK setup, OTel packages are indirect-only in go.mod, no spans, no OTLP exporter; tracked as v3 item; in the meantime structured logs carry request_id and certctl_issuance_duration_seconds carries the per-issuer latency signal (closes OPS-M2). - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control; no key material / bearer tokens / session cookies in log lines. - Rate-limit semantics under restarts + replicas: per-process, in-memory, reset-on-restart, NOT shared across replicas; full inventory of the 5 limiter call sites (break-glass login, SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic source-IP, ACME per-account); multi-replica + sticky-session implications; database-backed sliding window deferred to v3 (closes D8). - Performance harness scope: cross-references the explicit 'What it explicitly does NOT measure' list in deploy/test/loadtest/README.md (closes LOW-7 + finding 7). docs/operator/runbooks/postgres-backup.md — operator-runnable backup procedure: - Inventory of what to back up (DB + operator-managed file material that lives outside the DB: CA keys, RA keys, OCSP responder keys, trust bundles). - Logical backup recipe with docker-compose + Kubernetes variants, integrity verification step, off-host storage step. - Physical / PITR recipe pointing at pgbackrest / wal-g (certctl ships nothing here — standard PostgreSQL DBA work). - Three sample automation paths (in-cluster Postgres → S3 CronJob, managed Postgres PITR, self-hosted VM systemd timer + restic). - Quarterly restore-dry-run procedure. - Helm CronJob template deliberately not shipped — three documented reasons (deployment topology / secret-management integration / off-host storage all vary by operator) plus roadmap entry for shipping a starter template when a real operator asks for one (closes D6 + OPS-H1). Both new docs wired into docs/README.md Operator + Runbooks tables. D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml) and in deploy/test/loadtest/ + .github/workflows/loadtest.yml respectively; this bundle doesn't touch them — it just records the closure in the audit HTML. Verified: bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS bash scripts/ci-guards/doc-rot-detector.sh # PASS All 35 scripts/ci-guards/*.sh green.
2026-06-07 13:51:36 +00:00 · 2026-05-13 02:09:11 +00:00
parent 072e2af198
commit 57b539c378
3 changed files with 385 additions and 0 deletions
@@ -0,0 +1,169 @@
+# Runbook: PostgreSQL backup for certctl
+
+> Last reviewed: 2026-05-13
+
+Use this when:
+- You're setting up a new certctl deployment and need a backup policy
+  before going to production.
+- A buyer or auditor asks "where's the backup automation?" and you need
+  to point at the recommended cadence + procedure.
+- You're rotating the encryption key, swapping CAs, or doing any other
+  destructive maintenance and want a snapshot to roll back to.
+
+certctl does not ship a built-in backup daemon. Postgres is the system
+of record for every piece of certctl state that isn't on the
+operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust
+bundles — see "Operator-managed (NOT in DB)" in the
+[disaster-recovery runbook](disaster-recovery.md#postgres-restore));
+backing it up is treated as a standard PostgreSQL operations task
+that the operator owns end-to-end with their existing tooling.
+
+This page is the recommended recipe.
+
+## What to back up
+
+| Layer                              | Tool                                                                    | Cadence                  |
+|---|---|---|
+| `certctl` database (the row data)  | `pg_dump` (logical) **or** `pg_basebackup` + WAL archive (physical PIT) | ≥ daily, retention ≥ 30d |
+| CA cert + key (`CERTCTL_CA_CERT_PATH`, `CERTCTL_CA_KEY_PATH`) | Out-of-band file backup (operator's existing secret-management tool) | On change |
+| SCEP RA cert + key (per profile)   | Out-of-band file backup                                                 | On change                |
+| OCSP responder keys                | Out-of-band file backup (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)              | On change                |
+| Trust-anchor PEM bundles           | Out-of-band file backup                                                 | On change                |
+| Env vars (auth secret, etc.)       | Operator's secret-management tool (Vault, AWS Secrets Manager, etc.)    | On rotation              |
+
+A backup of only the Postgres database without the operator-managed
+file material is **not a complete restore artifact** — see the
+[disaster-recovery runbook's Postgres-restore section](disaster-recovery.md#postgres-restore)
+for the full inventory. The DR runbook owns the restore procedure;
+this page owns the capture procedure.
+
+## Logical backup (recommended for most deployments)
+
+`pg_dump -Fc` produces a portable compressed dump that's easy to
+restore into a fresh Postgres instance at any version ≥ the dump's
+source version. Best for deployments where the DB is small enough
+that a full logical dump fits the backup window (rough rule of thumb:
+under a million `managed_certificates` rows + corresponding history).
+
+### docker-compose
+
+```bash
+# 1. Snapshot. Run from any host that can reach the postgres container.
+TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
+docker compose -f deploy/docker-compose.yml exec -T postgres \
+  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
+  > "certctl-${TIMESTAMP}.dump"
+
+# 2. Verify integrity (catch transport / truncation bugs early).
+docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
+  pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
+  && echo "OK: pg_restore --list parses the dump cleanly" \
+  || { echo "CORRUPT DUMP"; exit 1; }
+
+# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
+# storage of your choice). DO NOT leave the dump on the certctl host
+# alone — that defeats the purpose of having a backup.
+aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"
+```
+
+### Kubernetes (with the bundled Helm chart)
+
+```bash
+# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
+TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
+NAMESPACE=certctl
+kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
+  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
+  > "certctl-${TIMESTAMP}.dump"
+
+# 2. Same verification step as above.
+# 3. Same off-host storage step as above.
+```
+
+### Restore (cross-reference)
+
+The restore procedure lives in
+[disaster-recovery.md § Postgres restore](disaster-recovery.md#postgres-restore).
+The key reminders: stop certctl first, restore the DB, run any
+migrations newer than the snapshot, truncate the CRL + OCSP caches,
+then restart.
+
+## Physical / PITR backup (large fleets, RPO < 1h)
+
+Logical dumps have a coarse RPO (the last successful dump). For
+deployments where ≤ 1h of cert-issuance history loss is unacceptable,
+pair Postgres physical backups with continuous WAL archiving:
+
+- `pg_basebackup` for the initial seed
+- `archive_command = '<your-WAL-archiver>'` in `postgresql.conf` to
+  ship every WAL segment off the host as it closes
+- `pgbackrest` or `wal-g` for the operational layer (both are
+  battle-tested, support encryption, and integrate cleanly with S3 /
+  GCS / Azure Blob)
+
+certctl ships nothing in this layer — it's standard PostgreSQL DBA
+work, and shipping a bespoke recipe would just be a worse version of
+what `pgbackrest` already does. The
+[pgbackrest configuration guide](https://pgbackrest.org/configuration.html)
+is the authoritative reference.
+
+## Automation paths
+
+This is the gap an acquisition reviewer typically wants to see filled.
+certctl ships no backup CronJob template in the Helm chart — the
+operator owns this layer because:
+
+1. The right tool depends on the deployment topology (in-cluster
+   Postgres vs. managed Postgres vs. self-hosted on a VM).
+2. The right secret-management integration depends on the operator's
+   existing stack (Vault, AWS Secrets Manager, GCP Secret Manager,
+   sealed-secrets, External Secrets).
+3. The right storage backend depends on the operator's existing
+   off-host blob storage.
+
+A bundled CronJob would be a half-answer for any operator with an
+established backup posture, and would have to be torn out before
+production. Three sample recipes that cover the common cases:
+
+- **In-cluster Postgres → S3:** a CronJob running an alpine image with
+  `aws-cli` + the `pg_dump` command above, output piped to
+  `aws s3 cp`. Cosign-signed if your supply-chain policy requires it.
+- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB):** rely on
+  the cloud provider's built-in PITR backup; configure retention
+  ≥ 30 days; the certctl deployment surface is the connection string
+  alone.
+- **Self-hosted VM:** systemd timer + `pg_dump` + `restic` (or
+  `borgbackup`) to encrypted off-host storage.
+
+Tracked in [WORKSPACE-ROADMAP.md](../../../WORKSPACE-ROADMAP.md) as a
+post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the
+in-cluster-Postgres-to-S3 case as a starter. The right time to ship
+it is when a real operator asks for it; speculatively shipping it
+without that signal would just produce a template every deployment
+ends up rewriting.
+
+## Verification — what to dry-run quarterly
+
+A backup you've never restored is a backup you don't have. Add this
+to your quarterly on-call rotation:
+
+1. Pick the most recent dump from the previous quarter.
+2. Stand up a throwaway Postgres instance (Docker, kind, anything).
+3. `pg_restore -d certctl <the dump>`.
+4. Bring up a certctl-server container pointed at the throwaway DB
+   (`CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...`).
+5. Confirm `/api/v1/version` returns 200, `/api/v1/certificates`
+   lists the expected rows, and the scheduler logs show no
+   migration-version mismatch.
+6. Tear down. Note the timing in your DR registry.
+
+The [disaster-recovery runbook](disaster-recovery.md) covers what to
+do when this dry-run reveals a gap.
+
+## Related reading
+
+- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
+- [`docs/operator/secret-custody.md`](../secret-custody.md) — what
+  the operator-managed file material (CA keys, RA keys, trust
+  anchors) contains, why it lives outside the DB, and what it costs
+  to lose