Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.
Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.
docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
- Metrics surface: both /api/v1/metrics (JSON) and
/api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
certctl_certificate_* gauges + certctl_issuance_duration_seconds
per-issuer-type histogram + certctl_uptime_seconds.
- Prometheus library vs hand-rolled exposition: explicit scope
statement — hand-rolled fmt.Fprintf is intentional for v2.x given
the shallow metric surface; client_golang migration tracked as
v3 item (closes OPS-M1).
- Tracing: explicit deferral — no OTel SDK setup, OTel packages
are indirect-only in go.mod, no spans, no OTLP exporter; tracked
as v3 item; in the meantime structured logs carry request_id and
certctl_issuance_duration_seconds carries the per-issuer latency
signal (closes OPS-M2).
- Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
no key material / bearer tokens / session cookies in log lines.
- Rate-limit semantics under restarts + replicas: per-process,
in-memory, reset-on-restart, NOT shared across replicas; full
inventory of the 5 limiter call sites (break-glass login,
SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
source-IP, ACME per-account); multi-replica + sticky-session
implications; database-backed sliding window deferred to v3
(closes D8).
- Performance harness scope: cross-references the explicit
'What it explicitly does NOT measure' list in
deploy/test/loadtest/README.md (closes LOW-7 + finding 7).
docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
- Inventory of what to back up (DB + operator-managed file
material that lives outside the DB: CA keys, RA keys, OCSP
responder keys, trust bundles).
- Logical backup recipe with docker-compose + Kubernetes variants,
integrity verification step, off-host storage step.
- Physical / PITR recipe pointing at pgbackrest / wal-g
(certctl ships nothing here — standard PostgreSQL DBA work).
- Three sample automation paths (in-cluster Postgres → S3 CronJob,
managed Postgres PITR, self-hosted VM systemd timer + restic).
- Quarterly restore-dry-run procedure.
- Helm CronJob template deliberately not shipped — three
documented reasons (deployment topology / secret-management
integration / off-host storage all vary by operator) plus
roadmap entry for shipping a starter template when a real
operator asks for one (closes D6 + OPS-H1).
Both new docs wired into docs/README.md Operator + Runbooks tables.
D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.
Verified:
bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS
bash scripts/ci-guards/doc-rot-detector.sh # PASS
All 35 scripts/ci-guards/*.sh green.
7.7 KiB
Runbook: PostgreSQL backup for certctl
Last reviewed: 2026-05-13
Use this when:
- You're setting up a new certctl deployment and need a backup policy before going to production.
- A buyer or auditor asks "where's the backup automation?" and you need to point at the recommended cadence + procedure.
- You're rotating the encryption key, swapping CAs, or doing any other destructive maintenance and want a snapshot to roll back to.
certctl does not ship a built-in backup daemon. Postgres is the system of record for every piece of certctl state that isn't on the operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust bundles — see "Operator-managed (NOT in DB)" in the disaster-recovery runbook); backing it up is treated as a standard PostgreSQL operations task that the operator owns end-to-end with their existing tooling.
This page is the recommended recipe.
What to back up
| Layer | Tool | Cadence |
|---|---|---|
certctl database (the row data) |
pg_dump (logical) or pg_basebackup + WAL archive (physical PIT) |
≥ daily, retention ≥ 30d |
CA cert + key (CERTCTL_CA_CERT_PATH, CERTCTL_CA_KEY_PATH) |
Out-of-band file backup (operator's existing secret-management tool) | On change |
| SCEP RA cert + key (per profile) | Out-of-band file backup | On change |
| OCSP responder keys | Out-of-band file backup (CERTCTL_OCSP_RESPONDER_KEY_DIR) |
On change |
| Trust-anchor PEM bundles | Out-of-band file backup | On change |
| Env vars (auth secret, etc.) | Operator's secret-management tool (Vault, AWS Secrets Manager, etc.) | On rotation |
A backup of only the Postgres database without the operator-managed file material is not a complete restore artifact — see the disaster-recovery runbook's Postgres-restore section for the full inventory. The DR runbook owns the restore procedure; this page owns the capture procedure.
Logical backup (recommended for most deployments)
pg_dump -Fc produces a portable compressed dump that's easy to
restore into a fresh Postgres instance at any version ≥ the dump's
source version. Best for deployments where the DB is small enough
that a full logical dump fits the backup window (rough rule of thumb:
under a million managed_certificates rows + corresponding history).
docker-compose
# 1. Snapshot. Run from any host that can reach the postgres container.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
docker compose -f deploy/docker-compose.yml exec -T postgres \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-${TIMESTAMP}.dump"
# 2. Verify integrity (catch transport / truncation bugs early).
docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
&& echo "OK: pg_restore --list parses the dump cleanly" \
|| { echo "CORRUPT DUMP"; exit 1; }
# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
# storage of your choice). DO NOT leave the dump on the certctl host
# alone — that defeats the purpose of having a backup.
aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"
Kubernetes (with the bundled Helm chart)
# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
NAMESPACE=certctl
kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-${TIMESTAMP}.dump"
# 2. Same verification step as above.
# 3. Same off-host storage step as above.
Restore (cross-reference)
The restore procedure lives in disaster-recovery.md § Postgres restore. The key reminders: stop certctl first, restore the DB, run any migrations newer than the snapshot, truncate the CRL + OCSP caches, then restart.
Physical / PITR backup (large fleets, RPO < 1h)
Logical dumps have a coarse RPO (the last successful dump). For deployments where ≤ 1h of cert-issuance history loss is unacceptable, pair Postgres physical backups with continuous WAL archiving:
pg_basebackupfor the initial seedarchive_command = '<your-WAL-archiver>'inpostgresql.confto ship every WAL segment off the host as it closespgbackrestorwal-gfor the operational layer (both are battle-tested, support encryption, and integrate cleanly with S3 / GCS / Azure Blob)
certctl ships nothing in this layer — it's standard PostgreSQL DBA
work, and shipping a bespoke recipe would just be a worse version of
what pgbackrest already does. The
pgbackrest configuration guide
is the authoritative reference.
Automation paths
This is the gap an acquisition reviewer typically wants to see filled. certctl ships no backup CronJob template in the Helm chart — the operator owns this layer because:
- The right tool depends on the deployment topology (in-cluster Postgres vs. managed Postgres vs. self-hosted on a VM).
- The right secret-management integration depends on the operator's existing stack (Vault, AWS Secrets Manager, GCP Secret Manager, sealed-secrets, External Secrets).
- The right storage backend depends on the operator's existing off-host blob storage.
A bundled CronJob would be a half-answer for any operator with an established backup posture, and would have to be torn out before production. Three sample recipes that cover the common cases:
- In-cluster Postgres → S3: a CronJob running an alpine image with
aws-cli+ thepg_dumpcommand above, output piped toaws s3 cp. Cosign-signed if your supply-chain policy requires it. - Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB): rely on the cloud provider's built-in PITR backup; configure retention ≥ 30 days; the certctl deployment surface is the connection string alone.
- Self-hosted VM: systemd timer +
pg_dump+restic(orborgbackup) to encrypted off-host storage.
Tracked in WORKSPACE-ROADMAP.md as a post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the in-cluster-Postgres-to-S3 case as a starter. The right time to ship it is when a real operator asks for it; speculatively shipping it without that signal would just produce a template every deployment ends up rewriting.
Verification — what to dry-run quarterly
A backup you've never restored is a backup you don't have. Add this to your quarterly on-call rotation:
- Pick the most recent dump from the previous quarter.
- Stand up a throwaway Postgres instance (Docker, kind, anything).
pg_restore -d certctl <the dump>.- Bring up a certctl-server container pointed at the throwaway DB
(
CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...). - Confirm
/api/v1/versionreturns 200,/api/v1/certificateslists the expected rows, and the scheduler logs show no migration-version mismatch. - Tear down. Note the timing in your DR registry.
The disaster-recovery runbook covers what to do when this dry-run reveals a gap.
Related reading
docs/operator/runbooks/disaster-recovery.md— the restore companiondocs/operator/secret-custody.md— what the operator-managed file material (CA keys, RA keys, trust anchors) contains, why it lives outside the DB, and what it costs to lose