mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:32:02 +00:00
57b539c378
Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.
Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.
docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
- Metrics surface: both /api/v1/metrics (JSON) and
/api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
certctl_certificate_* gauges + certctl_issuance_duration_seconds
per-issuer-type histogram + certctl_uptime_seconds.
- Prometheus library vs hand-rolled exposition: explicit scope
statement — hand-rolled fmt.Fprintf is intentional for v2.x given
the shallow metric surface; client_golang migration tracked as
v3 item (closes OPS-M1).
- Tracing: explicit deferral — no OTel SDK setup, OTel packages
are indirect-only in go.mod, no spans, no OTLP exporter; tracked
as v3 item; in the meantime structured logs carry request_id and
certctl_issuance_duration_seconds carries the per-issuer latency
signal (closes OPS-M2).
- Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
no key material / bearer tokens / session cookies in log lines.
- Rate-limit semantics under restarts + replicas: per-process,
in-memory, reset-on-restart, NOT shared across replicas; full
inventory of the 5 limiter call sites (break-glass login,
SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
source-IP, ACME per-account); multi-replica + sticky-session
implications; database-backed sliding window deferred to v3
(closes D8).
- Performance harness scope: cross-references the explicit
'What it explicitly does NOT measure' list in
deploy/test/loadtest/README.md (closes LOW-7 + finding 7).
docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
- Inventory of what to back up (DB + operator-managed file
material that lives outside the DB: CA keys, RA keys, OCSP
responder keys, trust bundles).
- Logical backup recipe with docker-compose + Kubernetes variants,
integrity verification step, off-host storage step.
- Physical / PITR recipe pointing at pgbackrest / wal-g
(certctl ships nothing here — standard PostgreSQL DBA work).
- Three sample automation paths (in-cluster Postgres → S3 CronJob,
managed Postgres PITR, self-hosted VM systemd timer + restic).
- Quarterly restore-dry-run procedure.
- Helm CronJob template deliberately not shipped — three
documented reasons (deployment topology / secret-management
integration / off-host storage all vary by operator) plus
roadmap entry for shipping a starter template when a real
operator asks for one (closes D6 + OPS-H1).
Both new docs wired into docs/README.md Operator + Runbooks tables.
D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.
Verified:
bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS
bash scripts/ci-guards/doc-rot-detector.sh # PASS
All 35 scripts/ci-guards/*.sh green.
170 lines
7.7 KiB
Markdown
170 lines
7.7 KiB
Markdown
# Runbook: PostgreSQL backup for certctl
|
|
|
|
> Last reviewed: 2026-05-13
|
|
|
|
Use this when:
|
|
- You're setting up a new certctl deployment and need a backup policy
|
|
before going to production.
|
|
- A buyer or auditor asks "where's the backup automation?" and you need
|
|
to point at the recommended cadence + procedure.
|
|
- You're rotating the encryption key, swapping CAs, or doing any other
|
|
destructive maintenance and want a snapshot to roll back to.
|
|
|
|
certctl does not ship a built-in backup daemon. Postgres is the system
|
|
of record for every piece of certctl state that isn't on the
|
|
operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust
|
|
bundles — see "Operator-managed (NOT in DB)" in the
|
|
[disaster-recovery runbook](disaster-recovery.md#postgres-restore));
|
|
backing it up is treated as a standard PostgreSQL operations task
|
|
that the operator owns end-to-end with their existing tooling.
|
|
|
|
This page is the recommended recipe.
|
|
|
|
## What to back up
|
|
|
|
| Layer | Tool | Cadence |
|
|
|---|---|---|
|
|
| `certctl` database (the row data) | `pg_dump` (logical) **or** `pg_basebackup` + WAL archive (physical PIT) | ≥ daily, retention ≥ 30d |
|
|
| CA cert + key (`CERTCTL_CA_CERT_PATH`, `CERTCTL_CA_KEY_PATH`) | Out-of-band file backup (operator's existing secret-management tool) | On change |
|
|
| SCEP RA cert + key (per profile) | Out-of-band file backup | On change |
|
|
| OCSP responder keys | Out-of-band file backup (`CERTCTL_OCSP_RESPONDER_KEY_DIR`) | On change |
|
|
| Trust-anchor PEM bundles | Out-of-band file backup | On change |
|
|
| Env vars (auth secret, etc.) | Operator's secret-management tool (Vault, AWS Secrets Manager, etc.) | On rotation |
|
|
|
|
A backup of only the Postgres database without the operator-managed
|
|
file material is **not a complete restore artifact** — see the
|
|
[disaster-recovery runbook's Postgres-restore section](disaster-recovery.md#postgres-restore)
|
|
for the full inventory. The DR runbook owns the restore procedure;
|
|
this page owns the capture procedure.
|
|
|
|
## Logical backup (recommended for most deployments)
|
|
|
|
`pg_dump -Fc` produces a portable compressed dump that's easy to
|
|
restore into a fresh Postgres instance at any version ≥ the dump's
|
|
source version. Best for deployments where the DB is small enough
|
|
that a full logical dump fits the backup window (rough rule of thumb:
|
|
under a million `managed_certificates` rows + corresponding history).
|
|
|
|
### docker-compose
|
|
|
|
```bash
|
|
# 1. Snapshot. Run from any host that can reach the postgres container.
|
|
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
|
|
docker compose -f deploy/docker-compose.yml exec -T postgres \
|
|
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
|
|
> "certctl-${TIMESTAMP}.dump"
|
|
|
|
# 2. Verify integrity (catch transport / truncation bugs early).
|
|
docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
|
|
pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
|
|
&& echo "OK: pg_restore --list parses the dump cleanly" \
|
|
|| { echo "CORRUPT DUMP"; exit 1; }
|
|
|
|
# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
|
|
# storage of your choice). DO NOT leave the dump on the certctl host
|
|
# alone — that defeats the purpose of having a backup.
|
|
aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"
|
|
```
|
|
|
|
### Kubernetes (with the bundled Helm chart)
|
|
|
|
```bash
|
|
# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
|
|
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
|
|
NAMESPACE=certctl
|
|
kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
|
|
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
|
|
> "certctl-${TIMESTAMP}.dump"
|
|
|
|
# 2. Same verification step as above.
|
|
# 3. Same off-host storage step as above.
|
|
```
|
|
|
|
### Restore (cross-reference)
|
|
|
|
The restore procedure lives in
|
|
[disaster-recovery.md § Postgres restore](disaster-recovery.md#postgres-restore).
|
|
The key reminders: stop certctl first, restore the DB, run any
|
|
migrations newer than the snapshot, truncate the CRL + OCSP caches,
|
|
then restart.
|
|
|
|
## Physical / PITR backup (large fleets, RPO < 1h)
|
|
|
|
Logical dumps have a coarse RPO (the last successful dump). For
|
|
deployments where ≤ 1h of cert-issuance history loss is unacceptable,
|
|
pair Postgres physical backups with continuous WAL archiving:
|
|
|
|
- `pg_basebackup` for the initial seed
|
|
- `archive_command = '<your-WAL-archiver>'` in `postgresql.conf` to
|
|
ship every WAL segment off the host as it closes
|
|
- `pgbackrest` or `wal-g` for the operational layer (both are
|
|
battle-tested, support encryption, and integrate cleanly with S3 /
|
|
GCS / Azure Blob)
|
|
|
|
certctl ships nothing in this layer — it's standard PostgreSQL DBA
|
|
work, and shipping a bespoke recipe would just be a worse version of
|
|
what `pgbackrest` already does. The
|
|
[pgbackrest configuration guide](https://pgbackrest.org/configuration.html)
|
|
is the authoritative reference.
|
|
|
|
## Automation paths
|
|
|
|
This is the gap an acquisition reviewer typically wants to see filled.
|
|
certctl ships no backup CronJob template in the Helm chart — the
|
|
operator owns this layer because:
|
|
|
|
1. The right tool depends on the deployment topology (in-cluster
|
|
Postgres vs. managed Postgres vs. self-hosted on a VM).
|
|
2. The right secret-management integration depends on the operator's
|
|
existing stack (Vault, AWS Secrets Manager, GCP Secret Manager,
|
|
sealed-secrets, External Secrets).
|
|
3. The right storage backend depends on the operator's existing
|
|
off-host blob storage.
|
|
|
|
A bundled CronJob would be a half-answer for any operator with an
|
|
established backup posture, and would have to be torn out before
|
|
production. Three sample recipes that cover the common cases:
|
|
|
|
- **In-cluster Postgres → S3:** a CronJob running an alpine image with
|
|
`aws-cli` + the `pg_dump` command above, output piped to
|
|
`aws s3 cp`. Cosign-signed if your supply-chain policy requires it.
|
|
- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB):** rely on
|
|
the cloud provider's built-in PITR backup; configure retention
|
|
≥ 30 days; the certctl deployment surface is the connection string
|
|
alone.
|
|
- **Self-hosted VM:** systemd timer + `pg_dump` + `restic` (or
|
|
`borgbackup`) to encrypted off-host storage.
|
|
|
|
Tracked in [WORKSPACE-ROADMAP.md](../../../WORKSPACE-ROADMAP.md) as a
|
|
post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the
|
|
in-cluster-Postgres-to-S3 case as a starter. The right time to ship
|
|
it is when a real operator asks for it; speculatively shipping it
|
|
without that signal would just produce a template every deployment
|
|
ends up rewriting.
|
|
|
|
## Verification — what to dry-run quarterly
|
|
|
|
A backup you've never restored is a backup you don't have. Add this
|
|
to your quarterly on-call rotation:
|
|
|
|
1. Pick the most recent dump from the previous quarter.
|
|
2. Stand up a throwaway Postgres instance (Docker, kind, anything).
|
|
3. `pg_restore -d certctl <the dump>`.
|
|
4. Bring up a certctl-server container pointed at the throwaway DB
|
|
(`CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...`).
|
|
5. Confirm `/api/v1/version` returns 200, `/api/v1/certificates`
|
|
lists the expected rows, and the scheduler logs show no
|
|
migration-version mismatch.
|
|
6. Tear down. Note the timing in your DR registry.
|
|
|
|
The [disaster-recovery runbook](disaster-recovery.md) covers what to
|
|
do when this dry-run reveals a gap.
|
|
|
|
## Related reading
|
|
|
|
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
|
|
- [`docs/operator/secret-custody.md`](../secret-custody.md) — what
|
|
the operator-managed file material (CA keys, RA keys, trust
|
|
anchors) contains, why it lives outside the DB, and what it costs
|
|
to lose
|