mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 22:51:30 +00:00

Files

T

shankar0123 57b539c378 docs(b12): observability reference + Postgres backup runbook

Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.

Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.

docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
  - Metrics surface: both /api/v1/metrics (JSON) and
    /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
    certctl_certificate_* gauges + certctl_issuance_duration_seconds
    per-issuer-type histogram + certctl_uptime_seconds.
  - Prometheus library vs hand-rolled exposition: explicit scope
    statement — hand-rolled fmt.Fprintf is intentional for v2.x given
    the shallow metric surface; client_golang migration tracked as
    v3 item (closes OPS-M1).
  - Tracing: explicit deferral — no OTel SDK setup, OTel packages
    are indirect-only in go.mod, no spans, no OTLP exporter; tracked
    as v3 item; in the meantime structured logs carry request_id and
    certctl_issuance_duration_seconds carries the per-issuer latency
    signal (closes OPS-M2).
  - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
    no key material / bearer tokens / session cookies in log lines.
  - Rate-limit semantics under restarts + replicas: per-process,
    in-memory, reset-on-restart, NOT shared across replicas; full
    inventory of the 5 limiter call sites (break-glass login,
    SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
    source-IP, ACME per-account); multi-replica + sticky-session
    implications; database-backed sliding window deferred to v3
    (closes D8).
  - Performance harness scope: cross-references the explicit
    'What it explicitly does NOT measure' list in
    deploy/test/loadtest/README.md (closes LOW-7 + finding 7).

docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
  - Inventory of what to back up (DB + operator-managed file
    material that lives outside the DB: CA keys, RA keys, OCSP
    responder keys, trust bundles).
  - Logical backup recipe with docker-compose + Kubernetes variants,
    integrity verification step, off-host storage step.
  - Physical / PITR recipe pointing at pgbackrest / wal-g
    (certctl ships nothing here — standard PostgreSQL DBA work).
  - Three sample automation paths (in-cluster Postgres → S3 CronJob,
    managed Postgres PITR, self-hosted VM systemd timer + restic).
  - Quarterly restore-dry-run procedure.
  - Helm CronJob template deliberately not shipped — three
    documented reasons (deployment topology / secret-management
    integration / off-host storage all vary by operator) plus
    roadmap entry for shipping a starter template when a real
    operator asks for one (closes D6 + OPS-H1).

Both new docs wired into docs/README.md Operator + Runbooks tables.

D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.

Verified:
  bash scripts/ci-guards/G-3-env-docs-drift.sh    # PASS
  bash scripts/ci-guards/doc-rot-detector.sh      # PASS
  All 35 scripts/ci-guards/*.sh green.

2026-05-13 02:09:11 +00:00

7.7 KiB

Raw Blame History

Runbook: PostgreSQL backup for certctl

Last reviewed: 2026-05-13

Use this when:

You're setting up a new certctl deployment and need a backup policy before going to production.
A buyer or auditor asks "where's the backup automation?" and you need to point at the recommended cadence + procedure.
You're rotating the encryption key, swapping CAs, or doing any other destructive maintenance and want a snapshot to roll back to.

certctl does not ship a built-in backup daemon. Postgres is the system of record for every piece of certctl state that isn't on the operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust bundles — see "Operator-managed (NOT in DB)" in the disaster-recovery runbook); backing it up is treated as a standard PostgreSQL operations task that the operator owns end-to-end with their existing tooling.

This page is the recommended recipe.

What to back up

Layer	Tool	Cadence
`certctl` database (the row data)	`pg_dump` (logical) or `pg_basebackup` + WAL archive (physical PIT)	≥ daily, retention ≥ 30d
CA cert + key (`CERTCTL_CA_CERT_PATH`, `CERTCTL_CA_KEY_PATH`)	Out-of-band file backup (operator's existing secret-management tool)	On change
SCEP RA cert + key (per profile)	Out-of-band file backup	On change
OCSP responder keys	Out-of-band file backup (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)	On change
Trust-anchor PEM bundles	Out-of-band file backup	On change
Env vars (auth secret, etc.)	Operator's secret-management tool (Vault, AWS Secrets Manager, etc.)	On rotation

A backup of only the Postgres database without the operator-managed file material is not a complete restore artifact — see the disaster-recovery runbook's Postgres-restore section for the full inventory. The DR runbook owns the restore procedure; this page owns the capture procedure.

Logical backup (recommended for most deployments)

pg_dump -Fc produces a portable compressed dump that's easy to restore into a fresh Postgres instance at any version ≥ the dump's source version. Best for deployments where the DB is small enough that a full logical dump fits the backup window (rough rule of thumb: under a million managed_certificates rows + corresponding history).

docker-compose

# 1. Snapshot. Run from any host that can reach the postgres container.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
docker compose -f deploy/docker-compose.yml exec -T postgres \
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
  > "certctl-${TIMESTAMP}.dump"

# 2. Verify integrity (catch transport / truncation bugs early).
docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
  pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
  && echo "OK: pg_restore --list parses the dump cleanly" \
  || { echo "CORRUPT DUMP"; exit 1; }

# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
# storage of your choice). DO NOT leave the dump on the certctl host
# alone — that defeats the purpose of having a backup.
aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"

Kubernetes (with the bundled Helm chart)

# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
NAMESPACE=certctl
kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
  > "certctl-${TIMESTAMP}.dump"

# 2. Same verification step as above.
# 3. Same off-host storage step as above.

Restore (cross-reference)

The restore procedure lives in disaster-recovery.md § Postgres restore. The key reminders: stop certctl first, restore the DB, run any migrations newer than the snapshot, truncate the CRL + OCSP caches, then restart.

Physical / PITR backup (large fleets, RPO < 1h)

Logical dumps have a coarse RPO (the last successful dump). For deployments where ≤ 1h of cert-issuance history loss is unacceptable, pair Postgres physical backups with continuous WAL archiving:

pg_basebackup for the initial seed
archive_command = '<your-WAL-archiver>' in postgresql.conf to ship every WAL segment off the host as it closes
pgbackrest or wal-g for the operational layer (both are battle-tested, support encryption, and integrate cleanly with S3 / GCS / Azure Blob)

certctl ships nothing in this layer — it's standard PostgreSQL DBA work, and shipping a bespoke recipe would just be a worse version of what pgbackrest already does. The pgbackrest configuration guide is the authoritative reference.

Automation paths

This is the gap an acquisition reviewer typically wants to see filled. certctl ships no backup CronJob template in the Helm chart — the operator owns this layer because:

The right tool depends on the deployment topology (in-cluster Postgres vs. managed Postgres vs. self-hosted on a VM).
The right secret-management integration depends on the operator's existing stack (Vault, AWS Secrets Manager, GCP Secret Manager, sealed-secrets, External Secrets).
The right storage backend depends on the operator's existing off-host blob storage.

A bundled CronJob would be a half-answer for any operator with an established backup posture, and would have to be torn out before production. Three sample recipes that cover the common cases:

In-cluster Postgres → S3: a CronJob running an alpine image with aws-cli + the pg_dump command above, output piped to aws s3 cp. Cosign-signed if your supply-chain policy requires it.
Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB): rely on the cloud provider's built-in PITR backup; configure retention ≥ 30 days; the certctl deployment surface is the connection string alone.
Self-hosted VM: systemd timer + pg_dump + restic (or borgbackup) to encrypted off-host storage.

Tracked in WORKSPACE-ROADMAP.md as a post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the in-cluster-Postgres-to-S3 case as a starter. The right time to ship it is when a real operator asks for it; speculatively shipping it without that signal would just produce a template every deployment ends up rewriting.

Verification — what to dry-run quarterly

A backup you've never restored is a backup you don't have. Add this to your quarterly on-call rotation:

Pick the most recent dump from the previous quarter.
Stand up a throwaway Postgres instance (Docker, kind, anything).
pg_restore -d certctl <the dump>.
Bring up a certctl-server container pointed at the throwaway DB (CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...).
Confirm /api/v1/version returns 200, /api/v1/certificates lists the expected rows, and the scheduler logs show no migration-version mismatch.
Tear down. Note the timing in your DR registry.

The disaster-recovery runbook covers what to do when this dry-run reveals a gap.

docs/operator/runbooks/disaster-recovery.md — the restore companion
docs/operator/secret-custody.md — what the operator-managed file material (CA keys, RA keys, trust anchors) contains, why it lives outside the DB, and what it costs to lose

7.7 KiB Raw Blame History