mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:21:31 +00:00
feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths.
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Runbook: PostgreSQL backup for certctl
|
||||
|
||||
> Last reviewed: 2026-05-16
|
||||
> Last reviewed: 2026-05-16 (Sprint 4 ACQ — CI restore verification subsection added)
|
||||
|
||||
Use this when:
|
||||
- You're setting up a new certctl deployment and need a backup policy
|
||||
@@ -198,6 +198,42 @@ to your quarterly on-call rotation:
|
||||
The [disaster-recovery runbook](disaster-recovery.md) covers what to
|
||||
do when this dry-run reveals a gap.
|
||||
|
||||
## CI restore verification
|
||||
|
||||
> Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
|
||||
> 2026-05-16). The quarterly dry-run above is the operator-side
|
||||
> proof; the workflow below is the upstream-side proof.
|
||||
|
||||
The certctl repo ships a weekly GitHub Actions workflow that
|
||||
exercises the **exact** pg_dump shape this runbook recommends
|
||||
(`--format=custom --no-owner --no-acl`) against a real Postgres
|
||||
container, then asserts the audit_events hash chain round-trips
|
||||
byte-for-byte across the dump → restore boundary. A regression in
|
||||
the dump format, in a Postgres minor bump, or in migration 000047's
|
||||
canonical-payload serialization would surface in the next Monday
|
||||
run instead of on a customer's restore day.
|
||||
|
||||
- **Workflow:** [`.github/workflows/backup-restore.yml`](../../../.github/workflows/backup-restore.yml)
|
||||
— Mondays 07:00 UTC + `workflow_dispatch`. Postgres service
|
||||
container pinned to the same SHA256 digest as
|
||||
`deploy/docker-compose.yml`.
|
||||
- **Harness:** [`deploy/test/backup-restore-smoke.sh`](../../../deploy/test/backup-restore-smoke.sh)
|
||||
— runs the workload → `pg_dump -Fc` → `DROP SCHEMA public CASCADE`
|
||||
→ `pg_restore` → verify cycle. Locally runnable against any
|
||||
reachable Postgres (it DROPs the schema, so do not point it at
|
||||
data you care about).
|
||||
- **Workload + verifier:** [`deploy/test/backupsmoke/main.go`](../../../deploy/test/backupsmoke/main.go)
|
||||
— generates 24 synthetic `audit_events` rows representing an
|
||||
issue/renew/revoke/auth-login cycle, snapshots the chain head
|
||||
before the backup, and after restore runs
|
||||
`audit_events_verify_chain()` to confirm `first_break_id IS NULL`.
|
||||
|
||||
The CI workflow is not a replacement for the quarterly operator
|
||||
dry-run — it does not exercise the operator-managed file material
|
||||
(CA keys, RA keys, trust anchors) listed in the "What to back up"
|
||||
table above. Treat it as the dump-shape regression test; the
|
||||
quarterly run remains the full-restore correctness test.
|
||||
|
||||
## Related reading
|
||||
|
||||
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
|
||||
|
||||
Reference in New Issue
Block a user