mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:51:30 +00:00
374ec574c5
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths.