Files
certctl/docs/operator
shankar0123 374ec574c5 feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore
test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16).

A backup procedure that has never been restore-tested is not a backup
procedure. The Helm CronJob at deploy/helm/certctl/templates/backup-
cronjob.yaml and the operator runbook at
docs/operator/runbooks/postgres-backup.md both document a
`pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the
dump shape has never been restored end-to-end under CI. This sprint
adds the missing assertion.

Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so
the two jobs don't fight for runners), boot a real postgres:16-alpine
service container pinned to the SAME sha256 digest as
deploy/docker-compose.yml, exercise the audit_events hash chain
with 24 synthetic rows representing an issue/renew/revoke/auth-login
cycle, take a custom-format dump, DROP SCHEMA public CASCADE
(simulating an operator-side data-loss event), pg_restore, and
assert:

  pre.row_count        == post.row_count
  pre.chain_head_hash  == post.chain_head_hash    (BYTE-EXACT)
  post.first_break_id  == ""                      (verify_chain clean)
  post.verifier_walked == pre.row_count           (every row walked)

The chain-head byte-exact assertion is the load-bearing one.
Migration 000047 hashes each row's canonical payload with
`to_char(timestamp AT TIME ZONE 'UTC',
'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss
in the dump/restore path (a real concern across major Postgres
upgrades or with --format=plain) would corrupt the hash. The point
of testing is to PROVE the property, not to defend against a known
quirk.

Files
=====
- .github/workflows/backup-restore.yml — Mondays 07:00 UTC +
  workflow_dispatch. Postgres service container; Go 1.25.10;
  contents:read; 15-min timeout. Action SHAs pinned to match
  ci.yml's pinning convention.
- deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight
  (postgresql-client + Go + python3 on PATH); wait-for-ready loop;
  DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify
  + python3 JSON diff. ::error:: prefix on any assertion failure.
  Same script runs unchanged locally against any reachable Postgres.
- deploy/test/backupsmoke/main.go — Go program with --mode=workload
  and --mode=verify. Imports the repo's
  internal/repository/postgres.RunMigrations and emits a small JSON
  snapshot to stdout. INSERT shape mirrors
  internal/repository/postgres/audit_chain_test.go.
- docs/operator/runbooks/postgres-backup.md — adds a 'CI restore
  verification' subsection after the existing quarterly-dry-run
  section, points at the new workflow + harness + smoke program,
  bumps the last-reviewed marker.

Verified locally: gofmt clean, go vet clean, staticcheck clean,
`go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell
harness, python3 -c yaml.safe_load on the workflow, dry-run of the
JSON-diff python block on synthetic pre.json/post.json covers both
PASS and ::error:: paths.
2026-05-16 17:27:57 +00:00
..
2026-05-05 18:18:29 +00:00