Files
certctl/.github/workflows/backup-restore.yml
T
shankar0123 374ec574c5 feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore
test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16).

A backup procedure that has never been restore-tested is not a backup
procedure. The Helm CronJob at deploy/helm/certctl/templates/backup-
cronjob.yaml and the operator runbook at
docs/operator/runbooks/postgres-backup.md both document a
`pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the
dump shape has never been restored end-to-end under CI. This sprint
adds the missing assertion.

Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so
the two jobs don't fight for runners), boot a real postgres:16-alpine
service container pinned to the SAME sha256 digest as
deploy/docker-compose.yml, exercise the audit_events hash chain
with 24 synthetic rows representing an issue/renew/revoke/auth-login
cycle, take a custom-format dump, DROP SCHEMA public CASCADE
(simulating an operator-side data-loss event), pg_restore, and
assert:

  pre.row_count        == post.row_count
  pre.chain_head_hash  == post.chain_head_hash    (BYTE-EXACT)
  post.first_break_id  == ""                      (verify_chain clean)
  post.verifier_walked == pre.row_count           (every row walked)

The chain-head byte-exact assertion is the load-bearing one.
Migration 000047 hashes each row's canonical payload with
`to_char(timestamp AT TIME ZONE 'UTC',
'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss
in the dump/restore path (a real concern across major Postgres
upgrades or with --format=plain) would corrupt the hash. The point
of testing is to PROVE the property, not to defend against a known
quirk.

Files
=====
- .github/workflows/backup-restore.yml — Mondays 07:00 UTC +
  workflow_dispatch. Postgres service container; Go 1.25.10;
  contents:read; 15-min timeout. Action SHAs pinned to match
  ci.yml's pinning convention.
- deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight
  (postgresql-client + Go + python3 on PATH); wait-for-ready loop;
  DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify
  + python3 JSON diff. ::error:: prefix on any assertion failure.
  Same script runs unchanged locally against any reachable Postgres.
- deploy/test/backupsmoke/main.go — Go program with --mode=workload
  and --mode=verify. Imports the repo's
  internal/repository/postgres.RunMigrations and emits a small JSON
  snapshot to stdout. INSERT shape mirrors
  internal/repository/postgres/audit_chain_test.go.
- docs/operator/runbooks/postgres-backup.md — adds a 'CI restore
  verification' subsection after the existing quarterly-dry-run
  section, points at the new workflow + harness + smoke program,
  bumps the last-reviewed marker.

Verified locally: gofmt clean, go vet clean, staticcheck clean,
`go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell
harness, python3 -c yaml.safe_load on the workflow, dry-run of the
JSON-diff python block on synthetic pre.json/post.json covers both
PASS and ::error:: paths.
2026-05-16 17:27:57 +00:00

119 lines
4.6 KiB
YAML

# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
# 2026-05-16). Weekly backup-restore smoke test.
#
# Why
# ===
# The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml
# and the operator runbook at docs/operator/runbooks/postgres-backup.md
# both document a pg_dump -Fc -based backup strategy, but the dump has
# never been restored end-to-end under CI. A backup procedure that has
# never been restore-tested is not a backup procedure. This workflow
# adds the missing assertion.
#
# What
# ====
# Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC
# slot so they don't fight for runners), boot a real Postgres
# 16-alpine container against the same digest pin as the production
# deploy/docker-compose.yml, exercise the audit_events hash chain
# with a small synthetic workload, pg_dump the database, drop the
# schema, pg_restore, and assert the chain head + row count
# round-trip byte-for-byte.
#
# The chain head round-trip property is the load-bearing assertion.
# Migration 000047 hashes each audit_events row's canonical payload
# with `to_char(timestamp AT TIME ZONE 'UTC',
# 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss
# in the dump→restore path (a real concern across major Postgres
# upgrades or with --format=plain) would corrupt the hash. The whole
# point of testing instead of trusting docs is to PROVE the property
# under a real workload.
#
# Workflow boundaries
# ===================
# - Does not exercise PITR / WAL archiving (DR runbook owns that).
# - Does not exercise the Helm CronJob's S3 sink or scheduling
# (operator-side concern, not a property of the dump shape).
# - Does not deploy or boot the certctl-server itself — the smoke
# harness talks to Postgres directly; we're testing the dump,
# not the server.
name: backup-restore-smoke
on:
# Manual trigger from the Actions tab — useful before tagging a
# release that touches the audit_events schema, or after a dep
# bump that could affect canonical-payload formatting.
workflow_dispatch:
schedule:
# Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml
# (06:00 UTC) so the two jobs don't fight for runners on the
# GitHub-hosted ubuntu-latest pool.
- cron: '0 7 * * 1'
# Defense-in-depth: this job reads source and exercises a database;
# it never needs write access to PRs, branches, releases, or
# packages. Pin permissions to the minimum.
permissions:
contents: read
jobs:
backup-restore:
name: pg_dump / pg_restore smoke
runs-on: ubuntu-latest
# 15-minute hard cap. The actual workload + dump + restore + verify
# cycle runs in well under a minute on a warm runner; 15 minutes
# absorbs cold image pulls, slow runner provisioning, and the
# Postgres service-container readiness wait without letting a stuck
# job consume the runner indefinitely.
timeout-minutes: 15
# Postgres service container. Pin to the same digest as
# deploy/docker-compose.yml so the smoke runs against the exact
# image the production deploy uses — a regression that surfaces
# only on a specific Postgres minor bump shows up here on the
# next image refresh in compose, not silently on a customer site.
services:
postgres:
image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
env:
POSTGRES_DB: certctl
POSTGRES_USER: certctl
POSTGRES_PASSWORD: certctl
ports:
- 5432:5432
# GitHub's services-container health check. The smoke shell
# also waits for pg_isready as a belt-and-suspenders guard.
options: >-
--health-cmd "pg_isready -U certctl -d certctl"
--health-interval 5s
--health-timeout 3s
--health-retries 10
steps:
- name: Checkout
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Set up Go
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: '1.25.10'
# Cache go-build + go-mod for the weekly run. Keep the
# cache key bound to go.sum so a dep bump invalidates it.
cache: true
- name: Run backup-restore smoke
env:
PGHOST: 127.0.0.1
PGPORT: '5432'
PGUSER: certctl
PGPASSWORD: certctl
PGDATABASE: certctl
# Insert enough rows to exercise the chain over a non-trivial
# length. 24 ≫ 1 — large enough to surface ordering bugs,
# small enough that the dump finishes in seconds.
SMOKE_ROWS: '24'
run: bash deploy/test/backup-restore-smoke.sh