From 374ec574c5606908b4bc0f55de4eaa71003d4bfc Mon Sep 17 00:00:00 2001
From: shankar0123 <skreddy040@gmail.com>
Date: Sat, 16 May 2026 17:27:57 +0000
Subject: [PATCH] =?UTF-8?q?feat(ci):=20DEPL-005=20+=20DATA-012=20=E2=80=94?=
 =?UTF-8?q?=20weekly=20backup/restore=20smoke=20+=20audit-chain=20round-tr?=
 =?UTF-8?q?ip=20assertion?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Acquisition-audit DEPL-005 (backup runbook exists but no CI restore
test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16).

A backup procedure that has never been restore-tested is not a backup
procedure. The Helm CronJob at deploy/helm/certctl/templates/backup-
cronjob.yaml and the operator runbook at
docs/operator/runbooks/postgres-backup.md both document a
`pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the
dump shape has never been restored end-to-end under CI. This sprint
adds the missing assertion.

Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so
the two jobs don't fight for runners), boot a real postgres:16-alpine
service container pinned to the SAME sha256 digest as
deploy/docker-compose.yml, exercise the audit_events hash chain
with 24 synthetic rows representing an issue/renew/revoke/auth-login
cycle, take a custom-format dump, DROP SCHEMA public CASCADE
(simulating an operator-side data-loss event), pg_restore, and
assert:

  pre.row_count        == post.row_count
  pre.chain_head_hash  == post.chain_head_hash    (BYTE-EXACT)
  post.first_break_id  == ""                      (verify_chain clean)
  post.verifier_walked == pre.row_count           (every row walked)

The chain-head byte-exact assertion is the load-bearing one.
Migration 000047 hashes each row's canonical payload with
`to_char(timestamp AT TIME ZONE 'UTC',
'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss
in the dump/restore path (a real concern across major Postgres
upgrades or with --format=plain) would corrupt the hash. The point
of testing is to PROVE the property, not to defend against a known
quirk.

Files
=====
- .github/workflows/backup-restore.yml — Mondays 07:00 UTC +
  workflow_dispatch. Postgres service container; Go 1.25.10;
  contents:read; 15-min timeout. Action SHAs pinned to match
  ci.yml's pinning convention.
- deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight
  (postgresql-client + Go + python3 on PATH); wait-for-ready loop;
  DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify
  + python3 JSON diff. ::error:: prefix on any assertion failure.
  Same script runs unchanged locally against any reachable Postgres.
- deploy/test/backupsmoke/main.go — Go program with --mode=workload
  and --mode=verify. Imports the repo's
  internal/repository/postgres.RunMigrations and emits a small JSON
  snapshot to stdout. INSERT shape mirrors
  internal/repository/postgres/audit_chain_test.go.
- docs/operator/runbooks/postgres-backup.md — adds a 'CI restore
  verification' subsection after the existing quarterly-dry-run
  section, points at the new workflow + harness + smoke program,
  bumps the last-reviewed marker.

Verified locally: gofmt clean, go vet clean, staticcheck clean,
`go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell
harness, python3 -c yaml.safe_load on the workflow, dry-run of the
JSON-diff python block on synthetic pre.json/post.json covers both
PASS and ::error:: paths.
---
 .github/workflows/backup-restore.yml      | 118 ++++++++++++
 deploy/test/backup-restore-smoke.sh       | 225 ++++++++++++++++++++++
 deploy/test/backupsmoke/main.go           | 222 +++++++++++++++++++++
 docs/operator/runbooks/postgres-backup.md |  38 +++-
 4 files changed, 602 insertions(+), 1 deletion(-)
 create mode 100644 .github/workflows/backup-restore.yml
 create mode 100755 deploy/test/backup-restore-smoke.sh
 create mode 100644 deploy/test/backupsmoke/main.go

diff --git a/.github/workflows/backup-restore.yml b/.github/workflows/backup-restore.yml
new file mode 100644
index 0000000..16ce5a6
--- /dev/null
+++ b/.github/workflows/backup-restore.yml
@@ -0,0 +1,118 @@
+# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
+# 2026-05-16). Weekly backup-restore smoke test.
+#
+# Why
+# ===
+# The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml
+# and the operator runbook at docs/operator/runbooks/postgres-backup.md
+# both document a pg_dump -Fc -based backup strategy, but the dump has
+# never been restored end-to-end under CI. A backup procedure that has
+# never been restore-tested is not a backup procedure. This workflow
+# adds the missing assertion.
+#
+# What
+# ====
+# Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC
+# slot so they don't fight for runners), boot a real Postgres
+# 16-alpine container against the same digest pin as the production
+# deploy/docker-compose.yml, exercise the audit_events hash chain
+# with a small synthetic workload, pg_dump the database, drop the
+# schema, pg_restore, and assert the chain head + row count
+# round-trip byte-for-byte.
+#
+# The chain head round-trip property is the load-bearing assertion.
+# Migration 000047 hashes each audit_events row's canonical payload
+# with `to_char(timestamp AT TIME ZONE 'UTC',
+# 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss
+# in the dump→restore path (a real concern across major Postgres
+# upgrades or with --format=plain) would corrupt the hash. The whole
+# point of testing instead of trusting docs is to PROVE the property
+# under a real workload.
+#
+# Workflow boundaries
+# ===================
+# - Does not exercise PITR / WAL archiving (DR runbook owns that).
+# - Does not exercise the Helm CronJob's S3 sink or scheduling
+#   (operator-side concern, not a property of the dump shape).
+# - Does not deploy or boot the certctl-server itself — the smoke
+#   harness talks to Postgres directly; we're testing the dump,
+#   not the server.
+
+name: backup-restore-smoke
+
+on:
+  # Manual trigger from the Actions tab — useful before tagging a
+  # release that touches the audit_events schema, or after a dep
+  # bump that could affect canonical-payload formatting.
+  workflow_dispatch:
+
+  schedule:
+    # Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml
+    # (06:00 UTC) so the two jobs don't fight for runners on the
+    # GitHub-hosted ubuntu-latest pool.
+    - cron: '0 7 * * 1'
+
+# Defense-in-depth: this job reads source and exercises a database;
+# it never needs write access to PRs, branches, releases, or
+# packages. Pin permissions to the minimum.
+permissions:
+  contents: read
+
+jobs:
+  backup-restore:
+    name: pg_dump / pg_restore smoke
+    runs-on: ubuntu-latest
+
+    # 15-minute hard cap. The actual workload + dump + restore + verify
+    # cycle runs in well under a minute on a warm runner; 15 minutes
+    # absorbs cold image pulls, slow runner provisioning, and the
+    # Postgres service-container readiness wait without letting a stuck
+    # job consume the runner indefinitely.
+    timeout-minutes: 15
+
+    # Postgres service container. Pin to the same digest as
+    # deploy/docker-compose.yml so the smoke runs against the exact
+    # image the production deploy uses — a regression that surfaces
+    # only on a specific Postgres minor bump shows up here on the
+    # next image refresh in compose, not silently on a customer site.
+    services:
+      postgres:
+        image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
+        env:
+          POSTGRES_DB: certctl
+          POSTGRES_USER: certctl
+          POSTGRES_PASSWORD: certctl
+        ports:
+          - 5432:5432
+        # GitHub's services-container health check. The smoke shell
+        # also waits for pg_isready as a belt-and-suspenders guard.
+        options: >-
+          --health-cmd "pg_isready -U certctl -d certctl"
+          --health-interval 5s
+          --health-timeout 3s
+          --health-retries 10
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
+
+      - name: Set up Go
+        uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff  # v5
+        with:
+          go-version: '1.25.10'
+          # Cache go-build + go-mod for the weekly run. Keep the
+          # cache key bound to go.sum so a dep bump invalidates it.
+          cache: true
+
+      - name: Run backup-restore smoke
+        env:
+          PGHOST: 127.0.0.1
+          PGPORT: '5432'
+          PGUSER: certctl
+          PGPASSWORD: certctl
+          PGDATABASE: certctl
+          # Insert enough rows to exercise the chain over a non-trivial
+          # length. 24 ≫ 1 — large enough to surface ordering bugs,
+          # small enough that the dump finishes in seconds.
+          SMOKE_ROWS: '24'
+        run: bash deploy/test/backup-restore-smoke.sh
diff --git a/deploy/test/backup-restore-smoke.sh b/deploy/test/backup-restore-smoke.sh
new file mode 100755
index 0000000..12e7842
--- /dev/null
+++ b/deploy/test/backup-restore-smoke.sh
@@ -0,0 +1,225 @@
+#!/usr/bin/env bash
+# Copyright 2026 certctl LLC. All rights reserved.
+# SPDX-License-Identifier: BUSL-1.1
+#
+# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
+# 2026-05-16). Backup/restore smoke harness — orchestrates a real
+# pg_dump -Fc → DROP DATABASE → CREATE DATABASE → pg_restore loop
+# around the audit_events hash chain and asserts the chain head
+# round-trips byte-for-byte.
+#
+# This script is the body of the `.github/workflows/backup-restore.yml`
+# weekly job AND the same thing an operator can run locally against a
+# running Postgres to gain confidence before a real restore.
+#
+# Prereqs
+# =======
+# - psql / pg_dump / pg_restore installed and on PATH (ubuntu-latest
+#   ships postgresql-client by default; on macOS use Homebrew's
+#   libpq).
+# - A reachable Postgres at $PGHOST:$PGPORT, plus the certctl user +
+#   database created. In CI we point this at the GHA service container
+#   (postgres:16-alpine, pinned to the same digest as
+#   deploy/docker-compose.yml). Locally, point it wherever — the
+#   script DROPs the database it connects to, so DO NOT POINT THIS
+#   AT A DATABASE YOU CARE ABOUT.
+# - Go 1.25+ on PATH so the smoke program can be built. (CI's
+#   setup-go step handles this.)
+# - jq is NOT required — JSON snapshots are compared via python3.
+#
+# Behavior contract
+# =================
+# - On success: exit 0, prints "PASS" + a summary line.
+# - On any assertion failure: prints `::error::<reason>`, exits 1.
+#   (The ::error:: prefix is the GitHub Actions log-annotation shape;
+#    it surfaces as a red banner in the Actions run UI.)
+#
+# Non-goals
+# =========
+# - Does not exercise PITR / WAL archiving. The Sprint 4 scope is the
+#   pg_dump/pg_restore path only; managed-DB PITR is the operator's
+#   responsibility per docs/operator/runbooks/postgres-backup.md.
+# - Does not regenerate the audit chain after restore. A "restore
+#   that rewrote history" would mask exactly the bug under test.
+
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+WORKDIR="$(mktemp -d)"
+trap 'rm -rf "$WORKDIR"' EXIT
+
+# ----------------------------------------------------------------------
+# Configuration — every knob is env-overridable so the same script
+# runs unchanged in CI (where the GHA service container exposes
+# 127.0.0.1:5432) and on an operator's laptop (where they may have
+# Postgres on a UNIX socket or a different port).
+# ----------------------------------------------------------------------
+: "${PGHOST:=127.0.0.1}"
+: "${PGPORT:=5432}"
+: "${PGUSER:=certctl}"
+: "${PGPASSWORD:=certctl}"
+: "${PGDATABASE:=certctl}"
+: "${SMOKE_ROWS:=24}"
+: "${MIGRATIONS_PATH:=${REPO_ROOT}/migrations}"
+
+# psql/pg_dump/pg_restore all read PG* env vars. Export so we don't
+# have to spell them out on every command line.
+export PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE
+
+DB_URL="postgres://${PGUSER}:${PGPASSWORD}@${PGHOST}:${PGPORT}/${PGDATABASE}?sslmode=disable"
+
+fail() {
+	# GitHub Actions log annotation. The `::error::` prefix is what
+	# the Actions UI uses to highlight a line in the run log.
+	echo "::error::backup-restore-smoke: $*" >&2
+	exit 1
+}
+
+step() { printf '\n=== %s ===\n' "$*"; }
+
+# ----------------------------------------------------------------------
+# Sanity preflight
+# ----------------------------------------------------------------------
+step "preflight"
+command -v psql       >/dev/null || fail "psql not on PATH (install postgresql-client)"
+command -v pg_dump    >/dev/null || fail "pg_dump not on PATH"
+command -v pg_restore >/dev/null || fail "pg_restore not on PATH"
+command -v go         >/dev/null || fail "go not on PATH (need Go to build the smoke program)"
+command -v python3    >/dev/null || fail "python3 not on PATH (used for JSON diff)"
+test -d "${MIGRATIONS_PATH}" || fail "migrations dir not found: ${MIGRATIONS_PATH}"
+
+# Wait for Postgres readiness up to 60s. pg_isready returns 0 when
+# the server is accepting connections, so the loop is the canonical
+# CI-friendly "wait for the service container" pattern.
+step "waiting for postgres at ${PGHOST}:${PGPORT}"
+for _ in $(seq 1 60); do
+	if pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q; then
+		break
+	fi
+	sleep 1
+done
+pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q \
+	|| fail "postgres not ready after 60s at ${PGHOST}:${PGPORT}"
+
+# Wipe any prior state in the target DB. A previous failed run could
+# have left rows behind; the smoke contract is "starts from clean."
+step "wiping ${PGDATABASE} schema (DROP SCHEMA public CASCADE; CREATE SCHEMA public)"
+psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
+
+# ----------------------------------------------------------------------
+# Build the smoke program. We use `go run` to avoid leaving a binary
+# behind; the migrations + workload are quick so the per-invocation
+# compile cost is negligible.
+# ----------------------------------------------------------------------
+step "building smoke program"
+cd "${REPO_ROOT}"
+go build -o "${WORKDIR}/smoke" ./deploy/test/backupsmoke
+
+# ----------------------------------------------------------------------
+# Phase 1 — workload: migrate, insert rows, snapshot chain head.
+# ----------------------------------------------------------------------
+step "phase 1 — workload (${SMOKE_ROWS} audit_events rows)"
+"${WORKDIR}/smoke" \
+	--mode=workload \
+	--db-url="${DB_URL}" \
+	--migrations-path="${MIGRATIONS_PATH}" \
+	--rows="${SMOKE_ROWS}" \
+	| tee "${WORKDIR}/pre.json"
+
+# ----------------------------------------------------------------------
+# Phase 2 — backup. Canonical pg_dump shape per
+# deploy/helm/certctl/templates/backup-cronjob.yaml: --format=custom,
+# --no-owner, --no-acl. --no-owner / --no-acl keep the dump portable
+# across Postgres installations with different role layouts (the
+# audit-trail hash chain is data, not ACL state).
+# ----------------------------------------------------------------------
+step "phase 2 — pg_dump -Fc"
+pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" --file="${WORKDIR}/backup.dump"
+test -s "${WORKDIR}/backup.dump" || fail "pg_dump produced an empty file"
+
+# ----------------------------------------------------------------------
+# Phase 3 — wipe. The fresh-schema approach is the closest analogue
+# to "operator nuked the wrong volume." DROP DATABASE would require
+# connecting to a different DB and reconnect dance; DROP SCHEMA
+# achieves the same "no rows, no schema, no functions" end state
+# inside the existing connection and is restore-compatible (pg_dump
+# -Fc bundles the schema in the dump, so pg_restore recreates it).
+# ----------------------------------------------------------------------
+step "phase 3 — drop schema (simulating data-loss event)"
+psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
+
+# Sanity: confirm audit_events is actually gone before restore. A
+# regression here (e.g. DROP SCHEMA silently no-op) would let the
+# verifier "succeed" by reading the original rows, making the test
+# false-pass.
+PRE_RESTORE_TABLES=$(psql -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public'")
+if [ "${PRE_RESTORE_TABLES}" -ne 0 ]; then
+	fail "post-DROP SCHEMA, expected 0 public tables; saw ${PRE_RESTORE_TABLES}"
+fi
+
+# ----------------------------------------------------------------------
+# Phase 4 — restore.
+# ----------------------------------------------------------------------
+step "phase 4 — pg_restore"
+pg_restore --dbname="${PGDATABASE}" --no-owner --no-acl --exit-on-error "${WORKDIR}/backup.dump"
+
+# ----------------------------------------------------------------------
+# Phase 5 — verify: re-snapshot, run audit_events_verify_chain().
+# ----------------------------------------------------------------------
+step "phase 5 — verify (audit_events_verify_chain() + snapshot)"
+"${WORKDIR}/smoke" \
+	--mode=verify \
+	--db-url="${DB_URL}" \
+	| tee "${WORKDIR}/post.json"
+
+# ----------------------------------------------------------------------
+# Phase 6 — assert.
+#
+#   pre.row_count       == post.row_count
+#   pre.chain_head_hash == post.chain_head_hash   (BYTE-EXACT)
+#   post.first_break_id == ""                     (verifier clean)
+#   post.verifier_walked == pre.row_count         (every row walked)
+#
+# Use python3 rather than jq so the script runs unchanged on macOS
+# without an extra Homebrew install.
+# ----------------------------------------------------------------------
+step "phase 6 — assertions"
+python3 - <<'PY' "${WORKDIR}/pre.json" "${WORKDIR}/post.json"
+import json, sys
+
+pre  = json.load(open(sys.argv[1]))
+post = json.load(open(sys.argv[2]))
+
+def bail(msg):
+    print(f"::error::backup-restore-smoke: {msg}", file=sys.stderr)
+    sys.exit(1)
+
+if pre["row_count"] != post["row_count"]:
+    bail(f"row_count mismatch: pre={pre['row_count']} post={post['row_count']}")
+
+if pre["chain_head_hash"] != post["chain_head_hash"]:
+    bail(
+        "chain_head_hash mismatch — pg_dump/pg_restore did NOT round-trip the "
+        "audit_events hash chain byte-for-byte. "
+        f"pre={pre['chain_head_hash']} post={post['chain_head_hash']}"
+    )
+
+if post.get("first_break_id", "") != "":
+    bail(
+        "audit_events_verify_chain() reports a break post-restore at id="
+        f"{post['first_break_id']} pos={post.get('first_break_pos', '?')} — "
+        "the chain is no longer self-consistent after the restore."
+    )
+
+if post.get("verifier_walked", -1) != pre["row_count"]:
+    bail(
+        f"verifier_walked={post.get('verifier_walked')} != pre.row_count="
+        f"{pre['row_count']} — verifier short-circuited or read stale rows."
+    )
+
+print(
+    f"PASS  rows={pre['row_count']}  "
+    f"chain_head={pre['chain_head_hash'][:16]}…  "
+    f"verifier=clean"
+)
+PY
diff --git a/deploy/test/backupsmoke/main.go b/deploy/test/backupsmoke/main.go
new file mode 100644
index 0000000..82493a3
--- /dev/null
+++ b/deploy/test/backupsmoke/main.go
@@ -0,0 +1,222 @@
+// Copyright 2026 certctl LLC. All rights reserved.
+// SPDX-License-Identifier: BUSL-1.1
+
+// Command backupsmoke is the workload+verifier half of the
+// backup/restore CI gate (acquisition-audit DEPL-005 + DATA-012
+// closure, Sprint 4 ACQ, 2026-05-16).
+//
+// The companion shell harness `deploy/test/backup-restore-smoke.sh`
+// orchestrates the dump/drop/restore lifecycle around two
+// invocations of this program: one before the backup
+// (--mode=workload) and one after the restore (--mode=verify). Both
+// emit a small JSON snapshot to stdout; the shell harness diffs them
+// and asserts the chain head + row count round-trip byte-for-byte.
+//
+// Modes
+// =====
+//
+//	--mode=workload
+//	  Run all up-migrations against `--migrations-path`, then
+//	  generate `--rows` (default 24) audit_events rows representing
+//	  an issue / renew / revoke / auth-login cycle. Emit a snapshot
+//	  with the post-workload row_count + chain head row_hash.
+//
+//	--mode=verify
+//	  Run `audit_events_verify_chain()` (the per-row hash-chain
+//	  verifier installed by migration 000047) and capture
+//	  first_break_id / first_break_pos / verifier_walked. Emit a
+//	  snapshot with row_count + chain head row_hash + verifier
+//	  output. No mutations.
+//
+// The CI assertion contract
+// =========================
+//
+// After (workload → pg_dump -Fc → DROP + CREATE → pg_restore →
+// verify), the shell asserts:
+//
+//	pre.row_count      == post.row_count
+//	pre.chain_head_hash == post.chain_head_hash   (byte-exact)
+//	post.first_break_id == ""                     (verifier clean)
+//
+// A pg_dump format-quirk that didn't preserve TIMESTAMPTZ
+// microseconds would surface as a chain-head mismatch (the
+// canonical payload re-formats `timestamp AT TIME ZONE 'UTC'` to
+// microsecond ISO-8601 — any precision loss breaks the hash). A
+// trigger-or-function regression would surface as a verifier non-
+// empty first_break_id. The test exists to PROVE these properties
+// under a real workload, not to defend against a known quirk.
+package main
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	_ "github.com/lib/pq"
+
+	"github.com/certctl-io/certctl/internal/repository/postgres"
+)
+
+// Snapshot is the on-the-wire shape emitted to stdout. The shell
+// orchestrator parses it via python3 -c 'json.load(...)' and diffs
+// the relevant fields. Keep it stable — any rename here must land
+// alongside a shell-harness change.
+type Snapshot struct {
+	Phase          string `json:"phase"`
+	RowCount       int    `json:"row_count"`
+	ChainHead      string `json:"chain_head_hash"`
+	FirstBreakID   string `json:"first_break_id,omitempty"`
+	FirstBreakPos  int    `json:"first_break_pos,omitempty"`
+	VerifierWalked int    `json:"verifier_walked,omitempty"`
+}
+
+func main() {
+	var (
+		mode           = flag.String("mode", "", "workload | verify")
+		dbURL          = flag.String("db-url", os.Getenv("DATABASE_URL"), "Postgres URL (or set DATABASE_URL)")
+		migrationsPath = flag.String("migrations-path", "./migrations", "Path to the migrations/ directory (workload mode only)")
+		rows           = flag.Int("rows", 24, "Number of audit_events rows to insert (workload mode only)")
+	)
+	flag.Parse()
+
+	if *dbURL == "" {
+		log.Fatal("--db-url or DATABASE_URL is required")
+	}
+	if *mode == "" {
+		log.Fatal("--mode is required (workload | verify)")
+	}
+
+	db, err := sql.Open("postgres", *dbURL)
+	if err != nil {
+		log.Fatalf("sql.Open: %v", err)
+	}
+	defer db.Close()
+
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+	defer cancel()
+	if err := db.PingContext(ctx); err != nil {
+		log.Fatalf("ping: %v", err)
+	}
+
+	switch *mode {
+	case "workload":
+		// Run all up-migrations end-to-end. The trigger + verifier
+		// function installed by migration 000047 must be in place
+		// before the inserts below; partial migration would mask a
+		// real bug.
+		if err := postgres.RunMigrations(db, *migrationsPath); err != nil {
+			log.Fatalf("RunMigrations(%s): %v", *migrationsPath, err)
+		}
+		if err := runWorkload(ctx, db, *rows); err != nil {
+			log.Fatalf("runWorkload: %v", err)
+		}
+		snap, err := snapshot(ctx, db, "workload", false)
+		if err != nil {
+			log.Fatalf("snapshot: %v", err)
+		}
+		emit(snap)
+	case "verify":
+		snap, err := snapshot(ctx, db, "verify", true)
+		if err != nil {
+			log.Fatalf("snapshot: %v", err)
+		}
+		emit(snap)
+	default:
+		log.Fatalf("unknown --mode=%q (workload | verify)", *mode)
+	}
+}
+
+// runWorkload inserts n audit_events rows representing an
+// issue / renew / revoke / auth-login cycle. Patterns mirror the
+// shape the application emits (see internal/service/audit_*.go),
+// so the canonical payload exercised here is representative.
+//
+// event_category is omitted on each INSERT — migration 000032 gave
+// the column DEFAULT 'cert_lifecycle', which is also the value the
+// application uses for cert lifecycle events. Auth rows get the
+// default too, which is harmless for the round-trip property under
+// test (only the canonical-payload byte sequence matters).
+//
+// Timestamps are monotonic via the `NOW() + ($interval ||
+// ' microsecond')::interval` pattern from
+// internal/repository/postgres/audit_chain_test.go — ordering
+// determinism is necessary for the chain head to be stable across
+// runs.
+func runWorkload(ctx context.Context, db *sql.DB, n int) error {
+	actions := []struct{ act, resType, resID string }{
+		{"certificate.issue", "certificate", "mc-smoke"},
+		{"certificate.renew", "certificate", "mc-smoke"},
+		{"certificate.revoke", "certificate", "mc-smoke"},
+		{"auth.login", "session", "sess-smoke"},
+	}
+	for i := 0; i < n; i++ {
+		a := actions[i%len(actions)]
+		id := fmt.Sprintf("audit-smoke-%04d", i)
+		_, err := db.ExecContext(ctx, `
+			INSERT INTO audit_events (
+				id, actor, actor_type, action,
+				resource_type, resource_id, details, timestamp
+			)
+			VALUES (
+				$1, 'smoke-actor', 'User', $2,
+				$3, $4, '{}'::jsonb,
+				NOW() + ($5 || ' microsecond')::interval
+			)
+		`, id, a.act, a.resType, a.resID, fmt.Sprintf("%d", i))
+		if err != nil {
+			return fmt.Errorf("insert row %d (%s): %w", i, id, err)
+		}
+	}
+	return nil
+}
+
+// snapshot reads the chain head + row count, optionally invoking
+// the on-demand verifier. Verifier output goes in three additional
+// fields so the workload-side snapshot can omit them via the
+// `omitempty` tag.
+func snapshot(ctx context.Context, db *sql.DB, phase string, runVerifier bool) (*Snapshot, error) {
+	s := &Snapshot{Phase: phase}
+
+	if err := db.QueryRowContext(ctx, `SELECT COUNT(*) FROM audit_events`).Scan(&s.RowCount); err != nil {
+		return nil, fmt.Errorf("count(audit_events): %w", err)
+	}
+
+	if err := db.QueryRowContext(ctx, `SELECT row_hash FROM audit_chain_head WHERE id = 1`).Scan(&s.ChainHead); err != nil {
+		return nil, fmt.Errorf("read audit_chain_head: %w", err)
+	}
+
+	if runVerifier {
+		var brokenID sql.NullString
+		var brokenPos, walked int
+		err := db.QueryRowContext(ctx, `
+			SELECT first_break_id, first_break_pos, row_count
+			FROM audit_events_verify_chain()
+		`).Scan(&brokenID, &brokenPos, &walked)
+		if err != nil {
+			return nil, fmt.Errorf("audit_events_verify_chain(): %w", err)
+		}
+		if brokenID.Valid {
+			s.FirstBreakID = brokenID.String
+		}
+		s.FirstBreakPos = brokenPos
+		s.VerifierWalked = walked
+	}
+
+	return s, nil
+}
+
+// emit pretty-prints the snapshot to stdout. The trailing newline
+// from json.Encoder is the right shape for both shell `tee` and
+// python3 stdin handling.
+func emit(s *Snapshot) {
+	enc := json.NewEncoder(os.Stdout)
+	enc.SetIndent("", "  ")
+	if err := enc.Encode(s); err != nil {
+		log.Fatalf("encode snapshot: %v", err)
+	}
+}
diff --git a/docs/operator/runbooks/postgres-backup.md b/docs/operator/runbooks/postgres-backup.md
index 4daef22..1b9165e 100644
--- a/docs/operator/runbooks/postgres-backup.md
+++ b/docs/operator/runbooks/postgres-backup.md
@@ -1,6 +1,6 @@
 # Runbook: PostgreSQL backup for certctl
 
-> Last reviewed: 2026-05-16
+> Last reviewed: 2026-05-16 (Sprint 4 ACQ — CI restore verification subsection added)
 
 Use this when:
 - You're setting up a new certctl deployment and need a backup policy
@@ -198,6 +198,42 @@ to your quarterly on-call rotation:
 The [disaster-recovery runbook](disaster-recovery.md) covers what to
 do when this dry-run reveals a gap.
 
+## CI restore verification
+
+> Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
+> 2026-05-16). The quarterly dry-run above is the operator-side
+> proof; the workflow below is the upstream-side proof.
+
+The certctl repo ships a weekly GitHub Actions workflow that
+exercises the **exact** pg_dump shape this runbook recommends
+(`--format=custom --no-owner --no-acl`) against a real Postgres
+container, then asserts the audit_events hash chain round-trips
+byte-for-byte across the dump → restore boundary. A regression in
+the dump format, in a Postgres minor bump, or in migration 000047's
+canonical-payload serialization would surface in the next Monday
+run instead of on a customer's restore day.
+
+- **Workflow:** [`.github/workflows/backup-restore.yml`](../../../.github/workflows/backup-restore.yml)
+  — Mondays 07:00 UTC + `workflow_dispatch`. Postgres service
+  container pinned to the same SHA256 digest as
+  `deploy/docker-compose.yml`.
+- **Harness:** [`deploy/test/backup-restore-smoke.sh`](../../../deploy/test/backup-restore-smoke.sh)
+  — runs the workload → `pg_dump -Fc` → `DROP SCHEMA public CASCADE`
+  → `pg_restore` → verify cycle. Locally runnable against any
+  reachable Postgres (it DROPs the schema, so do not point it at
+  data you care about).
+- **Workload + verifier:** [`deploy/test/backupsmoke/main.go`](../../../deploy/test/backupsmoke/main.go)
+  — generates 24 synthetic `audit_events` rows representing an
+  issue/renew/revoke/auth-login cycle, snapshots the chain head
+  before the backup, and after restore runs
+  `audit_events_verify_chain()` to confirm `first_break_id IS NULL`.
+
+The CI workflow is not a replacement for the quarterly operator
+dry-run — it does not exercise the operator-managed file material
+(CA keys, RA keys, trust anchors) listed in the "What to back up"
+table above. Treat it as the dump-shape regression test; the
+quarterly run remains the full-restore correctness test.
+
 ## Related reading
 
 - [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion