From 374ec574c5606908b4bc0f55de4eaa71003d4bfc Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Sat, 16 May 2026 17:27:57 +0000 Subject: [PATCH] =?UTF-8?q?feat(ci):=20DEPL-005=20+=20DATA-012=20=E2=80=94?= =?UTF-8?q?=20weekly=20backup/restore=20smoke=20+=20audit-chain=20round-tr?= =?UTF-8?q?ip=20assertion?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths. --- .github/workflows/backup-restore.yml | 118 ++++++++++++ deploy/test/backup-restore-smoke.sh | 225 ++++++++++++++++++++++ deploy/test/backupsmoke/main.go | 222 +++++++++++++++++++++ docs/operator/runbooks/postgres-backup.md | 38 +++- 4 files changed, 602 insertions(+), 1 deletion(-) create mode 100644 .github/workflows/backup-restore.yml create mode 100755 deploy/test/backup-restore-smoke.sh create mode 100644 deploy/test/backupsmoke/main.go diff --git a/.github/workflows/backup-restore.yml b/.github/workflows/backup-restore.yml new file mode 100644 index 0000000..16ce5a6 --- /dev/null +++ b/.github/workflows/backup-restore.yml @@ -0,0 +1,118 @@ +# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ, +# 2026-05-16). Weekly backup-restore smoke test. +# +# Why +# === +# The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml +# and the operator runbook at docs/operator/runbooks/postgres-backup.md +# both document a pg_dump -Fc -based backup strategy, but the dump has +# never been restored end-to-end under CI. A backup procedure that has +# never been restore-tested is not a backup procedure. This workflow +# adds the missing assertion. +# +# What +# ==== +# Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC +# slot so they don't fight for runners), boot a real Postgres +# 16-alpine container against the same digest pin as the production +# deploy/docker-compose.yml, exercise the audit_events hash chain +# with a small synthetic workload, pg_dump the database, drop the +# schema, pg_restore, and assert the chain head + row count +# round-trip byte-for-byte. +# +# The chain head round-trip property is the load-bearing assertion. +# Migration 000047 hashes each audit_events row's canonical payload +# with `to_char(timestamp AT TIME ZONE 'UTC', +# 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss +# in the dump→restore path (a real concern across major Postgres +# upgrades or with --format=plain) would corrupt the hash. The whole +# point of testing instead of trusting docs is to PROVE the property +# under a real workload. +# +# Workflow boundaries +# =================== +# - Does not exercise PITR / WAL archiving (DR runbook owns that). +# - Does not exercise the Helm CronJob's S3 sink or scheduling +# (operator-side concern, not a property of the dump shape). +# - Does not deploy or boot the certctl-server itself — the smoke +# harness talks to Postgres directly; we're testing the dump, +# not the server. + +name: backup-restore-smoke + +on: + # Manual trigger from the Actions tab — useful before tagging a + # release that touches the audit_events schema, or after a dep + # bump that could affect canonical-payload formatting. + workflow_dispatch: + + schedule: + # Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml + # (06:00 UTC) so the two jobs don't fight for runners on the + # GitHub-hosted ubuntu-latest pool. + - cron: '0 7 * * 1' + +# Defense-in-depth: this job reads source and exercises a database; +# it never needs write access to PRs, branches, releases, or +# packages. Pin permissions to the minimum. +permissions: + contents: read + +jobs: + backup-restore: + name: pg_dump / pg_restore smoke + runs-on: ubuntu-latest + + # 15-minute hard cap. The actual workload + dump + restore + verify + # cycle runs in well under a minute on a warm runner; 15 minutes + # absorbs cold image pulls, slow runner provisioning, and the + # Postgres service-container readiness wait without letting a stuck + # job consume the runner indefinitely. + timeout-minutes: 15 + + # Postgres service container. Pin to the same digest as + # deploy/docker-compose.yml so the smoke runs against the exact + # image the production deploy uses — a regression that surfaces + # only on a specific Postgres minor bump shows up here on the + # next image refresh in compose, not silently on a customer site. + services: + postgres: + image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7 + env: + POSTGRES_DB: certctl + POSTGRES_USER: certctl + POSTGRES_PASSWORD: certctl + ports: + - 5432:5432 + # GitHub's services-container health check. The smoke shell + # also waits for pg_isready as a belt-and-suspenders guard. + options: >- + --health-cmd "pg_isready -U certctl -d certctl" + --health-interval 5s + --health-timeout 3s + --health-retries 10 + + steps: + - name: Checkout + uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - name: Set up Go + uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 + with: + go-version: '1.25.10' + # Cache go-build + go-mod for the weekly run. Keep the + # cache key bound to go.sum so a dep bump invalidates it. + cache: true + + - name: Run backup-restore smoke + env: + PGHOST: 127.0.0.1 + PGPORT: '5432' + PGUSER: certctl + PGPASSWORD: certctl + PGDATABASE: certctl + # Insert enough rows to exercise the chain over a non-trivial + # length. 24 ≫ 1 — large enough to surface ordering bugs, + # small enough that the dump finishes in seconds. + SMOKE_ROWS: '24' + run: bash deploy/test/backup-restore-smoke.sh diff --git a/deploy/test/backup-restore-smoke.sh b/deploy/test/backup-restore-smoke.sh new file mode 100755 index 0000000..12e7842 --- /dev/null +++ b/deploy/test/backup-restore-smoke.sh @@ -0,0 +1,225 @@ +#!/usr/bin/env bash +# Copyright 2026 certctl LLC. All rights reserved. +# SPDX-License-Identifier: BUSL-1.1 +# +# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ, +# 2026-05-16). Backup/restore smoke harness — orchestrates a real +# pg_dump -Fc → DROP DATABASE → CREATE DATABASE → pg_restore loop +# around the audit_events hash chain and asserts the chain head +# round-trips byte-for-byte. +# +# This script is the body of the `.github/workflows/backup-restore.yml` +# weekly job AND the same thing an operator can run locally against a +# running Postgres to gain confidence before a real restore. +# +# Prereqs +# ======= +# - psql / pg_dump / pg_restore installed and on PATH (ubuntu-latest +# ships postgresql-client by default; on macOS use Homebrew's +# libpq). +# - A reachable Postgres at $PGHOST:$PGPORT, plus the certctl user + +# database created. In CI we point this at the GHA service container +# (postgres:16-alpine, pinned to the same digest as +# deploy/docker-compose.yml). Locally, point it wherever — the +# script DROPs the database it connects to, so DO NOT POINT THIS +# AT A DATABASE YOU CARE ABOUT. +# - Go 1.25+ on PATH so the smoke program can be built. (CI's +# setup-go step handles this.) +# - jq is NOT required — JSON snapshots are compared via python3. +# +# Behavior contract +# ================= +# - On success: exit 0, prints "PASS" + a summary line. +# - On any assertion failure: prints `::error::`, exits 1. +# (The ::error:: prefix is the GitHub Actions log-annotation shape; +# it surfaces as a red banner in the Actions run UI.) +# +# Non-goals +# ========= +# - Does not exercise PITR / WAL archiving. The Sprint 4 scope is the +# pg_dump/pg_restore path only; managed-DB PITR is the operator's +# responsibility per docs/operator/runbooks/postgres-backup.md. +# - Does not regenerate the audit chain after restore. A "restore +# that rewrote history" would mask exactly the bug under test. + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +WORKDIR="$(mktemp -d)" +trap 'rm -rf "$WORKDIR"' EXIT + +# ---------------------------------------------------------------------- +# Configuration — every knob is env-overridable so the same script +# runs unchanged in CI (where the GHA service container exposes +# 127.0.0.1:5432) and on an operator's laptop (where they may have +# Postgres on a UNIX socket or a different port). +# ---------------------------------------------------------------------- +: "${PGHOST:=127.0.0.1}" +: "${PGPORT:=5432}" +: "${PGUSER:=certctl}" +: "${PGPASSWORD:=certctl}" +: "${PGDATABASE:=certctl}" +: "${SMOKE_ROWS:=24}" +: "${MIGRATIONS_PATH:=${REPO_ROOT}/migrations}" + +# psql/pg_dump/pg_restore all read PG* env vars. Export so we don't +# have to spell them out on every command line. +export PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE + +DB_URL="postgres://${PGUSER}:${PGPASSWORD}@${PGHOST}:${PGPORT}/${PGDATABASE}?sslmode=disable" + +fail() { + # GitHub Actions log annotation. The `::error::` prefix is what + # the Actions UI uses to highlight a line in the run log. + echo "::error::backup-restore-smoke: $*" >&2 + exit 1 +} + +step() { printf '\n=== %s ===\n' "$*"; } + +# ---------------------------------------------------------------------- +# Sanity preflight +# ---------------------------------------------------------------------- +step "preflight" +command -v psql >/dev/null || fail "psql not on PATH (install postgresql-client)" +command -v pg_dump >/dev/null || fail "pg_dump not on PATH" +command -v pg_restore >/dev/null || fail "pg_restore not on PATH" +command -v go >/dev/null || fail "go not on PATH (need Go to build the smoke program)" +command -v python3 >/dev/null || fail "python3 not on PATH (used for JSON diff)" +test -d "${MIGRATIONS_PATH}" || fail "migrations dir not found: ${MIGRATIONS_PATH}" + +# Wait for Postgres readiness up to 60s. pg_isready returns 0 when +# the server is accepting connections, so the loop is the canonical +# CI-friendly "wait for the service container" pattern. +step "waiting for postgres at ${PGHOST}:${PGPORT}" +for _ in $(seq 1 60); do + if pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q; then + break + fi + sleep 1 +done +pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q \ + || fail "postgres not ready after 60s at ${PGHOST}:${PGPORT}" + +# Wipe any prior state in the target DB. A previous failed run could +# have left rows behind; the smoke contract is "starts from clean." +step "wiping ${PGDATABASE} schema (DROP SCHEMA public CASCADE; CREATE SCHEMA public)" +psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;' + +# ---------------------------------------------------------------------- +# Build the smoke program. We use `go run` to avoid leaving a binary +# behind; the migrations + workload are quick so the per-invocation +# compile cost is negligible. +# ---------------------------------------------------------------------- +step "building smoke program" +cd "${REPO_ROOT}" +go build -o "${WORKDIR}/smoke" ./deploy/test/backupsmoke + +# ---------------------------------------------------------------------- +# Phase 1 — workload: migrate, insert rows, snapshot chain head. +# ---------------------------------------------------------------------- +step "phase 1 — workload (${SMOKE_ROWS} audit_events rows)" +"${WORKDIR}/smoke" \ + --mode=workload \ + --db-url="${DB_URL}" \ + --migrations-path="${MIGRATIONS_PATH}" \ + --rows="${SMOKE_ROWS}" \ + | tee "${WORKDIR}/pre.json" + +# ---------------------------------------------------------------------- +# Phase 2 — backup. Canonical pg_dump shape per +# deploy/helm/certctl/templates/backup-cronjob.yaml: --format=custom, +# --no-owner, --no-acl. --no-owner / --no-acl keep the dump portable +# across Postgres installations with different role layouts (the +# audit-trail hash chain is data, not ACL state). +# ---------------------------------------------------------------------- +step "phase 2 — pg_dump -Fc" +pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" --file="${WORKDIR}/backup.dump" +test -s "${WORKDIR}/backup.dump" || fail "pg_dump produced an empty file" + +# ---------------------------------------------------------------------- +# Phase 3 — wipe. The fresh-schema approach is the closest analogue +# to "operator nuked the wrong volume." DROP DATABASE would require +# connecting to a different DB and reconnect dance; DROP SCHEMA +# achieves the same "no rows, no schema, no functions" end state +# inside the existing connection and is restore-compatible (pg_dump +# -Fc bundles the schema in the dump, so pg_restore recreates it). +# ---------------------------------------------------------------------- +step "phase 3 — drop schema (simulating data-loss event)" +psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;' + +# Sanity: confirm audit_events is actually gone before restore. A +# regression here (e.g. DROP SCHEMA silently no-op) would let the +# verifier "succeed" by reading the original rows, making the test +# false-pass. +PRE_RESTORE_TABLES=$(psql -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public'") +if [ "${PRE_RESTORE_TABLES}" -ne 0 ]; then + fail "post-DROP SCHEMA, expected 0 public tables; saw ${PRE_RESTORE_TABLES}" +fi + +# ---------------------------------------------------------------------- +# Phase 4 — restore. +# ---------------------------------------------------------------------- +step "phase 4 — pg_restore" +pg_restore --dbname="${PGDATABASE}" --no-owner --no-acl --exit-on-error "${WORKDIR}/backup.dump" + +# ---------------------------------------------------------------------- +# Phase 5 — verify: re-snapshot, run audit_events_verify_chain(). +# ---------------------------------------------------------------------- +step "phase 5 — verify (audit_events_verify_chain() + snapshot)" +"${WORKDIR}/smoke" \ + --mode=verify \ + --db-url="${DB_URL}" \ + | tee "${WORKDIR}/post.json" + +# ---------------------------------------------------------------------- +# Phase 6 — assert. +# +# pre.row_count == post.row_count +# pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) +# post.first_break_id == "" (verifier clean) +# post.verifier_walked == pre.row_count (every row walked) +# +# Use python3 rather than jq so the script runs unchanged on macOS +# without an extra Homebrew install. +# ---------------------------------------------------------------------- +step "phase 6 — assertions" +python3 - <<'PY' "${WORKDIR}/pre.json" "${WORKDIR}/post.json" +import json, sys + +pre = json.load(open(sys.argv[1])) +post = json.load(open(sys.argv[2])) + +def bail(msg): + print(f"::error::backup-restore-smoke: {msg}", file=sys.stderr) + sys.exit(1) + +if pre["row_count"] != post["row_count"]: + bail(f"row_count mismatch: pre={pre['row_count']} post={post['row_count']}") + +if pre["chain_head_hash"] != post["chain_head_hash"]: + bail( + "chain_head_hash mismatch — pg_dump/pg_restore did NOT round-trip the " + "audit_events hash chain byte-for-byte. " + f"pre={pre['chain_head_hash']} post={post['chain_head_hash']}" + ) + +if post.get("first_break_id", "") != "": + bail( + "audit_events_verify_chain() reports a break post-restore at id=" + f"{post['first_break_id']} pos={post.get('first_break_pos', '?')} — " + "the chain is no longer self-consistent after the restore." + ) + +if post.get("verifier_walked", -1) != pre["row_count"]: + bail( + f"verifier_walked={post.get('verifier_walked')} != pre.row_count=" + f"{pre['row_count']} — verifier short-circuited or read stale rows." + ) + +print( + f"PASS rows={pre['row_count']} " + f"chain_head={pre['chain_head_hash'][:16]}… " + f"verifier=clean" +) +PY diff --git a/deploy/test/backupsmoke/main.go b/deploy/test/backupsmoke/main.go new file mode 100644 index 0000000..82493a3 --- /dev/null +++ b/deploy/test/backupsmoke/main.go @@ -0,0 +1,222 @@ +// Copyright 2026 certctl LLC. All rights reserved. +// SPDX-License-Identifier: BUSL-1.1 + +// Command backupsmoke is the workload+verifier half of the +// backup/restore CI gate (acquisition-audit DEPL-005 + DATA-012 +// closure, Sprint 4 ACQ, 2026-05-16). +// +// The companion shell harness `deploy/test/backup-restore-smoke.sh` +// orchestrates the dump/drop/restore lifecycle around two +// invocations of this program: one before the backup +// (--mode=workload) and one after the restore (--mode=verify). Both +// emit a small JSON snapshot to stdout; the shell harness diffs them +// and asserts the chain head + row count round-trip byte-for-byte. +// +// Modes +// ===== +// +// --mode=workload +// Run all up-migrations against `--migrations-path`, then +// generate `--rows` (default 24) audit_events rows representing +// an issue / renew / revoke / auth-login cycle. Emit a snapshot +// with the post-workload row_count + chain head row_hash. +// +// --mode=verify +// Run `audit_events_verify_chain()` (the per-row hash-chain +// verifier installed by migration 000047) and capture +// first_break_id / first_break_pos / verifier_walked. Emit a +// snapshot with row_count + chain head row_hash + verifier +// output. No mutations. +// +// The CI assertion contract +// ========================= +// +// After (workload → pg_dump -Fc → DROP + CREATE → pg_restore → +// verify), the shell asserts: +// +// pre.row_count == post.row_count +// pre.chain_head_hash == post.chain_head_hash (byte-exact) +// post.first_break_id == "" (verifier clean) +// +// A pg_dump format-quirk that didn't preserve TIMESTAMPTZ +// microseconds would surface as a chain-head mismatch (the +// canonical payload re-formats `timestamp AT TIME ZONE 'UTC'` to +// microsecond ISO-8601 — any precision loss breaks the hash). A +// trigger-or-function regression would surface as a verifier non- +// empty first_break_id. The test exists to PROVE these properties +// under a real workload, not to defend against a known quirk. +package main + +import ( + "context" + "database/sql" + "encoding/json" + "flag" + "fmt" + "log" + "os" + "time" + + _ "github.com/lib/pq" + + "github.com/certctl-io/certctl/internal/repository/postgres" +) + +// Snapshot is the on-the-wire shape emitted to stdout. The shell +// orchestrator parses it via python3 -c 'json.load(...)' and diffs +// the relevant fields. Keep it stable — any rename here must land +// alongside a shell-harness change. +type Snapshot struct { + Phase string `json:"phase"` + RowCount int `json:"row_count"` + ChainHead string `json:"chain_head_hash"` + FirstBreakID string `json:"first_break_id,omitempty"` + FirstBreakPos int `json:"first_break_pos,omitempty"` + VerifierWalked int `json:"verifier_walked,omitempty"` +} + +func main() { + var ( + mode = flag.String("mode", "", "workload | verify") + dbURL = flag.String("db-url", os.Getenv("DATABASE_URL"), "Postgres URL (or set DATABASE_URL)") + migrationsPath = flag.String("migrations-path", "./migrations", "Path to the migrations/ directory (workload mode only)") + rows = flag.Int("rows", 24, "Number of audit_events rows to insert (workload mode only)") + ) + flag.Parse() + + if *dbURL == "" { + log.Fatal("--db-url or DATABASE_URL is required") + } + if *mode == "" { + log.Fatal("--mode is required (workload | verify)") + } + + db, err := sql.Open("postgres", *dbURL) + if err != nil { + log.Fatalf("sql.Open: %v", err) + } + defer db.Close() + + ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute) + defer cancel() + if err := db.PingContext(ctx); err != nil { + log.Fatalf("ping: %v", err) + } + + switch *mode { + case "workload": + // Run all up-migrations end-to-end. The trigger + verifier + // function installed by migration 000047 must be in place + // before the inserts below; partial migration would mask a + // real bug. + if err := postgres.RunMigrations(db, *migrationsPath); err != nil { + log.Fatalf("RunMigrations(%s): %v", *migrationsPath, err) + } + if err := runWorkload(ctx, db, *rows); err != nil { + log.Fatalf("runWorkload: %v", err) + } + snap, err := snapshot(ctx, db, "workload", false) + if err != nil { + log.Fatalf("snapshot: %v", err) + } + emit(snap) + case "verify": + snap, err := snapshot(ctx, db, "verify", true) + if err != nil { + log.Fatalf("snapshot: %v", err) + } + emit(snap) + default: + log.Fatalf("unknown --mode=%q (workload | verify)", *mode) + } +} + +// runWorkload inserts n audit_events rows representing an +// issue / renew / revoke / auth-login cycle. Patterns mirror the +// shape the application emits (see internal/service/audit_*.go), +// so the canonical payload exercised here is representative. +// +// event_category is omitted on each INSERT — migration 000032 gave +// the column DEFAULT 'cert_lifecycle', which is also the value the +// application uses for cert lifecycle events. Auth rows get the +// default too, which is harmless for the round-trip property under +// test (only the canonical-payload byte sequence matters). +// +// Timestamps are monotonic via the `NOW() + ($interval || +// ' microsecond')::interval` pattern from +// internal/repository/postgres/audit_chain_test.go — ordering +// determinism is necessary for the chain head to be stable across +// runs. +func runWorkload(ctx context.Context, db *sql.DB, n int) error { + actions := []struct{ act, resType, resID string }{ + {"certificate.issue", "certificate", "mc-smoke"}, + {"certificate.renew", "certificate", "mc-smoke"}, + {"certificate.revoke", "certificate", "mc-smoke"}, + {"auth.login", "session", "sess-smoke"}, + } + for i := 0; i < n; i++ { + a := actions[i%len(actions)] + id := fmt.Sprintf("audit-smoke-%04d", i) + _, err := db.ExecContext(ctx, ` + INSERT INTO audit_events ( + id, actor, actor_type, action, + resource_type, resource_id, details, timestamp + ) + VALUES ( + $1, 'smoke-actor', 'User', $2, + $3, $4, '{}'::jsonb, + NOW() + ($5 || ' microsecond')::interval + ) + `, id, a.act, a.resType, a.resID, fmt.Sprintf("%d", i)) + if err != nil { + return fmt.Errorf("insert row %d (%s): %w", i, id, err) + } + } + return nil +} + +// snapshot reads the chain head + row count, optionally invoking +// the on-demand verifier. Verifier output goes in three additional +// fields so the workload-side snapshot can omit them via the +// `omitempty` tag. +func snapshot(ctx context.Context, db *sql.DB, phase string, runVerifier bool) (*Snapshot, error) { + s := &Snapshot{Phase: phase} + + if err := db.QueryRowContext(ctx, `SELECT COUNT(*) FROM audit_events`).Scan(&s.RowCount); err != nil { + return nil, fmt.Errorf("count(audit_events): %w", err) + } + + if err := db.QueryRowContext(ctx, `SELECT row_hash FROM audit_chain_head WHERE id = 1`).Scan(&s.ChainHead); err != nil { + return nil, fmt.Errorf("read audit_chain_head: %w", err) + } + + if runVerifier { + var brokenID sql.NullString + var brokenPos, walked int + err := db.QueryRowContext(ctx, ` + SELECT first_break_id, first_break_pos, row_count + FROM audit_events_verify_chain() + `).Scan(&brokenID, &brokenPos, &walked) + if err != nil { + return nil, fmt.Errorf("audit_events_verify_chain(): %w", err) + } + if brokenID.Valid { + s.FirstBreakID = brokenID.String + } + s.FirstBreakPos = brokenPos + s.VerifierWalked = walked + } + + return s, nil +} + +// emit pretty-prints the snapshot to stdout. The trailing newline +// from json.Encoder is the right shape for both shell `tee` and +// python3 stdin handling. +func emit(s *Snapshot) { + enc := json.NewEncoder(os.Stdout) + enc.SetIndent("", " ") + if err := enc.Encode(s); err != nil { + log.Fatalf("encode snapshot: %v", err) + } +} diff --git a/docs/operator/runbooks/postgres-backup.md b/docs/operator/runbooks/postgres-backup.md index 4daef22..1b9165e 100644 --- a/docs/operator/runbooks/postgres-backup.md +++ b/docs/operator/runbooks/postgres-backup.md @@ -1,6 +1,6 @@ # Runbook: PostgreSQL backup for certctl -> Last reviewed: 2026-05-16 +> Last reviewed: 2026-05-16 (Sprint 4 ACQ — CI restore verification subsection added) Use this when: - You're setting up a new certctl deployment and need a backup policy @@ -198,6 +198,42 @@ to your quarterly on-call rotation: The [disaster-recovery runbook](disaster-recovery.md) covers what to do when this dry-run reveals a gap. +## CI restore verification + +> Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ, +> 2026-05-16). The quarterly dry-run above is the operator-side +> proof; the workflow below is the upstream-side proof. + +The certctl repo ships a weekly GitHub Actions workflow that +exercises the **exact** pg_dump shape this runbook recommends +(`--format=custom --no-owner --no-acl`) against a real Postgres +container, then asserts the audit_events hash chain round-trips +byte-for-byte across the dump → restore boundary. A regression in +the dump format, in a Postgres minor bump, or in migration 000047's +canonical-payload serialization would surface in the next Monday +run instead of on a customer's restore day. + +- **Workflow:** [`.github/workflows/backup-restore.yml`](../../../.github/workflows/backup-restore.yml) + — Mondays 07:00 UTC + `workflow_dispatch`. Postgres service + container pinned to the same SHA256 digest as + `deploy/docker-compose.yml`. +- **Harness:** [`deploy/test/backup-restore-smoke.sh`](../../../deploy/test/backup-restore-smoke.sh) + — runs the workload → `pg_dump -Fc` → `DROP SCHEMA public CASCADE` + → `pg_restore` → verify cycle. Locally runnable against any + reachable Postgres (it DROPs the schema, so do not point it at + data you care about). +- **Workload + verifier:** [`deploy/test/backupsmoke/main.go`](../../../deploy/test/backupsmoke/main.go) + — generates 24 synthetic `audit_events` rows representing an + issue/renew/revoke/auth-login cycle, snapshots the chain head + before the backup, and after restore runs + `audit_events_verify_chain()` to confirm `first_break_id IS NULL`. + +The CI workflow is not a replacement for the quarterly operator +dry-run — it does not exercise the operator-managed file material +(CA keys, RA keys, trust anchors) listed in the "What to back up" +table above. Treat it as the dump-shape regression test; the +quarterly run remains the full-restore correctness test. + ## Related reading - [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion