mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:21:31 +00:00
feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths.
This commit is contained in:
@@ -0,0 +1,118 @@
|
|||||||
|
# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
|
||||||
|
# 2026-05-16). Weekly backup-restore smoke test.
|
||||||
|
#
|
||||||
|
# Why
|
||||||
|
# ===
|
||||||
|
# The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml
|
||||||
|
# and the operator runbook at docs/operator/runbooks/postgres-backup.md
|
||||||
|
# both document a pg_dump -Fc -based backup strategy, but the dump has
|
||||||
|
# never been restored end-to-end under CI. A backup procedure that has
|
||||||
|
# never been restore-tested is not a backup procedure. This workflow
|
||||||
|
# adds the missing assertion.
|
||||||
|
#
|
||||||
|
# What
|
||||||
|
# ====
|
||||||
|
# Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC
|
||||||
|
# slot so they don't fight for runners), boot a real Postgres
|
||||||
|
# 16-alpine container against the same digest pin as the production
|
||||||
|
# deploy/docker-compose.yml, exercise the audit_events hash chain
|
||||||
|
# with a small synthetic workload, pg_dump the database, drop the
|
||||||
|
# schema, pg_restore, and assert the chain head + row count
|
||||||
|
# round-trip byte-for-byte.
|
||||||
|
#
|
||||||
|
# The chain head round-trip property is the load-bearing assertion.
|
||||||
|
# Migration 000047 hashes each audit_events row's canonical payload
|
||||||
|
# with `to_char(timestamp AT TIME ZONE 'UTC',
|
||||||
|
# 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss
|
||||||
|
# in the dump→restore path (a real concern across major Postgres
|
||||||
|
# upgrades or with --format=plain) would corrupt the hash. The whole
|
||||||
|
# point of testing instead of trusting docs is to PROVE the property
|
||||||
|
# under a real workload.
|
||||||
|
#
|
||||||
|
# Workflow boundaries
|
||||||
|
# ===================
|
||||||
|
# - Does not exercise PITR / WAL archiving (DR runbook owns that).
|
||||||
|
# - Does not exercise the Helm CronJob's S3 sink or scheduling
|
||||||
|
# (operator-side concern, not a property of the dump shape).
|
||||||
|
# - Does not deploy or boot the certctl-server itself — the smoke
|
||||||
|
# harness talks to Postgres directly; we're testing the dump,
|
||||||
|
# not the server.
|
||||||
|
|
||||||
|
name: backup-restore-smoke
|
||||||
|
|
||||||
|
on:
|
||||||
|
# Manual trigger from the Actions tab — useful before tagging a
|
||||||
|
# release that touches the audit_events schema, or after a dep
|
||||||
|
# bump that could affect canonical-payload formatting.
|
||||||
|
workflow_dispatch:
|
||||||
|
|
||||||
|
schedule:
|
||||||
|
# Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml
|
||||||
|
# (06:00 UTC) so the two jobs don't fight for runners on the
|
||||||
|
# GitHub-hosted ubuntu-latest pool.
|
||||||
|
- cron: '0 7 * * 1'
|
||||||
|
|
||||||
|
# Defense-in-depth: this job reads source and exercises a database;
|
||||||
|
# it never needs write access to PRs, branches, releases, or
|
||||||
|
# packages. Pin permissions to the minimum.
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
backup-restore:
|
||||||
|
name: pg_dump / pg_restore smoke
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
# 15-minute hard cap. The actual workload + dump + restore + verify
|
||||||
|
# cycle runs in well under a minute on a warm runner; 15 minutes
|
||||||
|
# absorbs cold image pulls, slow runner provisioning, and the
|
||||||
|
# Postgres service-container readiness wait without letting a stuck
|
||||||
|
# job consume the runner indefinitely.
|
||||||
|
timeout-minutes: 15
|
||||||
|
|
||||||
|
# Postgres service container. Pin to the same digest as
|
||||||
|
# deploy/docker-compose.yml so the smoke runs against the exact
|
||||||
|
# image the production deploy uses — a regression that surfaces
|
||||||
|
# only on a specific Postgres minor bump shows up here on the
|
||||||
|
# next image refresh in compose, not silently on a customer site.
|
||||||
|
services:
|
||||||
|
postgres:
|
||||||
|
image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
|
||||||
|
env:
|
||||||
|
POSTGRES_DB: certctl
|
||||||
|
POSTGRES_USER: certctl
|
||||||
|
POSTGRES_PASSWORD: certctl
|
||||||
|
ports:
|
||||||
|
- 5432:5432
|
||||||
|
# GitHub's services-container health check. The smoke shell
|
||||||
|
# also waits for pg_isready as a belt-and-suspenders guard.
|
||||||
|
options: >-
|
||||||
|
--health-cmd "pg_isready -U certctl -d certctl"
|
||||||
|
--health-interval 5s
|
||||||
|
--health-timeout 3s
|
||||||
|
--health-retries 10
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Checkout
|
||||||
|
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||||||
|
|
||||||
|
- name: Set up Go
|
||||||
|
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
||||||
|
with:
|
||||||
|
go-version: '1.25.10'
|
||||||
|
# Cache go-build + go-mod for the weekly run. Keep the
|
||||||
|
# cache key bound to go.sum so a dep bump invalidates it.
|
||||||
|
cache: true
|
||||||
|
|
||||||
|
- name: Run backup-restore smoke
|
||||||
|
env:
|
||||||
|
PGHOST: 127.0.0.1
|
||||||
|
PGPORT: '5432'
|
||||||
|
PGUSER: certctl
|
||||||
|
PGPASSWORD: certctl
|
||||||
|
PGDATABASE: certctl
|
||||||
|
# Insert enough rows to exercise the chain over a non-trivial
|
||||||
|
# length. 24 ≫ 1 — large enough to surface ordering bugs,
|
||||||
|
# small enough that the dump finishes in seconds.
|
||||||
|
SMOKE_ROWS: '24'
|
||||||
|
run: bash deploy/test/backup-restore-smoke.sh
|
||||||
Executable
+225
@@ -0,0 +1,225 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Copyright 2026 certctl LLC. All rights reserved.
|
||||||
|
# SPDX-License-Identifier: BUSL-1.1
|
||||||
|
#
|
||||||
|
# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
|
||||||
|
# 2026-05-16). Backup/restore smoke harness — orchestrates a real
|
||||||
|
# pg_dump -Fc → DROP DATABASE → CREATE DATABASE → pg_restore loop
|
||||||
|
# around the audit_events hash chain and asserts the chain head
|
||||||
|
# round-trips byte-for-byte.
|
||||||
|
#
|
||||||
|
# This script is the body of the `.github/workflows/backup-restore.yml`
|
||||||
|
# weekly job AND the same thing an operator can run locally against a
|
||||||
|
# running Postgres to gain confidence before a real restore.
|
||||||
|
#
|
||||||
|
# Prereqs
|
||||||
|
# =======
|
||||||
|
# - psql / pg_dump / pg_restore installed and on PATH (ubuntu-latest
|
||||||
|
# ships postgresql-client by default; on macOS use Homebrew's
|
||||||
|
# libpq).
|
||||||
|
# - A reachable Postgres at $PGHOST:$PGPORT, plus the certctl user +
|
||||||
|
# database created. In CI we point this at the GHA service container
|
||||||
|
# (postgres:16-alpine, pinned to the same digest as
|
||||||
|
# deploy/docker-compose.yml). Locally, point it wherever — the
|
||||||
|
# script DROPs the database it connects to, so DO NOT POINT THIS
|
||||||
|
# AT A DATABASE YOU CARE ABOUT.
|
||||||
|
# - Go 1.25+ on PATH so the smoke program can be built. (CI's
|
||||||
|
# setup-go step handles this.)
|
||||||
|
# - jq is NOT required — JSON snapshots are compared via python3.
|
||||||
|
#
|
||||||
|
# Behavior contract
|
||||||
|
# =================
|
||||||
|
# - On success: exit 0, prints "PASS" + a summary line.
|
||||||
|
# - On any assertion failure: prints `::error::<reason>`, exits 1.
|
||||||
|
# (The ::error:: prefix is the GitHub Actions log-annotation shape;
|
||||||
|
# it surfaces as a red banner in the Actions run UI.)
|
||||||
|
#
|
||||||
|
# Non-goals
|
||||||
|
# =========
|
||||||
|
# - Does not exercise PITR / WAL archiving. The Sprint 4 scope is the
|
||||||
|
# pg_dump/pg_restore path only; managed-DB PITR is the operator's
|
||||||
|
# responsibility per docs/operator/runbooks/postgres-backup.md.
|
||||||
|
# - Does not regenerate the audit chain after restore. A "restore
|
||||||
|
# that rewrote history" would mask exactly the bug under test.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
|
||||||
|
WORKDIR="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$WORKDIR"' EXIT
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Configuration — every knob is env-overridable so the same script
|
||||||
|
# runs unchanged in CI (where the GHA service container exposes
|
||||||
|
# 127.0.0.1:5432) and on an operator's laptop (where they may have
|
||||||
|
# Postgres on a UNIX socket or a different port).
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
: "${PGHOST:=127.0.0.1}"
|
||||||
|
: "${PGPORT:=5432}"
|
||||||
|
: "${PGUSER:=certctl}"
|
||||||
|
: "${PGPASSWORD:=certctl}"
|
||||||
|
: "${PGDATABASE:=certctl}"
|
||||||
|
: "${SMOKE_ROWS:=24}"
|
||||||
|
: "${MIGRATIONS_PATH:=${REPO_ROOT}/migrations}"
|
||||||
|
|
||||||
|
# psql/pg_dump/pg_restore all read PG* env vars. Export so we don't
|
||||||
|
# have to spell them out on every command line.
|
||||||
|
export PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE
|
||||||
|
|
||||||
|
DB_URL="postgres://${PGUSER}:${PGPASSWORD}@${PGHOST}:${PGPORT}/${PGDATABASE}?sslmode=disable"
|
||||||
|
|
||||||
|
fail() {
|
||||||
|
# GitHub Actions log annotation. The `::error::` prefix is what
|
||||||
|
# the Actions UI uses to highlight a line in the run log.
|
||||||
|
echo "::error::backup-restore-smoke: $*" >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
step() { printf '\n=== %s ===\n' "$*"; }
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Sanity preflight
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "preflight"
|
||||||
|
command -v psql >/dev/null || fail "psql not on PATH (install postgresql-client)"
|
||||||
|
command -v pg_dump >/dev/null || fail "pg_dump not on PATH"
|
||||||
|
command -v pg_restore >/dev/null || fail "pg_restore not on PATH"
|
||||||
|
command -v go >/dev/null || fail "go not on PATH (need Go to build the smoke program)"
|
||||||
|
command -v python3 >/dev/null || fail "python3 not on PATH (used for JSON diff)"
|
||||||
|
test -d "${MIGRATIONS_PATH}" || fail "migrations dir not found: ${MIGRATIONS_PATH}"
|
||||||
|
|
||||||
|
# Wait for Postgres readiness up to 60s. pg_isready returns 0 when
|
||||||
|
# the server is accepting connections, so the loop is the canonical
|
||||||
|
# CI-friendly "wait for the service container" pattern.
|
||||||
|
step "waiting for postgres at ${PGHOST}:${PGPORT}"
|
||||||
|
for _ in $(seq 1 60); do
|
||||||
|
if pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q \
|
||||||
|
|| fail "postgres not ready after 60s at ${PGHOST}:${PGPORT}"
|
||||||
|
|
||||||
|
# Wipe any prior state in the target DB. A previous failed run could
|
||||||
|
# have left rows behind; the smoke contract is "starts from clean."
|
||||||
|
step "wiping ${PGDATABASE} schema (DROP SCHEMA public CASCADE; CREATE SCHEMA public)"
|
||||||
|
psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Build the smoke program. We use `go run` to avoid leaving a binary
|
||||||
|
# behind; the migrations + workload are quick so the per-invocation
|
||||||
|
# compile cost is negligible.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "building smoke program"
|
||||||
|
cd "${REPO_ROOT}"
|
||||||
|
go build -o "${WORKDIR}/smoke" ./deploy/test/backupsmoke
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 1 — workload: migrate, insert rows, snapshot chain head.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 1 — workload (${SMOKE_ROWS} audit_events rows)"
|
||||||
|
"${WORKDIR}/smoke" \
|
||||||
|
--mode=workload \
|
||||||
|
--db-url="${DB_URL}" \
|
||||||
|
--migrations-path="${MIGRATIONS_PATH}" \
|
||||||
|
--rows="${SMOKE_ROWS}" \
|
||||||
|
| tee "${WORKDIR}/pre.json"
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 2 — backup. Canonical pg_dump shape per
|
||||||
|
# deploy/helm/certctl/templates/backup-cronjob.yaml: --format=custom,
|
||||||
|
# --no-owner, --no-acl. --no-owner / --no-acl keep the dump portable
|
||||||
|
# across Postgres installations with different role layouts (the
|
||||||
|
# audit-trail hash chain is data, not ACL state).
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 2 — pg_dump -Fc"
|
||||||
|
pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" --file="${WORKDIR}/backup.dump"
|
||||||
|
test -s "${WORKDIR}/backup.dump" || fail "pg_dump produced an empty file"
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 3 — wipe. The fresh-schema approach is the closest analogue
|
||||||
|
# to "operator nuked the wrong volume." DROP DATABASE would require
|
||||||
|
# connecting to a different DB and reconnect dance; DROP SCHEMA
|
||||||
|
# achieves the same "no rows, no schema, no functions" end state
|
||||||
|
# inside the existing connection and is restore-compatible (pg_dump
|
||||||
|
# -Fc bundles the schema in the dump, so pg_restore recreates it).
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 3 — drop schema (simulating data-loss event)"
|
||||||
|
psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
|
||||||
|
|
||||||
|
# Sanity: confirm audit_events is actually gone before restore. A
|
||||||
|
# regression here (e.g. DROP SCHEMA silently no-op) would let the
|
||||||
|
# verifier "succeed" by reading the original rows, making the test
|
||||||
|
# false-pass.
|
||||||
|
PRE_RESTORE_TABLES=$(psql -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public'")
|
||||||
|
if [ "${PRE_RESTORE_TABLES}" -ne 0 ]; then
|
||||||
|
fail "post-DROP SCHEMA, expected 0 public tables; saw ${PRE_RESTORE_TABLES}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 4 — restore.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 4 — pg_restore"
|
||||||
|
pg_restore --dbname="${PGDATABASE}" --no-owner --no-acl --exit-on-error "${WORKDIR}/backup.dump"
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 5 — verify: re-snapshot, run audit_events_verify_chain().
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 5 — verify (audit_events_verify_chain() + snapshot)"
|
||||||
|
"${WORKDIR}/smoke" \
|
||||||
|
--mode=verify \
|
||||||
|
--db-url="${DB_URL}" \
|
||||||
|
| tee "${WORKDIR}/post.json"
|
||||||
|
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Phase 6 — assert.
|
||||||
|
#
|
||||||
|
# pre.row_count == post.row_count
|
||||||
|
# pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT)
|
||||||
|
# post.first_break_id == "" (verifier clean)
|
||||||
|
# post.verifier_walked == pre.row_count (every row walked)
|
||||||
|
#
|
||||||
|
# Use python3 rather than jq so the script runs unchanged on macOS
|
||||||
|
# without an extra Homebrew install.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
step "phase 6 — assertions"
|
||||||
|
python3 - <<'PY' "${WORKDIR}/pre.json" "${WORKDIR}/post.json"
|
||||||
|
import json, sys
|
||||||
|
|
||||||
|
pre = json.load(open(sys.argv[1]))
|
||||||
|
post = json.load(open(sys.argv[2]))
|
||||||
|
|
||||||
|
def bail(msg):
|
||||||
|
print(f"::error::backup-restore-smoke: {msg}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if pre["row_count"] != post["row_count"]:
|
||||||
|
bail(f"row_count mismatch: pre={pre['row_count']} post={post['row_count']}")
|
||||||
|
|
||||||
|
if pre["chain_head_hash"] != post["chain_head_hash"]:
|
||||||
|
bail(
|
||||||
|
"chain_head_hash mismatch — pg_dump/pg_restore did NOT round-trip the "
|
||||||
|
"audit_events hash chain byte-for-byte. "
|
||||||
|
f"pre={pre['chain_head_hash']} post={post['chain_head_hash']}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if post.get("first_break_id", "") != "":
|
||||||
|
bail(
|
||||||
|
"audit_events_verify_chain() reports a break post-restore at id="
|
||||||
|
f"{post['first_break_id']} pos={post.get('first_break_pos', '?')} — "
|
||||||
|
"the chain is no longer self-consistent after the restore."
|
||||||
|
)
|
||||||
|
|
||||||
|
if post.get("verifier_walked", -1) != pre["row_count"]:
|
||||||
|
bail(
|
||||||
|
f"verifier_walked={post.get('verifier_walked')} != pre.row_count="
|
||||||
|
f"{pre['row_count']} — verifier short-circuited or read stale rows."
|
||||||
|
)
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"PASS rows={pre['row_count']} "
|
||||||
|
f"chain_head={pre['chain_head_hash'][:16]}… "
|
||||||
|
f"verifier=clean"
|
||||||
|
)
|
||||||
|
PY
|
||||||
@@ -0,0 +1,222 @@
|
|||||||
|
// Copyright 2026 certctl LLC. All rights reserved.
|
||||||
|
// SPDX-License-Identifier: BUSL-1.1
|
||||||
|
|
||||||
|
// Command backupsmoke is the workload+verifier half of the
|
||||||
|
// backup/restore CI gate (acquisition-audit DEPL-005 + DATA-012
|
||||||
|
// closure, Sprint 4 ACQ, 2026-05-16).
|
||||||
|
//
|
||||||
|
// The companion shell harness `deploy/test/backup-restore-smoke.sh`
|
||||||
|
// orchestrates the dump/drop/restore lifecycle around two
|
||||||
|
// invocations of this program: one before the backup
|
||||||
|
// (--mode=workload) and one after the restore (--mode=verify). Both
|
||||||
|
// emit a small JSON snapshot to stdout; the shell harness diffs them
|
||||||
|
// and asserts the chain head + row count round-trip byte-for-byte.
|
||||||
|
//
|
||||||
|
// Modes
|
||||||
|
// =====
|
||||||
|
//
|
||||||
|
// --mode=workload
|
||||||
|
// Run all up-migrations against `--migrations-path`, then
|
||||||
|
// generate `--rows` (default 24) audit_events rows representing
|
||||||
|
// an issue / renew / revoke / auth-login cycle. Emit a snapshot
|
||||||
|
// with the post-workload row_count + chain head row_hash.
|
||||||
|
//
|
||||||
|
// --mode=verify
|
||||||
|
// Run `audit_events_verify_chain()` (the per-row hash-chain
|
||||||
|
// verifier installed by migration 000047) and capture
|
||||||
|
// first_break_id / first_break_pos / verifier_walked. Emit a
|
||||||
|
// snapshot with row_count + chain head row_hash + verifier
|
||||||
|
// output. No mutations.
|
||||||
|
//
|
||||||
|
// The CI assertion contract
|
||||||
|
// =========================
|
||||||
|
//
|
||||||
|
// After (workload → pg_dump -Fc → DROP + CREATE → pg_restore →
|
||||||
|
// verify), the shell asserts:
|
||||||
|
//
|
||||||
|
// pre.row_count == post.row_count
|
||||||
|
// pre.chain_head_hash == post.chain_head_hash (byte-exact)
|
||||||
|
// post.first_break_id == "" (verifier clean)
|
||||||
|
//
|
||||||
|
// A pg_dump format-quirk that didn't preserve TIMESTAMPTZ
|
||||||
|
// microseconds would surface as a chain-head mismatch (the
|
||||||
|
// canonical payload re-formats `timestamp AT TIME ZONE 'UTC'` to
|
||||||
|
// microsecond ISO-8601 — any precision loss breaks the hash). A
|
||||||
|
// trigger-or-function regression would surface as a verifier non-
|
||||||
|
// empty first_break_id. The test exists to PROVE these properties
|
||||||
|
// under a real workload, not to defend against a known quirk.
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"database/sql"
|
||||||
|
"encoding/json"
|
||||||
|
"flag"
|
||||||
|
"fmt"
|
||||||
|
"log"
|
||||||
|
"os"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
_ "github.com/lib/pq"
|
||||||
|
|
||||||
|
"github.com/certctl-io/certctl/internal/repository/postgres"
|
||||||
|
)
|
||||||
|
|
||||||
|
// Snapshot is the on-the-wire shape emitted to stdout. The shell
|
||||||
|
// orchestrator parses it via python3 -c 'json.load(...)' and diffs
|
||||||
|
// the relevant fields. Keep it stable — any rename here must land
|
||||||
|
// alongside a shell-harness change.
|
||||||
|
type Snapshot struct {
|
||||||
|
Phase string `json:"phase"`
|
||||||
|
RowCount int `json:"row_count"`
|
||||||
|
ChainHead string `json:"chain_head_hash"`
|
||||||
|
FirstBreakID string `json:"first_break_id,omitempty"`
|
||||||
|
FirstBreakPos int `json:"first_break_pos,omitempty"`
|
||||||
|
VerifierWalked int `json:"verifier_walked,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
var (
|
||||||
|
mode = flag.String("mode", "", "workload | verify")
|
||||||
|
dbURL = flag.String("db-url", os.Getenv("DATABASE_URL"), "Postgres URL (or set DATABASE_URL)")
|
||||||
|
migrationsPath = flag.String("migrations-path", "./migrations", "Path to the migrations/ directory (workload mode only)")
|
||||||
|
rows = flag.Int("rows", 24, "Number of audit_events rows to insert (workload mode only)")
|
||||||
|
)
|
||||||
|
flag.Parse()
|
||||||
|
|
||||||
|
if *dbURL == "" {
|
||||||
|
log.Fatal("--db-url or DATABASE_URL is required")
|
||||||
|
}
|
||||||
|
if *mode == "" {
|
||||||
|
log.Fatal("--mode is required (workload | verify)")
|
||||||
|
}
|
||||||
|
|
||||||
|
db, err := sql.Open("postgres", *dbURL)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatalf("sql.Open: %v", err)
|
||||||
|
}
|
||||||
|
defer db.Close()
|
||||||
|
|
||||||
|
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
|
||||||
|
defer cancel()
|
||||||
|
if err := db.PingContext(ctx); err != nil {
|
||||||
|
log.Fatalf("ping: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
switch *mode {
|
||||||
|
case "workload":
|
||||||
|
// Run all up-migrations end-to-end. The trigger + verifier
|
||||||
|
// function installed by migration 000047 must be in place
|
||||||
|
// before the inserts below; partial migration would mask a
|
||||||
|
// real bug.
|
||||||
|
if err := postgres.RunMigrations(db, *migrationsPath); err != nil {
|
||||||
|
log.Fatalf("RunMigrations(%s): %v", *migrationsPath, err)
|
||||||
|
}
|
||||||
|
if err := runWorkload(ctx, db, *rows); err != nil {
|
||||||
|
log.Fatalf("runWorkload: %v", err)
|
||||||
|
}
|
||||||
|
snap, err := snapshot(ctx, db, "workload", false)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatalf("snapshot: %v", err)
|
||||||
|
}
|
||||||
|
emit(snap)
|
||||||
|
case "verify":
|
||||||
|
snap, err := snapshot(ctx, db, "verify", true)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatalf("snapshot: %v", err)
|
||||||
|
}
|
||||||
|
emit(snap)
|
||||||
|
default:
|
||||||
|
log.Fatalf("unknown --mode=%q (workload | verify)", *mode)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// runWorkload inserts n audit_events rows representing an
|
||||||
|
// issue / renew / revoke / auth-login cycle. Patterns mirror the
|
||||||
|
// shape the application emits (see internal/service/audit_*.go),
|
||||||
|
// so the canonical payload exercised here is representative.
|
||||||
|
//
|
||||||
|
// event_category is omitted on each INSERT — migration 000032 gave
|
||||||
|
// the column DEFAULT 'cert_lifecycle', which is also the value the
|
||||||
|
// application uses for cert lifecycle events. Auth rows get the
|
||||||
|
// default too, which is harmless for the round-trip property under
|
||||||
|
// test (only the canonical-payload byte sequence matters).
|
||||||
|
//
|
||||||
|
// Timestamps are monotonic via the `NOW() + ($interval ||
|
||||||
|
// ' microsecond')::interval` pattern from
|
||||||
|
// internal/repository/postgres/audit_chain_test.go — ordering
|
||||||
|
// determinism is necessary for the chain head to be stable across
|
||||||
|
// runs.
|
||||||
|
func runWorkload(ctx context.Context, db *sql.DB, n int) error {
|
||||||
|
actions := []struct{ act, resType, resID string }{
|
||||||
|
{"certificate.issue", "certificate", "mc-smoke"},
|
||||||
|
{"certificate.renew", "certificate", "mc-smoke"},
|
||||||
|
{"certificate.revoke", "certificate", "mc-smoke"},
|
||||||
|
{"auth.login", "session", "sess-smoke"},
|
||||||
|
}
|
||||||
|
for i := 0; i < n; i++ {
|
||||||
|
a := actions[i%len(actions)]
|
||||||
|
id := fmt.Sprintf("audit-smoke-%04d", i)
|
||||||
|
_, err := db.ExecContext(ctx, `
|
||||||
|
INSERT INTO audit_events (
|
||||||
|
id, actor, actor_type, action,
|
||||||
|
resource_type, resource_id, details, timestamp
|
||||||
|
)
|
||||||
|
VALUES (
|
||||||
|
$1, 'smoke-actor', 'User', $2,
|
||||||
|
$3, $4, '{}'::jsonb,
|
||||||
|
NOW() + ($5 || ' microsecond')::interval
|
||||||
|
)
|
||||||
|
`, id, a.act, a.resType, a.resID, fmt.Sprintf("%d", i))
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("insert row %d (%s): %w", i, id, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// snapshot reads the chain head + row count, optionally invoking
|
||||||
|
// the on-demand verifier. Verifier output goes in three additional
|
||||||
|
// fields so the workload-side snapshot can omit them via the
|
||||||
|
// `omitempty` tag.
|
||||||
|
func snapshot(ctx context.Context, db *sql.DB, phase string, runVerifier bool) (*Snapshot, error) {
|
||||||
|
s := &Snapshot{Phase: phase}
|
||||||
|
|
||||||
|
if err := db.QueryRowContext(ctx, `SELECT COUNT(*) FROM audit_events`).Scan(&s.RowCount); err != nil {
|
||||||
|
return nil, fmt.Errorf("count(audit_events): %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := db.QueryRowContext(ctx, `SELECT row_hash FROM audit_chain_head WHERE id = 1`).Scan(&s.ChainHead); err != nil {
|
||||||
|
return nil, fmt.Errorf("read audit_chain_head: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if runVerifier {
|
||||||
|
var brokenID sql.NullString
|
||||||
|
var brokenPos, walked int
|
||||||
|
err := db.QueryRowContext(ctx, `
|
||||||
|
SELECT first_break_id, first_break_pos, row_count
|
||||||
|
FROM audit_events_verify_chain()
|
||||||
|
`).Scan(&brokenID, &brokenPos, &walked)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("audit_events_verify_chain(): %w", err)
|
||||||
|
}
|
||||||
|
if brokenID.Valid {
|
||||||
|
s.FirstBreakID = brokenID.String
|
||||||
|
}
|
||||||
|
s.FirstBreakPos = brokenPos
|
||||||
|
s.VerifierWalked = walked
|
||||||
|
}
|
||||||
|
|
||||||
|
return s, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// emit pretty-prints the snapshot to stdout. The trailing newline
|
||||||
|
// from json.Encoder is the right shape for both shell `tee` and
|
||||||
|
// python3 stdin handling.
|
||||||
|
func emit(s *Snapshot) {
|
||||||
|
enc := json.NewEncoder(os.Stdout)
|
||||||
|
enc.SetIndent("", " ")
|
||||||
|
if err := enc.Encode(s); err != nil {
|
||||||
|
log.Fatalf("encode snapshot: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
# Runbook: PostgreSQL backup for certctl
|
# Runbook: PostgreSQL backup for certctl
|
||||||
|
|
||||||
> Last reviewed: 2026-05-16
|
> Last reviewed: 2026-05-16 (Sprint 4 ACQ — CI restore verification subsection added)
|
||||||
|
|
||||||
Use this when:
|
Use this when:
|
||||||
- You're setting up a new certctl deployment and need a backup policy
|
- You're setting up a new certctl deployment and need a backup policy
|
||||||
@@ -198,6 +198,42 @@ to your quarterly on-call rotation:
|
|||||||
The [disaster-recovery runbook](disaster-recovery.md) covers what to
|
The [disaster-recovery runbook](disaster-recovery.md) covers what to
|
||||||
do when this dry-run reveals a gap.
|
do when this dry-run reveals a gap.
|
||||||
|
|
||||||
|
## CI restore verification
|
||||||
|
|
||||||
|
> Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
|
||||||
|
> 2026-05-16). The quarterly dry-run above is the operator-side
|
||||||
|
> proof; the workflow below is the upstream-side proof.
|
||||||
|
|
||||||
|
The certctl repo ships a weekly GitHub Actions workflow that
|
||||||
|
exercises the **exact** pg_dump shape this runbook recommends
|
||||||
|
(`--format=custom --no-owner --no-acl`) against a real Postgres
|
||||||
|
container, then asserts the audit_events hash chain round-trips
|
||||||
|
byte-for-byte across the dump → restore boundary. A regression in
|
||||||
|
the dump format, in a Postgres minor bump, or in migration 000047's
|
||||||
|
canonical-payload serialization would surface in the next Monday
|
||||||
|
run instead of on a customer's restore day.
|
||||||
|
|
||||||
|
- **Workflow:** [`.github/workflows/backup-restore.yml`](../../../.github/workflows/backup-restore.yml)
|
||||||
|
— Mondays 07:00 UTC + `workflow_dispatch`. Postgres service
|
||||||
|
container pinned to the same SHA256 digest as
|
||||||
|
`deploy/docker-compose.yml`.
|
||||||
|
- **Harness:** [`deploy/test/backup-restore-smoke.sh`](../../../deploy/test/backup-restore-smoke.sh)
|
||||||
|
— runs the workload → `pg_dump -Fc` → `DROP SCHEMA public CASCADE`
|
||||||
|
→ `pg_restore` → verify cycle. Locally runnable against any
|
||||||
|
reachable Postgres (it DROPs the schema, so do not point it at
|
||||||
|
data you care about).
|
||||||
|
- **Workload + verifier:** [`deploy/test/backupsmoke/main.go`](../../../deploy/test/backupsmoke/main.go)
|
||||||
|
— generates 24 synthetic `audit_events` rows representing an
|
||||||
|
issue/renew/revoke/auth-login cycle, snapshots the chain head
|
||||||
|
before the backup, and after restore runs
|
||||||
|
`audit_events_verify_chain()` to confirm `first_break_id IS NULL`.
|
||||||
|
|
||||||
|
The CI workflow is not a replacement for the quarterly operator
|
||||||
|
dry-run — it does not exercise the operator-managed file material
|
||||||
|
(CA keys, RA keys, trust anchors) listed in the "What to back up"
|
||||||
|
table above. Treat it as the dump-shape regression test; the
|
||||||
|
quarterly run remains the full-restore correctness test.
|
||||||
|
|
||||||
## Related reading
|
## Related reading
|
||||||
|
|
||||||
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
|
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
|
||||||
|
|||||||
Reference in New Issue
Block a user