mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:51:32 +00:00
374ec574c5
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths.
119 lines
4.6 KiB
YAML
119 lines
4.6 KiB
YAML
# Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
|
|
# 2026-05-16). Weekly backup-restore smoke test.
|
|
#
|
|
# Why
|
|
# ===
|
|
# The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml
|
|
# and the operator runbook at docs/operator/runbooks/postgres-backup.md
|
|
# both document a pg_dump -Fc -based backup strategy, but the dump has
|
|
# never been restored end-to-end under CI. A backup procedure that has
|
|
# never been restore-tested is not a backup procedure. This workflow
|
|
# adds the missing assertion.
|
|
#
|
|
# What
|
|
# ====
|
|
# Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC
|
|
# slot so they don't fight for runners), boot a real Postgres
|
|
# 16-alpine container against the same digest pin as the production
|
|
# deploy/docker-compose.yml, exercise the audit_events hash chain
|
|
# with a small synthetic workload, pg_dump the database, drop the
|
|
# schema, pg_restore, and assert the chain head + row count
|
|
# round-trip byte-for-byte.
|
|
#
|
|
# The chain head round-trip property is the load-bearing assertion.
|
|
# Migration 000047 hashes each audit_events row's canonical payload
|
|
# with `to_char(timestamp AT TIME ZONE 'UTC',
|
|
# 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss
|
|
# in the dump→restore path (a real concern across major Postgres
|
|
# upgrades or with --format=plain) would corrupt the hash. The whole
|
|
# point of testing instead of trusting docs is to PROVE the property
|
|
# under a real workload.
|
|
#
|
|
# Workflow boundaries
|
|
# ===================
|
|
# - Does not exercise PITR / WAL archiving (DR runbook owns that).
|
|
# - Does not exercise the Helm CronJob's S3 sink or scheduling
|
|
# (operator-side concern, not a property of the dump shape).
|
|
# - Does not deploy or boot the certctl-server itself — the smoke
|
|
# harness talks to Postgres directly; we're testing the dump,
|
|
# not the server.
|
|
|
|
name: backup-restore-smoke
|
|
|
|
on:
|
|
# Manual trigger from the Actions tab — useful before tagging a
|
|
# release that touches the audit_events schema, or after a dep
|
|
# bump that could affect canonical-payload formatting.
|
|
workflow_dispatch:
|
|
|
|
schedule:
|
|
# Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml
|
|
# (06:00 UTC) so the two jobs don't fight for runners on the
|
|
# GitHub-hosted ubuntu-latest pool.
|
|
- cron: '0 7 * * 1'
|
|
|
|
# Defense-in-depth: this job reads source and exercises a database;
|
|
# it never needs write access to PRs, branches, releases, or
|
|
# packages. Pin permissions to the minimum.
|
|
permissions:
|
|
contents: read
|
|
|
|
jobs:
|
|
backup-restore:
|
|
name: pg_dump / pg_restore smoke
|
|
runs-on: ubuntu-latest
|
|
|
|
# 15-minute hard cap. The actual workload + dump + restore + verify
|
|
# cycle runs in well under a minute on a warm runner; 15 minutes
|
|
# absorbs cold image pulls, slow runner provisioning, and the
|
|
# Postgres service-container readiness wait without letting a stuck
|
|
# job consume the runner indefinitely.
|
|
timeout-minutes: 15
|
|
|
|
# Postgres service container. Pin to the same digest as
|
|
# deploy/docker-compose.yml so the smoke runs against the exact
|
|
# image the production deploy uses — a regression that surfaces
|
|
# only on a specific Postgres minor bump shows up here on the
|
|
# next image refresh in compose, not silently on a customer site.
|
|
services:
|
|
postgres:
|
|
image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
|
|
env:
|
|
POSTGRES_DB: certctl
|
|
POSTGRES_USER: certctl
|
|
POSTGRES_PASSWORD: certctl
|
|
ports:
|
|
- 5432:5432
|
|
# GitHub's services-container health check. The smoke shell
|
|
# also waits for pg_isready as a belt-and-suspenders guard.
|
|
options: >-
|
|
--health-cmd "pg_isready -U certctl -d certctl"
|
|
--health-interval 5s
|
|
--health-timeout 3s
|
|
--health-retries 10
|
|
|
|
steps:
|
|
- name: Checkout
|
|
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
|
|
|
- name: Set up Go
|
|
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
|
with:
|
|
go-version: '1.25.10'
|
|
# Cache go-build + go-mod for the weekly run. Keep the
|
|
# cache key bound to go.sum so a dep bump invalidates it.
|
|
cache: true
|
|
|
|
- name: Run backup-restore smoke
|
|
env:
|
|
PGHOST: 127.0.0.1
|
|
PGPORT: '5432'
|
|
PGUSER: certctl
|
|
PGPASSWORD: certctl
|
|
PGDATABASE: certctl
|
|
# Insert enough rows to exercise the chain over a non-trivial
|
|
# length. 24 ≫ 1 — large enough to surface ordering bugs,
|
|
# small enough that the dump finishes in seconds.
|
|
SMOKE_ROWS: '24'
|
|
run: bash deploy/test/backup-restore-smoke.sh
|