certctl

gsadmin/certctl

Fork 0

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:51:30 +00:00

Commit Graph

Author	SHA1	Message	Date
shankar0123	374ec574c5	feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion Acquisition-audit DEPL-005 (backup runbook exists but no CI restore test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16). A backup procedure that has never been restore-tested is not a backup procedure. The Helm CronJob at deploy/helm/certctl/templates/backup- cronjob.yaml and the operator runbook at docs/operator/runbooks/postgres-backup.md both document a `pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the dump shape has never been restored end-to-end under CI. This sprint adds the missing assertion. Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so the two jobs don't fight for runners), boot a real postgres:16-alpine service container pinned to the SAME sha256 digest as deploy/docker-compose.yml, exercise the audit_events hash chain with 24 synthetic rows representing an issue/renew/revoke/auth-login cycle, take a custom-format dump, DROP SCHEMA public CASCADE (simulating an operator-side data-loss event), pg_restore, and assert: pre.row_count == post.row_count pre.chain_head_hash == post.chain_head_hash (BYTE-EXACT) post.first_break_id == "" (verify_chain clean) post.verifier_walked == pre.row_count (every row walked) The chain-head byte-exact assertion is the load-bearing one. Migration 000047 hashes each row's canonical payload with `to_char(timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss in the dump/restore path (a real concern across major Postgres upgrades or with --format=plain) would corrupt the hash. The point of testing is to PROVE the property, not to defend against a known quirk. Files ===== - .github/workflows/backup-restore.yml — Mondays 07:00 UTC + workflow_dispatch. Postgres service container; Go 1.25.10; contents:read; 15-min timeout. Action SHAs pinned to match ci.yml's pinning convention. - deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight (postgresql-client + Go + python3 on PATH); wait-for-ready loop; DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify + python3 JSON diff. ::error:: prefix on any assertion failure. Same script runs unchanged locally against any reachable Postgres. - deploy/test/backupsmoke/main.go — Go program with --mode=workload and --mode=verify. Imports the repo's internal/repository/postgres.RunMigrations and emits a small JSON snapshot to stdout. INSERT shape mirrors internal/repository/postgres/audit_chain_test.go. - docs/operator/runbooks/postgres-backup.md — adds a 'CI restore verification' subsection after the existing quarterly-dry-run section, points at the new workflow + harness + smoke program, bumps the last-reviewed marker. Verified locally: gofmt clean, go vet clean, staticcheck clean, `go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell harness, python3 -c yaml.safe_load on the workflow, dry-run of the JSON-diff python block on synthetic pre.json/post.json covers both PASS and ::error:: paths.	2026-05-16 17:27:57 +00:00
shankar0123	3ce05ab0a8	docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md sections 110-143 still said 'certctl ships no backup CronJob template in the Helm chart' and the three sample recipes that followed included an 'in-cluster Postgres → S3' rollup that the operator 'should roll their own.' But the chart actually DOES ship that CronJob: deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4 DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true', PVC + S3 sinks, pg_dump shape byte-comparable with the manual command earlier in the runbook. Operators following the pre-fix runbook would write a duplicate CronJob from scratch while the working template sat unused under their nose. Rewrite of sections 110-143: - Lead with the shipped CronJob, two install one-liners (PVC + S3). - Move the recipes-by-topology block down to 'When the bundled CronJob is NOT the answer' — still call out managed Postgres (use provider PITR) and bare-VM Postgres (systemd + pg_dump + restic) as deliberately out-of-scope. - Add 'Recovery objectives' subsection: RPO ≈ 24h at the default nightly schedule, RTO ≈ 30-60min from the existing drill steps further down the page. Tells the reader where the bundled CronJob fits in their RPO/RTO budget without overpromising (anything below 24h RPO needs WAL-shipping, which the CronJob doesn't do). - Bump '> Last reviewed:' to today. Closes DEPL-005.	2026-05-16 04:31:31 +00:00
shankar0123	57b539c378	docs(b12): observability reference + Postgres backup runbook Closes acquisition-diligence Bundle 12 — Observability, DR, Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8, T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7. Two new operator-facing references; both non-audit-framed per the Bundle 5 doc-placement policy. docs/operator/observability.md — single canonical statement of what certctl emits, what it doesn't, and what survives a restart: - Metrics surface: both /api/v1/metrics (JSON) and /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of certctl_certificate_* gauges + certctl_issuance_duration_seconds per-issuer-type histogram + certctl_uptime_seconds. - Prometheus library vs hand-rolled exposition: explicit scope statement — hand-rolled fmt.Fprintf is intentional for v2.x given the shallow metric surface; client_golang migration tracked as v3 item (closes OPS-M1). - Tracing: explicit deferral — no OTel SDK setup, OTel packages are indirect-only in go.mod, no spans, no OTLP exporter; tracked as v3 item; in the meantime structured logs carry request_id and certctl_issuance_duration_seconds carries the per-issuer latency signal (closes OPS-M2). - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control; no key material / bearer tokens / session cookies in log lines. - Rate-limit semantics under restarts + replicas: per-process, in-memory, reset-on-restart, NOT shared across replicas; full inventory of the 5 limiter call sites (break-glass login, SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic source-IP, ACME per-account); multi-replica + sticky-session implications; database-backed sliding window deferred to v3 (closes D8). - Performance harness scope: cross-references the explicit 'What it explicitly does NOT measure' list in deploy/test/loadtest/README.md (closes LOW-7 + finding 7). docs/operator/runbooks/postgres-backup.md — operator-runnable backup procedure: - Inventory of what to back up (DB + operator-managed file material that lives outside the DB: CA keys, RA keys, OCSP responder keys, trust bundles). - Logical backup recipe with docker-compose + Kubernetes variants, integrity verification step, off-host storage step. - Physical / PITR recipe pointing at pgbackrest / wal-g (certctl ships nothing here — standard PostgreSQL DBA work). - Three sample automation paths (in-cluster Postgres → S3 CronJob, managed Postgres PITR, self-hosted VM systemd timer + restic). - Quarterly restore-dry-run procedure. - Helm CronJob template deliberately not shipped — three documented reasons (deployment topology / secret-management integration / off-host storage all vary by operator) plus roadmap entry for shipping a starter template when a real operator asks for one (closes D6 + OPS-H1). Both new docs wired into docs/README.md Operator + Runbooks tables. D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml) and in deploy/test/loadtest/ + .github/workflows/loadtest.yml respectively; this bundle doesn't touch them — it just records the closure in the audit HTML. Verified: bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS bash scripts/ci-guards/doc-rot-detector.sh # PASS All 35 scripts/ci-guards/*.sh green.	2026-05-13 02:09:11 +00:00

Author

SHA1

Message

Date

shankar0123

374ec574c5

feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion

Acquisition-audit DEPL-005 (backup runbook exists but no CI restore
test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16).

A backup procedure that has never been restore-tested is not a backup
procedure. The Helm CronJob at deploy/helm/certctl/templates/backup-
cronjob.yaml and the operator runbook at
docs/operator/runbooks/postgres-backup.md both document a
`pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the
dump shape has never been restored end-to-end under CI. This sprint
adds the missing assertion.

Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so
the two jobs don't fight for runners), boot a real postgres:16-alpine
service container pinned to the SAME sha256 digest as
deploy/docker-compose.yml, exercise the audit_events hash chain
with 24 synthetic rows representing an issue/renew/revoke/auth-login
cycle, take a custom-format dump, DROP SCHEMA public CASCADE
(simulating an operator-side data-loss event), pg_restore, and
assert:

  pre.row_count        == post.row_count
  pre.chain_head_hash  == post.chain_head_hash    (BYTE-EXACT)
  post.first_break_id  == ""                      (verify_chain clean)
  post.verifier_walked == pre.row_count           (every row walked)

The chain-head byte-exact assertion is the load-bearing one.
Migration 000047 hashes each row's canonical payload with
`to_char(timestamp AT TIME ZONE 'UTC',
'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss
in the dump/restore path (a real concern across major Postgres
upgrades or with --format=plain) would corrupt the hash. The point
of testing is to PROVE the property, not to defend against a known
quirk.

Files
=====
- .github/workflows/backup-restore.yml — Mondays 07:00 UTC +
  workflow_dispatch. Postgres service container; Go 1.25.10;
  contents:read; 15-min timeout. Action SHAs pinned to match
  ci.yml's pinning convention.
- deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight
  (postgresql-client + Go + python3 on PATH); wait-for-ready loop;
  DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify
  + python3 JSON diff. ::error:: prefix on any assertion failure.
  Same script runs unchanged locally against any reachable Postgres.
- deploy/test/backupsmoke/main.go — Go program with --mode=workload
  and --mode=verify. Imports the repo's
  internal/repository/postgres.RunMigrations and emits a small JSON
  snapshot to stdout. INSERT shape mirrors
  internal/repository/postgres/audit_chain_test.go.
- docs/operator/runbooks/postgres-backup.md — adds a 'CI restore
  verification' subsection after the existing quarterly-dry-run
  section, points at the new workflow + harness + smoke program,
  bumps the last-reviewed marker.

Verified locally: gofmt clean, go vet clean, staticcheck clean,
`go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell
harness, python3 -c yaml.safe_load on the workflow, dry-run of the
JSON-diff python block on synthetic pre.json/post.json covers both
PASS and ::error:: paths.

2026-05-16 17:27:57 +00:00

shankar0123

3ce05ab0a8

docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob

Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:

  deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
  DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
  PVC + S3 sinks, pg_dump shape byte-comparable with the manual
  command earlier in the runbook.

Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.

Rewrite of sections 110-143:
  - Lead with the shipped CronJob, two install one-liners (PVC + S3).
  - Move the recipes-by-topology block down to 'When the bundled
    CronJob is NOT the answer' — still call out managed Postgres
    (use provider PITR) and bare-VM Postgres (systemd + pg_dump +
    restic) as deliberately out-of-scope.
  - Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
    nightly schedule, RTO ≈ 30-60min from the existing drill steps
    further down the page. Tells the reader where the bundled
    CronJob fits in their RPO/RTO budget without overpromising
    (anything below 24h RPO needs WAL-shipping, which the CronJob
    doesn't do).
  - Bump '> Last reviewed:' to today.

Closes DEPL-005.

2026-05-16 04:31:31 +00:00

shankar0123

57b539c378

docs(b12): observability reference + Postgres backup runbook

Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.

Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.

docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
  - Metrics surface: both /api/v1/metrics (JSON) and
    /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
    certctl_certificate_* gauges + certctl_issuance_duration_seconds
    per-issuer-type histogram + certctl_uptime_seconds.
  - Prometheus library vs hand-rolled exposition: explicit scope
    statement — hand-rolled fmt.Fprintf is intentional for v2.x given
    the shallow metric surface; client_golang migration tracked as
    v3 item (closes OPS-M1).
  - Tracing: explicit deferral — no OTel SDK setup, OTel packages
    are indirect-only in go.mod, no spans, no OTLP exporter; tracked
    as v3 item; in the meantime structured logs carry request_id and
    certctl_issuance_duration_seconds carries the per-issuer latency
    signal (closes OPS-M2).
  - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
    no key material / bearer tokens / session cookies in log lines.
  - Rate-limit semantics under restarts + replicas: per-process,
    in-memory, reset-on-restart, NOT shared across replicas; full
    inventory of the 5 limiter call sites (break-glass login,
    SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
    source-IP, ACME per-account); multi-replica + sticky-session
    implications; database-backed sliding window deferred to v3
    (closes D8).
  - Performance harness scope: cross-references the explicit
    'What it explicitly does NOT measure' list in
    deploy/test/loadtest/README.md (closes LOW-7 + finding 7).

docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
  - Inventory of what to back up (DB + operator-managed file
    material that lives outside the DB: CA keys, RA keys, OCSP
    responder keys, trust bundles).
  - Logical backup recipe with docker-compose + Kubernetes variants,
    integrity verification step, off-host storage step.
  - Physical / PITR recipe pointing at pgbackrest / wal-g
    (certctl ships nothing here — standard PostgreSQL DBA work).
  - Three sample automation paths (in-cluster Postgres → S3 CronJob,
    managed Postgres PITR, self-hosted VM systemd timer + restic).
  - Quarterly restore-dry-run procedure.
  - Helm CronJob template deliberately not shipped — three
    documented reasons (deployment topology / secret-management
    integration / off-host storage all vary by operator) plus
    roadmap entry for shipping a starter template when a real
    operator asks for one (closes D6 + OPS-H1).

Both new docs wired into docs/README.md Operator + Runbooks tables.

D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.

Verified:
  bash scripts/ci-guards/G-3-env-docs-drift.sh    # PASS
  bash scripts/ci-guards/doc-rot-detector.sh      # PASS
  All 35 scripts/ci-guards/*.sh green.

2026-05-13 02:09:11 +00:00

3 Commits