Files
shankar0123 43836aca7c feat(audit): COMP-001-HASH — per-row hash chain on audit_events (tamper-evidence)
Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.

Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.

HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.

Wire shape:

  migrations/000047_audit_events_hash_chain.up.sql
    + pgcrypto extension (digest function)
    + audit_chain_head: single-row sentinel table holding the most
      recent row_hash; FOR UPDATE row-lock serialises chain writes
      under concurrent INSERTs so two parallel writers can't read
      the same prev_hash and produce a forked chain
    + audit_events: prev_hash + row_hash columns
    + audit_events_canonical_payload(): centralised hash input
      builder. UTC + microsecond ISO-8601 keeps the hash session-
      timezone-independent. All columns separated by '|' so a
      concatenation-ambiguity exploit can't fabricate a collision
    + audit_events_compute_hash_chain(): BEFORE-INSERT trigger
      function. Reads sentinel FOR UPDATE → computes
      sha256(prev_hash || id || actor || actor_type || action ||
      resource_type || resource_id || details::text ||
      timestamp_utc_iso || event_category) → writes both columns +
      advances the sentinel
    + backfill loop walks every existing row in (timestamp ASC, id
      ASC) order; WORM trigger temporarily DISABLEd inside this
      migration's transaction so backfill UPDATEs land cleanly,
      ENABLEd before COMMIT
    + audit_events_verify_chain(): STABLE plpgsql verifier. Walks
      the chain end-to-end and returns the first break:
        (first_break_id TEXT, first_break_pos INT, row_count INT)

  internal/repository/postgres/audit.go
    + AuditRepository.VerifyHashChain — calls the SQL function and
      maps the OUT parameters to Go return values

  internal/repository/interfaces.go
    + AuditRepository.VerifyHashChain in the contract; every
      in-memory mock + stub picks up the no-op implementation

  internal/scheduler/scheduler.go
    + AuditChainVerifier + AuditChainBreakRecorder interfaces
    + auditChainVerifyInterval (default 6h)
    + auditChainVerifyLoop: runs once on start + every tick;
      atomic.Bool guard + 5-min per-tick context timeout match every
      other GC loop's pattern

  internal/service/audit_chain_metric.go
    + AuditChainCounter type with atomic counters. Sticky-first-
      detection on (BrokenAtID, BrokenAtPos) so the actionable
      alarm doesn't drift across walks. Snapshot() returns the
      full state for the metrics handler

  internal/api/handler/metrics.go
    + AuditChainCounterSnapshotter interface + Prometheus
      exposition for four series:
        certctl_audit_chain_break_detected_total counter (the alarm)
        certctl_audit_chain_verify_total          counter (walks done)
        certctl_audit_chain_rows                  gauge (last walk size)
        certctl_audit_chain_last_verified_at      gauge (unix seconds)

  internal/config/config.go
    + AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL

  cmd/server/main.go
    + wires AuditChainCounter into both the scheduler (recorder) +
      metrics handler (snapshotter) — single instance shared so the
      writer + reader are guaranteed to converge

  internal/repository/postgres/audit_chain_test.go (NEW)
    + TestAuditEventsHashChain_FreshTable: empty walk → clean
    + TestAuditEventsHashChain_AppendLinksRows: three INSERTs
      produce a strictly-linked chain; prev_hash on row 0 is NULL;
      verifier walks clean over the 3 rows
    + TestAuditEventsHashChain_VerifierDetectsTampering: simulate
      the compliance-superuser threat model (DISABLE WORM, UPDATE
      a middle row, ENABLE WORM); verifier returns the tampered
      row's id at position 1

  docs/operator/audit-chain.md (NEW)
    + Layered-defenses explainer (WORM + hash chain). Verifier
      function reference. Recommended Prometheus alert rule.
      Performance scaling table (10k to 10M rows). Step-by-step
      runbook for what to do when a break is detected. Operator
      configuration table.

  Test-stub additions for AuditRepository.VerifyHashChain:
    internal/service/testutil_test.go  — mockAuditRepo
    internal/service/acme_test.go      — fakeAuditRepo
    internal/integration/lifecycle_test.go — mockAuditRepository
    internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo

Verified locally:
  go vet ./...                                          (clean)
  gofmt -l internal/ cmd/                               (clean)
  go test -short -count=1 ./internal/scheduler/... ./internal/config/...
    ./internal/service/... ./internal/api/handler/... ./internal/repository/...
    (all green)

Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
  go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...

Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).
2026-05-16 06:17:15 +00:00

203 lines
7.0 KiB
Go

package postgres_test
import (
"context"
"encoding/json"
"fmt"
"testing"
"time"
)
// Sprint 6 COMP-001-HASH closure tests. Migration 000047 installs the
// per-row hash chain on audit_events; this suite runs the live trigger
// against testcontainers + postgres:16-alpine + the migration runner
// from migrations_test.go.
//
// The tests cover four invariants:
//
// 1. Fresh table: a clean walk over zero rows returns
// brokenAtID == "" + rowCount == 0.
// 2. Append: three inserts produce a strictly-linked chain (each
// row's prev_hash equals the previous row's row_hash; row 0's
// prev_hash is NULL).
// 3. Verifier-clean: after the append, audit_events_verify_chain()
// returns brokenAtID == "" + rowCount == 3.
// 4. Verifier-detection: tampering with a row's `actor` (via the
// compliance-superuser bypass — we ENABLE/DISABLE the WORM
// trigger to simulate the threat model) makes
// audit_events_verify_chain() return the tampered row's id +
// its 0-indexed position.
//
// Gated by testing.Short() so the default `go test ./... -short` CI
// loop doesn't require docker-in-docker.
func TestAuditEventsHashChain_FreshTable(t *testing.T) {
if testing.Short() {
t.Skip("skipping integration test in short mode")
}
tdb := setupTestDB(t)
defer tdb.teardown(t)
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
var brokenID string
var brokenPos int
var rowCount int
row := tdb.db.QueryRowContext(ctx, `SELECT COALESCE(first_break_id, ''), first_break_pos, row_count FROM audit_events_verify_chain()`)
if err := row.Scan(&brokenID, &brokenPos, &rowCount); err != nil {
t.Fatalf("verify_chain on empty table: %v", err)
}
if brokenID != "" || rowCount != 0 {
t.Errorf("expected clean empty walk; got brokenID=%q rowCount=%d", brokenID, rowCount)
}
}
func TestAuditEventsHashChain_AppendLinksRows(t *testing.T) {
if testing.Short() {
t.Skip("skipping integration test in short mode")
}
tdb := setupTestDB(t)
defer tdb.teardown(t)
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Insert three rows in chronological order. The BEFORE-INSERT
// trigger populates prev_hash + row_hash on each.
for i, id := range []string{"audit-chain-001", "audit-chain-002", "audit-chain-003"} {
_, err := tdb.db.ExecContext(ctx, `
INSERT INTO audit_events (id, actor, actor_type, action, resource_type, resource_id, details, timestamp)
VALUES ($1, 'tester', 'User', $2, 'certificate', 'mc-test', '{}'::jsonb, NOW() + ($3 || ' microsecond')::interval)
`, id, fmt.Sprintf("action_%d", i), fmt.Sprintf("%d", i))
if err != nil {
t.Fatalf("insert %s: %v", id, err)
}
}
// Pull the three rows back in chain order. The first row's
// prev_hash MUST be NULL (genesis); each subsequent row's
// prev_hash MUST equal the previous row's row_hash.
rows, err := tdb.db.QueryContext(ctx, `
SELECT id, prev_hash, row_hash
FROM audit_events
ORDER BY timestamp ASC, id ASC
`)
if err != nil {
t.Fatalf("select chain: %v", err)
}
defer rows.Close()
type chainRow struct {
ID string
PrevHash *string
RowHash string
}
var chain []chainRow
for rows.Next() {
var r chainRow
if err := rows.Scan(&r.ID, &r.PrevHash, &r.RowHash); err != nil {
t.Fatalf("scan: %v", err)
}
chain = append(chain, r)
}
if len(chain) != 3 {
t.Fatalf("expected 3 rows, got %d", len(chain))
}
if chain[0].PrevHash != nil {
t.Errorf("row 0 prev_hash should be NULL (genesis); got %q", *chain[0].PrevHash)
}
if chain[0].RowHash == "" {
t.Errorf("row 0 row_hash should be non-empty")
}
for i := 1; i < len(chain); i++ {
if chain[i].PrevHash == nil || *chain[i].PrevHash != chain[i-1].RowHash {
t.Errorf("row %d prev_hash should equal row %d row_hash; prev=%v hash=%s",
i, i-1, chain[i].PrevHash, chain[i-1].RowHash)
}
}
// Verifier walks clean.
var brokenID string
var brokenPos int
var rowCount int
if err := tdb.db.QueryRowContext(ctx,
`SELECT COALESCE(first_break_id, ''), first_break_pos, row_count FROM audit_events_verify_chain()`,
).Scan(&brokenID, &brokenPos, &rowCount); err != nil {
t.Fatalf("verify_chain: %v", err)
}
if brokenID != "" || rowCount != 3 {
t.Errorf("verifier should report clean walk over 3 rows; got brokenID=%q pos=%d rows=%d",
brokenID, brokenPos, rowCount)
}
}
func TestAuditEventsHashChain_VerifierDetectsTampering(t *testing.T) {
if testing.Short() {
t.Skip("skipping integration test in short mode")
}
tdb := setupTestDB(t)
defer tdb.teardown(t)
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Seed three rows. Use deterministic timestamps so the walk order
// is unambiguous (timestamp ASC, id ASC).
base := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
ids := []string{"audit-chain-t-001", "audit-chain-t-002", "audit-chain-t-003"}
for i, id := range ids {
_, err := tdb.db.ExecContext(ctx, `
INSERT INTO audit_events (id, actor, actor_type, action, resource_type, resource_id, details, timestamp)
VALUES ($1, 'tester', 'User', $2, 'certificate', 'mc-test', '{}'::jsonb, $3)
`, id, fmt.Sprintf("action_%d", i), base.Add(time.Duration(i)*time.Second))
if err != nil {
t.Fatalf("insert %s: %v", id, err)
}
}
// Simulate the compliance-superuser threat model: temporarily
// disable the WORM trigger and rewrite the middle row's actor.
// (Production deployments don't have routine ability to do this;
// the threat is a backup-restore operator with PG-superuser
// credentials, or post-compromise persistence.)
if _, err := tdb.db.ExecContext(ctx, `ALTER TABLE audit_events DISABLE TRIGGER audit_events_worm_trigger`); err != nil {
t.Fatalf("disable worm: %v", err)
}
if _, err := tdb.db.ExecContext(ctx, `UPDATE audit_events SET actor = 'tampered' WHERE id = $1`, ids[1]); err != nil {
t.Fatalf("tamper update: %v", err)
}
if _, err := tdb.db.ExecContext(ctx, `ALTER TABLE audit_events ENABLE TRIGGER audit_events_worm_trigger`); err != nil {
t.Fatalf("enable worm: %v", err)
}
// Verifier MUST detect the break at position 1 (the middle row's
// 0-indexed position).
var brokenID string
var brokenPos int
var rowCount int
if err := tdb.db.QueryRowContext(ctx,
`SELECT COALESCE(first_break_id, ''), first_break_pos, row_count FROM audit_events_verify_chain()`,
).Scan(&brokenID, &brokenPos, &rowCount); err != nil {
t.Fatalf("verify_chain: %v", err)
}
if brokenID != ids[1] {
t.Errorf("expected break at %s; got %s", ids[1], brokenID)
}
if brokenPos != 1 {
t.Errorf("expected break position 1; got %d", brokenPos)
}
if rowCount != 2 {
// rowCount is "rows walked through the break"; the verifier
// returns immediately on first mismatch so rowCount should be
// position + 1 = 2.
t.Errorf("expected row_count = 2 (walked through the break); got %d", rowCount)
}
}
// _ = json.RawMessage ensures the encoding/json import survives
// linting even though the active test bodies don't reference it.
// Keeps room for future hash-chain tests that exercise details JSONB
// determinism without re-importing.
var _ = json.RawMessage(nil)