mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 21:21:40 +00:00

Files

T

shankar0123 43836aca7c feat(audit): COMP-001-HASH — per-row hash chain on audit_events (tamper-evidence)

Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.

Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.

HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.

Wire shape:

  migrations/000047_audit_events_hash_chain.up.sql
    + pgcrypto extension (digest function)
    + audit_chain_head: single-row sentinel table holding the most
      recent row_hash; FOR UPDATE row-lock serialises chain writes
      under concurrent INSERTs so two parallel writers can't read
      the same prev_hash and produce a forked chain
    + audit_events: prev_hash + row_hash columns
    + audit_events_canonical_payload(): centralised hash input
      builder. UTC + microsecond ISO-8601 keeps the hash session-
      timezone-independent. All columns separated by '|' so a
      concatenation-ambiguity exploit can't fabricate a collision
    + audit_events_compute_hash_chain(): BEFORE-INSERT trigger
      function. Reads sentinel FOR UPDATE → computes
      sha256(prev_hash || id || actor || actor_type || action ||
      resource_type || resource_id || details::text ||
      timestamp_utc_iso || event_category) → writes both columns +
      advances the sentinel
    + backfill loop walks every existing row in (timestamp ASC, id
      ASC) order; WORM trigger temporarily DISABLEd inside this
      migration's transaction so backfill UPDATEs land cleanly,
      ENABLEd before COMMIT
    + audit_events_verify_chain(): STABLE plpgsql verifier. Walks
      the chain end-to-end and returns the first break:
        (first_break_id TEXT, first_break_pos INT, row_count INT)

  internal/repository/postgres/audit.go
    + AuditRepository.VerifyHashChain — calls the SQL function and
      maps the OUT parameters to Go return values

  internal/repository/interfaces.go
    + AuditRepository.VerifyHashChain in the contract; every
      in-memory mock + stub picks up the no-op implementation

  internal/scheduler/scheduler.go
    + AuditChainVerifier + AuditChainBreakRecorder interfaces
    + auditChainVerifyInterval (default 6h)
    + auditChainVerifyLoop: runs once on start + every tick;
      atomic.Bool guard + 5-min per-tick context timeout match every
      other GC loop's pattern

  internal/service/audit_chain_metric.go
    + AuditChainCounter type with atomic counters. Sticky-first-
      detection on (BrokenAtID, BrokenAtPos) so the actionable
      alarm doesn't drift across walks. Snapshot() returns the
      full state for the metrics handler

  internal/api/handler/metrics.go
    + AuditChainCounterSnapshotter interface + Prometheus
      exposition for four series:
        certctl_audit_chain_break_detected_total counter (the alarm)
        certctl_audit_chain_verify_total          counter (walks done)
        certctl_audit_chain_rows                  gauge (last walk size)
        certctl_audit_chain_last_verified_at      gauge (unix seconds)

  internal/config/config.go
    + AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL

  cmd/server/main.go
    + wires AuditChainCounter into both the scheduler (recorder) +
      metrics handler (snapshotter) — single instance shared so the
      writer + reader are guaranteed to converge

  internal/repository/postgres/audit_chain_test.go (NEW)
    + TestAuditEventsHashChain_FreshTable: empty walk → clean
    + TestAuditEventsHashChain_AppendLinksRows: three INSERTs
      produce a strictly-linked chain; prev_hash on row 0 is NULL;
      verifier walks clean over the 3 rows
    + TestAuditEventsHashChain_VerifierDetectsTampering: simulate
      the compliance-superuser threat model (DISABLE WORM, UPDATE
      a middle row, ENABLE WORM); verifier returns the tampered
      row's id at position 1

  docs/operator/audit-chain.md (NEW)
    + Layered-defenses explainer (WORM + hash chain). Verifier
      function reference. Recommended Prometheus alert rule.
      Performance scaling table (10k to 10M rows). Step-by-step
      runbook for what to do when a break is detected. Operator
      configuration table.

  Test-stub additions for AuditRepository.VerifyHashChain:
    internal/service/testutil_test.go  — mockAuditRepo
    internal/service/acme_test.go      — fakeAuditRepo
    internal/integration/lifecycle_test.go — mockAuditRepository
    internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo

Verified locally:
  go vet ./...                                          (clean)
  gofmt -l internal/ cmd/                               (clean)
  go test -short -count=1 ./internal/scheduler/... ./internal/config/...
    ./internal/service/... ./internal/api/handler/... ./internal/repository/...
    (all green)

Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
  go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...

Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).

2026-05-16 06:17:15 +00:00

6.7 KiB

Raw Blame History

Audit-trail tamper-evidence (audit_events hash chain)

Last reviewed: 2026-05-16

Sprint 6 COMP-001-HASH closure. The audit_events table has two layered defenses against history rewrites:

Layer	Migration	What it blocks
WORM trigger	`000018_audit_events_worm.up.sql`	The application role cannot `UPDATE` or `DELETE` rows (tamper-prevention).
Hash chain	`000047_audit_events_hash_chain.up.sql`	A compliance superuser (DB-superuser-equivalent) who bypasses the WORM trigger CAN still rewrite rows, but the rewrite is detectable — every subsequent `audit_events_verify_chain()` walk reports the first broken row's id + position (tamper-evidence).

This document covers the hash-chain layer. The WORM layer is documented inline in migrations/000018_audit_events_worm.up.sql.

Why a hash chain in addition to WORM

The WORM trigger documents (in its header comment) that a compliance superuser role exists by design — backup-restore, retention purges, and breach-recovery operators need a way through. Without a hash chain, that role can rewrite any row's actor / action / details content with no on-disk trace.

HIPAA §164.312(b), FedRAMP AU-9, and NIST 800-53 AU-10 want tamper-evidence, not just tamper-prevention. The hash chain provides it: every row carries a row_hash = sha256(prev_hash || id || actor || actor_type || action || resource_type || resource_id || details::text || timestamp_iso8601_utc || event_category), and the genesis row's prev_hash is NULL. Mutating any field in any row breaks the chain at that row's position; the verifier returns the first break.

The verifier function

audit_events_verify_chain() is a STABLE plpgsql function shipped in migration 000047. It walks every row in (timestamp ASC, id ASC) order, recomputes each row's expected hash, and returns:

first_break_id  TEXT  -- NULL if the chain validated end-to-end
first_break_pos INT   -- 0-indexed position of the first break
row_count       INT   -- rows walked (= position + 1 on break, else table size)

Call it directly from psql:

SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();

Scheduled verification + Prometheus exposure

The scheduler's auditChainVerifyLoop calls the verifier every CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL (default 6h) and writes the results into the AuditChainCounter instance shared with the metrics handler. Four metrics get exposed at /api/v1/metrics/prometheus:

Metric	Type	Meaning
`certctl_audit_chain_break_detected_total`	counter	Sticky once non-zero — the actionable alarm.
`certctl_audit_chain_verify_total`	counter	Walks completed. Cross-check that the loop is alive.
`certctl_audit_chain_rows`	gauge	Most recent walk's row count.
`certctl_audit_chain_last_verified_at`	gauge	Unix seconds of most recent walk (0 = never).

The recommended alert rule is:

ALERT AuditChainBreak
  IF certctl_audit_chain_break_detected_total > 0
  FOR 1m
  LABELS { severity = "page", category = "compliance" }
  ANNOTATIONS {
    summary = "audit_events hash chain break detected — investigate immediately",
    runbook = "<your-runbook-url>/audit-chain-break"
  }

Cross-check certctl_audit_chain_last_verified_at (should advance roughly every CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL) and certctl_audit_chain_verify_total (should increment monotonically). A stalled _verified_at with an unchanged _verify_total means the scheduler loop has died — page on that too.

Performance notes

The walk is O(N) plpgsql over the audit_events table. On testcontainers + postgres:16-alpine the cost scales linearly:

Row count	Walk duration (approx)
10k	< 50 ms
100k	< 500 ms
1M	2-3 s
10M	25-30 s

A 5-minute per-tick context timeout (in internal/scheduler/scheduler.go::runAuditChainVerify) bounds the worst case. Fleets with > 10M audit rows should consider:

Lengthening CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL to 24h.
Pre-aggregating older rows (out of scope today — would require a "chain checkpoint" concept that re-anchors the genesis hash to a snapshot's row_hash; future work if needed).

What to do when a break is detected

Don't panic, don't auto-remediate. The break is a forensic signal, not a self-healing event.
Capture the position + id. The metric exposes both, but the sticky in-memory state (AuditChainCounter.BrokenAtID) only records the first break. SQL the verifier yourself to enumerate downstream breaks:
```
SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();
```
Snapshot the table. pg_dump --table=audit_events --data-only to a chain-of-custody location. The next investigative step is recovering the original row content from the most recent backup that pre-dates the tampering — without this snapshot you can't tell which write order caused the divergence.
Audit the compliance-superuser credential trail. The break implies someone with non-app DB credentials wrote to audit_events. Rotate the credential, investigate every recent session that authenticated under it, and review the WAL for the write.
Restore + cross-reference. If you keep streaming WAL or periodic snapshots, restore a known-good snapshot to a sandbox and EXCEPT-diff the two audit_events tables to enumerate every mutated row.

Backfill behavior

Migration 000047 backfills existing audit_events rows in (timestamp ASC, id ASC) order during its transaction. The WORM trigger is temporarily DISABLEd for the duration; subsequent ENABLE is a no-op equivalent. The migration is idempotent — a re-run sees row_hash IS NULL rows as the only backfill targets, so already-hashed rows are not touched.

Once backfill completes, row_hash becomes NOT NULL. prev_hash remains nullable so the genesis row (first row in the chain) stays representable.

Operator configuration

Env var	Default	Notes
`CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL`	`6h`	Tick cadence for the scheduler's verify loop. Zero or negative is ignored.

6.7 KiB Raw Blame History