mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:51:30 +00:00
43836aca7c
Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.
Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.
HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.
Wire shape:
migrations/000047_audit_events_hash_chain.up.sql
+ pgcrypto extension (digest function)
+ audit_chain_head: single-row sentinel table holding the most
recent row_hash; FOR UPDATE row-lock serialises chain writes
under concurrent INSERTs so two parallel writers can't read
the same prev_hash and produce a forked chain
+ audit_events: prev_hash + row_hash columns
+ audit_events_canonical_payload(): centralised hash input
builder. UTC + microsecond ISO-8601 keeps the hash session-
timezone-independent. All columns separated by '|' so a
concatenation-ambiguity exploit can't fabricate a collision
+ audit_events_compute_hash_chain(): BEFORE-INSERT trigger
function. Reads sentinel FOR UPDATE → computes
sha256(prev_hash || id || actor || actor_type || action ||
resource_type || resource_id || details::text ||
timestamp_utc_iso || event_category) → writes both columns +
advances the sentinel
+ backfill loop walks every existing row in (timestamp ASC, id
ASC) order; WORM trigger temporarily DISABLEd inside this
migration's transaction so backfill UPDATEs land cleanly,
ENABLEd before COMMIT
+ audit_events_verify_chain(): STABLE plpgsql verifier. Walks
the chain end-to-end and returns the first break:
(first_break_id TEXT, first_break_pos INT, row_count INT)
internal/repository/postgres/audit.go
+ AuditRepository.VerifyHashChain — calls the SQL function and
maps the OUT parameters to Go return values
internal/repository/interfaces.go
+ AuditRepository.VerifyHashChain in the contract; every
in-memory mock + stub picks up the no-op implementation
internal/scheduler/scheduler.go
+ AuditChainVerifier + AuditChainBreakRecorder interfaces
+ auditChainVerifyInterval (default 6h)
+ auditChainVerifyLoop: runs once on start + every tick;
atomic.Bool guard + 5-min per-tick context timeout match every
other GC loop's pattern
internal/service/audit_chain_metric.go
+ AuditChainCounter type with atomic counters. Sticky-first-
detection on (BrokenAtID, BrokenAtPos) so the actionable
alarm doesn't drift across walks. Snapshot() returns the
full state for the metrics handler
internal/api/handler/metrics.go
+ AuditChainCounterSnapshotter interface + Prometheus
exposition for four series:
certctl_audit_chain_break_detected_total counter (the alarm)
certctl_audit_chain_verify_total counter (walks done)
certctl_audit_chain_rows gauge (last walk size)
certctl_audit_chain_last_verified_at gauge (unix seconds)
internal/config/config.go
+ AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL
cmd/server/main.go
+ wires AuditChainCounter into both the scheduler (recorder) +
metrics handler (snapshotter) — single instance shared so the
writer + reader are guaranteed to converge
internal/repository/postgres/audit_chain_test.go (NEW)
+ TestAuditEventsHashChain_FreshTable: empty walk → clean
+ TestAuditEventsHashChain_AppendLinksRows: three INSERTs
produce a strictly-linked chain; prev_hash on row 0 is NULL;
verifier walks clean over the 3 rows
+ TestAuditEventsHashChain_VerifierDetectsTampering: simulate
the compliance-superuser threat model (DISABLE WORM, UPDATE
a middle row, ENABLE WORM); verifier returns the tampered
row's id at position 1
docs/operator/audit-chain.md (NEW)
+ Layered-defenses explainer (WORM + hash chain). Verifier
function reference. Recommended Prometheus alert rule.
Performance scaling table (10k to 10M rows). Step-by-step
runbook for what to do when a break is detected. Operator
configuration table.
Test-stub additions for AuditRepository.VerifyHashChain:
internal/service/testutil_test.go — mockAuditRepo
internal/service/acme_test.go — fakeAuditRepo
internal/integration/lifecycle_test.go — mockAuditRepository
internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo
Verified locally:
go vet ./... (clean)
gofmt -l internal/ cmd/ (clean)
go test -short -count=1 ./internal/scheduler/... ./internal/config/...
./internal/service/... ./internal/api/handler/... ./internal/repository/...
(all green)
Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...
Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).
162 lines
6.7 KiB
Markdown
162 lines
6.7 KiB
Markdown
# Audit-trail tamper-evidence (audit_events hash chain)
|
|
|
|
> Last reviewed: 2026-05-16
|
|
|
|
Sprint 6 COMP-001-HASH closure. The `audit_events` table has two
|
|
layered defenses against history rewrites:
|
|
|
|
| Layer | Migration | What it blocks |
|
|
|---|---|---|
|
|
| **WORM trigger** | `000018_audit_events_worm.up.sql` | The application role cannot `UPDATE` or `DELETE` rows (tamper-**prevention**). |
|
|
| **Hash chain** | `000047_audit_events_hash_chain.up.sql` | A compliance superuser (DB-superuser-equivalent) who bypasses the WORM trigger CAN still rewrite rows, but the rewrite is **detectable** — every subsequent `audit_events_verify_chain()` walk reports the first broken row's id + position (tamper-**evidence**). |
|
|
|
|
This document covers the hash-chain layer. The WORM layer is
|
|
documented inline in `migrations/000018_audit_events_worm.up.sql`.
|
|
|
|
## Why a hash chain in addition to WORM
|
|
|
|
The WORM trigger documents (in its header comment) that a compliance
|
|
superuser role exists by design — backup-restore, retention purges,
|
|
and breach-recovery operators need a way through. Without a hash
|
|
chain, that role can rewrite any row's `actor` / `action` / `details`
|
|
content with no on-disk trace.
|
|
|
|
HIPAA §164.312(b), FedRAMP AU-9, and NIST 800-53 AU-10 want
|
|
tamper-**evidence**, not just tamper-prevention. The hash chain
|
|
provides it: every row carries a `row_hash = sha256(prev_hash || id
|
|
|| actor || actor_type || action || resource_type || resource_id
|
|
|| details::text || timestamp_iso8601_utc || event_category)`, and
|
|
the genesis row's `prev_hash` is `NULL`. Mutating any field in any
|
|
row breaks the chain at that row's position; the verifier returns
|
|
the first break.
|
|
|
|
## The verifier function
|
|
|
|
`audit_events_verify_chain()` is a STABLE plpgsql function shipped
|
|
in migration 000047. It walks every row in `(timestamp ASC, id ASC)`
|
|
order, recomputes each row's expected hash, and returns:
|
|
|
|
```
|
|
first_break_id TEXT -- NULL if the chain validated end-to-end
|
|
first_break_pos INT -- 0-indexed position of the first break
|
|
row_count INT -- rows walked (= position + 1 on break, else table size)
|
|
```
|
|
|
|
Call it directly from psql:
|
|
|
|
```sql
|
|
SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();
|
|
```
|
|
|
|
## Scheduled verification + Prometheus exposure
|
|
|
|
The scheduler's `auditChainVerifyLoop` calls the verifier every
|
|
`CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` (default 6h) and writes the
|
|
results into the `AuditChainCounter` instance shared with the
|
|
metrics handler. Four metrics get exposed at
|
|
`/api/v1/metrics/prometheus`:
|
|
|
|
| Metric | Type | Meaning |
|
|
|---|---|---|
|
|
| `certctl_audit_chain_break_detected_total` | counter | Sticky once non-zero — the actionable alarm. |
|
|
| `certctl_audit_chain_verify_total` | counter | Walks completed. Cross-check that the loop is alive. |
|
|
| `certctl_audit_chain_rows` | gauge | Most recent walk's row count. |
|
|
| `certctl_audit_chain_last_verified_at` | gauge | Unix seconds of most recent walk (0 = never). |
|
|
|
|
The recommended alert rule is:
|
|
|
|
```
|
|
ALERT AuditChainBreak
|
|
IF certctl_audit_chain_break_detected_total > 0
|
|
FOR 1m
|
|
LABELS { severity = "page", category = "compliance" }
|
|
ANNOTATIONS {
|
|
summary = "audit_events hash chain break detected — investigate immediately",
|
|
runbook = "<your-runbook-url>/audit-chain-break"
|
|
}
|
|
```
|
|
|
|
Cross-check `certctl_audit_chain_last_verified_at` (should advance
|
|
roughly every `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL`) and
|
|
`certctl_audit_chain_verify_total` (should increment monotonically).
|
|
A stalled `_verified_at` with an unchanged `_verify_total` means the
|
|
scheduler loop has died — page on that too.
|
|
|
|
## Performance notes
|
|
|
|
The walk is `O(N)` plpgsql over the `audit_events` table. On
|
|
testcontainers + postgres:16-alpine the cost scales linearly:
|
|
|
|
| Row count | Walk duration (approx) |
|
|
|---|---|
|
|
| 10k | < 50 ms |
|
|
| 100k | < 500 ms |
|
|
| 1M | 2-3 s |
|
|
| 10M | 25-30 s |
|
|
|
|
A 5-minute per-tick context timeout (in
|
|
`internal/scheduler/scheduler.go::runAuditChainVerify`) bounds the
|
|
worst case. Fleets with > 10M audit rows should consider:
|
|
|
|
1. Lengthening `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` to 24h.
|
|
2. Pre-aggregating older rows (out of scope today — would require a
|
|
"chain checkpoint" concept that re-anchors the genesis hash to a
|
|
snapshot's row_hash; future work if needed).
|
|
|
|
## What to do when a break is detected
|
|
|
|
1. **Don't panic, don't auto-remediate.** The break is a forensic
|
|
signal, not a self-healing event.
|
|
2. **Capture the position + id.** The metric exposes both, but the
|
|
sticky in-memory state (`AuditChainCounter.BrokenAtID`) only
|
|
records the first break. SQL the verifier yourself to enumerate
|
|
downstream breaks:
|
|
|
|
```sql
|
|
SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();
|
|
```
|
|
|
|
3. **Snapshot the table.** `pg_dump --table=audit_events --data-only`
|
|
to a chain-of-custody location. The next investigative step is
|
|
recovering the original row content from the most recent backup
|
|
that pre-dates the tampering — without this snapshot you can't
|
|
tell which write order caused the divergence.
|
|
4. **Audit the compliance-superuser credential trail.** The break
|
|
implies someone with non-app DB credentials wrote to
|
|
`audit_events`. Rotate the credential, investigate every recent
|
|
session that authenticated under it, and review the WAL for the
|
|
write.
|
|
5. **Restore + cross-reference.** If you keep streaming WAL or
|
|
periodic snapshots, restore a known-good snapshot to a sandbox
|
|
and `EXCEPT`-diff the two `audit_events` tables to enumerate
|
|
every mutated row.
|
|
|
|
## Backfill behavior
|
|
|
|
Migration 000047 backfills existing `audit_events` rows in
|
|
`(timestamp ASC, id ASC)` order during its transaction. The WORM
|
|
trigger is temporarily `DISABLE`d for the duration; subsequent
|
|
`ENABLE` is a no-op equivalent. The migration is idempotent — a
|
|
re-run sees `row_hash IS NULL` rows as the only backfill targets, so
|
|
already-hashed rows are not touched.
|
|
|
|
Once backfill completes, `row_hash` becomes `NOT NULL`. `prev_hash`
|
|
remains nullable so the genesis row (first row in the chain) stays
|
|
representable.
|
|
|
|
## Operator configuration
|
|
|
|
| Env var | Default | Notes |
|
|
|---|---|---|
|
|
| `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` | `6h` | Tick cadence for the scheduler's verify loop. Zero or negative is ignored. |
|
|
|
|
## See also
|
|
|
|
- `migrations/000047_audit_events_hash_chain.up.sql` — migration source.
|
|
- `migrations/000018_audit_events_worm.up.sql` — paired WORM trigger.
|
|
- `internal/repository/postgres/audit_chain_test.go` — testcontainers integration tests.
|
|
- `internal/repository/postgres/audit_worm_test.go` — WORM behaviour tests.
|
|
- `internal/scheduler/scheduler.go::auditChainVerifyLoop` — scheduler loop.
|
|
- `internal/service/audit_chain_metric.go` — `AuditChainCounter`.
|
|
- `internal/api/handler/metrics.go` — Prometheus exposer.
|