feat(audit): COMP-001-HASH — per-row hash chain on audit_events (tamper-evidence)

Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.

Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.

HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.

Wire shape:

  migrations/000047_audit_events_hash_chain.up.sql
    + pgcrypto extension (digest function)
    + audit_chain_head: single-row sentinel table holding the most
      recent row_hash; FOR UPDATE row-lock serialises chain writes
      under concurrent INSERTs so two parallel writers can't read
      the same prev_hash and produce a forked chain
    + audit_events: prev_hash + row_hash columns
    + audit_events_canonical_payload(): centralised hash input
      builder. UTC + microsecond ISO-8601 keeps the hash session-
      timezone-independent. All columns separated by '|' so a
      concatenation-ambiguity exploit can't fabricate a collision
    + audit_events_compute_hash_chain(): BEFORE-INSERT trigger
      function. Reads sentinel FOR UPDATE → computes
      sha256(prev_hash || id || actor || actor_type || action ||
      resource_type || resource_id || details::text ||
      timestamp_utc_iso || event_category) → writes both columns +
      advances the sentinel
    + backfill loop walks every existing row in (timestamp ASC, id
      ASC) order; WORM trigger temporarily DISABLEd inside this
      migration's transaction so backfill UPDATEs land cleanly,
      ENABLEd before COMMIT
    + audit_events_verify_chain(): STABLE plpgsql verifier. Walks
      the chain end-to-end and returns the first break:
        (first_break_id TEXT, first_break_pos INT, row_count INT)

  internal/repository/postgres/audit.go
    + AuditRepository.VerifyHashChain — calls the SQL function and
      maps the OUT parameters to Go return values

  internal/repository/interfaces.go
    + AuditRepository.VerifyHashChain in the contract; every
      in-memory mock + stub picks up the no-op implementation

  internal/scheduler/scheduler.go
    + AuditChainVerifier + AuditChainBreakRecorder interfaces
    + auditChainVerifyInterval (default 6h)
    + auditChainVerifyLoop: runs once on start + every tick;
      atomic.Bool guard + 5-min per-tick context timeout match every
      other GC loop's pattern

  internal/service/audit_chain_metric.go
    + AuditChainCounter type with atomic counters. Sticky-first-
      detection on (BrokenAtID, BrokenAtPos) so the actionable
      alarm doesn't drift across walks. Snapshot() returns the
      full state for the metrics handler

  internal/api/handler/metrics.go
    + AuditChainCounterSnapshotter interface + Prometheus
      exposition for four series:
        certctl_audit_chain_break_detected_total counter (the alarm)
        certctl_audit_chain_verify_total          counter (walks done)
        certctl_audit_chain_rows                  gauge (last walk size)
        certctl_audit_chain_last_verified_at      gauge (unix seconds)

  internal/config/config.go
    + AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL

  cmd/server/main.go
    + wires AuditChainCounter into both the scheduler (recorder) +
      metrics handler (snapshotter) — single instance shared so the
      writer + reader are guaranteed to converge

  internal/repository/postgres/audit_chain_test.go (NEW)
    + TestAuditEventsHashChain_FreshTable: empty walk → clean
    + TestAuditEventsHashChain_AppendLinksRows: three INSERTs
      produce a strictly-linked chain; prev_hash on row 0 is NULL;
      verifier walks clean over the 3 rows
    + TestAuditEventsHashChain_VerifierDetectsTampering: simulate
      the compliance-superuser threat model (DISABLE WORM, UPDATE
      a middle row, ENABLE WORM); verifier returns the tampered
      row's id at position 1

  docs/operator/audit-chain.md (NEW)
    + Layered-defenses explainer (WORM + hash chain). Verifier
      function reference. Recommended Prometheus alert rule.
      Performance scaling table (10k to 10M rows). Step-by-step
      runbook for what to do when a break is detected. Operator
      configuration table.

  Test-stub additions for AuditRepository.VerifyHashChain:
    internal/service/testutil_test.go  — mockAuditRepo
    internal/service/acme_test.go      — fakeAuditRepo
    internal/integration/lifecycle_test.go — mockAuditRepository
    internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo

Verified locally:
  go vet ./...                                          (clean)
  gofmt -l internal/ cmd/                               (clean)
  go test -short -count=1 ./internal/scheduler/... ./internal/config/...
    ./internal/service/... ./internal/api/handler/... ./internal/repository/...
    (all green)

Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
  go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...

Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).
This commit is contained in:
shankar0123
2026-05-16 06:17:15 +00:00
parent 8c2d3c844e
commit 43836aca7c
15 changed files with 1127 additions and 0 deletions
+49
View File
@@ -102,6 +102,20 @@ type ExpiryAlertSnapshotter interface {
SnapshotExpiryAlerts() []service.ExpiryAlertSnapshotEntry
}
// AuditChainCounterSnapshotter is the surface MetricsHandler consumes
// to emit the Sprint 6 COMP-001-HASH tamper-evidence counters:
//
// certctl_audit_chain_break_detected_total counter
// certctl_audit_chain_verify_total counter
// certctl_audit_chain_rows gauge
// certctl_audit_chain_last_verified_at gauge (unix seconds)
//
// *service.AuditChainCounter satisfies this. nil disables emission;
// cmd/server/main.go wires the instance at startup.
type AuditChainCounterSnapshotter interface {
Snapshot() service.AuditChainSnapshot
}
// MetricsHandler handles HTTP requests for metrics.
// Supports both JSON format (GET /api/v1/metrics) and Prometheus exposition format
// (GET /api/v1/metrics/prometheus) for integration with Prometheus, Grafana, Datadog, etc.
@@ -129,6 +143,10 @@ type MetricsHandler struct {
// 2026-05-03 Infisical deep-research deliverable. nil disables
// emission of certctl_expiry_alerts_total{channel,threshold,result}.
expiryAlerts ExpiryAlertSnapshotter
// Sprint 6 COMP-001-HASH tamper-evidence counters. nil disables
// emission of certctl_audit_chain_* metrics. *service.AuditChainCounter
// is the production wiring; cmd/server/main.go sets this at startup.
auditChainCounter AuditChainCounterSnapshotter
}
// NewMetricsHandler creates a new MetricsHandler with a service dependency.
@@ -177,6 +195,14 @@ func (h *MetricsHandler) SetExpiryAlerts(c ExpiryAlertSnapshotter) {
h.expiryAlerts = c
}
// SetAuditChainCounter wires the Sprint 6 COMP-001-HASH tamper-evidence
// counters for the Prometheus exposition. nil disables the block.
// The counter is also passed to scheduler.SetAuditChainBreakRecorder so
// the verify loop writes to the same instance the handler reads.
func (h *MetricsHandler) SetAuditChainCounter(c AuditChainCounterSnapshotter) {
h.auditChainCounter = c
}
// MetricsResponse represents the JSON metrics response for V2.
type MetricsResponse struct {
Gauge MetricsGauge `json:"gauge"`
@@ -523,6 +549,29 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
}
}
}
// Sprint 6 COMP-001-HASH tamper-evidence counters. Emitted as four
// adjacent series so an alert rule can fire on any non-zero
// certctl_audit_chain_break_detected_total (the operator-actionable
// signal — see docs/operator/audit-chain.md).
if h.auditChainCounter != nil {
snap := h.auditChainCounter.Snapshot()
fmt.Fprintf(w, "\n# HELP certctl_audit_chain_break_detected_total Number of audit_events hash-chain breaks detected (Sprint 6 COMP-001-HASH).\n")
fmt.Fprintf(w, "# TYPE certctl_audit_chain_break_detected_total counter\n")
fmt.Fprintf(w, "certctl_audit_chain_break_detected_total %d\n", snap.BreaksDetected)
fmt.Fprintf(w, "# HELP certctl_audit_chain_verify_total Number of audit_events_verify_chain() walks completed by the scheduler.\n")
fmt.Fprintf(w, "# TYPE certctl_audit_chain_verify_total counter\n")
fmt.Fprintf(w, "certctl_audit_chain_verify_total %d\n", snap.WalksCompleted)
fmt.Fprintf(w, "# HELP certctl_audit_chain_rows Most recent walk's row count (gauge — last-write-wins).\n")
fmt.Fprintf(w, "# TYPE certctl_audit_chain_rows gauge\n")
fmt.Fprintf(w, "certctl_audit_chain_rows %d\n", snap.LastRowCount)
fmt.Fprintf(w, "# HELP certctl_audit_chain_last_verified_at Unix seconds of most recent walk (0 = never).\n")
fmt.Fprintf(w, "# TYPE certctl_audit_chain_last_verified_at gauge\n")
fmt.Fprintf(w, "certctl_audit_chain_last_verified_at %d\n", snap.LastVerifiedAtUnix)
}
}
// formatLE formats a histogram bucket boundary the way Prometheus
@@ -170,6 +170,14 @@ func (r *intuneE2EAuditRepo) List(_ context.Context, _ *repository.AuditFilter)
return nil, nil
}
// VerifyHashChain satisfies the Sprint 6 COMP-001-HASH interface
// addition. In-memory stub: always clean.
func (r *intuneE2EAuditRepo) VerifyHashChain(_ context.Context) (string, int, int, error) {
r.mu.Lock()
defer r.mu.Unlock()
return "", -1, len(r.events), nil
}
func (r *intuneE2EAuditRepo) actions() []string {
r.mu.Lock()
defer r.mu.Unlock()