mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:41:30 +00:00
43836aca7c
Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.
Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.
HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.
Wire shape:
migrations/000047_audit_events_hash_chain.up.sql
+ pgcrypto extension (digest function)
+ audit_chain_head: single-row sentinel table holding the most
recent row_hash; FOR UPDATE row-lock serialises chain writes
under concurrent INSERTs so two parallel writers can't read
the same prev_hash and produce a forked chain
+ audit_events: prev_hash + row_hash columns
+ audit_events_canonical_payload(): centralised hash input
builder. UTC + microsecond ISO-8601 keeps the hash session-
timezone-independent. All columns separated by '|' so a
concatenation-ambiguity exploit can't fabricate a collision
+ audit_events_compute_hash_chain(): BEFORE-INSERT trigger
function. Reads sentinel FOR UPDATE → computes
sha256(prev_hash || id || actor || actor_type || action ||
resource_type || resource_id || details::text ||
timestamp_utc_iso || event_category) → writes both columns +
advances the sentinel
+ backfill loop walks every existing row in (timestamp ASC, id
ASC) order; WORM trigger temporarily DISABLEd inside this
migration's transaction so backfill UPDATEs land cleanly,
ENABLEd before COMMIT
+ audit_events_verify_chain(): STABLE plpgsql verifier. Walks
the chain end-to-end and returns the first break:
(first_break_id TEXT, first_break_pos INT, row_count INT)
internal/repository/postgres/audit.go
+ AuditRepository.VerifyHashChain — calls the SQL function and
maps the OUT parameters to Go return values
internal/repository/interfaces.go
+ AuditRepository.VerifyHashChain in the contract; every
in-memory mock + stub picks up the no-op implementation
internal/scheduler/scheduler.go
+ AuditChainVerifier + AuditChainBreakRecorder interfaces
+ auditChainVerifyInterval (default 6h)
+ auditChainVerifyLoop: runs once on start + every tick;
atomic.Bool guard + 5-min per-tick context timeout match every
other GC loop's pattern
internal/service/audit_chain_metric.go
+ AuditChainCounter type with atomic counters. Sticky-first-
detection on (BrokenAtID, BrokenAtPos) so the actionable
alarm doesn't drift across walks. Snapshot() returns the
full state for the metrics handler
internal/api/handler/metrics.go
+ AuditChainCounterSnapshotter interface + Prometheus
exposition for four series:
certctl_audit_chain_break_detected_total counter (the alarm)
certctl_audit_chain_verify_total counter (walks done)
certctl_audit_chain_rows gauge (last walk size)
certctl_audit_chain_last_verified_at gauge (unix seconds)
internal/config/config.go
+ AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL
cmd/server/main.go
+ wires AuditChainCounter into both the scheduler (recorder) +
metrics handler (snapshotter) — single instance shared so the
writer + reader are guaranteed to converge
internal/repository/postgres/audit_chain_test.go (NEW)
+ TestAuditEventsHashChain_FreshTable: empty walk → clean
+ TestAuditEventsHashChain_AppendLinksRows: three INSERTs
produce a strictly-linked chain; prev_hash on row 0 is NULL;
verifier walks clean over the 3 rows
+ TestAuditEventsHashChain_VerifierDetectsTampering: simulate
the compliance-superuser threat model (DISABLE WORM, UPDATE
a middle row, ENABLE WORM); verifier returns the tampered
row's id at position 1
docs/operator/audit-chain.md (NEW)
+ Layered-defenses explainer (WORM + hash chain). Verifier
function reference. Recommended Prometheus alert rule.
Performance scaling table (10k to 10M rows). Step-by-step
runbook for what to do when a break is detected. Operator
configuration table.
Test-stub additions for AuditRepository.VerifyHashChain:
internal/service/testutil_test.go — mockAuditRepo
internal/service/acme_test.go — fakeAuditRepo
internal/integration/lifecycle_test.go — mockAuditRepository
internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo
Verified locally:
go vet ./... (clean)
gofmt -l internal/ cmd/ (clean)
go test -short -count=1 ./internal/scheduler/... ./internal/config/...
./internal/service/... ./internal/api/handler/... ./internal/repository/...
(all green)
Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...
Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).
118 lines
4.2 KiB
Go
118 lines
4.2 KiB
Go
// Copyright 2026 certctl LLC. All rights reserved.
|
|
// SPDX-License-Identifier: BUSL-1.1
|
|
|
|
package service
|
|
|
|
import (
|
|
"sync/atomic"
|
|
"time"
|
|
)
|
|
|
|
// AuditChainCounter is the metric-side companion to the Sprint 6
|
|
// COMP-001-HASH chain verifier. The scheduler's auditChainVerifyLoop
|
|
// calls RecordSuccess on every clean walk and RecordBreak on
|
|
// detection; the Prometheus metrics handler reads the snapshot.
|
|
//
|
|
// Wire shape:
|
|
//
|
|
// scheduler.AuditChainVerifier → *postgres.AuditRepository
|
|
// (calls audit_events_verify_chain SQL func)
|
|
// scheduler.AuditChainBreakRecorder → *AuditChainCounter (this file)
|
|
// handler.MetricsHandler → reads Snapshot() / LastBreakID() / ...
|
|
//
|
|
// Three counters get surfaced (matching the existing
|
|
// /api/v1/metrics/prometheus naming conventions):
|
|
//
|
|
// certctl_audit_chain_break_detected_total counter (cumulative)
|
|
// certctl_audit_chain_verify_total counter (every walk)
|
|
// certctl_audit_chain_rows gauge (last walk's row count)
|
|
//
|
|
// Plus three info-label fields (broken_at_id, broken_at_pos,
|
|
// last_verified_at_unix) so operators can render a
|
|
// "last walk: clean, 1.2M rows, T-37m" panel.
|
|
//
|
|
// The counters use atomic.Uint64 so writes from the scheduler
|
|
// goroutine and reads from the HTTP handler goroutine don't need a
|
|
// mutex. The string fields (broken_at_id) are guarded by a
|
|
// dedicated mutex because atomic.Pointer would force the caller to
|
|
// re-allocate on every set.
|
|
type AuditChainCounter struct {
|
|
breaksDetected atomic.Uint64
|
|
walksCompleted atomic.Uint64
|
|
lastRowCount atomic.Uint64
|
|
lastVerifiedAt atomic.Int64 // unix seconds; 0 = never
|
|
|
|
// brokenAtID / brokenAtPos are sticky — they record the *first*
|
|
// detected break, not the most recent walk's data. Operators
|
|
// reset by restarting the process (or a future Phase 2 reset
|
|
// endpoint behind auth.audit.admin).
|
|
brokenAtID atomic.Value // string
|
|
brokenAtPos atomic.Int64
|
|
}
|
|
|
|
// NewAuditChainCounter returns a zero-state counter. Wire from
|
|
// cmd/server/main.go and pass to both the scheduler
|
|
// (SetAuditChainBreakRecorder) and the metrics handler
|
|
// (SetAuditChainCounter).
|
|
func NewAuditChainCounter() *AuditChainCounter {
|
|
c := &AuditChainCounter{}
|
|
c.brokenAtID.Store("")
|
|
c.brokenAtPos.Store(-1)
|
|
return c
|
|
}
|
|
|
|
// RecordSuccess marks a clean walk. The scheduler calls this on every
|
|
// tick where VerifyHashChain returned brokenAtID == "".
|
|
func (c *AuditChainCounter) RecordSuccess(rowCount int) {
|
|
c.walksCompleted.Add(1)
|
|
if rowCount < 0 {
|
|
rowCount = 0
|
|
}
|
|
c.lastRowCount.Store(uint64(rowCount))
|
|
c.lastVerifiedAt.Store(time.Now().Unix())
|
|
}
|
|
|
|
// RecordBreak marks a detected break. Sticky: subsequent breaks do not
|
|
// overwrite the (brokenAtID, brokenAtPos) fields — the first detection
|
|
// is the actionable signal. The breaksDetected counter still
|
|
// increments on every observation so operators can tell whether the
|
|
// tampering is ongoing or one-shot.
|
|
func (c *AuditChainCounter) RecordBreak(brokenAtID string, brokenAtPos int) {
|
|
c.breaksDetected.Add(1)
|
|
c.walksCompleted.Add(1)
|
|
c.lastVerifiedAt.Store(time.Now().Unix())
|
|
// Sticky-first-detection — only record if the field is still empty.
|
|
if cur, _ := c.brokenAtID.Load().(string); cur == "" {
|
|
c.brokenAtID.Store(brokenAtID)
|
|
c.brokenAtPos.Store(int64(brokenAtPos))
|
|
}
|
|
}
|
|
|
|
// Snapshot returns the current counter state for the Prometheus
|
|
// exposer. Reads use atomic loads — no mutex.
|
|
type AuditChainSnapshot struct {
|
|
BreaksDetected uint64
|
|
WalksCompleted uint64
|
|
LastRowCount uint64
|
|
// LastVerifiedAtUnix is 0 if the loop has never run; otherwise the
|
|
// unix-epoch second of the most recent walk (clean or break).
|
|
LastVerifiedAtUnix int64
|
|
// BrokenAtID is "" if no break has ever been recorded.
|
|
BrokenAtID string
|
|
BrokenAtPos int64
|
|
}
|
|
|
|
// Snapshot returns a point-in-time view of every counter. The metrics
|
|
// handler renders this into Prometheus exposition format.
|
|
func (c *AuditChainCounter) Snapshot() AuditChainSnapshot {
|
|
id, _ := c.brokenAtID.Load().(string)
|
|
return AuditChainSnapshot{
|
|
BreaksDetected: c.breaksDetected.Load(),
|
|
WalksCompleted: c.walksCompleted.Load(),
|
|
LastRowCount: c.lastRowCount.Load(),
|
|
LastVerifiedAtUnix: c.lastVerifiedAt.Load(),
|
|
BrokenAtID: id,
|
|
BrokenAtPos: c.brokenAtPos.Load(),
|
|
}
|
|
}
|