mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:21:35 +00:00
43836aca7c
Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding.
Pre-fix posture: migration 000018 installs a WORM trigger on
audit_events that blocks UPDATE / DELETE for the application role.
But the trigger header itself documents a compliance-superuser
bypass (backup restore, retention purges, breach recovery). Without
a hash chain, that role can rewrite any row's actor / action /
details / timestamp / event_category with no on-disk trace.
HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper-
EVIDENCE, not just tamper-prevention. This commit ships the
evidence layer.
Wire shape:
migrations/000047_audit_events_hash_chain.up.sql
+ pgcrypto extension (digest function)
+ audit_chain_head: single-row sentinel table holding the most
recent row_hash; FOR UPDATE row-lock serialises chain writes
under concurrent INSERTs so two parallel writers can't read
the same prev_hash and produce a forked chain
+ audit_events: prev_hash + row_hash columns
+ audit_events_canonical_payload(): centralised hash input
builder. UTC + microsecond ISO-8601 keeps the hash session-
timezone-independent. All columns separated by '|' so a
concatenation-ambiguity exploit can't fabricate a collision
+ audit_events_compute_hash_chain(): BEFORE-INSERT trigger
function. Reads sentinel FOR UPDATE → computes
sha256(prev_hash || id || actor || actor_type || action ||
resource_type || resource_id || details::text ||
timestamp_utc_iso || event_category) → writes both columns +
advances the sentinel
+ backfill loop walks every existing row in (timestamp ASC, id
ASC) order; WORM trigger temporarily DISABLEd inside this
migration's transaction so backfill UPDATEs land cleanly,
ENABLEd before COMMIT
+ audit_events_verify_chain(): STABLE plpgsql verifier. Walks
the chain end-to-end and returns the first break:
(first_break_id TEXT, first_break_pos INT, row_count INT)
internal/repository/postgres/audit.go
+ AuditRepository.VerifyHashChain — calls the SQL function and
maps the OUT parameters to Go return values
internal/repository/interfaces.go
+ AuditRepository.VerifyHashChain in the contract; every
in-memory mock + stub picks up the no-op implementation
internal/scheduler/scheduler.go
+ AuditChainVerifier + AuditChainBreakRecorder interfaces
+ auditChainVerifyInterval (default 6h)
+ auditChainVerifyLoop: runs once on start + every tick;
atomic.Bool guard + 5-min per-tick context timeout match every
other GC loop's pattern
internal/service/audit_chain_metric.go
+ AuditChainCounter type with atomic counters. Sticky-first-
detection on (BrokenAtID, BrokenAtPos) so the actionable
alarm doesn't drift across walks. Snapshot() returns the
full state for the metrics handler
internal/api/handler/metrics.go
+ AuditChainCounterSnapshotter interface + Prometheus
exposition for four series:
certctl_audit_chain_break_detected_total counter (the alarm)
certctl_audit_chain_verify_total counter (walks done)
certctl_audit_chain_rows gauge (last walk size)
certctl_audit_chain_last_verified_at gauge (unix seconds)
internal/config/config.go
+ AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL
cmd/server/main.go
+ wires AuditChainCounter into both the scheduler (recorder) +
metrics handler (snapshotter) — single instance shared so the
writer + reader are guaranteed to converge
internal/repository/postgres/audit_chain_test.go (NEW)
+ TestAuditEventsHashChain_FreshTable: empty walk → clean
+ TestAuditEventsHashChain_AppendLinksRows: three INSERTs
produce a strictly-linked chain; prev_hash on row 0 is NULL;
verifier walks clean over the 3 rows
+ TestAuditEventsHashChain_VerifierDetectsTampering: simulate
the compliance-superuser threat model (DISABLE WORM, UPDATE
a middle row, ENABLE WORM); verifier returns the tampered
row's id at position 1
docs/operator/audit-chain.md (NEW)
+ Layered-defenses explainer (WORM + hash chain). Verifier
function reference. Recommended Prometheus alert rule.
Performance scaling table (10k to 10M rows). Step-by-step
runbook for what to do when a break is detected. Operator
configuration table.
Test-stub additions for AuditRepository.VerifyHashChain:
internal/service/testutil_test.go — mockAuditRepo
internal/service/acme_test.go — fakeAuditRepo
internal/integration/lifecycle_test.go — mockAuditRepository
internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo
Verified locally:
go vet ./... (clean)
gofmt -l internal/ cmd/ (clean)
go test -short -count=1 ./internal/scheduler/... ./internal/config/...
./internal/service/... ./internal/api/handler/... ./internal/repository/...
(all green)
Verified with testcontainers + postgres:16-alpine + the migration
runner (not gated under -short — requires docker):
go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/...
Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in
the next commit (separate concern: federated-user PII retention).
206 lines
6.8 KiB
Go
206 lines
6.8 KiB
Go
// Copyright 2026 certctl LLC. All rights reserved.
|
|
// SPDX-License-Identifier: BUSL-1.1
|
|
|
|
package postgres
|
|
|
|
import (
|
|
"context"
|
|
"database/sql"
|
|
"fmt"
|
|
"strings"
|
|
|
|
"github.com/certctl-io/certctl/internal/domain"
|
|
"github.com/certctl-io/certctl/internal/repository"
|
|
"github.com/google/uuid"
|
|
)
|
|
|
|
// AuditRepository implements repository.AuditRepository
|
|
type AuditRepository struct {
|
|
db *sql.DB
|
|
}
|
|
|
|
// NewAuditRepository creates a new AuditRepository
|
|
func NewAuditRepository(db *sql.DB) *AuditRepository {
|
|
return &AuditRepository{db: db}
|
|
}
|
|
|
|
// Create stores a new audit event using the repository's package-level
|
|
// *sql.DB. Use CreateWithTx when the audit event must be atomic with
|
|
// another database operation in a service-layer transaction.
|
|
func (r *AuditRepository) Create(ctx context.Context, event *domain.AuditEvent) error {
|
|
return r.CreateWithTx(ctx, r.db, event)
|
|
}
|
|
|
|
// CreateWithTx stores a new audit event using the supplied Querier.
|
|
// Pass *sql.Tx (typically from postgres.WithinTx) to participate in a
|
|
// caller's transaction; pass *sql.DB or call Create for stand-alone
|
|
// inserts. The SQL and side-effect contract is identical to Create —
|
|
// CreateWithTx is the load-bearing path that closes the audit's
|
|
// atomicity blocker (audit row must be transactional with the
|
|
// operation that triggered it).
|
|
func (r *AuditRepository) CreateWithTx(ctx context.Context, q repository.Querier, event *domain.AuditEvent) error {
|
|
if event.ID == "" {
|
|
event.ID = uuid.New().String()
|
|
}
|
|
// Bundle 1 Phase 8: empty EventCategory defaults to
|
|
// cert_lifecycle (matches the migration's DEFAULT clause + the
|
|
// DB CHECK constraint). The boundary catches callers that
|
|
// haven't yet been migrated to the categorized API.
|
|
if event.EventCategory == "" {
|
|
event.EventCategory = domain.EventCategoryCertLifecycle
|
|
}
|
|
|
|
err := q.QueryRowContext(ctx, `
|
|
INSERT INTO audit_events (
|
|
id, actor, actor_type, action, resource_type, resource_id, details, timestamp, event_category
|
|
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
|
RETURNING id
|
|
`, event.ID, event.Actor, event.ActorType, event.Action, event.ResourceType,
|
|
event.ResourceID, event.Details, event.Timestamp, event.EventCategory).Scan(&event.ID)
|
|
|
|
if err != nil {
|
|
return fmt.Errorf("failed to create audit event: %w", err)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// List returns audit events matching the filter criteria
|
|
func (r *AuditRepository) List(ctx context.Context, filter *repository.AuditFilter) ([]*domain.AuditEvent, error) {
|
|
if filter == nil {
|
|
filter = &repository.AuditFilter{}
|
|
}
|
|
|
|
// Set defaults
|
|
if filter.Page < 1 {
|
|
filter.Page = 1
|
|
}
|
|
if filter.PerPage == 0 || filter.PerPage > 500 {
|
|
filter.PerPage = 50
|
|
}
|
|
|
|
// Build WHERE clause
|
|
var whereConditions []string
|
|
var args []interface{}
|
|
argCount := 1
|
|
|
|
if filter.Actor != "" {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("actor = $%d", argCount))
|
|
args = append(args, filter.Actor)
|
|
argCount++
|
|
}
|
|
if filter.ActorType != "" {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("actor_type = $%d", argCount))
|
|
args = append(args, filter.ActorType)
|
|
argCount++
|
|
}
|
|
if filter.ResourceType != "" {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("resource_type = $%d", argCount))
|
|
args = append(args, filter.ResourceType)
|
|
argCount++
|
|
}
|
|
if filter.ResourceID != "" {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("resource_id = $%d", argCount))
|
|
args = append(args, filter.ResourceID)
|
|
argCount++
|
|
}
|
|
if !filter.From.IsZero() {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("timestamp >= $%d", argCount))
|
|
args = append(args, filter.From)
|
|
argCount++
|
|
}
|
|
if !filter.To.IsZero() {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("timestamp <= $%d", argCount))
|
|
args = append(args, filter.To)
|
|
argCount++
|
|
}
|
|
if filter.EventCategory != "" {
|
|
whereConditions = append(whereConditions, fmt.Sprintf("event_category = $%d", argCount))
|
|
args = append(args, filter.EventCategory)
|
|
argCount++
|
|
}
|
|
|
|
whereClause := ""
|
|
if len(whereConditions) > 0 {
|
|
whereClause = "WHERE " + strings.Join(whereConditions, " AND ")
|
|
}
|
|
|
|
// Get total count
|
|
countQuery := fmt.Sprintf("SELECT COUNT(*) FROM audit_events %s", whereClause)
|
|
var total int
|
|
if err := r.db.QueryRowContext(ctx, countQuery, args...).Scan(&total); err != nil {
|
|
return nil, fmt.Errorf("failed to count audit events: %w", err)
|
|
}
|
|
|
|
// Get paginated results
|
|
offset := (filter.Page - 1) * filter.PerPage
|
|
query := fmt.Sprintf(`
|
|
SELECT id, actor, actor_type, action, resource_type, resource_id, details, timestamp, event_category
|
|
FROM audit_events
|
|
%s
|
|
ORDER BY timestamp DESC
|
|
LIMIT $%d OFFSET $%d
|
|
`, whereClause, argCount, argCount+1)
|
|
|
|
args = append(args, filter.PerPage, offset)
|
|
|
|
rows, err := r.db.QueryContext(ctx, query, args...)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to query audit events: %w", err)
|
|
}
|
|
defer rows.Close()
|
|
|
|
var events []*domain.AuditEvent
|
|
for rows.Next() {
|
|
var event domain.AuditEvent
|
|
if err := rows.Scan(&event.ID, &event.Actor, &event.ActorType, &event.Action,
|
|
&event.ResourceType, &event.ResourceID, &event.Details, &event.Timestamp, &event.EventCategory); err != nil {
|
|
return nil, fmt.Errorf("failed to scan audit event: %w", err)
|
|
}
|
|
events = append(events, &event)
|
|
}
|
|
|
|
if err := rows.Err(); err != nil {
|
|
return nil, fmt.Errorf("error iterating audit event rows: %w", err)
|
|
}
|
|
|
|
return events, nil
|
|
}
|
|
|
|
// VerifyHashChain calls the migration 000047 audit_events_verify_chain()
|
|
// stored function and returns its three OUT parameters. This is the
|
|
// Sprint 6 COMP-001-HASH tamper-evidence verifier — the scheduler's
|
|
// auditChainVerifyLoop invokes it every CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL
|
|
// tick and emits the certctl_audit_chain_break_detected counter on any
|
|
// non-empty brokenAtID.
|
|
//
|
|
// The chain walk happens entirely server-side (plpgsql, STABLE). For an
|
|
// audit_events table with N rows the cost is O(N) per call; we expect
|
|
// modest fleets (single-digit-millions of events) so the per-tick cost
|
|
// is bounded. Operators with very large audit tables can lengthen the
|
|
// interval — the metric is sticky once incremented, so even an hourly
|
|
// walk is enough lead time to surface tampering for human investigation.
|
|
func (r *AuditRepository) VerifyHashChain(ctx context.Context) (brokenAtID string, brokenAtPos int, rowCount int, err error) {
|
|
var (
|
|
brokenID sql.NullString
|
|
pos sql.NullInt32
|
|
total sql.NullInt32
|
|
)
|
|
row := r.db.QueryRowContext(ctx, `SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain()`)
|
|
if err := row.Scan(&brokenID, &pos, &total); err != nil {
|
|
return "", -1, 0, fmt.Errorf("audit_events_verify_chain: %w", err)
|
|
}
|
|
if brokenID.Valid {
|
|
brokenAtID = brokenID.String
|
|
}
|
|
if pos.Valid {
|
|
brokenAtPos = int(pos.Int32)
|
|
} else {
|
|
brokenAtPos = -1
|
|
}
|
|
if total.Valid {
|
|
rowCount = int(total.Int32)
|
|
}
|
|
return brokenAtID, brokenAtPos, rowCount, nil
|
|
}
|