Files
certctl/internal/service/audit.go
T
shankar0123 02438ad9e1 ci: floor raise + doc drift (Phase 3 closure — TEST-H1/H2/M1/M2/M3/M4/L1, ARCH-H3/L1/L2/L3/L4)
Twelve findings from the architecture diligence audit's Phase 3 bundle
closed in one PR. All touch the CI workflows + small doc-drift fixes
across the production Go tree + migration headers.

CI workflow changes
====================

TEST-H1 — Race detection on ./... -short
  .github/workflows/ci.yml:106 was a 9-package explicit list. Audit
  finding TEST-H1 flagged that 25+ packages (internal/auth/*,
  internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
  internal/api/router, internal/api/acme, internal/cli, internal/cms,
  internal/config, internal/deploy, internal/integration,
  internal/ratelimit, internal/secret, internal/trustanchor, all of
  cmd/) silently dropped off race coverage.
  Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'.
  76 testing.Short() guards already cover testcontainers + live-DB
  integration suites, so -short keeps the long-running tests out.

TEST-H2 — Cross-platform build matrix
  New 'cross-platform-build' job in ci.yml. Matrix:
  ubuntu-latest + windows-latest + macos-latest, fail-fast: false.
  Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each.
  Catches Windows-specific regressions (path separators, file
  permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only
  CI missed.

TEST-L1 — actions/setup-go cache: true (explicit)
  setup-go v5 defaults cache: true; making it explicit so a future
  setup-go upgrade can't silently flip it. Re-runs hit the Go module
  + build cache instead of recompiling cold.

TEST-M1 — Mutation-testing floor at 55%
  security-deep-scan.yml::go-mutesting step rewritten. Removed
  continue-on-error + per-package '|| true'. New post-loop check
  extracts every 'The mutation score is X.YZ' line and fails the
  step if any package drops below 0.55. Floor rationale: starter
  ratio catches major regressions without rejecting the audit's
  'this is OK' steady state; raise quarterly.

TEST-M2 — 3 advisory deep-scan gates promoted to blocking
  Removed continue-on-error: true from:
    - gosec (filtered to G201/G202/G304/G108 high-signal rules:
      SQL-injection + path-traversal + pprof-exposed)
    - osv-scanner (multi-ecosystem CVE; complements govulncheck
      which is already blocking in ci.yml)
    - trivy image scan (--severity HIGH,CRITICAL --exit-code 1)
  continue-on-error count: 15 → 11.
  ZAP / schemathesis / nuclei / testssl stay advisory because their
  false-positive rates on https://localhost:8443-targeted DAST runs
  are high.

TEST-M3 — Playwright harness stub
  web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install'
  npm scripts. web/playwright.config.ts ships single chromium project
  with webServer block pointing at 'npm run dev'. web/src/__tests__/
  e2e/smoke.spec.ts proves the harness wires through. The full 15-flow
  suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit);
  this is the wiring + a single smoke test as the regression floor.
  New Makefile target: 'make e2e-test'.

Doc/code drift fixes
====================

TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard
  scripts/skip-inventory.sh walks every t.Skip site under cmd/ +
  internal/ + deploy/test/ and emits docs/testing/skip-inventory.md
  grouped by package with file:line:expression triples. Current
  inventory: 142 t.Skip sites, 76 testing.Short() guards.
  scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on
  diff (excluding the 'Last reviewed' timestamp line which drifts
  daily). The Markdown is the canonical acquisition-diligence artifact
  for 'what tests are being skipped and why.'

ARCH-H3 — MCP catalogue floor reconciliation
  Audit framing was '121 vs floor 150 — doc/code drift.' Live count
  via the test's actual regex over all 5 tool files (tools.go +
  tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go +
  tools_est.go): 155 unique 'Name: "certctl_*"' declarations.
  Pre-Phase-3 audit measured tools.go in isolation (121) and missed
  the other 4 files (+34 unique names). The test at
  internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP
  passes today (155 ≥ 150). Added a clarifying comment near
  mcpBaselineFloor explaining the measurement scope so future
  reviewers don't repeat the audit's framing error.
  STATUS: stale — no code drift, just a measurement scoping error in
  the audit.

ARCH-L1 — panic() rationale comments
  5 panic sites in production Go (excluding _test.go):
    - internal/repository/postgres/tx.go:84
    - internal/service/issuer.go:861 (mustJSON)
    - internal/service/est.go:728 (mustParseTime)
    - internal/service/acme.go:1288 (rand source failure — already documented)
    - internal/pkcs7/certrep.go:270 (OID marshal — already documented)
  Added ARCH-L1 rationale comments to the 3 sites that didn't have
  them. All 5 are defensible impossible-path / rethrow / hardcoded-
  constant guards.

ARCH-L3 — Migration IF-NOT-EXISTS carve-outs
  4 migrations skip the literal 'IF NOT EXISTS' token but ARE
  idempotent via different Postgres patterns:
    - 000014_policy_violation_severity_check.up.sql: ALTER TABLE
      ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency
      via DROP CONSTRAINT IF EXISTS preamble.
    - 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION
      + DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles
      existence check. CREATE TRIGGER doesn't take IF NOT EXISTS.
    - 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING.
    - 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern.
  Added ARCH-L3 header comments to each explaining the carve-out so
  reviewers don't flag the missing literal token.
  STATUS: largely stale — migrations are already idempotent.

ARCH-L4 — TODO/FIXME → see #<descriptor>
  5 TODOs rewritten to the allowed 'see #<descriptor>' pattern:
    - internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk
    - internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination
    - internal/service/audit.go:244 → see #audit-pagination-count
    - internal/service/job.go:295, 299 → see #validation-job-impl
  New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any
  new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows
  'see #N' / 'see #<descriptor>' patterns.

Sandbox limitation
==================
The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10
toolchain download fails with 'no space left on device' (sandbox has
1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT
run in this commit. Operator must run 'make verify' on their
workstation before push per CLAUDE.md operating rules.

The smoke.spec.ts NOT executed in the sandbox (no chromium installed).
Operator runs 'cd web && npm install && npx playwright install
--with-deps chromium && npm run e2e' on first wire-up.

All CI guards (no-todo-in-prod, skip-inventory-drift, G-3
env-docs-drift, doc-rot-detector, and every existing guard) verified
clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4
2026-05-13 20:10:08 +00:00

312 lines
11 KiB
Go

package service
import (
"context"
"encoding/json"
"fmt"
"time"
"github.com/certctl-io/certctl/internal/domain"
"github.com/certctl-io/certctl/internal/repository"
)
// AuditService provides business logic for recording and retrieving audit events.
type AuditService struct {
auditRepo repository.AuditRepository
}
// NewAuditService creates a new audit service.
func NewAuditService(auditRepo repository.AuditRepository) *AuditService {
return &AuditService{
auditRepo: auditRepo,
}
}
// RecordEvent records an audit event with actor, action, and resource information.
//
// Bundle-6 / Audit H-008 + M-022 / CWE-532: every details map flows through
// RedactDetailsForAudit BEFORE marshaling. The redactor scrubs credential
// keys (api_key, password, token, *_pem, eab_secret, ...) and PII keys
// (email, phone, ssn, name, address, ip_address, ...) and surfaces a
// `redacted_keys` array so operators can audit the redactor itself during
// a compliance review. See internal/service/audit_redact.go.
func (s *AuditService) RecordEvent(ctx context.Context, actor string, actorType domain.ActorType, action string, resourceType string, resourceID string, details map[string]interface{}) error {
return s.RecordEventWithCategory(ctx, actor, actorType, action, "", resourceType, resourceID, details)
}
// RecordEventWithCategory is the Bundle 1 Phase 8 categorized variant
// of RecordEvent. eventCategory is one of
// domain.EventCategoryCertLifecycle, domain.EventCategoryAuth,
// domain.EventCategoryConfig — empty defaults to cert_lifecycle in
// the persistence layer + DB CHECK constraint.
//
// Existing 90+ call sites that don't yet pass a category route
// through the legacy RecordEvent and inherit the cert_lifecycle
// default; new callers (auth handlers, bootstrap, config-mutation
// handlers) call this method directly with their explicit category.
// Both paths share the same redaction + marshaling contract.
func (s *AuditService) RecordEventWithCategory(ctx context.Context, actor string, actorType domain.ActorType, action, eventCategory, resourceType, resourceID string, details map[string]interface{}) error {
redacted := RedactDetailsForAudit(details)
detailsJSON, err := json.Marshal(redacted)
if err != nil {
detailsJSON = []byte("{}")
}
event := &domain.AuditEvent{
ID: generateID("audit"),
Timestamp: time.Now(),
Actor: actor,
ActorType: actorType,
Action: action,
ResourceType: resourceType,
ResourceID: resourceID,
Details: json.RawMessage(detailsJSON),
EventCategory: eventCategory,
}
if err := s.auditRepo.Create(ctx, event); err != nil {
return fmt.Errorf("failed to record audit event: %w", err)
}
return nil
}
// RecordEventWithTx records an audit event using the supplied repository.Querier.
//
// Pass *sql.Tx (typically obtained from postgres.WithinTx) to participate in
// a caller's transaction so the audit row is atomic with the operation that
// triggered it. Closes the #3 acquisition-readiness blocker from the
// 2026-05-01 issuer coverage audit (audit row not transactional with the
// operation it audits).
//
// Same redaction + marshalling contract as RecordEvent; only the database
// handle changes.
func (s *AuditService) RecordEventWithTx(ctx context.Context, q repository.Querier, actor string, actorType domain.ActorType, action string, resourceType string, resourceID string, details map[string]interface{}) error {
redacted := RedactDetailsForAudit(details)
detailsJSON, err := json.Marshal(redacted)
if err != nil {
detailsJSON = []byte("{}")
}
event := &domain.AuditEvent{
ID: generateID("audit"),
Timestamp: time.Now(),
Actor: actor,
ActorType: actorType,
Action: action,
ResourceType: resourceType,
ResourceID: resourceID,
Details: json.RawMessage(detailsJSON),
}
if err := s.auditRepo.CreateWithTx(ctx, q, event); err != nil {
return fmt.Errorf("failed to record audit event: %w", err)
}
return nil
}
// RecordEventWithCategoryWithTx records a categorized audit event using
// the supplied repository.Querier so the row is committed in the same
// transaction as the underlying action. Mirrors RecordEventWithCategory
// but takes the Querier (typically *sql.Tx from postgres.WithinTx).
//
// Audit 2026-05-10 HIGH-6 closure — closes the gap where Bundle-1+2
// auth-mutation paths emitted the audit row via a separate, non-
// transactional connection. A DB hiccup or connection reset between
// the action and the audit-row INSERT used to leave the action
// committed with no audit trail (CWE-778). With this method, the
// audit row participates in the action's transaction: rollback on
// any failure removes both the action row AND any audit row that the
// caller wrote inside the tx.
func (s *AuditService) RecordEventWithCategoryWithTx(ctx context.Context, q repository.Querier, actor string, actorType domain.ActorType, action, eventCategory, resourceType, resourceID string, details map[string]interface{}) error {
redacted := RedactDetailsForAudit(details)
detailsJSON, err := json.Marshal(redacted)
if err != nil {
detailsJSON = []byte("{}")
}
event := &domain.AuditEvent{
ID: generateID("audit"),
Timestamp: time.Now(),
Actor: actor,
ActorType: actorType,
Action: action,
ResourceType: resourceType,
ResourceID: resourceID,
Details: json.RawMessage(detailsJSON),
EventCategory: eventCategory,
}
if err := s.auditRepo.CreateWithTx(ctx, q, event); err != nil {
return fmt.Errorf("failed to record audit event: %w", err)
}
return nil
}
// List returns audit events matching filter criteria.
func (s *AuditService) List(ctx context.Context, filter *repository.AuditFilter) ([]*domain.AuditEvent, error) {
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to list audit events: %w", err)
}
return events, nil
}
// ListByResource returns all audit events for a specific resource.
func (s *AuditService) ListByResource(ctx context.Context, resourceType string, resourceID string) ([]*domain.AuditEvent, error) {
filter := &repository.AuditFilter{
ResourceType: resourceType,
ResourceID: resourceID,
PerPage: 1000, // reasonable default for single resource
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to list audit events: %w", err)
}
return events, nil
}
// ListByActor returns all audit events for a specific actor.
func (s *AuditService) ListByActor(ctx context.Context, actor string) ([]*domain.AuditEvent, error) {
filter := &repository.AuditFilter{
Actor: actor,
PerPage: 1000,
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to list audit events: %w", err)
}
return events, nil
}
// ListByAction returns all audit events for a specific action type.
func (s *AuditService) ListByAction(ctx context.Context, action string, from, to time.Time) ([]*domain.AuditEvent, error) {
filter := &repository.AuditFilter{
From: from,
To: to,
PerPage: 1000,
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to list audit events: %w", err)
}
// Filter by action on client side (repository may not filter by action directly)
var filtered []*domain.AuditEvent
for _, e := range events {
if e.Action == action {
filtered = append(filtered, e)
}
}
return filtered, nil
}
// ListAuditEvents returns paginated audit events (handler interface method).
func (s *AuditService) ListAuditEvents(ctx context.Context, page, perPage int) ([]domain.AuditEvent, int64, error) {
return s.ListAuditEventsByCategory(ctx, "", page, perPage)
}
// ListAuditEventsByCategory is the Bundle 1 Phase 8 categorized variant.
// Empty eventCategory disables the filter.
func (s *AuditService) ListAuditEventsByCategory(ctx context.Context, eventCategory string, page, perPage int) ([]domain.AuditEvent, int64, error) {
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
filter := &repository.AuditFilter{
EventCategory: eventCategory,
Page: page,
PerPage: perPage,
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, 0, fmt.Errorf("failed to list audit events: %w", err)
}
// Convert pointers to values for the handler interface
var result []domain.AuditEvent
for _, e := range events {
if e != nil {
result = append(result, *e)
}
}
// see #audit-pagination-count — the repository currently returns
// the full filtered slice and we surface len(result) as total. This
// works for the audit page's current shape (server-side filter +
// client-side pagination over a bounded window) but is wrong when the
// frontend ports to server-side cursoring (Phase 9 P-H2). At that
// point the repository must add a CountAuditEvents(filter) method and
// this line becomes total, _ := s.repo.CountAuditEvents(ctx, filter).
total := int64(len(result))
return result, total, nil
}
// ExportEventsByFilter returns audit events matching a date-range +
// optional category filter without pagination — the export handler
// uses this to stream NDJSON for compliance evidence collection.
//
// Audit 2026-05-10 HIGH-11 closure: pre-fix, the `audit.export`
// permission was seeded into r-admin and r-auditor (migration 000031)
// but no endpoint enforced it — misleading capability advertisement.
// This method is the service-layer building block for the new
// GET /api/v1/audit/export endpoint.
//
// Bounded callers: the handler enforces a max 90-day range + max-rows
// cap before invoking this; the service-layer method itself is
// permissive so future callers (compliance-job runner, MCP tool) can
// reuse the helper without duplicating the bound enforcement.
func (s *AuditService) ExportEventsByFilter(ctx context.Context, from, to time.Time, eventCategory string, maxRows int) ([]domain.AuditEvent, error) {
if maxRows <= 0 {
maxRows = 50000
}
filter := &repository.AuditFilter{
EventCategory: eventCategory,
From: from,
To: to,
Page: 1,
PerPage: maxRows,
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to list audit events for export: %w", err)
}
out := make([]domain.AuditEvent, 0, len(events))
for _, e := range events {
if e != nil {
out = append(out, *e)
}
}
return out, nil
}
// GetAuditEvent returns a single audit event (handler interface method).
func (s *AuditService) GetAuditEvent(ctx context.Context, id string) (*domain.AuditEvent, error) {
filter := &repository.AuditFilter{
ResourceID: id,
PerPage: 1,
}
events, err := s.auditRepo.List(ctx, filter)
if err != nil {
return nil, fmt.Errorf("failed to get audit event: %w", err)
}
if len(events) == 0 {
return nil, fmt.Errorf("audit event not found")
}
return events[0], nil
}