auth-bundle-1 Phase 6-7-8: bootstrap path + scope-down CLI + auditor-role split

# Phase 6 — day-0 admin bootstrap * internal/auth/bootstrap/ (new package): Strategy interface + EnvTokenStrategy with constant-time compare, one-shot consumption via sync.Mutex, optional admin-existence probe. Bundle 2's OIDC- first-admin will plug in alongside as an alternate Strategy. * BootstrapService.ValidateAndMint: validates the operator's CERTCTL_BOOTSTRAP_TOKEN, mints a 32-byte (64-hex-char) random API key value, persists the SHA-256 hash to api_keys, grants r-admin via actor_roles, AddHashed's the runtime keystore so the just- minted key authenticates the next request without restart, and records bootstrap.consume to the audit trail with category=auth. * internal/auth/keystore.go (new): KeyStore interface + StaticKeyStore (immutable env-var-only path) + MutableKeyStore (env-var keys + DB-loaded api_keys + runtime AddHashed). The auth middleware now consumes a KeyStore so the bootstrap path can extend the lookup table at runtime. * migrations/000031_api_keys.up/down.sql: api_keys table with (id, name UNIQUE, key_hash UNIQUE, tenant_id, admin, created_by, created_at, expires_at, last_used_at). Idempotent. * /v1/auth/bootstrap GET (probe) + POST (mint) — auth-exempt. Both routes documented in api/openapi.yaml + AuthExemptRouterRoutes allowlist updated. The token never leaves internal/auth/bootstrap; the minted plaintext key flows only into the HTTP response body. * Startup warning emitted when CERTCTL_BOOTSTRAP_TOKEN is set AND admin actors already exist (config drift signal). * Tests: 4 strategy invariants (empty token born disabled, wrong token=ErrInvalidToken without consumption, one-shot consumption, admin-exists closes path), 5 service tests (happy path + actor- name validation + propagation of strategy errors + nil-deps guard + 32-byte entropy budget), 8 HTTP-handler tests (status 201/410/401/400 mapping + token-leak hygiene scan of slog + audit details + Location header). Token-leak test redirects slog.Default to a buffer for the test scope. # Phase 7 — API-key migration + scope-down CLI * GET /v1/auth/keys handler + service method ListKeys backed by ActorRoleRepository.ListDistinctActors. Returns one row per (actor_id, actor_type) pair with the slice of role IDs they hold. Permission: auth.role.list. * internal/cli/auth_scope_down.go: AuthListKeys, AuthScopeDown (interactive), AuthScopeDownNonInteractive (JSON config), AuthScopeDownSuggest (--suggest with optional --apply). The synthetic actor-demo-anon is filtered out of every interactive / bulk path; non-interactive flow logs and skips it explicitly. * SuggestRoleFromAuditEvents (pure function): walks 30 days of audit events per actor and returns the narrowest matching role (admin / mcp / viewer / agent / operator) plus a one-line reason. Classification: any admin-shaped action wins; otherwise all-MCP → mcp; all-read-only → viewer; all-agent-shaped → agent; otherwise operator. Test table pins all six classifications. * CLI subcommand tree extended: 'auth keys list' + 'auth keys scope-down [--non-interactive <cfg>] [--suggest [--apply]]'. * CHANGELOG.md leads v2.1.0 with the SECURITY: AUDIT YOUR API KEYS call-out + four flow examples. # Phase 8 — auditor role + event_category column * migrations/000032_audit_category.up/down.sql: ALTER TABLE audit_events ADD COLUMN event_category TEXT NOT NULL DEFAULT 'cert_lifecycle' + CHECK constraint (cert_lifecycle/auth/config) + (event_category) and (event_category, timestamp DESC) indexes for the auditor-filter query path. WORM trigger from migration 000018 continues to enforce append-only at the DB layer (DDL is not blocked). * domain.AuditEvent gains EventCategory string (omitempty); domain.EventCategoryCertLifecycle / Auth / Config constants. * AuditService.RecordEventWithCategory sibling of RecordEvent; legacy callers stay on RecordEvent (defaults to cert_lifecycle). Auth callers (RoleService, ActorRoleService, BootstrapService) switched to RecordEventWithCategory(..., 'auth', ...). * GET /v1/audit?category=<cat>: handler accepts the optional query param, validates against the enum (400 on invalid value), dispatches through ListAuditEventsByCategory. OpenAPI updated with the new query param + AuditEvent.event_category schema. * Postgres AuditRepository.Create now writes event_category; AuditRepository.List filters on it; AuditFilter.EventCategory gates the WHERE clause. * Tests: 5 audit-category-filter HTTP tests (dispatch routing, back-compat fallback, 400 for invalid values, all 3 enum values accepted, page+category combine, JSON output surfaces the field). 3 auditor-role invariants (auditor holds exactly audit.read+audit.export, no mutating perms, disjoint from viewer except audit.read). # Cross-phase wiring * HandlerRegistry.Bootstrap field added; cmd/server/main.go wires the bootstrap service ahead of RegisterHandlers (extracted assembleNamedAPIKeys helper into auth_backfill.go, moved the keystore + bootstrap construction up alongside the auth repos). * AuthCheckResolver / AuthActorRoleService extended with ListKeys to satisfy the Phase 7 surface; existing fakes updated. * fakeAudit + mockAuditService stubs in tests gain RecordEventWithCategory + ListAuditEventsByCategory; existing tests untouched. # Verifications * gofmt -l: clean across every modified file. * go vet ./...: clean. * staticcheck across internal/auth + handler + router + cli + service + repository + cmd + domain: clean. * go test -short -count=1: green across every Bundle-1-touched package — internal/auth (incl. bootstrap), internal/api/handler, internal/api/router, internal/cli, internal/service/auth, internal/service, internal/domain/auth, internal/repository/postgres, cmd/server, cmd/cli, plus internal/scheduler, internal/api/middleware, cmd/agent, internal/mcp.
2026-06-07 19:21:29 +00:00 · 2026-05-09 20:15:43 +00:00
parent 60a589ab96
commit 3ef45e2ad4
38 changed files with 3159 additions and 140 deletions
@@ -0,0 +1,194 @@
+// Package bootstrap ships the day-0 admin-creation primitive for Bundle 1
+// Phase 6. The control plane comes up with no admin-roled actors; the
+// operator hands the env-var token to a single curl call; the server
+// mints the first admin API key, returns the key value once, then locks
+// the bootstrap door behind it.
+//
+// The Strategy interface is the forward-compat seam: Bundle 2 plugs in an
+// OIDC-first-admin strategy (the operator logs in via OIDC, the server
+// recognizes their group claim, the first such login auto-grants r-admin)
+// alongside the env-var-token strategy this file ships. Both implementations
+// satisfy the same interface; the boot path picks one based on which
+// CERTCTL_BOOTSTRAP_* env var is set.
+package bootstrap
+
+import (
+	"context"
+	"crypto/subtle"
+	"errors"
+	"sync"
+)
+
+// Sentinel errors the HTTP handler maps to status codes.
+var (
+	// ErrDisabled is returned when the bootstrap path is not callable
+	// either because (a) no token was set, or (b) admin actors already
+	// exist, or (c) the token was already consumed by an earlier call.
+	// Maps to HTTP 410 Gone.
+	ErrDisabled = errors.New("bootstrap: endpoint disabled")
+
+	// ErrInvalidToken is returned when the supplied token does not
+	// match the env-var token (constant-time compared). Maps to HTTP
+	// 401 Unauthorized. Deliberately does NOT distinguish between
+	// "wrong token" and "no token configured" so callers cannot use
+	// timing or status to probe the server's bootstrap state.
+	ErrInvalidToken = errors.New("bootstrap: invalid token")
+
+	// ErrInvalidActorName is returned when the requested admin-key
+	// name is empty or contains characters that would break audit
+	// attribution. Maps to HTTP 400.
+	ErrInvalidActorName = errors.New("bootstrap: invalid actor name")
+)
+
+// Strategy is the bundle 1 -> bundle 2 forward-compat seam. Each
+// strategy gates the day-0 admin path with a different credential type:
+// Bundle 1 ships EnvTokenStrategy (CERTCTL_BOOTSTRAP_TOKEN); Bundle 2
+// adds OIDCFirstAdminStrategy (CERTCTL_BOOTSTRAP_OIDC_GROUP). The
+// service holds whichever strategy was wired at boot.
+type Strategy interface {
+	// Available reports whether the strategy is currently callable.
+	// Returns false once the strategy is consumed (one-shot semantics)
+	// OR once the strategy detects an existing admin (via the
+	// AdminExistenceProbe). The HTTP handler maps !Available to 410
+	// Gone before doing any token validation, so probing for "is there
+	// a bootstrap path open" is safe.
+	Available(ctx context.Context) (bool, error)
+
+	// Validate consumes the credential and returns nil when the caller
+	// is permitted to mint the first admin. The strategy MUST atomic-
+	// flip its consumed state on first successful Validate so a
+	// concurrent racing call gets ErrDisabled. Returning a non-nil
+	// error MUST NOT mark the strategy consumed; the operator can
+	// retry with the correct credential.
+	Validate(ctx context.Context, token string) error
+}
+
+// AdminExistenceProbe is the callback the EnvTokenStrategy uses to ask
+// the actor-role repository whether any actor holds r-admin. Lives at
+// this package boundary so the strategy doesn't import internal/repository
+// (would create a cycle: bootstrap -> repository -> postgres -> bootstrap
+// when the postgres adapter is wired).
+type AdminExistenceProbe func(ctx context.Context) (bool, error)
+
+// EnvTokenStrategy is the env-var-token Bundle 1 implementation. The
+// operator sets CERTCTL_BOOTSTRAP_TOKEN, the server boots with this
+// strategy, the first valid Validate call atomically flips the
+// `consumed` flag and the next call returns ErrDisabled.
+//
+// The token comparison is crypto/subtle.ConstantTimeCompare so timing
+// attacks can't leak the token byte-by-byte. The token itself never
+// leaves this package: the strategy holds it in memory, the handler
+// receives only error sentinels, the audit row records the event but
+// not the token value.
+type EnvTokenStrategy struct {
+	token       string              // set once at construction; never mutated
+	probe       AdminExistenceProbe // optional; nil = skip the existence probe
+	mu          sync.Mutex          // guards consumed
+	consumed    bool                // flipped to true after first successful Validate
+	tokenLength int                 // cached for early-reject fast path
+}
+
+// NewEnvTokenStrategy constructs the env-var-token strategy. token must
+// be the raw value of CERTCTL_BOOTSTRAP_TOKEN. probe is optional; when
+// non-nil it gates Available + Validate on "no admin exists yet" so the
+// caller can't bootstrap a second admin after the fleet has stabilized.
+//
+// When token is empty the returned strategy is born consumed —
+// Available returns false, Validate returns ErrDisabled. This matches
+// the boot-path contract that an unset env var disables the endpoint.
+func NewEnvTokenStrategy(token string, probe AdminExistenceProbe) *EnvTokenStrategy {
+	s := &EnvTokenStrategy{
+		token:       token,
+		probe:       probe,
+		tokenLength: len(token),
+	}
+	if token == "" {
+		s.consumed = true
+	}
+	return s
+}
+
+// Available implements Strategy.
+func (s *EnvTokenStrategy) Available(ctx context.Context) (bool, error) {
+	s.mu.Lock()
+	consumed := s.consumed
+	s.mu.Unlock()
+	if consumed {
+		return false, nil
+	}
+	if s.probe != nil {
+		exists, err := s.probe(ctx)
+		if err != nil {
+			return false, err
+		}
+		if exists {
+			return false, nil
+		}
+	}
+	return true, nil
+}
+
+// Validate implements Strategy.
+func (s *EnvTokenStrategy) Validate(ctx context.Context, token string) error {
+	// Fast-path: if the strategy is disabled, return Disabled before
+	// doing any constant-time compare. The state flip below acquires
+	// the same mutex so this read is safe.
+	s.mu.Lock()
+	if s.consumed {
+		s.mu.Unlock()
+		return ErrDisabled
+	}
+	// Refuse zero-length tokens up front. ConstantTimeCompare returns
+	// 1 when both inputs are empty, which would otherwise produce a
+	// permanent backdoor on misconfigured deployments where token=""
+	// at construction; NewEnvTokenStrategy already covers that, but
+	// belt-and-braces here in case a future caller passes the strategy
+	// raw.
+	if s.tokenLength == 0 || len(token) == 0 {
+		s.mu.Unlock()
+		return ErrInvalidToken
+	}
+	// Constant-time compare. Length-pad implicit: ConstantTimeCompare
+	// returns 0 when lengths differ (and runs in constant time
+	// relative to the shorter length).
+	if subtle.ConstantTimeCompare([]byte(s.token), []byte(token)) != 1 {
+		s.mu.Unlock()
+		return ErrInvalidToken
+	}
+	// External probe: respect the "admin already exists" gate even
+	// after a valid token was supplied. This closes the race where a
+	// fleet first-admin lands during the gap between Available and
+	// Validate.
+	if s.probe != nil {
+		// Drop the lock for the probe — repo calls may be slow and
+		// holding the mutex through I/O would serialize every
+		// concurrent bootstrap attempt. Re-acquire after.
+		s.mu.Unlock()
+		exists, err := s.probe(ctx)
+		if err != nil {
+			return err
+		}
+		if exists {
+			return ErrDisabled
+		}
+		s.mu.Lock()
+		// Re-check consumed because a concurrent caller might have
+		// flipped it while we were probing.
+		if s.consumed {
+			s.mu.Unlock()
+			return ErrDisabled
+		}
+	}
+	s.consumed = true
+	s.mu.Unlock()
+	return nil
+}
+
+// IsConsumed reports whether the strategy has already been used. Test
+// helper; production callers should use Available which also runs the
+// admin-existence probe.
+func (s *EnvTokenStrategy) IsConsumed() bool {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	return s.consumed
+}