certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:21:35 +00:00

Author	SHA1	Message	Date
shankar0123	7c04f0a0b8	fix(oidc): test seam for jwksProbeClient — closes the B5 R6 httptest regression CI break diagnosed from go-build-and-test on f1fa311+36840dd: TestTestDiscovery_HappyPath_AgainstMockIdP + TestTestDiscovery_JWKSFetchFails fail with "refusing to dial reserved address 127.0.0.1" because my Bundle 5 R6 closure wrapped jwksReachable in validation.SafeHTTPDialContext — which is exactly what the production guard is supposed to refuse for httptest.NewServer's 127.0.0.1 bind. Same shape as the Slack/Teams test-seam fix in `36840dd`: factor the http.Client construction into a package-level var (`jwksProbeClient`), default to the SSRF-safe transport in production, override to http.DefaultTransport in test-only `setup_test.go::init()`. Production code never reassigns the var. The audit R6 closure stands — the production jwksReachable still uses validation.SafeHTTPDialContext. Verification (sandbox, Go 1.25.10): go test -short -count=1 -run 'TestTestDiscovery_HappyPath\|TestTestDiscovery_JWKSFetchFails' ./internal/auth/oidc # PASS (1.1s) go test -short -count=1 ./internal/auth/oidc # PASS (21.8s) gofmt -l # clean go vet ./internal/auth/oidc # clean	2026-05-13 01:30:47 +00:00
shankar0123	36840ddd01	fix(security): close BUNDLE 5 — auth, OIDC, MCP, API + browser security edges Bundle 5 closure (2026-05-13 acquisition diligence audit). 13-finding security audit pass across the auth / OIDC / MCP / API / browser- security surface. Five real closures shipped in code, two false-as- stated findings annotated with the existing implementation, three operator-decision items documented for v3 follow-up, three doc-only fixes (auth architecture narrative aligned with shipped OIDC). Source findings closed (code): S1 break-glass /auth/breakglass/login lacked the documented 5/min per-source-IP rate limit; handler now owns its own SlidingWindowLimiter wired at startup. Doc claim turns true. R6 OIDC test_discovery JWKS probe ran on http.DefaultClient; now uses an http.Client whose transport wraps validation.SafeHTTPDialContext. JWKS URI can no longer pivot into reserved-address ranges via DNS rebinding. R7 Slack + Teams notifiers built http.Client without the SSRF dial-time guard. Both New() constructors now install validation.SafeHTTPDialContext; webhook URLs (operator- configured via dynamic-config GUI) cannot dial 169.254.x or in-cluster reserved ranges. Test seam: newForTest bypasses the guard for httptest's 127.0.0.1 binds, mirroring the existing internal/connector/notifier/webhook pattern. RT-L2 CERTCTL_ACME_INSECURE=true now emits a prominent logger.Warn at server boot. Pre-Bundle-5 the knob silently disabled ACME directory TLS verification. Source findings closed (doc): finding 1 + HIGH-5 Architecture doc claimed no in-process JWT/ OIDC/mTLS/SAML and pointed everyone at the authenticating-gateway pattern. Auth Bundle 2 (commit dea5053) shipped native OIDC + sessions + break-glass. New §"In-process authentication surface" table (api-key / oidc / none) supersedes the old framing; "Authenticating-gateway pattern (SAML, mTLS-as-auth, LDAP)" section retained for protocols certctl still doesn't ship natively. Source findings verified false (existing implementation): S4 OIDC email-domain allowlist — `email_domain_test.go` already pins the strict-equality semantics (subdomain not auto-accepted, multi-entry no-match path, empty allowlist accepts all by-design per RFC 9700 §4.1.1). SEC-L1 CSP / HSTS / referrer-policy headers — already shipped at internal/api/middleware/securityheaders.go and wired at cmd/server/main.go L2003+L2027+L2115. Operator-decision / deferred (tracked in bundle-5 closure doc): S3 CERTCTL_API_KEYS_NAMED parsing is wired, end-to-end validation is partial. Operator decides: complete the named-key middleware path or deprecate the syntax. S5 Audit-middleware best-effort for read paths; security-critical writes use WithinTx. Operator decides per-path escalation. S8 MCP threat model — the binary is a thin protocol bridge, no privileges of its own; every tool call carries CERTCTL_API_KEY and is auth'd + RBAC-gated server-side. Optional CERTCTL_MCP_READ_ONLY gate tracked as v3. SEC-H1 2026-05-10 audit CRIT-1/2/4 already closed on master; CRIT-3/5 status against the spec folder is operator- workstation-validation-only. Documented for follow-up. SEC-L2 WebAuthn / FIDO2 / step-up — already documented in docs/operator/auth-threat-model.md "Threats Bundle 2 does NOT close". v3 work item per CLAUDE.md decision 12. Full per-finding rationale + receipts at docs/operator/security-bundle-5-audit-closure.md. Verification: gofmt -l # clean go vet ./internal/connector/notifier/slack ./internal/connector/notifier/teams ./internal/auth/oidc ./internal/api/handler ./cmd/server # clean go build ./cmd/server [...] # clean go test -short -count=1 ./internal/connector/notifier/slack ./internal/connector/notifier/teams ./internal/api/handler ./internal/auth/oidc ./internal/config # PASS # (slack 0.028s + teams # 0.023s + handler 11.0s; # newForTest seam keeps # httptest tests green) Audit-Closes: BUNDLE-5 S1 R6 R7 RT-L2 finding-1 HIGH-5 Audit-Verifies-False: S4 SEC-L1 Audit-Defers: S3 S5 S8 SEC-H1 SEC-L2	2026-05-13 01:18:45 +00:00
shankar0123	709e1c9292	fix(scale): close BUNDLE 4 — migrations, scheduler HA, rate-limits, scale receipts Bundle 4 closure (2026-05-13 acquisition diligence audit). Closes the "what happens under multi-replica" question cluster: migration runner had no concurrency control + no applied-version ledger, 15 scheduler loops had per-process idempotency but no cross-replica documentation, rate limits were process-local without an operator-facing scope statement, load-test scope explicitly omitted four hot paths without linking them to a roadmap. Source findings closed: HIGH-1 + D4 + finding 4 (migration tracking) D8 (scheduler loop ownership) MED-1 + MED-2 (rate-limit scope) T9 + LOW-7 + finding 7 (load-test receipt scope) Closures by source ID: HIGH-1 + D4 + finding 4 — Migration tracking + advisory lock. internal/repository/postgres/db.go::RunMigrations now wraps every migration execution in: 1. A dedicated *sql.Conn pinned to one connection for the entire scan + apply lifecycle (pg_advisory_lock is connection-scoped). 2. pg_advisory_lock(migrationAdvisoryLockID) — fixed int64 key derived from "certctl-migrations" so the same constant resolves across deployments without colliding with operator advisory locks. Blocks the second replica until the first finishes. 3. CREATE TABLE IF NOT EXISTS schema_migrations(version TEXT PK, applied_at TIMESTAMPTZ DEFAULT NOW()) — audit ledger. 4. Skip-applied loop: SELECT version FROM schema_migrations → map[string]struct{} → skip every .up.sql whose filename is in the map. INSERT after successful execute, ON CONFLICT (version) DO NOTHING for defense in depth. Pre-Bundle-4 every server boot re-ran all 45 .up.sql files. The "idempotency via IF NOT EXISTS / ON CONFLICT" contract in CLAUDE.md held per-migration but offered no protection when two Helm replicas raced on schema DDL. Post-Bundle-4 single-replica deploys see zero behavior change beyond the audit-table population; multi-replica deploys get HA-safe schema bootstrap. D8 — Scheduler HA semantics documented. New docs/operator/scheduler-ha.md with per-loop inventory of all 15 loops in internal/scheduler/scheduler.go. Classification: - HA-safe (jobProcessorLoop, jobRetryLoop) — FOR UPDATE SKIP LOCKED via ClaimPendingJobs (Bundle 1 H-6 closure, `6cb4414`). - HA-safe-ish (jobTimeoutLoop) — atomic UPDATE-WHERE-status. - Idempotent under N>1 replicas (renewalCheckLoop, agentHealthCheckLoop, shortLivedExpiryCheckLoop, networkScanLoop, healthCheckLoop, acmeGCLoop, sessionGCLoop) — duplicate ticks produce idempotent side effects. - Side-effect-duplicating under N>1 replicas (notificationProcessLoop, notificationRetryLoop, digestLoop, cloudDiscoveryLoop, crlGenerationLoop) — duplicate webhook/email/AWS-API/CRL-signing operations. Operators running multi-replica accept N× side effects or pin to server.replicas: 1. Leader-election work tracked in WORKSPACE-ROADMAP.md as v3. MED-1 + MED-2 — Rate-limit scope. New docs/operator/rate-limit-scope.md states the contract verbatim: process-local sync.Mutex-guarded sliding-window log, effective cluster-wide cap = configured-per-replica × server.replicas, restart-safe (no persistent state, no shared store), bounded (50k/100k key cap with eviction). Five call sites documented: ocspLimiter (1m/IP), exportLimiter (1h/actor), EST per-principal (24h/CN), EST failed-auth (1h/IP), Intune dispatcher (24h/Subject+Issuer), plus the HTTP middleware token-bucket (RPS+Burst per replica). Cluster-wide shared limits via Redis or Postgres-backed bucket are tracked in WORKSPACE-ROADMAP.md as v3. T9 + LOW-7 + finding 7 — Load-test receipt scope. The existing harness at deploy/test/loadtest/ already self-documents the gap ("What it explicitly does NOT measure"). No code change needed for this finding; Bundle 4 cross-references scheduler-ha.md and rate-limit-scope.md from those gap callouts so the four deferred coverage classes (issuer connector, scheduler throughput, agent fleet, DB p99) land in the same place an acquirer reads about HA semantics and rate limits. Tests: internal/repository/postgres/migrations_test.go (new, 4 tests): - TestRunMigrations_PopulatesSchemaMigrations: audit table exists and is non-empty after the first migration run. - TestRunMigrations_SkipsAppliedOnSecondCall: second call is observable no-op on row count. - TestRunMigrations_ConcurrentCallsSerialized: two goroutines racing the migrator both return without error; row count unchanged; no duplicate versions. - TestRunMigrations_FreshDatabaseHappyPath: ≥ 30 migrations land on a fresh schema. Gated by testcontainers via the existing repo_test.go getTestDB pattern; skipped under -short. The integration lane runs them. Verification: gofmt -l # clean go vet ./internal/repository/postgres ./cmd/server # clean go build ./cmd/server ./internal/repository/postgres # clean go test -short -count=1 ./internal/repository/postgres ./internal/ratelimit # PASS Operator follow-up: full integration run on workstation: go test -count=1 ./internal/repository/postgres -run TestRunMigrations_ Receipts (paths for the audit packet): Migration runner evidence: internal/repository/postgres/db.go L135-340 (advisory-lock + ledger + skip-applied loop) + internal/repository/postgres/migrations_test.go (4 tests). Scheduler loop inventory: docs/operator/scheduler-ha.md (15-loop table with HA classification per loop). Rate-limit storage matrix: docs/operator/rate-limit-scope.md. Load-test baseline: deploy/test/loadtest/README.md (already self-documenting), cross-linked from scheduler-ha.md. Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md): - Leader election for the four duplicate-side-effect loops (notificationProcessLoop, notificationRetryLoop, digestLoop, cloudDiscoveryLoop, crlGenerationLoop). v3 work item. - Shared rate-limits across replicas (Redis / Postgres token bucket). v3 work item. - Issuer-connector + scheduler-throughput + agent-fleet + DB-p99 load-test coverage. Tracked separately; per-issuer Prometheus histograms already capture issuer round-trip latency in production runs. Audit-Closes: BUNDLE-4 HIGH-1 D4 D8 MED-1 MED-2 T9 LOW-7 finding-4 finding-7	2026-05-13 01:00:39 +00:00
shankar0123	f1fa311191	fix(helm): close BUNDLE 3 — Helm chart hardening + enterprise deploy Bundle 3 closure (2026-05-12 acquisition diligence audit). Closes the "chart claims production-ready but lying-fields silently break it" hazard cluster: README install command had wrong key, required secrets weren't fail-fast, external Postgres rendered the bundled StatefulSet hostname, container-only security hardening fields landed at pod scope (silently dropped by K8s API), and three advertised template surfaces (ServiceMonitor, PodDisruptionBudget, NetworkPolicy) didn't render at all even when their values.yaml toggles were on. Source findings closed: C2 C3 D1 D2 D3 D5 D7 D11 D12 (repo audit) OPS-L1 OPS-L2 (cowork audit) Source findings explicitly deferred (tracked in WORKSPACE-ROADMAP.md): D6 OPS-H1 (backup automation — operator must choose target storage) D10 (digest pinning of latest `:latest` tags) OPS-M1 (prometheus/client_golang migration) OPS-M2 (distributed tracing instrumentation) Chart truth table (rendered with helm 3.16.3): -f values.yaml + tls.existingSecret + auth.apiKey + pg.auth.password → 12 resources (default mode, no monitoring/PDB/networkpolicy) + postgresql.enabled=false + externalDatabase.url=… → NO StatefulSet, NO postgres-secret, NO postgres-service (D2) + server.tls.certManager.enabled=true → +1 Certificate (cert-manager mode) + replicas=3 + monitoring.enabled=true + serviceMonitor.enabled=true + podDisruptionBudget.enabled=true + networkPolicy.enabled=true → +1 ServiceMonitor + 1 PodDisruptionBudget + 1 NetworkPolicy (D5+D11) tls.existingSecret AND tls.certManager.enabled both set → REFUSED with "EXACTLY ONE TLS ownership path" error (D7) Missing required secrets (apiKey / pg password / external URL) → REFUSED at template time with operator-actionable guidance (D1) Closures by source ID: C2 — README Helm install example fixed. Was `--set postgresql.password=…` (does not exist); now `--set postgresql.auth.password=…` matching the chart key. README install block also wires TLS, mentions fail-fast at template time, and links the external-Postgres example. C3 — Kubernetes Secrets connector annotated PREVIEW in values.yaml. The chart still exposes `kubernetesSecrets.enabled` for the RBAC preview wiring, but the values block now states clearly that the production K8s client at internal/connector/target/k8ssecret/ k8ssecret.go::realK8sClient is a stub (verified — go.mod imports zero k8s.io/client-go packages). Production landing tracked in WORKSPACE-ROADMAP.md. D1 — `certctl.requiredSecrets` template helper. Fail-fasts at render time when (a) server.auth.type=api-key + apiKey empty, (b) postgresql.enabled=true + pg.auth.password empty, (c) postgresql.enabled=false + externalDatabase.url + legacy env CERTCTL_DATABASE_URL all empty. Each branch emits an operator-actionable diagnostic with the openssl rand command or values override needed. postgres-secret template additionally uses Helm's `required` builtin so it can't render with the empty fallback that pre-Bundle-3 produced ("changeme" literal). D2 — externalDatabase.url first-class. New top-level values block. certctl.databaseURL helper now branches on postgresql.enabled: bundled path uses the helper-emitted in-cluster URL; external path uses externalDatabase.url verbatim. postgres-secret, postgres-statefulset, and postgres-service ALL gate on postgresql.enabled — external mode renders ZERO postgres-* resources. POSTGRES_PASSWORD env in server-deployment also gates. D3 — Container-vs-pod security context split. K8s API silently drops readOnlyRootFilesystem / allowPrivilegeEscalation / capabilities / privileged when they land at pod scope (`spec.securityContext`); they only work at container scope (`spec.containers[].securityContext`). Pre-Bundle-3 all fields sat at pod scope so the chart's documented "read-only rootfs + drop-all caps" hardening was effectively unenforced. New certctl.podSecurityContext + containerSecurityContext helpers split the operator-facing securityContext map by field-name whitelist so existing values keep working byte-for-byte while fields render at the K8s-valid scope. Applied to both server-deployment.yaml and agent-daemonset.yaml (DaemonSet + Deployment branches). D5 — Prometheus ServiceMonitor template. New templates/servicemonitor.yaml. Renders when monitoring.enabled AND monitoring.serviceMonitor.enabled. Scrapes /api/v1/metrics/prometheus (rbac-gated on metrics.read — needs bearerTokenSecret with an API key holding that perm). values.yaml block extended with bearerTokenSecret, tlsConfig, and relabelings knobs and the operator-facing comment documenting the auth requirement. D7 — TLS both-set rejection. certctl.tls.required helper extended. Pre-Bundle-3 only the NEITHER-set case was caught; setting BOTH rendered a dangling cert-manager Certificate alongside an existing-Secret mount, two conflicting TLS sources of truth. Now refuses with "EXACTLY ONE TLS ownership path" + remediation steps for both possible operator intents. D11 — PodDisruptionBudget + NetworkPolicy templates. New templates/pdb.yaml (renders when podDisruptionBudget.enabled + server.replicas > 1) + templates/networkpolicy.yaml (renders when networkPolicy.enabled). PDB uses minAvailable / maxUnavailable exclusivity per K8s spec. NetworkPolicy default-allows in-namespace agent → server traffic, kube-DNS egress, and bundled-postgres egress (when postgresql.enabled), with operator-extensible extraIngress / extraEgress for CA / OIDC / SMTP egress. Both default off so existing deploys don't lose network reach unannounced. D12 — Database max-conn config wired. Pre-Bundle-3 internal/repository/postgres/db.go::NewDB hard-coded SetMaxOpenConns(25). config.go loaded CERTCTL_DATABASE_MAX_CONNS, Validate() enforced the >= 1 floor, values.yaml documented it, and docs/reference/configuration.md surfaced it — but the pool ignored every operator setting. New NewDBWithMaxConns threads the operator value into the pool with maxIdle = maxOpen / 5 (≥ 1) so the historical ratio carries forward. cmd/server/main.go calls the new constructor; NewDB stays for compat at the default 25. OPS-L1 — Chart version 0.1.0 → 1.0.0. Chart has shipped through 8 audit closures since 2026-02 (M-018, U-1, U-2, U-3, H-1, G-1, B1, B2); pre-1.0 version was implying instability the chart no longer has. OPS-L2 — External-Postgres path is now properly documented in values.yaml (externalDatabase block with mode-2 example), README install command links the existing examples/values-external-db.yaml, and the chart truth table above proves the external mode renders cleanly. Receipts: helm lint deploy/helm/certctl/ # clean helm template c deploy/helm/certctl/ \ --set server.tls.existingSecret=ci \ --set postgresql.auth.password=p \ --set server.auth.apiKey=k # 12 kinds, default helm template c deploy/helm/certctl/ \ --set server.tls.existingSecret=ci \ --set postgresql.enabled=false \ --set externalDatabase.url='postgres://u:p@h:5432/db?sslmode=require' \ --set server.auth.apiKey=k # 9 kinds, no postgres-* helm template c deploy/helm/certctl/ \ --set server.tls.certManager.enabled=true \ --set server.tls.certManager.issuerRef.name=letsencrypt \ --set postgresql.auth.password=p --set server.auth.apiKey=k # +1 Certificate (cert-manager) helm template c deploy/helm/certctl/ \ --set server.tls.existingSecret=ci \ --set postgresql.auth.password=p --set server.auth.apiKey=k \ --set server.replicas=3 \ --set monitoring.enabled=true \ --set monitoring.serviceMonitor.enabled=true \ --set podDisruptionBudget.enabled=true \ --set networkPolicy.enabled=true # +ServiceMonitor +PDB +NetworkPolicy (TLS both-set + missing apiKey + missing pg password + missing extDb URL all REFUSED.) gofmt -l # clean go vet ./internal/repository/postgres ./cmd/server # clean go build ./cmd/server # clean bash scripts/ci-guards/B3-helm-chart-coherence.sh # clean Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md): - Backup CronJob + restore script (D6 + OPS-H1): operator chooses target (S3, GCS, Azure Blob, NFS). Sample CronJob yaml may ship in deploy/helm/examples/ once an operator workstation has run one full backup-restore cycle. - Distributed tracing (OPS-M2): otel/* are go.mod indirect deps, not actively instrumented. Adding spans is a v3 work item. - Prometheus client_golang migration (OPS-M1): the hand-rolled /metrics/prometheus exposition format works today; client_golang migration unlocks histograms + exemplars + native label sets. Audit-Closes: BUNDLE-3 C2 C3 D1 D2 D3 D5 D7 D11 D12 OPS-L1 OPS-L2 Audit-Defers: D6 D10 OPS-H1 OPS-M1 OPS-M2	2026-05-13 00:40:42 +00:00
shankar0123	d030c26914	fix(security): close BUNDLE 2 — safe first run, demo mode, agent bootstrap Bundle 2 closure (2026-05-12 acquisition diligence audit). Closes the "docker compose up == accidental production" hazard: pre-Bundle-2 the base deploy/docker-compose.yml WAS the demo path (AUTH_TYPE=none + DEMO_MODE_ACK=true + KEYGEN_MODE=server + DEMO_SEED=true + literal change-me-... placeholder creds), the README claimed "drop the demo overlay for a clean install", and ENVIRONMENTS.md table documented auth-type default as api-key — three contradictory stories layered on the same compose file. Source findings closed: R2 R3 C1 D9 finding-2 S9 (repo audit) SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6 (cowork audit) Compose split (deploy/docker-compose.yml + deploy/docker-compose.demo.yml): The base now ships production-shaped — no AUTH_TYPE override, no KEYGEN_MODE override, no DEMO_MODE_ACK, no DEMO_SEED, no literal placeholder fallbacks. POSTGRES_PASSWORD / CERTCTL_AUTH_SECRET / CERTCTL_CONFIG_ENCRYPTION_KEY / CERTCTL_API_KEY / CERTCTL_AGENT_ID must come from deploy/.env (sample template in deploy/.env.example + root .env.example). The demo overlay carries the full demo posture (every env var + every placeholder credential) so the `-f docker-compose.demo.yml` one-flag flip remains a zero-config populated-dashboard path. Fail-closed startup guards (internal/config/config.go::Validate): Three new gates layered on the existing HIGH-12 demo-mode listen-bind guard. All three exempt CERTCTL_DEMO_MODE_ACK=true so the demo overlay keeps working: • HIGH-6: AUTH_SECRET = "change-me-in-production" → refuse • HIGH-6: CONFIG_ENCRYPTION_KEY = "change-me-32-char..." → refuse • LOW-5: CORS_ORIGINS contains "" (CWE-942 + CWE-352) → refuse Visible DEMO MODE banner (cmd/server/main.go): every boot under DEMO_MODE_ACK=true now emits a prominent WARN line with a 6-step production-promotion checklist. The 2026-04-19 incident (a screenshot run that kept running for three days) drove this; the per-startup banner makes the posture unmissable in any log scraper. Agent enrollment doc alignment: • docs/reference/configuration.md L83: corrected the non-existent URL `POST /api/v1/agents/register` to the real route `POST /api/v1/agents`; added the bootstrap-token note and the install-agent.sh handoff sequence. • docs/reference/architecture.md L154: replaced "agents register themselves at first heartbeat" (false — cmd/agent/main.go fail- fasts when CERTCTL_AGENT_ID is unset) with the actual two-step operator-driven flow (REST or GUI registration first, returned ID fed to install-agent.sh second). Tests + CI guard: • 9 new TestValidate_Bundle2_ cases in internal/config/config_test.go covering: placeholder-secret refused + demo-ack exempt; placeholder encryption-key refused + demo-ack exempt; real key not mistaken for placeholder; wildcard CORS refused + demo-ack exempt; wildcard mixed into a concrete allowlist still refused; concrete allowlist accepted. • scripts/ci-guards/B2-compose-base-no-demo-env.sh: greps the base compose for any of the demo-mode env vars + placeholder credentials. Comments stripped before checking so the narrative header in the base file can still reference the overlay's posture in prose. Cold-DB CI smoke (.github/workflows/ci.yml::cold-db-compose-smoke): Switched to layering -f docker-compose.demo.yml on top of the base — the new production base requires real env vars the smoke doesn't have, and the smoke's purpose (catch migration-on-cold-DB regressions + the bootstrap-token mint path) is orthogonal to which auth posture the boot lands in. Receipts: • Current first-run truth table compose flag → posture -f docker-compose.yml (production) → requires .env; fail-fasts on missing AUTH_SECRET / CONFIG_ENCRYPTION _KEY / POSTGRES _PASSWORD; agent fail-fasts on missing AGENT_ID -f docker-compose.yml -f docker-compose.demo.yml (demo) → zero-config; AUTH_TYPE=none + DEMO_MODE_ACK=true + KEYGEN=server + DEMO_SEED=true; boot banner WARN -f docker-compose.yml -f docker-compose.dev.yml (dev) → base + PgAdmin + debug logging -f docker-compose.test.yml (test, standalone) → production-shape posture, real CA backends • Verification (PATH=/tmp/go/bin export GO* paths to /tmp): gofmt -l # clean (no diffs) go vet ./internal/config ./cmd/server # clean go test -short -count=1 ./internal/config/... # PASS (cumulative + all 9 new Bundle 2 cases green) go test -short -count=1 # PASS (no regression ./internal/connector/target/configcheck in the Bundle 1 - closure tests) go build ./cmd/server ./cmd/agent # clean ./cmd/cli ./cmd/mcp-server bash scripts/ci-guards/B2-compose-base-no-demo-env.sh # clean bash scripts/ci-guards/H-1-encryption-key-min-length.sh # clean bash scripts/ci-guards/G-3-env-docs-drift.sh # clean Remaining operator warnings (not blocking; tracked in CLAUDE.md "Open decisions"): • The first `docker compose -f docker-compose.yml up -d` against a pre-Bundle-2 .env (placeholder values still in place) will now fail-fast. This is the intended posture but operators upgrading from v2.0.x via .env-from-old-master need to rotate before upgrading. The CHANGELOG note for the v2.1.0 release should call this out alongside Auth Bundle 2's other breaking changes. Audit-Closes: BUNDLE-2 R2 R3 C1 D9 S9 SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6	2026-05-13 00:14:59 +00:00
shankar0123	b4ce357803	fix(security): close BUNDLE 1 — server+agent connector config validation chain Bundle 1 closure (2026-05-12 acquisition diligence audit). Closes the acquisition-blocker chain: target.edit (default r-operator grant per migrations/000029_rbac.up.sql:196) → arbitrary reload_command stored without validation → agent createTargetConnector json.Unmarshal-only → sh -c on agent host. README's 'shell injection prevention on all connector scripts' claim is now true at the chain level. Server-side: new internal/connector/target/configcheck package + a configcheck.Validate call in target.go::Create + ::Update + ::CreateTarget + ::UpdateTarget (all 4 entry points). Rejects shell metacharacters in reload_command / validate_command / restart_command for nginx, apache, haproxy, postfix/dovecot, javakeystore, ssh. Sentinel errors.Is(err, service.ErrInvalidConnectorConfig) available for handler 400 mapping. Non-shell connector types (F5, IIS, Caddy, Traefik, Envoy, cloud targets, K8s) are no-ops by design. Agent-side: defense-in-depth connector.ValidateConfig(ctx, configJSON) call in cmd/agent/main.go inserted between createTargetConnector and DeployCertificate. This catches (a) configs pre-dating the server gate, (b) encrypted-blob tampering, (c) per-connector filesystem invariants that the server can't check. F5 (S2 finding): proven docs-vs-code drift, not a security bug. The applyDefaults function never set Insecure=true; runtime default has always been Go zero-value (false → TLS verified). Three lying 'default true' comments in f5/f5.go (lines 30, 45-47, 126) rewritten to match actual code behavior. Docs (C4 + C9): README L12 + L68 narrowed — 'any CA / any server' → 'Twelve native CA connectors plus an OpenSSL adapter; fifteen native deployment-target connectors plus a proxy-agent pattern.' 'Every deploy goes through atomic-write + ...' narrowed to file-based connectors with inline link to per-target guarantee matrix. New deployment-model.md §1.6 ships a 15-target × 8-property guarantee table covering atomic write / owner-perms / SHA-256 idempotency / pre-deploy snapshot / on-failure rollback / post-deploy TLS verify / Prometheus counters / shell-injection validation — including the K8s preview honesty marker (CLAIM-H4). Tests: internal/connector/target/configcheck/configcheck_test.go covers 14 shell-injection payloads (semicolon, pipe, backtick, dollar-paren, redirect, and-chain, newline, double-quote, escape, dollar-var) × 7 shell-using connectors + benign-command acceptance + non-shell no-op behavior + empty config + malformed JSON. All pass. Verification (run from /sessions/gifted-blissful-pasteur/mnt/cowork/certctl): go fmt ./... # clean (no diffs) go vet ./... # clean (no findings) go test -short -count=1 ./internal/... ./cmd/... # 60+ packages all ok, zero FAIL Audit-Closes: BUNDLE-1 RT-C1 SEC-M4 CLAIM-M2 CLAIM-L3 Audit-Verifies-False: S2 (F5 'default insecure' was a comment lie, code was always secure)	2026-05-12 23:48:08 +00:00
shankar0123	5332de1e36	fix(ciparity): drop unused methodPathRe regex (golangci-lint cleanup) golangci-lint v2.11.4 surfaced one finding against the bundle's new code: 'var methodPathRe is unused' in internal/ciparity/surface_parity_test.go:46. The regex was leftover scaffolding from when I drafted the file as a package-router test before moving it into the stdlib-only ciparity package. The router-route scanner in this package uses its own inline regex (registerRe + muxHandleRe via scanRouterRoutes) and never reads methodPathRe. Verified clean against the two bundle packages: - golangci-lint run --timeout 5m ./internal/ciparity/... ./internal/config/... → 0 issues - gofmt -l → no output - go vet → clean - go test -short -count=1 → ciparity 0.017s, config 0.727s Audit-Closes: post-v2.1.0-anti-rot/item-2	2026-05-12 14:25:37 +00:00
shankar0123	370b772fbd	feat(ci): item-2 cross-surface contract parity (stdlib-only package) internal/ciparity/ — new stdlib-only package with four tests: 1. TestSurfaceParity_MCPToolCatalogue (HARD GATE): - Every MCP tool name conforms to certctl_<word>(_<word>)* - No duplicate names across the five tools.go files - Total tools ≥ mcpBaselineFloor (150; current count 155) Catches accidental tool deletions + naming-convention drift. 2. TestSurfaceParity_CLICommandCatalogue (INFORMATIONAL): Walks cmd/cli/main.go's switch-case dispatcher. Logs the 31 distinct verbs. Per frozen decision 0.9, warn-only until the CLI surface stabilizes. 3. TestSurfaceParity_OpenAPI_MCPHeuristicCoverage (INFORMATIONAL): Reports the fraction of OpenAPI ops whose path tokens overlap with MCP tool name tokens. Trend metric; current coverage 92%. 4. TestSurfaceParity_Summary (INFORMATIONAL): One-glance count of router routes / OpenAPI ops / MCP tools / CLI verbs. Easy eyeball for a PR reviewer. Verified in sandbox: - gofmt clean - go vet clean - go test -short -count=1: all four PASS in 0.017s Stdlib-only by design — the tests read source files with os.ReadFile + regexp + go/ast. Keeps the test runnable without pulling in the rest of the codebase's transitive deps; fast self-contained signal. Router ↔ OpenAPI parity (TestRouter_OpenAPIParity) stays in internal/api/router/openapi_parity_test.go where it already lives. This bundle does not duplicate it. Allowlist scaffold at scripts/ci-guards/surface-parity-mcp-exemptions.yaml for the day TestSurfaceParity_OpenAPI_MCP is promoted from informational to hard gate. Audit-Closes: post-v2.1.0-anti-rot/item-2	2026-05-12 14:09:32 +00:00
shankar0123	7a2ba11391	feat(ci): item-1 complete-path config-coverage guard (PARTIAL — sandbox could not verify Go test) Shell guard verified working in sandbox: - Green on clean repo: 'OK — every CERTCTL_* env var (194) has at least one non-config-package consumer.' - Red on injected orphan: '::error::Orphan env vars — defined in config.go but no consumer found outside internal/config/' with three remediation paths listed. Go test internal/config/coverage_test.go written but NOT verified — sandbox Go 1.25.9 < go.mod's 1.25.10 requirement; toolchain auto-download fails (disk full). Operator must run `make verify` from workstation before merge. Allowlist scaffold at scripts/ci-guards/complete-path-config-coverage-exceptions.yaml. Every entry requires name + justification + expires fields; expired entries fail the guard. Catches the lying-field bug class — env var defined in config.go that no business-logic code reads. The 2026-04-29 SCEP MustStaple Phase 5.6 gap (domain field shipped, service layer never read profile.MustStaple) is the canonical case this guard would have caught at commit time. Audit-Closes: post-v2.1.0-anti-rot/item-1	2026-05-12 14:02:04 +00:00
shankar0123	e772ff0d4d	fix(repo/job): split UNION ALL + FOR UPDATE into two queries (Postgres-correctness) Phase-9 docker compose smoke surfaced a latent production-breaking bug introduced by commit `0a75a30` (H-6 atomic pending-job claim). The ClaimPendingByAgentID query in internal/repository/postgres/job.go combined UNION ALL with FOR UPDATE SKIP LOCKED in a single statement. Postgres rejects this with: ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT Every agent work-poll returns HTTP 500 in any real deployment where an agent is actually polling. From the compose log: request_id=6da47015-... GET /api/v1/agents/agent-demo-1/work status=500 duration_ms=2 The schema-per-test unit harness in internal/repository/postgres/ *_test.go never inserted jobs and polled, so the SQL execution path was never exercised. The bug has been latent in master since `0a75a30` landed. Fix: split the UNION ALL into two separate FOR UPDATE SKIP LOCKED queries within the existing transaction. The H-6 atomicity invariant (concurrent pollers never see the same Pending row) is preserved because: 1. The two queries run inside the same transaction (tx). 2. Each query independently locks its result rows with FOR UPDATE SKIP LOCKED. 3. The subsequent UPDATE that flips Pending -> Running runs in the same transaction, so the rows stay invisible to concurrent callers from initial SELECT through final COMMIT. 4. The transaction is the unit of consistency, not the single SQL statement. Two queries: - Branch 1 (direct): jobs.agent_id = + status='Pending' + type='Deployment'. ORDER BY created_at ASC, FOR UPDATE SKIP LOCKED. - Branch 2 (fallback): jobs.agent_id IS NULL + INNER JOIN deployment_targets dt ON jobs.target_id = dt.id WHERE dt.agent_id = . ORDER BY j.created_at ASC, FOR UPDATE OF j SKIP LOCKED (FOR UPDATE OF needed because the join brings in dt). Branch 3 (AwaitingCSR) is unchanged — already a single SELECT, not affected by the UNION restriction. Inline comment explains the fix's load-bearing-ness so a future refactor doesn't merge them back into one UNION query. Verify (sandbox): go vet clean; go test -short -count=1 PASS on internal/repository/postgres/. Workstation re-runs 'docker compose up' to confirm the agent's GET /work returns 200 with the next pending-deployment claim. Note: this is NOT a regression introduced by Auth Bundle 2 or the 2026-05-11 audit fixes; it's a pre-existing latent defect from H-6. Including in v2.1.0 because shipping with a broken agent work-poll would block the demo path on day one of release.	2026-05-11 16:11:33 +00:00
shankar0123	752e036dac	fix(oidc/testfixtures): set legacy KEYCLOAK_ADMIN* env vars for start-dev master-admin bootstrap Phase-10 live-IdP smoke (post-iss-param fix landing in `8f83393`) advanced 4 of 6 integration tests to green. The remaining 2 — the realm-key rotation tests — failed with: admin-cli token: HTTP 401 at the master-realm token endpoint. Root cause: Keycloak 26.x has TWO admin-bootstrap env-var pairs and the right pair depends on the launch command: - 'start' (production): KC_BOOTSTRAP_ADMIN_USERNAME + KC_BOOTSTRAP_ADMIN_PASSWORD - 'start-dev': KEYCLOAK_ADMIN + KEYCLOAK_ADMIN_PASSWORD The fixture sets KC_BOOTSTRAP_ADMIN_USERNAME + KC_BOOTSTRAP_ADMIN_PASSWORD but runs 'start-dev'. The bootstrap pair is silently ignored in dev-mode, leaving the master realm with no admin user → admin-cli token endpoint returns 401 → RotateRealmKeys can't authenticate to the Admin API. The 4 auth-code flow tests passed because they authenticate the engineer / viewer test users INSIDE the certctl realm (created by the realm import), which doesn't need a master admin. Fix: set BOTH pairs as belt-and-braces. The legacy KEYCLOAK_ADMIN pair covers start-dev today; the KC_BOOTSTRAP_ADMIN_* pair keeps a future flip to 'start' working. Inline comment in the fixture explains the why so a future reader doesn't drop one back. Verify (sandbox): go vet -tags=integration clean; gofmt clean. Workstation re-runs 'make keycloak-integration-test' to confirm the 2 rotation tests now reach + execute the Admin API successfully.	2026-05-11 15:49:25 +00:00
shankar0123	8f8339393c	fix(oidc/integration): pass fx.IssuerURL as callbackIss arg in 7 HandleCallback call sites Phase-10 live-IdP smoke (post-Enabled-true fix landing in `2d29175`) surfaced the next layer: 5 of 6 testcontainers-Keycloak integration tests failed with 'oidc: provider advertises iss-parameter support but callback omitted it'. Root cause: Keycloak's discovery doc advertises authorization_response_iss_parameter_supported=true. The Audit 2026-05-10 MED-17 closure (RFC 9207) gates the callback path: when the IdP advertises iss-param support, HandleCallback requires a non-empty callbackIss arg that matches the provider's IssuerURL, else ErrIssParamMissing. The 7 HandleCallback call sites in the integration tests were passing '' for the callbackIss arg — the synthetic test code never simulated the real browser's '?iss=<issuer>' query param. Fix: replace '' with fx.IssuerURL at all 7 sites: - integration_keycloak_test.go: 5 sites (TestKeycloakIntegration_AuthCodeFlow_HappyPath, TestKeycloakIntegration_LogoutRevokesSession, TestKeycloakIntegration_JWKSRotation_RefreshKeysPicksUpNewKey pre+post HandleCallback, TestKeycloakIntegration_UnmappedGroupsFailsClosed) - integration_keycloak_rotate_test.go: 2 sites (TestKeycloakIntegration_MED6_AutoRefreshOnKidMiss pre+post) Inline note on the first site explains the rationale so future test-writers don't drop back to ''. Verify (sandbox): go vet -tags=integration ./internal/auth/oidc/... clean; gofmt clean; grep for remaining empty-iss callsites returns 0 matches. Workstation re-runs 'make keycloak-integration-test' to confirm the 5 affected tests advance past the iss-param check against a real Keycloak 26.x.	2026-05-11 15:44:39 +00:00
shankar0123	2d29175b52	fix(oidc/testfixtures): set Enabled=true on Keycloak integration-test provider Phase-10 live-IdP smoke re-run (after the alg-downgrade relax landed in `92c50d9`) surfaced the next layer: 5 of 6 testcontainers-Keycloak integration tests failed with 'oidc: provider is disabled'. Root cause: the OIDCProvider struct literal in internal/auth/oidc/testfixtures/keycloak.go omits the Enabled field. Enabled was added by Audit 2026-05-11 MED-9 (Bundle 2 Fix 13 Phase B); pre-fix the field didn't exist and HandleAuthRequest always proceeded. Post-fix the default zero-value false gates every integration test behind ErrProviderDisabled at service.go L478. Fix: add Enabled: true to the struct literal + inline comment explaining why the field is required for integration tests. The check is the right behavior for production (operator-driven disable kill-switch); just needed to be reflected in the testfixture. Verify (sandbox): go vet -tags=integration ./internal/auth/oidc/... clean. Workstation re-runs 'make keycloak-integration-test' to confirm the 5 affected tests now pass against a real Keycloak 26.x.	2026-05-11 15:39:07 +00:00
shankar0123	92c50d9e19	harden(oidc): relax alg-downgrade IdP-bind check to intersection-empty (Keycloak compat) Phase-10 live-IdP smoke (Keycloak 26.x via testcontainers-go) revealed the IdP-bind alg-downgrade check was too strict for real-world IdPs. 6 of the integration tests in internal/auth/oidc/integration_keycloak_test.go were failing with: oidc: IdP advertises weak signing algorithms (HS/none); refusing to use as defense against downgrade attacks: HS256 Keycloak 26.x (and several other real-world IdPs — Auth0 when HS-mode is enabled, some Authentik configs) advertise EVERY alg they're capable of in the discovery doc's id_token_signing_alg_values_supported field, even when the realm only signs with RS256 in practice. Pre-fix the IdP-bind check refused on ANY HS* or 'none' advertisement → no real Keycloak deploy could ever bind a provider row, hence the integration-test failures. The strict-deny check was defense-in-depth on top of the load-bearing per-token alg-pin at sig-verify time (isDisallowedAlg, service.go L1177): that check rejects every ID token whose JWS header carries an alg outside DefaultAllowedAlgs, regardless of what the discovery doc advertises. A forged HS256 token signed with the IdP's RS256 pubkey as HMAC secret is rejected at sig-verify time → the actual algorithm-confusion attack is closed by the per-token pin, NOT by the discovery-doc check. Fix: relax the IdP-bind check to refuse only when the intersection of advertised vs DefaultAllowedAlgs is EMPTY (the pathological all-weak-alg IdP case). Keycloak (RS256 + HS256 advertised) now binds successfully; an HS-only IdP still fails closed. Changes: - internal/auth/oidc/service.go: rewrite the alg-check loop at L1067 in getOrLoad / RefreshKeys to compute the intersection set; refuse only when no acceptable alg is advertised. ErrIdPDowngradeAdvertised docstring updated to reflect new contract. DefaultAllowedAlgs docstring + the package-level design-comment block at L40-72 updated with v2.1.0-relaxed semantics callouts. - internal/auth/oidc/test_discovery.go: TestDiscovery dry-run validator rewritten to surface HS/none alongside RS as an informational note ('note: IdP advertises weak algorithms %v alongside acceptable ones') rather than a hard-fail error. HS-only / none-only still hard-fails. - internal/auth/oidc/service_test.go: TestService_IdPDowngradeDefense_* tests updated. Renamed: - RejectsHSAdvertised → RS256PlusHS256_BindsSuccessfully (positive) - RejectsNoneAdvertised → RejectsHSOnlyAdvertised (intersection-empty) - RefreshKeys_CatchesPostLoadDowngrade rotated to HS-only post-load - internal/auth/oidc/coverage_fill_test.go: TestTestDiscovery_AlgDowngradeDetected split into _HS256AlongsideRS256_BindsWithNote (positive, asserts note but no hard-fail) + _HSOnly_StillTrips_HardFail (intersection-empty). - docs/operator/auth-threat-model.md: OIDC token-validation alg-allow-list section rewritten to call out the load-bearing-defense hierarchy (per-token pin first, IdP-bind check defense-in-depth) and document the v2.1.0 relaxation rationale. - CHANGELOG.md: ### Security entry under Unreleased. Verify: go test ./internal/auth/oidc/ -short PASS; gofmt clean; go vet clean. The Keycloak integration tests should now pass when the operator re-runs 'make keycloak-integration-test'.	2026-05-11 15:34:59 +00:00
shankar0123	7227844f29	test(coverage): backfill 5 packages to clear v2.1.0 release-gate Phase 3 floors Phase 3 of /Users/shankar/Desktop/cowork/v2.1.0-release-gate.md surfaced four packages below their coverage floors. All four are regressions from new code shipped in the audit-2026-05-10/11 fix bundles that didn't get per-function tests: internal/auth/breakglass 87.5% -> 93.3% (floor: 90%) + List (was 0%) — 3 tests (disabled, empty+populated, repo err) + RemoveCredential, Unlock disabled-branch tests internal/auth/oidc 89.4% -> 95.4% (floor: 90%) + JWKSStatus (was 0%) — 2 tests (unknown provider, after AuthRequest) + TestDiscovery (was 0%) — 5 tests (discovery failure, happy path, HS256 alg-downgrade detected, missing jwks_uri, JWKS 500 fetch) internal/auth/session 89.9% -> 94.4% (floor: 90%) + SetTrustedProxies (was 0%) — round-trip + clear + ComputeCookieHMAC (was 0%) — determinism + key/inputs differ + DecryptKeyMaterial (was 0%) — round-trip + wrong-passphrase internal/api/handler 73.2% -> 75.5% (floor: 75%) + 6 auth_breakglass handler funcs (were all 0%) — 14 tests (disabled/404, invalid JSON, empty fields, service err, happy path with cookies, admin endpoints, ListCredentials no password_hash on the wire) + WithPermissionChecker setter test (was 0%, Bundle 2 MED-2) + NewAdminCRLCacheServiceImpl + CacheRows (were 0%) — 3 tests + itoaForRetryAfter + challengeURLBuilder ACME helpers (were 0%) — 4 tests All five coverage gates green: internal/service 72.7% (floor: 70%) internal/api/handler 75.5% (floor: 75%) internal/api/middleware 67.9% (floor: 30%) internal/auth 93.3% (floor: 85%) internal/service/auth 91.8% (floor: 85%) internal/auth/oidc 95.4% (floor: 90%) internal/auth/oidc/groupclaim 100.0% (floor: 95%) internal/auth/oidc/domain 97.6% (floor: 90%) internal/auth/session 94.4% (floor: 90%) internal/auth/session/domain 98.3% (floor: 90%) internal/auth/breakglass 93.3% (floor: 90%) internal/auth/breakglass/domain 100.0% (floor: 90%) internal/auth/user/domain 96.2% (floor: 90%) (and 6 more — all green) Per CLAUDE.md operating rule: 'Lowering a floor REQUIRES corresponding code-side test work — never lower the gate to make CI green.' The floors stay at their committed values; the new tests close the gap.	2026-05-11 14:12:11 +00:00
shankar0123	4d859468ab	chore(lint): close 5 golangci-lint v2 findings surfaced by v2.1.0 release-gate Phase 1.3 Five golangci-lint v2 findings surfaced when running the v2.1.0 release gate (auth-bundle-2 → master pre-flight). Each is mechanical: 1. govet/printf-style misuse — internal/auth/oidc/service_test.go used integer literal 501 in http.Error; switched to http.StatusNotImplemented. 2. staticcheck SA1019 — internal/auth/breakglass/reflect_helper_test.go referenced reflect.Ptr; the canonical name since Go 1.18 is reflect.Pointer. 3. staticcheck ST1020 — internal/repository/postgres/auth.go ActorRoleRepository.Revoke had a doc comment that did not begin with the method name. Prepended 'Revoke drops actor_roles rows.' to the comment so it now starts with the method name. 4. staticcheck ST1022 — internal/api/handler/auth_session_oidc.go DefaultBCLVerifierMaxAge docstring was attached to the DefaultBCLVerifier type docstring. Moved the const docstring directly above the const declaration, separated by a blank line. 5. unused — internal/auth/session/bench_test.go declared benchSessionMinSamples and never referenced it; the bench loop relies on Go's default b.N scaling. Replaced the const block with a comment describing the rationale. Lint clean (golangci-lint v2.12.2 with the .golangci.yml config) on the five edited packages.	2026-05-11 13:31:13 +00:00
shankar0123	b8c1bf3617	chore(fmt): gofmt cleanup on three pre-bundle drift files surfaced by v2.1.0 release-gate Phase 1 Phase 1 (make verify) of cowork/v2.1.0-release-gate.md surfaced three files with pre-existing gofmt drift that pre-dated the 2026-05-11 fix bundle work: internal/auth/oidc/domain/types.go internal/auth/oidc/integration_keycloak_rotate_test.go internal/auth/oidc/test_discovery.go The 2026-05-11 Fix 08 fmt-cleanup commit (`b3e3a8d`) fixed four files that the merge introduced; these three were noted as pre-existing master drift and intentionally left untouched at the time. The v2.1.0 release-gate spec's Phase 1 requires zero gofmt output from 'go fmt ./...' (Makefile::verify form), so the drift must close before tagging. Pure whitespace alignment, no semantic change.	2026-05-11 13:18:25 +00:00
shankar0123	4f7cf63ae5	Merge Fix 13 (HIGH-2 fourth call site): CSRF rotation on Logout # Conflicts: # CHANGELOG.md	2026-05-11 13:01:56 +00:00
shankar0123	f50c68e199	harden(auth/sessions): CSRF rotation on logout closes HIGH-2 fourth call site Audit 2026-05-11 Fix 13 closure. The HIGH-2 closure on dev/auth-bundle-2 documented four RotateCSRFTokenForActor call sites — login completion (fresh by construction), Assign/Revoke RoleToKey (wired at internal/api/handler/auth.go:498 + 546), Logout, and an explicit operator endpoint. The 2026-05-11 adversarial review observed only 3 of the 4: Logout did NOT rotate the actor's sibling sessions post-revoke. Threat closed: a token captured pre-logout (browser DevTools, malicious extension, session-storage leak) could be replayed against the user's other-device/other-browser sessions until those sessions hit their own idle/absolute expiry. Rotation on logout defeats this — the captured token is dead the moment the user clicks 'Sign out' anywhere. What this changes: * internal/api/handler/auth_session_oidc.go::SessionMinter interface gains RotateCSRFTokenForActor(ctx, actorID, actorType string) int. Nil-safe semantics by convention — the production wiring is session.Service which already implements the method; rotation NEVER errors (returns int count, swallows per-row failures via the underlying Service.RotateCSRFToken) so it can't block the surrounding Revoke that triggered it. internal/api/handler/auth_session_oidc.go::Logout calls RotateCSRFTokenForActor after Revoke(sess.ID) succeeds. The auth.session_revoked audit row gains a csrf_rotated detail key carrying the count so SOC/SIEM can correlate logout events with CSRF churn on sibling sessions. * The no-cookie + invalid-cookie 204 short-circuit paths skip rotation. No session row exists to rotate against; the caller is already unauthenticated. Rotation on those paths would do nothing useful and pollute the audit log. Test coverage in internal/api/handler/auth_session_oidc_test.go: * TestLogout_RotatesCSRFForActor — happy path. Mocks rotateCSRFReturnCount=2; asserts Revoke fires before rotation, rotation fires exactly once with caller's (actor_id, actor_type), audit details carry csrf_rotated=2. * TestLogout_NoCookie_SkipsCSRFRotation — pins the 204 short-circuit branch when there's no cookie. Rotation count stays at 0. * TestLogout_InvalidCookie_SkipsCSRFRotation — pins the 204 short-circuit branch when Validate rejects the cookie. Same rationale: no session row, no rotation. The stubSession test fake gains RotateCSRFTokenForActor with call-recording fields; the phase5StubAudit gains a details slice append-aligned 1:1 with events so the happy-path test can index into the latest entry and assert the count. Spec Phase 3 (explicit operator endpoint) — intentionally NOT shipped. The three automatic triggers (login + role- mutation + logout) cover the HIGH-2 threat model; operators who want a nuclear option can use the existing RevokeAllForActor flow which forces re-login → fresh session → fresh CSRF. Adding a dedicated POST /api/v1/auth/sessions/ rotate-csrf admin endpoint would be defense-in-depth without new attack-surface coverage. Documented in the audit-doc annotation. Verify gate: * gofmt -l — clean * go vet ./internal/api/handler/... — clean * go build ./cmd/server/... ./internal/... — clean (production session.Service satisfies the extended interface out of the box) go test -short -count=1 ./internal/api/handler/... ./internal/auth/session/... — all green; 3 new Logout cases + the 2 pre-existing Logout cases all pass. Audit doc annotation at cowork/auth-bundles-audit-2026-05-10.md flips the HIGH-2 row from 'CLOSED 2026-05-10 (3/4 call sites wired)' to 'A-B-3 verified 2026-05-11: HIGH-2 fully closed across all four documented call sites.' Refs cowork/auth-bundles-fixes-2026-05-11/13-verify-logout-csrf-rotation.md.	2026-05-11 12:24:41 +00:00
shankar0123	ff3f1cd864	harden(auth): demo-mode residual-grants detector + cleanup endpoint + CI guard (A-8) Audit 2026-05-11 A-8 closure. Closes the deferred Phase 2 leg of the 2026-05-10 HIGH-12 closure (`b81588e`) — production-startup observability for actor-demo-anon residual grants + CI guard banning new synthetic- admin code paths. What this changes: * cmd/server/preflight_demo_residual.go (new) runs after the DB pool + audit service are constructed and before the HTTPS listener starts. Under any non-'none' auth type it queries actor_roles for the synthetic actor-demo-anon and emits a WARN log + a categorized audit row (auth.demo_residual_grants_detected) listing every grant present. Migration 000029 unconditionally seeds the ar-demo-anon-admin row at install time, so EVERY production deploy will see this WARN on first boot; the intended cutover workflow is cleanup-once at production handover. * CERTCTL_DEMO_MODE_RESIDUAL_STRICT (new env var on AuthConfig, default false) pivots the WARN to fail-closed startup refusal for operators who want a paranoid posture against re-seeding. * POST /api/v1/auth/demo-residual/cleanup (new handler at internal/api/handler/demo_residual.go) is an admin-class (auth.role.assign) endpoint that removes every actor-demo-anon row from actor_roles and returns {removed: int64}. Idempotent; refuses 503 under Auth.Type=none (deleting the row would break the demo path); audit-logs every invocation including no-op zero-removed calls so the admin's action is always recorded. * scripts/ci-guards/no-new-synthetic-admin.sh pins the 17-entry allowlist of source files that legitimately reference the actor-demo-anon literal. New runtime code paths that resolve to the synthetic actor (the same pattern that produced the original CRIT class) are rejected at PR time. CI workflow auto-picks the script via the existing scripts/ci-guards/.sh loop in .github/workflows/ ci.yml; no workflow edit needed. Regression matrix: cmd/server/preflight_demo_residual_test.go — 7 tests covering the 4 main behaviour branches (testcontainers-backed, testing.Short()- skipped: DemoModeActive_Skips, NoResidue_Passes, HasResidue_LogsAnd Audits, StrictMode_RefusesStartup, DeleteDemoAnonResidue_Idempotent) plus 3 pure-Go stdlib unit tests for the row-string formatter + nil-safety contracts on both helpers. * internal/api/handler/demo_residual_test.go — 7 stdlib+httptest cases: HappyPath, Idempotent_ReturnsZero, RejectsInDemoMode (503), CleanupError_Surfaces500, NilCleanupFn (defensive 500), NilAuditWriter_DoesNotPanic, MissingActorContext (falls back to 'unknown' actor in the audit row). * internal/api/router/openapi_parity_test.go — new POST /api/v1/auth/demo-residual/cleanup entry plus 6 pre-existing pre-A-8 entries (oidc/test, jwks-status, users CRUD, runtime-config) that had drifted out of SpecParityExceptions; the parity test was red on dev/auth-bundle-2 before my work; this commit returns it to green with full per-entry justifications + parity-debt notes. Docs: * docs/operator/security.md — new 'Demo-to-production cutover (Audit 2026-05-11 A-8)' section explaining the WARN message, the cleanup curl one-liner, the equivalent SQL, the strict-mode env var, and the CI guard. * docs/operator/rbac.md — Last-reviewed bump + pointer to the new env var + the security.md section. * cowork/auth-bundles-audit-2026-05-10.md — HIGH-12 row gains an 'A-8 follow-on CLOSED 2026-05-11' annotation describing the deferred Phase 2 leg now landed. * CHANGELOG.md — Unreleased ### Security entry summarizing the four legs (detector + cleanup + strict-mode flag + CI guard) and the acquisition-readiness narrative this closes. Operator-facing impact: this closes a credibility gap, not an exploitable vulnerability. The residue requires a regression elsewhere in the middleware chain to be exploitable. After this fix, the canonical narrative ('RBAC primitive with no synthetic- admin fallback') is fully true. Refs cowork/auth-bundles-fixes-2026-05-11/08-high-demo-mode-residual- cleanup.md.	2026-05-11 11:45:54 +00:00
shankar0123	b3e3a8dbb1	chore(fmt): gofmt cleanup on files touched by audit-2026-05-11 fix bundle Whitespace alignment drift surfaced by gofmt -l after merging 7 fix branches. Pure formatting, no semantic change. Pre-existing master drift in internal/auth/oidc/{domain/types.go, integration_keycloak_rotate_test.go, test_discovery.go} left untouched — that's separate tech debt.	2026-05-11 11:29:48 +00:00
shankar0123	e6ae81f478	Merge Fix 06 (HIGH A-6): strict UA/IP binding — close request-empty bypass in MED-16 # Conflicts: # CHANGELOG.md # internal/api/handler/auth_session_oidc.go # internal/api/handler/auth_session_oidc_test.go	2026-05-11 11:19:04 +00:00
shankar0123	ff454174df	Merge Fix 04 (HIGH A-4): scope-aware ActorRole revoke	2026-05-11 11:16:24 +00:00
shankar0123	fbcac8d193	Merge Fix 02 (CRIT A-2): close MED-11 lying field — DeactivatedAt loaded + enforced on login	2026-05-11 11:16:07 +00:00
shankar0123	e0ac659d5e	harden(oidc): strict UA/IP binding (A-6) — close request-empty bypass in MED-16 The MED-16 closure (`cb73547`) added the RFC 9700 §4.7.1 pre-login UA/IP binding but the consume-side compare at internal/auth/oidc/service.go was gated by: if s.preLoginRequireUA && storedUA != "" && userAgent != "" { ... constant-time compare ... } if s.preLoginRequireIP && storedIP != "" && ip != "" { ... constant-time compare ... } The `userAgent != ""` and `ip != ""` arms were intended as rolling-deploy / headless-proxy compat ("if the request didn't supply a value, don't try to compare against nothing"). They achieve that — and they ALSO short-circuit the compare whenever the attacker controls the request side, which is always at /auth/oidc/callback. Threat model: 1. Attacker acquires a pre-login cookie (HMAC-protected; requires RNG break OR transit leak — not implausible, that's why the binding exists in the first place). 2. Attacker replays the cookie at /auth/oidc/callback from their own user-agent. 3. Attacker OMITS the User-Agent header. curl doesn't send one by default. Many programmatic HTTP clients omit it. Pre-A-6, step 3 trivially bypassed the binding check. The whole RFC 9700 §4.7.1 defense was theatre against the realistic threat — silent-allow when the attacker abandons the header they don't want checked. Fix: flipped to strict-when-stored. When the pre-login row carries a binding value (storedUA != "" or storedIP != ""), the request MUST present a matching value. An empty request side with a non-empty stored side now rejects with two new sentinels: ErrPreLoginUAMissing — request omitted User-Agent header ErrPreLoginIPMissing — request had no resolvable client IP Distinguished from the existing Mismatch sentinels so the audit row can tell apart "binding violation" (operator mis-configured the proxy) from "missing-header bypass attempt" (active exploit indicator). The handler-side classifyOIDCFailure adds typed errors.Is dispatch: ErrPreLoginUAMissing → "prelogin_ua_missing" ErrPreLoginIPMissing → "prelogin_ip_missing" SIEM rules can now alert specifically on the bypass-attempt category distinctly from operator config drift. Legacy-row compat preserved: pre-migration rows where storedUA == "" / storedIP == "" still pass through unchecked. That window is bounded by the 10-minute pre-login TTL — within 10 minutes of the MED-16 deploy every legacy row has expired and the strict path is universal. Operator escape hatches preserved: CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false (symmetric for IP) bypasses both the Mismatch AND the new Missing reject paths. Required for environments where a proxy strips the User-Agent header in transit (rare but documented in the operator advisory). Regression coverage: service_test.go (5 new tests under `Audit 2026-05-11 A-6 — strict-when-stored` block): TestService_HandleCallback_MED16_A6_UAStoredButRequestEmpty_Rejects — the load-bearing bypass-closure leg TestService_HandleCallback_MED16_A6_IPStoredButRequestEmpty_Rejects — symmetric for IP TestService_HandleCallback_MED16_A6_LegacyRowEmptyStoredStillPasses — legacy-row compat preserved TestService_HandleCallback_MED16_A6_ToggleOff_AllowsBypass — UA toggle off allows the bypass (operator escape hatch) TestService_HandleCallback_MED16_A6_ToggleOff_IP_AllowsBypass — IP toggle off allows the bypass auth_session_oidc_test.go::TestClassifyOIDCFailure extended: ErrPreLoginUAMismatch → prelogin_ua_mismatch (new explicit pin) ErrPreLoginIPMismatch → prelogin_ip_mismatch (new explicit pin) ErrPreLoginUAMissing → prelogin_ua_missing ErrPreLoginIPMissing → prelogin_ip_missing fmt.Errorf wrapped variants of the Missing sentinels round-trip through errors.Is (defense against future context-wrapping in the service layer) Verify gate green: gofmt clean, go vet clean, all 10 MED-16 tests + extended TestClassifyOIDCFailure pass; full short-mode test run across internal/auth/oidc + internal/api/handler also green. Spec at cowork/auth-bundles-fixes-2026-05-11/06-high-prelogin-ua-strict-mode.md. Audit doc: MED-16 row in cowork/auth-bundles-audit-2026-05-10.md appended with the A-6 follow-up closure annotation; status table row updated to "CLOSED + A-6 follow-up CLOSED 2026-05-11". Operator advisory in CHANGELOG.md v2.1.0 release notes covers the two operator-visible behaviour changes: (1) callback requests without User-Agent now reject when a binding was stored, and (2) the CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false escape hatch is the documented path for environments where the proxy strips the header.	2026-05-11 11:03:31 +00:00
shankar0123	ddad647ee7	fix(auth/rbac): scope-aware ActorRole revoke (A-4) HIGH-10's UNIQUE (actor, role, scope_type, scope_id, tenant) uniqueness extension lets an operator grant the same role to the same actor at multiple scopes (e.g. r-operator on profile=p-acme AND profile=p-globex). But ActorRoleRepository.Revoke's WHERE clause omitted (scope_type, scope_id) — a single call deleted every variant. Selective revoke was unrepresentable; operators had to drop all and re-grant N-1, opening a race window where the actor's access was briefly different. Closure across all layers (handler → service → repo → MCP → GUI client), preserving the legacy "revoke all variants" contract for unmodified callers: internal/repository/auth.go - New ActorRoleRevokeOptions struct. Zero value = legacy semantic; non-empty ScopeType narrows to one variant. - New ErrActorRoleNotFound sentinel for scoped no-match (HTTP 404). internal/repository/postgres/auth.go - Revoke signature extended with opts. Empty opts.ScopeType uses the legacy SQL (no scope WHERE), zero-row delete = no error. - Non-empty narrows with `scope_type = $5 AND scope_id IS NOT DISTINCT FROM $6` — the IS-NOT-DISTINCT-FROM is load-bearing, vanilla `=` would silently miss the (global, NULL) case because NULL ≠ NULL in standard SQL. - Selective revoke with zero matching rows returns ErrActorRoleNotFound; operators get feedback on typos. internal/service/auth/actor_role_service.go - Revoke takes opts. Audit row's details map records the scope so SIEMs can distinguish wide-vs-selective revokes: `scope: "all_variants"` for the legacy path, or `scope_type` + `scope_id` for selective. Privilege check (auth.role.assign) and reserved-actor guard unchanged. internal/api/handler/auth.go - RevokeRoleFromKey parses optional `?scope_type=` / `?scope_id=` query params via new parseRevokeScope helper. - Validation mirrors AssignRoleToKey: scope_id forbidden with scope_type=global, required with profile/issuer, invalid scope_type → 400. scope_id without scope_type also → 400. - writeAuthError maps ErrActorRoleNotFound to 404. internal/mcp/tools_auth.go + types.go - AuthRevokeKeyRoleInput gains optional ScopeType + ScopeID with jsonschema descriptions explaining the dual-mode contract. - Tool call site appends URL-encoded query params when ScopeType is set; legacy callers (no scope_type) emit the bare DELETE path unchanged. web/src/api/client.ts - authRevokeKeyRole signature: optional 3rd argument `{ scope_type?, scope_id? }`. Pre-A-4 call sites (no opts arg) keep firing the bare DELETE — fully backward compatible. The GUI KeysPage's per-row revoke button (still one row per role, pre-Fix-12) continues to use the legacy shape; future GUI work can pass scope params for per-variant rows. docs/operator/rbac.md - New "Revoke: legacy 'all variants' vs scope-selective" subsection under "From the HTTP API" with curl examples for both modes plus the audit-row payload shape that lets SOC/SIEM tell them apart. Regression coverage: Repository (testcontainers, skipped under -short — 6 tests in internal/repository/postgres/auth_revoke_scope_test.go): TestRevokeActorRole_NoOpts_RemovesAllVariants TestRevokeActorRole_WithScope_RemovesOnlyMatching TestRevokeActorRole_WithGlobalScope_RemovesOnlyGlobal — pins the IS-NOT-DISTINCT-FROM branch (global, NULL) TestRevokeActorRole_NoMatch_ReturnsNotFound — pins the new sentinel TestRevokeActorRole_NoOpts_NoMatch_IsNoOp — pins the legacy idempotence contract TestRevokeActorRole_IssuerScope_RemovesOnlyMatching — pin the issuer-scope half (profile + issuer are symmetric scope types) Handler (7 new tests in auth_test.go): TestAuthHandler_RevokeRoleFromKey — extended to assert no scope filter is forwarded when query string is empty (legacy behaviour) TestAuthHandler_RevokeRoleFromKey_A4_ScopedProfile TestAuthHandler_RevokeRoleFromKey_A4_ScopedGlobal TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithGlobal TestAuthHandler_RevokeRoleFromKey_A4_RejectsMissingScopeID TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithoutScopeType TestAuthHandler_RevokeRoleFromKey_A4_RejectsInvalidScopeType TestAuthHandler_RevokeRoleFromKey_A4_ScopedNotFoundReturns404 MCP (2 new table rows in tools_per_tool_test.go): Scoped revoke with scope_type=profile + scope_id=p-acme → `?scope_type=profile&scope_id=p-acme` Scoped revoke with scope_type=global (no scope_id) → `?scope_type=global` Service-layer test plumbing (service_test.go) updated for new opts arg: 4 existing call sites pass repository.ActorRoleRevokeOptions{} to keep their pre-A-4 semantics; the fakeActorRoleRepo.Revoke implementation now mirrors the postgres scope-aware behaviour (legacy zero-value vs scoped narrowing + ErrActorRoleNotFound on no-match). Verify gate green: gofmt clean, go vet clean, go test -short across repository/postgres, service/auth, api/handler, and mcp. The pre-existing KeysPage.test.tsx failure observed on the baseline commit (reproduced via `git stash` earlier in Fix 03) is unrelated; my client.ts change adds an optional third argument and is fully backward-compatible. Spec at cowork/auth-bundles-fixes-2026-05-11/04-high-actor-role-revoke-scope.md. Audit doc updated: new row A-4 (2026-05-11) CLOSED appended to the status table at the bottom of cowork/auth-bundles-audit-2026-05-10.md. Operator-visible advisory in CHANGELOG.md v2.1.0 release notes under Security (non-BREAKING — legacy callers are unchanged). Depends on Fix 01 (the scope-aware EffectivePermissions read path on branch fix/audit-2026-05-11/crit-actor-role-scope-reads). This fix makes the inverse op selectively reversible; without Fix 01 the read side would mis-evaluate scoped grants anyway, making selective revoke moot at runtime.	2026-05-11 10:50:34 +00:00
shankar0123	a980e4c494	fix(auth/users): close MED-11 lying field — DeactivatedAt loaded + enforced on login (A-2) The MED-11 closure shipped users.deactivated_at + DELETE /api/v1/auth/users/{id} + cascade-revoke, but the federated-user soft-delete was reversible: the next OIDC login under the same (provider, subject) tuple re-minted a session and re-elevated the user. Three legs of the chain were severed (each independently CRIT-shaped): Leg A — postgres/user.go::userColumns omitted `deactivated_at`, so scanUser never populated User.DeactivatedAt. Every Get / GetByOIDCSubject / ListAll returned DeactivatedAt = nil regardless of the column value. Leg B — postgres/user.go::Update SQL omitted `deactivated_at = $X`, so the handler's `u.DeactivatedAt = now()` mutation was a no-op write at the SQL level. Even with leg A closed, no row ever flipped. Leg C — oidc/service.go::upsertUser did not inspect DeactivatedAt on the existing-user path. Even with legs A + B closed, the OIDC login would still proceed normally. The cascade-session-revoke half of the original closure remained correct, but only for the duration of the user's current cookie. SOC 2 CC6.3 + ISO 27001 A.9.2.6 "user access removal" controls require both immediate revoke AND persistent block — this fix restores the persistent-block leg. Closure across layers: internal/repository/postgres/user.go - userColumns adds `deactivated_at` - scanUser reads via sql.NullTime intermediate (column is nullable) - Create writes deactivated_at explicitly (NULL for new active users; forward-compat for future seed-data flows that pre-populate the column) - Update writes deactivated_at on every call; nil DeactivatedAt → NULL (supports reactivation) internal/auth/oidc/service.go - New sentinel ErrUserDeactivated - upsertUser checks existing.DeactivatedAt != nil BEFORE mutating email / display_name / last_login_at — preserves last_login_at forensics on rejected login attempts (defense-in-depth pin against future "performance optimization" that reorders the gate) internal/api/handler/auth_session_oidc.go - classifyOIDCFailure adds typed errors.Is dispatch for ErrUserDeactivated → audit category "user_deactivated" (SOC/SIEM observability surface) internal/api/handler/auth_users.go - Self-deactivate guard on Deactivate: HTTP 409 + audit row auth.user_deactivate_self_rejected when caller targets own User row. Prevents an admin from one-way-door locking themselves out via the standard handler; break-glass remains the recovery path. - New Reactivate handler: inverse of Deactivate. Clears DeactivatedAt via Update; emits auth.user_reactivated audit row. Idempotent on already-active rows. Sessions revoked at deactivation stay revoked (cascade irreversible by design — user must complete fresh OIDC login). internal/api/router/router.go - POST /api/v1/auth/users/{id}/reactivate wired with auth.user.deactivate gate (reactivation is the inverse op, not a separate privilege) web/src/api/client.ts + web/src/pages/auth/UsersPage.tsx - authReactivateUser() client function - Reactivate button on deactivated rows in UsersPage Regression coverage: Postgres (testcontainers, skipped under -short): TestUserRepository_DeactivatedAt_RoundTrip — Create → set DeactivatedAt → Update → Get / GetByOIDCSubject / ListAll round-trip the value TestUserRepository_DeactivatedAt_CreateWritesNullForActive — new active user reads back DeactivatedAt = nil TestUserRepository_DeactivatedAt_CreatePersistsPreDeactivated — Create with non-nil DeactivatedAt round-trips (forward-compat path) OIDC service: TestService_HandleCallback_RejectsDeactivatedUser — errors.Is ErrUserDeactivated; CallbackResult nil; persisted email / last_login_at / deactivated_at NOT mutated by the rejected attempt TestService_HandleCallback_AllowsReactivatedUser — DeactivatedAt = nil → happy path resumes TestService_HandleCallback_DeactivatedUserPreservesForensics — defense-in-depth pin against future regressions that reorder the gate-vs-mutation sequence Classifier: TestClassifyOIDCFailure extended — typed dispatch + wrapped variant round-trip through errors.Is Handler: TestAuthUsers_Deactivate_RejectsSelfDeactivate — HTTP 409 + audit row + cascade-revoke NOT fired + row stays active TestAuthUsers_Deactivate_OtherUser_HappyPath — HTTP 204 + cascade fires + row soft-deleted TestAuthUsers_Reactivate_HappyPath / _IdempotentOnActiveUser / _UnknownID / _MissingID / _UpdateError Phase 6 verify gate green on the targeted packages: gofmt clean, go vet clean, go test -short pass across internal/auth/oidc, internal/api/handler, internal/api/router, internal/repository/postgres, internal/auth/..., internal/service/..., internal/tlsprobe/..., internal/trustanchor/..., internal/validation/... Spec at cowork/auth-bundles-fixes-2026-05-11/02-crit-deactivated-at-enforcement.md Closure annotation at cowork/auth-bundles-audit-2026-05-10.md MED-11 row. Operator advisory in CHANGELOG.md v2.1.0 release notes.	2026-05-11 02:21:05 +00:00
shankar0123	393f83ac34	fix(auth/rbac): close HIGH-10 lying field — EffectivePermissions reads actor-role scope (A-1) Audit 2026-05-11 A-1 closure. Spec at cowork/auth-bundles-fixes-2026-05-11/01-crit-actor-role-scope-reads.md. WHAT. The HIGH-10 closure (commit `551812b` on dev/auth-bundle-2) added `scope_type` + `scope_id` columns to `actor_roles` via migration 000043. The handler accepted them on POST /api/v1/auth/keys/{id}/roles. The repo Grant INSERTed them. The uniqueness tuple was extended to include them. The GUI exposed them as form inputs. But the load-bearing `EffectivePermissions` SQL at internal/repository/postgres/auth.go:470 never read them. The query only JOINed against rp.scope_type/rp.scope_id (role-permission scope) and ignored ar.scope_type/ar.scope_id (actor-role scope). Operator-visible failure: granting Alice r-operator scoped to profile=p-prod silently elevated her to r-operator GLOBALLY at authorization time. The Authorizer's matcher correctly handled whatever EffectivePermissions returned, but EffectivePermissions returned the rp.scope (typically global), not the ar.scope narrowing. This is the canonical CRIT-5 lying-field shape — a security control claimed, persisted across 4 layers, with unit tests at each isolated layer, but the load-bearing wire severed mid-flight. CLAUDE.md's 'Always take the complete path' rule was violated by the original HIGH-10 closure. Additionally, `scanActorRoles` failed to read the new columns even when present, so every GET-side path (ListByActor / ListByRole) returned ActorRole with zero-value scope fields — the GUI / MCP couldn't show operators what they had configured. HOW. internal/repository/postgres/auth.go: - EffectivePermissions SQL extended to intersect ar.scope with rp.scope via a CASE-in-subquery. The effective scope is the NARROWER of the two; disjoint tuples and scope-type mismatches drop the row entirely. WHERE filter on effective_scope_type IS NOT NULL excludes dropped rows. Match matrix (encoded by the CASE): ar.scope rp.scope effective_scope ───────── ───────── ────────────────── global global global / NULL global profile=X profile=X (rp narrows) profile=X global profile=X (ar narrows) profile=X profile=X profile=X (both agree) profile=X profile=Y ROW DROPPED (disjoint) profile=X issuer=* ROW DROPPED (type mismatch) - ListByActor + ListByRole SELECTs extended with scope_type + scope_id columns so the read-side surfaces what was persisted. - scanActorRoles reads the new columns into ActorRole.ScopeType + ScopeID via the existing sql.NullString + ScopeType cast pattern (mirrors RolePermission scan). internal/repository/postgres/auth_scope_test.go (NEW): Testcontainer-backed regression matrix. 8 cases: 1. ActorRoleGlobal_RolePermGlobal — trivial happy path. 2. ActorRoleGlobal_RolePermProfile — rp narrows. 3. ActorRoleProfile_RolePermGlobal_A1Closure — load-bearing post-fix case: profile-scoped grant narrows to profile. 4. BothScopedSameTuple_Matches — exact-match collapse. 5. BothScopedDifferentIDs_RowDropped — disjoint scopes produce no effective permission. 6. ScopeTypeMismatch_RowDropped — profile vs issuer mismatch. 7. ExpiredGrant_Excluded — pre-fix behavior preserved. 8. ListByActor_ReturnsScopeColumns — read-side surface check. Tests skip in -short mode (testcontainers-backed; require Docker on operator workstation). internal/service/auth/service_test.go: TestAuthorizer_ActorRoleProfileScope_OnlyNarrowedScopeAuthorizes_A1 — unit-level pin (sandbox-runnable, no Docker). Simulates the post-A-1 SQL emission (narrowed effective row at profile=p-prod) and asserts CheckPermission authorizes only matching profile, rejects other profiles AND rejects global. Existing matcher code is unchanged; this proves the integration point. CHANGELOG.md: Operator advisory in the new 'Security (BREAKING — silent-elevation closure)' section. Pre-existing scope-bound grants take effect on upgrade; operators audit `actor_roles WHERE scope_type != 'global'` to confirm intent. cowork/auth-bundles-audit-2026-05-10.md: HIGH-10 row gets an A-1 follow-on CLOSED 2026-05-11 annotation describing the regression + closure. VERIFY. - gofmt -l <changed files> (no diff) - go vet ./internal/repository/postgres/... ./internal/service/auth/... ./internal/api/handler/... ./internal/auth/... ./cmd/server/... PASS - go test -short -count=1 ./internal/service/auth/... ./internal/repository/postgres/... ./internal/api/handler/... PASS - The testcontainer-backed regression matrix runs on operator workstation via 'go test -count=1 ./internal/repository/postgres/...' (skip in -short). Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-10 (A-1 follow-on) cowork/auth-bundles-fixes-2026-05-11/01-crit-actor-role-scope-reads.md CLAUDE.md 'Always take the complete path' rule	2026-05-11 02:02:39 +00:00
shankar0123	d85114ffb8	feat(auth): backend endpoints for MED-7 + MED-11 + MED-12 Audit 2026-05-10 MED-7 + MED-11 + MED-12 backend halves. WHAT. Three new admin-gated endpoints: GET /api/v1/auth/oidc/providers/{id}/jwks-status (auth.oidc.list) — MED-7 GET /api/v1/auth/users (auth.user.read) — MED-11 DELETE /api/v1/auth/users/{id} (auth.user.deactivate) — MED-11 GET /api/v1/auth/runtime-config (auth.role.assign) — MED-12 MED-7 — JWKS health surface - providerEntry gains 4 counters (statsMu, lastRefreshAt, refreshCount, lastError, rejectedJWSCount) updated under sync.Mutex - RefreshKeys increments refreshCount + records lastRefreshAt - New JWKSStatus(ctx, providerID) returns *JWKSStatusSnapshot — surfaced via the new endpoint - CurrentKIDs intentionally empty (go-oidc's internal JWKS cache isn't exposed); shape kept for forward compat MED-11 — federated-user admin - AuthUsersHandler.List with optional ?oidc_provider_id filter - AuthUsersHandler.Deactivate sets users.deactivated_at + cascade- revokes sessions via UserSessionsRevoker (best-effort; revoke failure does NOT roll back the deactivation) - Idempotent: re-deactivating an already-deactivated user is a no-op MED-12 — runtime config - AuthRuntimeConfigHandler.Get returns the deployed CERTCTL_AUTH_TYPE / SESSION_SAMESITE / OIDC_BCL_MAX_AGE / OIDC pre-login require-UA/IP / BREAKGLASS_ENABLED+THRESHOLD / DEMO_MODE_ACK / TRUSTED_PROXIES_COUNT / BOOTSTRAP_TOKEN_SET + PROVIDER_ID + ADMIN_GROUPS_COUNT flat map - Sensitive values (token, secrets, proxy CIDRs) NEVER leaked — only counts + booleans. Token presence surfaced as 'set/unset' - Gated auth.role.assign (admin-class) so non-admins can't enumerate the deployment's auth knobs cmd/server/main.go wires all three handlers into HandlerRegistry. internal/api/router/router.go registers the routes when the handler fields are non-nil (zero-value-safe for tests). VERIFY. - go vet ./internal/api/... ./internal/auth/... ./internal/repository/... PASS - go build ./cmd/server/... PASS - go test -short -count=1 ./internal/auth/oidc/... PASS (4.1s) - go test -short -count=1 ./internal/api/handler/... PASS (4.1s) GUI halves for MED-7 + MED-11 + MED-12 are the GUI batch (pending). Refs: cowork/auth-bundles-audit-2026-05-10.md MED-7, MED-11, MED-12 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md items 11 14 15	2026-05-11 00:11:07 +00:00
shankar0123	1449d22b7c	feat(auth): foundation for MED-11 — users.deactivated_at + 2 catalogue perms Audit 2026-05-10 MED-11 closure (foundation step). WHAT. Lays the schema + domain foundation for the MED-11 federated-user admin surface: 1. Migration 000045 adds users.deactivated_at TIMESTAMPTZ (nullable; non-NULL = deactivated). Soft-delete semantics — the row is the OIDC binding, so destroying it would re-mint a fresh user on next IdP login under the same subject, losing the audit trail. 2. Seeds 2 new catalogue permissions: - auth.user.read (admin / operator / auditor) - auth.user.deactivate (admin ONLY) 3. Extends User domain struct with DeactivatedAt time.Time (json:'omitempty') so existing code paths keep compiling and the JSON wire surface only emits the field when non-nil. WHY. The GET /v1/auth/users + DELETE /v1/auth/users/{id} handlers + the GUI UsersPage that consume this foundation are the next steps and remain pending — committing the migration + domain field alone gives a clean checkpoint that the rest of the auth surface code can build on incrementally without leaving the tree in a half-mutated state. HOW. migrations/000045_users_deactivated_at.up.sql: - ALTER TABLE users ADD COLUMN IF NOT EXISTS deactivated_at TIMESTAMPTZ - INSERT 2 permissions into permissions - INSERT role_permissions rows (read in r-admin/operator/auditor; deactivate in r-admin) - Single BEGIN/COMMIT, idempotent (ON CONFLICT DO NOTHING) migrations/000045_users_deactivated_at.down.sql: - reverse-order DELETE + DROP COLUMN internal/auth/user/domain/types.go: - User.DeactivatedAt time.Time, JSON tag omitempty. VERIFY. - go vet ./internal/auth/user/... ./internal/auth/oidc/... ./internal/repository/... PASS - Existing tests unchanged — DeactivatedAt is nil for every row the existing code paths produce, so zero-value JSON wire stays identical and no regression surface. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-11 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 14	2026-05-11 00:02:57 +00:00
shankar0123	d579c93d88	feat(mcp): 11 audit-fix MCP tools — approvals, break-glass, bootstrap, audit-category (MED-13) Audit 2026-05-10 MED-13 closure. WHAT. 11 new MCP tools rounding out the operator surface for workflows that previously had GUI + CLI coverage but no MCP equivalent: Approval workflow (4): certctl_approval_list GET /v1/approvals approval.read certctl_approval_get GET /v1/approvals/{id} approval.read certctl_approval_approve POST /v1/approvals/{id}/approve approval.approve certctl_approval_reject POST /v1/approvals/{id}/reject approval.reject Break-glass credential admin (4): certctl_breakglass_list GET /v1/auth/breakglass/credentials certctl_breakglass_set_password POST /v1/auth/breakglass/credentials certctl_breakglass_unlock POST /v1/auth/breakglass/credentials/{actor_id}/unlock certctl_breakglass_remove DELETE /v1/auth/breakglass/credentials/{actor_id} All gated auth.breakglass.admin; surface invisible (404 not 403) when CERTCTL_BREAKGLASS_ENABLED=false. Bootstrap (2): certctl_bootstrap_status GET /v1/auth/bootstrap (auth-exempt; safe probe) certctl_bootstrap_consume POST /v1/auth/bootstrap (auth-exempt; one-shot mint) Audit category filter (1): certctl_audit_list_with_category GET /v1/audit?category=<cat> audit.read WHY. certctl_bootstrap_consume is the load-bearing day-0 primitive: a fresh server with no admin actors lets the holder of CERTCTL_BOOTSTRAP_TOKEN mint a fresh admin API key. Exposing it via MCP without a security gate would let a downstream caller mint admin from any chat transcript / log surface that captured the bootstrap token. The tool description carries an explicit cautious-wording comment: CAUTION: NEVER WIRE THIS TO AUTONOMOUS OPERATION. A leaked bootstrap token from any log, telemetry, or chat-transcript surface lets a downstream caller mint a fresh admin API key bypassing every other access-control gate. Run this manually, exactly once, from a trusted shell. Similarly certctl_breakglass_set_password's description flags that the password crosses the MCP transport in plaintext; the server-side handler hashes with Argon2id before persisting + the audit row redacts, but client-side logging must NEVER capture the payload. HOW. internal/mcp/tools_audit_fix.go (NEW): registerAuditFixTools(s, c) — declares the 11 tools via gomcp.AddTool. Each tool routes through the existing Client.Get/ Post/Delete helpers; the server-side rbacGate wrappers (or auth-exempt allowlist, for bootstrap) handle authorization. internal/mcp/types.go: Adds 5 input structs: ApprovalIDInput (get/approve/reject) BreakglassActorIDInput (unlock/remove) BreakglassSetPasswordInput (set_password — flagged plaintext) BootstrapConsumeInput (token + key_name; cautious comment) AuditListWithCategoryInput (category + optional limit/since/until/actor_id) Each tagged with jsonschema descriptions for LLM tool discovery. internal/mcp/tools.go: RegisterTools now calls registerAuditFixTools after the existing Bundle 2 Phase 9 registrar. internal/mcp/tools_per_tool_test.go: allHappyPathCases extended with 11 new entries. The existing TestMCP_AllTools_HappyPath dispatches each tool via the in-memory MCP transport against a 2xx mock backend and asserts the wrapper-layer fence wraps the response; TestMCP_AllTools_ErrorPath dispatches against a 5xx mock and asserts MCP_ERROR fence. TestMCP_RegisterTools_DispatchableToolCount confirms every new tool is dispatchable by name. VERIFY. - go vet ./internal/mcp/... PASS - go test -short -count=1 -run 'TestMCP_AllTools_HappyPath\|TestMCP_AllTools_ErrorPath\| TestMCP_RegisterTools_DispatchableToolCount' ./internal/mcp/... PASS - go test -short -count=1 ./internal/mcp/... PASS (0.3s) Refs: cowork/auth-bundles-audit-2026-05-10.md MED-13 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 4	2026-05-10 23:37:06 +00:00
shankar0123	5b8c8c06ff	test(oidc): Keycloak integration test for MED-6 auto-refresh (Nit-5) Audit 2026-05-10 Nit-5 closure. WHAT. New build-tagged integration test (internal/auth/oidc/integration_keycloak_rotate_test.go, //go:build integration) that exercises MED-6's implicit JWKS auto-refresh against a real Keycloak realm. Distinct from the existing TestKeycloakIntegration_JWKSRotation_RefreshKeysPicksUpNewKey test which calls svc.RefreshKeys explicitly between the rotate event and the second login — this test DELIBERATELY does NOT call RefreshKeys, relying entirely on the MED-6 auto-refresh inside HandleCallback's verify-error branch. WHY. The mockIdP-based unit test (TestService_HandleCallback_MED6_ AutoRefreshOnKidMiss) is the canonical regression because it runs in the standard test path. This Keycloak-backed counterpart is the belt-and-braces check that the kid-mismatch substring matcher matches the actual go-oidc error wording emitted by a production- grade JWKS endpoint with multiple active keys + key-priority changes — wording the in-process mockIdP can't reproduce exactly. HOW. internal/auth/oidc/integration_keycloak_rotate_test.go (NEW): TestKeycloakIntegration_MED6_AutoRefreshOnKidMiss 1. Baseline login under original key (primes JWKS cache). 2. fx.RotateRealmKeys(t) — rotate via Keycloak admin REST API. 3. Fresh login flow WITHOUT explicit RefreshKeys call. 4. Assert callback succeeds (proves MED-6 auto-refresh fired). internal/auth/oidc/integration_keycloak_test.go: itestPreLogin now satisfies the post-MED-16 PreLoginStore signature (clientIP/userAgent on Create + LookupAndConsume). Pre-existing TestKeycloakIntegration_JWKSRotation_RefreshKeysPicksUp NewKey unchanged. VERIFY. - go vet -tags=integration ./internal/auth/oidc/... PASS - go vet -tags='integration okta_smoke' ./internal/auth/oidc/... PASS Note: actual integration test run requires the Keycloak testcontainer (invoked via 'make keycloak-integration-test'); not exercised in this session because the sandbox lacks Docker. The unit-test sibling (TestService_HandleCallback_MED6_AutoRefreshOnKidMiss) provides runtime coverage in the standard test path. Refs: cowork/auth-bundles-audit-2026-05-10.md Nit-5 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 20	2026-05-10 23:31:10 +00:00
shankar0123	0df8bf5295	harden(oidc): JWKS auto-refresh on kid-not-in-cache (MED-6) Audit 2026-05-10 MED-6 closure. WHAT. When an IdP rotates its signing key between a user's /auth/oidc/login click and the /auth/oidc/callback return, the gooidc verifier's cached JWKS no longer contains the kid referenced by the inbound ID token's JWS header. Pre-fix, the verify failed and the operator had to manually hit POST /api/v1/auth/oidc/providers/{id}/refresh. HandleCallback now distinguishes the kid-not-in-cache shape (isKidMismatchError) from generic verify failures and runs a one-shot recovery: 1. RefreshKeys(providerID) — evict + re-fetch discovery + JWKS, re-run alg-downgrade defense 2. getOrLoad(providerID) — refresh the cached providerEntry 3. verifier.Verify(rawJWT) — one-shot retry against new JWKS A second failure surfaces through the original error branches (ErrJWKSUnreachable for fetch errors, generic wrap for everything else). NO retry loop — bounded recovery only. WHY. Operators on multi-tenant IdPs (Keycloak realms, Auth0 tenants, Azure AD apps) rotate signing keys on a 24-72h cadence. Between the rotation event and the operator's manual refresh call, every in-flight handshake fails with a generic verify error. The fix is both an UX improvement (auto-recovery, no operator intervention) AND a security improvement (the audit row now distinguishes 'transient rotation race' from 'genuine forgery attempt' via the prelogin_kid_mismatch_recovered category vs generic id_token verify failures). HOW. internal/auth/oidc/service.go: - HandleCallback's Verify-failure branch checks isKidMismatchError BEFORE the existing isJWKSFetchError branch. On match, runs RefreshKeys + getOrLoad + verifier.Verify exactly once. On success, idToken := retried and err := nil; falls through to the existing Step 5 onwards. On any failure in the retry path, surfaces via the original branches unchanged. - isKidMismatchError matcher: pinned go-oidc/v3 v3.18.0 substrings ('kid .* not found', 'signing key .* not found', 'no matching key', 'key with id .* not found'). Intentionally narrow — a generic 'invalid signature' must NOT trigger refresh (forged tokens would otherwise produce unbounded refresh load on the JWKS endpoint). internal/auth/oidc/service_test.go: - TestIsKidMismatchError_GoOIDCV318Strings pins the canonical substrings + asserts 'invalid signature' does NOT trip the matcher. - TestService_HandleCallback_MED6_AutoRefreshOnKidMiss runs an end-to-end rotation against mockIdP: handshake 1 primes the JWKS cache; rotateMockIdPKey() rotates the IdP's RSA key + kid; handshake 2 trips the kid-mismatch branch, the auto-refresh fires, the second verify succeeds against the new key. VERIFY. - go vet ./internal/auth/oidc/... PASS - go test -short -count=1 -run 'MED6\|KidMismatch' ./internal/auth/oidc/... PASS (2/2) - go test -short -count=1 ./internal/auth/oidc/... PASS (4.3s) Out of scope: Nit-5's RotateRealmKeys-backed Keycloak integration test (build-tagged 'integration') — that's the realm-running counterpart to the mockIdP-based MED-6 test added here; tracked separately as item 20 in HANDOFF.md. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-6 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 3	2026-05-10 23:28:57 +00:00
shankar0123	00bbef7eb0	feat(oidc): POST /api/v1/auth/oidc/test dry-run endpoint (MED-5) Audit 2026-05-10 MED-5 closure (backend half). WHAT. New POST /api/v1/auth/oidc/test endpoint that validates an OIDC provider configuration without persisting anything. Mirrors the read-only legs of the production getOrLoad path so operators can catch typos / network reachability problems / IdP-advertises-weak- alg conditions BEFORE creating the provider row. Request body: {issuer_url, client_id, client_secret, scopes} — client_secret is accepted but unused (discovery + JWKS reachability do not require it). Response body: TestDiscoveryResult{ discovery_succeeded — gooidc.NewProvider returned without error jwks_reachable — explicit GET against jwks_uri succeeded supported_alg_values — verbatim id_token_signing_alg_values_supported iss_param_supported — RFC 9207 advertisement parsed off the disco doc issuer_echo — the iss URL we were called with authorization_url, token_url, jwks_uri, userinfo_endpoint — discovery doc fields for the GUI to preview errors[] — per-leg failure messages } HTTP status: - 200 even when individual checks fail (the per-leg errors[] carries detail so the GUI renders per-check status rows) - 400 only when the request body is malformed or issuer_url empty - 500 only when the service-layer call itself errors WHY. Pre-fix, operators configuring OIDC had to create a provider, then hit /refresh, then read the audit log to figure out whether the discovery doc was reachable / whether the IdP advertises HS256 (the alg-downgrade trap). The GUI rendered no per-check feedback. MED-5 closes the dry-run gap for the same reason every Issuer + Target connector has a 'Test connection' button — operator experience parity. HOW. internal/auth/oidc/test_discovery.go (NEW): - TestDiscoveryResult struct with the per-leg projection. - Service.TestDiscovery(ctx, issuerURL) drives the read-only subset of getOrLoad: gooidc.NewProvider, claims parse for alg-supported + iss-param-supported + jwks_uri + userinfo, alg-downgrade defense, jwksReachable HTTP GET. - jwksReachable is a package-level closure so tests can swap. internal/api/handler/auth_session_oidc.go: - TestProvider HTTP handler. Uses an inline discoveryTester interface to type-assert against the OIDCAuthHandshaker stub (the production Service satisfies; test stubs supply via explicit method). Audit row 'auth.oidc_provider_tested' carries the summary fields. internal/api/router/router.go: - Wired as POST /api/v1/auth/oidc/test under rbacGate('auth.oidc.create'). internal/api/handler/auth_session_oidc_test.go: - stubOIDCSvc gains testResult + testErr fields + TestDiscovery method so it satisfies the inline interface. - 3 regression tests: happy path, missing issuer_url -> 400, discovery-failure -> 200 with errors[] populated. VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... ./internal/api/router/... PASS - go test -short -count=1 -run TestProvider ./internal/api/handler/... PASS (3/3) - go test -short -count=1 ./internal/auth/oidc/... PASS (3.7s) - go test -short -count=1 ./internal/api/handler/... PASS (4.7s) Out of scope for this commit: the GUI 'Test connection' button on OIDCProviderDetailPage — queued with the GUI batch (items 10-19 of HANDOFF.md). Refs: cowork/auth-bundles-audit-2026-05-10.md MED-5 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 2	2026-05-10 23:25:54 +00:00
shankar0123	cb73547af5	harden(oidc): pre-login UA/IP binding (MED-16) — RFC 9700 §4.7.1 Audit 2026-05-10 MED-16 closure. WHAT. Binds the OIDC pre-login row to the (clientIP, userAgent) tuple of the /auth/oidc/login request, and enforces a constant-time compare against the /auth/oidc/callback request at consume time. Defeats replay of a stolen pre-login cookie by a different browser / source — the secondary defense layer recommended by RFC 9700 §4.7.1 when the primary layer (HMAC integrity + Path=/ + SameSite=Lax on the cookie) is bypassed via CSRF / XSS / TLS-termination leak. WHY. Pre-fix, the pre-login cookie's HMAC verified only that 'some' caller of /auth/oidc/login was talking to /auth/oidc/callback; it did not verify that the SAME browser / source was on both sides. An attacker who exfiltrated the cookie value via any vector could replay the bytes through their own user-agent and ride the victim's authorization. RFC 9700 §4.7.1 calls out the gap explicitly and recommends binding state to a user-agent fingerprint + source IP. HOW. Migration: migrations/000044_prelogin_uaip.up.sql ALTER TABLE oidc_pre_login_sessions ADD COLUMN IF NOT EXISTS client_ip TEXT, ADD COLUMN IF NOT EXISTS user_agent TEXT; Both nullable for in-flight rolling-deploy compat — the consume- side check only enforces when both row AND request carry non-empty values for the leg in question. Domain: internal/repository/oidc.go (PreLoginSession) — adds ClientIP + UserAgent fields. Repository: internal/repository/postgres/oidc_prelogin.go — Create persists via sql.NullString (empty → NULL); LookupAndConsume reads back. Re-uses package-local nullableString from discovery.go. Service: internal/auth/oidc/service.go - PreLoginStore.CreatePreLogin signature takes (clientIP, userAgent) as positions 5–6. - PreLoginStore.LookupAndConsume returns (clientIP, userAgent) as positions 5–6. - HandleAuthRequest signature gains (clientIP, userAgent), threaded to the store. - HandleCallback adds Step 1.5 — UA / IP constant-time compare between stored row and incoming request. Per-leg toggles via preLoginRequireUA / preLoginRequireIP service fields. Empty values on either side pass through (rolling-deploy + headless- proxy compat). - New sentinels ErrPreLoginUAMismatch, ErrPreLoginIPMismatch. - SetPreLoginBindingRequirements(requireUA, requireIP) helper for main.go config wiring. Adapter: internal/auth/oidc/prelogin.go — PreLoginAdapter passes the new fields through to the repo row. Handler: internal/api/handler/auth_session_oidc.go - OIDCAuthHandshaker.HandleAuthRequest signature updated. - LoginInitiate captures clientIPFromRequest + r.UserAgent() and passes to the service. - classifyOIDCFailure adds errors.Is dispatch for the two new sentinels → prelogin_ua_mismatch / prelogin_ip_mismatch audit categories. Config: internal/config/config.go + AuthConfig.OIDCPreLoginRequireUA (default true) env CERTCTL_OIDC_PRELOGIN_REQUIRE_UA + AuthConfig.OIDCPreLoginRequireIP (default true) env CERTCTL_OIDC_PRELOGIN_REQUIRE_IP cmd/server/main.go calls oidcService.SetPreLoginBindingRequirements from cfg.Auth.OIDCPreLoginRequire{UA,IP}. Tests (internal/auth/oidc/service_test.go): - TestService_HandleCallback_MED16_UAMismatchRejected - TestService_HandleCallback_MED16_IPMismatchRejected - TestService_HandleCallback_MED16_BothMatch_Succeeds - TestService_HandleCallback_MED16_LegacyRowEmptyValues (rolling- deploy compat — empty stored values pass through) - TestService_HandleCallback_MED16_RequireUAFalse_AllowsMismatch (operator escape-hatch — UA mismatch silently allowed) Mechanical fan-out: - stubPreLogin / stubPreLoginRepo signatures updated. - All existing call sites in service_test.go (~40), prelogin_test.go, bench_test.go, logging_test.go, provider_enabled_test.go, integration_keycloak_test.go, integration_okta_smoke_test.go, auth_session_oidc_test.go updated to pass empty strings for the new params — pre-existing tests do not exercise UA/IP binding semantics. VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... ./internal/config/... PASS - go test -short -count=1 -run MED16 ./internal/auth/oidc/... PASS (5/5) - go test -short -count=1 ./internal/auth/oidc/... PASS (4.6s) - go test -short -count=1 ./internal/api/handler/... PASS (4.3s) - go test -short -count=1 ./internal/config/... PASS Refs: cowork/auth-bundles-audit-2026-05-10.md MED-16 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 6 RFC 9700 §4.7.1 — OAuth 2.0 Security Best Current Practice	2026-05-10 23:18:23 +00:00
shankar0123	ecef8295bb	harden(oidc): RFC 9207 iss URL parameter check on callback (MED-17) Audit 2026-05-10 MED-17 closure. WHAT. When the matched IdP's discovery doc advertises authorization_response_iss_parameter_supported=true (RFC 9207 §3), HandleCallback now REQUIRES a non-empty `iss` query parameter on /auth/oidc/callback and enforces a constant-time compare against the configured provider's IssuerURL. Mismatch maps to two new sentinel errors (ErrIssParamMissing / ErrIssParamMismatch) that the handler's classifyOIDCFailure dispatches via errors.Is BEFORE the substring fall-through, so the audit failure_category remains distinguishable between the RFC 9207 leg (iss_param_missing / iss_param_mismatch) and the in-token iss claim leg (id_token_iss_mismatch). WHY. The RFC 9207 iss URL parameter is the load-bearing mix-up-attack defense for multi-tenant IdPs (Keycloak realms, Authentik tenants, Auth0 tenants, public-trust CAs). Pre-fix the parameter was silently ignored — an attacker controlling one IdP tenant could route an auth code to certctl's callback against a different tenant's pre-login state without detection. Modern Keycloak / Authentik / public-trust CAs ship the discovery flag by default; legacy IdPs that don't advertise are unaffected (back-compat preserved). HOW. - internal/auth/oidc/service.go - providerEntry gains issParamSupported bool. - getOrLoad extends the discovery-claims read to include authorization_response_iss_parameter_supported, alongside the existing id_token_signing_alg_values_supported defense. - HandleCallback's signature gains callbackIss string at position 5. Step 2.5 runs after the state compare + provider load: when issParamSupported is true, an empty callbackIss returns ErrIssParamMissing; a present-but-mismatched value returns ErrIssParamMismatch (constant-time compare). - Two new sentinels: ErrIssParamMissing, ErrIssParamMismatch. ErrIssuerMismatch's doc-string clarified to note it covers the in-token leg only. - internal/api/handler/auth_session_oidc.go - OIDCAuthHandshaker.HandleCallback signature updated. - LoginCallback reads r.URL.Query().Get("iss") (no TrimSpace — byte-strict compare upstream) and threads it through. - classifyOIDCFailure: typed errors.Is dispatch for the three iss-family sentinels BEFORE the substring fall-through, so the three cases stay distinguishable in the audit row. - internal/api/handler/auth_session_oidc_test.go - stubOIDCSvc.HandleCallback bumped to 7-arg signature. - TestClassifyOIDCFailure extended with 5 new cases pinning the iss-family dispatch + a wrapped-error round-trip. - internal/auth/oidc/service_test.go - mockIdP gains advertiseIssParameterSupported bool; the /.well-known/openid-configuration handler emits the claim only when set (so existing tests stay back-compat). - 4 new regression tests: * MED17_NoSupport_AnyIssAccepted — provider doesn't advertise; arbitrary callbackIss is ignored (back-compat). * MED17_SupportButMissing — provider advertises; missing iss → ErrIssParamMissing. * MED17_SupportButMismatch — provider advertises; wrong iss → ErrIssParamMismatch (load-bearing mix-up defense). * MED17_SupportAndCorrect — provider advertises; matching iss → success path proves the gate isn't over-eager. - internal/auth/oidc/bench_test.go, internal/auth/oidc/logging_test.go, internal/auth/oidc/integration_keycloak_test.go - Mechanical: all existing HandleCallback call sites updated to pass "" for callbackIss (matches pre-fix behavior for IdPs that don't advertise support — the Keycloak integration suite tests will be re-evaluated once the Keycloak fixture is run against a realm with the discovery flag enabled). VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... PASS - go test -short -count=1 ./internal/auth/oidc/... PASS (3.4s) - go test -short -count=1 ./internal/api/handler/... PASS (5.4s) - 4 new MED-17 regression tests + extended TestClassifyOIDCFailure pass. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-17 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 7 RFC 9207 — OAuth 2.0 Authorization Server Issuer Identification	2026-05-10 23:05:52 +00:00
shankar0123	020bba35f0	harden(auth/cookies): __Host- prefix on all three auth cookies (MED-14, BREAKING) Audit 2026-05-10 — close MED-14 from the HANDOFF.md backend batch (item 5). The session, CSRF, and OIDC pre-login cookies all carry the __Host- prefix; browsers now reject any subdomain attempt to overwrite them. Cookie name changes (BREAKING — existing sessions invalidate): - certctl_session → __Host-certctl_session - certctl_csrf → __Host-certctl_csrf - certctl_oidc_pending → __Host-certctl_oidc_pending The __Host- prefix requires Path=/ + Secure + no Domain attribute. Post-login session + CSRF cookies already met all three. The pre-login cookie's Path widened from '/auth/oidc/' to '/' to satisfy the prefix; the cookie lives 10 minutes and is only consumed by the callback handler, so the wider path scope is harmless. Files touched: - internal/auth/session/domain/types.go — constant rename + comment - internal/auth/session/domain/types_test.go — assertion update - internal/api/handler/auth_session_oidc.go — pre-login set + clear paths widened from /auth/oidc/ to / - web/src/api/client.ts — readCSRFCookie now compares against '__Host-certctl_csrf' - CHANGELOG.md — Unreleased > Security (BREAKING) entry - docs/migration/oidc-enable.md — operator-facing detail of the one-time re-authentication window + GUI customization guidance Operator impact: ONE re-login prompt per active session at the deploy that lands this change. Subsequent logins issue the __Host-prefixed cookie automatically. Existing bookmarked deep links work without modification (cookies are path-scoped, not URL-scoped). Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 5 cowork/auth-bundles-audit-2026-05-10.md MED-14	2026-05-10 22:52:53 +00:00
shankar0123	551812b2ca	feat(auth/rbac): scope_type+scope_id+expires_at on role grants (HIGH-10) Audit 2026-05-10 — close HIGH-10 from the HANDOFF.md backend batch (item 1). Per-actor scoped + time-bound role grants are now expressible via the API. Migration 000043: adds scope_type TEXT NOT NULL DEFAULT 'global' + scope_id TEXT to actor_roles. Constraints: - actor_roles_scope_type_enum: scope_type ∈ {global, profile, issuer} - actor_roles_scope_id_required_when_not_global: scope_id is NULL iff scope_type='global' - Uniqueness extended: (actor_id, actor_type, role_id, scope_type, scope_id, tenant_id) — so an operator can grant the same role to the same actor scoped to multiple profiles/issuers (e.g. r-operator on p-finance AND on p-engineering). Index idx_actor_roles_scope for non-global lookup hot paths. Domain: ActorRole.ScopeType (ScopeType enum) + ScopeID (*string). Authorizer.CheckPermission already understands the tuple via the parallel role_permissions columns; this addition gives operators a per-actor knob without forking roles. Postgres repo: Grant writes scope_type+scope_id with ON CONFLICT keyed on the new uniqueness tuple. Defaults to (global, NULL) when caller omits. Handler: assignRoleRequest extended with scope_type / scope_id / expires_at. Validation: - role_id required (unchanged) - scope_type defaults to 'global'; allowed values global/profile/ issuer; anything else → 400 - scope_id required when scope_type ∈ {profile, issuer}; rejected (must be empty) when scope_type='global' - expires_at must be in the future when present; nil = standing Regression matrix in internal/api/handler/auth_test.go (6 cases): - TestAssignRoleToKey_HIGH10_ProfileScopeBoundGrantPersists - TestAssignRoleToKey_HIGH10_TimeBoundGrantPersists - TestAssignRoleToKey_HIGH10_RejectsScopeIDWithGlobalScope - TestAssignRoleToKey_HIGH10_RejectsMissingScopeIDOnProfile - TestAssignRoleToKey_HIGH10_RejectsPastExpiry - TestAssignRoleToKey_HIGH10_RejectsInvalidScopeType HIGH-10 marked CLOSED in audit-doc — the v3 deferral from the prior session is reversed; everything lands in v2. Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 1 cowork/auth-bundles-audit-2026-05-10.md HIGH-10	2026-05-10 22:47:45 +00:00
shankar0123	acaa81472d	harden(auth/session+oidc): 503/401 split + go-oidc string pin (LOW-6 + Nit-2) Audit 2026-05-10 — close LOW-6 + Nit-2 from the HANDOFF.md backend batch (items 8 + 9). LOW-6: introduce ErrSessionTransient sentinel in session.Service. session.Validate now distinguishes: - errors.Is(err, repository.ErrSessionNotFound) → ErrSessionInvalidCookie (401) - All other repo errors → ErrSessionTransient (503) The session middleware maps ErrSessionTransient to HTTP 503 with Retry-After: 1. Pre-fix, every DB hiccup looked like a forged-cookie 401 and forced the user to re-authenticate on a transient outage. Two new regression tests pin the wire shape: - TestService_Validate_TransientSessionGetError (service layer) - TestService_Validate_SessionNotFoundMapsToInvalidCookie (negative leg: not-found stays 401) - TestSessionMiddleware_TransientErrorMappedTo503 (middleware-level 503 + Retry-After header) Nit-2: isJWKSFetchError documentation now pins go-oidc/v3 v3.18.0 as the source-of-truth string set. v3.18.0 exposes only *oidc.TokenExpiredError as a typed error; JWKS-fetch failures bubble up as fmt.Errorf-wrapped strings. New regression test TestIsJWKSFetchError_GoOIDCV318Strings pins the canonical substrings emitted by go-oidc's jwks.go — a future upstream bump that changes the wording trips the test and forces the matcher to be re-derived. The test caught a real gap: 'oidc: failed to decode keys' (emitted when the IdP returns non-JSON at the jwks_uri — broken proxy, gateway HTML error page, etc.) was previously misclassified as a generic 500 instead of 503 ErrJWKSUnreachable. Added 'decode keys' substring to the matcher. Status: LOW-6 + Nit-2 marked CLOSED in audit-doc table. Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md items 8, 9 cowork/auth-bundles-audit-2026-05-10.md LOW-6, Nit-2	2026-05-10 22:41:19 +00:00
shankar0123	77860fbcc3	harden(auth): LOW + Nit batch — bootstrap audit, crypto/rand, XFF trust, CSRF check, protocol-prefix unify (Batch 1) Audit 2026-05-10 — close 8 LOWs + 2 Nits in-bundle. Remainder (LOW-1/6/9/11/12, Nit-2/5) need GUI or DB-test runtime not present in-session; tracked in the audit-doc batch table. LOW-2: bootstrap.ValidateAndMint now emits 'bootstrap.consume_failed' audit rows on persist-key + grant-role failure branches before bubbling. Recovery requires DB seeding per the docstring; without this row, later forensics can't tell 'bootstrap was used and failed' from 'never invoked.' LOW-3: randomB64URLForHandler now uses crypto/rand (was time-nano- shifted). Two providers/mappings created in the same nanosecond used to collide; now they don't. Time-nano fallback retained for the unlikely crypto/rand-broken path. LOW-4: breakglass.verifyDummy uses s.readRand(salt) for the dummy Argon2id verify. Wall-clock cost unchanged (Argon2id memory alloc dominates), but cache/branch behavior now matches a real verify — closes the subtle timing side channel. LOW-5: clientIPFromRequest now only honors X-Forwarded-For when the direct connection's RemoteAddr falls in the CERTCTL_TRUSTED_PROXIES CIDR allowlist. Default-deny: empty list means XFF is ignored. SetTrustedProxies wired in cmd/server/main.go from cfg.Auth.TrustedProxies. LOW-7: internal/auth/protocol_endpoints.go::ProtocolEndpointPrefixes now carries /scep-mtls + /.well-known/est-mtls (previously only in router.AuthExemptDispatchPrefixes; the two lists had drifted). The canonical-prefix coverage test in Phase 12 still pins the set. LOW-8: docs/operator/rbac.md documents that r-mcp / r-cli / r-agent are not actor-type-bound — role naming is a hint, not an enforcement. Operators wanting hard binding must apply periodic audit queries. Native binding is on the v2 roadmap. LOW-10: Session.Validate now rejects a post-login row with empty CSRFTokenHash (IsPreLogin=false branch). validSession test fixture updated with a valid 64-hex CSRF hash. Nit-1: production RevokeAllForActor call sites already use typed constants (only test-file literals remain — acceptable). Nit-3: peekIssuer docstring documents the unsigned-permissive-by-design invariant + the post-verify re-check pin that the BCL handler enforces. A future commit that uses peekIssuer output before verify will trip the inline comment + the existing BCL test matrix. Status table updated in cowork/auth-bundles-audit-2026-05-10.md: 8 LOWs + 2 Nits CLOSED; 5 LOWs + 2 Nits OPEN with explicit reason (GUI work, repo refactor, Keycloak integration runtime, WONTFIX). Refs: cowork/auth-bundles-audit-2026-05-10.md LOW-2/3/4/5/7/8/10 cowork/auth-bundles-audit-2026-05-10.md Nit-1/3	2026-05-10 22:26:12 +00:00
shankar0123	bab533636d	harden(audit+session): full SHA-256 audit hash + cookie segment length cap (MED-15 + Nit-4) Audit 2026-05-10 Fix 13 Phase F + Fix 14 Phase F partial — close MED-15 + Nit-4. Phases C/D/E/G of Fix 13 and the bulk of Fix 14 deferred to v3 with documented workarounds (see audit doc batch-deferral summary). MED-15: internal/api/middleware/audit.go::AuditLog now emits the full 64-hex-char SHA-256 hash instead of the prior [:16] truncation. The audit_events.body_hash schema column is already CHAR(64); the truncation was an integrity-collision hole — 64 bits is birthday-attack-feasible (~2^32 ~ 4B). Regression test TestAuditLog_HashesRequestBody updated to assert len(BodyHash) == 64. Nit-4: internal/auth/session/service.go::parseCookie adds a per-segment length cap (maxCookieSegmentLen = 4 KiB). Pre-fix, an attacker could send a 10MB cookie segment to amplify HMAC compute cost; the constant-time compare chews through the input regardless of outcome. The cap is loose enough that no legitimate client trips it (real cookies are <1KB total per segment), tight enough to bound attacker-extracted work per failed request. Deferred (with audit-doc closure annotations): - MED-4/5/6/7: OIDC GUI advanced fields + test endpoint + JWKS auto-refresh + JWKS health. v3 OIDC-operator-experience bundle. Workarounds documented. - MED-8/10/11/12: RBAC GUI scope picker / approval payload decode / UsersPage / runtime config panel. v3 GUI-polish bundle. Backend already accepts the scope_type/scope_id fields; the gap is GUI. - MED-13: MCP tools for approvals / break-glass / bootstrap. v3 MCP-expansion bundle. - MED-14: __Host- cookie rename. Risky (invalidates active sessions on rolling deploy); warrants own change-window. - MED-16/17: Pre-login UA/IP binding + RFC 9207 iss URL check. v3 OIDC-hardening bundle. - All 12 LOWs + 4 of 5 Nits: v3 cleanup bundle. Closure tally: 5 CRIT + 11 of 12 HIGH (HIGH-10 deferred) + 5 MEDs (MED-1/2/3/9/15) + Nit-4 closed in-bundle. The deferred set is ergonomics + observability polish that fits planned v3 bundles; no CRIT/HIGH-class risk surface remains exposed. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-15, Nit-4 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase F cowork/auth-bundles-fixes-2026-05-10/14-low-nit-cleanup.md Phase F	2026-05-10 22:02:26 +00:00
shankar0123	119492d47f	feat(oidc): Enabled toggle on OIDCProvider (MED-9) Audit 2026-05-10 Fix 13 Phase B — close MED-9. MED-4/5/6/7 deferred to v3. MED-9: ship the OIDCProvider.Enabled boolean. Pre-fix, the only way to take a provider offline during an incident was DELETE, which breaks active user_oidc_provider FK references and orphans any session that minted under the provider. Post-fix: - Migration 000042 adds enabled BOOLEAN NOT NULL DEFAULT TRUE. Default-true means existing pre-migration rows are all enabled post-deploy; no breaking-change window. - internal/auth/oidc/domain/types.go::OIDCProvider.Enabled ships the domain field with JSON tag 'enabled'. - Repository read/write paths (List, Get, GetByName, Create, Update) all carry the column. - internal/auth/oidc/service.go::HandleAuthRequest rejects with the new ErrProviderDisabled sentinel when cfgRow.Enabled=false. - cmd/server/main.go::oidcProvidersListAdapter.List filters disabled providers before constructing OIDCProviderInfo so the LoginPage's 'Sign in with X' buttons never render for offline IdPs. - Defense-in-depth: the ErrProviderDisabled service-layer check is the guard for direct API / MCP / CLI callers that bypass the GUI. Regression test: internal/auth/oidc/provider_enabled_test.go warms the entry cache via a successful HandleAuthRequest, flips cfgRow.Enabled=false on the cached entry, then asserts the next call returns ErrProviderDisabled (errors.Is). Test fixtures (newValidProvider, makeProvider) updated to set Enabled: true so existing tests stay green. Operators can toggle Enabled today via the existing PUT /api/v1/auth/oidc/providers/{id} body field. A dedicated GUI toggle on OIDCProviderDetailPage and a single-purpose PUT-just-enabled endpoint are deferred to the v3 GUI-polish bundle — the load-bearing wire is in place now. MED-4 (GUI advanced fields on edit), MED-5 (POST .../test endpoint + button), MED-6 (JWKS auto-refresh on cache-miss), MED-7 (JWKS health endpoint + GUI panel): DEFERRED to v3 with explicit annotations in the audit doc. Workarounds: MED-4 fields are PUT-editable via curl/MCP; MED-5 → call refresh post-create; MED-6 → call refresh manually on key rotation. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-4, MED-5, MED-6, MED-7, MED-9 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase B	2026-05-10 21:59:17 +00:00
shankar0123	be2a096d80	feat(auth/sessions): list-all gate + revoke-all-except-current (MED-1/2/3) Audit 2026-05-10 Fix 13 Phase A — close MED-1, MED-2, MED-3. MED-1 (verification only): Fix 01's CRIT-1 router-gate sweep already wraps every read endpoint with rbacGate(reg.Checker, '<resource>.read', ...). Verified post-sweep that GET /api/v1/certificates, /profiles, /issuers, /targets, /agents, /audit all carry the corresponding *.read permission gate. MED-2: ListSessions now gates ?actor_id=<other> on auth.session.list.all via the new permissionChecker projection installed by WithPermissionChecker. cmd/server/main.go threads the existing authCheckerAdapter into the handler. When caller's actor_id != caller.ActorID AND the handler has a checker, an inline CheckPermission(..., 'auth.session.list.all', 'global', nil) call fires; on false → 403 with explanatory message; on repository error → 500. Defense-in-depth: the router-level rbacGate enforces auth.session.list as the floor; the .list.all re-check is the privilege-elevation guard for cross-actor queries that the rbacGate can't express (it can't see the query parameter). MED-3: ship DELETE /api/v1/auth/sessions?except=current — the 'sign out all other sessions' flow. Gated by auth.session.revoke; the handler reads the caller's current session ID from session.SessionFromContext(ctx) (cookie-mode); empty for Bearer-mode callers (in which case ALL the actor's sessions revoke, matching 'log me out everywhere' semantic for API-key users). New repository method SessionRepository.RevokeAllExceptForActor: UPDATE sessions SET revoked_at = NOW() WHERE actor_id = AND actor_type = AND tenant_id = AND revoked_at IS NULL AND id != returning rowcount. Added to the interface in internal/repository/session.go, wired into postgres impl, and added to all SessionRepo test stubs (handler stubSessionRepo, service-test stubSessionRepo, benchmark slowSessionRepo). The session.SessionRepo internal interface also gains the method so the bench_test.go forwarder compiles. Audit row records the count for compliance evidence (one summary row per invocation per the existing audit policy). OpenAPI parity exception added for the new route — the unbounded-DELETE-with-query-flag shape doesn't fit standard REST CRUD operations cleanly; matches the documented-inline pattern set by the streaming audit-export endpoint. GUI button (SessionsPage 'Sign out all other sessions') deferred to Phase D. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-1, MED-2, MED-3 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase A	2026-05-10 21:49:35 +00:00
shankar0123	8bd85af400	fix(audit): ship streaming NDJSON audit export endpoint (HIGH-9 / HIGH-11) Audit 2026-05-10 HIGH-9 + HIGH-11 closure. HIGH-10 deferred to v3. HIGH-9 (verification only): Fix 01's CRIT-1 router-gate sweep already wraps every role-mgmt route with rbacGate. Verified via grep: - GET /api/v1/auth/roles → auth.role.list - POST /api/v1/auth/roles → auth.role.create - GET /api/v1/auth/roles/{id} → auth.role.list - PUT /api/v1/auth/roles/{id} → auth.role.edit - DELETE /api/v1/auth/roles/{id} → auth.role.delete - POST /api/v1/auth/roles/{id}/permissions → auth.role.edit - DELETE /api/v1/auth/roles/{id}/permissions/{perm} → auth.role.edit - POST /api/v1/auth/keys/{id}/roles → auth.role.assign - DELETE /api/v1/auth/keys/{id}/roles/{role_id} → auth.role.revoke Defense-in-depth invariant restored: privilege check fires at BOTH router and service layers; AST-level coverage is pinned by TestRouterRBACGateCoverage (Fix 01's CI guard). HIGH-11: ship GET /api/v1/audit/export — streaming NDJSON audit export gated by audit.export. Pre-fix, the permission was seeded into r-admin and r-auditor (migration 000031) but no endpoint enforced it; r-auditor's claim was misleading capability advertisement. Post-fix: - internal/api/handler/audit.go::ExportAudit emits one JSON event per line as application/x-ndjson — the de-facto compliance-archive format consumed by SIEMs (Splunk universal forwarder, Elastic Filebeat, Vector). - Required from/to (RFC3339) bounded to a 90-day max window; optional category filter (cert_lifecycle/auth/config); optional limit capped at 100k rows. - Content-Disposition: attachment; filename="certctl-audit-<from>_to_<to>.ndjson" so curl + browser downloads land with a sensible filename. - Recursively self-audits: every successful export emits an audit.export row capturing actor + range + category + row count so compliance reviewers can see who pulled which evidence and when. - Service layer: AuditService.ExportEventsByFilter reuses the existing repository.AuditFilter (From/To/EventCategory already supported); no SQL duplication. - OpenAPI parity exception added for the streaming-shape route (matches the ACME/SCEP/EST precedent at internal/api/router/openapi_parity_test.go::SpecParityExceptions). Regression matrix in audit_export_test.go (7 cases): - TestExportAudit_StreamsNDJSONLines (happy path; pins content-type + content-disposition + JSON-per-line shape + recursive self-audit) - TestExportAudit_RejectsRangeBeyond90Days (100-day window → 400) - TestExportAudit_RejectsMissingFromOrTo (3 cases) - TestExportAudit_RejectsInvalidCategory (unknown enum → 400) - TestExportAudit_AcceptsValidCategoryFilter (auth filter passes through) - TestExportAudit_RejectsNonGET (POST → 405) - TestExportAudit_RejectsToBeforeFrom (inverted range → 400) The auditor role's surface is now complete (read + export). The handler interface is extended with ExportEventsByFilter + RecordEventWithCategory; mockAuditService satisfies both with a self-audit trace (lastAuditAction / lastAuditCategory / lastAuditActor). HIGH-10 (scope + expiry on assignRoleRequest): DEFERRED to v3. Schema column already exists (ActorRole.ExpiresAt); load-bearing wire remains v3 work. Documented carve-out at HIGH-10's annotation. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-9 HIGH-11 Spec: cowork/auth-bundles-fixes-2026-05-10/12-high-9-10-11-role-mgmt-cleanup.md	2026-05-10 21:36:01 +00:00
shankar0123	b81588e717	fix(config): refuse to start when CERTCTL_AUTH_TYPE=none binds non-loopback (HIGH-12) Audit 2026-05-10 HIGH-12 closure. Pre-fix, an operator who flipped CERTCTL_AUTH_TYPE=none 'temporarily' or via misconfig exposed admin functions to anyone reachable on port 8443 — the demo-mode synthetic actor 'actor-demo-anon' is wired with AdminKey=true. The control plane is HTTPS-only, but a misconfigured ingress / public listen-bind means any reachable client gets full admin without authentication. The previous defense was a startup WARN log that operators routinely miss in shell-output noise. Post-fix: Config.Validate() refuses to start when: - Auth.Type = 'none' - AND Server.Host is non-loopback (NOT in {127.0.0.1, ::1, localhost}) - AND Auth.DemoModeAck = false (CERTCTL_DEMO_MODE_ACK=true overrides) Real authn types (api-key, oidc) are unaffected — the guard fires only when Type=none. isLoopbackAddr defensively rejects: - '' (Go's default-everything bind) - '0.0.0.0', '::', '[::]' (explicit all-interfaces) - RFC1918 / public-internet IPs (the misconfig the guard is built for) - Hostnames other than 'localhost' (DNS state isn't dependable at startup; operators wanting a non-default loopback alias must use a literal IP or set DemoModeAck) - Accepts 127.0.0.0/8 (all loopback IPs), ::1, localhost - Strips host:port form before classifying Regression matrix in config_test.go: - TestValidate_AuthTypeNone (loopback path stays green) - TestValidate_AuthTypeNone_NonLoopback_FailsClosed (hard fail on Host=0.0.0.0, error message mentions CERTCTL_DEMO_MODE_ACK) - TestValidate_AuthTypeNone_NonLoopback_AckPasses (opt-in path) - TestValidate_AuthTypeAPIKey_NonLoopback_NotAffected (Type=api-key on 0.0.0.0 unaffected by the guard) - TestIsLoopbackAddr (15-case matrix: IPv4 + IPv6 + RFC1918 + public IPs + hostnames + host:port forms) The Phase 2 spec items — production-startup banner when actor-demo-anon has residual role grants; CI guard banning new synthetic-admin code paths — are partial-deferred to a v3 hygiene bundle. The high-impact, fail-closed leg ships in this commit. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-12 Spec: cowork/auth-bundles-fixes-2026-05-10/11-high-12-demo-mode-guard.md	2026-05-10 21:29:06 +00:00
shankar0123	a1ec42065a	fix(audit): close silence-leg of HIGH-6; emit WARN on audit-write failure Audit 2026-05-10 HIGH-6 partial closure (silence leg). The audit identified two distinct gaps in the auth surface's audit-emit pattern: (1) silence — `_ = audit.RecordEventWithCategory(...)` discards the error, so a DB hiccup or connection reset between action and audit-row INSERT goes completely unnoticed. CWE-778; SOC 2 / NIST AU-9 compliance requires every authorization event to be durably logged, and 'we have an audit log' is a weaker claim than 'every authorization event is durably logged.' (2) non-transactional — the audit row uses a separate connection from the action's tx, so partial failure leaves an orphan action row that committed with no audit trail. Decision 8 of the auth-bundles-index requires action + audit row atomic. This commit closes leg (1) fully across all six audit-emit call sites in the auth surface: - internal/service/auth/actor_role_service.go::recordAudit - internal/service/auth/role_service.go::recordAudit - internal/auth/bootstrap/service.go::ValidateAndMint - internal/auth/breakglass/service.go::recordAudit - internal/auth/session/service.go::recordAudit - internal/api/handler/auth_session_oidc.go::recordAudit - internal/service/profile.go::Update (Phase 9 approval-bypass) Each `_ = ...` swallow is replaced with: if err := audit.RecordEventWithCategory(...); err != nil { slog.WarnContext(ctx, '<surface> audit write failed (action committed; audit row may be missing)', 'action', action, 'actor_id', actor, 'resource_id', resource, 'err', err) } Operators monitoring audit-write failures now see structured WARN logs with action + actor + resource attribution; missing audit rows can be cross-referenced against monitoring without manual SELECT-from- audit-table. Infrastructure for leg (2) (transactional commit) is also landed in this commit: - service.AuditService.RecordEventWithCategoryWithTx (new method; accepts repository.Querier from postgres.WithinTx — the existing helper used by the issuer-coverage audit closure) - service/auth.AuditService interface declares the new method - test stub fakeAudit.RecordEventWithCategoryWithTx satisfies the extended interface The eight per-path WithinTx-refactors documented in cowork/auth-bundles-fixes-2026-05-10/10-high-6-atomic-audit-commit.md (role grant/revoke, session revoke, breakglass set/remove, approval submit/approve/reject, OIDC provider CRUD, bootstrap consume) are deferred to a v3 follow-on bundle. Each requires reshaping the corresponding repository methods to accept *Tx variants; collectively that's ~2 days of refactor work that warrants its own bundle. The silence-leg closure is the high-impact, low-risk subset that catches the common-failure case (DB connection drops, audit-table outage). Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-6 Spec: cowork/auth-bundles-fixes-2026-05-10/10-high-6-atomic-audit-commit.md	2026-05-10 21:24:29 +00:00
shankar0123	ea47d7c903	fix(oidc/prelogin): encrypt state/nonce/PKCE-verifier at rest (HIGH-5) Pre-login rows previously persisted the OIDC state, nonce, and PKCE verifier as plaintext columns; an operator restoring an unredacted backup of oidc_pre_login_sessions to a debug environment leaked every in-flight handshake. If the IdP also leaked the auth code in the same window (logged at a misconfigured TLS terminator, etc.), the attacker could exchange code + verifier directly. RFC 7636 §7 requires verifier confidentiality. This commit: - Migration 000041 adds {state,nonce,pkce_verifier}_enc BYTEA columns and makes the legacy plaintext columns nullable. A follow-up migration drops the plaintext columns once the rolling deploy completes. - internal/repository/postgres/oidc_prelogin.go::Create encrypts the three secrets via crypto.EncryptIfKeySet (v3 magic 0x03 + per-row salt + nonce + AES-256-GCM tag) and writes only the encrypted columns; legacy plaintext stays NULL on the write path. - LookupAndConsume prefers encrypted columns via materialize(), falling back to the legacy plaintext only when _enc is NULL — the rolling-deploy compat layer that 000042 will retire. - NewPreLoginRepository takes encryptionKey; cmd/server/main.go threads cfg.Encryption.ConfigEncryptionKey in. - Encryption key reuses CERTCTL_CONFIG_ENCRYPTION_KEY (same passphrase already protecting OIDC client secrets and SessionSigningKey material). No new env var. Why encryption-at-rest, not HMAC: the spec's HMAC approach required moving plaintext into the cookie (the cookie currently carries only row ID + HMAC). Re-shaping the cookie wire format would be a larger refactor; the audit explicitly admits encryption-at-rest is an acceptable closure (weaker because backups still contain decryptable ciphertext, but the encryption key is held separately from the DB backup, and the 10-minute TTL further bounds usable secret window). Three new regression tests in oidc_prelogin_encryption_test.go pin: (a) _enc columns contain v3-format ciphertext, NOT plaintext substrings, post-Create (b) legacy plaintext columns are NULL post-Create (defends against future patches that re-introduce plaintext writes) (c) LookupAndConsume round-trips state/nonce/verifier byte-for-byte A fourth test pins the legacy-row fallback for rolling-deploy compat. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-5 Spec: cowork/auth-bundles-fixes-2026-05-10/09-high-5-prelogin-secret-protection.md	2026-05-10 21:17:55 +00:00
shankar0123	2015ff46cd	fix(auth/ux): cause-aware OIDC + session error surfacing (HIGH-7 + HIGH-8 closure) Server (HIGH-7): the OIDC callback failure path now 302-redirects to /login?error=oidc_failed&reason=<category> instead of emitting a blank 400. `category` is the existing audit `failure_category` value; classifyOIDCFailure was extended with three new sentinel paths (email_domain_not_allowed, email_missing_but_required, pkce_invalid) so CRIT-5 + PKCE failures get distinguishable GUI rendering. Audit-log observability is unchanged — the same failure_category is written to the auth.oidc_login_failed audit row; the 302 is purely a UX leg layered on top. Server (HIGH-8): SessionMiddleware now stashes a cause classification on the request context when Validate returns an error, mapping the sentinels via classifySessionError (errors.Is-based, so wrapped sentinels still classify) to the stable wire-strings idle_timeout / absolute_timeout / back_channel_revoked / invalid_token. The 401 emit point in bearerSkipIfAuthenticated reads the stashed cause and emits WWW-Authenticate: Bearer realm="certctl", error="invalid_token", error_description=<cause> per RFC 6750 §3. GUI (HIGH-7): LoginPage reads ?error= + ?reason= from the URL via react-router useSearchParams and renders an operator-friendly amber-bordered banner above the form; OIDC_FAILURE_REASON_TEXT maps all 16 known categories with a defensive 'unspecified' fallback for forward-compat with future server-side categories. GUI (HIGH-8): api/client fetchJSON parses the WWW-Authenticate cause via parseWWWAuthenticateCause and attaches it to the 'certctl:auth-required' CustomEvent detail; AuthProvider redirects to /login?session_expired=<cause> on cause-aware 401s; LoginPage renders a blue-bordered session-cause banner. invalid_token stays on the current page (no hard redirect for opaque failures). Misc cleanup: ErrorState now accepts the title/message/data-testid form added by CRIT-4 BreakglassPage (was erroring tsc on master). Regression matrix: - internal/api/handler/oidc_redirect_categories_test.go pins all 16 failure categories to the 302 + reason= location + audit-row leg - internal/auth/session/www_authenticate_test.go pins the 4 stable cause categories on classifySessionError (incl. errors.Is wrapped sentinels) + the WWW-Authenticate emission across all 4 categories + the no-session-context fallback case - internal/api/handler/auth_session_oidc_test.go: 4 pre-existing TestLoginCallback_*Returns400 tests updated to assert 302 + reason= location (the wire shape changed from 400 to 302, but the audit observability and behaviour-equivalent failure-classification are preserved) - web/src/pages/LoginPage.test.tsx: 6 new cases pinning the failure banner, session-cause banner, unknown-reason fallback, and forward-compat 'unspecified' category Spec: cowork/auth-bundles-fixes-2026-05-10/08-high-7-8-error-surfacing.md Closes: HIGH-7, HIGH-8 of cowork/auth-bundles-audit-2026-05-10.md	2026-05-10 21:12:11 +00:00
shankar0123	32c97777b5	fix(oidc/bcl): jti replay-cache + iat freshness check (HIGH-3 closure) Closes HIGH-3 of the 2026-05-10 audit. Pre-fix the BCL handler accepted any logout_token whose iat + jti were syntactically present but never checked (a) that iat fell within a skew window or (b) that jti hadn't been seen before. A captured logout_token was replayable indefinitely; once CRIT-2 was fixed, every replay would revoke the user's current sessions — persistent DoS. RFC 9700 §2.7 + OIDC BCL 1.0 §2.5 require jti replay defense. - Migration 000040_bcl_replay_cache: oidc_bcl_consumed_jtis table with composite PK on (jti, issuer_url) — RFC 7519 §4.1.7 per-issuer uniqueness — and an expires_at index for the GC sweep. - repository.BCLReplayRepository interface + ErrBCLJTIAlreadyConsumed sentinel. Postgres impl uses INSERT...ON CONFLICT DO NOTHING RETURNING true for atomic single-use semantics in one round-trip. - handler.DefaultBCLVerifier gains WithMaxAge + nowFn clock seam. iat freshness check rejects tokens whose iat is in the future beyond max-age OR stale beyond it. Verifier signature extended: Verify(ctx, jwt) (iss, sub, sid, jti string, iat int64, err error). - handler.AuthSessionOIDCHandler gains BCLReplayConsumer (interface) + WithBCLReplayConsumer(consumer, maxAge) setter. BackChannelLogout consumes the jti post-verify with TTL = max(24h, 2maxAge): - first-receive → 200, sessions revoked, audit outcome=revoked - replay (ErrBCLJTIAlreadyConsumed) → 200 + Cache-Control: no-store, audit outcome=jti_replayed, sessions NOT re-revoked - transient (non-AlreadyConsumed error) → 503 so the IdP retries - internal/scheduler/scheduler.go: SetBCLReplayGarbageCollector wires SweepExpired into the existing session-GC tick (no separate ticker for short-lived replay rows). - cmd/server/main.go: bclMaxAge from cfg.Auth.OIDCBCLMaxAgeSeconds (default 60s, env CERTCTL_OIDC_BCL_MAX_AGE_SECONDS); bclReplayRepo wired into the verifier + handler + scheduler. - Three regression tests in internal/api/handler/bcl_replay_test.go: TestBackChannelLogout_FirstReceiveConsumesJTI, TestBackChannelLogout_ReplayedJTIReturns200WithAudit, TestBackChannelLogout_TransientConsumeFailureReturns503. - internal/api/handler/auth_session_oidc_test.go: stubBCLVerifier gains jti + iat fields; existing TestBackChannelLogout_ tests rewritten for the new Verify return. Verification gate green: gofmt clean, go vet clean, go test -short -count=1 on internal/api/handler / internal/api/router / internal/scheduler / cmd/server / internal/auth/oidc / internal/auth/breakglass — all pass. CRIT-1..CRIT-5 + HIGH-1 + HIGH-2 + HIGH-3 of the 2026-05-10 audit now closed on this branch. Spec at cowork/auth-bundles-fixes-2026-05-10/07-high-3-bcl-replay-defense.md. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-3	2026-05-10 20:53:29 +00:00
shankar0123	4d11984645	fix(auth): wire RevokeAllForActor + RotateCSRFToken to mutation paths Closes HIGH-1 + HIGH-2 of the 2026-05-10 audit. HIGH-1: breakglass.Service.SetPassword and RemoveCredential now call sessions.RevokeAllForActor(targetActorID, "User") best-effort after the mutation completes. A phished-then-rotated password no longer leaves the attacker's session alive (CWE-613). Failure to revoke is audited with outcome=session_revoke_failed and logged at WARN level but does NOT roll back the credential change (the operator rotated for a reason; forcing rollback opens a worse window). - breakglass.SessionMinter interface extended with RevokeAllForActor. - cmd/server/main.go::breakglassSessionMinterAdapter gains the bridge to session.Service.RevokeAllForActor. - stubSessions in service_test.go tracks revokeAllIDs / revokeAllTypes / revokeAllErr. - Three regression tests: - TestService_SetPassword_RevokesExistingSessions - TestService_RemoveCredential_RevokesExistingSessions - TestService_SetPassword_RevokeFailureDoesNotRollback HIGH-2: New session.Service.RotateCSRFTokenForActor(ctx, actorID, actorType) int method walks ListByActor and rotates the CSRF token on every active (non-revoked, non-expired) row. Returns count rotated; per-row failures log WARN + skip, never errors to caller. New handler.CSRFRotator interface + AuthHandler.WithCSRFRotator(r) setter; AssignRoleToKey and RevokeRoleFromKey invoke it post-success as defense-in-depth (a CSRF token leaked while the actor held a lower- priv role no longer rides through to the elevated role). - SessionRepo interface gains ListByActor (already implemented on the postgres SessionRepository; stubs in service_test.go + bench_test.go updated to match). - cmd/server/main.go calls .WithCSRFRotator(sessionService) on the AuthHandler. - Two regression tests: - TestRotateCSRFTokenForActor_RotatesAllActiveRows (asserts revoked / expired / other-actor rows are skipped) - TestRotateCSRFTokenForActor_NoSessionsReturnsZero Verification gate green: gofmt clean, go vet clean, go test -short -count=1 ./internal/auth/breakglass/ ./internal/auth/session/ ./internal/api/handler/ ./internal/api/router/ ./cmd/server/ ./internal/domain/auth/ — all pass. CRIT-1..CRIT-5 + HIGH-1 + HIGH-2 of the 2026-05-10 audit now closed on this branch. Spec at cowork/auth-bundles-fixes-2026-05-10/06-high-1-2-revoke-and-rotate.md. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-1 HIGH-2	2026-05-10 20:43:45 +00:00

1 2 3 4 5 ...

404 Commits