certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 23:51:41 +00:00

Author	SHA1	Message	Date
shankar0123	4f2d865b51	feat(middleware): SEC-008 — Permissions-Policy deny-all-features header Acquisition-audit SEC-008 closure (Sprint 2 ACQ, 2026-05-16). Add Permissions-Policy as a sixth security header alongside HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, and CSP. Default value is a deny-all-features baseline: accelerometer=(), camera=(), geolocation=(), microphone=(), payment=(), usb=(), interest-cohort=() certctl is a control-plane API + dashboard; no part of the surface needs camera / microphone / geolocation / accelerometer / payment / USB access, and `interest-cohort=()` opts out of the deprecated FLoC browser feature. The deny-all default removes those attack/fingerprint surfaces if certctl is ever embedded in a malicious page or if a dashboard route is XSS-compromised post-CSP-bypass. Per-field empty-string suppression is preserved: operators who want to allow a feature (e.g. hardware-attestation flows wanting WebAuthn's USB transport) can either set Cfg.PermissionsPolicy to their own narrowed allowlist or set it to "" to suppress the header entirely. Tests: - TestSecurityHeaders_PermissionsPolicyDefault — pins the literal default value byte-for-byte so any widening (e.g. someone adding camera=*) breaks the test. - TestSecurityHeaders_PermissionsPolicyOverrideToEmptySuppresses — pins the operator escape hatch and that the per-field suppression contract still holds field-by-field. - TestSecurityHeaders_DefaultsAllPresent gains Permissions-Policy in its loop, so the existing on-error and on-2xx paths now cover the new header too. The middleware pre-trim slice capacity bumps from 5 → 6 entries.	2026-05-16 17:13:17 +00:00
shankar0123	2e9262cfb7	fix(handler): SEC-021 — wrap BCL provider re-fetch via SafeOIDCContext Acquisition-audit Sprint 1 follow-up to SEC-001 (2026-05-16). Companion to SEC-020 (prior commit). Closes the second of the two adjacent OIDC call sites the original SEC-001 sweep missed: the per-request discovery re-fetch in DefaultBCLVerifier.Verify. Pre-fix: func (v *DefaultBCLVerifier) Verify(ctx, logoutToken) { ... provider, perr := gooidc.NewProvider(ctx, matched.IssuerURL) ... } Same shape as service.go::fetchUserinfoGroups (closed in the prior commit) and service.go:1084 (closed by SEC-001 itself). go-oidc's NewProvider derives its http.Client from ctx; bare ctx falls through to http.DefaultClient at the discovery-doc + JWKS-fetch dial. An IdP whose registered IssuerURL resolves to a reserved address (or is rebinding to one at logout time) would trigger an unguarded HTTPS egress on every back-channel-logout request. Post-fix: provider, perr := gooidc.NewProvider( oidcsvc.SafeOIDCContext(ctx), matched.IssuerURL) The 'oidcsvc' alias for github.com/certctl-io/certctl/internal/auth/oidc is added to the import block (matches the canonical alias used in cmd/server/main.go:29). SafeOIDCContext routes the dial through validation.SafeHTTPDialContext, which re-resolves the issuer host at dial time and refuses reserved-address answers (loopback / link-local / 169.254.169.254 cloud-metadata). Files touched: internal/api/handler/auth_session_oidc_bcl.go — add oidcsvc import + wrap ctx at the NewProvider call site internal/api/handler/auth_session_oidc_bcl_test.go — NEW FILE. TestDefaultBCLVerifier_SSRF_BlocksReservedAddress constructs a stubProviderRepo with IssuerURL='http://127.0.0.1:1' (literal loopback — the IP-literal class that SafeHTTPDialContext. isReservedIPForDial refuses up-front, before any DNS resolution). Hand-rolls a 3-segment JWT whose payload base64url-decodes to {"iss":"<loopback url>"} so peekIssuer extracts the matching issuer and provs.List() returns the seeded provider. Calls Verify and asserts the error wraps the dial-time reserved-address rejection (substring match on 'refusing to dial' / 'reserved address') AND that it's wrapped through the 'provider discovery:' prefix that distinguishes a discovery-time dial failure from a signature-verification failure. docs/operator/auth-threat-model.md — NEW subsection 'Userinfo + BCL SSRF parity (post-SEC-001 follow-up)' under '### Back-channel logout'. Documents both SEC-020 and SEC-021 closures, the context-key shape (why a single SafeOIDCContext wrap covers both go-oidc and oauth2 legs), and the out-of-scope RFC 1918 carve-out (covered separately by acquisition-audit Sprint 5 RED-005). Cross- references the two pinning tests by name so future audits can locate the load-bearing enforcement. Verified: gofmt -l internal/ docs/ (clean) go vet ./... (clean) go test -race -short ./internal/api/handler/... (all green) TestDefaultBCLVerifier_SSRF_BlocksReservedAddress (new; green) All 4 cited CI guards pass. Acceptance grep on the BCL handler: internal/api/handler/auth_session_oidc_bcl.go:132: provider, perr := gooidc.NewProvider(oidcsvc.SafeOIDCContext(ctx), matched.IssuerURL) No bare-ctx NewProvider remains in the BCL verifier. Combined with the SEC-020 commit, every gooidc.NewProvider + Provider.UserInfo call site in the production OIDC + BCL surface now routes through SafeOIDCContext. Closes acquisition-audit SEC-021. Sprint 1 ACQ is complete (2/2 findings). The single sprint shipped as two operator-authored commits (per-finding, mirrors the project's commit cadence for closures).	2026-05-16 16:41:39 +00:00
shankar0123	663b14bfd8	feat(retention): COMP-002-RETENTION — federated-user PII purge pipeline Sprint 6 closure of the audit's MED-severity COMP-002-RETENTION finding. Pre-fix posture: the federated-user admin surface (auth_users.go::Deactivate) sets users.deactivated_at on soft-delete, but the PII columns (email, display_name, oidc_subject) stay populated forever. No in-code primitive for GDPR right-to-be- forgotten; no scheduled retention purge. This commit ships the audit's recommended two-phase fix: Phase 1 — operator-callable scrub primitive internal/service/user_retention.go UserRetentionService.DeleteUserPII(ctx, userID): - revoke all active sessions (defense-in-depth) - email := 'purged@redacted.local' - display_name := '[purged]' - oidc_subject := 'sha256:' \|\| hex(sha256(original)) - audit_events row with action=user.purge_pii, category=auth, actor=system Why hash oidc_subject instead of NULL: 1. (oidc_provider_id, oidc_subject) UNIQUE constraint would trip on multiple purged users converging to NULL 2. The hash is one-way; the original IdP-side identifier is unrecoverable. Re-login under the same subject mints a fresh u-id (right-to-be-forgotten semantics) 3. Forensic continuity: an operator can recompute sha256(<known-subject>) and confirm "this user was deactivated then purged" users.id itself is preserved so historical audit_events.actor = u-X rows still resolve. The forensic- attribution chain stays intact even after the PII is gone. Phase 2 — scheduled batch purge internal/scheduler/scheduler.go UserRetentionPurger interface + userRetentionLoop: - PurgeDeactivatedUsers enumerates every user with deactivated_at < NOW() - retention_window - DeleteUserPII per row - per-tick batch cap (default 200) keeps blast radius predictable; large backlogs spread across multiple ticks - atomic.Bool guard + 5-min per-tick context.WithTimeout Repository contract grew a single new method: internal/repository/user.go::ListDeactivatedBefore(ctx, t) internal/repository/postgres/user.go: SQL-side filter (deactivated_at IS NOT NULL AND deactivated_at < $1) ORDER BY deactivated_at ASC, cross-tenant. Configuration CERTCTL_USER_RETENTION_INTERVAL default 24h CERTCTL_USER_RETENTION_WINDOW default 30 days CERTCTL_USER_RETENTION_BATCH_CAP default 200 Test stub additions for repository.UserRepository.ListDeactivatedBefore: internal/auth/oidc/service_test.go::stubUsers internal/api/handler/auth_users_test.go::stubFullUserRepo internal/api/handler/auth_session_oidc_test.go::stubUserRepo Documentation docs/operator/privacy-and-retention.md - retention pipeline diagram (day-0 deactivate → day-N purge) - operator config table - verification runbook (4 steps with SQL) - what's NOT covered (deferred: DSAR export, api_keys cascade, retroactive audit_events.details redaction) Tests internal/service/user_retention_test.go (NEW, 4 tests): TestDeleteUserPII_ScrubsAndRevokes TestDeleteUserPII_IsIdempotent TestPurgeDeactivatedUsers_RespectsWindow TestPurgeDeactivatedUsers_BatchCap Verified locally: go vet ./... (clean) gofmt -l internal/ cmd/ (clean) go test -short -count=1 \ ./internal/service/... ./internal/scheduler/... ./internal/config/... (all green) Cross-sprint interaction: pairs with COMP-001-HASH (prior commit). The user.purge_pii audit row this service emits flows through the new hash chain, so the scrub event is itself tamper-evident. Closes COMP-002-RETENTION. Sprint 6 is complete (2/2 findings).	2026-05-16 06:18:39 +00:00
shankar0123	43836aca7c	feat(audit): COMP-001-HASH — per-row hash chain on audit_events (tamper-evidence) Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding. Pre-fix posture: migration 000018 installs a WORM trigger on audit_events that blocks UPDATE / DELETE for the application role. But the trigger header itself documents a compliance-superuser bypass (backup restore, retention purges, breach recovery). Without a hash chain, that role can rewrite any row's actor / action / details / timestamp / event_category with no on-disk trace. HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper- EVIDENCE, not just tamper-prevention. This commit ships the evidence layer. Wire shape: migrations/000047_audit_events_hash_chain.up.sql + pgcrypto extension (digest function) + audit_chain_head: single-row sentinel table holding the most recent row_hash; FOR UPDATE row-lock serialises chain writes under concurrent INSERTs so two parallel writers can't read the same prev_hash and produce a forked chain + audit_events: prev_hash + row_hash columns + audit_events_canonical_payload(): centralised hash input builder. UTC + microsecond ISO-8601 keeps the hash session- timezone-independent. All columns separated by '\|' so a concatenation-ambiguity exploit can't fabricate a collision + audit_events_compute_hash_chain(): BEFORE-INSERT trigger function. Reads sentinel FOR UPDATE → computes sha256(prev_hash \|\| id \|\| actor \|\| actor_type \|\| action \|\| resource_type \|\| resource_id \|\| details::text \|\| timestamp_utc_iso \|\| event_category) → writes both columns + advances the sentinel + backfill loop walks every existing row in (timestamp ASC, id ASC) order; WORM trigger temporarily DISABLEd inside this migration's transaction so backfill UPDATEs land cleanly, ENABLEd before COMMIT + audit_events_verify_chain(): STABLE plpgsql verifier. Walks the chain end-to-end and returns the first break: (first_break_id TEXT, first_break_pos INT, row_count INT) internal/repository/postgres/audit.go + AuditRepository.VerifyHashChain — calls the SQL function and maps the OUT parameters to Go return values internal/repository/interfaces.go + AuditRepository.VerifyHashChain in the contract; every in-memory mock + stub picks up the no-op implementation internal/scheduler/scheduler.go + AuditChainVerifier + AuditChainBreakRecorder interfaces + auditChainVerifyInterval (default 6h) + auditChainVerifyLoop: runs once on start + every tick; atomic.Bool guard + 5-min per-tick context timeout match every other GC loop's pattern internal/service/audit_chain_metric.go + AuditChainCounter type with atomic counters. Sticky-first- detection on (BrokenAtID, BrokenAtPos) so the actionable alarm doesn't drift across walks. Snapshot() returns the full state for the metrics handler internal/api/handler/metrics.go + AuditChainCounterSnapshotter interface + Prometheus exposition for four series: certctl_audit_chain_break_detected_total counter (the alarm) certctl_audit_chain_verify_total counter (walks done) certctl_audit_chain_rows gauge (last walk size) certctl_audit_chain_last_verified_at gauge (unix seconds) internal/config/config.go + AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL cmd/server/main.go + wires AuditChainCounter into both the scheduler (recorder) + metrics handler (snapshotter) — single instance shared so the writer + reader are guaranteed to converge internal/repository/postgres/audit_chain_test.go (NEW) + TestAuditEventsHashChain_FreshTable: empty walk → clean + TestAuditEventsHashChain_AppendLinksRows: three INSERTs produce a strictly-linked chain; prev_hash on row 0 is NULL; verifier walks clean over the 3 rows + TestAuditEventsHashChain_VerifierDetectsTampering: simulate the compliance-superuser threat model (DISABLE WORM, UPDATE a middle row, ENABLE WORM); verifier returns the tampered row's id at position 1 docs/operator/audit-chain.md (NEW) + Layered-defenses explainer (WORM + hash chain). Verifier function reference. Recommended Prometheus alert rule. Performance scaling table (10k to 10M rows). Step-by-step runbook for what to do when a break is detected. Operator configuration table. Test-stub additions for AuditRepository.VerifyHashChain: internal/service/testutil_test.go — mockAuditRepo internal/service/acme_test.go — fakeAuditRepo internal/integration/lifecycle_test.go — mockAuditRepository internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo Verified locally: go vet ./... (clean) gofmt -l internal/ cmd/ (clean) go test -short -count=1 ./internal/scheduler/... ./internal/config/... ./internal/service/... ./internal/api/handler/... ./internal/repository/... (all green) Verified with testcontainers + postgres:16-alpine + the migration runner (not gated under -short — requires docker): go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/... Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in the next commit (separate concern: federated-user PII retention).	2026-05-16 06:17:15 +00:00
shankar0123	8f2e5771db	fix(middleware): SEC-006 — TTL-evict idle token-bucket rate-limiter entries Sprint 2 unified-master-audit closure. Pre-fix the keyed rate limiter's bucket map had no eviction. The package-level comment explicitly noted the leak: high-cardinality unauthenticated traffic (CGNAT churn, Tor exit lists, botnets, infinite-cardinality scanners) grew process memory unboundedly. Production deploys with millions of unique IPs would eventually OOM. Fix: - RateLimitConfig.BucketTTL (env CERTCTL_RATE_LIMIT_BUCKET_TTL, default 1h, clamp-floor 1m). 1h chosen to be well above realistic operator IP churn windows (returning clients keep their bucket) and well below the unbounded-leak window the pre-fix code allowed. - tokenBucket gains a lastAccess field updated on every allow() call via touch(); reading via lastAccessTime() under the bucket's own mutex. - keyedRateLimiter.sweepLoop runs in a single goroutine per limiter (production wires 2: default + no-auth fallback), waking every BucketTTL/4. sweep() removes any bucket whose lastAccess is older than the cutoff and bumps evictedTotal atomically. - Both NewRateLimiter call sites in cmd/server/main.go (default stack and no-auth fallback) now thread cfg.RateLimit.BucketTTL. Regression coverage: - TestKeyedRateLimiter_SweepEvictsIdleBuckets: 1000 synthetic IP keys populate the map, advance past TTL, call sweep() directly, assert map drained to 0 + evictedTotal=1000 + fresh key creates new bucket (map not poisoned). - TestKeyedRateLimiter_SweepKeepsActiveBuckets: inverse — a bucket touched within the TTL window survives the sweep. Catches a future regression that inverts the cutoff comparison. Closes SEC-006.	2026-05-16 04:01:18 +00:00
shankar0123	7268d12a17	feat(web): close FE-M6 — migrate static inline-style attrs to Tailwind + correct CSP rationale comment Closes frontend-design-audit finding FE-M6 (Med): CSP allows 'unsafe-inline' for `style-src` — necessary today because of inline SVG `style=` attrs (related to FE-H2) ═══════════════════════════ GROUND-TRUTH FINDINGS ═══════════════════ Ground-truth recon found 4 audit-framing errors: (1) The "17 inline-style tsx files" count was stale — actual is 9 (8 after excluding a Layout.tsx comment match the audit's grep counted). (2) The CSP rationale comment at securityheaders.go:35 LIED about WHY 'unsafe-inline' is needed. It claimed "Tailwind (via Vite) injects per-component <style> blocks at build time." Verified against the post-build artifact: `grep -c '<style' dist/index.html` = 0; Vite's CSS output is a single .css file linked via `<link rel="stylesheet">`. The 'unsafe-inline' grant exists for React's `style={...}` attribute model, NOT for Vite or Tailwind. (3) The 9 sites split cleanly into: LOAD-BEARING DYNAMIC (5 sites; can't be Tailwind utilities because values are computed at runtime): - Tooltip.tsx Floating-UI position (left/top px per-tick) - AgentFleetPage.tsx dynamic color+width chart bars - dashboard/charts.tsx Recharts color props - CertificatesPage.tsx progress-bar percent width - IssuerHierarchyPage.tsx depth-based marginLeft STATIC PIXEL VALUES (3 files, ~12 sites; clean Tailwind migration targets): - UsersPage.tsx — filter UI + table styling - DigestPage.tsx — iframe min-height - AuthProvider.tsx — demo-mode banner (4) Fully eliminating 'unsafe-inline' would require either banning dynamic `style={...}` (CSS-in-JS rewrite of the 5 load-bearing sites) or adopting CSP nonces with React 18+'s style runtime. Neither fits the original FE-M6 phase budget. ═══════════════════════════ CHANGES ═══════════════════════════════ web/src/pages/auth/UsersPage.tsx: 9 inline-style attrs → Tailwind utility classes. The filter UI (mb-4, mr-2, w-[280px] p-1), the table (w-full border-collapse), the thead row (border-b-2 border-gray-300 text-left), per-row borders (border-b border-gray-200 + opacity-50/100 conditional), buttons (px-3 py-1), the empty-state cell (p-3 text-center). Behavior-preserving. web/src/pages/DigestPage.tsx: iframe `style={{ minHeight: '600px' }}` → className "min-h-[600px]" (composed into the existing className). web/src/components/AuthProvider.tsx: Demo-mode banner: 6-prop `style={{ background, color, padding, fontSize, fontWeight, textAlign }}` → className "bg-red-700 text-white px-4 py-2 text-[13px] font-semibold text-center". Same visual. internal/api/middleware/securityheaders.go: CSP rationale comment rewritten to accurately describe WHY 'unsafe-inline' is required. New comment: - Names the 5 load-bearing dynamic-style sites explicitly - Lists the 3 static sites that were migrated to Tailwind today - Documents that the OLD comment's "Tailwind/Vite injects <style> blocks" claim was factually wrong (verified against built dist/index.html — zero <style> tags emitted) - Records the future-tightening path (React style-runtime nonces OR CSS-in-JS rewrite of the 5 sites) and notes it doesn't fit the original FE-M6 phase budget ═══════════════════════════ AUDIT FRAMING ════════════════════════ The audit said FE-M6 was about "inline SVG style= attrs (related to FE-H2)." Ground-truth: FE-H2 (Phase 3 Layout SVG → Lucide icons) ALREADY happened; the remaining inline-style sites have nothing to do with SVGs. The audit's bridge from FE-H2 → FE-M6 was a red herring. The OPERATOR-VISIBLE win from this closure: • 3 production tsx files now use Tailwind utility classes for static styling — consistent with the rest of the codebase. • The CSP comment now tells the truth about why 'unsafe-inline' is needed, so the next operator who reads it doesn't waste time hunting for non-existent <style> blocks. • The inline-style attribute surface is reduced to ONLY load-bearing dynamic styling — making any future tightening work (nonces, CSS-in-JS migration) easier to scope. The CSP header itself is UNCHANGED ("style-src 'self' 'unsafe-inline'"). True elimination of 'unsafe-inline' is a separate workstream tracked in the corrected comment. ═══════════════════════════ VERIFICATION ═══════════════════════════ • gofmt -l internal/api/middleware/securityheaders.go — clean • go vet ./internal/api/middleware/... — exit 0 • go test -short -count=1 ./internal/api/middleware/... — ok 0.247s (existing securityheaders_test.go pins the Content-Security-Policy header value byte-string; unchanged by this commit so test stays green) • npx tsc --noEmit — exit 0 • npx vitest run AuthProvider DigestPage UsersPage — 16/16 pass • npx vite build — built in 3.42s Ground-truth: origin/master tip `9ba5ee4` (P-M2 just pushed) verified via GitHub API BEFORE commit. Falsifiable proof: a future engineer reading securityheaders.go:35 sees an accurate explanation of why 'unsafe-inline' is needed, NOT the previous false "Tailwind/Vite" claim.	2026-05-14 20:40:55 +00:00
shankar0123	fc237de357	feat(audit): close P-H2 — server-side `since` / `until` time-range filters Closes frontend-design-audit finding P-H2 (High): AuditPage filters time-range client-side; comment says "server may not support time params" — fetches the entire event window, throws 99% away in JS Ground-truth recon found the closure is much smaller than the audit's "1 day backend + 2 hours frontend" estimate: • repository AuditFilter.From / .To: ALREADY exist in internal/repository/filters.go:57-58 • postgres.AuditRepository.List: ALREADY pushes `timestamp >= since` + `timestamp <= until` predicates into the SQL query (internal/repository/postgres/audit.go:107-116) • Composite index idx_audit_events_category_timestamp on (event_category, timestamp DESC) added in migration 000032 makes the new query hit an index scan • MCP `certctl_audit_list_with_category` tool's docstring already advertises `since` / `until` (internal/mcp/tools_audit_fix.go:174) — but the server silently ignored them, making the published contract a lie The only missing piece was the handler exposing the params + the frontend porting from client-side filtering. ~150 lines total. ═══════════════════════════ CHANGES ═══════════════════════════════ Service (internal/service/audit.go): • New ListAuditEventsByFilter(ctx, since, until, category, page, perPage) threads time bounds into the existing repository. AuditFilter.From / .To fields. • Existing ListAuditEvents + ListAuditEventsByCategory become thin wrappers around the new method with zero times. Handler (internal/api/handler/audit.go): • Interface gains ListAuditEventsByFilter signature. • ListAuditEvents handler parses `since` + `until` RFC3339 query params; 400 on malformed input or `until` not after `since`. • Single dispatch via ListAuditEventsByFilter for ALL request shapes (with or without time bounds, with or without category). Tests (internal/api/handler/audit_handler_test.go): • mockAuditService gains listByFiltFunc + lastFilterSince/Until/ Category trace fields. • 5 new subtests: - TestListAuditEvents_WithSinceUntil — happy path, both bounds - TestListAuditEvents_SinceOnly — one-sided open-ended - TestListAuditEvents_InvalidSince — 400 on garbage - TestListAuditEvents_UntilBeforeSince — 400 on reversed range - TestListAuditEvents_TimeRangePlusCategory — composes with auditor-role category=auth filter Frontend (web/src/pages/AuditPage.tsx): • TIME_RANGES dropdown now sends `since` as RFC3339 (now − N hours) via the existing useQuery params object instead of filtering client-side after the fact. • Pre-P-H2 `filtered = data.data.filter(e => now-ts<N)` block deleted (replaced by `filtered = data?.data \|\| []`); comment documents why for the diff reader. OpenAPI (api/openapi.yaml): • listAuditEvents gains `since` + `until` query-param specs (format: date-time, description, P-H2 closure date). • Description block explains the `since`/`until` vs `from`/`to` naming divergence from the sibling /audit/export endpoint (different param semantics: list = open-ended bounds, export = required ≤ 90-day compliance window). ═══════════════════════════ VERIFICATION ═══════════════════════════ Backend (Go toolchain now wired in sandbox — go1.25.10 ARM64 from .gomodcache, GOCACHE on /tmp partition): • gofmt -l on all touched files: clean • go vet ./... — exit 0 • go test -short -count=1 ./internal/api/handler/... — ok 4.195s (existing 14 subtests + 5 new = 19/19 pass) • go test -short -count=1 ./internal/service/... — ok 4.733s • staticcheck ./internal/api/handler/... ./internal/service/...: zero findings Frontend: • npm ci — 634 packages, exit 0 (resolves cleanly post-Hotfix #9) • npx tsc --noEmit — exit 0 • npx vitest run src/pages/AuditPage.test.tsx — 4/4 pass • npx vite build — built in 3.49s Ground-truth: origin/master tip `b22cdb3` verified via GitHub API BEFORE commit per the operating rule. ═══════════════════════════ RELATED NOTES ════════════════════════ • AuditPage's `resource_type` / `actor` / `action` query params are ALSO silently ignored by the server today — the handler doesn't parse them. That's a separate latent gap (the audit only flagged the time filter); tracked as a follow-up for the next audit-handler pass. Not scope-creeping into this commit. • The `total` returned by ListAuditEventsByFilter is len(result), not a separate COUNT(*) query — same limitation as before; when the page ports to server-side cursoring the repository will need a CountAuditEvents(filter) method. Documented in the service comment.	2026-05-14 19:35:51 +00:00
shankar0123	03f0e08a77	fix(middleware): Hotfix #14 — staticcheck QF1008 from Hotfix #12 CI run #571 (commit `af5c392`, "Hotfix #12 — CodeQL #34 go/reflected-xss in etag.go") failed: internal/api/middleware/etag.go:261:11: QF1008: could remove embedded field "ResponseWriter" from selector (staticcheck) hdr := r.ResponseWriter.Header() Root cause: etagRecorder embeds http.ResponseWriter: type etagRecorder struct { http.ResponseWriter body *bytes.Buffer status int headerWritten bool headerWrittenOnWire bool bodyTruncated bool } etagRecorder DOES override Write() and WriteHeader() — those buffer / track instead of writing through. So r.ResponseWriter.Write(b) and r.ResponseWriter.WriteHeader(s) ARE intentional embedded-field selectors (calling the recorder's own Write would recurse infinitely; calling its WriteHeader would skip the wire flush). staticcheck recognizes those as load-bearing and doesn't flag. But etagRecorder does NOT override Header(). So r.ResponseWriter.Header() and r.Header() are equivalent — staticcheck QF1008 wants the shorter form. The Hotfix #12 change added a new r.ResponseWriter.Header() that I missed. Fix: Change r.ResponseWriter.Header() → r.Header() at line 261 (the Content-Type defense added in Hotfix #12). Behavior is byte- identical: r.Header() is the promoted method from the embedded ResponseWriter. Added a comment block immediately above the fix explaining why the neighboring r.ResponseWriter.WriteHeader / r.ResponseWriter.Write calls intentionally KEEP the explicit selector (overridden methods → embedded form required to bypass recursion). Future engineers won't get confused by the asymmetric pattern. Hotfix #13 (signer FileDriver path-injection — local commit `38f86bc`, not yet pushed) does NOT have the same risk: FileDriver has no embedded struct / interface, only direct fields, so QF1008 can't apply. Verification (sandbox constraints — Go unavailable): • Manual syntax inspection: brace count balanced (27/27), paren count balanced (53/53). Diff +9/-1. • No remaining r.ResponseWriter.Header() in the file (verified via grep — empty match). • All 48 CI guards pass. • Other CI noise on run #571 (windows-latest syscall.Stat_t, Node.js 20 deprecation warnings) is PRE-EXISTING and not introduced by either Hotfix #12 or #13 — see the failure log: undefined: syscall.Stat_t fires in internal/deploy/ownership.go which neither hotfix touched. Ground-truth: origin/master tip `af5c392` verified via GitHub API. Local is at `38f86bc` (Hotfix #13) which the operator hasn't pushed yet; this commit lands on top. After push the order is: `af5c392` → `38f86bc` → <this>. Operator: please run `make verify` from the repo root before pushing — sandbox can't run staticcheck/go vet/go test.	2026-05-14 19:12:43 +00:00
shankar0123	af5c39252f	fix(middleware): Hotfix #12 — CodeQL #34 go/reflected-xss in etag.go CodeQL alert #34 (severity: HIGH, rule: go/reflected-xss) fired on commit `8191b1e` (Phase 6 SCALE-L2 ETag middleware): internal/api/middleware/etag.go:220 return r.ResponseWriter.Write(b) "Cross-site scripting vulnerability due to user-provided value." Root cause (analysis): The etagRecorder type buffers response bytes from the wrapped handler so the ETag middleware can hash the body before deciding 304-vs-200. On the over-sized-response truncation path (body > 64 KiB), bytes are forwarded directly to the underlying ResponseWriter at line 220. CodeQL's data-flow query traces: *http.Request (source: user input) → handler reads query/path/body → handler echoes data into the JSON response payload (a cert's common_name, an audit row's actor display name, etc.) → json.NewEncoder(w).Encode(...) calls w.Write([]byte) → etagRecorder.Write forwards to r.ResponseWriter.Write(b) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ sink — CodeQL flags reflected-XSS CodeQL can't see that the wrapped handler set Content-Type: application/json via handler.JSON() before any byte was written; it sees a generic byte forwarder writing to an http.ResponseWriter with no proximate Content-Type guarantee. Browsers don't interpret application/json as HTML — so this is technically a false positive — but the data-flow path is real and a future handler that forgets to set Content-Type would convert it into a real vuln (browsers can content-sniff a JSON body as text/html when Content-Type is absent). Fix (defense-in-depth, not just suppression): Add an explicit Content-Type guard at writeHeadersToWire() — the centralized chokepoint that ALL wire-write paths funnel through (line 213 in Write's truncation branch, line 258 in flush's main branch). If Content-Type is unset at this point, default to "application/json; charset=utf-8". This: 1. Makes the Content-Type invariant the middleware relies on explicit at the sink, which is the standard pattern CodeQL's go/reflected-xss recognizes as "validated before write". 2. Adds REAL defense-in-depth: a hypothetical future handler wired through ETag that forgot Content-Type can no longer expose a content-sniff vuln. The middleware enforces the safe shape at the boundary. 3. Is behavior-preserving for the 5 current consumers — every wrapped list endpoint (/api/v1/{certificates,agents,jobs, audit,discovered-certificates}) routes JSON responses through handler.JSON() at internal/api/handler/response.go:60, which already sets Content-Type: application/json. Path is no-op for them. Why not a simpler approach: • Removing line 220 (refactor to avoid the data-flow): the truncation path is required behavior — once buffer > 64 KiB the middleware degrades to no-caching pass-through, which requires writing the body bytes to the wire. The data flow is structural. • html.EscapeString(b) before write: would corrupt JSON. Wrong encoder for the content type. • Bare CodeQL suppression comment: closes the alert without actually addressing the latent bug a future handler could create. Defense-in-depth is the operator's stated preference per the CLAUDE.md "always take the complete path" principle. Verification (sandbox constraints disclosed honestly): • Manual syntax inspection — diff is 21-line additive, all inside writeHeadersToWire(). Brace count balanced (27/27), paren count balanced (53/53). No imports changed (http.Header API was already in use). • CI guards: all 48 pass locally. • Existing etag_test.go has 10 contract tests covering: ETag emit on GET, 304-on-If-None-Match, 200-on-mutation, POST bypass, 5xx/4xx pass-through, OversizedResponse degradation, wildcard match, HEAD parity, PassThrough body preservation. Behavior analysis (see commit body): every test either (a) has the handler set Content-Type explicitly (no-op for the new guard) or (b) goes through the 304-direct-write path in ETag() which bypasses the recorder entirely. All 10 tests should remain green when `make verify` runs on workstation. • Go toolchain NOT available in sandbox (no `go vet` / `go test` / `golangci-lint` / `staticcheck`). Disk pressure on the shared /sessions partition (166 MB free of 9.8 GB) prevented installing Go for this run. The CLAUDE.md operating rule allows this fallback path provided the verification gap is disclosed and the operator runs `make verify` on workstation BEFORE pushing. Operator: please run `make verify` from the repo root on your workstation before pushing. The change is minimal + additive, but the Go test suite should be the final green-light. Falsifiable proof for the next CodeQL scan: alert #34 should auto-close on the next push to master once the post-fix run sees the Content-Type setter precede every Write to the wire. Ground-truth: origin/master tip `6c00f7b` verified via GitHub API BEFORE commit per the operating rule.	2026-05-14 19:03:50 +00:00
shankar0123	a41fc2d75c	feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete) Phase 13 Sprint 13.3 — the completion half of the ARCH-M1 substantive close. Sprint 13.2 shipped the Postgres-backed sliding-window limiter + multi-replica integration test; Sprint 13.3 wires the 6 call sites in cmd/server/main.go through the operator- chosen backend selector, adds the rate_limit_buckets scheduler janitor sweep, rewrites the observability doc, exposes the env-var in the helm chart, and promotes the multi-replica integration test to a required CI status check. Signature ground-truth (sprint 13.2 + 13.3) =========================================== Prompt-template signatures: `Allow(key string) error` and "5 call sites." Actual repo: `Allow(key string, now time.Time) error` and 6 NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt miscounted the second EST per-principal arm). Per CLAUDE.md "the repo is truth," matched the live shape. What changed ============ internal/config/server.go (+40 LOC): - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval time.Duration` to RateLimitConfig with full operator-facing documentation of the two valid values (memory\|postgres) + when-to-use-which decision tree. internal/config/config.go (+27 LOC): - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m). - Validate() rejects anything other than ""/"memory"/"postgres" (empty = memory equivalence for test-built Configs that bypass Load()). Janitor interval must be ≥ 1 minute when set. - Failure modes return clear ::error:: with the env-var name + the valid values, so an operator typo ("postgress" → memory in a 3-replica cluster) fails fast at startup. internal/ratelimit/factory.go (NEW, 67 LOC): - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single factory the 6 cmd/server/main.go call sites route through. - Drop-in signature: same maxN/window/mapCap as NewSlidingWindowLimiter (mapCap accepted + ignored for postgres — the rate_limit_buckets table grows until the janitor sweeps). - Defensive panic on unknown backend (config.Validate is SoT; this is belt-and-suspenders). internal/ratelimit/postgres_gc.go (NEW, 73 LOC): - PostgresGC struct + NewPostgresGC + GarbageCollect. - Single-statement DELETE FROM rate_limit_buckets WHERE updated_at < NOW() - maxWindow. Idempotent. - maxWindow <= 0 is a no-op (operator opt-out). internal/scheduler/scheduler.go (+90 LOC): - New RateLimitGarbageCollector interface (mirrors the ACMEGarbageCollector / SessionGarbageCollector contracts). - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning on Scheduler. - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d) Setters following the existing acmeGC/sessionGC pattern. - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard + per-tick context.WithTimeout(1m). Logs row count at Debug. - Loop counted in the Start() WaitGroup only when the GC is non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector when backend=memory so the loop never launches for that case. cmd/server/main.go (35 LOC diff): - All 6 ratelimit.NewSlidingWindowLimiter call sites now route through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend, db, ...). Grep verification post-fix returns ZERO hits. - Six sites: breakglass loginLimiter (580), ocspLimiter (1003), exportLimiter (1068), EST failed-basic (1535), EST per-principal SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved — its inner type-alias wrapper is the prompt's out-of-scope (cmd/server/.go only). - Conditionally constructs PostgresGC + wires the scheduler janitor when backend=postgres; logs the wiring decision either way so operators see "rate-limit GC sweep enabled (postgres backend)" or "in-memory backend self-prunes" in the boot log. internal/api/handler/{est,export,certificates,auth_breakglass}.go: - Replaced 5 ratelimit.SlidingWindowLimiter field/Setter types with ratelimit.Limiter (the interface). Allow() satisfies the same call shape on both backends; the in-memory tests that construct SlidingWindowLimiter still compile because the concrete type satisfies the interface (compile-time check in internal/ratelimit/limiter.go pins this). docs/operator/observability.md (176 LOC diff): - Replaced the "per-process, in-memory, reset-on-restart, not shared across replicas" paragraph with the new configurable-backend section: operator decision tree, backend internals (memory vs postgres), janitor description, falsifiable closure proof (the Sprint 13.2 integration test name + invocation), helm chart wiring example. - Updated inventory to reflect the actual handler file paths + actual cap configurations (the prior doc said "60s window" for several limiters that actually use 60m / 24h windows). - Doc smoke confirmed: grep -c 'per-process, in-memory, reset-on-restart' docs/operator/observability.md = 0. deploy/helm/certctl/values.yaml + templates/server-configmap.yaml + templates/server-deployment.yaml: - Exposed server.rateLimiting.backend (default "memory") + server.rateLimiting.janitorInterval (default "5m") under the existing rateLimiting block. - ConfigMap renders both as rate-limit-backend + rate-limit-janitor-interval keys. - Deployment wires CERTCTL_RATE_LIMIT_BACKEND + CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap. - Helm render: `helm template deploy/helm/certctl --set server.rateLimiting.backend=postgres` shows the env-var on the server-deployment.yaml output. .github/workflows/ci.yml (+12 LOC): - Added a new step in the Go Build & Test job that runs the Sprint 13.2 multi-replica integration test (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with -tags=integration -race -timeout=300s. Fails the CI status check if the cross-replica row lock ever stops arbitrating across replicas — the ARCH-M1 closure regression gate. Verification (all green locally; postgres integration via CI) ============================================================ $ grep -nE 'NewSlidingWindowLimiter' cmd/server/.go (zero hits — Sprint 13.3 receipt) $ go test -short -count=1 \ ./internal/config/... ./internal/ratelimit/... \ ./internal/scheduler/... ./internal/api/handler/... \ ./cmd/server/... ok internal/config 1.177s ok internal/ratelimit 0.007s ok internal/scheduler 9.165s ok internal/api/handler 6.245s ok cmd/server 0.390s $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \ ./internal/config/... ./internal/api/handler/... ./cmd/server/... (clean) $ gofmt -l internal/ cmd/server/ (clean) $ grep -c 'per-process, in-memory, reset-on-restart' \ docs/operator/observability.md 0 (doc smoke — the audit's verbatim phrasing is gone) $ bash scripts/ci-guards/G-3-env-docs-drift.sh G-3 env-docs-drift: clean. $ bash scripts/ci-guards/complete-path-config-coverage.sh OK — every CERTCTL_* env var (197) has at least one non-config- package consumer. Selector contract verified — config.Validate() rejects any value other than ""/memory/postgres at startup with a clear error message. Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a different axis; ARCH-M1 closure is complete with this commit modulo the Sprint 13.7 audit-HTML flip + zero-floor pin. Closes: ARCH-M1 substantive remediation. The cross-replica rate- limit-cap-enforcement gap that the audit recommended deferring to v3 is closed; operators with server.replicas > 1 flip CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement across the cluster (proved by the multi-replica integration test now gating CI).	2026-05-14 11:52:13 +00:00
shankar0123	cd374b243e	refactor(handler): split auth_session_oidc.go by handler-section (Phase 9, 11 of N) Phase 9 ARCH-M2 closure Sprint 11. Splits internal/api/handler/auth_session_oidc.go (was 1577 LOC, the fifth-largest backend hotspot from the original audit) via the Option B sibling-file pattern — new files stay in `package handler` so every external caller of `handler.AuthSessionOIDCHandler.{LoginInitiate, LoginCallback, BackChannelLogout, Logout, ListSessions, RevokeSession, RevokeAllExceptCurrent, ListProviders, CreateProvider, UpdateProvider, DeleteProvider, TestProvider, RefreshProvider, ListGroupMappings, AddGroupMapping, RemoveGroupMapping}` and `handler.{DefaultBCLVerifier, NewDefaultBCLVerifier, DefaultBCLVerifierMaxAge}` resolves the same way. Pure mechanical relocation; no signature, no behavior, no import-graph change. Section-based split (Option B + audit's verb prescription) ========================================================== The audit's Tasks-Deferred row prescribed splitting "per handler verb (login / callback / refresh / logout / backchannel)." The file itself documents a three-section layout in its package doc-comment: 1. Public OIDC handshake (auth-exempt) 2. Session management (RBAC-gated) 3. OIDC provider + group-mapping CRUD (RBAC-gated) Going strictly verb-by-verb would have: - mis-grouped RefreshProvider (which is an ADMIN op on a provider's signing-key cache, not a session refresh — same auth.oidc.edit permission as Update/Delete); - split LoginInitiate + LoginCallback into separate files despite them sharing the state cookie + pre-login row flow; - left the other 9 handlers (Sessions, Provider CRUD, Group Mappings) with no obvious home. Sprint 11 follows the file's own self-described section split plus a fourth file for the DefaultBCLVerifier, which the original file already kept under a separate banner. What moved ========== New `internal/api/handler/auth_session_oidc_handshake.go` (391 LOC) — Section 1 / Public OIDC handshake handlers (auth-exempt): - LoginInitiate (GET /auth/oidc/login?provider=<id>) - LoginCallback (GET /auth/oidc/callback?code=...&state=...) - BackChannelLogout (POST /auth/oidc/back-channel-logout) - Logout (POST /auth/logout) New `internal/api/handler/auth_session_oidc_sessions.go` (208 LOC) — Section 2 / Session-management handlers (RBAC-gated): - sessionResponse projection type + sessionToResponse mapper - ListSessions (GET /api/v1/auth/sessions) - RevokeSession (DELETE /api/v1/auth/sessions/{id}) - RevokeAllExceptCurrent (DELETE /api/v1/auth/sessions/all-except-current) New `internal/api/handler/auth_session_oidc_crud.go` (470 LOC) — Section 3 / OIDC provider + group-mapping CRUD (RBAC-gated): - oidcProviderResponse + oidcProviderRequest projection types, providerToResponse mapper - ListProviders / CreateProvider / UpdateProvider / DeleteProvider / TestProvider / RefreshProvider - groupMappingResponse + groupMappingRequest projection types, mappingToResponse mapper - ListGroupMappings / AddGroupMapping / RemoveGroupMapping New `internal/api/handler/auth_session_oidc_bcl.go` (225 LOC) — DefaultBCLVerifier (handler's default implementation of the BackChannelLogoutVerifier interface declared in auth_session_oidc.go): - DefaultBCLVerifierMaxAge constant - DefaultBCLVerifier struct + NewDefaultBCLVerifier - WithMaxAge builder - Verify (the OpenID Connect Back-Channel Logout 1.0 §2.6 verification: events claim, iat window, algorithm allowlist, audience match, sub/sid/jti decode) - peekIssuer unexported helper What stays in auth_session_oidc.go (452 LOC, down from 1577) ============================================================ - Package + import block. - Service-layer interface projections (OIDCAuthHandshaker, SessionMinter, BackChannelLogoutVerifier) — declared once and consumed by every section. - SessionCookieAttrs config struct. - AuthSessionOIDCHandler struct + permissionChecker / BCLReplayConsumer / AuditRecorder interfaces + NewAuthSession- OIDCHandler constructor + the WithPermissionChecker / WithBCLReplayConsumer builder methods. - The shared helpers consumed across multiple sections: encryptClientSecret, recordAudit, clearPreLoginCookie, clearSessionCookies, clientIPFromRequest, classifyOIDCFailure, randomB64URLForHandler, defaultIfBlank, defaultIntIfZero. Side-effect import cleanup ========================== Four imports drop from auth_session_oidc.go as a clean side effect of the cut: - "encoding/json" (used only in CRUD + BCL — moved out) - "fmt" (used only in BCL — moved out) - gooidc "github.com/coreos/go-oidc/v3/oidc" (used only in BCL — moved out) - oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain" (used in handshake + CRUD + BCL — moved out) Per-import audit on every new sibling file is in the commit's diff: each carries only the imports its extracted code actually consumes. Net effect ========== auth_session_oidc.go: 1577 → 452 LOC (-1,125 = -71.3%). Four new sibling files at 1,294 LOC total (1,125 moved + ~169 of header + Phase 9 doc-comment overhead). The original hotspot drops below the cmd/agent/main.go target for Sprint 12 (1489 LOC). Cumulative Phase 9 progress (top 5 hotspots) ============================================ config.go 3403 → 1342 (-60.6%, Sprints 1-7) cmd/server/main.go 2966 → 2260 (-23.8%, Sprints 8 + 8b) service/acme.go 1965 → 1162 (-40.9%, Sprints 9 + 9b) mcp/tools.go 1867 → 109 (-94.2%, Sprint 10) auth_session_oidc 1577 → 452 (-71.3%, Sprint 11) TOTAL across 5 files: 11,778 → 5,325 LOC = -6,453 (-54.8%) Behavior preservation contract ============================== 1. gofmt -l clean across all 5 affected files. 2. go vet ./internal/api/handler/... — no findings. 3. staticcheck ./internal/api/handler/... — no findings. 4. go test -short -count=1 ./internal/api/handler/... — green (includes the 1,439-line auth_session_oidc_test.go suite that pins every moved handler's behavior including BCL replay, CSRF rotation, audit emission, and the Phase-5 RBAC path). 5. Broader-importer build green: go build ./... . 6. Broader-importer tests green: go test -short -count=1 ./cmd/server/... ./internal/api/router/... . cmd/server/main.go consumes handler.DefaultBCLVerifier + handler.NewDefaultBCLVerifier + handler.DefaultBCLVerifierMaxAge across three call sites; all three resolve unchanged through Go's same-package public-export mechanism (the type + constructor moved to a sibling file in the same `handler` package). The mcp/tools_auth_bundle2.go comment string referencing "oidcProviderRequest" is descriptive prose, not an import. What remains for Phase 9 ======================== One sibling-file split queued: - Sprint 12: cmd/agent/main.go (1489 LOC) → main + poll + deploy + register sibling files in same cmd/agent package (mirrors the cmd/server pattern from Sprints 8 + 8b). Refs: ARCH-M2 (god-files), Phase 9 audit. Sprint 11 closes the auth-session-OIDC handler hotspot from the audit's top-5 list.	2026-05-14 10:22:33 +00:00
shankar0123	51529ea609	fix(router): invert ETag wrap so rbacGate stays outer — close CRIT-1 ratchet CI run on master@0ad881c2 failed TestRouterRBACGateCoverage on five routes: GET /api/v1/agents GET /api/v1/audit GET /api/v1/certificates GET /api/v1/discovered-certificates GET /api/v1/jobs These are the five top-5 read endpoints that Phase 6 SCALE-L2 (commit `8191b1ee`) wrapped with the new etagged() helper. The existing rbacGate wrap was preserved INSIDE the etagged() call: r.Register("GET /api/v1/certificates", etagged(rbacGate(reg.Checker, "cert.read", reg.Certificates.ListCertificates))) Functionally this is safe (the rbacGate still runs at request time; the ETag middleware emits ETag only on 2xx, so 401s/403s never get cached), but it FAILS the AST-based RBAC coverage test introduced by the 2026-05-10 auth-bundle audit (CRIT-1). That test walks router.go's `r.Register(route, handler)` calls and asserts the second argument is either `rbacGate(...)` or `rbacGateScoped(...)` or that the route is in `authExemptRoutes` / matches a `protocolPrefixes` entry. With `etagged()` as the outer wrap, the test's AST inspection sees `etagged(...)` and counts the route as ungated. CRIT-1's standing rule (test header): "Removing an existing rbacGate wrap requires either (a) moving the route to authExemptRoutes here, or (b) demonstrating the new approach in the commit body." Phase 6 did neither — the rbacGate wrap was demoted from outer to inner without an authExemptRoutes entry and without the test being taught about the new shape. This is exactly the regression the CRIT-1 ratchet is designed to catch. Root cause: rbacGate's signature is func rbacGate(checker, perm string, h http.HandlerFunc) http.Handler and etagged's signature was func etagged(h http.Handler) http.Handler so etagged COULD wrap rbacGate but rbacGate could NOT wrap etagged (the third arg type didn't match). Phase 6 took the type-easy path; this hotfix takes the security-correct path. Fix ==== Rename `etagged()` → `etaggedFunc()` and change its signature to `http.HandlerFunc → http.HandlerFunc` so it can be used INSIDE the rbacGate call: r.Register("GET /api/v1/certificates", rbacGate(reg.Checker, "cert.read", etaggedFunc(reg.Certificates.ListCertificates))) New runtime order: request → rbacGate → etaggedFunc → handler Unauthenticated requests now bounce at HTTP 403 BEFORE the response-buffering ETag middleware ever runs. The SHA-256-over-body cost only applies to authenticated 2xx responses — also a small perf win on top of fixing the lint. The internal implementation reduces to: func etaggedFunc(h http.HandlerFunc) http.HandlerFunc { return middleware.ETag(h).ServeHTTP } middleware.ETag itself is unchanged. The five call sites swap wrap order; everything else stays identical. Pattern lesson ============== golangci-lint and staticcheck check different layers; the AST-based TestRouterRBACGateCoverage is ANOTHER layer (a Go test, not a linter) that the local `go test ./internal/api/router/...` step would have caught. Phase 6's pre-commit verification ran `go test ./internal/scheduler/ ./internal/api/middleware/` explicitly but missed `./internal/api/router/` — which is where this test lives. Future commits that touch router.go MUST run `go test ./internal/api/router/... -count=1` before push. Adding this to the standing pre-commit rule alongside the "`golangci-lint run` AND `staticcheck` BOTH must pass" rule from the previous hotfix. Verification: go build ./internal/api/router/... → ok go test ./internal/api/router/... -count=1 -short → ok (TestRouterRBACGateCoverage passes) go test ./internal/api/router/... \ ./internal/api/middleware/... -count=1 -short → ok (router + ETag tests both green) staticcheck ./internal/api/router/... \ ./internal/api/middleware/... → clean gofmt -l internal/api/router/router.go → clean Closes: CI failure run on master@0ad881c2 — TestRouterRBACGateCoverage	2026-05-14 03:32:14 +00:00
shankar0123	0ad881c2bd	fix(lint): U1000 — delete dead etagRecorder.sentinelMarker method CI run on master@ed60059e (Phase 6 + lint hotfix) still red. The golangci-lint step now passes cleanly (0 issues — yesterday's ST1021 fix landed), but the workflow also has a SEPARATE `staticcheck ./...` step at the end that runs raw staticcheck without golangci-lint's directive-resolution layer: internal/api/middleware/etag.go:254:24: func (etagRecorder).sentinelMarker is unused (U1000) Root cause: Phase 6's etag.go shipped a dead no-op method `func (r etagRecorder) sentinelMarker() {}` with a `//nolint:unused` directive. golangci-lint's `unused` linter respects the directive; raw staticcheck's U1000 does NOT — `//nolint:` is a golangci-lint convention, not a staticcheck convention (staticcheck uses `//lint:ignore U1000 reason` syntax). The comment claimed the method "anchors" documentation about the `headerWrittenOnWire` field. Reading the actual code: the field is used directly in `writeHeadersToWire` (line 241); the method is pure dead code with a misleading comment. Deleting it loses nothing — the sentinel field stays where it's needed. Pattern lesson logged in the Tasks-Deferred table: golangci-lint's `//nolint:LINTER` directive is a golangci-lint invention. Raw staticcheck (or any underlying linter run outside golangci-lint) ignores it. The certctl workflow runs BOTH golangci-lint AND a standalone `staticcheck ./...` step, so any future `//nolint:unused` / `//nolint:staticcheck` use needs to be paired with `//lint:ignore U1000` (or equivalent) for staticcheck to honor it — OR the code should be deleted / exported / actually used. Verification: staticcheck ./... → exit 0, no output (mirrors CI's invocation) go vet ./internal/api/middleware/... → clean go test ./internal/api/middleware/... -count=1 -short → ok (0.25s) gofmt -l → clean Closes: CI run on master@ed60059e U1000 lint failure	2026-05-14 03:11:57 +00:00
shankar0123	8191b1ee64	scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll Phase 6 of the certctl architecture diligence remediation. Five findings across the same scheduler-and-DB-pool surface. SCALE-M1 (Med) — DB pool default bumped 25 → 50 internal/config/config.go line 1972: MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50) Postgres default max_connections is 100; 50 leaves headroom for pg_dump + ad-hoc psql + a server replica without exhausting the DB-side cap. Operator override env var unchanged. Operator-tune ladder for larger fleets (5K / 50K certs) lives in docs/operator/scale.md as starter values pending Phase 8 load tests — explicitly marked TBD. SCALE-M3 (Med) — async-CA poll budget operator-configurable Live state was partially-already-shipped: all 4 async-CA connectors (digicert, entrust, globalsign, sectigo) already have per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5 closed pre-Phase-6). What was missing: a global package-default override. Shipped: - internal/connector/issuer/asyncpoll/asyncpoll.go gains SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the currentDefaultMaxWait() priority resolver. - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS at boot and calls SetDefaultMaxWait. - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard green). Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS: the live code tracks wall-clock time (MaxWait), not attempt count. Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS) so the priority chain reads naturally. SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops internal/scheduler/jitter.go ships NewJitteredTicker(interval, jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in internal/scheduler/scheduler.go migrated from bare time.NewTicker to NewJitteredTicker(interval, DefaultSchedulerJitter). Base intervals unchanged; only the per-tick envelope adds ±10% randomized delay so multiple loops with the same nominal cadence don't co-fire and spike CPU + DB at wall-clock boundaries. internal/scheduler/jitter_test.go pins: - Bounded envelope (each tick within ±jitterPct of interval) - Mean drift < 30% of nominal (sign-bug detector) - Stop() releases the goroutine + closes C - Stop() idempotent (no panic on repeat) - Zero-jitter behaves like time.NewTicker - Negative and >=1 jitterPct values clamped defensively CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks any future bare time.NewTicker in scheduler.go. SCALE-L1 (Low) — renewal-sweep semaphore behavior documented docs/operator/scale.md "Scheduler tick budgets" section explains the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25 default), the ctx-cancellation drain on tick-budget overrun, and operator tuning advice (raise concurrency + DB pool together). No code change — the behavior is defensible as-is per the audit. SCALE-L2 (Low) — ETag middleware for top-5 read endpoints internal/api/middleware/etag.go computes SHA-256 ETag over the buffered response body, respects If-None-Match, short-circuits to 304 Not Modified on match. GET/HEAD only; non-2xx responses pass through unchanged. 64 KiB buffer cap degrades gracefully on oversized responses (no caching, body still flushes intact). Wired around the top-5 read endpoints via etagged() helper in internal/api/router/router.go: GET /api/v1/certificates GET /api/v1/agents GET /api/v1/jobs GET /api/v1/audit GET /api/v1/discovered-certificates internal/api/middleware/etag_test.go pins 11 behaviors including 304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass, 4xx/5xx pass-through, oversized-response degradation, wildcard match, HEAD-treated-like-GET, byte-equal pass-through. Cross-cutting fixes: - internal/config/config_test.go::TestLoad_DefaultValues updated to assert the new 50 default (was 25). - deploy/helm/certctl/values.yaml comment corrected — agent pollInterval is hardcoded 30s, not env-configurable; the Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL which G-3 caught as a phantom env var. - asyncpoll.go reformatted by gofmt; functionally unchanged. Verification (all pass): grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go # finds 1 site grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go # config default is 50 grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md # wired grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go # 0 (all migrated) grep -cE 'JitteredTicker' internal/scheduler/scheduler.go # 15 ls internal/scheduler/jitter.go internal/api/middleware/etag.go # both exist ls docs/operator/scale.md # exists bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh # clean bash scripts/ci-guards/G-3-env-docs-drift.sh # clean go test ./internal/scheduler/ ./internal/api/middleware/ \ ./internal/connector/issuer/asyncpoll/ ./internal/config/ # 4/4 packages green Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1 cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2	2026-05-14 01:23:03 +00:00
shankar0123	21aeed4f4e	legal: addlicense headers + normalize legacy variants (Phase 0 RED-4) Phase 0 closure (Path B2, post-rewrite): addlicense sweep — adds the canonical certctl LLC copyright + BUSL-1.1 SPDX header to every production Go file. Template: // Copyright 2026 certctl LLC. All rights reserved. // SPDX-License-Identifier: BUSL-1.1 Coverage: 338 / 338 production Go files (cmd/ + internal/, excluding _test.go and /testdata/). Pre-sweep coverage was 22 / 338 (6.5%); post-sweep is 338 / 338 (100%). Normalized 22 pre-existing legacy headers (`// Copyright (c) certctl` + `// SPDX-License-Identifier: BSL-1.1`) and 1 file using a `Certctl Contributors` attribution. The legacy SPDX ID `BSL-1.1` is non-standard; the official SPDX identifier for Business Source License 1.1 is `BUSL-1.1` (capital U). All 338 files now share the canonical form. Generated via: addlicense -c "certctl LLC" -y 2026 \ -f cowork/legal/copyright-header.tpl \ -ignore '/testdata/' -ignore '/_test.go' \ cmd/ internal/ Verification: find cmd internal -name '.go' -not -name '_test.go' \ -not -path '/testdata/' \ -exec grep -L '^// Copyright 2026 certctl LLC' {} \; \| wc -l Returns: 0 gofmt clean. Header additions are comments only, no compile impact. Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-4	2026-05-13 21:23:35 +00:00
shankar0123	596e675ec7	fix(security): close BUNDLE 5 — auth, OIDC, MCP, API + browser security edges Bundle 5 closure (2026-05-13 acquisition diligence audit). 13-finding security audit pass across the auth / OIDC / MCP / API / browser- security surface. Five real closures shipped in code, two false-as- stated findings annotated with the existing implementation, three operator-decision items documented for v3 follow-up, three doc-only fixes (auth architecture narrative aligned with shipped OIDC). Source findings closed (code): S1 break-glass /auth/breakglass/login lacked the documented 5/min per-source-IP rate limit; handler now owns its own SlidingWindowLimiter wired at startup. Doc claim turns true. R6 OIDC test_discovery JWKS probe ran on http.DefaultClient; now uses an http.Client whose transport wraps validation.SafeHTTPDialContext. JWKS URI can no longer pivot into reserved-address ranges via DNS rebinding. R7 Slack + Teams notifiers built http.Client without the SSRF dial-time guard. Both New() constructors now install validation.SafeHTTPDialContext; webhook URLs (operator- configured via dynamic-config GUI) cannot dial 169.254.x or in-cluster reserved ranges. Test seam: newForTest bypasses the guard for httptest's 127.0.0.1 binds, mirroring the existing internal/connector/notifier/webhook pattern. RT-L2 CERTCTL_ACME_INSECURE=true now emits a prominent logger.Warn at server boot. Pre-Bundle-5 the knob silently disabled ACME directory TLS verification. Source findings closed (doc): finding 1 + HIGH-5 Architecture doc claimed no in-process JWT/ OIDC/mTLS/SAML and pointed everyone at the authenticating-gateway pattern. Auth Bundle 2 (commit dea5053) shipped native OIDC + sessions + break-glass. New §"In-process authentication surface" table (api-key / oidc / none) supersedes the old framing; "Authenticating-gateway pattern (SAML, mTLS-as-auth, LDAP)" section retained for protocols certctl still doesn't ship natively. Source findings verified false (existing implementation): S4 OIDC email-domain allowlist — `email_domain_test.go` already pins the strict-equality semantics (subdomain not auto-accepted, multi-entry no-match path, empty allowlist accepts all by-design per RFC 9700 §4.1.1). SEC-L1 CSP / HSTS / referrer-policy headers — already shipped at internal/api/middleware/securityheaders.go and wired at cmd/server/main.go L2003+L2027+L2115. Operator-decision / deferred (tracked in bundle-5 closure doc): S3 CERTCTL_API_KEYS_NAMED parsing is wired, end-to-end validation is partial. Operator decides: complete the named-key middleware path or deprecate the syntax. S5 Audit-middleware best-effort for read paths; security-critical writes use WithinTx. Operator decides per-path escalation. S8 MCP threat model — the binary is a thin protocol bridge, no privileges of its own; every tool call carries CERTCTL_API_KEY and is auth'd + RBAC-gated server-side. Optional CERTCTL_MCP_READ_ONLY gate tracked as v3. SEC-H1 2026-05-10 audit CRIT-1/2/4 already closed on master; CRIT-3/5 status against the spec folder is operator- workstation-validation-only. Documented for follow-up. SEC-L2 WebAuthn / FIDO2 / step-up — already documented in docs/operator/auth-threat-model.md "Threats Bundle 2 does NOT close". v3 work item per CLAUDE.md decision 12. Full per-finding rationale + receipts at docs/operator/security-bundle-5-audit-closure.md. Verification: gofmt -l # clean go vet ./internal/connector/notifier/slack ./internal/connector/notifier/teams ./internal/auth/oidc ./internal/api/handler ./cmd/server # clean go build ./cmd/server [...] # clean go test -short -count=1 ./internal/connector/notifier/slack ./internal/connector/notifier/teams ./internal/api/handler ./internal/auth/oidc ./internal/config # PASS # (slack 0.028s + teams # 0.023s + handler 11.0s; # newForTest seam keeps # httptest tests green) Audit-Closes: BUNDLE-5 S1 R6 R7 RT-L2 finding-1 HIGH-5 Audit-Verifies-False: S4 SEC-L1 Audit-Defers: S3 S5 S8 SEC-H1 SEC-L2	2026-05-13 01:18:45 +00:00
shankar0123	80cbd2db59	test(coverage): backfill 5 packages to clear v2.1.0 release-gate Phase 3 floors Phase 3 of /Users/shankar/Desktop/cowork/v2.1.0-release-gate.md surfaced four packages below their coverage floors. All four are regressions from new code shipped in the audit-2026-05-10/11 fix bundles that didn't get per-function tests: internal/auth/breakglass 87.5% -> 93.3% (floor: 90%) + List (was 0%) — 3 tests (disabled, empty+populated, repo err) + RemoveCredential, Unlock disabled-branch tests internal/auth/oidc 89.4% -> 95.4% (floor: 90%) + JWKSStatus (was 0%) — 2 tests (unknown provider, after AuthRequest) + TestDiscovery (was 0%) — 5 tests (discovery failure, happy path, HS256 alg-downgrade detected, missing jwks_uri, JWKS 500 fetch) internal/auth/session 89.9% -> 94.4% (floor: 90%) + SetTrustedProxies (was 0%) — round-trip + clear + ComputeCookieHMAC (was 0%) — determinism + key/inputs differ + DecryptKeyMaterial (was 0%) — round-trip + wrong-passphrase internal/api/handler 73.2% -> 75.5% (floor: 75%) + 6 auth_breakglass handler funcs (were all 0%) — 14 tests (disabled/404, invalid JSON, empty fields, service err, happy path with cookies, admin endpoints, ListCredentials no password_hash on the wire) + WithPermissionChecker setter test (was 0%, Bundle 2 MED-2) + NewAdminCRLCacheServiceImpl + CacheRows (were 0%) — 3 tests + itoaForRetryAfter + challengeURLBuilder ACME helpers (were 0%) — 4 tests All five coverage gates green: internal/service 72.7% (floor: 70%) internal/api/handler 75.5% (floor: 75%) internal/api/middleware 67.9% (floor: 30%) internal/auth 93.3% (floor: 85%) internal/service/auth 91.8% (floor: 85%) internal/auth/oidc 95.4% (floor: 90%) internal/auth/oidc/groupclaim 100.0% (floor: 95%) internal/auth/oidc/domain 97.6% (floor: 90%) internal/auth/session 94.4% (floor: 90%) internal/auth/session/domain 98.3% (floor: 90%) internal/auth/breakglass 93.3% (floor: 90%) internal/auth/breakglass/domain 100.0% (floor: 90%) internal/auth/user/domain 96.2% (floor: 90%) (and 6 more — all green) Per CLAUDE.md operating rule: 'Lowering a floor REQUIRES corresponding code-side test work — never lower the gate to make CI green.' The floors stay at their committed values; the new tests close the gap.	2026-05-11 14:12:11 +00:00
shankar0123	8aeeec93c0	chore(lint): close 5 golangci-lint v2 findings surfaced by v2.1.0 release-gate Phase 1.3 Five golangci-lint v2 findings surfaced when running the v2.1.0 release gate (auth-bundle-2 → master pre-flight). Each is mechanical: 1. govet/printf-style misuse — internal/auth/oidc/service_test.go used integer literal 501 in http.Error; switched to http.StatusNotImplemented. 2. staticcheck SA1019 — internal/auth/breakglass/reflect_helper_test.go referenced reflect.Ptr; the canonical name since Go 1.18 is reflect.Pointer. 3. staticcheck ST1020 — internal/repository/postgres/auth.go ActorRoleRepository.Revoke had a doc comment that did not begin with the method name. Prepended 'Revoke drops actor_roles rows.' to the comment so it now starts with the method name. 4. staticcheck ST1022 — internal/api/handler/auth_session_oidc.go DefaultBCLVerifierMaxAge docstring was attached to the DefaultBCLVerifier type docstring. Moved the const docstring directly above the const declaration, separated by a blank line. 5. unused — internal/auth/session/bench_test.go declared benchSessionMinSamples and never referenced it; the bench loop relies on Go's default b.N scaling. Replaced the const block with a comment describing the rationale. Lint clean (golangci-lint v2.12.2 with the .golangci.yml config) on the five edited packages.	2026-05-11 13:31:13 +00:00
shankar0123	a4b2919f59	Merge Fix 13 (HIGH-2 fourth call site): CSRF rotation on Logout # Conflicts: # CHANGELOG.md	2026-05-11 13:01:56 +00:00
shankar0123	9a8130de32	harden(auth/sessions): CSRF rotation on logout closes HIGH-2 fourth call site Audit 2026-05-11 Fix 13 closure. The HIGH-2 closure on dev/auth-bundle-2 documented four RotateCSRFTokenForActor call sites — login completion (fresh by construction), Assign/Revoke RoleToKey (wired at internal/api/handler/auth.go:498 + 546), Logout, and an explicit operator endpoint. The 2026-05-11 adversarial review observed only 3 of the 4: Logout did NOT rotate the actor's sibling sessions post-revoke. Threat closed: a token captured pre-logout (browser DevTools, malicious extension, session-storage leak) could be replayed against the user's other-device/other-browser sessions until those sessions hit their own idle/absolute expiry. Rotation on logout defeats this — the captured token is dead the moment the user clicks 'Sign out' anywhere. What this changes: * internal/api/handler/auth_session_oidc.go::SessionMinter interface gains RotateCSRFTokenForActor(ctx, actorID, actorType string) int. Nil-safe semantics by convention — the production wiring is session.Service which already implements the method; rotation NEVER errors (returns int count, swallows per-row failures via the underlying Service.RotateCSRFToken) so it can't block the surrounding Revoke that triggered it. internal/api/handler/auth_session_oidc.go::Logout calls RotateCSRFTokenForActor after Revoke(sess.ID) succeeds. The auth.session_revoked audit row gains a csrf_rotated detail key carrying the count so SOC/SIEM can correlate logout events with CSRF churn on sibling sessions. * The no-cookie + invalid-cookie 204 short-circuit paths skip rotation. No session row exists to rotate against; the caller is already unauthenticated. Rotation on those paths would do nothing useful and pollute the audit log. Test coverage in internal/api/handler/auth_session_oidc_test.go: * TestLogout_RotatesCSRFForActor — happy path. Mocks rotateCSRFReturnCount=2; asserts Revoke fires before rotation, rotation fires exactly once with caller's (actor_id, actor_type), audit details carry csrf_rotated=2. * TestLogout_NoCookie_SkipsCSRFRotation — pins the 204 short-circuit branch when there's no cookie. Rotation count stays at 0. * TestLogout_InvalidCookie_SkipsCSRFRotation — pins the 204 short-circuit branch when Validate rejects the cookie. Same rationale: no session row, no rotation. The stubSession test fake gains RotateCSRFTokenForActor with call-recording fields; the phase5StubAudit gains a details slice append-aligned 1:1 with events so the happy-path test can index into the latest entry and assert the count. Spec Phase 3 (explicit operator endpoint) — intentionally NOT shipped. The three automatic triggers (login + role- mutation + logout) cover the HIGH-2 threat model; operators who want a nuclear option can use the existing RevokeAllForActor flow which forces re-login → fresh session → fresh CSRF. Adding a dedicated POST /api/v1/auth/sessions/ rotate-csrf admin endpoint would be defense-in-depth without new attack-surface coverage. Documented in the audit-doc annotation. Verify gate: * gofmt -l — clean * go vet ./internal/api/handler/... — clean * go build ./cmd/server/... ./internal/... — clean (production session.Service satisfies the extended interface out of the box) go test -short -count=1 ./internal/api/handler/... ./internal/auth/session/... — all green; 3 new Logout cases + the 2 pre-existing Logout cases all pass. Audit doc annotation at cowork/auth-bundles-audit-2026-05-10.md flips the HIGH-2 row from 'CLOSED 2026-05-10 (3/4 call sites wired)' to 'A-B-3 verified 2026-05-11: HIGH-2 fully closed across all four documented call sites.' Refs cowork/auth-bundles-fixes-2026-05-11/13-verify-logout-csrf-rotation.md.	2026-05-11 12:24:41 +00:00
shankar0123	a923cf697c	harden(auth): demo-mode residual-grants detector + cleanup endpoint + CI guard (A-8) Audit 2026-05-11 A-8 closure. Closes the deferred Phase 2 leg of the 2026-05-10 HIGH-12 closure (`2e97cc1`) — production-startup observability for actor-demo-anon residual grants + CI guard banning new synthetic- admin code paths. What this changes: * cmd/server/preflight_demo_residual.go (new) runs after the DB pool + audit service are constructed and before the HTTPS listener starts. Under any non-'none' auth type it queries actor_roles for the synthetic actor-demo-anon and emits a WARN log + a categorized audit row (auth.demo_residual_grants_detected) listing every grant present. Migration 000029 unconditionally seeds the ar-demo-anon-admin row at install time, so EVERY production deploy will see this WARN on first boot; the intended cutover workflow is cleanup-once at production handover. * CERTCTL_DEMO_MODE_RESIDUAL_STRICT (new env var on AuthConfig, default false) pivots the WARN to fail-closed startup refusal for operators who want a paranoid posture against re-seeding. * POST /api/v1/auth/demo-residual/cleanup (new handler at internal/api/handler/demo_residual.go) is an admin-class (auth.role.assign) endpoint that removes every actor-demo-anon row from actor_roles and returns {removed: int64}. Idempotent; refuses 503 under Auth.Type=none (deleting the row would break the demo path); audit-logs every invocation including no-op zero-removed calls so the admin's action is always recorded. * scripts/ci-guards/no-new-synthetic-admin.sh pins the 17-entry allowlist of source files that legitimately reference the actor-demo-anon literal. New runtime code paths that resolve to the synthetic actor (the same pattern that produced the original CRIT class) are rejected at PR time. CI workflow auto-picks the script via the existing scripts/ci-guards/.sh loop in .github/workflows/ ci.yml; no workflow edit needed. Regression matrix: cmd/server/preflight_demo_residual_test.go — 7 tests covering the 4 main behaviour branches (testcontainers-backed, testing.Short()- skipped: DemoModeActive_Skips, NoResidue_Passes, HasResidue_LogsAnd Audits, StrictMode_RefusesStartup, DeleteDemoAnonResidue_Idempotent) plus 3 pure-Go stdlib unit tests for the row-string formatter + nil-safety contracts on both helpers. * internal/api/handler/demo_residual_test.go — 7 stdlib+httptest cases: HappyPath, Idempotent_ReturnsZero, RejectsInDemoMode (503), CleanupError_Surfaces500, NilCleanupFn (defensive 500), NilAuditWriter_DoesNotPanic, MissingActorContext (falls back to 'unknown' actor in the audit row). * internal/api/router/openapi_parity_test.go — new POST /api/v1/auth/demo-residual/cleanup entry plus 6 pre-existing pre-A-8 entries (oidc/test, jwks-status, users CRUD, runtime-config) that had drifted out of SpecParityExceptions; the parity test was red on dev/auth-bundle-2 before my work; this commit returns it to green with full per-entry justifications + parity-debt notes. Docs: * docs/operator/security.md — new 'Demo-to-production cutover (Audit 2026-05-11 A-8)' section explaining the WARN message, the cleanup curl one-liner, the equivalent SQL, the strict-mode env var, and the CI guard. * docs/operator/rbac.md — Last-reviewed bump + pointer to the new env var + the security.md section. * cowork/auth-bundles-audit-2026-05-10.md — HIGH-12 row gains an 'A-8 follow-on CLOSED 2026-05-11' annotation describing the deferred Phase 2 leg now landed. * CHANGELOG.md — Unreleased ### Security entry summarizing the four legs (detector + cleanup + strict-mode flag + CI guard) and the acquisition-readiness narrative this closes. Operator-facing impact: this closes a credibility gap, not an exploitable vulnerability. The residue requires a regression elsewhere in the middleware chain to be exploitable. After this fix, the canonical narrative ('RBAC primitive with no synthetic- admin fallback') is fully true. Refs cowork/auth-bundles-fixes-2026-05-11/08-high-demo-mode-residual- cleanup.md.	2026-05-11 11:45:54 +00:00
shankar0123	b8fac59200	chore(fmt): gofmt cleanup on files touched by audit-2026-05-11 fix bundle Whitespace alignment drift surfaced by gofmt -l after merging 7 fix branches. Pure formatting, no semantic change. Pre-existing master drift in internal/auth/oidc/{domain/types.go, integration_keycloak_rotate_test.go, test_discovery.go} left untouched — that's separate tech debt.	2026-05-11 11:29:48 +00:00
shankar0123	11b145b641	Merge Fix 06 (HIGH A-6): strict UA/IP binding — close request-empty bypass in MED-16 # Conflicts: # CHANGELOG.md # internal/api/handler/auth_session_oidc.go # internal/api/handler/auth_session_oidc_test.go	2026-05-11 11:19:04 +00:00
shankar0123	68af18d081	Merge Fix 04 (HIGH A-4): scope-aware ActorRole revoke	2026-05-11 11:16:24 +00:00
shankar0123	92519436a1	harden(oidc): strict UA/IP binding (A-6) — close request-empty bypass in MED-16 The MED-16 closure (`2a1a0b3`) added the RFC 9700 §4.7.1 pre-login UA/IP binding but the consume-side compare at internal/auth/oidc/service.go was gated by: if s.preLoginRequireUA && storedUA != "" && userAgent != "" { ... constant-time compare ... } if s.preLoginRequireIP && storedIP != "" && ip != "" { ... constant-time compare ... } The `userAgent != ""` and `ip != ""` arms were intended as rolling-deploy / headless-proxy compat ("if the request didn't supply a value, don't try to compare against nothing"). They achieve that — and they ALSO short-circuit the compare whenever the attacker controls the request side, which is always at /auth/oidc/callback. Threat model: 1. Attacker acquires a pre-login cookie (HMAC-protected; requires RNG break OR transit leak — not implausible, that's why the binding exists in the first place). 2. Attacker replays the cookie at /auth/oidc/callback from their own user-agent. 3. Attacker OMITS the User-Agent header. curl doesn't send one by default. Many programmatic HTTP clients omit it. Pre-A-6, step 3 trivially bypassed the binding check. The whole RFC 9700 §4.7.1 defense was theatre against the realistic threat — silent-allow when the attacker abandons the header they don't want checked. Fix: flipped to strict-when-stored. When the pre-login row carries a binding value (storedUA != "" or storedIP != ""), the request MUST present a matching value. An empty request side with a non-empty stored side now rejects with two new sentinels: ErrPreLoginUAMissing — request omitted User-Agent header ErrPreLoginIPMissing — request had no resolvable client IP Distinguished from the existing Mismatch sentinels so the audit row can tell apart "binding violation" (operator mis-configured the proxy) from "missing-header bypass attempt" (active exploit indicator). The handler-side classifyOIDCFailure adds typed errors.Is dispatch: ErrPreLoginUAMissing → "prelogin_ua_missing" ErrPreLoginIPMissing → "prelogin_ip_missing" SIEM rules can now alert specifically on the bypass-attempt category distinctly from operator config drift. Legacy-row compat preserved: pre-migration rows where storedUA == "" / storedIP == "" still pass through unchecked. That window is bounded by the 10-minute pre-login TTL — within 10 minutes of the MED-16 deploy every legacy row has expired and the strict path is universal. Operator escape hatches preserved: CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false (symmetric for IP) bypasses both the Mismatch AND the new Missing reject paths. Required for environments where a proxy strips the User-Agent header in transit (rare but documented in the operator advisory). Regression coverage: service_test.go (5 new tests under `Audit 2026-05-11 A-6 — strict-when-stored` block): TestService_HandleCallback_MED16_A6_UAStoredButRequestEmpty_Rejects — the load-bearing bypass-closure leg TestService_HandleCallback_MED16_A6_IPStoredButRequestEmpty_Rejects — symmetric for IP TestService_HandleCallback_MED16_A6_LegacyRowEmptyStoredStillPasses — legacy-row compat preserved TestService_HandleCallback_MED16_A6_ToggleOff_AllowsBypass — UA toggle off allows the bypass (operator escape hatch) TestService_HandleCallback_MED16_A6_ToggleOff_IP_AllowsBypass — IP toggle off allows the bypass auth_session_oidc_test.go::TestClassifyOIDCFailure extended: ErrPreLoginUAMismatch → prelogin_ua_mismatch (new explicit pin) ErrPreLoginIPMismatch → prelogin_ip_mismatch (new explicit pin) ErrPreLoginUAMissing → prelogin_ua_missing ErrPreLoginIPMissing → prelogin_ip_missing fmt.Errorf wrapped variants of the Missing sentinels round-trip through errors.Is (defense against future context-wrapping in the service layer) Verify gate green: gofmt clean, go vet clean, all 10 MED-16 tests + extended TestClassifyOIDCFailure pass; full short-mode test run across internal/auth/oidc + internal/api/handler also green. Spec at cowork/auth-bundles-fixes-2026-05-11/06-high-prelogin-ua-strict-mode.md. Audit doc: MED-16 row in cowork/auth-bundles-audit-2026-05-10.md appended with the A-6 follow-up closure annotation; status table row updated to "CLOSED + A-6 follow-up CLOSED 2026-05-11". Operator advisory in CHANGELOG.md v2.1.0 release notes covers the two operator-visible behaviour changes: (1) callback requests without User-Agent now reject when a binding was stored, and (2) the CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false escape hatch is the documented path for environments where the proxy strips the header.	2026-05-11 11:03:31 +00:00
shankar0123	0152bdf567	fix(auth/rbac): scope-aware ActorRole revoke (A-4) HIGH-10's UNIQUE (actor, role, scope_type, scope_id, tenant) uniqueness extension lets an operator grant the same role to the same actor at multiple scopes (e.g. r-operator on profile=p-acme AND profile=p-globex). But ActorRoleRepository.Revoke's WHERE clause omitted (scope_type, scope_id) — a single call deleted every variant. Selective revoke was unrepresentable; operators had to drop all and re-grant N-1, opening a race window where the actor's access was briefly different. Closure across all layers (handler → service → repo → MCP → GUI client), preserving the legacy "revoke all variants" contract for unmodified callers: internal/repository/auth.go - New ActorRoleRevokeOptions struct. Zero value = legacy semantic; non-empty ScopeType narrows to one variant. - New ErrActorRoleNotFound sentinel for scoped no-match (HTTP 404). internal/repository/postgres/auth.go - Revoke signature extended with opts. Empty opts.ScopeType uses the legacy SQL (no scope WHERE), zero-row delete = no error. - Non-empty narrows with `scope_type = $5 AND scope_id IS NOT DISTINCT FROM $6` — the IS-NOT-DISTINCT-FROM is load-bearing, vanilla `=` would silently miss the (global, NULL) case because NULL ≠ NULL in standard SQL. - Selective revoke with zero matching rows returns ErrActorRoleNotFound; operators get feedback on typos. internal/service/auth/actor_role_service.go - Revoke takes opts. Audit row's details map records the scope so SIEMs can distinguish wide-vs-selective revokes: `scope: "all_variants"` for the legacy path, or `scope_type` + `scope_id` for selective. Privilege check (auth.role.assign) and reserved-actor guard unchanged. internal/api/handler/auth.go - RevokeRoleFromKey parses optional `?scope_type=` / `?scope_id=` query params via new parseRevokeScope helper. - Validation mirrors AssignRoleToKey: scope_id forbidden with scope_type=global, required with profile/issuer, invalid scope_type → 400. scope_id without scope_type also → 400. - writeAuthError maps ErrActorRoleNotFound to 404. internal/mcp/tools_auth.go + types.go - AuthRevokeKeyRoleInput gains optional ScopeType + ScopeID with jsonschema descriptions explaining the dual-mode contract. - Tool call site appends URL-encoded query params when ScopeType is set; legacy callers (no scope_type) emit the bare DELETE path unchanged. web/src/api/client.ts - authRevokeKeyRole signature: optional 3rd argument `{ scope_type?, scope_id? }`. Pre-A-4 call sites (no opts arg) keep firing the bare DELETE — fully backward compatible. The GUI KeysPage's per-row revoke button (still one row per role, pre-Fix-12) continues to use the legacy shape; future GUI work can pass scope params for per-variant rows. docs/operator/rbac.md - New "Revoke: legacy 'all variants' vs scope-selective" subsection under "From the HTTP API" with curl examples for both modes plus the audit-row payload shape that lets SOC/SIEM tell them apart. Regression coverage: Repository (testcontainers, skipped under -short — 6 tests in internal/repository/postgres/auth_revoke_scope_test.go): TestRevokeActorRole_NoOpts_RemovesAllVariants TestRevokeActorRole_WithScope_RemovesOnlyMatching TestRevokeActorRole_WithGlobalScope_RemovesOnlyGlobal — pins the IS-NOT-DISTINCT-FROM branch (global, NULL) TestRevokeActorRole_NoMatch_ReturnsNotFound — pins the new sentinel TestRevokeActorRole_NoOpts_NoMatch_IsNoOp — pins the legacy idempotence contract TestRevokeActorRole_IssuerScope_RemovesOnlyMatching — pin the issuer-scope half (profile + issuer are symmetric scope types) Handler (7 new tests in auth_test.go): TestAuthHandler_RevokeRoleFromKey — extended to assert no scope filter is forwarded when query string is empty (legacy behaviour) TestAuthHandler_RevokeRoleFromKey_A4_ScopedProfile TestAuthHandler_RevokeRoleFromKey_A4_ScopedGlobal TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithGlobal TestAuthHandler_RevokeRoleFromKey_A4_RejectsMissingScopeID TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithoutScopeType TestAuthHandler_RevokeRoleFromKey_A4_RejectsInvalidScopeType TestAuthHandler_RevokeRoleFromKey_A4_ScopedNotFoundReturns404 MCP (2 new table rows in tools_per_tool_test.go): Scoped revoke with scope_type=profile + scope_id=p-acme → `?scope_type=profile&scope_id=p-acme` Scoped revoke with scope_type=global (no scope_id) → `?scope_type=global` Service-layer test plumbing (service_test.go) updated for new opts arg: 4 existing call sites pass repository.ActorRoleRevokeOptions{} to keep their pre-A-4 semantics; the fakeActorRoleRepo.Revoke implementation now mirrors the postgres scope-aware behaviour (legacy zero-value vs scoped narrowing + ErrActorRoleNotFound on no-match). Verify gate green: gofmt clean, go vet clean, go test -short across repository/postgres, service/auth, api/handler, and mcp. The pre-existing KeysPage.test.tsx failure observed on the baseline commit (reproduced via `git stash` earlier in Fix 03) is unrelated; my client.ts change adds an optional third argument and is fully backward-compatible. Spec at cowork/auth-bundles-fixes-2026-05-11/04-high-actor-role-revoke-scope.md. Audit doc updated: new row A-4 (2026-05-11) CLOSED appended to the status table at the bottom of cowork/auth-bundles-audit-2026-05-10.md. Operator-visible advisory in CHANGELOG.md v2.1.0 release notes under Security (non-BREAKING — legacy callers are unchanged). Depends on Fix 01 (the scope-aware EffectivePermissions read path on branch fix/audit-2026-05-11/crit-actor-role-scope-reads). This fix makes the inverse op selectively reversible; without Fix 01 the read side would mis-evaluate scoped grants anyway, making selective revoke moot at runtime.	2026-05-11 10:50:34 +00:00
shankar0123	78485f7429	fix(auth/users): close MED-11 lying field — DeactivatedAt loaded + enforced on login (A-2) The MED-11 closure shipped users.deactivated_at + DELETE /api/v1/auth/users/{id} + cascade-revoke, but the federated-user soft-delete was reversible: the next OIDC login under the same (provider, subject) tuple re-minted a session and re-elevated the user. Three legs of the chain were severed (each independently CRIT-shaped): Leg A — postgres/user.go::userColumns omitted `deactivated_at`, so scanUser never populated User.DeactivatedAt. Every Get / GetByOIDCSubject / ListAll returned DeactivatedAt = nil regardless of the column value. Leg B — postgres/user.go::Update SQL omitted `deactivated_at = $X`, so the handler's `u.DeactivatedAt = now()` mutation was a no-op write at the SQL level. Even with leg A closed, no row ever flipped. Leg C — oidc/service.go::upsertUser did not inspect DeactivatedAt on the existing-user path. Even with legs A + B closed, the OIDC login would still proceed normally. The cascade-session-revoke half of the original closure remained correct, but only for the duration of the user's current cookie. SOC 2 CC6.3 + ISO 27001 A.9.2.6 "user access removal" controls require both immediate revoke AND persistent block — this fix restores the persistent-block leg. Closure across layers: internal/repository/postgres/user.go - userColumns adds `deactivated_at` - scanUser reads via sql.NullTime intermediate (column is nullable) - Create writes deactivated_at explicitly (NULL for new active users; forward-compat for future seed-data flows that pre-populate the column) - Update writes deactivated_at on every call; nil DeactivatedAt → NULL (supports reactivation) internal/auth/oidc/service.go - New sentinel ErrUserDeactivated - upsertUser checks existing.DeactivatedAt != nil BEFORE mutating email / display_name / last_login_at — preserves last_login_at forensics on rejected login attempts (defense-in-depth pin against future "performance optimization" that reorders the gate) internal/api/handler/auth_session_oidc.go - classifyOIDCFailure adds typed errors.Is dispatch for ErrUserDeactivated → audit category "user_deactivated" (SOC/SIEM observability surface) internal/api/handler/auth_users.go - Self-deactivate guard on Deactivate: HTTP 409 + audit row auth.user_deactivate_self_rejected when caller targets own User row. Prevents an admin from one-way-door locking themselves out via the standard handler; break-glass remains the recovery path. - New Reactivate handler: inverse of Deactivate. Clears DeactivatedAt via Update; emits auth.user_reactivated audit row. Idempotent on already-active rows. Sessions revoked at deactivation stay revoked (cascade irreversible by design — user must complete fresh OIDC login). internal/api/router/router.go - POST /api/v1/auth/users/{id}/reactivate wired with auth.user.deactivate gate (reactivation is the inverse op, not a separate privilege) web/src/api/client.ts + web/src/pages/auth/UsersPage.tsx - authReactivateUser() client function - Reactivate button on deactivated rows in UsersPage Regression coverage: Postgres (testcontainers, skipped under -short): TestUserRepository_DeactivatedAt_RoundTrip — Create → set DeactivatedAt → Update → Get / GetByOIDCSubject / ListAll round-trip the value TestUserRepository_DeactivatedAt_CreateWritesNullForActive — new active user reads back DeactivatedAt = nil TestUserRepository_DeactivatedAt_CreatePersistsPreDeactivated — Create with non-nil DeactivatedAt round-trips (forward-compat path) OIDC service: TestService_HandleCallback_RejectsDeactivatedUser — errors.Is ErrUserDeactivated; CallbackResult nil; persisted email / last_login_at / deactivated_at NOT mutated by the rejected attempt TestService_HandleCallback_AllowsReactivatedUser — DeactivatedAt = nil → happy path resumes TestService_HandleCallback_DeactivatedUserPreservesForensics — defense-in-depth pin against future regressions that reorder the gate-vs-mutation sequence Classifier: TestClassifyOIDCFailure extended — typed dispatch + wrapped variant round-trip through errors.Is Handler: TestAuthUsers_Deactivate_RejectsSelfDeactivate — HTTP 409 + audit row + cascade-revoke NOT fired + row stays active TestAuthUsers_Deactivate_OtherUser_HappyPath — HTTP 204 + cascade fires + row soft-deleted TestAuthUsers_Reactivate_HappyPath / _IdempotentOnActiveUser / _UnknownID / _MissingID / _UpdateError Phase 6 verify gate green on the targeted packages: gofmt clean, go vet clean, go test -short pass across internal/auth/oidc, internal/api/handler, internal/api/router, internal/repository/postgres, internal/auth/..., internal/service/..., internal/tlsprobe/..., internal/trustanchor/..., internal/validation/... Spec at cowork/auth-bundles-fixes-2026-05-11/02-crit-deactivated-at-enforcement.md Closure annotation at cowork/auth-bundles-audit-2026-05-10.md MED-11 row. Operator advisory in CHANGELOG.md v2.1.0 release notes.	2026-05-11 02:21:05 +00:00
shankar0123	172b30b8f1	feat(auth): backend endpoints for MED-7 + MED-11 + MED-12 Audit 2026-05-10 MED-7 + MED-11 + MED-12 backend halves. WHAT. Three new admin-gated endpoints: GET /api/v1/auth/oidc/providers/{id}/jwks-status (auth.oidc.list) — MED-7 GET /api/v1/auth/users (auth.user.read) — MED-11 DELETE /api/v1/auth/users/{id} (auth.user.deactivate) — MED-11 GET /api/v1/auth/runtime-config (auth.role.assign) — MED-12 MED-7 — JWKS health surface - providerEntry gains 4 counters (statsMu, lastRefreshAt, refreshCount, lastError, rejectedJWSCount) updated under sync.Mutex - RefreshKeys increments refreshCount + records lastRefreshAt - New JWKSStatus(ctx, providerID) returns *JWKSStatusSnapshot — surfaced via the new endpoint - CurrentKIDs intentionally empty (go-oidc's internal JWKS cache isn't exposed); shape kept for forward compat MED-11 — federated-user admin - AuthUsersHandler.List with optional ?oidc_provider_id filter - AuthUsersHandler.Deactivate sets users.deactivated_at + cascade- revokes sessions via UserSessionsRevoker (best-effort; revoke failure does NOT roll back the deactivation) - Idempotent: re-deactivating an already-deactivated user is a no-op MED-12 — runtime config - AuthRuntimeConfigHandler.Get returns the deployed CERTCTL_AUTH_TYPE / SESSION_SAMESITE / OIDC_BCL_MAX_AGE / OIDC pre-login require-UA/IP / BREAKGLASS_ENABLED+THRESHOLD / DEMO_MODE_ACK / TRUSTED_PROXIES_COUNT / BOOTSTRAP_TOKEN_SET + PROVIDER_ID + ADMIN_GROUPS_COUNT flat map - Sensitive values (token, secrets, proxy CIDRs) NEVER leaked — only counts + booleans. Token presence surfaced as 'set/unset' - Gated auth.role.assign (admin-class) so non-admins can't enumerate the deployment's auth knobs cmd/server/main.go wires all three handlers into HandlerRegistry. internal/api/router/router.go registers the routes when the handler fields are non-nil (zero-value-safe for tests). VERIFY. - go vet ./internal/api/... ./internal/auth/... ./internal/repository/... PASS - go build ./cmd/server/... PASS - go test -short -count=1 ./internal/auth/oidc/... PASS (4.1s) - go test -short -count=1 ./internal/api/handler/... PASS (4.1s) GUI halves for MED-7 + MED-11 + MED-12 are the GUI batch (pending). Refs: cowork/auth-bundles-audit-2026-05-10.md MED-7, MED-11, MED-12 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md items 11 14 15	2026-05-11 00:11:07 +00:00
shankar0123	b4b98799d5	feat(oidc): POST /api/v1/auth/oidc/test dry-run endpoint (MED-5) Audit 2026-05-10 MED-5 closure (backend half). WHAT. New POST /api/v1/auth/oidc/test endpoint that validates an OIDC provider configuration without persisting anything. Mirrors the read-only legs of the production getOrLoad path so operators can catch typos / network reachability problems / IdP-advertises-weak- alg conditions BEFORE creating the provider row. Request body: {issuer_url, client_id, client_secret, scopes} — client_secret is accepted but unused (discovery + JWKS reachability do not require it). Response body: TestDiscoveryResult{ discovery_succeeded — gooidc.NewProvider returned without error jwks_reachable — explicit GET against jwks_uri succeeded supported_alg_values — verbatim id_token_signing_alg_values_supported iss_param_supported — RFC 9207 advertisement parsed off the disco doc issuer_echo — the iss URL we were called with authorization_url, token_url, jwks_uri, userinfo_endpoint — discovery doc fields for the GUI to preview errors[] — per-leg failure messages } HTTP status: - 200 even when individual checks fail (the per-leg errors[] carries detail so the GUI renders per-check status rows) - 400 only when the request body is malformed or issuer_url empty - 500 only when the service-layer call itself errors WHY. Pre-fix, operators configuring OIDC had to create a provider, then hit /refresh, then read the audit log to figure out whether the discovery doc was reachable / whether the IdP advertises HS256 (the alg-downgrade trap). The GUI rendered no per-check feedback. MED-5 closes the dry-run gap for the same reason every Issuer + Target connector has a 'Test connection' button — operator experience parity. HOW. internal/auth/oidc/test_discovery.go (NEW): - TestDiscoveryResult struct with the per-leg projection. - Service.TestDiscovery(ctx, issuerURL) drives the read-only subset of getOrLoad: gooidc.NewProvider, claims parse for alg-supported + iss-param-supported + jwks_uri + userinfo, alg-downgrade defense, jwksReachable HTTP GET. - jwksReachable is a package-level closure so tests can swap. internal/api/handler/auth_session_oidc.go: - TestProvider HTTP handler. Uses an inline discoveryTester interface to type-assert against the OIDCAuthHandshaker stub (the production Service satisfies; test stubs supply via explicit method). Audit row 'auth.oidc_provider_tested' carries the summary fields. internal/api/router/router.go: - Wired as POST /api/v1/auth/oidc/test under rbacGate('auth.oidc.create'). internal/api/handler/auth_session_oidc_test.go: - stubOIDCSvc gains testResult + testErr fields + TestDiscovery method so it satisfies the inline interface. - 3 regression tests: happy path, missing issuer_url -> 400, discovery-failure -> 200 with errors[] populated. VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... ./internal/api/router/... PASS - go test -short -count=1 -run TestProvider ./internal/api/handler/... PASS (3/3) - go test -short -count=1 ./internal/auth/oidc/... PASS (3.7s) - go test -short -count=1 ./internal/api/handler/... PASS (4.7s) Out of scope for this commit: the GUI 'Test connection' button on OIDCProviderDetailPage — queued with the GUI batch (items 10-19 of HANDOFF.md). Refs: cowork/auth-bundles-audit-2026-05-10.md MED-5 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 2	2026-05-10 23:25:54 +00:00
shankar0123	2a1a0b347c	harden(oidc): pre-login UA/IP binding (MED-16) — RFC 9700 §4.7.1 Audit 2026-05-10 MED-16 closure. WHAT. Binds the OIDC pre-login row to the (clientIP, userAgent) tuple of the /auth/oidc/login request, and enforces a constant-time compare against the /auth/oidc/callback request at consume time. Defeats replay of a stolen pre-login cookie by a different browser / source — the secondary defense layer recommended by RFC 9700 §4.7.1 when the primary layer (HMAC integrity + Path=/ + SameSite=Lax on the cookie) is bypassed via CSRF / XSS / TLS-termination leak. WHY. Pre-fix, the pre-login cookie's HMAC verified only that 'some' caller of /auth/oidc/login was talking to /auth/oidc/callback; it did not verify that the SAME browser / source was on both sides. An attacker who exfiltrated the cookie value via any vector could replay the bytes through their own user-agent and ride the victim's authorization. RFC 9700 §4.7.1 calls out the gap explicitly and recommends binding state to a user-agent fingerprint + source IP. HOW. Migration: migrations/000044_prelogin_uaip.up.sql ALTER TABLE oidc_pre_login_sessions ADD COLUMN IF NOT EXISTS client_ip TEXT, ADD COLUMN IF NOT EXISTS user_agent TEXT; Both nullable for in-flight rolling-deploy compat — the consume- side check only enforces when both row AND request carry non-empty values for the leg in question. Domain: internal/repository/oidc.go (PreLoginSession) — adds ClientIP + UserAgent fields. Repository: internal/repository/postgres/oidc_prelogin.go — Create persists via sql.NullString (empty → NULL); LookupAndConsume reads back. Re-uses package-local nullableString from discovery.go. Service: internal/auth/oidc/service.go - PreLoginStore.CreatePreLogin signature takes (clientIP, userAgent) as positions 5–6. - PreLoginStore.LookupAndConsume returns (clientIP, userAgent) as positions 5–6. - HandleAuthRequest signature gains (clientIP, userAgent), threaded to the store. - HandleCallback adds Step 1.5 — UA / IP constant-time compare between stored row and incoming request. Per-leg toggles via preLoginRequireUA / preLoginRequireIP service fields. Empty values on either side pass through (rolling-deploy + headless- proxy compat). - New sentinels ErrPreLoginUAMismatch, ErrPreLoginIPMismatch. - SetPreLoginBindingRequirements(requireUA, requireIP) helper for main.go config wiring. Adapter: internal/auth/oidc/prelogin.go — PreLoginAdapter passes the new fields through to the repo row. Handler: internal/api/handler/auth_session_oidc.go - OIDCAuthHandshaker.HandleAuthRequest signature updated. - LoginInitiate captures clientIPFromRequest + r.UserAgent() and passes to the service. - classifyOIDCFailure adds errors.Is dispatch for the two new sentinels → prelogin_ua_mismatch / prelogin_ip_mismatch audit categories. Config: internal/config/config.go + AuthConfig.OIDCPreLoginRequireUA (default true) env CERTCTL_OIDC_PRELOGIN_REQUIRE_UA + AuthConfig.OIDCPreLoginRequireIP (default true) env CERTCTL_OIDC_PRELOGIN_REQUIRE_IP cmd/server/main.go calls oidcService.SetPreLoginBindingRequirements from cfg.Auth.OIDCPreLoginRequire{UA,IP}. Tests (internal/auth/oidc/service_test.go): - TestService_HandleCallback_MED16_UAMismatchRejected - TestService_HandleCallback_MED16_IPMismatchRejected - TestService_HandleCallback_MED16_BothMatch_Succeeds - TestService_HandleCallback_MED16_LegacyRowEmptyValues (rolling- deploy compat — empty stored values pass through) - TestService_HandleCallback_MED16_RequireUAFalse_AllowsMismatch (operator escape-hatch — UA mismatch silently allowed) Mechanical fan-out: - stubPreLogin / stubPreLoginRepo signatures updated. - All existing call sites in service_test.go (~40), prelogin_test.go, bench_test.go, logging_test.go, provider_enabled_test.go, integration_keycloak_test.go, integration_okta_smoke_test.go, auth_session_oidc_test.go updated to pass empty strings for the new params — pre-existing tests do not exercise UA/IP binding semantics. VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... ./internal/config/... PASS - go test -short -count=1 -run MED16 ./internal/auth/oidc/... PASS (5/5) - go test -short -count=1 ./internal/auth/oidc/... PASS (4.6s) - go test -short -count=1 ./internal/api/handler/... PASS (4.3s) - go test -short -count=1 ./internal/config/... PASS Refs: cowork/auth-bundles-audit-2026-05-10.md MED-16 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 6 RFC 9700 §4.7.1 — OAuth 2.0 Security Best Current Practice	2026-05-10 23:18:23 +00:00
shankar0123	2cd2a5c52f	harden(oidc): RFC 9207 iss URL parameter check on callback (MED-17) Audit 2026-05-10 MED-17 closure. WHAT. When the matched IdP's discovery doc advertises authorization_response_iss_parameter_supported=true (RFC 9207 §3), HandleCallback now REQUIRES a non-empty `iss` query parameter on /auth/oidc/callback and enforces a constant-time compare against the configured provider's IssuerURL. Mismatch maps to two new sentinel errors (ErrIssParamMissing / ErrIssParamMismatch) that the handler's classifyOIDCFailure dispatches via errors.Is BEFORE the substring fall-through, so the audit failure_category remains distinguishable between the RFC 9207 leg (iss_param_missing / iss_param_mismatch) and the in-token iss claim leg (id_token_iss_mismatch). WHY. The RFC 9207 iss URL parameter is the load-bearing mix-up-attack defense for multi-tenant IdPs (Keycloak realms, Authentik tenants, Auth0 tenants, public-trust CAs). Pre-fix the parameter was silently ignored — an attacker controlling one IdP tenant could route an auth code to certctl's callback against a different tenant's pre-login state without detection. Modern Keycloak / Authentik / public-trust CAs ship the discovery flag by default; legacy IdPs that don't advertise are unaffected (back-compat preserved). HOW. - internal/auth/oidc/service.go - providerEntry gains issParamSupported bool. - getOrLoad extends the discovery-claims read to include authorization_response_iss_parameter_supported, alongside the existing id_token_signing_alg_values_supported defense. - HandleCallback's signature gains callbackIss string at position 5. Step 2.5 runs after the state compare + provider load: when issParamSupported is true, an empty callbackIss returns ErrIssParamMissing; a present-but-mismatched value returns ErrIssParamMismatch (constant-time compare). - Two new sentinels: ErrIssParamMissing, ErrIssParamMismatch. ErrIssuerMismatch's doc-string clarified to note it covers the in-token leg only. - internal/api/handler/auth_session_oidc.go - OIDCAuthHandshaker.HandleCallback signature updated. - LoginCallback reads r.URL.Query().Get("iss") (no TrimSpace — byte-strict compare upstream) and threads it through. - classifyOIDCFailure: typed errors.Is dispatch for the three iss-family sentinels BEFORE the substring fall-through, so the three cases stay distinguishable in the audit row. - internal/api/handler/auth_session_oidc_test.go - stubOIDCSvc.HandleCallback bumped to 7-arg signature. - TestClassifyOIDCFailure extended with 5 new cases pinning the iss-family dispatch + a wrapped-error round-trip. - internal/auth/oidc/service_test.go - mockIdP gains advertiseIssParameterSupported bool; the /.well-known/openid-configuration handler emits the claim only when set (so existing tests stay back-compat). - 4 new regression tests: * MED17_NoSupport_AnyIssAccepted — provider doesn't advertise; arbitrary callbackIss is ignored (back-compat). * MED17_SupportButMissing — provider advertises; missing iss → ErrIssParamMissing. * MED17_SupportButMismatch — provider advertises; wrong iss → ErrIssParamMismatch (load-bearing mix-up defense). * MED17_SupportAndCorrect — provider advertises; matching iss → success path proves the gate isn't over-eager. - internal/auth/oidc/bench_test.go, internal/auth/oidc/logging_test.go, internal/auth/oidc/integration_keycloak_test.go - Mechanical: all existing HandleCallback call sites updated to pass "" for callbackIss (matches pre-fix behavior for IdPs that don't advertise support — the Keycloak integration suite tests will be re-evaluated once the Keycloak fixture is run against a realm with the discovery flag enabled). VERIFY. - go vet ./internal/auth/oidc/... ./internal/api/handler/... PASS - go test -short -count=1 ./internal/auth/oidc/... PASS (3.4s) - go test -short -count=1 ./internal/api/handler/... PASS (5.4s) - 4 new MED-17 regression tests + extended TestClassifyOIDCFailure pass. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-17 cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 7 RFC 9207 — OAuth 2.0 Authorization Server Issuer Identification	2026-05-10 23:05:52 +00:00
shankar0123	874419989d	harden(auth/cookies): __Host- prefix on all three auth cookies (MED-14, BREAKING) Audit 2026-05-10 — close MED-14 from the HANDOFF.md backend batch (item 5). The session, CSRF, and OIDC pre-login cookies all carry the __Host- prefix; browsers now reject any subdomain attempt to overwrite them. Cookie name changes (BREAKING — existing sessions invalidate): - certctl_session → __Host-certctl_session - certctl_csrf → __Host-certctl_csrf - certctl_oidc_pending → __Host-certctl_oidc_pending The __Host- prefix requires Path=/ + Secure + no Domain attribute. Post-login session + CSRF cookies already met all three. The pre-login cookie's Path widened from '/auth/oidc/' to '/' to satisfy the prefix; the cookie lives 10 minutes and is only consumed by the callback handler, so the wider path scope is harmless. Files touched: - internal/auth/session/domain/types.go — constant rename + comment - internal/auth/session/domain/types_test.go — assertion update - internal/api/handler/auth_session_oidc.go — pre-login set + clear paths widened from /auth/oidc/ to / - web/src/api/client.ts — readCSRFCookie now compares against '__Host-certctl_csrf' - CHANGELOG.md — Unreleased > Security (BREAKING) entry - docs/migration/oidc-enable.md — operator-facing detail of the one-time re-authentication window + GUI customization guidance Operator impact: ONE re-login prompt per active session at the deploy that lands this change. Subsequent logins issue the __Host-prefixed cookie automatically. Existing bookmarked deep links work without modification (cookies are path-scoped, not URL-scoped). Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 5 cowork/auth-bundles-audit-2026-05-10.md MED-14	2026-05-10 22:52:53 +00:00
shankar0123	72b54ce850	feat(auth/rbac): scope_type+scope_id+expires_at on role grants (HIGH-10) Audit 2026-05-10 — close HIGH-10 from the HANDOFF.md backend batch (item 1). Per-actor scoped + time-bound role grants are now expressible via the API. Migration 000043: adds scope_type TEXT NOT NULL DEFAULT 'global' + scope_id TEXT to actor_roles. Constraints: - actor_roles_scope_type_enum: scope_type ∈ {global, profile, issuer} - actor_roles_scope_id_required_when_not_global: scope_id is NULL iff scope_type='global' - Uniqueness extended: (actor_id, actor_type, role_id, scope_type, scope_id, tenant_id) — so an operator can grant the same role to the same actor scoped to multiple profiles/issuers (e.g. r-operator on p-finance AND on p-engineering). Index idx_actor_roles_scope for non-global lookup hot paths. Domain: ActorRole.ScopeType (ScopeType enum) + ScopeID (*string). Authorizer.CheckPermission already understands the tuple via the parallel role_permissions columns; this addition gives operators a per-actor knob without forking roles. Postgres repo: Grant writes scope_type+scope_id with ON CONFLICT keyed on the new uniqueness tuple. Defaults to (global, NULL) when caller omits. Handler: assignRoleRequest extended with scope_type / scope_id / expires_at. Validation: - role_id required (unchanged) - scope_type defaults to 'global'; allowed values global/profile/ issuer; anything else → 400 - scope_id required when scope_type ∈ {profile, issuer}; rejected (must be empty) when scope_type='global' - expires_at must be in the future when present; nil = standing Regression matrix in internal/api/handler/auth_test.go (6 cases): - TestAssignRoleToKey_HIGH10_ProfileScopeBoundGrantPersists - TestAssignRoleToKey_HIGH10_TimeBoundGrantPersists - TestAssignRoleToKey_HIGH10_RejectsScopeIDWithGlobalScope - TestAssignRoleToKey_HIGH10_RejectsMissingScopeIDOnProfile - TestAssignRoleToKey_HIGH10_RejectsPastExpiry - TestAssignRoleToKey_HIGH10_RejectsInvalidScopeType HIGH-10 marked CLOSED in audit-doc — the v3 deferral from the prior session is reversed; everything lands in v2. Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 1 cowork/auth-bundles-audit-2026-05-10.md HIGH-10	2026-05-10 22:47:45 +00:00
shankar0123	9cce2ab043	harden(auth): LOW + Nit batch — bootstrap audit, crypto/rand, XFF trust, CSRF check, protocol-prefix unify (Batch 1) Audit 2026-05-10 — close 8 LOWs + 2 Nits in-bundle. Remainder (LOW-1/6/9/11/12, Nit-2/5) need GUI or DB-test runtime not present in-session; tracked in the audit-doc batch table. LOW-2: bootstrap.ValidateAndMint now emits 'bootstrap.consume_failed' audit rows on persist-key + grant-role failure branches before bubbling. Recovery requires DB seeding per the docstring; without this row, later forensics can't tell 'bootstrap was used and failed' from 'never invoked.' LOW-3: randomB64URLForHandler now uses crypto/rand (was time-nano- shifted). Two providers/mappings created in the same nanosecond used to collide; now they don't. Time-nano fallback retained for the unlikely crypto/rand-broken path. LOW-4: breakglass.verifyDummy uses s.readRand(salt) for the dummy Argon2id verify. Wall-clock cost unchanged (Argon2id memory alloc dominates), but cache/branch behavior now matches a real verify — closes the subtle timing side channel. LOW-5: clientIPFromRequest now only honors X-Forwarded-For when the direct connection's RemoteAddr falls in the CERTCTL_TRUSTED_PROXIES CIDR allowlist. Default-deny: empty list means XFF is ignored. SetTrustedProxies wired in cmd/server/main.go from cfg.Auth.TrustedProxies. LOW-7: internal/auth/protocol_endpoints.go::ProtocolEndpointPrefixes now carries /scep-mtls + /.well-known/est-mtls (previously only in router.AuthExemptDispatchPrefixes; the two lists had drifted). The canonical-prefix coverage test in Phase 12 still pins the set. LOW-8: docs/operator/rbac.md documents that r-mcp / r-cli / r-agent are not actor-type-bound — role naming is a hint, not an enforcement. Operators wanting hard binding must apply periodic audit queries. Native binding is on the v2 roadmap. LOW-10: Session.Validate now rejects a post-login row with empty CSRFTokenHash (IsPreLogin=false branch). validSession test fixture updated with a valid 64-hex CSRF hash. Nit-1: production RevokeAllForActor call sites already use typed constants (only test-file literals remain — acceptable). Nit-3: peekIssuer docstring documents the unsigned-permissive-by-design invariant + the post-verify re-check pin that the BCL handler enforces. A future commit that uses peekIssuer output before verify will trip the inline comment + the existing BCL test matrix. Status table updated in cowork/auth-bundles-audit-2026-05-10.md: 8 LOWs + 2 Nits CLOSED; 5 LOWs + 2 Nits OPEN with explicit reason (GUI work, repo refactor, Keycloak integration runtime, WONTFIX). Refs: cowork/auth-bundles-audit-2026-05-10.md LOW-2/3/4/5/7/8/10 cowork/auth-bundles-audit-2026-05-10.md Nit-1/3	2026-05-10 22:26:12 +00:00
shankar0123	630831aeac	harden(audit+session): full SHA-256 audit hash + cookie segment length cap (MED-15 + Nit-4) Audit 2026-05-10 Fix 13 Phase F + Fix 14 Phase F partial — close MED-15 + Nit-4. Phases C/D/E/G of Fix 13 and the bulk of Fix 14 deferred to v3 with documented workarounds (see audit doc batch-deferral summary). MED-15: internal/api/middleware/audit.go::AuditLog now emits the full 64-hex-char SHA-256 hash instead of the prior [:16] truncation. The audit_events.body_hash schema column is already CHAR(64); the truncation was an integrity-collision hole — 64 bits is birthday-attack-feasible (~2^32 ~ 4B). Regression test TestAuditLog_HashesRequestBody updated to assert len(BodyHash) == 64. Nit-4: internal/auth/session/service.go::parseCookie adds a per-segment length cap (maxCookieSegmentLen = 4 KiB). Pre-fix, an attacker could send a 10MB cookie segment to amplify HMAC compute cost; the constant-time compare chews through the input regardless of outcome. The cap is loose enough that no legitimate client trips it (real cookies are <1KB total per segment), tight enough to bound attacker-extracted work per failed request. Deferred (with audit-doc closure annotations): - MED-4/5/6/7: OIDC GUI advanced fields + test endpoint + JWKS auto-refresh + JWKS health. v3 OIDC-operator-experience bundle. Workarounds documented. - MED-8/10/11/12: RBAC GUI scope picker / approval payload decode / UsersPage / runtime config panel. v3 GUI-polish bundle. Backend already accepts the scope_type/scope_id fields; the gap is GUI. - MED-13: MCP tools for approvals / break-glass / bootstrap. v3 MCP-expansion bundle. - MED-14: __Host- cookie rename. Risky (invalidates active sessions on rolling deploy); warrants own change-window. - MED-16/17: Pre-login UA/IP binding + RFC 9207 iss URL check. v3 OIDC-hardening bundle. - All 12 LOWs + 4 of 5 Nits: v3 cleanup bundle. Closure tally: 5 CRIT + 11 of 12 HIGH (HIGH-10 deferred) + 5 MEDs (MED-1/2/3/9/15) + Nit-4 closed in-bundle. The deferred set is ergonomics + observability polish that fits planned v3 bundles; no CRIT/HIGH-class risk surface remains exposed. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-15, Nit-4 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase F cowork/auth-bundles-fixes-2026-05-10/14-low-nit-cleanup.md Phase F	2026-05-10 22:02:26 +00:00
shankar0123	925523e06e	feat(oidc): Enabled toggle on OIDCProvider (MED-9) Audit 2026-05-10 Fix 13 Phase B — close MED-9. MED-4/5/6/7 deferred to v3. MED-9: ship the OIDCProvider.Enabled boolean. Pre-fix, the only way to take a provider offline during an incident was DELETE, which breaks active user_oidc_provider FK references and orphans any session that minted under the provider. Post-fix: - Migration 000042 adds enabled BOOLEAN NOT NULL DEFAULT TRUE. Default-true means existing pre-migration rows are all enabled post-deploy; no breaking-change window. - internal/auth/oidc/domain/types.go::OIDCProvider.Enabled ships the domain field with JSON tag 'enabled'. - Repository read/write paths (List, Get, GetByName, Create, Update) all carry the column. - internal/auth/oidc/service.go::HandleAuthRequest rejects with the new ErrProviderDisabled sentinel when cfgRow.Enabled=false. - cmd/server/main.go::oidcProvidersListAdapter.List filters disabled providers before constructing OIDCProviderInfo so the LoginPage's 'Sign in with X' buttons never render for offline IdPs. - Defense-in-depth: the ErrProviderDisabled service-layer check is the guard for direct API / MCP / CLI callers that bypass the GUI. Regression test: internal/auth/oidc/provider_enabled_test.go warms the entry cache via a successful HandleAuthRequest, flips cfgRow.Enabled=false on the cached entry, then asserts the next call returns ErrProviderDisabled (errors.Is). Test fixtures (newValidProvider, makeProvider) updated to set Enabled: true so existing tests stay green. Operators can toggle Enabled today via the existing PUT /api/v1/auth/oidc/providers/{id} body field. A dedicated GUI toggle on OIDCProviderDetailPage and a single-purpose PUT-just-enabled endpoint are deferred to the v3 GUI-polish bundle — the load-bearing wire is in place now. MED-4 (GUI advanced fields on edit), MED-5 (POST .../test endpoint + button), MED-6 (JWKS auto-refresh on cache-miss), MED-7 (JWKS health endpoint + GUI panel): DEFERRED to v3 with explicit annotations in the audit doc. Workarounds: MED-4 fields are PUT-editable via curl/MCP; MED-5 → call refresh post-create; MED-6 → call refresh manually on key rotation. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-4, MED-5, MED-6, MED-7, MED-9 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase B	2026-05-10 21:59:17 +00:00
shankar0123	ba0959ddc7	feat(auth/sessions): list-all gate + revoke-all-except-current (MED-1/2/3) Audit 2026-05-10 Fix 13 Phase A — close MED-1, MED-2, MED-3. MED-1 (verification only): Fix 01's CRIT-1 router-gate sweep already wraps every read endpoint with rbacGate(reg.Checker, '<resource>.read', ...). Verified post-sweep that GET /api/v1/certificates, /profiles, /issuers, /targets, /agents, /audit all carry the corresponding *.read permission gate. MED-2: ListSessions now gates ?actor_id=<other> on auth.session.list.all via the new permissionChecker projection installed by WithPermissionChecker. cmd/server/main.go threads the existing authCheckerAdapter into the handler. When caller's actor_id != caller.ActorID AND the handler has a checker, an inline CheckPermission(..., 'auth.session.list.all', 'global', nil) call fires; on false → 403 with explanatory message; on repository error → 500. Defense-in-depth: the router-level rbacGate enforces auth.session.list as the floor; the .list.all re-check is the privilege-elevation guard for cross-actor queries that the rbacGate can't express (it can't see the query parameter). MED-3: ship DELETE /api/v1/auth/sessions?except=current — the 'sign out all other sessions' flow. Gated by auth.session.revoke; the handler reads the caller's current session ID from session.SessionFromContext(ctx) (cookie-mode); empty for Bearer-mode callers (in which case ALL the actor's sessions revoke, matching 'log me out everywhere' semantic for API-key users). New repository method SessionRepository.RevokeAllExceptForActor: UPDATE sessions SET revoked_at = NOW() WHERE actor_id = AND actor_type = AND tenant_id = AND revoked_at IS NULL AND id != returning rowcount. Added to the interface in internal/repository/session.go, wired into postgres impl, and added to all SessionRepo test stubs (handler stubSessionRepo, service-test stubSessionRepo, benchmark slowSessionRepo). The session.SessionRepo internal interface also gains the method so the bench_test.go forwarder compiles. Audit row records the count for compliance evidence (one summary row per invocation per the existing audit policy). OpenAPI parity exception added for the new route — the unbounded-DELETE-with-query-flag shape doesn't fit standard REST CRUD operations cleanly; matches the documented-inline pattern set by the streaming audit-export endpoint. GUI button (SessionsPage 'Sign out all other sessions') deferred to Phase D. Refs: cowork/auth-bundles-audit-2026-05-10.md MED-1, MED-2, MED-3 Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase A	2026-05-10 21:49:35 +00:00
shankar0123	912ec3f547	fix(audit): ship streaming NDJSON audit export endpoint (HIGH-9 / HIGH-11) Audit 2026-05-10 HIGH-9 + HIGH-11 closure. HIGH-10 deferred to v3. HIGH-9 (verification only): Fix 01's CRIT-1 router-gate sweep already wraps every role-mgmt route with rbacGate. Verified via grep: - GET /api/v1/auth/roles → auth.role.list - POST /api/v1/auth/roles → auth.role.create - GET /api/v1/auth/roles/{id} → auth.role.list - PUT /api/v1/auth/roles/{id} → auth.role.edit - DELETE /api/v1/auth/roles/{id} → auth.role.delete - POST /api/v1/auth/roles/{id}/permissions → auth.role.edit - DELETE /api/v1/auth/roles/{id}/permissions/{perm} → auth.role.edit - POST /api/v1/auth/keys/{id}/roles → auth.role.assign - DELETE /api/v1/auth/keys/{id}/roles/{role_id} → auth.role.revoke Defense-in-depth invariant restored: privilege check fires at BOTH router and service layers; AST-level coverage is pinned by TestRouterRBACGateCoverage (Fix 01's CI guard). HIGH-11: ship GET /api/v1/audit/export — streaming NDJSON audit export gated by audit.export. Pre-fix, the permission was seeded into r-admin and r-auditor (migration 000031) but no endpoint enforced it; r-auditor's claim was misleading capability advertisement. Post-fix: - internal/api/handler/audit.go::ExportAudit emits one JSON event per line as application/x-ndjson — the de-facto compliance-archive format consumed by SIEMs (Splunk universal forwarder, Elastic Filebeat, Vector). - Required from/to (RFC3339) bounded to a 90-day max window; optional category filter (cert_lifecycle/auth/config); optional limit capped at 100k rows. - Content-Disposition: attachment; filename="certctl-audit-<from>_to_<to>.ndjson" so curl + browser downloads land with a sensible filename. - Recursively self-audits: every successful export emits an audit.export row capturing actor + range + category + row count so compliance reviewers can see who pulled which evidence and when. - Service layer: AuditService.ExportEventsByFilter reuses the existing repository.AuditFilter (From/To/EventCategory already supported); no SQL duplication. - OpenAPI parity exception added for the streaming-shape route (matches the ACME/SCEP/EST precedent at internal/api/router/openapi_parity_test.go::SpecParityExceptions). Regression matrix in audit_export_test.go (7 cases): - TestExportAudit_StreamsNDJSONLines (happy path; pins content-type + content-disposition + JSON-per-line shape + recursive self-audit) - TestExportAudit_RejectsRangeBeyond90Days (100-day window → 400) - TestExportAudit_RejectsMissingFromOrTo (3 cases) - TestExportAudit_RejectsInvalidCategory (unknown enum → 400) - TestExportAudit_AcceptsValidCategoryFilter (auth filter passes through) - TestExportAudit_RejectsNonGET (POST → 405) - TestExportAudit_RejectsToBeforeFrom (inverted range → 400) The auditor role's surface is now complete (read + export). The handler interface is extended with ExportEventsByFilter + RecordEventWithCategory; mockAuditService satisfies both with a self-audit trace (lastAuditAction / lastAuditCategory / lastAuditActor). HIGH-10 (scope + expiry on assignRoleRequest): DEFERRED to v3. Schema column already exists (ActorRole.ExpiresAt); load-bearing wire remains v3 work. Documented carve-out at HIGH-10's annotation. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-9 HIGH-11 Spec: cowork/auth-bundles-fixes-2026-05-10/12-high-9-10-11-role-mgmt-cleanup.md	2026-05-10 21:36:01 +00:00
shankar0123	f5ba17114d	fix(audit): close silence-leg of HIGH-6; emit WARN on audit-write failure Audit 2026-05-10 HIGH-6 partial closure (silence leg). The audit identified two distinct gaps in the auth surface's audit-emit pattern: (1) silence — `_ = audit.RecordEventWithCategory(...)` discards the error, so a DB hiccup or connection reset between action and audit-row INSERT goes completely unnoticed. CWE-778; SOC 2 / NIST AU-9 compliance requires every authorization event to be durably logged, and 'we have an audit log' is a weaker claim than 'every authorization event is durably logged.' (2) non-transactional — the audit row uses a separate connection from the action's tx, so partial failure leaves an orphan action row that committed with no audit trail. Decision 8 of the auth-bundles-index requires action + audit row atomic. This commit closes leg (1) fully across all six audit-emit call sites in the auth surface: - internal/service/auth/actor_role_service.go::recordAudit - internal/service/auth/role_service.go::recordAudit - internal/auth/bootstrap/service.go::ValidateAndMint - internal/auth/breakglass/service.go::recordAudit - internal/auth/session/service.go::recordAudit - internal/api/handler/auth_session_oidc.go::recordAudit - internal/service/profile.go::Update (Phase 9 approval-bypass) Each `_ = ...` swallow is replaced with: if err := audit.RecordEventWithCategory(...); err != nil { slog.WarnContext(ctx, '<surface> audit write failed (action committed; audit row may be missing)', 'action', action, 'actor_id', actor, 'resource_id', resource, 'err', err) } Operators monitoring audit-write failures now see structured WARN logs with action + actor + resource attribution; missing audit rows can be cross-referenced against monitoring without manual SELECT-from- audit-table. Infrastructure for leg (2) (transactional commit) is also landed in this commit: - service.AuditService.RecordEventWithCategoryWithTx (new method; accepts repository.Querier from postgres.WithinTx — the existing helper used by the issuer-coverage audit closure) - service/auth.AuditService interface declares the new method - test stub fakeAudit.RecordEventWithCategoryWithTx satisfies the extended interface The eight per-path WithinTx-refactors documented in cowork/auth-bundles-fixes-2026-05-10/10-high-6-atomic-audit-commit.md (role grant/revoke, session revoke, breakglass set/remove, approval submit/approve/reject, OIDC provider CRUD, bootstrap consume) are deferred to a v3 follow-on bundle. Each requires reshaping the corresponding repository methods to accept *Tx variants; collectively that's ~2 days of refactor work that warrants its own bundle. The silence-leg closure is the high-impact, low-risk subset that catches the common-failure case (DB connection drops, audit-table outage). Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-6 Spec: cowork/auth-bundles-fixes-2026-05-10/10-high-6-atomic-audit-commit.md	2026-05-10 21:24:29 +00:00
shankar0123	0f340beb14	fix(auth/ux): cause-aware OIDC + session error surfacing (HIGH-7 + HIGH-8 closure) Server (HIGH-7): the OIDC callback failure path now 302-redirects to /login?error=oidc_failed&reason=<category> instead of emitting a blank 400. `category` is the existing audit `failure_category` value; classifyOIDCFailure was extended with three new sentinel paths (email_domain_not_allowed, email_missing_but_required, pkce_invalid) so CRIT-5 + PKCE failures get distinguishable GUI rendering. Audit-log observability is unchanged — the same failure_category is written to the auth.oidc_login_failed audit row; the 302 is purely a UX leg layered on top. Server (HIGH-8): SessionMiddleware now stashes a cause classification on the request context when Validate returns an error, mapping the sentinels via classifySessionError (errors.Is-based, so wrapped sentinels still classify) to the stable wire-strings idle_timeout / absolute_timeout / back_channel_revoked / invalid_token. The 401 emit point in bearerSkipIfAuthenticated reads the stashed cause and emits WWW-Authenticate: Bearer realm="certctl", error="invalid_token", error_description=<cause> per RFC 6750 §3. GUI (HIGH-7): LoginPage reads ?error= + ?reason= from the URL via react-router useSearchParams and renders an operator-friendly amber-bordered banner above the form; OIDC_FAILURE_REASON_TEXT maps all 16 known categories with a defensive 'unspecified' fallback for forward-compat with future server-side categories. GUI (HIGH-8): api/client fetchJSON parses the WWW-Authenticate cause via parseWWWAuthenticateCause and attaches it to the 'certctl:auth-required' CustomEvent detail; AuthProvider redirects to /login?session_expired=<cause> on cause-aware 401s; LoginPage renders a blue-bordered session-cause banner. invalid_token stays on the current page (no hard redirect for opaque failures). Misc cleanup: ErrorState now accepts the title/message/data-testid form added by CRIT-4 BreakglassPage (was erroring tsc on master). Regression matrix: - internal/api/handler/oidc_redirect_categories_test.go pins all 16 failure categories to the 302 + reason= location + audit-row leg - internal/auth/session/www_authenticate_test.go pins the 4 stable cause categories on classifySessionError (incl. errors.Is wrapped sentinels) + the WWW-Authenticate emission across all 4 categories + the no-session-context fallback case - internal/api/handler/auth_session_oidc_test.go: 4 pre-existing TestLoginCallback_*Returns400 tests updated to assert 302 + reason= location (the wire shape changed from 400 to 302, but the audit observability and behaviour-equivalent failure-classification are preserved) - web/src/pages/LoginPage.test.tsx: 6 new cases pinning the failure banner, session-cause banner, unknown-reason fallback, and forward-compat 'unspecified' category Spec: cowork/auth-bundles-fixes-2026-05-10/08-high-7-8-error-surfacing.md Closes: HIGH-7, HIGH-8 of cowork/auth-bundles-audit-2026-05-10.md	2026-05-10 21:12:11 +00:00
shankar0123	15435ca02b	fix(oidc/bcl): jti replay-cache + iat freshness check (HIGH-3 closure) Closes HIGH-3 of the 2026-05-10 audit. Pre-fix the BCL handler accepted any logout_token whose iat + jti were syntactically present but never checked (a) that iat fell within a skew window or (b) that jti hadn't been seen before. A captured logout_token was replayable indefinitely; once CRIT-2 was fixed, every replay would revoke the user's current sessions — persistent DoS. RFC 9700 §2.7 + OIDC BCL 1.0 §2.5 require jti replay defense. - Migration 000040_bcl_replay_cache: oidc_bcl_consumed_jtis table with composite PK on (jti, issuer_url) — RFC 7519 §4.1.7 per-issuer uniqueness — and an expires_at index for the GC sweep. - repository.BCLReplayRepository interface + ErrBCLJTIAlreadyConsumed sentinel. Postgres impl uses INSERT...ON CONFLICT DO NOTHING RETURNING true for atomic single-use semantics in one round-trip. - handler.DefaultBCLVerifier gains WithMaxAge + nowFn clock seam. iat freshness check rejects tokens whose iat is in the future beyond max-age OR stale beyond it. Verifier signature extended: Verify(ctx, jwt) (iss, sub, sid, jti string, iat int64, err error). - handler.AuthSessionOIDCHandler gains BCLReplayConsumer (interface) + WithBCLReplayConsumer(consumer, maxAge) setter. BackChannelLogout consumes the jti post-verify with TTL = max(24h, 2maxAge): - first-receive → 200, sessions revoked, audit outcome=revoked - replay (ErrBCLJTIAlreadyConsumed) → 200 + Cache-Control: no-store, audit outcome=jti_replayed, sessions NOT re-revoked - transient (non-AlreadyConsumed error) → 503 so the IdP retries - internal/scheduler/scheduler.go: SetBCLReplayGarbageCollector wires SweepExpired into the existing session-GC tick (no separate ticker for short-lived replay rows). - cmd/server/main.go: bclMaxAge from cfg.Auth.OIDCBCLMaxAgeSeconds (default 60s, env CERTCTL_OIDC_BCL_MAX_AGE_SECONDS); bclReplayRepo wired into the verifier + handler + scheduler. - Three regression tests in internal/api/handler/bcl_replay_test.go: TestBackChannelLogout_FirstReceiveConsumesJTI, TestBackChannelLogout_ReplayedJTIReturns200WithAudit, TestBackChannelLogout_TransientConsumeFailureReturns503. - internal/api/handler/auth_session_oidc_test.go: stubBCLVerifier gains jti + iat fields; existing TestBackChannelLogout_ tests rewritten for the new Verify return. Verification gate green: gofmt clean, go vet clean, go test -short -count=1 on internal/api/handler / internal/api/router / internal/scheduler / cmd/server / internal/auth/oidc / internal/auth/breakglass — all pass. CRIT-1..CRIT-5 + HIGH-1 + HIGH-2 + HIGH-3 of the 2026-05-10 audit now closed on this branch. Spec at cowork/auth-bundles-fixes-2026-05-10/07-high-3-bcl-replay-defense.md. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-3	2026-05-10 20:53:29 +00:00
shankar0123	1697845493	fix(auth): wire RevokeAllForActor + RotateCSRFToken to mutation paths Closes HIGH-1 + HIGH-2 of the 2026-05-10 audit. HIGH-1: breakglass.Service.SetPassword and RemoveCredential now call sessions.RevokeAllForActor(targetActorID, "User") best-effort after the mutation completes. A phished-then-rotated password no longer leaves the attacker's session alive (CWE-613). Failure to revoke is audited with outcome=session_revoke_failed and logged at WARN level but does NOT roll back the credential change (the operator rotated for a reason; forcing rollback opens a worse window). - breakglass.SessionMinter interface extended with RevokeAllForActor. - cmd/server/main.go::breakglassSessionMinterAdapter gains the bridge to session.Service.RevokeAllForActor. - stubSessions in service_test.go tracks revokeAllIDs / revokeAllTypes / revokeAllErr. - Three regression tests: - TestService_SetPassword_RevokesExistingSessions - TestService_RemoveCredential_RevokesExistingSessions - TestService_SetPassword_RevokeFailureDoesNotRollback HIGH-2: New session.Service.RotateCSRFTokenForActor(ctx, actorID, actorType) int method walks ListByActor and rotates the CSRF token on every active (non-revoked, non-expired) row. Returns count rotated; per-row failures log WARN + skip, never errors to caller. New handler.CSRFRotator interface + AuthHandler.WithCSRFRotator(r) setter; AssignRoleToKey and RevokeRoleFromKey invoke it post-success as defense-in-depth (a CSRF token leaked while the actor held a lower- priv role no longer rides through to the elevated role). - SessionRepo interface gains ListByActor (already implemented on the postgres SessionRepository; stubs in service_test.go + bench_test.go updated to match). - cmd/server/main.go calls .WithCSRFRotator(sessionService) on the AuthHandler. - Two regression tests: - TestRotateCSRFTokenForActor_RotatesAllActiveRows (asserts revoked / expired / other-actor rows are skipped) - TestRotateCSRFTokenForActor_NoSessionsReturnsZero Verification gate green: gofmt clean, go vet clean, go test -short -count=1 ./internal/auth/breakglass/ ./internal/auth/session/ ./internal/api/handler/ ./internal/api/router/ ./cmd/server/ ./internal/domain/auth/ — all pass. CRIT-1..CRIT-5 + HIGH-1 + HIGH-2 of the 2026-05-10 audit now closed on this branch. Spec at cowork/auth-bundles-fixes-2026-05-10/06-high-1-2-revoke-and-rotate.md. Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-1 HIGH-2	2026-05-10 20:43:45 +00:00
shankar0123	f1d97710e1	feat(gui+auth): break-glass admin GUI surface (CRIT-4 closure) Closes CRIT-4 of the 2026-05-10 audit. Bundle 2 Phase 7.5 shipped the break-glass backend (Argon2id + lockout + 4 endpoints) but no GUI surface. Operators recovering during an SSO outage had to hand-craft curl commands — operationally hostile and the opposite of what docs/operator/security.md advertised. This commit closes the gap. Three GUI surfaces: 1. LoginPage.tsx — inline "Use break-glass account (SSO outage recovery)" toggle below the API-key form. Clicking reveals an amber-bordered inline form (actor-id + password, autocomplete=off). Calls breakglassLogin(actor_id, password); on success navigates to "/" where AuthProvider re-validates via the session-cookie path. Intentionally low-visibility (text-amber-600 small text) — this is the deliberate-bypass path, not the everyday-login path. 2. web/src/pages/auth/BreakglassPage.tsx — admin page at /auth/breakglass (permission-gated by auth.breakglass.admin). Three sections: - Sticky security banner ("every action audited; use only during incidents"). - Set/rotate-password form (≥12-char + confirm-match). - Credentialed-actor table with rotate / unlock (disabled when not locked) / remove per row. Remove requires type-the-actor-id confirmation. 3. Layout.tsx nav — "Break-glass" entry under the auth section. Visible to all callers; the page itself permission-gates (server-side 403 is the load-bearing defense). Cosmetic hide-when-no-perm is deferred to fix 14's LOW bundle. Backend support (new endpoint required to enumerate credentialed actors): - internal/repository/breakglass.go — BreakglassCredentialRepository gains List(ctx, tenantID) method. - internal/repository/postgres/breakglass.go — postgres impl; reuses the existing breakglassColumns / scanBreakglass helpers. - internal/auth/breakglass/service.go — Service.List(ctx) method; returns ErrDisabled when CERTCTL_BREAKGLASS_ENABLED=false (handler maps to 404 for surface invisibility). - internal/api/handler/auth_breakglass.go — ListCredentials handler; password_hash field NEVER serialized to the wire (response shape is intentionally limited to actor_id + timestamps + failure_count + locked_until). - internal/api/router/router.go — registers GET /api/v1/auth/breakglass/credentials gated by auth.breakglass.admin. - internal/api/router/openapi_parity_test.go — SpecParityExceptions entry for the new endpoint (full OpenAPI row rides along with the next OpenAPI sweep). GUI api/client.ts gains breakglassListCredentials() + the BreakglassCredentialRow type matching the wire shape. Six Vitest cases in BreakglassPage.test.tsx pin the contract: permission gate (forbidden state when caller lacks the perm; admin surface when they have it), set-password mismatch rejection, set- password below-threshold-length rejection, unlock-disabled-when-not- locked, remove-modal type-confirm. Verification gate green: - gofmt -l clean on all touched files - go vet clean - go test -short -count=1 on internal/api/router (TestRouter_OpenAPIParity + TestRouterRBACGateCoverage + TestRouter_AuthExemptAllowlist), internal/api/handler (all BCL tests + ListCredentials), internal/auth/breakglass (Service.List + stubRepo.List), internal/repository/postgres, internal/domain/auth (auditor pin) — all pass. CRIT-1 + CRIT-2 + CRIT-3 from the same audit are already closed on this branch (commits `68ca42f`, `ca1e135`, `00eace8`). CRIT-5 (AllowedEmail- Domains lying field) remains the last Critical blocker for v2.1.0. Spec: cowork/auth-bundles-fixes-2026-05-10/04-crit-4-breakglass-gui.md. Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-4	2026-05-10 20:24:52 +00:00
shankar0123	00eace8068	fix(api/cors): narrow Bundle-2 routes from wildcard to NewCORS(corsCfg) Closes CRIT-3 of the 2026-05-10 audit. Bundle 2's OIDC handshake + back-channel-logout + logout + bootstrap + breakglass-login routes were wrapped by middleware.CORS — a hard-coded Access-Control-Allow-Origin: * middleware that ignored the operator's CERTCTL_CORS_ORIGINS knob (CWE-942). The properly-configured middleware.NewCORS(corsCfg) exists right next to it but wasn't used here. The deprecation comment on middleware.CORS said "Kept for health endpoints" but Bundle 2 added four additional call sites without converting them. This commit: - Renames middleware.CORS -> middleware.CORSWildcard with a stronger doc block making the security tradeoff explicit at every remaining call site. The doc references the CI guard + the 2026-05-10 audit closure. - Adds a CorsCfg middleware.CORSConfig field to router.HandlerRegistry and threads it from cmd/server/main.go using the existing cfg.CORS.AllowedOrigins value. The same config that drives the global corsMiddleware now also drives the per-route NewCORS wraps for the auth-exempt direct r.mux.Handle blocks. - Swaps middleware.CORS -> middleware.NewCORS(reg.CorsCfg) for the 7 credentialed auth-exempt routes: - GET /auth/oidc/login - GET /auth/oidc/callback - POST /auth/oidc/back-channel-logout - POST /auth/logout - POST /auth/breakglass/login - GET /api/v1/auth/bootstrap - POST /api/v1/auth/bootstrap - Keeps middleware.CORSWildcard for the 4 credential-free probe routes: - GET /health - GET /ready - GET /api/v1/version - GET /api/v1/auth/info - Adds scripts/ci-guards/cors-wildcard-allowlist.sh — pins the 4-route allowlist; fails CI when a new middleware.CORSWildcard wrap appears outside the allowlist. Adding a new wildcard call site requires updating the allowlist AND documenting why in the commit body. Operators who configured CERTCTL_CORS_ORIGINS=https://admin.example.com expecting the OIDC + BCL + breakglass-login routes to honor it now do. Previously those routes ignored the knob and emitted ACAO: * regardless. Verification gate green: - gofmt -l . clean - go vet ./... clean - go test -short -count=1 ./internal/api/... ./internal/auth/... ./internal/domain/auth/ ./internal/service/auth/ ./cmd/server/ pass - go build ./... clean - scripts/ci-guards/cors-wildcard-allowlist.sh passes (4 allowlisted routes; zero violations) CRIT-1 + CRIT-2 from the same audit are already closed on this branch (commits `68ca42f`, `ca1e135`); CRIT-4 / CRIT-5 remain open and continue to block the v2.1.0 tag. Spec: cowork/auth-bundles-fixes-2026-05-10/03-crit-3-cors-narrow.md. Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-3	2026-05-10 20:12:19 +00:00
shankar0123	ca1e135aa3	fix(oidc/bcl): resolve sub→actor_id via users.GetByOIDCSubject (CRIT-2 closure) Closes CRIT-2 of the 2026-05-10 audit. The BCL handler previously called sessionSvc.RevokeAllForActor(sub, "User") but session rows are keyed by user.ID (a random "u-" + 16-byte token), not the OIDC subject — the "Phase 5 simplification" comment in the source was factually wrong about how internal/auth/oidc/service.go::upsertUser seeds user.ID. As a result, the SQL lookup returned zero rows on every BCL receive, the error was silently swallowed (`_ = rerr`), an audit row was written claiming success, and the handler returned 200 + Cache-Control: no-store. OIDC BCL 1.0 §2.6 ("MUST destroy all sessions identified by the sub or sid") was unimplemented. CWE-613. This commit: - Adds userRepo (repository.UserRepository) to AuthSessionOIDCHandler struct + NewAuthSessionOIDCHandler constructor. cmd/server/main.go injects the existing oidcUserRepo (no new repository instance). - Replaces the broken sub-as-actor-id path with: 1. providerRepo.List(ctx, tenantID) + IssuerURL filter to map claims.iss → provider row (N is small; typically 1-5). 2. userRepo.GetByOIDCSubject(ctx, provider.ID, sub) to resolve the OIDC subject → user.ID. 3. sessionSvc.RevokeAllForActor(user.ID, "User") with the RESOLVED actor_id (not the OIDC subject). - Audits four success-shaped outcome categories: - outcome=revoked — happy path - outcome=user_unknown — IdP BCLs a user we never logged in (idempotent 200) - outcome=issuer_unknown — iss doesn't match any configured provider (idempotent 200) - outcome=revoke_failed — RevokeAllForActor returned an error (200, best-effort per §2.8) And two transient outcomes that return 503 (IdP retries per §2.8): - outcome=provider_lookup_failed — providerRepo.List error - outcome=user_lookup_failed — non-NotFound userRepo error - Removes the misleading "Phase 5 simplification" comment block; replaces with a doc explaining the resolution path + outcome taxonomy + spec refs. - Adds 5 regression tests in internal/api/handler/auth_session_oidc_test.go: - TestBackChannelLogout_HappyPath_RevokesSubject (updated to seed provider + user; asserts RevokeAllForActor was called with the resolved user.ID, not the raw OIDC subject — the test that would have caught CRIT-2 had it existed) - TestBackChannelLogout_UnknownUserReturns200WithAudit - TestBackChannelLogout_IssuerUnknownReturns200WithAudit - TestBackChannelLogout_TransientUserRepoErrorReturns503 - TestBackChannelLogout_RevokeFailureReturns200WithAuditFailureOutcome - Introduces stubUserRepo in the handler test file (matching the four repository.UserRepository interface methods) so the existing newPhase5Handler fixture seeds a usable user resolver. Verification gate green: - gofmt -l . clean - go vet ./... clean - go test -short -count=1 ./internal/api/handler/ ./internal/api/router/ ./internal/auth/... ./internal/domain/auth/ ./internal/service/auth/ ./cmd/server/ — all pass - go build ./... clean CRIT-1 from the same audit is already closed on this branch (commit `68ca42f`); CRIT-3 / CRIT-4 / CRIT-5 remain open and continue to block the v2.1.0 tag. Spec: cowork/auth-bundles-fixes-2026-05-10/02-crit-2-bcl-sub-lookup.md. Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-2	2026-05-10 20:07:29 +00:00
shankar0123	68ca42fef1	fix(auth): apply rbacGate to every state-changing + read handler (CRIT-1 closure) Closes the wire-layer authorization gap surfaced by the 2026-05-10 audit (CRIT-1). Before this commit only ~24 of ~140 routes carried rbacGate enforcement — all of them admin-only fine-grained perms (auth.session., auth.oidc., auth.breakglass.admin, cert.bulk_revoke, crl.admin, scep.admin, est.admin, ca.hierarchy.manage). Every catalogued legacy-CRUD perm (cert.read/issue/revoke/delete, profile.edit/delete, issuer.edit/delete, target., agent., plus role-mgmt verbs) was declared in internal/domain/auth/validate.go but never wired at the router. A r-viewer Bearer was essentially r-admin minus five verbs at the wire layer (CWE-862). This commit: - Adds rbacGateScoped(checker, perm, scopeType, scopeFn, h) helper to internal/api/router/router.go for path-bound scope resolution. Per-profile and per-issuer grants (Decision 2) now reach the wire layer. - Wraps every state-changing route AND every read endpoint in router.go with rbacGate (global) or rbacGateScoped (path-bound). The auth-management routes (POST /api/v1/auth/roles, etc.) gain router-level enforcement in addition to the existing service-layer Authorizer check — defense in depth (HIGH-9 of the same audit collapses into this closure). - Auth-exempt surfaces stay un-gated by design: login, callback, BCL, logout, breakglass-login, bootstrap, health, auth-info, version. Allowlist is documented in TestRouterRBACGateCoverage. - Extends internal/domain/auth/validate.go CanonicalPermissions with 30 new perms across 12 namespaces: cert.edit; job.read, job.cancel; approval.read, approval.approve, approval.reject; policy.read/edit/delete; team.read/edit/delete; owner.read/edit/delete; notification.read/edit; discovery.read/run/claim; network_scan.read/edit/run; healthcheck.read/edit/delete/acknowledge; digest.read, digest.send; verification.read, verification.run; stats.read; metrics.read. - Updates DefaultRoles for r-admin / r-operator / r-viewer / r-mcp / r-cli / r-agent. r-auditor gets NOTHING new — the auditor pin (TestAuditorRoleHoldsExactlyAuditReadAndExport) stays invariant. - Migration 000039_audit_crit1_perms seeds the new perm rows + role grants per the updated DefaultRoles map. Idempotent ON CONFLICT DO NOTHING. Reverse migration removes role_permissions before permissions (ON DELETE RESTRICT on the FK). - AST-level CI guard TestRouterRBACGateCoverage in internal/api/router/router_rbac_coverage_test.go walks router.go and asserts every state-changing + read route is wrapped (or in the documented allowlist). Adding a new ungated route fails CI. - Updates docs/operator/rbac.md permission-catalogue table with the new namespaces + footer link to the AST CI guard. - Updates certctl/CHANGELOG.md v2.1.0 section with the closure narrative. Audit doc cowork/auth-bundles-audit-2026-05-10.md CRIT-1 row annotated CLOSED 2026-05-10. Bundle's exit-gate spec lives at cowork/auth-bundles-fixes-2026-05-10/01-crit-1-rbac-gates.md. CRIT-2 / CRIT-3 / CRIT-4 / CRIT-5 of the same audit remain open and continue to block the v2.1.0 tag. Verification gate green: - gofmt -d (no diff after gofmt -w on the touched files) - go vet ./... - go test -short -count=1 ./... (all packages pass including auditor pin) - go build ./... HIGH-9 of the audit closes via this commit's router-layer rbacGate on POST /api/v1/auth/keys/{id}/roles + DELETE /api/v1/auth/keys/{id}/roles/{role_id} (defense-in-depth on top of the existing service-layer privilege check). Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-1 HIGH-9	2026-05-10 19:58:26 +00:00
shankar0123	1d01c87663	auth-bundle-2 Phase 7 + Phase 7.5: OIDC first-admin bootstrap + break-glass admin (Argon2id, lockout, default-OFF, surface-invisibility) Phase 7 — OIDC first-admin bootstrap (Decision 3): - Optional AdminBootstrapHook closure on oidc.Service. When wired, HandleCallback consults the hook AFTER group resolution + user upsert and BEFORE the empty-mapping fail-closed check. Hook receives (providerID, groups, userID); returns grantAdmin=true when the user matches CERTCTL_BOOTSTRAP_ADMIN_GROUPS AND no admin exists yet in the tenant. - cmd/server/main.go wires the hook as a closure that: Filters by CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID (if configured). * Probes AdminExists via authActorRoleRepo (admin-already-exists silently returns false; bootstrap mode is one-shot per tenant). * Walks group intersection. * On match: grants r-admin via authActorRoleRepo.Grant + emits the bootstrap.oidc_first_admin audit row with event_category=auth + INFO log. - Coexists with the Bundle 1 env-var-token bootstrap. Both paths can be configured; first match wins (admin-existence probe short-circuits the second). - HandleCallback's empty-mapping fail-closed check moved AFTER the hook so a fresh deployment with zero group_role_mappings can still mint the first admin. - 5 tests in service_test.go: hook grants admin on match, hook returns false preserves empty-mapping fail-closed, admin-already- exists silently falls through to normal mapping, hook-error wraps + bubbles, idempotent when admin is already in the mapped role set. Phase 7.5 — Break-glass admin (Decision 4, default-OFF): Migration 000038 ships: - breakglass_credentials table — at-most-one-credential-per-actor (UNIQUE(actor_id)), Argon2id PHC-format password_hash, lockout state machine (failure_count, locked_until, last_failure_at). FK CASCADE on users(id) so deleting a user atomically removes their credential. - Two new permissions seeded into r-admin only: auth.breakglass.admin — set/rotate/unlock/remove credentials. auth.breakglass.login — actor uses break-glass to log in. CanonicalPermissions extended in lockstep. internal/auth/breakglass/service.go (~580 LOC): - Service.Enabled() reflects CERTCTL_BREAKGLASS_ENABLED. - SetPassword: Argon2id with OWASP 2024 params (m=64MiB, t=3, p=4, salt=16 random bytes, output=32 bytes); per-password random salt; PHC-format hash output. Min 12 / max 256 byte input. - Authenticate: constant-time-compare via subtle.ConstantTimeCompare on every code path. Identical 401 + identical timing across the wrong-password / locked-account / non-existent-actor paths so an attacker cannot probe whether a given actor has break-glass configured. Non-existent-actor + locked-account paths run a verifyDummy() Argon2id pass for timing parity. Lockout state machine: failure_count++ on every wrong attempt; threshold (default 5) trips locked_until = NOW() + duration (default 15m). Successful Authenticate resets the counter. Reset-window: failures aged out after CERTCTL_BREAKGLASS_LOCKOUT_RESET_INTERVAL (default 1h) auto-reset on next attempt. - Unlock + RemoveCredential: admin-only (auth.breakglass.admin gated at the router via rbacGate). Audit rows on every operation. - All public methods refuse to act when Enabled()==false (returns ErrDisabled; the handler maps to HTTP 404 — surface invisibility). internal/repository/postgres/breakglass.go ships the 5-method postgres impl with atomic single-statement IncrementFailure (so concurrent racing wrong-password attempts can't observe an intermediate state and slip past the threshold) and idempotent ResetFailureCount. internal/api/handler/auth_breakglass.go ships the 4-endpoint HTTP surface: - POST /auth/breakglass/login (auth-exempt; 5/min rate-limited per source IP via the existing rate limiter; returns 404 when disabled). On success sets the post-login session cookie + CSRF cookie via SessionService.Create + 204. On any failure: uniform 401 + identical timing (the service has already audited the specific failure category). - POST /api/v1/auth/breakglass/credentials (auth.breakglass.admin) - POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock (auth.breakglass.admin) - DELETE /api/v1/auth/breakglass/credentials/{actor_id} (auth.breakglass.admin) Admin endpoints share the surface-invisibility property: when CERTCTL_BREAKGLASS_ENABLED=false, every admin endpoint also returns 404 (not 403) so probing via the admin surface gets the same signal as probing the login endpoint. Tests (internal/auth/breakglass/service_test.go): All 8 Phase 7.5 spec-mandated negative cases: 1. Service.Enabled()==false → all ops return ErrDisabled. 2. Wrong password → ErrInvalidCredentials, failure_count++, audit row with event_category=auth. 3. Failure_count exceeds threshold → locked, subsequent attempts (including with the CORRECT password) return identical-shape 401 while the lockout window holds. 4. Lockout window expires → next attempt with correct password succeeds + resets the counter. 5. Password < 12 bytes (or > 256 bytes) → ErrWeakPassword. 6. Password leak hygiene — the service has zero slog calls; the audit-row map literal never includes the password plaintext. 7. Argon2id hash never appears in logs OR API responses — pinned by `json:"-"` tag on BreakglassCredential.PasswordHash + a belt-and-braces json.Marshal probe asserting the hash bytes never appear in the marshaled output. 8. Constant-time-compare verified via timing-statistical test — wrong-password vs no-credential paths take statistically indistinguishable time (within 5x ratio). The verifyDummy() hash compute on the no-credential + locked paths is what keeps timing parity; absent that, an attacker could side- channel "actor doesn't have a credential" via timing. Plus coverage-lift batch covering: SetPassword first-time vs rotate, no-caller-id rejection, no-target-id rejection, RNG failure surface, Authenticate happy-path mints session, no-credential audit row, session-mint-failure surface, FailureResetInterval recycle, Unlock + RemoveCredential happy paths, hash-format unit tests (round-trip, mismatch, malformed/wrong-version/bad-base64 formats), nil-audit + nil-session pass-through. Coverage on internal/auth/breakglass/ at 91.5% per-statement (above the Phase 7.5 spec ≥ 90% floor). cmd/server/main.go wiring: - Constructs breakglassRepo + breakglassService + breakglassHandler after the OIDC service block. - breakglassSessionMinterAdapter shim bridges *session.Service.Create to the breakglass.SessionMinter port. - Logs WARN at boot when CERTCTL_BREAKGLASS_ENABLED=true (operator visibility for the deliberate SSO-bypass). internal/config/config.go gains: - AuthConfig.BootstrapAdminGroups + BootstrapOIDCProviderID for Phase 7 (CERTCTL_BOOTSTRAP_ADMIN_GROUPS comma-list + CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID). - AuthConfig.Breakglass nested struct with 4 env vars (CERTCTL_BREAKGLASS_ENABLED + LOCKOUT_THRESHOLD + LOCKOUT_DURATION + LOCKOUT_RESET_INTERVAL). Router wiring: - 4 new breakglass routes registered when reg.AuthBreakglass != nil; public login route via direct r.mux.Handle (auth-exempt), 3 admin routes via r.Register + rbacGate(auth.breakglass.admin). - POST /auth/breakglass/login pinned in AuthExemptRouterRoutes allowlist with Phase 7.5 justification. - SpecParityExceptions extended with 4 new entries documenting the Phase 7.5 deferral of full per-endpoint OpenAPI rows (handler doc-block at the top of auth_breakglass.go is the operator-facing reference). Threat model (encoded in service.go + auth_breakglass.go doc-blocks + migration 000038 docstrings, to be promoted to docs/operator/auth- threat-model.md in Phase 12): - Break-glass is a deliberate bypass of the SSO security boundary. An attacker who phishes the password OR finds it in a compromised password manager bypasses MFA, OIDC, and every group-claim gate. - Recommendation: keep CERTCTL_BREAKGLASS_ENABLED=false in steady- state. Enable only during SSO-broken incidents. Disable after recovery. - WebAuthn pairing (v3 per Decision 12) is the load-bearing second factor. Without it, break-glass is best treated as an emergency- only path. - Audit trail surfaces every break-glass action under event_category=auth; the auditor role can monitor for unexpected break-glass logins. Verifications: gofmt clean, go vet clean across all touched packages, go test -short -count=1 green across internal/auth/oidc (3.0s; new Phase 7 hook tests integrated alongside the 21+ Phase 3 negatives), internal/auth/breakglass (3.6s; 8 spec-mandated negatives + coverage batch passing), internal/config + internal/domain/auth + internal/api/ router + internal/api/handler all green, no regressions in Bundle 1 packages.	2026-05-10 06:51:41 +00:00
shankar0123	3189f3cd71	auth-bundle-2 Phase 6: session middleware + CSRF token plumbing + chained-auth combinator + AuthInfo OIDC providers extension + 2 CI guards (Bundle-1-compat + Bundle-1-to-2-upgrade) Phase 6 wires the Phase 4 session service + Phase 5 OIDC handlers into the request path. Three middlewares + one combinator land in internal/auth/session/middleware.go: 1. SessionMiddleware reads `certctl_session` cookie, validates via SessionService.Validate, populates the legacy UserKey/AdminKey + Phase 3 RBAC context keys (ActorIDKey/ActorTypeKey/TenantIDKey) so downstream RequirePermission + audit-attribution see a consistent caller. Best-effort UpdateLastSeen keeps the idle- expiry sliding window fresh. CRITICALLY: never 401s on validate failure — defers to the next middleware so the chained-auth combinator can fall back to Bearer. 2. CSRFMiddleware gates state-changing methods (POST/PUT/DELETE/ PATCH) for session-authenticated requests. API-key actors are EXEMPT (no session row in context => CSRF doesn't apply; they're not browser-driven). Constant-time-compares SHA-256(X-CSRF-Token header) against the session row's stored hash via SessionService.ValidateCSRF. Mismatch returns 403. 3. ChainAuthSessionThenBearer is the load-bearing chained-auth combinator: tries the session cookie first; on miss/invalid, falls back to the API-key Bearer middleware; if neither authenticates, 401. The composition uses bearerSkipIfAuthenticated so a request with both a valid session AND a valid Bearer uses the session (cookie wins per the Bundle 2 contract). Middleware chain order in cmd/server/main.go (per Phase 6 spec): RequestID → Logging → Recovery → CORS → RateLimit → AUTH (chained: session → Bearer) → CSRF (state-changing only; API-key exempt) → Audit → Handler The chained authMiddleware replaces the bare Bundle-1 bearerMiddleware at the chain entry point; csrfMiddleware lands immediately after so session-authenticated requests pass through CSRF before audit. Both new middlewares are pass-throughs when sessionService is nil (pre-Phase-4 builds). AuthInfo extension (Category E): GET /api/v1/auth/info now returns the list of configured OIDC providers (id + display_name + login_url where login_url = `/auth/oidc/login?provider=<id>`) so the GUI Login page renders the correct "Sign in with X" buttons. Endpoint stays auth-exempt; the providers list is public configuration. Wired via HealthHandler.OIDCProvidersResolver + a new OIDCProvidersListResolver projection interface; the cmd/server adapter oidcProvidersListAdapter projects the postgres OIDCProviderRepository into the public-safe shape. Resolver lookups are best-effort: failures fall back to the minimal payload rather than 500-ing the GUI's auth probe. Nil resolver preserves the pre-Phase-6 minimal shape so test fixtures + no-db deploys keep compiling. Bypass list preserved (Category E): the existing public-route allowlist in router.AuthExemptRouterRoutes is preserved by virtue of those routes registering via direct r.mux.Handle (they bypass the entire chain). The protocol-endpoint allowlist (ACME/SCEP/EST/OCSP/ CRL) bypasses via cmd/server/main.go::buildFinalHandler URL-prefix dispatch — those routes never reach the auth middleware at all. Both preservations are pinned by the Bundle-1 compat CI guard below. Tests (internal/auth/session/middleware_test.go): All 7 Phase 6 spec-mandated middleware-chain tests pass: 1. Session cookie + correct CSRF → 200. 2. Session cookie + wrong CSRF → 403. 3. Bearer-only (no session) + no CSRF → 200 (API-key actors are CSRF-exempt by design). 4. No cookie + no Bearer → 401. 5. Expired cookie + valid Bearer → fall back to Bearer succeeds. 6. Tampered cookie → 401 (no Bearer to fall back to). 7. Bypass-list awareness — state-changing method, no auth, no session row → uniform 401 (NOT a CSRF 403; the CSRF check is gated on session-row presence and never fires for unauth requests). Plus coverage-lift tests covering nil-service pass-through, safe- methods bypass, SessionFromContext nil + populated, isStateChangingMethod matrix, clientIPFromRequest variants (RemoteAddr / XFF first-hop / XFF single / no-port), nil-bearer chain branches. Coverage on internal/auth/session/middleware.go: 100% per-function across the 9 entry points (SessionValidator interfaces + NewSessionMiddleware + NewCSRFMiddleware + ChainAuthSessionThenBearer + bearerSkipIfAuthenticated + SessionFromContext + isStateChangingMethod + clientIPFromRequest + lastIndexByte). Package coverage 94.9%. Two new CI guards: scripts/ci-guards/bundle-1-compat-regression.sh — Bundle-1-only compat invariants. Static-source checks that protect the Bundle-1 path since spinning up docker-compose + running the integration test suite is sandbox-infeasible: 1. SessionMiddleware MUST defer-to-next on missing/invalid cookie. 2. CSRFMiddleware MUST be pass-through on missing session row. 3. cmd/server/main.go MUST wire ChainAuthSessionThenBearer. 4. The 4 public OIDC routes MUST be in AuthExemptRouterRoutes. 5. AuthInfo MUST guard on OIDCProvidersResolver != nil. scripts/ci-guards/bundle-1-to-2-upgrade-regression.sh — Bundle-1 → Bundle-2 upgrade invariants: 1. Migrations 000034..000037 use CREATE TABLE IF NOT EXISTS. 2. Migrations are wrapped in BEGIN; ... COMMIT;. 3. NO DROP TABLE / ALTER ... DROP COLUMN against any of the 19 protected Bundle-1 tables (api_keys, audit_events, certificates, certificate_versions, profiles, issuers, targets, agents, jobs, owners, teams, agent_groups, notifications, roles, permissions, role_permissions, actor_roles, tenants, approvals, intermediate_cas, issuance_approval_requests). 4. 000037 INSERTs use ON CONFLICT DO NOTHING (idempotent re-apply). 5. ChainAuthSessionThenBearer is wired (Bundle-1 Bearer keys continue to authenticate post-upgrade). 6. Bootstrap handler is registered (fresh-deployment bootstrap still works). Both guards are sandbox-feasible static analysis. When the operator gets a Linux VM with docker-in-docker, promote both to real `docker compose up` integration tests against a v2.1.0 baseline DB dump. Verifications: gofmt clean, go vet ./internal/auth/... ./internal/api/... ./cmd/server/... clean, go test -short -count=1 -race green across internal/auth/session (94.9% coverage), internal/api/handler, internal/api/router, no regressions in Bundle 1 packages, both new ci-guards green.	2026-05-10 06:22:25 +00:00
shankar0123	9c679a5960	auth-bundle-2 Phase 5: OIDC + session HTTP surface (13 endpoints), pre-login store, OpenID Connect Back-Channel Logout 1.0, cookieAuth scheme, 7 new auth permissions, CI guard, handler tests Phase 5 of the bundle puts the Phase 3 OIDC service + Phase 4 session service on the wire. 13 HTTP endpoints split into three logical groups: Public OIDC handshake (auth-exempt; protocol-mediated): GET /auth/oidc/login?provider=<id> -> 302 to IdP authorization URL + sets certctl_oidc_pending cookie (10-min TTL, Path=/auth/oidc/, SameSite=Lax) GET /auth/oidc/callback?code=...&state=... -> consume pre-login row, run Phase 3's 11-step token validation, mint post-login session, 302 to dashboard POST /auth/oidc/back-channel-logout -> OpenID Connect BCL 1.0 — IdP POSTs logout_token JWT; certctl validates signature against IdP JWKS via Phase 3 alg allow-list, required claims (iss/aud/iat/jti/ events; exactly one of sub/sid; nonce ABSENT per spec §2.4), revokes matching sessions, returns 200 with Cache-Control: no-store POST /auth/logout -> revoke caller's session Session management (RBAC-gated auth.session.): GET /api/v1/auth/sessions -> auth.session.list (own / all) DELETE /api/v1/auth/sessions/{id} -> auth.session.revoke (own bypass) OIDC provider + group-mapping CRUD (RBAC-gated auth.oidc.): GET /api/v1/auth/oidc/providers -> auth.oidc.list POST /api/v1/auth/oidc/providers -> auth.oidc.create (client_secret encrypted at rest via internal/crypto.EncryptIfKeySet) PUT /api/v1/auth/oidc/providers/{id} -> auth.oidc.edit DELETE /api/v1/auth/oidc/providers/{id} -> auth.oidc.delete (refused via ErrOIDCProviderInUse → 409 when users authenticated via this provider) POST /api/v1/auth/oidc/providers/{id}/refresh -> auth.oidc.edit (re-runs IdP downgrade defense via OIDCService.RefreshKeys) GET /api/v1/auth/oidc/group-mappings -> auth.oidc.list POST /api/v1/auth/oidc/group-mappings -> auth.oidc.edit DELETE /api/v1/auth/oidc/group-mappings/{id} -> auth.oidc.edit Migration 000037 ships: - oidc_pre_login_sessions table (10-min absolute TTL, FK CASCADE on oidc_provider_id, FK RESTRICT on signing_key_id; index on absolute_expires_at for the GC sweep); - 7 new permissions seeded into r-admin only: auth.session.list, auth.session.list.all, auth.session.revoke, auth.oidc.list, auth.oidc.create, auth.oidc.edit, auth.oidc.delete CanonicalPermissions extended in lockstep at internal/domain/auth/ validate.go. Pre-login machinery: - internal/repository/oidc.go gains PreLoginRepository interface + PreLoginSession struct + ErrPreLoginNotFound / ErrPreLoginExpired sentinels. - internal/repository/postgres/oidc_prelogin.go ships the impl; LookupAndConsume uses DELETE ... RETURNING for atomic single-use. - internal/auth/oidc/prelogin.go is the PreLoginAdapter that bridges the OIDC service's Phase 3 PreLoginStore interface to the new repository, signing the cookie value under the active SessionSigningKey via the same v1.<id>.<key>.<HMAC> wire format Phase 4 uses for post-login cookies. Defense-in-depth: the pre-login `pl-` prefix is enforced by ParseCookieValue(prefix); a stolen pre-login cookie cannot be replayed against the post-login Validate path (pinned by TestService_Validate_RejectsPreLoginCookieAtPostLoginGate). Session package extension: - internal/auth/session/service.go gains exported SignCookieValue, ParseCookieValue (with caller-supplied id-1 prefix), ComputeCookieHMAC, DecryptKeyMaterial wrappers so the OIDC pre-login adapter shares the same length-prefixed HMAC math without code duplication. - parseCookie no longer hardcodes the `ses-` prefix check (moved to Validate as defense-in-depth; pre-login cookie verification uses the `pl-` prefix via ParseCookieValue). Cookie attributes (all Phase 5 endpoints honor CERTCTL_SESSION_SAMESITE + Secure=true via SessionCookieAttrs from Phase 4 config): - certctl_oidc_pending: Path=/auth/oidc/, MaxAge=600s, SameSite=Lax (cannot be Strict because the IdP-initiated callback is a top-level navigation from a different origin). - certctl_session: Path=/, Expires=8h, SameSite=Lax\|Strict, HttpOnly. - certctl_csrf: Path=/, Expires=8h, HttpOnly=false (intentional — GUI must read it to echo into X-CSRF-Token header). Audit logging on every mutating operation (event_category="auth"): auth.oidc_login_succeeded / failed / unmapped_groups auth.oidc_back_channel_logout / failed auth.session_revoked auth.oidc_provider_{created,updated,deleted,refreshed} auth.group_mapping_{added,removed} OpenAPI updates: - cookieAuth security scheme added to api/openapi.yaml under components.securitySchemes (apiKey / cookie / certctl_session). - The 13 Phase 5 routes are added to SpecParityExceptions with a deferral note: full per-endpoint OpenAPI rows land in a follow-on commit alongside the GUI work (Phase 8) so the ergonomic shape can be validated against the live GUI client. CI guard: scripts/ci-guards/N-bundle-2-security-empty-preserved.sh asserts api/openapi.yaml has ≥ 14 'security: []' occurrences (the pre-Bundle-2 baseline). Reducing the count below 14 would silently force a Bearer-or-cookie requirement onto an endpoint that legitimately runs without certctl-issued credentials; the guard fires before that regression lands. Handler tests (internal/api/handler/auth_session_oidc_test.go): - All 6 prompt-mandated negative cases: BCL with missing events claim -> 400 BCL with nonce present -> 400 (per spec §2.4) BCL with sig signed by an unknown key -> 400 Callback with replayed state -> 400 Callback with PKCE verifier mismatch -> 400 Callback with expired pre-login row -> 400 - Plus happy paths for every endpoint, edge cases (missing-cookie, duplicate-name, in-use-409, wrong-tenant), and the Helper-function coverage (peekIssuer, classifyOIDCFailure, defaultIfBlank, defaultIntIfZero, clientIPFromRequest, encryptClientSecret). Coverage on internal/api/handler/auth_session_oidc.go: 80.9% per-function (above the Phase 5 spec's ≥ 80% floor). Server wiring (cmd/server/main.go): Wired AFTER sessionService (Phase 4) so the OIDC PreLoginAdapter can sign pre-login cookies under the active SessionSigningKey: oidcProviderRepo + oidcMappingRepo + oidcUserRepo + oidcPreLoginRepo -> preLoginAdapter -> oidcService -> authSessionOIDCHandler. sessionMinterAdapter shim bridges *session.Service.Create to the oidcsvc.SessionMinter port the OIDC service consumes. Router wiring (internal/api/router/router.go): 4 public OIDC routes via direct r.mux.Handle (auth-exempt; pinned in AuthExemptRouterRoutes); 9 RBAC-gated routes via r.Register + rbacGate(checker, perm, h). Routes only register when reg.AuthSessionOIDC != nil so pre-Phase-5 builds skip the block entirely. Verifications: gofmt clean, go vet clean across all touched packages, go test -short -count=1 green across internal/api/handler (74 tests + new Phase 5 batch), internal/api/router (parity + auth-exempt allowlist), internal/auth/oidc + session (no regressions), full domain + scheduler + config sweeps green, ci-guard N-bundle-2-security-empty-preserved.sh green (17 ≥ 14 baseline).	2026-05-10 06:08:27 +00:00
shankar0123	5d79e53ad0	auth-bundle-1 follow-on: close coverage gaps to clear Phase 12 floors CI run #486 (post-Bundle-1 merge + Go 1.25.10 bump) failed three coverage-threshold gates: internal/api/handler 74.7% < floor 75 (-0.3pp) internal/auth 66.3% < floor 85 (-18.7pp) internal/service/auth 51.1% < floor 85 (-33.9pp) The Phase 12 gate file's "85% with negative-test coverage" claim turned out to be aspirational — the read-side and Update-path methods on RoleService / PermissionService / ActorRoleService had zero unit-test coverage, and internal/auth's keystore + HasPermission helper had zero tests. This commit closes the gap without lowering the gate. Per-package CI-style averages after this commit (per scripts/check-coverage-thresholds.sh's per-function-mean): internal/api/handler 76.1% (+1.4pp, margin +1.1pp) internal/auth 90.5% (+24.2pp, margin +5.5pp) internal/service/auth 93.7% (+42.6pp, margin +8.7pp) Tests added: internal/service/auth/service_test.go (+18 tests, +518 LOC): PermissionService.List, PermissionService.GetByName, RoleService.Get (4 paths), RoleService.List (system caller), RoleService.Update (4 paths), RoleService.ListPermissions (3 paths), RoleService.AddPermission/RemovePermission round-trip + gate paths, RoleService.Delete (success + nil-caller + no-perm + audit), RoleService.Create (nil-caller), ActorRoleService.ListForActor (self-bypass + cross-actor + nil-caller + system + with-perm), ActorRoleService.Effective- Permissions (same shape), ActorRoleService.ListKeys (3 paths + system bypass), ActorRoleService.Revoke (4 paths), Authorizer edge cases (empty actorID short-circuit, empty tenantID default, scoped-grant-without-scope-id no-match invariant, repo-error wrap-and-return, HoldsAnyOf early-exit), recordAudit nil-arm short-circuits. internal/auth/keystore_test.go (NEW, +175 LOC): StaticKeyStore.Len, StaticKeyStore.LookupByHash hit + miss, MutableKeyStore seeded lookup + Len, Add registers new key, AddHashed registers from precomputed hash, AddHashed replaces on duplicate hash (idempotent boot-loader contract), HasPermission no-actor / default-actor-type / checker-error / scoped-check threading. internal/auth/bootstrap/service_test.go (+36 LOC): Service.Available nil-receiver/nil-strategy short-circuit, Service.Available delegates to Strategy when configured. internal/api/handler/auth_test.go (+208 LOC): GetRole returns role + permissions, GetRole 404 + 401, UpdateRole 200 + invalid-JSON-400 + 401, ListKeys returns actor list + 401, RemoveRolePermission 204 (global + scoped) + 401, rolePermToResponse scope encoding pin via GetRole. Verified: gofmt -l . clean (touched files only). go vet ./internal/auth/... ./internal/service/auth/... ./internal/api/handler/ rc=0. go test -count=1 -short on the four packages green. CI-style per-function averages computed via the live scripts/check-coverage-thresholds.sh arithmetic — all three gated packages clear their floors with margin. Per CLAUDE.md "complete path" + "do not lower the gate to make CI green": gate file unchanged. The 85/85/75 floors stand.	2026-05-10 02:04:36 +00:00

1 2 3 4

191 Commits