Commit Graph

14 Commits

Author SHA1 Message Date
shankar0123 663b14bfd8 feat(retention): COMP-002-RETENTION — federated-user PII purge pipeline
Sprint 6 closure of the audit's MED-severity COMP-002-RETENTION
finding.

Pre-fix posture: the federated-user admin surface
(auth_users.go::Deactivate) sets users.deactivated_at on soft-delete,
but the PII columns (email, display_name, oidc_subject) stay
populated forever. No in-code primitive for GDPR right-to-be-
forgotten; no scheduled retention purge.

This commit ships the audit's recommended two-phase fix:

  Phase 1 — operator-callable scrub primitive
    internal/service/user_retention.go
      UserRetentionService.DeleteUserPII(ctx, userID):
        - revoke all active sessions (defense-in-depth)
        - email := 'purged@redacted.local'
        - display_name := '[purged]'
        - oidc_subject := 'sha256:' || hex(sha256(original))
        - audit_events row with action=user.purge_pii,
          category=auth, actor=system

      Why hash oidc_subject instead of NULL:
        1. (oidc_provider_id, oidc_subject) UNIQUE constraint would
           trip on multiple purged users converging to NULL
        2. The hash is one-way; the original IdP-side identifier is
           unrecoverable. Re-login under the same subject mints a
           fresh u-id (right-to-be-forgotten semantics)
        3. Forensic continuity: an operator can recompute
           sha256(<known-subject>) and confirm "this user was
           deactivated then purged"

      users.id itself is preserved so historical
      audit_events.actor = u-X rows still resolve. The forensic-
      attribution chain stays intact even after the PII is gone.

  Phase 2 — scheduled batch purge
    internal/scheduler/scheduler.go
      UserRetentionPurger interface + userRetentionLoop:
        - PurgeDeactivatedUsers enumerates every user with
          deactivated_at < NOW() - retention_window
        - DeleteUserPII per row
        - per-tick batch cap (default 200) keeps blast radius
          predictable; large backlogs spread across multiple ticks
        - atomic.Bool guard + 5-min per-tick context.WithTimeout

    Repository contract grew a single new method:
      internal/repository/user.go::ListDeactivatedBefore(ctx, t)
      internal/repository/postgres/user.go: SQL-side filter
      (deactivated_at IS NOT NULL AND deactivated_at < $1)
      ORDER BY deactivated_at ASC, cross-tenant.

  Configuration
    CERTCTL_USER_RETENTION_INTERVAL   default 24h
    CERTCTL_USER_RETENTION_WINDOW     default 30 days
    CERTCTL_USER_RETENTION_BATCH_CAP  default 200

  Test stub additions for repository.UserRepository.ListDeactivatedBefore:
    internal/auth/oidc/service_test.go::stubUsers
    internal/api/handler/auth_users_test.go::stubFullUserRepo
    internal/api/handler/auth_session_oidc_test.go::stubUserRepo

  Documentation
    docs/operator/privacy-and-retention.md
      - retention pipeline diagram (day-0 deactivate → day-N purge)
      - operator config table
      - verification runbook (4 steps with SQL)
      - what's NOT covered (deferred: DSAR export, api_keys cascade,
        retroactive audit_events.details redaction)

  Tests
    internal/service/user_retention_test.go (NEW, 4 tests):
      TestDeleteUserPII_ScrubsAndRevokes
      TestDeleteUserPII_IsIdempotent
      TestPurgeDeactivatedUsers_RespectsWindow
      TestPurgeDeactivatedUsers_BatchCap

Verified locally:
  go vet ./...                                   (clean)
  gofmt -l internal/ cmd/                        (clean)
  go test -short -count=1 \
    ./internal/service/... ./internal/scheduler/... ./internal/config/...
    (all green)

Cross-sprint interaction: pairs with COMP-001-HASH (prior commit).
The user.purge_pii audit row this service emits flows through the
new hash chain, so the scrub event is itself tamper-evident.

Closes COMP-002-RETENTION. Sprint 6 is complete (2/2 findings).
2026-05-16 06:18:39 +00:00
shankar0123 fefeccfa59 harden(oidc): relax alg-downgrade IdP-bind check to intersection-empty (Keycloak compat)
Phase-10 live-IdP smoke (Keycloak 26.x via testcontainers-go) revealed
the IdP-bind alg-downgrade check was too strict for real-world IdPs.
6 of the integration tests in internal/auth/oidc/integration_keycloak*_test.go
were failing with:

  oidc: IdP advertises weak signing algorithms (HS*/none);
  refusing to use as defense against downgrade attacks: HS256

Keycloak 26.x (and several other real-world IdPs — Auth0 when HS-mode is
enabled, some Authentik configs) advertise EVERY alg they're capable of
in the discovery doc's id_token_signing_alg_values_supported field, even
when the realm only signs with RS256 in practice. Pre-fix the IdP-bind
check refused on ANY HS* or 'none' advertisement → no real Keycloak deploy
could ever bind a provider row, hence the integration-test failures.

The strict-deny check was defense-in-depth on top of the load-bearing
per-token alg-pin at sig-verify time (isDisallowedAlg, service.go L1177):
that check rejects every ID token whose JWS header carries an alg outside
DefaultAllowedAlgs, regardless of what the discovery doc advertises.
A forged HS256 token signed with the IdP's RS256 pubkey as HMAC secret
is rejected at sig-verify time → the actual algorithm-confusion attack
is closed by the per-token pin, NOT by the discovery-doc check.

Fix: relax the IdP-bind check to refuse only when the intersection of
advertised vs DefaultAllowedAlgs is EMPTY (the pathological all-weak-alg
IdP case). Keycloak (RS256 + HS256 advertised) now binds successfully;
an HS-only IdP still fails closed.

Changes:
- internal/auth/oidc/service.go: rewrite the alg-check loop at L1067 in
  getOrLoad / RefreshKeys to compute the intersection set; refuse only
  when no acceptable alg is advertised. ErrIdPDowngradeAdvertised
  docstring updated to reflect new contract. DefaultAllowedAlgs
  docstring + the package-level design-comment block at L40-72 updated
  with v2.1.0-relaxed semantics callouts.
- internal/auth/oidc/test_discovery.go: TestDiscovery dry-run validator
  rewritten to surface HS*/none alongside RS* as an informational note
  ('note: IdP advertises weak algorithms %v alongside acceptable ones')
  rather than a hard-fail error. HS-only / none-only still hard-fails.
- internal/auth/oidc/service_test.go: TestService_IdPDowngradeDefense_*
  tests updated. Renamed:
  - RejectsHSAdvertised → RS256PlusHS256_BindsSuccessfully (positive)
  - RejectsNoneAdvertised → RejectsHSOnlyAdvertised (intersection-empty)
  - RefreshKeys_CatchesPostLoadDowngrade rotated to HS-only post-load
- internal/auth/oidc/coverage_fill_test.go: TestTestDiscovery_AlgDowngradeDetected
  split into _HS256AlongsideRS256_BindsWithNote (positive, asserts note
  but no hard-fail) + _HSOnly_StillTrips_HardFail (intersection-empty).
- docs/operator/auth-threat-model.md: OIDC token-validation alg-allow-list
  section rewritten to call out the load-bearing-defense hierarchy
  (per-token pin first, IdP-bind check defense-in-depth) and document
  the v2.1.0 relaxation rationale.
- CHANGELOG.md: ### Security entry under Unreleased.

Verify: go test ./internal/auth/oidc/ -short PASS; gofmt clean; go vet
clean. The Keycloak integration tests should now pass when the operator
re-runs 'make keycloak-integration-test'.
2026-05-11 15:34:59 +00:00
shankar0123 8aeeec93c0 chore(lint): close 5 golangci-lint v2 findings surfaced by v2.1.0 release-gate Phase 1.3
Five golangci-lint v2 findings surfaced when running the v2.1.0 release
gate (auth-bundle-2 → master pre-flight). Each is mechanical:

1. govet/printf-style misuse — internal/auth/oidc/service_test.go used
   integer literal 501 in http.Error; switched to http.StatusNotImplemented.

2. staticcheck SA1019 — internal/auth/breakglass/reflect_helper_test.go
   referenced reflect.Ptr; the canonical name since Go 1.18 is
   reflect.Pointer.

3. staticcheck ST1020 — internal/repository/postgres/auth.go
   ActorRoleRepository.Revoke had a doc comment that did not begin with
   the method name. Prepended 'Revoke drops actor_roles rows.' to the
   comment so it now starts with the method name.

4. staticcheck ST1022 — internal/api/handler/auth_session_oidc.go
   DefaultBCLVerifierMaxAge docstring was attached to the DefaultBCLVerifier
   type docstring. Moved the const docstring directly above the const
   declaration, separated by a blank line.

5. unused — internal/auth/session/bench_test.go declared
   benchSessionMinSamples and never referenced it; the bench loop relies
   on Go's default b.N scaling. Replaced the const block with a comment
   describing the rationale.

Lint clean (golangci-lint v2.12.2 with the .golangci.yml config) on the
five edited packages.
2026-05-11 13:31:13 +00:00
shankar0123 11b145b641 Merge Fix 06 (HIGH A-6): strict UA/IP binding — close request-empty bypass in MED-16
# Conflicts:
#	CHANGELOG.md
#	internal/api/handler/auth_session_oidc.go
#	internal/api/handler/auth_session_oidc_test.go
2026-05-11 11:19:04 +00:00
shankar0123 92519436a1 harden(oidc): strict UA/IP binding (A-6) — close request-empty bypass in MED-16
The MED-16 closure (2a1a0b3) added the RFC 9700 §4.7.1 pre-login
UA/IP binding but the consume-side compare at
internal/auth/oidc/service.go was gated by:

  if s.preLoginRequireUA && storedUA != "" && userAgent != "" {
      ... constant-time compare ...
  }
  if s.preLoginRequireIP && storedIP != "" && ip != "" {
      ... constant-time compare ...
  }

The `userAgent != ""` and `ip != ""` arms were intended as
rolling-deploy / headless-proxy compat ("if the request didn't supply
a value, don't try to compare against nothing"). They achieve that —
and they ALSO short-circuit the compare whenever the **attacker**
controls the request side, which is always at /auth/oidc/callback.

Threat model:
  1. Attacker acquires a pre-login cookie (HMAC-protected; requires
     RNG break OR transit leak — not implausible, that's why the
     binding exists in the first place).
  2. Attacker replays the cookie at /auth/oidc/callback from their
     own user-agent.
  3. Attacker OMITS the User-Agent header. curl doesn't send one by
     default. Many programmatic HTTP clients omit it.

Pre-A-6, step 3 trivially bypassed the binding check. The whole
RFC 9700 §4.7.1 defense was theatre against the realistic threat —
silent-allow when the attacker abandons the header they don't want
checked.

Fix: flipped to strict-when-stored. When the pre-login row carries a
binding value (storedUA != "" or storedIP != ""), the request MUST
present a matching value. An empty request side with a non-empty
stored side now rejects with two new sentinels:

  ErrPreLoginUAMissing  — request omitted User-Agent header
  ErrPreLoginIPMissing  — request had no resolvable client IP

Distinguished from the existing *Mismatch sentinels so the audit
row can tell apart "binding violation" (operator mis-configured the
proxy) from "missing-header bypass attempt" (active exploit indicator).
The handler-side classifyOIDCFailure adds typed errors.Is dispatch:

  ErrPreLoginUAMissing → "prelogin_ua_missing"
  ErrPreLoginIPMissing → "prelogin_ip_missing"

SIEM rules can now alert specifically on the bypass-attempt category
distinctly from operator config drift.

Legacy-row compat preserved: pre-migration rows where storedUA == ""
/ storedIP == "" still pass through unchecked. That window is
bounded by the 10-minute pre-login TTL — within 10 minutes of the
MED-16 deploy every legacy row has expired and the strict path is
universal.

Operator escape hatches preserved: CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false
(symmetric for IP) bypasses both the *Mismatch AND the new *Missing
reject paths. Required for environments where a proxy strips the
User-Agent header in transit (rare but documented in the operator
advisory).

Regression coverage:

  service_test.go (5 new tests under
  `Audit 2026-05-11 A-6 — strict-when-stored` block):
    TestService_HandleCallback_MED16_A6_UAStoredButRequestEmpty_Rejects
      — the load-bearing bypass-closure leg
    TestService_HandleCallback_MED16_A6_IPStoredButRequestEmpty_Rejects
      — symmetric for IP
    TestService_HandleCallback_MED16_A6_LegacyRowEmptyStoredStillPasses
      — legacy-row compat preserved
    TestService_HandleCallback_MED16_A6_ToggleOff_AllowsBypass
      — UA toggle off allows the bypass (operator escape hatch)
    TestService_HandleCallback_MED16_A6_ToggleOff_IP_AllowsBypass
      — IP toggle off allows the bypass

  auth_session_oidc_test.go::TestClassifyOIDCFailure extended:
    ErrPreLoginUAMismatch → prelogin_ua_mismatch (new explicit pin)
    ErrPreLoginIPMismatch → prelogin_ip_mismatch (new explicit pin)
    ErrPreLoginUAMissing → prelogin_ua_missing
    ErrPreLoginIPMissing → prelogin_ip_missing
    fmt.Errorf wrapped variants of the *Missing sentinels round-trip
    through errors.Is (defense against future context-wrapping in
    the service layer)

Verify gate green: gofmt clean, go vet clean, all 10 MED-16 tests
+ extended TestClassifyOIDCFailure pass; full short-mode test run
across internal/auth/oidc + internal/api/handler also green.

Spec at cowork/auth-bundles-fixes-2026-05-11/06-high-prelogin-ua-strict-mode.md.
Audit doc: MED-16 row in cowork/auth-bundles-audit-2026-05-10.md
appended with the A-6 follow-up closure annotation; status table
row updated to "CLOSED + A-6 follow-up CLOSED 2026-05-11".
Operator advisory in CHANGELOG.md v2.1.0 release notes covers the
two operator-visible behaviour changes: (1) callback requests
without User-Agent now reject when a binding was stored, and (2)
the CERTCTL_OIDC_PRELOGIN_REQUIRE_UA=false escape hatch is the
documented path for environments where the proxy strips the header.
2026-05-11 11:03:31 +00:00
shankar0123 78485f7429 fix(auth/users): close MED-11 lying field — DeactivatedAt loaded + enforced on login (A-2)
The MED-11 closure shipped users.deactivated_at + DELETE /api/v1/auth/users/{id}
+ cascade-revoke, but the federated-user soft-delete was reversible: the next
OIDC login under the same (provider, subject) tuple re-minted a session and
re-elevated the user.

Three legs of the chain were severed (each independently CRIT-shaped):

  Leg A — postgres/user.go::userColumns omitted `deactivated_at`, so scanUser
          never populated User.DeactivatedAt. Every Get / GetByOIDCSubject /
          ListAll returned DeactivatedAt = nil regardless of the column value.

  Leg B — postgres/user.go::Update SQL omitted `deactivated_at = $X`, so the
          handler's `u.DeactivatedAt = now()` mutation was a no-op write at
          the SQL level. Even with leg A closed, no row ever flipped.

  Leg C — oidc/service.go::upsertUser did not inspect DeactivatedAt on the
          existing-user path. Even with legs A + B closed, the OIDC login
          would still proceed normally.

The cascade-session-revoke half of the original closure remained correct, but
only for the duration of the user's current cookie. SOC 2 CC6.3 + ISO 27001
A.9.2.6 "user access removal" controls require both immediate revoke AND
persistent block — this fix restores the persistent-block leg.

Closure across layers:

  internal/repository/postgres/user.go
    - userColumns adds `deactivated_at`
    - scanUser reads via sql.NullTime intermediate (column is nullable)
    - Create writes deactivated_at explicitly (NULL for new active users;
      forward-compat for future seed-data flows that pre-populate the column)
    - Update writes deactivated_at on every call; nil DeactivatedAt → NULL
      (supports reactivation)

  internal/auth/oidc/service.go
    - New sentinel ErrUserDeactivated
    - upsertUser checks existing.DeactivatedAt != nil BEFORE mutating email /
      display_name / last_login_at — preserves last_login_at forensics on
      rejected login attempts (defense-in-depth pin against future
      "performance optimization" that reorders the gate)

  internal/api/handler/auth_session_oidc.go
    - classifyOIDCFailure adds typed errors.Is dispatch for ErrUserDeactivated
      → audit category "user_deactivated" (SOC/SIEM observability surface)

  internal/api/handler/auth_users.go
    - Self-deactivate guard on Deactivate: HTTP 409 + audit row
      auth.user_deactivate_self_rejected when caller targets own User row.
      Prevents an admin from one-way-door locking themselves out via the
      standard handler; break-glass remains the recovery path.
    - New Reactivate handler: inverse of Deactivate. Clears DeactivatedAt
      via Update; emits auth.user_reactivated audit row. Idempotent on
      already-active rows. Sessions revoked at deactivation stay revoked
      (cascade irreversible by design — user must complete fresh OIDC
      login).

  internal/api/router/router.go
    - POST /api/v1/auth/users/{id}/reactivate wired with auth.user.deactivate
      gate (reactivation is the inverse op, not a separate privilege)

  web/src/api/client.ts + web/src/pages/auth/UsersPage.tsx
    - authReactivateUser() client function
    - Reactivate button on deactivated rows in UsersPage

Regression coverage:

  Postgres (testcontainers, skipped under -short):
    TestUserRepository_DeactivatedAt_RoundTrip — Create → set DeactivatedAt
      → Update → Get / GetByOIDCSubject / ListAll round-trip the value
    TestUserRepository_DeactivatedAt_CreateWritesNullForActive — new active
      user reads back DeactivatedAt = nil
    TestUserRepository_DeactivatedAt_CreatePersistsPreDeactivated — Create
      with non-nil DeactivatedAt round-trips (forward-compat path)

  OIDC service:
    TestService_HandleCallback_RejectsDeactivatedUser — errors.Is
      ErrUserDeactivated; CallbackResult nil; persisted email / last_login_at
      / deactivated_at NOT mutated by the rejected attempt
    TestService_HandleCallback_AllowsReactivatedUser — DeactivatedAt = nil
      → happy path resumes
    TestService_HandleCallback_DeactivatedUserPreservesForensics —
      defense-in-depth pin against future regressions that reorder the
      gate-vs-mutation sequence

  Classifier:
    TestClassifyOIDCFailure extended — typed dispatch + wrapped variant
      round-trip through errors.Is

  Handler:
    TestAuthUsers_Deactivate_RejectsSelfDeactivate — HTTP 409 + audit
      row + cascade-revoke NOT fired + row stays active
    TestAuthUsers_Deactivate_OtherUser_HappyPath — HTTP 204 + cascade
      fires + row soft-deleted
    TestAuthUsers_Reactivate_HappyPath / _IdempotentOnActiveUser /
      _UnknownID / _MissingID / _UpdateError

Phase 6 verify gate green on the targeted packages: gofmt clean, go vet
clean, go test -short pass across internal/auth/oidc, internal/api/handler,
internal/api/router, internal/repository/postgres, internal/auth/...,
internal/service/..., internal/tlsprobe/..., internal/trustanchor/...,
internal/validation/...

Spec at cowork/auth-bundles-fixes-2026-05-11/02-crit-deactivated-at-enforcement.md
Closure annotation at cowork/auth-bundles-audit-2026-05-10.md MED-11 row.
Operator advisory in CHANGELOG.md v2.1.0 release notes.
2026-05-11 02:21:05 +00:00
shankar0123 e005c004e1 harden(oidc): JWKS auto-refresh on kid-not-in-cache (MED-6)
Audit 2026-05-10 MED-6 closure.

WHAT.

When an IdP rotates its signing key between a user's /auth/oidc/login
click and the /auth/oidc/callback return, the gooidc verifier's
cached JWKS no longer contains the kid referenced by the inbound
ID token's JWS header. Pre-fix, the verify failed and the operator
had to manually hit POST /api/v1/auth/oidc/providers/{id}/refresh.

HandleCallback now distinguishes the kid-not-in-cache shape
(isKidMismatchError) from generic verify failures and runs a
one-shot recovery:

  1. RefreshKeys(providerID)   — evict + re-fetch discovery + JWKS,
                                 re-run alg-downgrade defense
  2. getOrLoad(providerID)     — refresh the cached providerEntry
  3. verifier.Verify(rawJWT)   — one-shot retry against new JWKS

A second failure surfaces through the original error branches
(ErrJWKSUnreachable for fetch errors, generic wrap for everything
else). NO retry loop — bounded recovery only.

WHY.

Operators on multi-tenant IdPs (Keycloak realms, Auth0 tenants,
Azure AD apps) rotate signing keys on a 24-72h cadence. Between
the rotation event and the operator's manual refresh call, every
in-flight handshake fails with a generic verify error. The fix is
both an UX improvement (auto-recovery, no operator intervention)
AND a security improvement (the audit row now distinguishes
'transient rotation race' from 'genuine forgery attempt' via the
prelogin_kid_mismatch_recovered category vs generic id_token verify
failures).

HOW.

internal/auth/oidc/service.go:
  - HandleCallback's Verify-failure branch checks isKidMismatchError
    BEFORE the existing isJWKSFetchError branch. On match, runs
    RefreshKeys + getOrLoad + verifier.Verify exactly once. On
    success, idToken := retried and err := nil; falls through to
    the existing Step 5 onwards. On any failure in the retry path,
    surfaces via the original branches unchanged.
  - isKidMismatchError matcher: pinned go-oidc/v3 v3.18.0 substrings
    ('kid .* not found', 'signing key .* not found', 'no matching
    key', 'key with id .* not found'). Intentionally narrow — a
    generic 'invalid signature' must NOT trigger refresh (forged
    tokens would otherwise produce unbounded refresh load on the
    JWKS endpoint).

internal/auth/oidc/service_test.go:
  - TestIsKidMismatchError_GoOIDCV318Strings pins the canonical
    substrings + asserts 'invalid signature' does NOT trip the
    matcher.
  - TestService_HandleCallback_MED6_AutoRefreshOnKidMiss runs an
    end-to-end rotation against mockIdP: handshake 1 primes the
    JWKS cache; rotateMockIdPKey() rotates the IdP's RSA key + kid;
    handshake 2 trips the kid-mismatch branch, the auto-refresh
    fires, the second verify succeeds against the new key.

VERIFY.

- go vet ./internal/auth/oidc/...                           PASS
- go test -short -count=1 -run 'MED6|KidMismatch'
  ./internal/auth/oidc/...                                  PASS (2/2)
- go test -short -count=1 ./internal/auth/oidc/...          PASS (4.3s)

Out of scope: Nit-5's RotateRealmKeys-backed Keycloak integration
test (build-tagged 'integration') — that's the realm-running
counterpart to the mockIdP-based MED-6 test added here; tracked
separately as item 20 in HANDOFF.md.

Refs: cowork/auth-bundles-audit-2026-05-10.md MED-6
      cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 3
2026-05-10 23:28:57 +00:00
shankar0123 2a1a0b347c harden(oidc): pre-login UA/IP binding (MED-16) — RFC 9700 §4.7.1
Audit 2026-05-10 MED-16 closure.

WHAT.

Binds the OIDC pre-login row to the (clientIP, userAgent) tuple of
the /auth/oidc/login request, and enforces a constant-time compare
against the /auth/oidc/callback request at consume time. Defeats
replay of a stolen pre-login cookie by a different browser /
source — the secondary defense layer recommended by RFC 9700 §4.7.1
when the primary layer (HMAC integrity + Path=/ + SameSite=Lax on
the cookie) is bypassed via CSRF / XSS / TLS-termination leak.

WHY.

Pre-fix, the pre-login cookie's HMAC verified only that 'some'
caller of /auth/oidc/login was talking to /auth/oidc/callback; it
did not verify that the SAME browser / source was on both sides.
An attacker who exfiltrated the cookie value via any vector could
replay the bytes through their own user-agent and ride the victim's
authorization. RFC 9700 §4.7.1 calls out the gap explicitly and
recommends binding state to a user-agent fingerprint + source IP.

HOW.

Migration:
  migrations/000044_prelogin_uaip.up.sql
    ALTER TABLE oidc_pre_login_sessions
      ADD COLUMN IF NOT EXISTS client_ip   TEXT,
      ADD COLUMN IF NOT EXISTS user_agent  TEXT;
  Both nullable for in-flight rolling-deploy compat — the consume-
  side check only enforces when both row AND request carry non-empty
  values for the leg in question.

Domain:
  internal/repository/oidc.go (PreLoginSession) — adds ClientIP +
    UserAgent fields.

Repository:
  internal/repository/postgres/oidc_prelogin.go — Create persists
    via sql.NullString (empty → NULL); LookupAndConsume reads back.
    Re-uses package-local nullableString from discovery.go.

Service:
  internal/auth/oidc/service.go
    - PreLoginStore.CreatePreLogin signature takes (clientIP,
      userAgent) as positions 5–6.
    - PreLoginStore.LookupAndConsume returns (clientIP, userAgent)
      as positions 5–6.
    - HandleAuthRequest signature gains (clientIP, userAgent),
      threaded to the store.
    - HandleCallback adds Step 1.5 — UA / IP constant-time compare
      between stored row and incoming request. Per-leg toggles via
      preLoginRequireUA / preLoginRequireIP service fields. Empty
      values on either side pass through (rolling-deploy + headless-
      proxy compat).
    - New sentinels ErrPreLoginUAMismatch, ErrPreLoginIPMismatch.
    - SetPreLoginBindingRequirements(requireUA, requireIP) helper
      for main.go config wiring.

Adapter:
  internal/auth/oidc/prelogin.go — PreLoginAdapter passes the new
    fields through to the repo row.

Handler:
  internal/api/handler/auth_session_oidc.go
    - OIDCAuthHandshaker.HandleAuthRequest signature updated.
    - LoginInitiate captures clientIPFromRequest + r.UserAgent()
      and passes to the service.
    - classifyOIDCFailure adds errors.Is dispatch for the two new
      sentinels → prelogin_ua_mismatch / prelogin_ip_mismatch
      audit categories.

Config:
  internal/config/config.go
    + AuthConfig.OIDCPreLoginRequireUA (default true)
      env CERTCTL_OIDC_PRELOGIN_REQUIRE_UA
    + AuthConfig.OIDCPreLoginRequireIP (default true)
      env CERTCTL_OIDC_PRELOGIN_REQUIRE_IP
  cmd/server/main.go calls oidcService.SetPreLoginBindingRequirements
    from cfg.Auth.OIDCPreLoginRequire{UA,IP}.

Tests (internal/auth/oidc/service_test.go):
  - TestService_HandleCallback_MED16_UAMismatchRejected
  - TestService_HandleCallback_MED16_IPMismatchRejected
  - TestService_HandleCallback_MED16_BothMatch_Succeeds
  - TestService_HandleCallback_MED16_LegacyRowEmptyValues  (rolling-
    deploy compat — empty stored values pass through)
  - TestService_HandleCallback_MED16_RequireUAFalse_AllowsMismatch
    (operator escape-hatch — UA mismatch silently allowed)

Mechanical fan-out:
  - stubPreLogin / stubPreLoginRepo signatures updated.
  - All existing call sites in service_test.go (~40), prelogin_test.go,
    bench_test.go, logging_test.go, provider_enabled_test.go,
    integration_keycloak_test.go, integration_okta_smoke_test.go,
    auth_session_oidc_test.go updated to pass empty strings for the
    new params — pre-existing tests do not exercise UA/IP binding
    semantics.

VERIFY.

- go vet ./internal/auth/oidc/... ./internal/api/handler/...
  ./internal/config/...                                       PASS
- go test -short -count=1 -run MED16 ./internal/auth/oidc/... PASS (5/5)
- go test -short -count=1 ./internal/auth/oidc/...            PASS (4.6s)
- go test -short -count=1 ./internal/api/handler/...          PASS (4.3s)
- go test -short -count=1 ./internal/config/...               PASS

Refs: cowork/auth-bundles-audit-2026-05-10.md MED-16
      cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 6
      RFC 9700 §4.7.1 — OAuth 2.0 Security Best Current Practice
2026-05-10 23:18:23 +00:00
shankar0123 2cd2a5c52f harden(oidc): RFC 9207 iss URL parameter check on callback (MED-17)
Audit 2026-05-10 MED-17 closure.

WHAT.

When the matched IdP's discovery doc advertises
authorization_response_iss_parameter_supported=true (RFC 9207 §3),
HandleCallback now REQUIRES a non-empty `iss` query parameter on
/auth/oidc/callback and enforces a constant-time compare against the
configured provider's IssuerURL. Mismatch maps to two new sentinel
errors (ErrIssParamMissing / ErrIssParamMismatch) that the handler's
classifyOIDCFailure dispatches via errors.Is BEFORE the substring
fall-through, so the audit failure_category remains distinguishable
between the RFC 9207 leg (iss_param_missing / iss_param_mismatch) and
the in-token iss claim leg (id_token_iss_mismatch).

WHY.

The RFC 9207 iss URL parameter is the load-bearing mix-up-attack
defense for multi-tenant IdPs (Keycloak realms, Authentik tenants,
Auth0 tenants, public-trust CAs). Pre-fix the parameter was silently
ignored — an attacker controlling one IdP tenant could route an auth
code to certctl's callback against a different tenant's pre-login
state without detection. Modern Keycloak / Authentik / public-trust
CAs ship the discovery flag by default; legacy IdPs that don't
advertise are unaffected (back-compat preserved).

HOW.

- internal/auth/oidc/service.go
  - providerEntry gains issParamSupported bool.
  - getOrLoad extends the discovery-claims read to include
    authorization_response_iss_parameter_supported, alongside the
    existing id_token_signing_alg_values_supported defense.
  - HandleCallback's signature gains callbackIss string at position 5.
    Step 2.5 runs after the state compare + provider load: when
    issParamSupported is true, an empty callbackIss returns
    ErrIssParamMissing; a present-but-mismatched value returns
    ErrIssParamMismatch (constant-time compare).
  - Two new sentinels: ErrIssParamMissing, ErrIssParamMismatch.
    ErrIssuerMismatch's doc-string clarified to note it covers the
    in-token leg only.

- internal/api/handler/auth_session_oidc.go
  - OIDCAuthHandshaker.HandleCallback signature updated.
  - LoginCallback reads r.URL.Query().Get("iss") (no TrimSpace —
    byte-strict compare upstream) and threads it through.
  - classifyOIDCFailure: typed errors.Is dispatch for the three
    iss-family sentinels BEFORE the substring fall-through, so the
    three cases stay distinguishable in the audit row.

- internal/api/handler/auth_session_oidc_test.go
  - stubOIDCSvc.HandleCallback bumped to 7-arg signature.
  - TestClassifyOIDCFailure extended with 5 new cases pinning the
    iss-family dispatch + a wrapped-error round-trip.

- internal/auth/oidc/service_test.go
  - mockIdP gains advertiseIssParameterSupported bool; the
    /.well-known/openid-configuration handler emits the claim only
    when set (so existing tests stay back-compat).
  - 4 new regression tests:
    * MED17_NoSupport_AnyIssAccepted — provider doesn't advertise;
      arbitrary callbackIss is ignored (back-compat).
    * MED17_SupportButMissing — provider advertises; missing iss →
      ErrIssParamMissing.
    * MED17_SupportButMismatch — provider advertises; wrong iss →
      ErrIssParamMismatch (load-bearing mix-up defense).
    * MED17_SupportAndCorrect — provider advertises; matching iss →
      success path proves the gate isn't over-eager.

- internal/auth/oidc/bench_test.go,
  internal/auth/oidc/logging_test.go,
  internal/auth/oidc/integration_keycloak_test.go
  - Mechanical: all existing HandleCallback call sites updated to
    pass "" for callbackIss (matches pre-fix behavior for IdPs that
    don't advertise support — the Keycloak integration suite tests
    will be re-evaluated once the Keycloak fixture is run against a
    realm with the discovery flag enabled).

VERIFY.

- go vet ./internal/auth/oidc/... ./internal/api/handler/...   PASS
- go test -short -count=1 ./internal/auth/oidc/...              PASS (3.4s)
- go test -short -count=1 ./internal/api/handler/...            PASS (5.4s)
- 4 new MED-17 regression tests + extended TestClassifyOIDCFailure pass.

Refs: cowork/auth-bundles-audit-2026-05-10.md MED-17
      cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 7
      RFC 9207 — OAuth 2.0 Authorization Server Issuer Identification
2026-05-10 23:05:52 +00:00
shankar0123 e7c4654b16 harden(auth/session+oidc): 503/401 split + go-oidc string pin (LOW-6 + Nit-2)
Audit 2026-05-10 — close LOW-6 + Nit-2 from the HANDOFF.md backend
batch (items 8 + 9).

LOW-6: introduce ErrSessionTransient sentinel in session.Service.
session.Validate now distinguishes:
  - errors.Is(err, repository.ErrSessionNotFound) → ErrSessionInvalidCookie (401)
  - All other repo errors                         → ErrSessionTransient (503)
The session middleware maps ErrSessionTransient to HTTP 503 with
Retry-After: 1. Pre-fix, every DB hiccup looked like a forged-cookie
401 and forced the user to re-authenticate on a transient outage.
Two new regression tests pin the wire shape:
  - TestService_Validate_TransientSessionGetError (service layer)
  - TestService_Validate_SessionNotFoundMapsToInvalidCookie (negative
    leg: not-found stays 401)
  - TestSessionMiddleware_TransientErrorMappedTo503 (middleware-level
    503 + Retry-After header)

Nit-2: isJWKSFetchError documentation now pins go-oidc/v3 v3.18.0 as
the source-of-truth string set. v3.18.0 exposes only
*oidc.TokenExpiredError as a typed error; JWKS-fetch failures bubble
up as fmt.Errorf-wrapped strings. New regression test
TestIsJWKSFetchError_GoOIDCV318Strings pins the canonical substrings
emitted by go-oidc's jwks.go — a future upstream bump that changes
the wording trips the test and forces the matcher to be re-derived.
The test caught a real gap: 'oidc: failed to decode keys' (emitted
when the IdP returns non-JSON at the jwks_uri — broken proxy, gateway
HTML error page, etc.) was previously misclassified as a generic 500
instead of 503 ErrJWKSUnreachable. Added 'decode keys' substring to
the matcher.

Status: LOW-6 + Nit-2 marked CLOSED in audit-doc table.

Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md items 8, 9
      cowork/auth-bundles-audit-2026-05-10.md LOW-6, Nit-2
2026-05-10 22:41:19 +00:00
shankar0123 925523e06e feat(oidc): Enabled toggle on OIDCProvider (MED-9)
Audit 2026-05-10 Fix 13 Phase B — close MED-9. MED-4/5/6/7 deferred to v3.

MED-9: ship the OIDCProvider.Enabled boolean. Pre-fix, the only way
to take a provider offline during an incident was DELETE, which
breaks active user_oidc_provider FK references and orphans any
session that minted under the provider. Post-fix:

  - Migration 000042 adds enabled BOOLEAN NOT NULL DEFAULT TRUE.
    Default-true means existing pre-migration rows are all enabled
    post-deploy; no breaking-change window.
  - internal/auth/oidc/domain/types.go::OIDCProvider.Enabled ships
    the domain field with JSON tag 'enabled'.
  - Repository read/write paths (List, Get, GetByName, Create, Update)
    all carry the column.
  - internal/auth/oidc/service.go::HandleAuthRequest rejects with
    the new ErrProviderDisabled sentinel when cfgRow.Enabled=false.
  - cmd/server/main.go::oidcProvidersListAdapter.List filters
    disabled providers before constructing OIDCProviderInfo so the
    LoginPage's 'Sign in with X' buttons never render for offline
    IdPs.
  - Defense-in-depth: the ErrProviderDisabled service-layer check
    is the guard for direct API / MCP / CLI callers that bypass the
    GUI.

Regression test: internal/auth/oidc/provider_enabled_test.go warms
the entry cache via a successful HandleAuthRequest, flips
cfgRow.Enabled=false on the cached entry, then asserts the next call
returns ErrProviderDisabled (errors.Is). Test fixtures (newValidProvider,
makeProvider) updated to set Enabled: true so existing tests stay
green.

Operators can toggle Enabled today via the existing PUT
/api/v1/auth/oidc/providers/{id} body field. A dedicated GUI
toggle on OIDCProviderDetailPage and a single-purpose PUT-just-enabled
endpoint are deferred to the v3 GUI-polish bundle — the load-bearing
wire is in place now.

MED-4 (GUI advanced fields on edit), MED-5 (POST .../test endpoint
+ button), MED-6 (JWKS auto-refresh on cache-miss), MED-7 (JWKS
health endpoint + GUI panel): DEFERRED to v3 with explicit
annotations in the audit doc. Workarounds: MED-4 fields are
PUT-editable via curl/MCP; MED-5 → call refresh post-create;
MED-6 → call refresh manually on key rotation.

Refs: cowork/auth-bundles-audit-2026-05-10.md MED-4, MED-5, MED-6,
      MED-7, MED-9
Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase B
2026-05-10 21:59:17 +00:00
shankar0123 9b6294e83d auth-bundle-2 Phase 14: session + OIDC validation benchmarks (steady-state + cold paths) + auth-benchmarks.md operator doc + Makefile targets
Closes Phase 14 of cowork/auth-bundle-2-prompt.md. Ships four
benchmarks producing four numbers + the operator-doc table; three
default-tag benchmarks runnable on every CI runner, the fourth
(cold-cache OIDC) runnable on operator-side Docker hosts via the
new make target.

Files
=====

internal/auth/session/bench_test.go (NEW):
* BenchmarkSession_SteadyState (target p99 < 1ms; measured 5µs).
  Warm in-memory repo + warm session row. Pure CPU: parseCookie +
  HMAC verify + map lookup + sentinel checks.
* BenchmarkSession_ColdProcess (target p99 < 10ms; measured 7.1ms).
  Same pipeline but with a configurable per-call delay simulating
  a 1ms Postgres RTT on each repo call. Two repo calls per
  Validate (signing-key fetch + session-row fetch) = 2ms minimum;
  Go time.Sleep granularity adds ~1-2ms jitter. Documented why
  testcontainers Postgres isn't viable inside b.N: 30+ second
  container boot incompatible with per-iteration timing.
* slowSessionRepo + slowKeyRepo wrappers add the per-call delay
  via time.Sleep; they delegate to the existing in-memory stubs.
* reportPercentiles helper sorts + reports p50/p95/p99/max via
  b.ReportMetric (Go testing.B doesn't surface percentiles
  natively).

internal/auth/oidc/bench_test.go (NEW):
* BenchmarkOIDC_SteadyState (target p99 < 5ms; measured 1.5ms).
  Drives full HandleCallback against an in-process mockIdP
  (httptest.Server localhost loopback). Pre-warmed JWKS cache via
  RefreshKeys at setup. Pipeline: pre-login consume + state
  compare + token exchange (localhost ~50-200µs) + go-oidc
  Verify (RSA-2048 sig verify + alg pin) + service-layer iss/
  aud/azp/at_hash/exp/iat/nonce re-checks + group-claim
  resolution + group→role mapping + user upsert + session mint.
* The localhost-loopback /token call adds ~100-500µs of TCP
  overhead vs pure crypto; the prompt's "no network calls"
  steady-state framing accommodates this since the localhost
  loopback is the closest practical proxy for a same-region
  IdP /token call (which adds 5-15ms in production).

internal/auth/oidc/bench_keycloak_test.go (NEW, //go:build integration):
* BenchmarkOIDC_ColdCache (target p99 < 200ms; operator-runs).
  Drives RefreshKeys against a live Keycloak container from the
  Phase 10 testfixtures harness. Each iteration evicts the
  in-process cache + re-fetches discovery + re-fetches JWKS over
  real HTTP + re-runs the IdP-downgrade-attack defense.
* Network-bounded: the cold path is dominated by HTTPS RTT to
  the IdP discovery endpoint, NOT crypto. The 200ms cap
  accommodates a geographically-distant IdP (~150ms RTT) plus
  the in-process JWKS fetch + downgrade-defense logic (~5ms
  locally).
* Reuses the sharedKeycloak fixture from
  integration_keycloak_test.go (Phase 10) so the benchmark
  doesn't pay the 60-90s container boot cost separately. Skips
  with a clear message if invoked without the integration test
  setup.
* Reports p50/p95/p99/max in MILLISECONDS (vs the
  microsecond-granularity steady-state benchmarks) since the
  cold path is two orders of magnitude slower.

internal/auth/oidc/service_test.go (MODIFIED):
* Refactored newMockIdP(t *testing.T) to delegate to a new
  newMockIdPWithTB(t testing.TB) sibling. Standard Go pattern
  for sharing test fixtures between *testing.T and *testing.B.
  No behavior change for existing service_test.go tests; the
  benchmark file in bench_test.go calls newMockIdPWithTB(b)
  to get the same fixture.

docs/operator/auth-benchmarks.md (NEW):
* Result table with all four benchmarks + targets + measured
  numbers + status markers. Four-row matrix for the default-tag
  benchmarks; the fourth row (cold-cache) is operator-recorded
  with an empty cell waiting for the first Docker-equipped run.
* Hardware floor section pinning the 4 vCPU / 8 GiB RAM /
  Postgres 16 / Go 1.25 baseline. GitHub-hosted Ubuntu runners
  satisfy this; operators on weaker hardware re-record.
* "What each benchmark covers (and what it doesn't)" section
  per benchmark, distinguishing the warm steady-state pipeline
  from the cold path's network-bounded budget.
* "Cold-cache OIDC: how to run" subsection documenting the
  make target + the test+benchmark coupling needed to populate
  sharedKeycloak. Operator-recorded baseline table seeded
  empty for first runs.
* "Why the cold path is bounded by network latency, not crypto"
  section explaining the budget breakdown:
    - TCP handshake (1 RTT)
    - TLS 1.3 handshake (1-2 RTTs)
    - 2 HTTPS GETs (discovery + JWKS, 1 RTT each)
    - In-process crypto on the certctl side (~5-10ms total)
  So the 200ms cap is operator-checkable: real measurement >
  200ms means the IdP is slow OR network congestion OR DNS
  issues — the diagnosis is upstream of certctl. Real
  measurement < 200ms means the IdP is on a fast same-region
  link.
* Methodology section pinning the per-iteration timing capture
  + sort + percentile-extract approach.
* Pre-merge audit section for the Phase 14 exit gate: four
  benchmarks ran, four numbers recorded, steady-state targets
  met, cold path is operator-runnable + measurably-bounded.

Makefile (MODIFIED):
* Added `make benchmark-auth` (default-tag, runs three of four
  benchmarks at 2000 samples each).
* Added `make benchmark-auth-coldcache` (integration-tagged,
  runs OIDC cold-cache against live Keycloak; requires Docker).
* Both targets carry explanatory comment blocks.

docs/README.md (MODIFIED):
* Added the auth-benchmarks.md doc to the Operator nav table
  alongside performance-baselines.md.

Measured baselines at Phase 14 close (linux/arm64, 4 vCPU)
==========================================================

  BenchmarkSession_SteadyState     p99 = 5µs    (target < 1ms)   ✓ 200× under
  BenchmarkSession_ColdProcess     p99 = 7.1ms  (target < 10ms)  ✓
  BenchmarkOIDC_SteadyState        p99 = 1.5ms  (target < 5ms)   ✓ 3× under
  BenchmarkOIDC_ColdCache          operator-runs (Docker required)

Verification
============

* gofmt -l on three new bench files: clean.
* go vet ./internal/auth/session/... ./internal/auth/oidc/...: clean
  (default tag).
* go vet -tags integration ./internal/auth/oidc/...: clean (integration
  tag covers the bench_keycloak_test.go file).
* go test -short -count=1 across all 5 OIDC + session packages:
  green; the bench_*_test.go files compile but don't run under
  -short (testing.Short() guards + benchmarks are not selected
  by -run pattern).
* All three runnable benchmarks executed and produce the numbers
  above; recorded in auth-benchmarks.md.
2026-05-10 16:51:28 +00:00
shankar0123 1d01c87663 auth-bundle-2 Phase 7 + Phase 7.5: OIDC first-admin bootstrap +
break-glass admin (Argon2id, lockout, default-OFF, surface-invisibility)

Phase 7 — OIDC first-admin bootstrap (Decision 3):

  - Optional AdminBootstrapHook closure on *oidc.Service. When wired,
    HandleCallback consults the hook AFTER group resolution + user
    upsert and BEFORE the empty-mapping fail-closed check. Hook
    receives (providerID, groups, userID); returns grantAdmin=true
    when the user matches CERTCTL_BOOTSTRAP_ADMIN_GROUPS AND no
    admin exists yet in the tenant.
  - cmd/server/main.go wires the hook as a closure that:
      * Filters by CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID (if configured).
      * Probes AdminExists via authActorRoleRepo (admin-already-exists
        silently returns false; bootstrap mode is one-shot per tenant).
      * Walks group intersection.
      * On match: grants r-admin via authActorRoleRepo.Grant + emits
        the bootstrap.oidc_first_admin audit row with
        event_category=auth + INFO log.
  - Coexists with the Bundle 1 env-var-token bootstrap. Both paths
    can be configured; first match wins (admin-existence probe
    short-circuits the second).
  - HandleCallback's empty-mapping fail-closed check moved AFTER the
    hook so a fresh deployment with zero group_role_mappings can
    still mint the first admin.
  - 5 tests in service_test.go: hook grants admin on match, hook
    returns false preserves empty-mapping fail-closed, admin-already-
    exists silently falls through to normal mapping, hook-error wraps
    + bubbles, idempotent when admin is already in the mapped role set.

Phase 7.5 — Break-glass admin (Decision 4, default-OFF):

Migration 000038 ships:

  - breakglass_credentials table — at-most-one-credential-per-actor
    (UNIQUE(actor_id)), Argon2id PHC-format password_hash, lockout
    state machine (failure_count, locked_until, last_failure_at).
    FK CASCADE on users(id) so deleting a user atomically removes
    their credential.
  - Two new permissions seeded into r-admin only:
      auth.breakglass.admin — set/rotate/unlock/remove credentials.
      auth.breakglass.login — actor uses break-glass to log in.
    CanonicalPermissions extended in lockstep.

internal/auth/breakglass/service.go (~580 LOC):

  - Service.Enabled() reflects CERTCTL_BREAKGLASS_ENABLED.
  - SetPassword: Argon2id with OWASP 2024 params (m=64MiB, t=3, p=4,
    salt=16 random bytes, output=32 bytes); per-password random salt;
    PHC-format hash output. Min 12 / max 256 byte input.
  - Authenticate: constant-time-compare via subtle.ConstantTimeCompare
    on every code path. Identical 401 + identical timing across the
    wrong-password / locked-account / non-existent-actor paths so an
    attacker cannot probe whether a given actor has break-glass
    configured. Non-existent-actor + locked-account paths run a
    verifyDummy() Argon2id pass for timing parity. Lockout state
    machine: failure_count++ on every wrong attempt; threshold (default
    5) trips locked_until = NOW() + duration (default 15m). Successful
    Authenticate resets the counter. Reset-window: failures aged out
    after CERTCTL_BREAKGLASS_LOCKOUT_RESET_INTERVAL (default 1h)
    auto-reset on next attempt.
  - Unlock + RemoveCredential: admin-only (auth.breakglass.admin
    gated at the router via rbacGate). Audit rows on every operation.
  - All public methods refuse to act when Enabled()==false (returns
    ErrDisabled; the handler maps to HTTP 404 — surface invisibility).

internal/repository/postgres/breakglass.go ships the 5-method
postgres impl with atomic single-statement IncrementFailure (so
concurrent racing wrong-password attempts can't observe an
intermediate state and slip past the threshold) and idempotent
ResetFailureCount.

internal/api/handler/auth_breakglass.go ships the 4-endpoint HTTP
surface:

  - POST /auth/breakglass/login (auth-exempt; 5/min rate-limited per
    source IP via the existing rate limiter; returns 404 when
    disabled). On success sets the post-login session cookie + CSRF
    cookie via SessionService.Create + 204. On any failure:
    uniform 401 + identical timing (the service has already audited
    the specific failure category).
  - POST /api/v1/auth/breakglass/credentials (auth.breakglass.admin)
  - POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock
    (auth.breakglass.admin)
  - DELETE /api/v1/auth/breakglass/credentials/{actor_id}
    (auth.breakglass.admin)

Admin endpoints share the surface-invisibility property: when
CERTCTL_BREAKGLASS_ENABLED=false, every admin endpoint also returns
404 (not 403) so probing via the admin surface gets the same signal
as probing the login endpoint.

Tests (internal/auth/breakglass/service_test.go):

All 8 Phase 7.5 spec-mandated negative cases:

  1. Service.Enabled()==false → all ops return ErrDisabled.
  2. Wrong password → ErrInvalidCredentials, failure_count++,
     audit row with event_category=auth.
  3. Failure_count exceeds threshold → locked, subsequent attempts
     (including with the CORRECT password) return identical-shape
     401 while the lockout window holds.
  4. Lockout window expires → next attempt with correct password
     succeeds + resets the counter.
  5. Password < 12 bytes (or > 256 bytes) → ErrWeakPassword.
  6. Password leak hygiene — the service has zero slog calls; the
     audit-row map literal never includes the password plaintext.
  7. Argon2id hash never appears in logs OR API responses — pinned
     by `json:"-"` tag on BreakglassCredential.PasswordHash + a
     belt-and-braces json.Marshal probe asserting the hash bytes
     never appear in the marshaled output.
  8. Constant-time-compare verified via timing-statistical test —
     wrong-password vs no-credential paths take statistically
     indistinguishable time (within 5x ratio). The verifyDummy()
     hash compute on the no-credential + locked paths is what
     keeps timing parity; absent that, an attacker could side-
     channel "actor doesn't have a credential" via timing.

Plus coverage-lift batch covering: SetPassword first-time vs rotate,
no-caller-id rejection, no-target-id rejection, RNG failure surface,
Authenticate happy-path mints session, no-credential audit row,
session-mint-failure surface, FailureResetInterval recycle, Unlock
+ RemoveCredential happy paths, hash-format unit tests (round-trip,
mismatch, malformed/wrong-version/bad-base64 formats), nil-audit +
nil-session pass-through.

Coverage on internal/auth/breakglass/ at 91.5% per-statement (above
the Phase 7.5 spec ≥ 90% floor).

cmd/server/main.go wiring:

  - Constructs breakglassRepo + breakglassService + breakglassHandler
    after the OIDC service block.
  - breakglassSessionMinterAdapter shim bridges *session.Service.Create
    to the breakglass.SessionMinter port.
  - Logs WARN at boot when CERTCTL_BREAKGLASS_ENABLED=true (operator
    visibility for the deliberate SSO-bypass).

internal/config/config.go gains:

  - AuthConfig.BootstrapAdminGroups + BootstrapOIDCProviderID for
    Phase 7 (CERTCTL_BOOTSTRAP_ADMIN_GROUPS comma-list +
    CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID).
  - AuthConfig.Breakglass nested struct with 4 env vars
    (CERTCTL_BREAKGLASS_ENABLED + LOCKOUT_THRESHOLD + LOCKOUT_DURATION
    + LOCKOUT_RESET_INTERVAL).

Router wiring:

  - 4 new breakglass routes registered when reg.AuthBreakglass != nil;
    public login route via direct r.mux.Handle (auth-exempt), 3 admin
    routes via r.Register + rbacGate(auth.breakglass.admin).
  - POST /auth/breakglass/login pinned in AuthExemptRouterRoutes
    allowlist with Phase 7.5 justification.
  - SpecParityExceptions extended with 4 new entries documenting
    the Phase 7.5 deferral of full per-endpoint OpenAPI rows
    (handler doc-block at the top of auth_breakglass.go is the
    operator-facing reference).

Threat model (encoded in service.go + auth_breakglass.go doc-blocks
+ migration 000038 docstrings, to be promoted to docs/operator/auth-
threat-model.md in Phase 12):

  - Break-glass is a deliberate bypass of the SSO security boundary.
    An attacker who phishes the password OR finds it in a compromised
    password manager bypasses MFA, OIDC, and every group-claim gate.
  - Recommendation: keep CERTCTL_BREAKGLASS_ENABLED=false in steady-
    state. Enable only during SSO-broken incidents. Disable after
    recovery.
  - WebAuthn pairing (v3 per Decision 12) is the load-bearing second
    factor. Without it, break-glass is best treated as an emergency-
    only path.
  - Audit trail surfaces every break-glass action under
    event_category=auth; the auditor role can monitor for unexpected
    break-glass logins.

Verifications: gofmt clean, go vet clean across all touched packages,
go test -short -count=1 green across internal/auth/oidc (3.0s; new
Phase 7 hook tests integrated alongside the 21+ Phase 3 negatives),
internal/auth/breakglass (3.6s; 8 spec-mandated negatives + coverage
batch passing), internal/config + internal/domain/auth + internal/api/
router + internal/api/handler all green, no regressions in Bundle 1
packages.
2026-05-10 06:51:41 +00:00
shankar0123 854135dfb7 auth-bundle-2 Phase 3: OIDC service (HandleAuthRequest, HandleCallback,
RefreshKeys), hand-rolled group-claim resolver, 21+ negative-test
matrix, token-leak hygiene, IdP downgrade-attack defense

Phase 3 of the bundle ships the business logic that turns the Phase 2
storage primitives into a working OpenID Connect 1.0 + RFC 7636 PKCE
authorization-code flow against any enterprise IdP (Okta / Azure AD /
Google Workspace / Keycloak / Authentik / Auth0).

Service surface:

  - Service.HandleAuthRequest(providerID) -> authURL, cookie, preLoginID
    Builds the IdP redirect with PKCE-S256 (mandatory; RFC 9700 §2.1.1),
    server-generated 32-byte state + nonce, persisted to the pre-login
    row keyed by the cookie value.
  - Service.HandleCallback(cookie, code, state, ip, ua) -> *CallbackResult
    11-step validation: pre-login lookup-and-consume (single-use),
    constant-time state compare, code-for-token exchange with PKCE
    verifier, ID-token verify (alg pin via go-oidc/v3), service-layer
    re-checks of iss / aud / azp (multi-aud requires it; mismatch
    rejected) / at_hash (REQUIRED when access_token returned —
    Phase 3 lifts the OIDC core "MAY" to a service-level "MUST") /
    exp / iat-window / nonce, group-claim resolution with userinfo
    fallback, group->role mapping (fail-closed on no match),
    user upsert, session mint via SessionMinter port.
  - Service.RefreshKeys(providerID) — explicit cache eviction +
    re-load. Re-runs the IdP downgrade-attack defense so a provider
    that later rotates to advertising HS* / none is caught BEFORE the
    next user login attempt.

Security posture (every fail-closed branch is a sentinel error +
test):

  - Algorithm pinning: allow-list {RS256, RS512, ES256, ES384, EdDSA};
    deny-list {HS256, HS384, HS512, none}. Belt-and-braces re-check
    via isDisallowedAlg after go-oidc.Verify.
  - PKCE-S256 mandatory (oauth2.GenerateVerifier + S256ChallengeOption);
    `plain` rejection sentinel exists for defense-in-depth.
  - State + nonce: 32-byte crypto/rand, base64url-no-pad,
    constant-time compare, single-use.
  - IdP downgrade-attack defense: at provider creation / RefreshKeys,
    reject any IdP whose discovery doc advertises HS* / none in
    id_token_signing_alg_values_supported.
  - JWKS fail-closed: in-flight login fails 503; existing sessions
    untouched. isJWKSFetchError detects the gooidc verify-error
    shape; ErrJWKSUnreachable is the wire mapping.
  - Token-leak hygiene: ID tokens, access tokens, refresh tokens,
    authorization codes, PKCE verifiers, state, nonce, signing key
    bytes — NEVER logged at any level. logging_test.go pins the
    invariant via a slog buffer + grep-assert across HandleAuthRequest,
    HandleCallback, alg rejection, and provider-load paths.

Group-claim resolver (internal/auth/oidc/groupclaim/):

  - Hand-rolled per Decision 10 (no JSON-path lib; ~150 LOC).
  - URL-shape paths (https:// / http://) treated as a single
    literal key — Auth0 namespaced claims like
    https://your-namespace/groups work without splitting on the
    dots in the URL.
  - Dot-separated paths walked through nested map[string]interface{}.
  - []interface{} / []string / single-string normalized to []string;
    bool / number / object / nil → fail closed.
  - 18 unit tests + sentinels (ErrPathEmpty, ErrSegmentMissing,
    ErrSegmentNotObject, ErrInvalidValueType).

Test surface:

  - service_test.go: 57 test functions including all 21 prompt-mandated
    negative cases (wrong aud / wrong iss / expired / unknown alg /
    alg=none / HMAC alg / azp missing on multi-aud / azp mismatched /
    at_hash missing / at_hash mismatched / iat in future / iat too old /
    nonce mismatched / state mismatched / state replayed / PKCE plain
    sentinel / pre-login replay / forged cookie / IdP downgrade /
    group-claim missing / group-claim unmapped) plus the userinfo
    fallback matrix (happy path + endpoint-missing + endpoint-failing +
    userinfo-also-empty), HandleAuthRequest entry point + RNG-failure
    paths, upsertUser update + create + display-name fallback +
    Validate-error paths, decryptClientSecret real-encrypt round-trip
    + bad-passphrase, alg-parser malformed-header matrix.
  - logging_test.go: 4 hygiene tests pinning no token / code / verifier /
    state / cookie / client_secret / alg name appears in any captured
    log line.
  - groupclaim/resolver_test.go: 18 cases covering Okta string-array,
    Keycloak realm_access.roles, Auth0 namespaced URL claim,
    single-string normalization, deeply-nested 3-segment walks, and
    every fail-closed branch.

Coverage:
  internal/auth/oidc                  92.2%  (floor: 90)
  internal/auth/oidc/groupclaim      100.0%  (floor: 95)
  internal/auth/oidc/domain           96.2%  (floor: 90)

Coverage gates added at .github/coverage-thresholds.yml so a future
regression in any fail-closed branch fails CI before the commit lands.

Phase 3 of cowork/auth-bundle-2-prompt.md is closed. Next up: Phase 4
(Session service: cookies, revocation, sliding-vs-absolute expiry).
2026-05-10 04:56:03 +00:00