mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:01:32 +00:00

Files

T

shankar0123 56e2ea1ad7 docs: v2.1.0 release polish — strip internal bundle/phase tags, update status for OIDC ship

README:
- Rewrite Status block: drop the stale 'federated identity not yet
  shipped' line; flag v2.1.0 OIDC + sessions + back-channel logout
  + break-glass as early-access; encourage GitHub issues for IdP
  rough edges. (A1 framing — keep early-access umbrella, no
  SAML/WebAuthn/JIT roadmap teaser.)
- Add OIDC SSO bullet to 'What it does' covering per-IdP runbooks,
  group-claim → role mapping, AES-256-GCM client_secret encryption,
  JWKS auto-refresh, PKCE-S256, RFC 9700 §4.7.1 pre-login binding,
  RFC 9207 iss check, __Host- cookies, CSRF rotation, idle+absolute
  expiry, BCL, break-glass admin.
- Update Security paragraph: three auth paths (API keys / OIDC /
  break-glass), HMAC-signed sessions, CSRF rotation, RFC OIDC BCL.
- Correct CI coverage thresholds against
  .github/coverage-thresholds.yml (service 70%, handler 75%,
  crypto 88%, auth packages 85-95%); 'static analysis' replaces
  the inflated '11 linters' claim (actual count is 4 active).

Docs B3 sweep — strip operator-facing 'Bundle N' / 'Phase N' tags:
- docs/operator/auth-threat-model.md — rewrite intro; rename 5 H2
  sections (API-key + RBAC defenses / OIDC + sessions + break-glass
  defenses / OIDC + sessions threat catalogue / Closed federated-
  identity threats / Future-work threats); clean ~12 H3/prose hits.
- docs/operator/rbac.md — strip Bundle 1 framing from intro,
  scope_id deferral note, MCP tools section, day-0 bootstrap, and
  'Where to look next'.
- docs/operator/auth-benchmarks.md — drop 'Phase 14' framing from
  title intro, hardware floor caption, result table caption,
  methodology, and pre-merge audit section.
- docs/operator/security.md — already cleaned earlier this session
  (RBAC / day-0 / approval-bypass / OIDC federation / sessions /
  OIDC first-admin / break-glass H3s).
- docs/operator/oidc-runbooks/{index,keycloak,authentik,okta,
  azure-ad}.md — strip Auth Bundle 2 framing + Phase 10/3/4
  references; replace with feature-name prose.
- docs/operator/legacy-clients-tls-1.2.md — drop Bundle F / M-023
  audit-reference framing; keep CWE-326.
- docs/operator/database-tls.md — drop Bundle B / M-018 framing
  from intro + Helm section.
- docs/operator/runbooks/disaster-recovery.md — drop 'Production
  hardening II Phase 10' status callout.
- docs/migration/oidc-enable.md — retitle 'Enable OIDC SSO';
  strip Bundle 1/2 framing from prereqs, troubleshooting, related
  docs; update __Host- cookie callout from 'audit MED-14' to
  v2.1.0-BREAKING.
- docs/migration/api-keys-to-rbac.md — strip Bundle 1 framing from
  intro, migration table, IsAdmin section, and cross-references.
- docs/migration/acme-from-cert-manager.md — strip residual
  'Phase 5' tags from cert-manager integration test references.
- docs/reference/configuration.md — retitle Auth section.
- docs/reference/profiles.md — strip Bundle 1 Phase 9 framing
  from RequiresApproval section + Related list.
- docs/reference/auth-standards-implemented.md — rewrite intro
  (API-key + RBAC + OIDC + sessions + back-channel logout +
  break-glass); rename 'Bundle 1 (RBAC) standards covered
  separately' H2; clean per-row Phase references.
- docs/README.md — rewrite nav-table entries to drop Bundle 1/2
  parentheticals; retitle 'Enable OIDC SSO' migration entry.

No code or test changes; pure operator-facing prose polish for
the v2.1.0 tag.

2026-05-11 16:54:07 +00:00

39 KiB

Raw Blame History

Authentication & authorization threat model

Last reviewed: 2026-05-10

This document describes the attack surface around authentication and authorization in certctl. It complements rbac.md and the per-IdP runbooks at oidc-runbooks/index.md - those docs explain how to USE the controls; this one explains what those controls defend against and which threats they explicitly do NOT close.

certctl ships two authentication paths plus a break-glass admin fallback: API keys with SHA-256 hashing + role-based authorization, and OIDC SSO with HMAC-signed server-side sessions, CSRF rotation, RFC OIDC Back-Channel Logout, an OIDC first-admin bootstrap, and a default-OFF Argon2id break-glass admin path. Each surface brings its own threat catalogue + mitigations, documented below.

Threat actors

External attacker with no credential - probing the public HTTP surface. The default trust boundary for everything except the protocol-level endpoints (ACME / SCEP / EST / OCSP / CRL, which authenticate via embedded credentials per their own RFCs).
Authenticated caller with the wrong role - has a valid API key but the role doesn't grant the requested operation. The primary RBAC threat model.
Compromised API key - attacker holds a valid Bearer token that an honest operator originally provisioned. The key may carry any role.
Insider operator - legitimate access; potentially trying to escalate privilege or bypass the approval workflow.
Compromised audit reviewer (auditor role) - read-only access to audit events but otherwise untrusted.

The following actors are added by the federated-identity surface:

OIDC-federated end user - authenticates via the organization's IdP (Keycloak / Okta / Auth0 / Entra ID / Authentik / Workspace-via-broker). The user's credential lives at the IdP; certctl never sees it. Attack vectors center on token forgery, session hijacking, and group-claim manipulation.
Stolen session cookie holder - attacker holds a valid certctl_session cookie value (typically via XSS, network MITM, or a developer who pasted a token into a chat / pastebin). Holds the attacker-side ability to make requests as the legitimate user until the cookie expires (idle 1h / absolute 8h defaults) or is revoked.
Compromised IdP - the upstream IdP itself is rogue: signs tokens for arbitrary users, mints groups arbitrarily, etc. Largely out of certctl's control; mitigations are bounded to "the audit trail records the source provider on every login, blast radius is bounded by group_role_mapping configured for that provider."
Break-glass-password holder - operator with the local Argon2id password set up for SSO outages. Bypasses the OIDC + group-claim layer entirely. The default-OFF posture is the load-bearing mitigation; once enabled the password is the entire attack surface.

API-key + RBAC defenses

API-key authentication

API keys live in CERTCTL_API_KEYS_NAMED (env-var) or api_keys (DB row, written by the day-0 admin bootstrap and the future role-management API). Keys hash via SHA-256; the middleware compares hashes via crypto/subtle.ConstantTimeCompare to defeat timing attacks.
The auth middleware populates ActorIDKey / ActorTypeKey / TenantIDKey on every authenticated request context. Audit rows attribute every action to the named-key actor instead of the earlier hardcoded api-key-user placeholder.
Demo mode (CERTCTL_AUTH_TYPE=none) injects the synthetic actor-demo-anon actor with admin grants. Production deploys MUST NOT use demo mode.

Authorization (RBAC)

Every gated handler routes through auth.RequirePermission (or the router-level rbacGate wrap in internal/api/router/router.go). The middleware resolves the actor's effective permissions via the Authorizer.CheckPermission service-layer call; on miss, the handler returns HTTP 403 BEFORE the body runs. This is the load-bearing gate.
The five admin-only fine-grained perms (cert.bulk_revoke / crl.admin / scep.admin / est.admin / ca.hierarchy.manage) are seeded into r-admin only. To delegate one, an operator creates a custom role with the specific perm and grants it to the right actor.
The auditor split: r-auditor holds only audit.read + audit.export. Pinned by the internal/domain/auth/auditor_test.go invariants. A regulator with the auditor key cannot read certificates, profiles, issuers, or any mutating surface.
The privilege-escalation guard: granting or revoking a role requires the caller to hold auth.role.assign (enforced in internal/service/auth/actor_role_service.go). A non-admin cannot self-grant admin.
The reserved-actor guard: mutations against actor-demo-anon return HTTP 409 from the service layer (ErrAuthReservedActor). The synthetic actor is operator- inaccessible.

Day-0 bootstrap

CERTCTL_BOOTSTRAP_TOKEN is constant-time-compared by EnvTokenStrategy.Validate. The strategy is one-shot via sync.Mutex-guarded consumed bool; the second call returns ErrDisabled (HTTP 410), not ErrInvalidToken (HTTP 401), so a probing attacker cannot distinguish "wrong token, retry" from "already consumed".
The strategy also re-probes admin existence on every Validate. If an admin actor lands during the gap between Available and Validate, the second caller still gets HTTP 410.
The minted plaintext key is written to the response body once. It is NEVER logged. The token-leak hygiene test in internal/api/handler/auth_bootstrap_test.go redirects slog.Default to a buffer and grep-asserts that neither the bootstrap token nor the minted key appears in any log line, audit row, or HTTP header.
The minted key is hashed before persistence. Lost key → rotate via the regular RBAC API; the plaintext is not recoverable from the DB.

Approval workflow + flip-flop loophole closure

CertificateProfile.RequiresApproval=true gates two surfaces: (a) issuance + renewal of every cert pointing at the profile, (b) edits to the profile itself. The flip-flop loophole closure closure prevents the flip-flop bypass where an admin disables approval, mutates, re-enables.
Same-actor self-approve is rejected at the service layer with ErrApproveBySameActor for both cert_issuance and profile_edit kinds. Two-person integrity is the load-bearing invariant; pinned by tests in internal/service/approval_test.go.

Audit trail

Every mutating operation flows through AuditService.RecordEvent or RecordEventWithCategory. The audit-category extension added the event_category column with a CHECK constraint enforcing the closed enum (cert_lifecycle / auth / config); the category surfaces the auth-mutation slice to the auditor view.
The WORM trigger from migration 000018 (audit_events_worm_trigger) blocks UPDATE and DELETE at the database layer. Even an admin DB user cannot tamper with audit history without dropping the trigger.
The audit redactor (internal/service/audit_redact.go) scrubs credentials + PII from the details JSONB before persistence; an _redacted_keys field surfaces what the redactor took out for compliance review.

Protocol-endpoint allowlist

ACME / SCEP / EST / OCSP / CRL endpoints authenticate via embedded credentials defined by their own RFCs (JWS-signed, challenge passwords, mTLS, public-by-RFC). The auth middleware explicitly bypasses these via IsProtocolEndpoint. The internal/api/router/phase12_protocol_allowlist_test.go regression test pins the invariant at three layers (middleware bypass, allowlist constant, router-level no-rbacGate-wraps-protocol-paths).

OIDC + sessions + break-glass defenses

OIDC token validation

Algorithm allow-list, never none, never HMAC. The service- layer pinning lives in internal/auth/oidc/service.go::disallowedAlgs
- isDisallowedAlg. The per-token alg check at sig-verify time (isDisallowedAlg, line ~1177) is the load-bearing defense — every ID token whose JWS header carries an alg outside the allow-list (RS256 / RS512 / ES256 / ES384 / EdDSA) is rejected with ErrAlgRejected. coreos/go-oidc additionally enforces the allow-list per-token at verify time as defense-in-depth against an upstream library regression. The IdP-downgrade-attack secondary defense at provider creation / RefreshKeys (v2.1.0-relaxed semantics) intersects the IdP's advertised id_token_signing_alg_values_supported with the allow-list and rejects only when the intersection is EMPTY — i.e., the IdP advertises NO acceptable alg. Pre-v2.1.0 the check strict-denied on ANY HS*/none advertisement; that broke against Keycloak 26.x (which lists every alg it's capable of in its discovery doc, including HS*, even when the realm only signs with RS256). The relaxation is safe because the per-token alg pin already prevents a real algorithm-confusion attack — a forged HS256 token using the IdP's RS256 pubkey as HMAC secret is rejected at sig-verify regardless of what the discovery doc advertises. Operators worried about a compromised IdP rotating to weak algs without rotating its certctl provider config get defense-in-depth from JWKSStatus + the alert hooks in the GUI panel.
Exact iss match. ID-token iss claim must equal the configured OIDCProvider.IssuerURL byte-for-byte (sentinel ErrIssuerMismatch). A token from a different IdP - even one with the same aud - cannot ride a misconfigured provider row.
aud + azp checks. Service-layer re-verification of the audience claim (must include client_id) plus the azp claim for multi-aud tokens (per OIDC core §3.1.3.7 step 5; sentinels ErrAudienceMismatch, ErrAZPRequired, ErrAZPMismatch). An attacker with a token issued for a different client cannot replay it against certctl.
at_hash REQUIRED when access_token is present. OIDC core treats at_hash as a "MAY"; certctl tightens to "MUST" (ErrATHashRequired). A substituted access token cannot ride alongside a clean ID token through the verifier.
Single-use state + nonce. Both 32-byte random server-generated values, persisted in the pre-login row keyed by the cookie. The pre-login row is consumed via DELETE...RETURNING on lookup (atomic single-use). subtle.ConstantTimeCompare on both. State replay returns ErrPreLoginNotFound; nonce mismatch returns ErrNonceMismatch.
PKCE-S256 mandatory. RFC 9700 §2.1.1 requires PKCE on auth- code; certctl hard-codes S256 via oauth2.GenerateVerifier + oauth2.S256ChallengeOption. The plain method is not just unsupported - the ErrPKCEPlainRejected sentinel exists so a future regression that surfaces a plain path trips a test.
iat window. Configurable per-provider (default 300s, capped at 600s by the domain validator). Defends against clock-skew attacks where an attacker submits a stale-but-valid token.
JWKS rotation handled transparently by coreos/go-oidc's built- in cache, plus the operator-triggered Service.RefreshKeys for forced refresh (and the auto-refresh on JWKS-cache TTL expiry, default 3600s).
JWKS-fetch failure during a key rotation: fail closed. The service maps go-oidc's network errors to ErrJWKSUnreachable (HTTP 503 to the in-flight login). Existing sessions are untouched. No exponential backoff, no auto-retry; the operator triages.
Encrypted client_secret at rest. AES-256-GCM via internal/crypto.EncryptIfKeySet (the same v3-blob path issuer
- target credentials use). The client_secret_encrypted column is json:"-" on the domain type so a misconfigured handler cannot wire-leak.

Session minting + cookies

Length-prefixed HMAC. Cookie wire format is v1.<session_id>.<signing_key_id>.<base64url-no-pad(HMAC-SHA256)>. HMAC input is length-prefixed as len(sid):sid:len(kid):kid
- NOT bare-concat. The bare-concat form admits a collision attack: <a, bc> and <ab, c> produce identical HMAC inputs, letting a forger swap one byte across the boundary. Pinned by TestComputeHMAC_LengthPrefixDefeatsConcatCollision + TestService_Validate_ConcatenationCollisionDefeatedByLengthPrefix. The v1. version prefix is reserved; unknown prefixes are rejected with no fallback.
Cookie hardening. HttpOnly=true (no JS access; defends XSS cookie theft), Secure=true (HTTPS-only; defends network MITM given HTTPS-Everywhere v2.2 milestone), SameSite=Lax default (configurable to Strict via CERTCTL_SESSION_SAMESITE), Path=/, no domain attribute (host-only).
Idle + absolute timeouts. 1h idle / 8h absolute defaults (configurable via CERTCTL_SESSION_IDLE_TIMEOUT / _ABSOLUTE_TIMEOUT). The session row tracks last_seen_at, idle_expires_at, absolute_expires_at independently; the scheduler's sessionGCLoop (default 1h) sweeps expired rows.
CSRF defense. Plaintext CSRF token in the JS-readable certctl_csrf cookie (intentionally HttpOnly=false so the GUI reads it for the X-CSRF-Token header). SHA-256 hash on the session row. CSRFMiddleware on state-changing methods uses subtle.ConstantTimeCompare against the hash. API-key actors (no session row) are CSRF-exempt - pinned by the bundle-1-compat CI guard.
Optional defense-in-depth IP / UA bind (default OFF; CERTCTL_SESSION_BIND_IP / _BIND_USER_AGENT). Mismatch returns ErrSessionIPMismatch / ErrSessionUAMismatch. Use with care - mobile clients on changing networks fail closed.
Signing-key rotation primitive. RotateSigningKey mints a new HMAC key; the old key stays valid for the configured retention window (default 24h via CERTCTL_SESSION_SIGNING_KEY_RETENTION) so existing cookies validate during the rollover. Past retention, the old key's row is dropped and any cookie still signed under it returns ErrSigningKeyNotFound.
EnsureInitialSigningKey is fail-fatal at server boot. Wired in cmd/server/main.go via logger.Error + os.Exit(1) so a server with a broken DB or RNG cannot boot into a state where session validation is impossible.
Pre-login cookie discriminated from post-login. Pre-login carries the pl- id prefix; post-login carries ses-. Defense- in-depth: Validate rejects pre-login cookies (pinned by TestService_Validate_RejectsPreLoginCookieAtPostLoginGate) so a stolen pre-login cookie cannot be replayed against the post-login gate.

Back-channel logout

OpenID Connect Back-Channel Logout 1.0 (NOT RFC 8414). Endpoint: POST /auth/oidc/back-channel-logout. The IdP signs a logout JWT and POSTs it to certctl when a user logs out at the IdP. The handler validates the JWT against the IdP's JWKS via the same alg allow-list as the login flow.
Required claims pinned. iss / aud / iat / jti / events (with the spec-mandated logout event type); exactly one of sub / sid; nonce MUST be absent (per spec §2.4
- logout tokens MUST NOT carry a nonce). All four pinned by the back-channel-logout negative-test matrix.
jti-based replay defense. The handler tracks recently-seen jti values to defeat logout-token replay attacks where an attacker captures a logout JWT and replays it.
Cache-Control: no-store on the response per spec §2.5.

OIDC first-admin bootstrap

Coexists with the env-var-token bootstrap path. Both can be configured; the admin-existence probe ensures only one wins.
Group-scoped. CERTCTL_BOOTSTRAP_ADMIN_GROUPS is a comma- separated allowlist of IdP group names; users in any one of those groups become admins on FIRST login per tenant. Non-empty intersection with the user's resolved groups is required.
One-shot per tenant via admin-existence probe. Once any actor holds r-admin in the tenant, the bootstrap hook silently falls through to normal mapping (no admin grant). Operators rely on this to avoid an "always-admin-on-login" backdoor.
Explicit OIDC provider gate. CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID pins which provider's tokens are eligible. A multi-IdP deploy cannot have any provider's group claims become admin.
Audit row on every grant. bootstrap.oidc_first_admin event with event_category=auth + INFO log; the auditor monitors.

Break-glass admin

Default-OFF. CERTCTL_BREAKGLASS_ENABLED=false is the default; the entire surface (4 endpoints) is disabled. Operators flip it on during SSO incidents and back off after recovery.
Surface invisibility via 404-not-403. Every endpoint returns HTTP 404 when disabled - public login AND admin endpoints. A scanner cannot distinguish "endpoint disabled" from "endpoint doesn't exist." All five service-layer methods short-circuit with ErrDisabled before any DB lookup; the handler maps to http.NotFound.
Argon2id with OWASP 2024 params. m=64MiB, t=3, p=4, 16-byte salt, 32-byte output, per-password random salt, PHC-format hash. The hash column is json:"-" so handlers cannot wire-leak.
Lockout state machine. CERTCTL_BREAKGLASS_LOCKOUT_THRESHOLD (default 5) failures within CERTCTL_BREAKGLASS_LOCKOUT_RESET_INTERVAL (default 1h) trip a CERTCTL_BREAKGLASS_LOCKOUT_DURATION lock (default 30s; bumped from 100ms after the test discovered Argon2id verify itself takes ~80-200ms each, making a millisecond-scale lockout invisible). Atomic single-statement IncrementFailure defeats concurrent racing attempts. Idempotent ResetFailureCount.
Constant-time across all failure paths. verifyDummy() runs a real Argon2id pass against an all-zeros throwaway salt on the no-credential and locked-account paths so all three failure modes (wrong password / locked / no actor) take statistically indistinguishable time. Pinned by TestPhase7_5_ConstantTimeAcrossWrongPasswordAndNoCredentialPaths (asserts within 5x ratio on durations).
Audit row + WARN log at boot. auth.breakglass_login_* events with event_category=auth. cmd/server/main.go emits a WARN-level log when ENABLED=true so the operator's log review notices an over-long enablement.
Rate limit on the public login endpoint. 5 attempts/minute via the existing middleware.NewRateLimiter.

OIDC + sessions threat catalogue

The following sub-sections enumerate the threat surface introduced by the OIDC + sessions surface and the mitigations the platform ships. They are deliberately exhaustive - if a threat is listed here it has a concrete mitigation or a documented "operator-driven, out of scope" framing. New threats discovered post-2026-05-10 should be added here with a dated commit note.

OIDC token forgery vectors and mitigations

Vector	Mitigation
Alg confusion (HS256 token signed with the IdP's public key)	Alg allow-list rejects HS256 / HS384 / HS512 / `none`. Service-layer + go-oidc enforce in two layers. IdP-downgrade-attack defense at provider-creation time.
Audience injection (token issued for a different client)	Service-layer `aud` re-check post-go-oidc verify; multi-aud tokens require matching `azp`. Sentinels `ErrAudienceMismatch` / `ErrAZPRequired` / `ErrAZPMismatch`.
Issuer mismatch (token from a different IdP with the same alg + key shape)	Exact `iss` string match (`ErrIssuerMismatch`). The 21-case OIDC negative-test matrix pins the byte-for-byte requirement.
Nonce replay (capturing a fresh token + replaying with the same nonce)	Single-use nonce stored in the pre-login row; `LookupAndConsume` is `DELETE...RETURNING` (atomic). Second use returns `ErrPreLoginNotFound`.
State replay (CSRF on the IdP redirect)	Same single-use mechanism as nonce. State is `subtle.ConstantTimeCompare`d.
`at_hash` substitution (clean ID token with a swapped access token)	`at_hash` REQUIRED when access_token present (certctl tightens OIDC core's MAY → MUST). `ErrATHashRequired` if missing; `ErrATHashMismatch` if non-matching.
`iat` window manipulation (stale token replay)	`iat_window_seconds` configurable per-provider (default 300, cap 600). Future `iat` returns `ErrIATInFuture`; older-than-window returns `ErrIATTooOld`.
JWKS rotation mid-login	coreos/go-oidc's built-in cache + auto-refresh on TTL expiry. Operator-triggered `Service.RefreshKeys` for forced refresh.
JWKS-fetch failure during a key rotation	`ErrJWKSUnreachable` (HTTP 503 to in-flight login). Existing sessions untouched. Operator clicks "Refresh discovery cache" once IdP recovers. No exponential backoff.

Session hijacking vectors and mitigations

Vector	Mitigation
Cookie theft via XSS	`HttpOnly` on the session cookie; CSP headers from the security-hardening middleware prevent inline-script execution.
Cookie theft via network MITM	`Secure` flag + TLS 1.3-only control plane (HTTPS-Everywhere v2.2 milestone).
CSRF on state-changing methods	`SameSite=Lax` default + double-submit-cookie pattern with hashed CSRF token on the session row. CSRFMiddleware fires on POST/PUT/PATCH/DELETE for session-authenticated callers; API-key actors are exempt.
Session-cookie forgery via concatenation collision	Length-prefixed HMAC input (`len(sid):sid:len(kid):kid`). Pinned by two tests + a doc-block at the top of `service.go`.
Stolen-cookie replay (attacker uses a valid cookie until expiry)	Short idle timeout (1h default) + admin-revoke-all-for-actor + back-channel logout from IdP + GUI session revocation.
Cross-tab session interference	Cookie value is opaque + length-prefixed; tabs sharing the cookie share the session row. Sign-out in one tab calls `POST /auth/logout`; the next request from any tab gets a missing-row 401.
Session-row race on sign-out vs in-flight request	`Validate` is the single point that reads the row; missing row = 401. There is no "stale read" path because every request re-validates.

IdP compromise scenarios

A rogue IdP issues malicious tokens (signs tokens for arbitrary users, mints arbitrary groups, etc.). Mitigations are largely out of certctl's control - the trust root is the IdP. Documented behaviors:

Operator should monitor IdP audit logs. Federated identity is only as trustworthy as the IdP it federates from. The iss claim on every certctl audit row points at the source IdP so the operator can correlate against IdP-side audit.
Operator can rotate group-role mappings from the GUI without redeploying. If the IdP is compromised but not yet decommissioned, the operator can dial down access via Auth → OIDC Providers → <provider> → Group → role mappings and remove every mapping. Subsequent logins fail closed (ErrGroupsUnmapped); existing sessions continue until expiry.
The audit trail records every OIDC login including the source provider. Blast radius is bounded by the group_role_mapping table for that provider. A compromised provider configured with only engineers → r-operator cannot escalate to r-admin via any token forgery.
The provider-delete path returns 409 when sessions exist for it. ErrOIDCProviderInUse forces the operator to revoke the provider's active sessions before deletion - prevents accidental loss of audit lineage on a hot incident.

Back-channel logout failure modes

Mode	Behavior	Mitigation
IdP unreachable	certctl never receives the logout signal; sessions persist until idle/absolute timeout (1h/8h defaults).	Operator keeps absolute timeout short relative to risk tolerance. Manual revoke via GUI is always available.
Logout token signature invalid	certctl returns 400; no session revoked; `auth.oidc_back_channel_logout_failed` audit row.	Operator-monitored audit row surfaces forged-logout-token attempts.
Logout token replay (attacker captures + replays a valid logout JWT)	`jti`-based deduplication rejects the replay; first delivery succeeds, second returns 400.	Pinned by back-channel-logout negative tests.
Logout token alg confusion	Same alg allow-list as the login flow; HS-family rejected.	The OIDC alg allow-list applies to BCL too (same `Provider.RemoteKeySet`).
Missing `events` claim	Spec §2.4 requires the OIDC-defined logout event type; missing returns 400.	Pinned by negative test.
`nonce` claim present	Spec §2.4 requires `nonce` MUST NOT appear in logout tokens; presence returns 400.	Pinned by negative test.

Group-claim manipulation

Per-IdP group-claim shapes are documented in oidc-runbooks/index.md. Manipulation threats:

Vector	Mitigation
Operator misconfigures mapping (e.g. `engineers → r-admin` instead of `r-operator`)	`auth.group_mapping_added` / `_removed` audit row with `event_category=auth`. The auditor role monitors.
Operator misconfigures `groups_claim_path` (e.g. `groups` when Auth0 emits `https://your-namespace/groups`)	User's group claim is ignored, user lands at "no roles assigned" screen. The GUI's OIDC provider detail page surfaces the configured path so the operator can verify.
IdP renames a group (e.g. `engineers → eng-team`)	Mappings silently break; users get fewer roles than expected. `auth.oidc_login_unmapped_groups` audit row fires on every such login; auditor monitors for unexpected spikes.
IdP user maintainer adds a user to an unintended group	Group is mapped to a higher-privilege role than intended; user gets the role on next login. Bounded blast radius: the group→role mapping is what they got, not arbitrary admin. Defense-in-depth: review mappings periodically; the auditor role can pull `auth.oidc_login_succeeded` rows by `details.subject` to spot drift.

Bootstrap phase risks

This section extends the day-0 bootstrap section with the OIDC first-admin path.

Vector	Mitigation
`CERTCTL_BOOTSTRAP_TOKEN` (env-var fallback path) leaks	One-shot via `consumed` bool + admin-existence probe. Both arms close the path the moment any admin lands.
`CERTCTL_BOOTSTRAP_ADMIN_GROUPS` misconfigured to a wide group (e.g. `everyone`)	Unintended user becomes admin on first OIDC login. Mitigation: scope-down via `certctl-cli auth keys scope-down --suggest`. Operators configure narrow groups. The audit row on `bootstrap.oidc_first_admin` surfaces every grant.
Both bootstrap strategies enabled simultaneously	Whichever fires first wins; the second sees admin-already-exists and falls through to normal mapping. No double-admin landing.
`CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID` left unset with multi-IdP deploy	Hook fires on ANY provider's tokens. Mitigation: explicit gate documented in `cmd/server/main.go` startup logging; operator audit reviewed pre-tag.

Break-glass risks

Vector	Mitigation
Phished password (operator gives password to attacker)	Bypasses OIDC + every group-claim gate. Mitigation: default-OFF posture; lockout after 5 failures; WebAuthn pairing (v3 / Decision 12) closes the gap properly.
Brute-force online	Lockout state machine + 5/min rate limit on `/auth/breakglass/login`.
Brute-force offline (DB compromise)	Argon2id with OWASP 2024 params (~80-200ms per verify). Cracking remains expensive even with GPU.
Operator forgets to disable post-incident	Break-glass becomes a permanent backdoor. Mitigation: WARN log at boot when ENABLED=true; audit row on every break-glass login; runbook prescribes "disable within 24h of SSO recovery."
Side-channel timing on no-credential vs wrong-password vs locked	All three paths take statistically indistinguishable time via `verifyDummy()`. Pinned by the timing-statistical test.
Surface fingerprinting (scanner identifies break-glass exists)	All four endpoints return 404 (NOT 403) when disabled. Surface-invisibility - identical to a non-existent route.
Reserved-actor `actor-demo-anon` mutation via break-glass admin	Service layer rejects with `ErrAuthReservedActor` (HTTP 409). Same gate as the RBAC path.

Token-leak hygiene (the explicit grep policy)

ID tokens, access tokens, refresh tokens, authorization codes, PKCE verifiers, state, nonce, signing keys, break-glass passwords MUST NEVER appear in any log line at any level.

The invariant is enforced by per-package logging_test.go files that redirect slog.Default to a buffer, run the service paths, and grep-assert the secret values are absent from every captured line. The pattern is internal/auth/bootstrap/service_test.go; the OIDC, session, and break-glass packages follow the same shape:

internal/auth/oidc/logging_test.go - token / code / verifier / state / nonce / cookie / client_secret / alg name absent from HandleAuthRequest, HandleCallback, alg-rejection, and provider- load paths.
internal/auth/session/service_test.go - signing-key bytes absent from cookie-mint + validate paths.
internal/auth/breakglass/service_test.go - plaintext password + Argon2id hash absent from every audit row + log line + HTTP-response shape (json:"-" probe via json.Marshal).

The details JSONB column on audit_events runs through the audit redactor (internal/service/audit_redact.go) before persistence; the redactor's allow-list is conservative enough that adding a new token-shaped field to a new audit row defaults to redacted, not leaked.

Closed federated-identity threats

Each item below was an open threat under the earlier API-key-only deployment posture. Status reflects current closure as of v2.1.0.

OIDC federation - ✅ closed. SAML and WebAuthn remain on the future-work list (Decision 12 — WebAuthn pairs with break-glass for hardware-token MFA). The break-glass path is a partial mitigation for the no-MFA case during SSO incidents.
Session management - ✅ closed. HMAC-signed __Host-certctl_session cookie with length-prefixed wire format, 1h idle / 8h absolute expiry, scheduler-driven GC, server-side revocation list (delete the row), GUI's "Sessions" page surfaces own + all-actor revocation, back-channel logout from the IdP.
Local password accounts (break-glass) - ✅ closed. Argon2id
- lockout + default-OFF + 404-not-403 surface invisibility. NOT for general human auth - only the "SSO is broken, need admin access right now" path. WebAuthn pairing on the future-work list.
OIDC first-admin bootstrap - ✅ closed. CERTCTL_BOOTSTRAP_ADMIN_GROUPS + CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID env vars + group-scoped + admin-existence-probe.
Rate limiting on the bootstrap endpoint - acceptable (one-shot by construction; per-IP rate limiting on the broader API is in place via middleware.NewRateLimiter). The break-glass /auth/breakglass/login endpoint carries the same rate-limit primitive at 5/min.

Future-work threats

The following are not yet closed:

WebAuthn / FIDO2 second factor - operator console is OIDC (or break-glass password) only. No hardware-token requirement even on the admin path. Decision 12.
Time-bound role grants / JIT elevation - the actor_roles.expires_at column exists, no UI/API yet.
SAML federation - OIDC only. Operators on SAML-only IdPs use the broker pattern (run Keycloak as a SAML-to-OIDC bridge); see the Google Workspace runbook for the same broker shape.
Multi-tenant data isolation activation - the schema and repository layer carry tenant_id columns + a query-coverage CI guard, but tenant ACLs are not enforced. v2.1.0 ships single-tenant only (t-default seeded). The managed-service hosting work (operator decision item) is where multi-tenant flips on.
HSM / FIPS-validated signing key for sessions - the session signing key is software-only (HMAC-SHA256, in-memory key material, encrypted at rest via internal/crypto). Operators in FIPS 140-3 environments need to supply their own Signer implementation; the abstraction at internal/crypto/signer/ accommodates this but no PKCS#11 driver ships yet.
OIDC RP-initiated logout (the "/end_session_endpoint" flow where certctl signs a logout token + redirects the browser to the IdP). v2.1.0 implements ONLY the back-channel flow (IdP → certctl). Operators wanting the full bidirectional logout pair wait on a follow-on release.
GUI E2E via Playwright - tracked alongside #9 above.
Per-IdP runbook external-tester sign-off - encouraged via the operator-sign-off footers in oidc-runbooks/*.md but NOT a merge gate (operator decision 2026-05-10; the earlier "≥ 2 external testers" requirement was retired).

Compliance mapping

The control set in this document supports the following framework requirements. This is a mapping; it is not a claim of formal certification.

SOC 2 CC6.1 (logical access controls) - RBAC primitive with role-based gating on every mutating endpoint.
SOC 2 CC6.3 (privileged access management) - r-admin role separation + role-grant audit trail with two-person integrity on approval-tier profile edits.
HIPAA §164.312(b) (audit controls) - event_category column lets the auditor role review authentication / authorization changes specifically. WORM trigger keeps the audit table append-only at the database layer.
NIST SSDF PO.5.2 (separation of duties) - two-person integrity for compliance-tier issuance via the RequiresApproval flow + the approval-bypass closure on profile edits.
FedRAMP AU-9 (audit information protection) - WORM enforcement + auditor-only read access (the auditor role cannot mutate, the WORM trigger blocks UPDATE/DELETE).
PCI-DSS §10 (audit logging) - every mutating operation emits an audit row with actor + action + resource + timestamp + category. The audit table is append-only.

Operator-facing checks

Run these periodically to verify the controls are working.

certctl-cli auth keys list - confirm no unexpected actor holds r-admin. Audit any new admin grants against the audit log.
SELECT actor, action, COUNT(*) FROM audit_events WHERE action LIKE 'approval_%' AND timestamp > NOW() - INTERVAL '7 days' GROUP BY actor, action; - confirm approvals are happening and not concentrated in a single approver.
SELECT COUNT(*) FROM audit_events WHERE actor = 'system-bypass'; - MUST return 0 in production. A non-zero count means CERTCTL_APPROVAL_BYPASS=true was set; production deploys MUST leave it unset.
SELECT actor, COUNT(*) FROM audit_events WHERE action = 'bootstrap.consume'; - MUST return at most one row per tenant. Multiple rows means the bootstrap endpoint was called more than once, which the strategy's one-shot guard should have prevented; investigate.
certctl-cli auth me while authenticated as the auditor key - effective_permissions must contain audit.read + audit.export ONLY. Any other permission means a role grant widened the auditor's surface; revoke immediately.

The following checks were added with v2.1.0's federated-identity surface:

SELECT COUNT(*) FROM oidc_providers; - confirm only the expected providers are configured. An unexpected row is a compromise indicator. Cross-check with the auth.oidc_provider_created audit row to find when + by whom.
SELECT actor_id, COUNT(*) FROM sessions WHERE NOT revoked AND absolute_expires_at > NOW() GROUP BY actor_id ORDER BY 2 DESC;
- confirm no actor has an unexpectedly large session count. Multi-session-per-actor is normal (laptop + phone), but a single actor with 50+ active sessions is a compromised-key signal.
SELECT COUNT(*) FROM audit_events WHERE action LIKE 'auth.oidc_login_unmapped_groups' AND timestamp > NOW() - INTERVAL '7 days'; - non-zero rows mean users are completing IdP authentication but failing the group-mapping step. Either the IdP renamed a group, or an unauthorized user attempted access. Investigate.
SELECT COUNT(*) FROM audit_events WHERE action LIKE 'auth.breakglass_%' AND timestamp > NOW() - INTERVAL '7 days';
- non-zero rows in steady state mean break-glass is being used outside an SSO incident OR was left enabled. Confirm CERTCTL_BREAKGLASS_ENABLED is false in non-incident windows.
SELECT COUNT(*) FROM audit_events WHERE action = 'bootstrap.oidc_first_admin'; - MUST return at most one row per tenant. Multiple rows means the OIDC bootstrap hook fired more than once per tenant, which the admin-existence probe should have prevented; investigate.
SELECT COUNT(*) FROM session_signing_keys WHERE retired_at IS NOT NULL AND retired_at < NOW() - INTERVAL '7 days'; - retired keys past the retention window should have been GC'd. Non-zero rows mean the scheduler's sessionGCLoop is wedged.

Cross-references

API-key + RBAC anchors:

rbac.md - the operator how-to
security.md - the wider security posture
approval-workflow.md - the two-person integrity gate
docs/migration/api-keys-to-rbac.md - upgrade flow
internal/auth/ - middleware + keystore + RequirePermission + bootstrap
internal/service/auth/ - Authorizer + privilege-escalation guard + reserved-actor guard
migrations/000029_rbac.up.sql - schema + seed
migrations/000030_rbac_admin_perms.up.sql - five admin-only fine-grained perms
migrations/000032_audit_category.up.sql - auditor surface
migrations/000033_approval_kinds.up.sql - approval-bypass closure

OIDC + sessions + back-channel logout + break-glass anchors:

oidc-runbooks/index.md - per-IdP setup guides (Keycloak / Authentik / Okta / Auth0 / Entra ID / Google Workspace) with cross-IdP recurring concepts at the top
internal/auth/oidc/ - OIDC service (HandleAuthRequest / HandleCallback / RefreshKeys), hand-rolled groupclaim resolver, alg allow-list, IdP downgrade-attack defense
internal/auth/session/ - session service (length-prefixed HMAC, cookie minting, idle/absolute expiry, signing-key rotation, GC), CSRF middleware, chained-auth combinator
internal/auth/breakglass/ - default-OFF break-glass admin (Argon2id + lockout + constant-time + surface-invisibility)
internal/auth/oidc/testfixtures/ - Keycloak testcontainers harness (//go:build integration)
migrations/000034_oidc_providers.up.sql - OIDC providers + group-role mappings tables
migrations/000035_sessions.up.sql - sessions + session-signing- keys tables
migrations/000036_users.up.sql - users (federated-human identity) table
migrations/000037_oidc_pre_login.up.sql - pre-login table + 7 new auth permissions
migrations/000038_breakglass_credentials.up.sql - break-glass credentials table + 2 new permissions
scripts/ci-guards/N-bundle-2-security-empty-preserved.sh - OpenAPI security: [] count guard
scripts/ci-guards/bundle-1-compat-regression.sh - API-key-only compat assertions (5 invariants)
scripts/ci-guards/bundle-1-to-2-upgrade-regression.sh - OIDC-upgrade-path assertions (6 invariants)

39 KiB Raw Blame History