Files
certctl/docs/operator/security.md
T
shankar0123 a923cf697c harden(auth): demo-mode residual-grants detector + cleanup endpoint + CI guard (A-8)
Audit 2026-05-11 A-8 closure. Closes the deferred Phase 2 leg of the
2026-05-10 HIGH-12 closure (2e97cc1) — production-startup observability
for actor-demo-anon residual grants + CI guard banning new synthetic-
admin code paths.

What this changes:

* cmd/server/preflight_demo_residual.go (new) runs after the DB pool +
  audit service are constructed and before the HTTPS listener starts.
  Under any non-'none' auth type it queries actor_roles for the
  synthetic actor-demo-anon and emits a WARN log + a categorized audit
  row (auth.demo_residual_grants_detected) listing every grant
  present. Migration 000029 unconditionally seeds the ar-demo-anon-admin
  row at install time, so EVERY production deploy will see this WARN
  on first boot; the intended cutover workflow is cleanup-once at
  production handover.

* CERTCTL_DEMO_MODE_RESIDUAL_STRICT (new env var on AuthConfig,
  default false) pivots the WARN to fail-closed startup refusal for
  operators who want a paranoid posture against re-seeding.

* POST /api/v1/auth/demo-residual/cleanup (new handler at
  internal/api/handler/demo_residual.go) is an admin-class
  (auth.role.assign) endpoint that removes every actor-demo-anon row
  from actor_roles and returns {removed: int64}. Idempotent; refuses
  503 under Auth.Type=none (deleting the row would break the demo
  path); audit-logs every invocation including no-op zero-removed
  calls so the admin's action is always recorded.

* scripts/ci-guards/no-new-synthetic-admin.sh pins the 17-entry
  allowlist of source files that legitimately reference the
  actor-demo-anon literal. New runtime code paths that resolve to the
  synthetic actor (the same pattern that produced the original CRIT
  class) are rejected at PR time. CI workflow auto-picks the script
  via the existing scripts/ci-guards/*.sh loop in .github/workflows/
  ci.yml; no workflow edit needed.

Regression matrix:

* cmd/server/preflight_demo_residual_test.go — 7 tests covering the
  4 main behaviour branches (testcontainers-backed, testing.Short()-
  skipped: DemoModeActive_Skips, NoResidue_Passes, HasResidue_LogsAnd
  Audits, StrictMode_RefusesStartup, DeleteDemoAnonResidue_Idempotent)
  plus 3 pure-Go stdlib unit tests for the row-string formatter +
  nil-safety contracts on both helpers.

* internal/api/handler/demo_residual_test.go — 7 stdlib+httptest
  cases: HappyPath, Idempotent_ReturnsZero, RejectsInDemoMode (503),
  CleanupError_Surfaces500, NilCleanupFn (defensive 500),
  NilAuditWriter_DoesNotPanic, MissingActorContext (falls back to
  'unknown' actor in the audit row).

* internal/api/router/openapi_parity_test.go — new
  POST /api/v1/auth/demo-residual/cleanup entry plus 6 pre-existing
  pre-A-8 entries (oidc/test, jwks-status, users CRUD, runtime-config)
  that had drifted out of SpecParityExceptions; the parity test was
  red on dev/auth-bundle-2 before my work; this commit returns it to
  green with full per-entry justifications + parity-debt notes.

Docs:

* docs/operator/security.md — new 'Demo-to-production cutover (Audit
  2026-05-11 A-8)' section explaining the WARN message, the cleanup
  curl one-liner, the equivalent SQL, the strict-mode env var, and
  the CI guard.

* docs/operator/rbac.md — Last-reviewed bump + pointer to the new
  env var + the security.md section.

* cowork/auth-bundles-audit-2026-05-10.md — HIGH-12 row gains an
  'A-8 follow-on CLOSED 2026-05-11' annotation describing the
  deferred Phase 2 leg now landed.

* CHANGELOG.md — Unreleased ### Security entry summarizing the four
  legs (detector + cleanup + strict-mode flag + CI guard) and the
  acquisition-readiness narrative this closes.

Operator-facing impact: this closes a credibility gap, not an
exploitable vulnerability. The residue requires a regression
elsewhere in the middleware chain to be exploitable. After this
fix, the canonical narrative ('RBAC primitive with no synthetic-
admin fallback') is fully true.

Refs cowork/auth-bundles-fixes-2026-05-11/08-high-demo-mode-residual-
cleanup.md.
2026-05-11 11:45:54 +00:00

18 KiB

certctl Security Posture & Operator Guidance

Last reviewed: 2026-05-11

This document collects the operator-facing security guidance that the source code's per-finding comment blocks reference. Each section names the audit finding it closes, the threat model, and the operator action required (if any).

OCSP responder availability

Audit reference: Bundle C / M-020. CWE-770 (uncontrolled resource consumption); RFC 6960 (OCSP); RFC 7633 (Must-Staple).

certctl ships an OCSP responder at /.well-known/pki/ocsp/{issuer_id}/{serial} that signs a fresh response per request. Pre-Bundle-C the unauth handler chain had no rate limit, so an attacker could DoS the responder and force fail-open relying parties to accept revoked certificates as valid. Bundle C adds the same per-key rate limiter to the unauth chain that the authenticated chain has used since Bundle B. Per-IP keying applies because OCSP traffic is unauthenticated.

The rate limiter alone does not solve the underlying revocation-bypass risk. The architectural fix is for issued certificates to carry the OCSP Must-Staple TLS Feature extension (RFC 7633, OID 1.3.6.1.5.5.7.1.24). When present, conforming TLS clients refuse to negotiate a session unless the server staples a fresh signed OCSP response in the TLS handshake. This shifts revocation enforcement from the client's discretion (which most fail-open by default) to a hard requirement that the connection cannot complete without proof of non-revocation.

Operator action

For certificates issued to systems where revocation correctness matters:

  1. Configure the issuer profile to set must-staple: true. Out-of-the-box profiles in migrations/seed.sql do not set this; operators add it at profile-creation time via the API or by editing seed data.
  2. Confirm the relying party honors the extension. OpenSSL ≥ 1.1.0, Firefox, and Chrome 84+ all enforce Must-Staple. Older clients silently ignore it.
  3. Confirm the deployment target is configured for OCSP stapling so the server can actually deliver the stapled response in the handshake.
  • nginx: ssl_stapling on; ssl_stapling_verify on;
  • Apache: SSLUseStapling on
  • HAProxy: set ssl ocsp-response /path/to/response.der
  • Envoy: ocsp_staple_policy: must_staple

What this does NOT cover

  • CRL fallback. Must-Staple does not affect CRL behavior. Operators with CRL-based relying parties should use the rate-limit + caching defense alone; there is no client-side equivalent to Must-Staple for CRLs.
  • Self-issued certs in air-gapped networks. When the relying party cannot reach the OCSP responder at all (the threat model the audit cited), Must-Staple is the only mechanism that closes the bypass. CRL distribution similarly requires the relying party to fetch the CRL, which is also subject to the same network-availability concern.

Postgres transport encryption

See docs/database-tls.md. Bundle B / M-018.

Encryption at rest

Bundle B / M-001. PBKDF2-SHA256 at 600,000 rounds (OWASP 2024 Password Storage Cheat Sheet floor) for the operator-supplied passphrase that derives the AES-256-GCM key for sensitive config columns. v3 blob format with a per-ciphertext random salt; v1/v2 read fallback for legacy rows. See internal/crypto/encryption.go and the accompanying tests for the format spec.

Authentication surface

Bundle B / M-002. Two layers decide auth-exempt status:

  1. Router layer: internal/api/router/router.go::AuthExemptRouterRoutes
  • the endpoints registered via direct r.mux.Handle without going through the middleware chain (/health, /ready, /api/v1/auth/info, /api/v1/version, plus /api/v1/auth/bootstrap GET + POST per Bundle 1 Phase 6).
  1. Dispatch layer: internal/api/router/router.go::AuthExemptDispatchPrefixes
  • URL-prefix routing in cmd/server/main.go::buildFinalHandler for /.well-known/pki/*, /.well-known/est/*, /.well-known/est-mtls, and /scep[/...]* (incl. /scep-mtls).

Both lists have AST-walking regression tests (auth_exempt_test.go) that fail CI if a new bypass lands without updating the documented constant.

RBAC primitive (Bundle 1)

Bundle 1 ships role-based authorization on top of API-key authentication. Every gated handler routes through the auth.RequirePermission middleware (or its router-level wrap rbacGate); the middleware resolves the actor's effective permissions via the service-layer Authorizer.CheckPermission and returns HTTP 403 BEFORE the handler body runs on miss. The seven default roles (admin / operator / viewer / agent / mcp / cli / auditor), 33-permission canonical catalogue, and the auditor split (r-auditor holds only audit.read + audit.export) are seeded by migration 000029.

For the operator how-to, see rbac.md. For the threat model + compliance mapping, see auth-threat-model.md. For the upgrade flow from a pre-Bundle-1 deployment, see docs/migration/api-keys-to-rbac.md.

Day-0 admin bootstrap (Bundle 1 Phase 6)

Fresh deployments where no admin actor exists yet can mint the first admin via POST /api/v1/auth/bootstrap - set CERTCTL_BOOTSTRAP_TOKEN, POST a single curl with the token, and the server returns the plaintext key value once. The token is constant-time-compared; the strategy is one-shot via mutex; the admin-existence probe re-closes the path once an admin lands. The token is NEVER logged. The minted plaintext key flows only into the HTTP response body. See rbac.md for the full flow.

Approval-bypass closure (Bundle 1 Phase 9)

CertificateProfile.RequiresApproval=true profiles route both issuance/renewal AND profile edits through the ApprovalService two-person integrity gate (Phase 9 closes the flip-flop loophole where an admin could disable approval, mutate, re-enable). Same-actor self-approve is rejected at the service layer with ErrApproveBySameActor. See docs/reference/profiles.md for the full gate semantics.

OIDC federation (Bundle 2 Phases 1-7)

Bundle 2 adds OIDC SSO on top of the API-key + RBAC foundation. Operators configure one or more identity providers (Keycloak, Authentik, Okta, Auth0, Entra ID, or Google Workspace via Keycloak broker); end users sign in at the IdP, certctl validates the returned ID token, and a session cookie is minted.

The token-validation pipeline pins:

  • Algorithm allow-list: RS256 / RS512 / ES256 / ES384 / EdDSA only. HS256 / HS384 / HS512 / none are rejected at the service-layer sentinel level.
  • IdP-downgrade-attack defense at provider creation AND every RefreshKeys: the IdP's advertised id_token_signing_alg_values_supported is intersected with the allow-list; a provider that advertises HS-family is rejected before any token is signed under the weak alg.
  • Exact iss match (ErrIssuerMismatch).
  • aud membership + azp for multi-aud tokens (per OIDC core §3.1.3.7 step 5).
  • at_hash REQUIRED-when-access_token-present (Phase 3 tightening of the spec MAY → MUST so a substituted access token cannot ride alongside a clean ID token).
  • Single-use state + nonce (32-byte random server-generated; atomic DELETE...RETURNING on consume).
  • PKCE-S256 mandatory; plain rejected.
  • Configurable iat window (default 300s, capped 600s).
  • JWKS cache with operator-triggered RefreshKeys + auto-refresh on TTL expiry (default 3600s); JWKS-fetch failure during a key rotation returns 503 to the in-flight login (existing sessions untouched).

OIDC client_secret is encrypted at rest via AES-256-GCM (v3 blob format: magic 0x03 + salt(16) + nonce(12) + ciphertext+tag) using the CERTCTL_CONFIG_ENCRYPTION_KEY passphrase. The encryption invariant is pinned by an integration test (internal/repository/postgres/oidc_encryption_invariant_test.go) that asserts ciphertext != plaintext + correct blob shape + round-trip recovery + wrong-passphrase fails.

Per-IdP setup guides at oidc-runbooks/index.md cover Keycloak, Authentik, Okta, Auth0, Entra ID, and Google Workspace.

Sessions + back-channel logout (Bundle 2 Phases 4-6)

Successful OIDC login mints a session cookie: v1.<session_id>.<signing_key_id>.<base64url-no-pad(HMAC-SHA256)>. The HMAC input is length-prefixed as len:sid:len:kid to defeat concatenation-collision attacks on bare-concat designs. Cookie attributes:

  • HttpOnly=true (no JS access; defends XSS cookie theft).
  • Secure=true (HTTPS-only; defends network MITM).
  • SameSite=Lax default (configurable to Strict via CERTCTL_SESSION_SAMESITE).
  • Path=/, host-only.

Idle timeout default 1h; absolute timeout default 8h; both configurable via CERTCTL_SESSION_IDLE_TIMEOUT and CERTCTL_SESSION_ABSOLUTE_TIMEOUT. The scheduler's sessionGCLoop (default 1h interval) sweeps expired rows.

CSRF defense: plaintext CSRF token in the JS-readable certctl_csrf cookie (intentionally HttpOnly=false for the GUI to echo into the X-CSRF-Token header); SHA-256 hash on the session row; subtle.ConstantTimeCompare in CSRFMiddleware. API-key actors are CSRF-exempt (no session row in context).

Session signing keys rotate via RotateSigningKey; the old key stays valid for CERTCTL_SESSION_SIGNING_KEY_RETENTION (default 24h) so existing cookies validate during rollover. Past retention, the old key's row is dropped and any cookie still signed under it returns ErrSigningKeyNotFound. EnsureInitialSigningKey is fail-fatal at server boot.

Back-channel logout per OpenID Connect Back-Channel Logout 1.0 (NOT RFC 8414): POST /auth/oidc/back-channel-logout accepts a JWT-signed logout token from the IdP, validates the JWT against the IdP's JWKS (same alg allow-list as login), pins required claims (iss / aud / iat / jti / events; exactly one of sub / sid; nonce MUST be absent), defeats replay via jti-based deduplication, and revokes matching sessions.

For threat-model coverage of these surfaces, see auth-threat-model.md. For the operator-runnable performance baselines, see auth-benchmarks.md.

OIDC first-admin bootstrap (Bundle 2 Phase 7)

Coexists with Bundle 1's env-var-token bootstrap. When the operator sets CERTCTL_BOOTSTRAP_ADMIN_GROUPS + (optionally) CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID, the first user with one of those IdP groups becomes admin on first login per tenant. Subsequent users go through normal mapping. The admin-existence probe ensures only one wins between the two bootstrap paths; once any actor holds r-admin, the OIDC bootstrap hook silently falls through to normal mapping. Audit row on every grant (bootstrap.oidc_first_admin, event_category=auth).

Break-glass admin (Bundle 2 Phase 7.5)

Default-OFF (CERTCTL_BREAKGLASS_ENABLED=false). When enabled, the local-password admin path bypasses OIDC + group-claim layers; intended ONLY for SSO-broken incidents.

  • Argon2id with OWASP 2024 params (m=64 MiB, t=3, p=4, 16-byte salt, 32-byte output, per-password random salt, PHC-format hash). Hash column is json:"-" so handlers cannot wire-leak.
  • Lockout state machine: 5 failures (default; configurable via CERTCTL_BREAKGLASS_LOCKOUT_THRESHOLD) within 1h reset window (_LOCKOUT_RESET_INTERVAL) trips a 30s lockout (_LOCKOUT_DURATION). Atomic single-statement IncrementFailure defeats concurrent racing attempts.
  • Constant-time across all failure paths via verifyDummy() — wrong-password / locked-account / no-actor all take statistically indistinguishable time.
  • Surface invisibility: when disabled, ALL four endpoints return HTTP 404 (NOT 403). Scanners cannot distinguish "endpoint disabled" from "endpoint doesn't exist".
  • WARN log at server boot when ENABLED=true; audit row on every break-glass login (auth.breakglass_login_*, event_category=auth); WebAuthn/FIDO2 second factor pairing on the v3 roadmap (Decision 12).

Operator should DISABLE break-glass within 24h of SSO recovery to avoid a permanent backdoor; the runbook at auth-threat-model.md#break-glass-risks-phase-75 documents the full state machine.

Demo-to-production cutover (Audit 2026-05-11 A-8)

Migration 000029_rbac.up.sql unconditionally seeds an actor-demo-anon → r-admin row into actor_roles. This row is the runtime principal injected by the demo-mode middleware when CERTCTL_AUTH_TYPE=none. Under any non-none auth type the row is DORMANT — the middleware chain never resolves to it. But its existence is a footgun: a future regression that resolves an unauthenticated request to actor-demo-anon (a misrouted CORS preflight, a fallback in a new auth-exempt route) would silently re-elevate to admin.

certctl-server detects this residue at startup and emits a WARN log + an auth.demo_residual_grants_detected audit row listing every grant present on actor-demo-anon. Every production deploy will see this WARN on first boot — the migration baseline is part of the install, not a side effect of running demo mode.

Operator workflow at production cutover:

  1. Drain the WARN by calling the cleanup endpoint with an admin API key:

    curl -X POST --cacert deploy/test/certs/ca.crt \
         -H "Authorization: Bearer $ADMIN_KEY" \
         https://certctl.example.com:8443/api/v1/auth/demo-residual/cleanup
    # → {"removed": 1}
    

    The endpoint is gated auth.role.assign (admin-class) and refuses to run when CERTCTL_AUTH_TYPE=none (HTTP 503 — the residue IS the active runtime state at that auth type). The cleanup is idempotent; a second call returns {"removed": 0} and still leaves an audit row.

    Equivalent SQL for operators preferring direct DB access:

    DELETE FROM actor_roles WHERE actor_id = 'actor-demo-anon';
    
  2. To make subsequent boots refuse startup if the row reappears (the most paranoid stance), set:

    CERTCTL_DEMO_MODE_RESIDUAL_STRICT=true
    

    With the flag set, any actor-demo-anon row under a non-none auth type causes certctl-server to log the WARN AND exit non-zero before binding the HTTPS listener. Default is false (WARN only).

  3. The CI guard scripts/ci-guards/no-new-synthetic-admin.sh pins the set of source files that may reference the actor-demo-anon literal. New runtime code paths that resolve to the synthetic actor are rejected at PR time so the credibility gap stays closed.

Migrating an existing deployment to OIDC

A Bundle-1-merged deployment that wants to add OIDC follows the step-by-step at docs/migration/oidc-enable.md: configure CERTCTL_CONFIG_ENCRYPTION_KEY, pick + configure an IdP per the relevant runbook, configure the certctl-side OIDCProvider

  • group→role mappings, verify the login flow against a single test user, then announce the SSO endpoint to the rest of the organization.

Per-user rate limiting

Bundle B / M-025. Authenticated callers are bucketed by API-key name; unauthenticated callers (probes, OCSP relying parties, EST/SCEP enrollees) are bucketed by source IP. RPS and BurstSize are per-key budgets. PerUserRPS / PerUserBurstSize give authenticated clients a separate budget when set non-zero.

API key rotation

Audit reference: L-004. CWE-924 (improper enforcement of message integrity during transmission in a communication channel) - operator UX variant.

certctl's API keys are configured via the CERTCTL_API_KEYS_NAMED env var (format name1:key1,name2:key2:admin) and parsed at startup into an in-memory list. There is no DB-resident key store, no GUI, no /api/v1/keys endpoint - the env var IS the key inventory.

Pre-Bundle-G the env var rejected duplicate names, so rotating a key required: stop accepting OLDKEY → restart → roll NEWKEY out. Any client polling against OLDKEY during the restart window hit a 401.

Bundle G adds a double-key rotation window: two entries can share a name during the rollover, and both keys validate. Operators run the rotation as:

  1. Generate the new key. openssl rand -hex 32 produces a 256-bit value with sufficient entropy.

  2. Append the new entry to CERTCTL_API_KEYS_NAMED alongside the existing one:

    CERTCTL_API_KEYS_NAMED="alice:OLDKEY:admin,alice:NEWKEY:admin"
    

    Both entries MUST carry the same admin flag - startup fails loud if they don't (a non-admin shouldn't share an identity with an admin).

  3. Restart certctl. A startup INFO log confirms the rotation window is active:

    INFO api-key rotation window active name=alice entries=2 see=docs/security.md::api-key-rotation
    
  4. Roll the new key out to all clients. Both keys validate during this phase. Audit-trail actor + per-user rate-limit bucket stay consistent across the rollover (both entries produce the same UserKey context value, the shared name).

  5. Remove the old entry from CERTCTL_API_KEYS_NAMED:

    CERTCTL_API_KEYS_NAMED="alice:NEWKEY:admin"
    
  6. Restart certctl. OLDKEY now fails with 401. Rotation complete.

The rotation window has no operator-set timeout - it lasts for as long as both entries are in the env var. Best practice is a 24-72h window covering a full deploy cadence; if a client hasn't rolled to NEWKEY by the end of step 4, extend the window before step 5.

What the contract guarantees

  • Two entries with the same name: allowed if both have the same admin flag.
  • Two entries with the same name but mismatched admin: rejected at startup (privilege escalation guard).
  • Two entries with the same (name, key) pair: rejected at startup (typo guard - rotation requires DIFFERENT keys under the same name).
  • Single-entry steady state: unchanged from pre-Bundle-G behavior.

What the contract does NOT do

  • No automatic expiration of OLDKEY. The operator removes the entry in step 5; certctl doesn't track timestamps. A future enhancement could add a rotated_at annotation if operators ask for it.
  • No GUI / API for key management. Keys are env-var only by design; building a key-management surface is a separate feature project.
  • No revocation list. If a key leaks, the only path is to remove it from the env var and restart. That's appropriate for a small env-var inventory; it would not scale to a per-user-key-issued model.

Reporting a vulnerability

Email certctl@proton.me. Coordinated disclosure preferred; we will acknowledge within 72h.